[ 
https://issues.apache.org/jira/browse/HADOOP-13203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15317086#comment-15317086
 ] 

Steve Loughran commented on HADOOP-13203:
-----------------------------------------

It looks like, as people note, the move may make forward seeking, or a mix of 
seek + read() calls more expensive.More specifically, it could well accelerate 
a sequence of readFully() offset calls, but not handle so well situations of ) 
+ read(pos, n) + seek(pos + n + n2) , stuff the forward skipping could handle.

Even regarding readFully() calls, it isn't going to handle well any mix of 
read()+readFully(), as the first read will have triggered a to-end-of-file read.

It seems to me that one could actually do something of both where all reads 
specified a block length, such as 64KB. On sustained forward reads, when the 
boundary was triggered it'd read forward. On mixed seek/read operations, ones 
where the range of the read is unknown, this would significantly optimise any 
random access use, rather than those which exclusively used on read operation.

And here's the problem: right now we don't know what are the API/file use modes 
in widespread use against s3. We don't have the data. I can see what you're 
highlighting: the current mechanism is very expensive for backwards seeks —but 
we have just optimised forward seeking *and* instrumented the code to collect 
detail on what's actually going on.

# I don't want to rush into a change which has the potential to make some 
existing codepaths worse —especially as we don't know how the FS gets used.
# I'd really like to see collected statistics on FS usage across a broad 
dataset. Anyone here is welcome to contribute to this —it should include 
statistics gathered in downstream use.

I'm very tempted to argue this should be an S3a phase III improvement: it has 
ramifications, and we should do it well. We are, with the metrics, in a 
position to understand those ramifications and, if not in a rush, implement 
something which works well for a broad set of uses

> S3a: Consider reducing the number of connection aborts by setting correct 
> length in s3 request
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-13203
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13203
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>            Priority: Minor
>         Attachments: HADOOP-13203-branch-2-001.patch, 
> HADOOP-13203-branch-2-002.patch, HADOOP-13203-branch-2-003.patch
>
>
> Currently file's "contentLength" is set as the "requestedStreamLen", when 
> invoking S3AInputStream::reopen().  As a part of lazySeek(), sometimes the 
> stream had to be closed and reopened. But lots of times the stream was closed 
> with abort() causing the internal http connection to be unusable. This incurs 
> lots of connection establishment cost in some jobs.  It would be good to set 
> the correct value for the stream length to avoid connection aborts. 
> I will post the patch once aws tests passes in my machine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to