[ 
https://issues.apache.org/jira/browse/HADOOP-13203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated HADOOP-13203:
--------------------------------------
    Attachment: stream_stats.tar.gz
                HADOOP-13203-branch-2-004.patch

There is a corner case, wherein closing the stream should make use of 
{{requestedStreamLen}} instead of {{contentLength}} to avoid connection abort. 
This would be visible in long running services in the cluster tries to access 
this codepath. Fixed this in the latest patch.

Also, got the stream access profiles for couple of TPC-DS and TPC-H queries, 
wherein I printed the stream statistics during close in the cluster where i 
tested it. Attaching those logs here with. Please note that this was done with 
ORC data format which tries to read the footer and then starts reading the 
stripe information.

1. In TPC-DS most of the files are small so they end up having single backwards 
seeks during file reading. I.e Reader reads
the postscript/footer/meta details as the first operation and then seeks 
backwards to read the data portion of the file. Without the patch, it would 
abort the connection as the difference between file length and the current 
position would be much higher than CLOSE_THRESHOLD.  

e.g log
{noformat}2016-06-15 09:00:31,546 [INFO] [TezChild] |s3a.S3AFileSystem|: 
S3AInputStream{s3a://xyz/tpcds_bin_partitioned_orc_200.db/store_sales/ss_sold_date_sk=2450967/000456_0
 pos=4162453 nextReadPos=4162453 contentLength=7630589 
StreamStatistics{OpenOperations=4, CloseOperations=4, Closed=4, Aborted=0, 
SeekOperations=3, ReadExceptions=0, ForwardSeekOperations=2, 
BackwardSeekOperations=1, BytesSkippedOnSeek=5963, 
BytesBackwardsOnSeek=7629525, BytesRead=740946, BytesRead excluding 
skipped=734983, ReadOperations=91, ReadFullyOperations=0, ReadsIncomplete=85}}
{noformat}

There are file accesses without any backward seeks, where in they access 
standard 16KB information to read the footer details and closes the file 
without any additional reads. 
e.g log
{noformat}
2016-06-15 09:00:28,590 [INFO] [TezChild] |s3a.S3AFileSystem|: 
S3AInputStream{s3a://xyz/tpcds_bin_partitioned_orc_200.db/store_sales/ss_sold_date_sk=2450993/000213_0
 pos=7549954 nextReadPos=7549954 contentLength=7549954 
StreamStatistics{OpenOperations=1, CloseOperations=1, Closed=1, Aborted=0, 
SeekOperations=0, ReadExceptions=0, ForwardSeekOperations=0, 
BackwardSeekOperations=0, BytesSkippedOnSeek=0, BytesBackwardsOnSeek=0, 
BytesRead=16384, BytesRead excluding skipped=16384, ReadOperations=1, 
ReadFullyOperations=0, ReadsIncomplete=0}}
{noformat}

2. In TPC-H dataset, relatively large files are present (e.g each file in 
lineitem dataset would be around 1 GB in size in the overall 1 TB tpc-h 
dataset). In such cases, equal amount of forward-seeks and backward-seeks 
happen (e.g around 24 times in per file in the log). Patch avoids connection 
aborts with backward seeks. 
e.g log
{noformat}
2016-06-15 09:26:26,671 [INFO] [TezChild] |s3a.S3AFileSystem|: 
S3AInputStream{s3a://xyz/tpch_flat_orc_1000.db/lineitem/000041_0 pos=728756230 
nextReadPos=728756230 contentLength=739566852 
StreamStatistics{OpenOperations=72, CloseOperations=72, Closed=72, Aborted=0, 
SeekOperations=48, ReadExceptions=0, ForwardSeekOperations=24, 
BackwardSeekOperations=24, BytesSkippedOnSeek=167662, 
BytesBackwardsOnSeek=737556392, BytesRead=244894978, BytesRead excluding 
skipped=244727316, ReadOperations=28457, ReadFullyOperations=0, 
ReadsIncomplete=28217}}
{noformat}


> S3a: Consider reducing the number of connection aborts by setting correct 
> length in s3 request
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-13203
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13203
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>            Priority: Minor
>         Attachments: HADOOP-13203-branch-2-001.patch, 
> HADOOP-13203-branch-2-002.patch, HADOOP-13203-branch-2-003.patch, 
> HADOOP-13203-branch-2-004.patch, stream_stats.tar.gz
>
>
> Currently file's "contentLength" is set as the "requestedStreamLen", when 
> invoking S3AInputStream::reopen().  As a part of lazySeek(), sometimes the 
> stream had to be closed and reopened. But lots of times the stream was closed 
> with abort() causing the internal http connection to be unusable. This incurs 
> lots of connection establishment cost in some jobs.  It would be good to set 
> the correct value for the stream length to avoid connection aborts. 
> I will post the patch once aws tests passes in my machine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to