[ 
https://issues.apache.org/jira/browse/HADOOP-15192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-15192:
------------------------------------
    Fix Version/s: 2.8.0

> S3A listStatus excessively slow -hurts Spark job partitioning
> -------------------------------------------------------------
>
>                 Key: HADOOP-15192
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15192
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs/s3
>    Affects Versions: 2.7.3
>            Reporter: Michel Lemay
>            Priority: Minor
>             Fix For: 2.8.0
>
>
> Symptoms:
>  - CloudWatch Metrics for S3 showing an unexpectedly large number of 4xx 
> errors in our bucket
>  - Performance when listing files recursively is abysmal (15 minutes on our 
> bucket compared to less than 2 minutes using cli `aws s3 ls`)
> Analysis:
>  - In CloudTrail logs for this bucket, we found that it generate one 404 
> (NoSuchKey) error per folder listed recursively.
>  - Spark recursively calls FileSystem::listStatus (S3AFileSystem 
> implementation from Hadoop-aws:2.7.3); which in turn calls getFileStatus to 
> determine if it is a directory.
>  - It turns out that this call to getFileStatus yield a 404 when the path 
> used is a directory but do not end with a slash. It then retries with the 
> slash concatenated (incurring one extra unneeded call to S3).
> Questions:
>  - Why is this trailing slash got removed in the first place? (Hadoop Path 
> class normalize it by removing trailing slashes when constructed)
>  - S3AFileSystem::listStatus needs to know if the path is a Directory. 
> However, it’s a common usage pattern to already have that FileStatus object 
> in hand when recursively listing files.  Thus incurring an unneeded 
> performance penalty.  Base FileSystem class could offer an optimized Api to 
> use this assumption (or fix listLocatedStatus(recursive=true) unoptimized 
> call to listStatus)
>  - I might be wrong on this last bullet but I think S3 object api will fetch 
> every objects under a prefix (not just current level) and filter them out.  
> If that is the case, there should be opportunities to have an efficient 
> recursive listStatus implementation for s3 using paginated calls to top level 
> folder only.
>  
> Note, all this is in the context of spark jobs reading hundred of thousands 
> of parquet files organized and partitioned hierarchically as recommended. 
> Every time we read it, spark lists recursively all files and folders to 
> discover what are the partitions (folder names).
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to