[ https://issues.apache.org/jira/browse/HADOOP-15192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Steve Loughran resolved HADOOP-15192. ------------------------------------- Resolution: Duplicate > S3A listStatus excessively slow > ------------------------------- > > Key: HADOOP-15192 > URL: https://issues.apache.org/jira/browse/HADOOP-15192 > Project: Hadoop Common > Issue Type: Improvement > Components: fs/s3 > Affects Versions: 2.7.3 > Reporter: Michel Lemay > Priority: Minor > > Symptoms: > - CloudWatch Metrics for S3 showing an unexpectedly large number of 4xx > errors in our bucket > - Performance when listing files recursively is abysmal (15 minutes on our > bucket compared to less than 2 minutes using cli `aws s3 ls`) > Analysis: > - In CloudTrail logs for this bucket, we found that it generate one 404 > (NoSuchKey) error per folder listed recursively. > - Spark recursively calls FileSystem::listStatus (S3AFileSystem > implementation from Hadoop-aws:2.7.3); which in turn calls getFileStatus to > determine if it is a directory. > - It turns out that this call to getFileStatus yield a 404 when the path > used is a directory but do not end with a slash. It then retries with the > slash concatenated (incurring one extra unneeded call to S3). > Questions: > - Why is this trailing slash got removed in the first place? (Hadoop Path > class normalize it by removing trailing slashes when constructed) > - S3AFileSystem::listStatus needs to know if the path is a Directory. > However, it’s a common usage pattern to already have that FileStatus object > in hand when recursively listing files. Thus incurring an unneeded > performance penalty. Base FileSystem class could offer an optimized Api to > use this assumption (or fix listLocatedStatus(recursive=true) unoptimized > call to listStatus) > - I might be wrong on this last bullet but I think S3 object api will fetch > every objects under a prefix (not just current level) and filter them out. > If that is the case, there should be opportunities to have an efficient > recursive listStatus implementation for s3 using paginated calls to top level > folder only. > > Note, all this is in the context of spark jobs reading hundred of thousands > of parquet files organized and partitioned hierarchically as recommended. > Every time we read it, spark lists recursively all files and folders to > discover what are the partitions (folder names). > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org