[ https://issues.apache.org/jira/browse/HADOOP-16090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16760065#comment-16760065 ]
Steve Loughran commented on HADOOP-16090: ----------------------------------------- bq. Note that issuing a DELETE request without specifying a version ID will always create a new delete marker, even if one already exists (AWS S3 Developer Guide) hmmm. This is interesting. > deleteUnnecessaryFakeDirectories() creates unnecessary delete markers in a > versioned S3 bucket > ---------------------------------------------------------------------------------------------- > > Key: HADOOP-16090 > URL: https://issues.apache.org/jira/browse/HADOOP-16090 > Project: Hadoop Common > Issue Type: Bug > Components: fs/s3 > Affects Versions: 2.8.1 > Reporter: Dmitri Chmelev > Priority: Minor > > The fix to avoid calls to getFileStatus() for each path component in > deleteUnnecessaryFakeDirectories() (HADOOP-13164) results in accumulation of > delete markers in versioned S3 buckets. The above patch replaced > getFileStatus() checks with a single batch delete request formed by > generating all ancestor keys formed from a given path. Since the delete > request is not checking for existence of fake directories, it will create a > delete marker for every path component that did not exist (or was previously > deleted). Note that issuing a DELETE request without specifying a version ID > will always create a new delete marker, even if one already exists ([AWS S3 > Developer > Guide|https://docs.aws.amazon.com/AmazonS3/latest/dev/RemDelMarker.html]) > Since deleteUnnecessaryFakeDirectories() is called as a callback on > successful writes and on renames, delete markers accumulate rather quickly > and their rate of accumulation is inversely proportional to the depth of the > path. In other words, directories closer to the root will have more delete > markers than the leaves. > This behavior negatively impacts performance of getFileStatus() operation > when it has to issue listObjects() request (especially v1) as the delete > markers have to be examined when the request searches for first current > non-deleted version of an object following a given prefix. > I did a quick comparison against 3.x and the issue is still present: > [https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L2947|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L2947] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org