[ https://issues.apache.org/jira/browse/HDFS-17120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17747297#comment-17747297 ]
ASF GitHub Bot commented on HDFS-17120: --------------------------------------- sadanand48 commented on code in PR #5885: URL: https://github.com/apache/hadoop/pull/5885#discussion_r1274429488 ########## hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java: ########## @@ -316,6 +303,42 @@ protected void doBuildListingWithSnapshotDiff( } } + /** + * Handle create Diffs and add to the copyList. + * If the path is a directory, iterate it recursively and add the paths + * to the result copyList. + * + * @param fileListWriter the list for holding processed results + * @param context The DistCp context with associated input options + * @param sourceRoot The rootDir of the source snapshot + * @param sourceFS the source Filesystem + * @param fileStatuses store the result fileStatuses to add to the copyList + * @param diff the SnapshotDiff report + * @throws IOException + */ + protected void addCreateDiffsToFileListing(SequenceFile.Writer fileListWriter, + DistCpContext context, Path sourceRoot, FileSystem sourceFS, + List<FileStatusInfo> fileStatuses, DiffInfo diff) throws IOException { + addToFileListing(fileListWriter, sourceRoot, diff.getTarget(), context); + + FileStatus sourceStatus = sourceFS.getFileStatus(diff.getTarget()); Review Comment: I have now changed it to use a flag now. > Support snapshot diff based copylisting for flat paths. > ------------------------------------------------------- > > Key: HDFS-17120 > URL: https://issues.apache.org/jira/browse/HDFS-17120 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: Sadanand Shenoy > Assignee: Sadanand Shenoy > Priority: Major > Labels: pull-request-available > > Currently for Diff-based copyListing that is used during the distcpSync step > of an incremental copy by default the SimpleCopyListing implementation is > used. In it's implementation it iterates through the DiffReport and if the > DiffType is Create and the path is a directory, it recursively traverses the > directory and adds the subpaths to the resultant copyList. > This works fine for implementations of snapshotDiff that include only > top-level directories as part of its DiffReport . Suppose a snapshotDiff > implementation outputs only flat paths that include both the directory and > sub-directory subpath in its DiffReport, it will lead to duplicate paths in > the copyList and throws DuplicateFileException. > > For example > Ozone filesystem implementation of snapdiff b/w 2 snapshots shows all > subpaths as part of the diff. > {code:java} > [~]# ozone sh snapshot create vol11/buck1 snap1 > [~]# ozone sh snapshot create vol11/buck2 snap1 > [~]# ozone fs -mkdir ofs://ozone1/vol11/buck1/dir1 > [ ~]# ozone fs -mkdir ofs://ozone1/vol11/buck1/dir1/dir11 > [ ~]# ozone fs -mkdir ofs://ozone1/vol11/buck1/dir1/dir11/dir111 > [~]# ozone sh snapshot create vol11/buck1 snap2 > [~]# ozone sh snapshot diff vol11/buck1 snap1 snap2 > Difference between snapshot: snap1 and snapshot: snap2 > + ./dir1 > + ./dir1/dir11 > + ./dir1/dir11/dir111 {code} > we can see even though dir11 & dir111 are subpaths they are present in > snapdiff , This is not the case for HDFS though. > This Jira aims to create a new copyListing impl that is used for diff based > copyListing that doesn't traverse the directory but only adds paths that are > present in its diff. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org