Mark Christiaens created HDFS-13604: ---------------------------------------
Summary: DistCp filtering conflicts with snapshotting Key: HDFS-13604 URL: https://issues.apache.org/jira/browse/HDFS-13604 Project: Hadoop HDFS Issue Type: Bug Components: distcp Reporter: Mark Christiaens DistCp has an option to filter (not copy) files that match one of the file patterns in a file. DistCp also has options where it optimizes incremental copying based on snapshots present at the source and target location. When enabling both options, files that should be copied from source to target are missing on the target. To reproduce the issue: * Create two directories, {{source}} and {{target}}. * In {{source}}, put two files, {{A}} and {{B}}, with some random content. * Create a filter file that filters {{A}} (so blocks copying {{A}}). * Create a snapshot, {{snapshot_old}}, of the {{source}} directory. * Use {{distcp}} to copy the content of {{source}} to {{target}}. * As expected, the {{target}} directory will contain only file {{B}}. {{A}} is filtered. * Take a snapshot of the target directory, snapshot_old. * In the {{source}} directory, rename {{A}} to {{C}}. * Take a new snapshot of the source directory, {{snapshot_new}}. * Now, perform an incremental {{distcp}} copy using the created snapshots so as to optimize the incremental copy process: {{distcp -update -filters filters.txt -diff snapshot_old snapshot_new ... ...}} * You will find that the newly created file {{C}} is not copied to the {{target}} directory. I suspect that the reason for this is that {{distcp}} concludes from analyzing the difference between {{snapshot_source}} and {{snapshot_source_new}} that {{A}} was renamed to {{C}}. This can be confirmed by using {{snapshotDiff}} to compare the two snapshot: it reports that {{A}} has been renamed to {{C}}. {{distcp}} seems to then assume that the data for {{C}} is already present in the {{target}} directory and only needs to be renamed. However, due to the filtering, {{A}} is {{not}} present on the target and cannot be renamed to {{C}}. Although the final {{distcp}} fails to create a copy of the {{C}} file in the {{target}} directory, {{distcp}} does not report any failure, nor can I find any trace of errors in the job logs of the jobs created by {{distcp}} to execute the actual copy. So, some options: * Combining {{-diff}} and {{-filters}} could be disallowed. * {{distcp}} could assume that files that have been filtered are _not_ present and should be replicated in ordinary fashion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org