Mark Christiaens created HDFS-13604:
---------------------------------------

             Summary: DistCp filtering conflicts with snapshotting
                 Key: HDFS-13604
                 URL: https://issues.apache.org/jira/browse/HDFS-13604
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: distcp
            Reporter: Mark Christiaens


DistCp has an option to filter (not copy) files that match one of the file 
patterns in a file.  DistCp also has options where it optimizes incremental 
copying based on snapshots present at the source and target location.  When 
enabling both options, files that should be copied from source to target are 
missing on the target.

To reproduce the issue:
 * Create two directories, {{source}} and {{target}}.
 * In {{source}}, put two files, {{A}} and {{B}}, with some random content.
 * Create a filter file that filters {{A}} (so blocks copying {{A}}).
 * Create a snapshot, {{snapshot_old}}, of the {{source}} directory.
 * Use {{distcp}} to copy the content of {{source}} to {{target}}.
 * As expected, the {{target}} directory will contain only file {{B}}.  {{A}} 
is filtered.
 * Take a snapshot of the target directory, snapshot_old.
 * In the {{source}} directory, rename {{A}} to {{C}}.
 * Take a new snapshot of the source directory, {{snapshot_new}}.
 * Now, perform an incremental {{distcp}} copy using the created snapshots so 
as to optimize the incremental copy process: {{distcp -update -filters 
filters.txt -diff snapshot_old snapshot_new ... ...}}
 * You will find that the newly created file {{C}} is not copied to the 
{{target}} directory.

I suspect that the reason for this is that {{distcp}} concludes from analyzing 
the difference between {{snapshot_source}} and {{snapshot_source_new}} that 
{{A}} was renamed to {{C}}. This can be confirmed by using {{snapshotDiff}} to 
compare the two snapshot:  it reports that {{A}} has been renamed to {{C}}.

{{distcp}} seems to then assume that the data for {{C}} is already present in 
the {{target}} directory and only needs to be renamed.  However, due to the 
filtering, {{A}} is {{not}} present on the target and cannot be renamed to 
{{C}}.

Although the final {{distcp}} fails to create a copy of the {{C}} file in the 
{{target}} directory, {{distcp}} does not report any failure, nor can I find 
any trace of errors in the job logs of the jobs created by {{distcp}} to 
execute the actual copy.

So, some options:
 * Combining {{-diff}} and {{-filters}} could be disallowed.
 * {{distcp}} could assume that files that have been filtered are _not_ present 
and should be replicated in ordinary fashion.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to