[ 
https://issues.apache.org/jira/browse/HADOOP-3939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12626739#action_12626739
 ] 

Chris Douglas commented on HADOOP-3939:
---------------------------------------

* Would it make sense to require either \-update or \-overwrite if \-delete is 
specified? Without either of these options, the semantics are a little 
confusing. For example:
** In this case, the destination doesn't exist. Everything that isn't the 
source is deleted, which seems reasonable.
{noformat}
$ bin/hadoop fs -ls a b
Found 2 items
-rw-r--r--   1 someuser somegroup      92934 2008-08-11 21:42 
/user/someuser/a/part-00000
Found 4 items
-rw-r--r--   1 someuser somegroup  105177784 2008-08-28 11:46 
/user/someuser/b/part-00000
-rw-r--r--   1 someuser somegroup  105177884 2008-08-28 11:46 
/user/someuser/b/part-00001
-rw-r--r--   1 someuser somegroup  105177754 2008-08-28 11:46 
/user/someuser/b/part-00002
$ bin/hadoop distcp -delete hdfs://host:8020/user/someuser/a 
hdfs://host:8020/user/someuser/b
08/08/28 11:51:18 INFO tools.DistCp: srcPaths=[hdfs://host:8020/user/someuser/a]
08/08/28 11:51:18 INFO tools.DistCp: destPath=hdfs://host:8020/user/someuser/b
Deleted hdfs://host/user/someuser/b/part-00000
Deleted hdfs://host/user/someuser/b/part-00001
Deleted hdfs://host/user/someuser/b/part-00002
[snip]
$ bin/hadoop fs -ls a b
Found 2 items
-rw-r--r--   1 someuser somegroup      92934 2008-08-11 21:42 
/user/someuser/a/part-00000
Found 2 items
drwxr-xr-x   - someuser somegroup          0 2008-08-28 11:51 /user/someuser/b/a
{noformat}
** Here, the destination does exist, but it is deleted anyway, as though 
\-overwrite were specified.
{noformat}
$ bin/hadoop fs -lsr a b
-rw-r--r--   1 someuser somegroup      92934 2008-08-11 21:42 
/user/someuser/a/part-00000
-rw-r--r--   1 someuser somegroup  105177784 2008-08-28 11:51 
/user/someuser/b/part-00000
-rw-r--r--   1 someuser somegroup  105177884 2008-08-28 11:51 
/user/someuser/b/part-00001
-rw-r--r--   1 someuser somegroup  105177754 2008-08-28 11:51 
/user/someuser/b/part-00002
drwxr-xr-x   - someuser somegroup          0 2008-08-28 13:34 /user/someuser/b/a
-rw-r--r--   1 someuser somegroup  105177784 2008-08-28 13:34 
/user/someuser/b/a/part-00000
$ bin/hadoop distcp -delete hdfs://host:8020/user/someuser/a 
hdfs://host:8020/user/someuser/b
08/08/28 13:35:14 INFO tools.DistCp: srcPaths=[hdfs://host:8020/user/someuser/a]
08/08/28 13:35:14 INFO tools.DistCp: destPath=hdfs://host:8020/user/someuser/b
Deleted hdfs://host:8020/user/someuser/b/part-00000
Deleted hdfs://host:8020/user/someuser/b/part-00001
Deleted hdfs://host:8020/user/someuser/b/part-00002
Deleted hdfs://host:8020/user/someuser/b/a
[snip]
$ bin/hadoop fs -lsr a b
-rw-r--r--   1 someuser somegroup      92934 2008-08-11 21:42 
/user/someuser/a/part-00000
drwxr-xr-x   - someuser somegroup          0 2008-08-28 13:35 /user/someuser/b/a
-rw-r--r--   1 someuser somegroup      92934 2008-08-28 13:35 
/user/someuser/b/a/part-00000
{noformat}

Adding this dependency would also help prevent casual errors and potentially 
serious mistakes if the Trash is disabled.
* It might help to always add a message about FsShell failing, and set the 
cause rather than:
{noformat}
+            } catch(Exception e) {
+              throw e instanceof IOException? (IOException)e: new 
IOException(e);
+            }
{noformat}
* When \-delete is specified, the client is doing a lot of work to recursively 
list the destination, then to delete individual files there. In the future it 
might make sense to leave it to the maps to delete entries, since the source 
list is sorted. The client (or a reduce) would have to do some work on the 
boundaries, but it should scale well. The current patch is clearer given 
distcp's current organization, though.
* The fix to FileStatus makes sense, but when is the Path null?

> DistCp should support an option for deleting non-existing files.
> ----------------------------------------------------------------
>
>                 Key: HADOOP-3939
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3939
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: tools/distcp
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 3939_20080825.patch, 3939_20080825b.patch, 
> 3939_20080826.patch
>
>
> One use case of DistCp is to sync two directories.  Currently, DistCp has an 
> -update option for overwriting dst files if src is different from dst.  
> However, it is not enough for sync.  If there are some files in dst but not 
> exist in src, there is no easy way to delete them.  We should add a new 
> option, say -delete, so that DistCp will delete the non-existing in dst.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to