[
https://issues.apache.org/jira/browse/HADOOP-3939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12626739#action_12626739
]
Chris Douglas commented on HADOOP-3939:
---------------------------------------
* Would it make sense to require either \-update or \-overwrite if \-delete is
specified? Without either of these options, the semantics are a little
confusing. For example:
** In this case, the destination doesn't exist. Everything that isn't the
source is deleted, which seems reasonable.
{noformat}
$ bin/hadoop fs -ls a b
Found 2 items
-rw-r--r-- 1 someuser somegroup 92934 2008-08-11 21:42
/user/someuser/a/part-00000
Found 4 items
-rw-r--r-- 1 someuser somegroup 105177784 2008-08-28 11:46
/user/someuser/b/part-00000
-rw-r--r-- 1 someuser somegroup 105177884 2008-08-28 11:46
/user/someuser/b/part-00001
-rw-r--r-- 1 someuser somegroup 105177754 2008-08-28 11:46
/user/someuser/b/part-00002
$ bin/hadoop distcp -delete hdfs://host:8020/user/someuser/a
hdfs://host:8020/user/someuser/b
08/08/28 11:51:18 INFO tools.DistCp: srcPaths=[hdfs://host:8020/user/someuser/a]
08/08/28 11:51:18 INFO tools.DistCp: destPath=hdfs://host:8020/user/someuser/b
Deleted hdfs://host/user/someuser/b/part-00000
Deleted hdfs://host/user/someuser/b/part-00001
Deleted hdfs://host/user/someuser/b/part-00002
[snip]
$ bin/hadoop fs -ls a b
Found 2 items
-rw-r--r-- 1 someuser somegroup 92934 2008-08-11 21:42
/user/someuser/a/part-00000
Found 2 items
drwxr-xr-x - someuser somegroup 0 2008-08-28 11:51 /user/someuser/b/a
{noformat}
** Here, the destination does exist, but it is deleted anyway, as though
\-overwrite were specified.
{noformat}
$ bin/hadoop fs -lsr a b
-rw-r--r-- 1 someuser somegroup 92934 2008-08-11 21:42
/user/someuser/a/part-00000
-rw-r--r-- 1 someuser somegroup 105177784 2008-08-28 11:51
/user/someuser/b/part-00000
-rw-r--r-- 1 someuser somegroup 105177884 2008-08-28 11:51
/user/someuser/b/part-00001
-rw-r--r-- 1 someuser somegroup 105177754 2008-08-28 11:51
/user/someuser/b/part-00002
drwxr-xr-x - someuser somegroup 0 2008-08-28 13:34 /user/someuser/b/a
-rw-r--r-- 1 someuser somegroup 105177784 2008-08-28 13:34
/user/someuser/b/a/part-00000
$ bin/hadoop distcp -delete hdfs://host:8020/user/someuser/a
hdfs://host:8020/user/someuser/b
08/08/28 13:35:14 INFO tools.DistCp: srcPaths=[hdfs://host:8020/user/someuser/a]
08/08/28 13:35:14 INFO tools.DistCp: destPath=hdfs://host:8020/user/someuser/b
Deleted hdfs://host:8020/user/someuser/b/part-00000
Deleted hdfs://host:8020/user/someuser/b/part-00001
Deleted hdfs://host:8020/user/someuser/b/part-00002
Deleted hdfs://host:8020/user/someuser/b/a
[snip]
$ bin/hadoop fs -lsr a b
-rw-r--r-- 1 someuser somegroup 92934 2008-08-11 21:42
/user/someuser/a/part-00000
drwxr-xr-x - someuser somegroup 0 2008-08-28 13:35 /user/someuser/b/a
-rw-r--r-- 1 someuser somegroup 92934 2008-08-28 13:35
/user/someuser/b/a/part-00000
{noformat}
Adding this dependency would also help prevent casual errors and potentially
serious mistakes if the Trash is disabled.
* It might help to always add a message about FsShell failing, and set the
cause rather than:
{noformat}
+ } catch(Exception e) {
+ throw e instanceof IOException? (IOException)e: new
IOException(e);
+ }
{noformat}
* When \-delete is specified, the client is doing a lot of work to recursively
list the destination, then to delete individual files there. In the future it
might make sense to leave it to the maps to delete entries, since the source
list is sorted. The client (or a reduce) would have to do some work on the
boundaries, but it should scale well. The current patch is clearer given
distcp's current organization, though.
* The fix to FileStatus makes sense, but when is the Path null?
> DistCp should support an option for deleting non-existing files.
> ----------------------------------------------------------------
>
> Key: HADOOP-3939
> URL: https://issues.apache.org/jira/browse/HADOOP-3939
> Project: Hadoop Core
> Issue Type: New Feature
> Components: tools/distcp
> Reporter: Tsz Wo (Nicholas), SZE
> Assignee: Tsz Wo (Nicholas), SZE
> Attachments: 3939_20080825.patch, 3939_20080825b.patch,
> 3939_20080826.patch
>
>
> One use case of DistCp is to sync two directories. Currently, DistCp has an
> -update option for overwriting dst files if src is different from dst.
> However, it is not enough for sync. If there are some files in dst but not
> exist in src, there is no easy way to delete them. We should add a new
> option, say -delete, so that DistCp will delete the non-existing in dst.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.