[ http://issues.apache.org/jira/browse/HADOOP-341?page=comments#action_12418999 ]
Arun C Murthy commented on HADOOP-341: -------------------------------------- Forgot to add: the above patch (distcp.patch) does a significant refactoring of CopyFiles.java by providing a base CopyFilesMapper class which is subclassed in DFSCopyFilesMapper (which contains Milind's existing code) and HttpCopyFilesMapper (for http-based sources). In future we can add other protocols (ftp?) by creating new subclasses (FtpCopyFilesMapper). thanks, Arun PS: Apologies for the extra spam. > Enhance distcp to handle *http* as a 'source protocol'. > ------------------------------------------------------- > > Key: HADOOP-341 > URL: http://issues.apache.org/jira/browse/HADOOP-341 > Project: Hadoop > Type: Improvement > Components: util > Reporter: Arun C Murthy > Attachments: distcp.patch > > Requirements: > Presently distcp recursively copies a directory from one dfs to another > i.e. both source and destination of of the *dfs* protocol. > Enhance it to handle *http* as the source protocol i.e. support copying > files from arbitrary http-based sources into the dfs. > Design: > > Follow distcp's current design: one map task per file which needs to be > copied. > Caveat: distcp handles *recursive* copying by listing sub-directories; this > is not as feasible with a http-based source since things like > 'fancy-indexing' might not be enabled on the web-server (for all > sub-locations recursively too), and even if it is enabled it will mean > tedious parsing of the html served to glean the sub-directories etc. Hence > the idea is to support an input file (via a -f option) which contains a list > of the http-based urls which represent multiple source files. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
