Hi,

I mentioned earlier here,
http://mail-archives.apache.org/mod_mbox/apex-dev/201602.mbox/%3CCAHekGF9xNa6qvvt4ySGBC4SmCN7_Hn2r9rj2SQSV%2BE1Cc5A0fQ%40mail.gmail.com%3E

I am proposing HDFS file copy module.
JIRA created for this work is available here :
https://issues.apache.org/jira/browse/APEXMALHAR-2013

Please note that, these work is related to but different from
https://issues.apache.org/jira/browse/APEXMALHAR-2009 which talks about
concrete operator for writing data to HDFS tuple by tuple.

Main difference here is in case of file copy module; block sequence for a
file has to be retained. Thus, we need to pass on additional information
like FileMetaData, BlockMetaData from the upstream operator.

Usecase
------------
This module can be used with HDFS input module to copy files from HDFS to
HDFS.
Large files will be copied in block-by-block approach.

Functionality
-----------------

   1. Writing files to HDFS using FileMetaData, BlockMetaData, BlockData
   emitted by HDFS input module.
   2. Blocks data have to be synchronized to retain original sequence from
   source
   3. Support to copy multiple files, recursive copy of directory structure
   etc.
   4. Metrics for summary information on the progress of file copy.

Let me know your thoughts on this. You may post your comments on the JIRA
https://issues.apache.org/jira/browse/APEXMALHAR-2013

~ Yogi

Reply via email to