Hi, I mentioned earlier here, http://mail-archives.apache.org/mod_mbox/apex-dev/201602.mbox/%3CCAHekGF9xNa6qvvt4ySGBC4SmCN7_Hn2r9rj2SQSV%2BE1Cc5A0fQ%40mail.gmail.com%3E
I am proposing HDFS file copy module. JIRA created for this work is available here : https://issues.apache.org/jira/browse/APEXMALHAR-2013 Please note that, these work is related to but different from https://issues.apache.org/jira/browse/APEXMALHAR-2009 which talks about concrete operator for writing data to HDFS tuple by tuple. Main difference here is in case of file copy module; block sequence for a file has to be retained. Thus, we need to pass on additional information like FileMetaData, BlockMetaData from the upstream operator. Usecase ------------ This module can be used with HDFS input module to copy files from HDFS to HDFS. Large files will be copied in block-by-block approach. Functionality ----------------- 1. Writing files to HDFS using FileMetaData, BlockMetaData, BlockData emitted by HDFS input module. 2. Blocks data have to be synchronized to retain original sequence from source 3. Support to copy multiple files, recursive copy of directory structure etc. 4. Metrics for summary information on the progress of file copy. Let me know your thoughts on this. You may post your comments on the JIRA https://issues.apache.org/jira/browse/APEXMALHAR-2013 ~ Yogi
