[ https://issues.apache.org/jira/browse/HADOOP-11281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299060#comment-14299060 ]
Chris Nauroth commented on HADOOP-11281: ---------------------------------------- HADOOP-11525 discusses a fix for the same problem. Since that one already has some design discussion and a proposed patch, I've decided to close HADOOP-11281 as duplicate. > Add flag to fs.shell to skip _COPYING_ file > ------------------------------------------- > > Key: HADOOP-11281 > URL: https://issues.apache.org/jira/browse/HADOOP-11281 > Project: Hadoop Common > Issue Type: Improvement > Components: fs, fs/s3 > Environment: Hadoop 2.2 but is in all of them. > AWS EMR 3.0.4 > Reporter: Corby Wilson > Priority: Critical > > Amazon S3 does not have a rename feature. > When you use the hadoop shell or distcp feature, hadoop first uploads the > file using the ._COPYING_ extension, then renames the file to the final > output. > Code: > org/apache/hadoop/fs/shell/CommandWithDestination.java > PathData tempTarget = target.suffix("._COPYING_"); > targetFs.setWriteChecksum(writeChecksum); > targetFs.writeStreamToFile(in, tempTarget, lazyPersist); > targetFs.rename(tempTarget, target); > The problem is that on rename, we actually have to download the file again > (through an InputStream), then upload it again. > For very large files (>= 5GB) we have to use multipart upload. > So if we are processing several TB of multi GB files, we are actually writing > the file to S3 twice and reading it once from S3. > It would be nice to have a flag or core-site.xml setting that allowed us to > tell hadoop to skip the copy and just write the file once. -- This message was sent by Atlassian JIRA (v6.3.4#6332)