[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596233#comment-16596233
 ] 

Tim Robertson edited comment on BEAM-5036 at 8/29/18 12:20 PM:
---------------------------------------------------------------

The changes (yet to be merged) to rename() in BEAM-4861 now creates directories 
if missing, but also surfaces an exception if the underlying operation reports 
the operation did not complete.

This means it will fail with exception if the target file already exists:
{code}
Caused by: java.io.IOException: Unable to rename resource 
hdfs://ha-nn/tmp/delme/.temp-beam-2018-08-29_11-41-47-0/1d676ec2-787d-4357-838f-f904e8d57b3d
 to hdfs://ha-nn/tmp/es-2012.txt-00000-of-00045. No further information 
provided by underlying filesystem.
        at 
org.apache.beam.sdk.io.hdfs.HadoopFileSystem.rename(HadoopFileSystem.java:181)
        at org.apache.beam.sdk.io.FileSystems.rename(FileSystems.java:326)
        at 
org.apache.beam.sdk.io.FileBasedSink$WriteOperation.moveToOutputFiles(FileBasedSink.java:761)
        at 
org.apache.beam.sdk.io.WriteFiles$FinalizeTempFileBundles$FinalizeFn.process(WriteFiles.java:801)
{code} 

The original implementation using copy() would overwrite files without warning. 

Do we wish to silently overwrite files when issuing a rename()? I am used to 
Hadoop operations failing if the output already exists so for me it is correct 
to fail if the output exists - I'd rather be forced to delete manually than 
accidentally be able to overwrite TBs of data.


was (Author: timrobertson100):
The changes (yet to be merged) to rename() in BEAM-4861 now creates directories 
if missing, but also surfaces an exception if the underlying operation reports 
the operation did not complete.

This means it will fail with exception if the target file already exists:
{code}
Caused by: java.io.IOException: Unable to rename resource 
hdfs://ha-nn/tmp/delme/.temp-beam-2018-08-29_11-41-47-0/1d676ec2-787d-4357-838f-f904e8d57b3d
 to hdfs://ha-nn/tmp/es-2012.txt-00000-of-00045. No further information 
provided by underlying filesystem.
        at 
org.apache.beam.sdk.io.hdfs.HadoopFileSystem.rename(HadoopFileSystem.java:181)
        at org.apache.beam.sdk.io.FileSystems.rename(FileSystems.java:326)
        at 
org.apache.beam.sdk.io.FileBasedSink$WriteOperation.moveToOutputFiles(FileBasedSink.java:761)
        at 
org.apache.beam.sdk.io.WriteFiles$FinalizeTempFileBundles$FinalizeFn.process(WriteFiles.java:801)
{code} 

The original implementation using copy() would overwrite files without warning. 

Do we wish to silently overwrite files when issuing a rename()? I am used to 
Hadoop operations failing if the output already exists so for me it sounds 
wrong - I'd rather be forced to delete manually than accidentally be able to 
overwrite TBs of data.

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> ------------------------------------------------------
>
>                 Key: BEAM-5036
>                 URL: https://issues.apache.org/jira/browse/BEAM-5036
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-java-files
>    Affects Versions: 2.5.0
>            Reporter: Jozef Vilcek
>            Assignee: Tim Robertson
>            Priority: Major
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to