[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596831#comment-16596831
 ] 

Tim Robertson commented on BEAM-5036:
-------------------------------------

[~reuvenlax], I believe those scenarios are covered as you write. I closed the 
PR today because I uncovered one other scenario by running a simple 
avro-to-avro file conversion job twice.

What to do for filesystems that support the atomic rename() where the 
destination files exist before the job starts? Beam 2.6.0 will simply overwrite 
for all Filesystems. If we change to use rename() then GCS, S3 will overwrite, 
while HDFS would either throw Exception if we merge the [PR for 
5036|https://github.com/apache/beam/pull/6285] or do nothing (which is clearly 
wrong) if not.

The Beam FileSystem [rename() 
documentation|https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystem.java#L108]
 doesn't cover this scenario and I don't find it mentioned anywhere in the 
FileBasedSink docs either.

MapReduce and Spark jobs (on HDFS at least) fail for this scenario at start up 
presumably to prevent overwriting existing data inadvertently and I think is 
the better behaviour, but it is not what Beam has done to date.

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> ------------------------------------------------------
>
>                 Key: BEAM-5036
>                 URL: https://issues.apache.org/jira/browse/BEAM-5036
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-java-files
>    Affects Versions: 2.5.0
>            Reporter: Jozef Vilcek
>            Assignee: Tim Robertson
>            Priority: Major
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to