[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16597502#comment-16597502
 ] 

Tim Robertson commented on BEAM-5036:
-------------------------------------

Thanks to everyone for contributing to this. [~JozoVilcek] I've come to a 
similar conclusion overnight and think we need to do one of:
 # surface {{FileAlreadyExistsException}} as well as {{FileNotFoundException}} 
from {{FileSystem.rename()}} and let the caller decide (here I presume we would 
opt to overwrite by deleting the target only if the source still exists and 
then retry)
 # document and implement that {{FileSystem.rename()}} will always replace 
existing files for all filesystems
 # expose a {{forceOverwrite}} flag / option and use it here

I propose we should open a separate issue to explore optimising rename for Gcs. 
I had simply overlooked the rewrite option (sorry, I am not all that familiar 
with Gcs).

I still have some concern about rewriting output files that already exist 
though. Isn't it the case that if "run 1" produced 45 avro file parts but for 
some reason "run 2" split differently and produced 43 file parts, anything 
using a glob on the directory would get incorrect data (i.e. the addition of 2 
parts from run 1)? This would be relevant for bounded, but possibly even a 
restart / recover of a streaming scenario?

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> ------------------------------------------------------
>
>                 Key: BEAM-5036
>                 URL: https://issues.apache.org/jira/browse/BEAM-5036
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-java-files
>    Affects Versions: 2.5.0
>            Reporter: Jozef Vilcek
>            Assignee: Tim Robertson
>            Priority: Major
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to