[ https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16597108#comment-16597108 ]
Jozef Vilcek edited comment on BEAM-5036 at 8/30/18 6:43 AM: ------------------------------------------------------------- [~timrobertson100] I guess if rename() API is kept defensive (throw exception when target exists) that component invoking it must decide it is OK to overwrite target or not. In case of `WriteOperation.moveToOutput()`, it clearly is instructed to reconstruct the target from source data and I believe it should obey. Job should be failed soon, during the launch if this is not desired (e.g. target dir must be empty). Re-run batch job is not the only case. In streaming, in case of failure ( or upgrade ), job can be started from checkpoint. Depending on checkpoint time and runner guarantees, some operations can be re-run and output files recreated again. Operator must either delete target before `rename()` or use `rename(overwrite = true)` if such choice would exists in API. was (Author: jozovilcek): [~timrobertson100] I guess if rename() API is kept defensive (throw exception when target exists) that component invoking it must decide it is OK to overwrite target or not. In case of `WriteOperation.moveToOutput()`, it clearly is instructed to reconstruct the target from source data and I believe it should obey. Job should be failed soon, during the launch if this is not desired (e.g. target dir must be empty). Re-run batch job is not the only case. In streaming, in case of failure job can be start from checkpoint. Depending on checkpoint time and runner guarantees, some operations can be re-run and output files recreated again. Operator must either delete target before `rename()` or use `rename(overwrite = true)` if such choice would exists in API. > Optimize FileBasedSink's WriteOperation.moveToOutput() > ------------------------------------------------------ > > Key: BEAM-5036 > URL: https://issues.apache.org/jira/browse/BEAM-5036 > Project: Beam > Issue Type: Improvement > Components: io-java-files > Affects Versions: 2.5.0 > Reporter: Jozef Vilcek > Assignee: Tim Robertson > Priority: Major > Time Spent: 1.5h > Remaining Estimate: 0h > > moveToOutput() methods in FileBasedSink.WriteOperation implements move by > copy+delete. It would be better to use a rename() which can be much more > effective for some filesystems. > Filesystem must support cross-directory rename. BEAM-4861 is related to this > for the case of HDFS filesystem. > Feature was discussed here: > http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)