[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16597108#comment-16597108
 ] 

Jozef Vilcek edited comment on BEAM-5036 at 8/30/18 6:43 AM:
-------------------------------------------------------------

[~timrobertson100] I guess if rename() API is kept defensive (throw exception 
when target exists) that component invoking it must decide it is OK to 
overwrite target or not. In case of `WriteOperation.moveToOutput()`, it clearly 
is instructed to reconstruct the target from source data and I believe it 
should obey. Job should be failed soon, during the launch if this is not 
desired (e.g. target dir must be empty).

Re-run batch job is not the only case. In streaming, in case of failure ( or 
upgrade ), job can be started from checkpoint. Depending on checkpoint time and 
runner guarantees, some operations can be re-run and output files recreated 
again. Operator must either delete target before `rename()` or use 
`rename(overwrite = true)` if such choice would exists in API.


was (Author: jozovilcek):
[~timrobertson100] I guess if rename() API is kept defensive (throw exception 
when target exists) that component invoking it must decide it is OK to 
overwrite target or not. In case of `WriteOperation.moveToOutput()`, it clearly 
is instructed to reconstruct the target from source data and I believe it 
should obey. Job should be failed soon, during the launch if this is not 
desired (e.g. target dir must be empty).

Re-run batch job is not the only case. In streaming, in case of failure job can 
be start from checkpoint. Depending on checkpoint time and runner guarantees, 
some operations can be re-run and output files recreated again. Operator must 
either delete target before `rename()` or use `rename(overwrite = true)` if 
such choice would exists in API.

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> ------------------------------------------------------
>
>                 Key: BEAM-5036
>                 URL: https://issues.apache.org/jira/browse/BEAM-5036
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-java-files
>    Affects Versions: 2.5.0
>            Reporter: Jozef Vilcek
>            Assignee: Tim Robertson
>            Priority: Major
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to