[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16597563#comment-16597563
 ] 

Reuven Lax commented on BEAM-5036:
----------------------------------

Once you get to the rename step, the set of files to rename should be
deterministic. This isn't currently true for the Flink runner (it is for
Dataflow) because support for @RequiresStableInput is fully implemented,
however without stable input to the rename step many things can go wrong.
The Flink implementation of stable input will block the rename step from
executing until the snapshot is finalized, which means that a rollback will
only rollback that far and not regenerate new output files. This does work
in the current Spark runner (I believe) by forcing an RDD checkpoint.

Of course if the user manually rerurns a pipeline this can happen.

On Thu, Aug 30, 2018 at 7:11 AM Tim Robertson (JIRA) <j...@apache.org>



> Optimize FileBasedSink's WriteOperation.moveToOutput()
> ------------------------------------------------------
>
>                 Key: BEAM-5036
>                 URL: https://issues.apache.org/jira/browse/BEAM-5036
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-java-files
>    Affects Versions: 2.5.0
>            Reporter: Jozef Vilcek
>            Assignee: Tim Robertson
>            Priority: Major
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to