[ https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16597563#comment-16597563 ]
Reuven Lax commented on BEAM-5036: ---------------------------------- Once you get to the rename step, the set of files to rename should be deterministic. This isn't currently true for the Flink runner (it is for Dataflow) because support for @RequiresStableInput is fully implemented, however without stable input to the rename step many things can go wrong. The Flink implementation of stable input will block the rename step from executing until the snapshot is finalized, which means that a rollback will only rollback that far and not regenerate new output files. This does work in the current Spark runner (I believe) by forcing an RDD checkpoint. Of course if the user manually rerurns a pipeline this can happen. On Thu, Aug 30, 2018 at 7:11 AM Tim Robertson (JIRA) <j...@apache.org> > Optimize FileBasedSink's WriteOperation.moveToOutput() > ------------------------------------------------------ > > Key: BEAM-5036 > URL: https://issues.apache.org/jira/browse/BEAM-5036 > Project: Beam > Issue Type: Improvement > Components: io-java-files > Affects Versions: 2.5.0 > Reporter: Jozef Vilcek > Assignee: Tim Robertson > Priority: Major > Time Spent: 1.5h > Remaining Estimate: 0h > > moveToOutput() methods in FileBasedSink.WriteOperation implements move by > copy+delete. It would be better to use a rename() which can be much more > effective for some filesystems. > Filesystem must support cross-directory rename. BEAM-4861 is related to this > for the case of HDFS filesystem. > Feature was discussed here: > http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)