[ https://issues.apache.org/jira/browse/BEAM-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15415985#comment-15415985 ]
ASF GitHub Bot commented on BEAM-522: ------------------------------------- Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/779 > Update FileSink.finalize_write() to be idempotent > ------------------------------------------------- > > Key: BEAM-522 > URL: https://issues.apache.org/jira/browse/BEAM-522 > Project: Beam > Issue Type: Bug > Components: sdk-py > Reporter: Chamikara Jayalath > Assignee: Chamikara Jayalath > > Currently FileSink.finelize_write() in fileio.py [1] performs following > operations. > (1) Obtains a list of temporary files as a side input > (2) Renames each temporary file to the location where final output should be > stored. > iobase.Sink.finalize_write() operation should be idempotent since runner > implementations may call this operation multiple times due to task failures. > Current implementation is not idempotent because if we re-run the operation > after renaming a sub-set of files, the operations may fail due to not being > able to find some files at source location (for example, [2] for GCS files). > We can fix this by checking if the destination file is already available > before performing the rename and not performing the rename for files that are > already available at the destination. > [1] > https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/io/fileio.py#L503 > [2] > https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/io/gcsio.py#L187 > -- This message was sent by Atlassian JIRA (v6.3.4#6332)