[ https://issues.apache.org/jira/browse/BEAM-5036?focusedWorklogId=145771&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-145771 ]
ASF GitHub Bot logged work on BEAM-5036: ---------------------------------------- Author: ASF GitHub Bot Created on: 19/Sep/18 19:08 Start Date: 19/Sep/18 19:08 Worklog Time Spent: 10m Work Description: lukecwik commented on issue #6289: [BEAM-5036] Optimize the FileBasedSink WriteOperation.moveToOutput() URL: https://github.com/apache/beam/pull/6289#issuecomment-422923094 The design[1] by the original author did define some rename/delete semantics. I do believe that the original APIs were very strict and threw errors in many situations and had knobs that would require the user of the API to handle more "edge" cases. I filed BEAM-5425[2] because I believe we can make our API significantly easier to not have to be so strict. We expect that our system will need to be resilient in the case of failure and I believe rename/delete semantics should make it easier to write such code. For example, if you want to rename a set of files you can't just do: // retry up to three times for (int i = 0; i < 3; i++) { try { filesystem.rename(srcs, dests, some set of options); return; } catch (failure) { } } since subsequent calls will fail if any of the files were renamed. In most cases the user will need to check to see what was renamed and then handle fixing up based upon how the rename failed. It would be much easier if we expected that delete(files)/rename(srcs, dests) didn't need any flags and could be called with the same lists over and over again and each partial success would make it such that subsequent calls made progress. I guess that the intent of having a FileSystems[3] class with static helpers for this was meant to address this. Unfortunately I don't have enough time to pay close attention to this space and the underlying reasons for why we made such choices in the past is fleeting. cc @chamik...@google.com, since he has also had some interest in this space in the past. 1: https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJsVG3qel2lhdKTknmZ_7M/edit# 2: https://issues.apache.org/jira/browse/BEAM-5425 3: https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystems.java On Wed, Sep 19, 2018 at 8:53 AM Ismaël Mejía <notificati...@github.com> wrote: > Interesting question, it is ideal to define and document a clear default > behavior for rename in all Beam filesystems (since there are no options > allowed in the API). > HDFS users probably will expect that the default rename behavior does NOT > overwrite (as HDFS works), and also because this implies possible data > loss, but I am not sure if there is a strong reason for other Filesystems > to do overwrite by default (e.g. Local). > cc @lukecwik <https://github.com/lukecwik> too for eventual extra > feedback since the original authors of Beam FileSystems are not in the > project anymore. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/apache/beam/pull/6289#issuecomment-422856503>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AJnK7AbywAnJ4M-2Gffuaj5lNpWMPh4pks5ucmh-gaJpZM4WQHrB> > . > ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking ------------------- Worklog Id: (was: 145771) Time Spent: 2.5h (was: 2h 20m) > Optimize FileBasedSink's WriteOperation.moveToOutput() > ------------------------------------------------------ > > Key: BEAM-5036 > URL: https://issues.apache.org/jira/browse/BEAM-5036 > Project: Beam > Issue Type: Improvement > Components: io-java-files > Affects Versions: 2.5.0 > Reporter: Jozef Vilcek > Assignee: Tim Robertson > Priority: Major > Time Spent: 2.5h > Remaining Estimate: 0h > > moveToOutput() methods in FileBasedSink.WriteOperation implements move by > copy+delete. It would be better to use a rename() which can be much more > effective for some filesystems. > Filesystem must support cross-directory rename. BEAM-4861 is related to this > for the case of HDFS filesystem. > Feature was discussed here: > http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)