[ 
https://issues.apache.org/jira/browse/BEAM-5036?focusedWorklogId=145771&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-145771
 ]

ASF GitHub Bot logged work on BEAM-5036:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 19/Sep/18 19:08
            Start Date: 19/Sep/18 19:08
    Worklog Time Spent: 10m 
      Work Description: lukecwik commented on issue #6289: [BEAM-5036] Optimize 
the FileBasedSink WriteOperation.moveToOutput()
URL: https://github.com/apache/beam/pull/6289#issuecomment-422923094
 
 
   The design[1] by the original author did define some rename/delete
   semantics.
   
   I do believe that the original APIs were very strict and threw errors in
   many situations and had knobs that would require the user of the API to
   handle more "edge" cases. I filed BEAM-5425[2] because I believe we can
   make our API significantly easier to not have to be so strict. We expect
   that our system will need to be resilient in the case of failure and I
   believe rename/delete semantics should make it easier to write such code.
   For example, if you want to rename a set of files you can't just do:
   // retry up to three times
   for (int i = 0; i < 3; i++) {
     try {
       filesystem.rename(srcs, dests, some set of options);
       return;
     } catch (failure) {
     }
   }
   since subsequent calls will fail if any of the files were renamed. In most
   cases the user will need to check to see what was renamed and then handle
   fixing up based upon how the rename failed.
   
   It would be much easier if we expected that delete(files)/rename(srcs,
   dests) didn't need any flags and could be called with the same lists over
   and over again and each partial success would make it such that subsequent
   calls made progress. I guess that the intent of having a FileSystems[3]
   class with static helpers for this was meant to address this. Unfortunately
   I don't have enough time to pay close attention to this space and the
   underlying reasons for why we made such choices in the past is fleeting.
   
   cc @chamik...@google.com, since he has also had some interest in this space
   in the past.
   
   1:
   
https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJsVG3qel2lhdKTknmZ_7M/edit#
   2: https://issues.apache.org/jira/browse/BEAM-5425
   3:
   
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystems.java
   
   On Wed, Sep 19, 2018 at 8:53 AM Ismaël Mejía <notificati...@github.com>
   wrote:
   
   > Interesting question, it is ideal to define and document a clear default
   > behavior for rename in all Beam filesystems (since there are no options
   > allowed in the API).
   > HDFS users probably will expect that the default rename behavior does NOT
   > overwrite (as HDFS works), and also because this implies possible data
   > loss, but I am not sure if there is a strong reason for other Filesystems
   > to do overwrite by default (e.g. Local).
   > cc @lukecwik <https://github.com/lukecwik> too for eventual extra
   > feedback since the original authors of Beam FileSystems are not in the
   > project anymore.
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/beam/pull/6289#issuecomment-422856503>, or mute
   > the thread
   > 
<https://github.com/notifications/unsubscribe-auth/AJnK7AbywAnJ4M-2Gffuaj5lNpWMPh4pks5ucmh-gaJpZM4WQHrB>
   > .
   >
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 145771)
    Time Spent: 2.5h  (was: 2h 20m)

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> ------------------------------------------------------
>
>                 Key: BEAM-5036
>                 URL: https://issues.apache.org/jira/browse/BEAM-5036
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-java-files
>    Affects Versions: 2.5.0
>            Reporter: Jozef Vilcek
>            Assignee: Tim Robertson
>            Priority: Major
>          Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to