[jira] [Comment Edited] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()
[ https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16620565#comment-16620565 ] Tim Robertson edited comment on BEAM-5036 at 9/19/18 1:19 PM: -- BEAM-5429 is created for the GCS implementation ( CC [~sinisa_lyh] ) and I'll aim to complete this one in time for 2.8.0 was (Author: timrobertson100): BEAM-5429 is created for the GCS implementation ( CC [~sinisa_lyh] ) > Optimize FileBasedSink's WriteOperation.moveToOutput() > -- > > Key: BEAM-5036 > URL: https://issues.apache.org/jira/browse/BEAM-5036 > Project: Beam > Issue Type: Improvement > Components: io-java-files >Affects Versions: 2.5.0 >Reporter: Jozef Vilcek >Assignee: Tim Robertson >Priority: Major > Time Spent: 2h 10m > Remaining Estimate: 0h > > moveToOutput() methods in FileBasedSink.WriteOperation implements move by > copy+delete. It would be better to use a rename() which can be much more > effective for some filesystems. > Filesystem must support cross-directory rename. BEAM-4861 is related to this > for the case of HDFS filesystem. > Feature was discussed here: > http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()
[ https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597108#comment-16597108 ] Jozef Vilcek edited comment on BEAM-5036 at 8/30/18 6:49 AM: - [~timrobertson100] I guess if rename() API is kept defensive (throw exception when target exists) that component invoking it must decide it is OK to overwrite target or not. In case of `WriteOperation.moveToOutput()`, it clearly is instructed to reconstruct the target from source data and I believe it should obey. Job should be failed soon, during the launch if this is not desired (e.g. target dir must be empty). Re-run batch job is not the only case. In streaming, in case of failure ( or upgrade ), job can be started from checkpoint. Depending on checkpoint time and runner guarantees, some operations can be re-run and output files recreated again. Operator must either delete target before `rename()` or use `rename(overwrite = true)` if such choice would exists in API (delete/overwrite off course in case source is present an there are actually data to be moved) was (Author: jozovilcek): [~timrobertson100] I guess if rename() API is kept defensive (throw exception when target exists) that component invoking it must decide it is OK to overwrite target or not. In case of `WriteOperation.moveToOutput()`, it clearly is instructed to reconstruct the target from source data and I believe it should obey. Job should be failed soon, during the launch if this is not desired (e.g. target dir must be empty). Re-run batch job is not the only case. In streaming, in case of failure ( or upgrade ), job can be started from checkpoint. Depending on checkpoint time and runner guarantees, some operations can be re-run and output files recreated again. Operator must either delete target before `rename()` or use `rename(overwrite = true)` if such choice would exists in API. > Optimize FileBasedSink's WriteOperation.moveToOutput() > -- > > Key: BEAM-5036 > URL: https://issues.apache.org/jira/browse/BEAM-5036 > Project: Beam > Issue Type: Improvement > Components: io-java-files >Affects Versions: 2.5.0 >Reporter: Jozef Vilcek >Assignee: Tim Robertson >Priority: Major > Time Spent: 1.5h > Remaining Estimate: 0h > > moveToOutput() methods in FileBasedSink.WriteOperation implements move by > copy+delete. It would be better to use a rename() which can be much more > effective for some filesystems. > Filesystem must support cross-directory rename. BEAM-4861 is related to this > for the case of HDFS filesystem. > Feature was discussed here: > http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()
[ https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597108#comment-16597108 ] Jozef Vilcek edited comment on BEAM-5036 at 8/30/18 6:43 AM: - [~timrobertson100] I guess if rename() API is kept defensive (throw exception when target exists) that component invoking it must decide it is OK to overwrite target or not. In case of `WriteOperation.moveToOutput()`, it clearly is instructed to reconstruct the target from source data and I believe it should obey. Job should be failed soon, during the launch if this is not desired (e.g. target dir must be empty). Re-run batch job is not the only case. In streaming, in case of failure ( or upgrade ), job can be started from checkpoint. Depending on checkpoint time and runner guarantees, some operations can be re-run and output files recreated again. Operator must either delete target before `rename()` or use `rename(overwrite = true)` if such choice would exists in API. was (Author: jozovilcek): [~timrobertson100] I guess if rename() API is kept defensive (throw exception when target exists) that component invoking it must decide it is OK to overwrite target or not. In case of `WriteOperation.moveToOutput()`, it clearly is instructed to reconstruct the target from source data and I believe it should obey. Job should be failed soon, during the launch if this is not desired (e.g. target dir must be empty). Re-run batch job is not the only case. In streaming, in case of failure job can be start from checkpoint. Depending on checkpoint time and runner guarantees, some operations can be re-run and output files recreated again. Operator must either delete target before `rename()` or use `rename(overwrite = true)` if such choice would exists in API. > Optimize FileBasedSink's WriteOperation.moveToOutput() > -- > > Key: BEAM-5036 > URL: https://issues.apache.org/jira/browse/BEAM-5036 > Project: Beam > Issue Type: Improvement > Components: io-java-files >Affects Versions: 2.5.0 >Reporter: Jozef Vilcek >Assignee: Tim Robertson >Priority: Major > Time Spent: 1.5h > Remaining Estimate: 0h > > moveToOutput() methods in FileBasedSink.WriteOperation implements move by > copy+delete. It would be better to use a rename() which can be much more > effective for some filesystems. > Filesystem must support cross-directory rename. BEAM-4861 is related to this > for the case of HDFS filesystem. > Feature was discussed here: > http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()
[ https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596565#comment-16596565 ] Neville Li edited comment on BEAM-5036 at 8/29/18 4:23 PM: --- Yeah that's what I figured. So there's no way to reduce this overhead on GCS unless if GCS starts to support efficient object {{rename}}. was (Author: sinisa_lyh): Yeah that's why I figured. So there's no way to reduce this overhead on GCS unless if GCS starts to support efficient object {{rename}}. > Optimize FileBasedSink's WriteOperation.moveToOutput() > -- > > Key: BEAM-5036 > URL: https://issues.apache.org/jira/browse/BEAM-5036 > Project: Beam > Issue Type: Improvement > Components: io-java-files >Affects Versions: 2.5.0 >Reporter: Jozef Vilcek >Assignee: Tim Robertson >Priority: Major > Time Spent: 1.5h > Remaining Estimate: 0h > > moveToOutput() methods in FileBasedSink.WriteOperation implements move by > copy+delete. It would be better to use a rename() which can be much more > effective for some filesystems. > Filesystem must support cross-directory rename. BEAM-4861 is related to this > for the case of HDFS filesystem. > Feature was discussed here: > http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()
[ https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596353#comment-16596353 ] Tim Robertson edited comment on BEAM-5036 at 8/29/18 1:51 PM: -- Thanks [~echauchot] Gcs and S3 have no notion of a rename and are implemented as copy (overwrite) and delete (see links in comment above). was (Author: timrobertson100): Thanks [~echauchot] Gcs and S3 have no notion of a rename they are copy (overwrite) and delete (see links in comment above). > Optimize FileBasedSink's WriteOperation.moveToOutput() > -- > > Key: BEAM-5036 > URL: https://issues.apache.org/jira/browse/BEAM-5036 > Project: Beam > Issue Type: Improvement > Components: io-java-files >Affects Versions: 2.5.0 >Reporter: Jozef Vilcek >Assignee: Tim Robertson >Priority: Major > Time Spent: 1.5h > Remaining Estimate: 0h > > moveToOutput() methods in FileBasedSink.WriteOperation implements move by > copy+delete. It would be better to use a rename() which can be much more > effective for some filesystems. > Filesystem must support cross-directory rename. BEAM-4861 is related to this > for the case of HDFS filesystem. > Feature was discussed here: > http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()
[ https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596336#comment-16596336 ] Etienne Chauchot edited comment on BEAM-5036 at 8/29/18 1:40 PM: - I have mixed emotions regarding this: - silent "fail" (or overwrite in that case) is usually bad - but local filesystems like ext4 for example silently overwrite the file in that case. - if distributed filesystems like HDFS tend to fail in that case, maybe people are used to that behavior in this big data echosystem => What is the general consensus among big data fs technologies ? What is the behavior of GS and S3 in that case ? We could test them and then implement what the majority says was (Author: echauchot): I have mixed emotions regarding this: - silent "fail" (or overwrite in that case) is usually bad - but local filesystems like ext4 for example silently overwrite the file in that case. - if distributed filesystems like HDFS tend to fail in that case, maybe people are used to that behavior in this big data echosystem => What is the general consensus among big data fs technologies ? What is the behavior of GS in that case ? We could test them and then implement what the majority says > Optimize FileBasedSink's WriteOperation.moveToOutput() > -- > > Key: BEAM-5036 > URL: https://issues.apache.org/jira/browse/BEAM-5036 > Project: Beam > Issue Type: Improvement > Components: io-java-files >Affects Versions: 2.5.0 >Reporter: Jozef Vilcek >Assignee: Tim Robertson >Priority: Major > Time Spent: 1.5h > Remaining Estimate: 0h > > moveToOutput() methods in FileBasedSink.WriteOperation implements move by > copy+delete. It would be better to use a rename() which can be much more > effective for some filesystems. > Filesystem must support cross-directory rename. BEAM-4861 is related to this > for the case of HDFS filesystem. > Feature was discussed here: > http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()
[ https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596233#comment-16596233 ] Tim Robertson edited comment on BEAM-5036 at 8/29/18 12:20 PM: --- The changes (yet to be merged) to rename() in BEAM-4861 now creates directories if missing, but also surfaces an exception if the underlying operation reports the operation did not complete. This means it will fail with exception if the target file already exists: {code} Caused by: java.io.IOException: Unable to rename resource hdfs://ha-nn/tmp/delme/.temp-beam-2018-08-29_11-41-47-0/1d676ec2-787d-4357-838f-f904e8d57b3d to hdfs://ha-nn/tmp/es-2012.txt-0-of-00045. No further information provided by underlying filesystem. at org.apache.beam.sdk.io.hdfs.HadoopFileSystem.rename(HadoopFileSystem.java:181) at org.apache.beam.sdk.io.FileSystems.rename(FileSystems.java:326) at org.apache.beam.sdk.io.FileBasedSink$WriteOperation.moveToOutputFiles(FileBasedSink.java:761) at org.apache.beam.sdk.io.WriteFiles$FinalizeTempFileBundles$FinalizeFn.process(WriteFiles.java:801) {code} The original implementation using copy() would overwrite files without warning. Do we wish to silently overwrite files when issuing a rename()? I am used to Hadoop operations failing if the output already exists so for me it is correct to fail if the output exists - I'd rather be forced to delete manually than accidentally be able to overwrite TBs of data. was (Author: timrobertson100): The changes (yet to be merged) to rename() in BEAM-4861 now creates directories if missing, but also surfaces an exception if the underlying operation reports the operation did not complete. This means it will fail with exception if the target file already exists: {code} Caused by: java.io.IOException: Unable to rename resource hdfs://ha-nn/tmp/delme/.temp-beam-2018-08-29_11-41-47-0/1d676ec2-787d-4357-838f-f904e8d57b3d to hdfs://ha-nn/tmp/es-2012.txt-0-of-00045. No further information provided by underlying filesystem. at org.apache.beam.sdk.io.hdfs.HadoopFileSystem.rename(HadoopFileSystem.java:181) at org.apache.beam.sdk.io.FileSystems.rename(FileSystems.java:326) at org.apache.beam.sdk.io.FileBasedSink$WriteOperation.moveToOutputFiles(FileBasedSink.java:761) at org.apache.beam.sdk.io.WriteFiles$FinalizeTempFileBundles$FinalizeFn.process(WriteFiles.java:801) {code} The original implementation using copy() would overwrite files without warning. Do we wish to silently overwrite files when issuing a rename()? I am used to Hadoop operations failing if the output already exists so for me it sounds wrong - I'd rather be forced to delete manually than accidentally be able to overwrite TBs of data. > Optimize FileBasedSink's WriteOperation.moveToOutput() > -- > > Key: BEAM-5036 > URL: https://issues.apache.org/jira/browse/BEAM-5036 > Project: Beam > Issue Type: Improvement > Components: io-java-files >Affects Versions: 2.5.0 >Reporter: Jozef Vilcek >Assignee: Tim Robertson >Priority: Major > Time Spent: 1.5h > Remaining Estimate: 0h > > moveToOutput() methods in FileBasedSink.WriteOperation implements move by > copy+delete. It would be better to use a rename() which can be much more > effective for some filesystems. > Filesystem must support cross-directory rename. BEAM-4861 is related to this > for the case of HDFS filesystem. > Feature was discussed here: > http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()
[ https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596233#comment-16596233 ] Tim Robertson edited comment on BEAM-5036 at 8/29/18 12:03 PM: --- The changes (yet to be merged) to rename() in BEAM-4861 now creates directories if missing, but also surfaces an exception if the underlying operation reports the operation did not complete. This means it will fail with exception if the target file already exists: {code} Caused by: java.io.IOException: Unable to rename resource hdfs://ha-nn/tmp/delme/.temp-beam-2018-08-29_11-41-47-0/1d676ec2-787d-4357-838f-f904e8d57b3d to hdfs://ha-nn/tmp/es-2012.txt-0-of-00045. No further information provided by underlying filesystem. at org.apache.beam.sdk.io.hdfs.HadoopFileSystem.rename(HadoopFileSystem.java:181) at org.apache.beam.sdk.io.FileSystems.rename(FileSystems.java:326) at org.apache.beam.sdk.io.FileBasedSink$WriteOperation.moveToOutputFiles(FileBasedSink.java:761) at org.apache.beam.sdk.io.WriteFiles$FinalizeTempFileBundles$FinalizeFn.process(WriteFiles.java:801) {code} The original implementation using copy() would overwrite files without warning. Do we wish to silently overwrite files when issuing a rename()? I am used to Hadoop operations failing if the output already exists so for me it sounds wrong - I'd rather be forced to delete manually than accidentally be able to overwrite TBs of data. was (Author: timrobertson100): The changes (yet to be merged) to rename() in BEAM-4861 now creates directories if missing, but also surfaces an exception if the underlying operation reports the operation did not complete. This means it will fail with exception if the target file already exists: {code} Caused by: java.io.IOException: Unable to rename resource hdfs://ha-nn/tmp/delme/.temp-beam-2018-08-29_11-41-47-0/1d676ec2-787d-4357-838f-f904e8d57b3d to hdfs://ha-nn/tmp/es-2012.txt-0-of-00045. No further information provided by underlying filesystem. at org.apache.beam.sdk.io.hdfs.HadoopFileSystem.rename(HadoopFileSystem.java:181) at org.apache.beam.sdk.io.FileSystems.rename(FileSystems.java:326) at org.apache.beam.sdk.io.FileBasedSink$WriteOperation.moveToOutputFiles(FileBasedSink.java:761) at org.apache.beam.sdk.io.WriteFiles$FinalizeTempFileBundles$FinalizeFn.process(WriteFiles.java:801) {code} The original implementation using copy() would overwrite files without warning. Do we wish to silently overwrite files when issuing a rename()? > Optimize FileBasedSink's WriteOperation.moveToOutput() > -- > > Key: BEAM-5036 > URL: https://issues.apache.org/jira/browse/BEAM-5036 > Project: Beam > Issue Type: Improvement > Components: io-java-files >Affects Versions: 2.5.0 >Reporter: Jozef Vilcek >Assignee: Tim Robertson >Priority: Major > Time Spent: 1.5h > Remaining Estimate: 0h > > moveToOutput() methods in FileBasedSink.WriteOperation implements move by > copy+delete. It would be better to use a rename() which can be much more > effective for some filesystems. > Filesystem must support cross-directory rename. BEAM-4861 is related to this > for the case of HDFS filesystem. > Feature was discussed here: > http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()
[ https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595337#comment-16595337 ] Tim Robertson edited comment on BEAM-5036 at 8/28/18 5:58 PM: -- For info on the other rename() methods: * {{S3FileSystem}} implements {{rename()}} as a [copy and delete|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/io/amazon-web-services/src/main/java/org/apache/beam/sdk/io/aws/s3/S3FileSystem.java#L597] * {{GcsFileSystem}} implements {{rename()}} as a [copy and delete|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/storage/GcsFileSystem.java#L122] * {{LocalFileSystem}} implements {{rename()}} by [making the parent directory if necessary|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/LocalFileSystem.java#L164] and then does a file move * {{HDFSFileSystem}} following BEAM-4861 (fixed and ready to merge) now implements {{rename()}} by creating missing parent directories and doing the move The move across different filesystems is not (fully) supported because the {{FileSystems.rename}} gets only the [filesystem for the source resource|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystems.java#L325]. It is not clear to me what might happen if the source were an {{HDFSFilesystem}} which itself can span multiple Filesystems. It is also not currently clear to me where we can best do the check - we could simply log a warn before the call to rename(). was (Author: timrobertson100): For info on the other rename() methods: * {{S3FileSystem}} implements {{rename()}} as a [copy and delete|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/io/amazon-web-services/src/main/java/org/apache/beam/sdk/io/aws/s3/S3FileSystem.java#L597] * {{GcsFileSystem}} implements {{rename()}} as a [copy and delete|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/storage/GcsFileSystem.java#L122] * {{LocalFileSystem}} implements {{rename()}} by [making the parent directory if necessary|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/LocalFileSystem.java#L164] and then does a file move * {{HDFSFileSystem}} following BEAM-4861 (fixed and ready to merge) now implements {{rename()}} by creating missing parent directories and doing the move The move across different filesystems is not (fully) supported because the {{FileSystems.rename}} gets only the [filesystem for the source resource|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystems.java#L325]. It is not clear to me what might happen if the source were an {{HDFSFilesystem}} which itself can span multiple Filesystems. It is also not currently clear to me where we can best do the check. > Optimize FileBasedSink's WriteOperation.moveToOutput() > -- > > Key: BEAM-5036 > URL: https://issues.apache.org/jira/browse/BEAM-5036 > Project: Beam > Issue Type: Improvement > Components: io-java-files >Affects Versions: 2.5.0 >Reporter: Jozef Vilcek >Assignee: Tim Robertson >Priority: Major > > moveToOutput() methods in FileBasedSink.WriteOperation implements move by > copy+delete. It would be better to use a rename() which can be much more > effective for some filesystems. > Filesystem must support cross-directory rename. BEAM-4861 is related to this > for the case of HDFS filesystem. > Feature was discussed here: > http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()
[ https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595337#comment-16595337 ] Tim Robertson edited comment on BEAM-5036 at 8/28/18 5:26 PM: -- For info on the other rename() methods: * {{S3FileSystem}} implements {{rename()}} as a [copy and delete|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/io/amazon-web-services/src/main/java/org/apache/beam/sdk/io/aws/s3/S3FileSystem.java#L597] * {{GcsFileSystem}} implements {{rename()}} as a [copy and delete|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/storage/GcsFileSystem.java#L122] * {{LocalFileSystem}} implements {{rename()}} by [making the parent directory if necessary|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/LocalFileSystem.java#L164] and then does a file move * {{HDFSFileSystem}} following BEAM-4861 (fixed and ready to merge) now implements {{rename()}} by creating missing parent directories and doing the move The move across different filesystems is not (fully) supported because the {{FileSystems.rename}} gets only the [filesystem for the source resource|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystems.java#L325]. It is not clear to me what might happen if the source were an {{HDFSFilesystem}} which itself can span multiple Filesystems. It is also not currently clear to me where we can best do the check. was (Author: timrobertson100): For info on the other FileSystem rename(): * {{S3FileSystem}} implements {{rename()}} as a [copy and delete|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/io/amazon-web-services/src/main/java/org/apache/beam/sdk/io/aws/s3/S3FileSystem.java#L597] * {{GcsFileSystem}} implements {{rename()}} as a [copy and delete|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/storage/GcsFileSystem.java#L122] * {{LocalFileSystem}} implements {{rename()}} by [making the parent directory if necessary|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/LocalFileSystem.java#L164] and then does a file move * {{HDFSFileSystem}} following BEAM-4861 (fixed and ready to merge) now implements {{rename()}} by creating missing parent directories and doing the move The move across different filesystems is not (fully) supported because the {{FileSystems.rename}} gets only the [filesystem for the source resource|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystems.java#L325]. It is not clear to me what might happen if the source were an {{HDFSFilesystem}} which itself can span multiple Filesystems. It is also not currently clear to me where we can best do the check. > Optimize FileBasedSink's WriteOperation.moveToOutput() > -- > > Key: BEAM-5036 > URL: https://issues.apache.org/jira/browse/BEAM-5036 > Project: Beam > Issue Type: Improvement > Components: io-java-files >Affects Versions: 2.5.0 >Reporter: Jozef Vilcek >Assignee: Tim Robertson >Priority: Major > > moveToOutput() methods in FileBasedSink.WriteOperation implements move by > copy+delete. It would be better to use a rename() which can be much more > effective for some filesystems. > Filesystem must support cross-directory rename. BEAM-4861 is related to this > for the case of HDFS filesystem. > Feature was discussed here: > http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()
[ https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593439#comment-16593439 ] Tim Robertson edited comment on BEAM-5036 at 8/27/18 10:09 AM: --- Thanks [~reuvenlax] 1. Adding a cross FS check seems reasonable as a precaution. 2. Please see [this comment|https://issues.apache.org/jira/browse/BEAM-4861?focusedCommentId=16593406=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16593406] on BEAM-4861 where we have a decision to make on the HDFS parent directory not existing. Appreciate your and [~JozoVilcek] thoughts on that (and others). was (Author: timrobertson100): Thanks [~reuvenlax] 1. Adding a cross FS check seems reasonable as a precaution. 2. Please see this comment on BEAM-4861 where we have a decision to make on the HDFS parent directory not existing. Appreciate your and [~JozoVilcek] thoughts on that (and others). > Optimize FileBasedSink's WriteOperation.moveToOutput() > -- > > Key: BEAM-5036 > URL: https://issues.apache.org/jira/browse/BEAM-5036 > Project: Beam > Issue Type: Improvement > Components: io-java-files >Affects Versions: 2.5.0 >Reporter: Jozef Vilcek >Assignee: Tim Robertson >Priority: Major > > moveToOutput() methods in FileBasedSink.WriteOperation implements move by > copy+delete. It would be better to use a rename() which can be much more > effective for some filesystems. > Filesystem must support cross-directory rename. BEAM-4861 is related to this > for the case of HDFS filesystem. > Feature was discussed here: > http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)