[jira] [Comment Edited] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()

2018-09-19 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16620565#comment-16620565
 ] 

Tim Robertson edited comment on BEAM-5036 at 9/19/18 1:19 PM:
--

BEAM-5429 is created for the GCS implementation ( CC [~sinisa_lyh] ) and I'll 
aim to complete this one in time for 2.8.0


was (Author: timrobertson100):
BEAM-5429 is created for the GCS implementation ( CC [~sinisa_lyh] )

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> --
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Affects Versions: 2.5.0
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()

2018-08-30 Thread Jozef Vilcek (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597108#comment-16597108
 ] 

Jozef Vilcek edited comment on BEAM-5036 at 8/30/18 6:49 AM:
-

[~timrobertson100] I guess if rename() API is kept defensive (throw exception 
when target exists) that component invoking it must decide it is OK to 
overwrite target or not. In case of `WriteOperation.moveToOutput()`, it clearly 
is instructed to reconstruct the target from source data and I believe it 
should obey. Job should be failed soon, during the launch if this is not 
desired (e.g. target dir must be empty).

Re-run batch job is not the only case. In streaming, in case of failure ( or 
upgrade ), job can be started from checkpoint. Depending on checkpoint time and 
runner guarantees, some operations can be re-run and output files recreated 
again. Operator must either delete target before `rename()` or use 
`rename(overwrite = true)` if such choice would exists in API (delete/overwrite 
off course in case source is present an there are actually data to be moved)


was (Author: jozovilcek):
[~timrobertson100] I guess if rename() API is kept defensive (throw exception 
when target exists) that component invoking it must decide it is OK to 
overwrite target or not. In case of `WriteOperation.moveToOutput()`, it clearly 
is instructed to reconstruct the target from source data and I believe it 
should obey. Job should be failed soon, during the launch if this is not 
desired (e.g. target dir must be empty).

Re-run batch job is not the only case. In streaming, in case of failure ( or 
upgrade ), job can be started from checkpoint. Depending on checkpoint time and 
runner guarantees, some operations can be re-run and output files recreated 
again. Operator must either delete target before `rename()` or use 
`rename(overwrite = true)` if such choice would exists in API.

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> --
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Affects Versions: 2.5.0
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()

2018-08-30 Thread Jozef Vilcek (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597108#comment-16597108
 ] 

Jozef Vilcek edited comment on BEAM-5036 at 8/30/18 6:43 AM:
-

[~timrobertson100] I guess if rename() API is kept defensive (throw exception 
when target exists) that component invoking it must decide it is OK to 
overwrite target or not. In case of `WriteOperation.moveToOutput()`, it clearly 
is instructed to reconstruct the target from source data and I believe it 
should obey. Job should be failed soon, during the launch if this is not 
desired (e.g. target dir must be empty).

Re-run batch job is not the only case. In streaming, in case of failure ( or 
upgrade ), job can be started from checkpoint. Depending on checkpoint time and 
runner guarantees, some operations can be re-run and output files recreated 
again. Operator must either delete target before `rename()` or use 
`rename(overwrite = true)` if such choice would exists in API.


was (Author: jozovilcek):
[~timrobertson100] I guess if rename() API is kept defensive (throw exception 
when target exists) that component invoking it must decide it is OK to 
overwrite target or not. In case of `WriteOperation.moveToOutput()`, it clearly 
is instructed to reconstruct the target from source data and I believe it 
should obey. Job should be failed soon, during the launch if this is not 
desired (e.g. target dir must be empty).

Re-run batch job is not the only case. In streaming, in case of failure job can 
be start from checkpoint. Depending on checkpoint time and runner guarantees, 
some operations can be re-run and output files recreated again. Operator must 
either delete target before `rename()` or use `rename(overwrite = true)` if 
such choice would exists in API.

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> --
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Affects Versions: 2.5.0
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()

2018-08-29 Thread Neville Li (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596565#comment-16596565
 ] 

Neville Li edited comment on BEAM-5036 at 8/29/18 4:23 PM:
---

Yeah that's what I figured. So there's no way to reduce this overhead on GCS 
unless if GCS starts to support efficient object {{rename}}.


was (Author: sinisa_lyh):
Yeah that's why I figured. So there's no way to reduce this overhead on GCS 
unless if GCS starts to support efficient object {{rename}}.

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> --
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Affects Versions: 2.5.0
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()

2018-08-29 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596353#comment-16596353
 ] 

Tim Robertson edited comment on BEAM-5036 at 8/29/18 1:51 PM:
--

Thanks [~echauchot]
Gcs and S3 have no notion of a rename and are implemented as copy (overwrite) 
and delete (see links in comment above).


was (Author: timrobertson100):
Thanks [~echauchot]
Gcs and S3 have no notion of a rename they are copy (overwrite) and delete (see 
links in comment above).

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> --
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Affects Versions: 2.5.0
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()

2018-08-29 Thread Etienne Chauchot (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596336#comment-16596336
 ] 

Etienne Chauchot edited comment on BEAM-5036 at 8/29/18 1:40 PM:
-

I have mixed emotions regarding this: 
- silent "fail" (or overwrite in that case) is usually bad
- but local filesystems like ext4 for example silently overwrite the file in 
that case.
- if distributed filesystems like HDFS tend to fail in that case, maybe people 
are used to that behavior in this big data echosystem
=> What is the general consensus among big data fs technologies ? What is the 
behavior of GS and S3 in that case ? We could test them and then implement what 
the majority says 


was (Author: echauchot):
I have mixed emotions regarding this: 
- silent "fail" (or overwrite in that case) is usually bad
- but local filesystems like ext4 for example silently overwrite the file in 
that case.
- if distributed filesystems like HDFS tend to fail in that case, maybe people 
are used to that behavior in this big data echosystem
=> What is the general consensus among big data fs technologies ? What is the 
behavior of GS in that case ? We could test them and then implement what the 
majority says 

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> --
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Affects Versions: 2.5.0
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()

2018-08-29 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596233#comment-16596233
 ] 

Tim Robertson edited comment on BEAM-5036 at 8/29/18 12:20 PM:
---

The changes (yet to be merged) to rename() in BEAM-4861 now creates directories 
if missing, but also surfaces an exception if the underlying operation reports 
the operation did not complete.

This means it will fail with exception if the target file already exists:
{code}
Caused by: java.io.IOException: Unable to rename resource 
hdfs://ha-nn/tmp/delme/.temp-beam-2018-08-29_11-41-47-0/1d676ec2-787d-4357-838f-f904e8d57b3d
 to hdfs://ha-nn/tmp/es-2012.txt-0-of-00045. No further information 
provided by underlying filesystem.
at 
org.apache.beam.sdk.io.hdfs.HadoopFileSystem.rename(HadoopFileSystem.java:181)
at org.apache.beam.sdk.io.FileSystems.rename(FileSystems.java:326)
at 
org.apache.beam.sdk.io.FileBasedSink$WriteOperation.moveToOutputFiles(FileBasedSink.java:761)
at 
org.apache.beam.sdk.io.WriteFiles$FinalizeTempFileBundles$FinalizeFn.process(WriteFiles.java:801)
{code} 

The original implementation using copy() would overwrite files without warning. 

Do we wish to silently overwrite files when issuing a rename()? I am used to 
Hadoop operations failing if the output already exists so for me it is correct 
to fail if the output exists - I'd rather be forced to delete manually than 
accidentally be able to overwrite TBs of data.


was (Author: timrobertson100):
The changes (yet to be merged) to rename() in BEAM-4861 now creates directories 
if missing, but also surfaces an exception if the underlying operation reports 
the operation did not complete.

This means it will fail with exception if the target file already exists:
{code}
Caused by: java.io.IOException: Unable to rename resource 
hdfs://ha-nn/tmp/delme/.temp-beam-2018-08-29_11-41-47-0/1d676ec2-787d-4357-838f-f904e8d57b3d
 to hdfs://ha-nn/tmp/es-2012.txt-0-of-00045. No further information 
provided by underlying filesystem.
at 
org.apache.beam.sdk.io.hdfs.HadoopFileSystem.rename(HadoopFileSystem.java:181)
at org.apache.beam.sdk.io.FileSystems.rename(FileSystems.java:326)
at 
org.apache.beam.sdk.io.FileBasedSink$WriteOperation.moveToOutputFiles(FileBasedSink.java:761)
at 
org.apache.beam.sdk.io.WriteFiles$FinalizeTempFileBundles$FinalizeFn.process(WriteFiles.java:801)
{code} 

The original implementation using copy() would overwrite files without warning. 

Do we wish to silently overwrite files when issuing a rename()? I am used to 
Hadoop operations failing if the output already exists so for me it sounds 
wrong - I'd rather be forced to delete manually than accidentally be able to 
overwrite TBs of data.

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> --
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Affects Versions: 2.5.0
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()

2018-08-29 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596233#comment-16596233
 ] 

Tim Robertson edited comment on BEAM-5036 at 8/29/18 12:03 PM:
---

The changes (yet to be merged) to rename() in BEAM-4861 now creates directories 
if missing, but also surfaces an exception if the underlying operation reports 
the operation did not complete.

This means it will fail with exception if the target file already exists:
{code}
Caused by: java.io.IOException: Unable to rename resource 
hdfs://ha-nn/tmp/delme/.temp-beam-2018-08-29_11-41-47-0/1d676ec2-787d-4357-838f-f904e8d57b3d
 to hdfs://ha-nn/tmp/es-2012.txt-0-of-00045. No further information 
provided by underlying filesystem.
at 
org.apache.beam.sdk.io.hdfs.HadoopFileSystem.rename(HadoopFileSystem.java:181)
at org.apache.beam.sdk.io.FileSystems.rename(FileSystems.java:326)
at 
org.apache.beam.sdk.io.FileBasedSink$WriteOperation.moveToOutputFiles(FileBasedSink.java:761)
at 
org.apache.beam.sdk.io.WriteFiles$FinalizeTempFileBundles$FinalizeFn.process(WriteFiles.java:801)
{code} 

The original implementation using copy() would overwrite files without warning. 

Do we wish to silently overwrite files when issuing a rename()? I am used to 
Hadoop operations failing if the output already exists so for me it sounds 
wrong - I'd rather be forced to delete manually than accidentally be able to 
overwrite TBs of data.


was (Author: timrobertson100):
The changes (yet to be merged) to rename() in BEAM-4861 now creates directories 
if missing, but also surfaces an exception if the underlying operation reports 
the operation did not complete.

This means it will fail with exception if the target file already exists:
{code}
Caused by: java.io.IOException: Unable to rename resource 
hdfs://ha-nn/tmp/delme/.temp-beam-2018-08-29_11-41-47-0/1d676ec2-787d-4357-838f-f904e8d57b3d
 to hdfs://ha-nn/tmp/es-2012.txt-0-of-00045. No further information 
provided by underlying filesystem.
at 
org.apache.beam.sdk.io.hdfs.HadoopFileSystem.rename(HadoopFileSystem.java:181)
at org.apache.beam.sdk.io.FileSystems.rename(FileSystems.java:326)
at 
org.apache.beam.sdk.io.FileBasedSink$WriteOperation.moveToOutputFiles(FileBasedSink.java:761)
at 
org.apache.beam.sdk.io.WriteFiles$FinalizeTempFileBundles$FinalizeFn.process(WriteFiles.java:801)
{code} 

The original implementation using copy() would overwrite files without warning. 

Do we wish to silently overwrite files when issuing a rename()?

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> --
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Affects Versions: 2.5.0
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()

2018-08-28 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595337#comment-16595337
 ] 

Tim Robertson edited comment on BEAM-5036 at 8/28/18 5:58 PM:
--

For info on the other rename() methods:
 * {{S3FileSystem}} implements {{rename()}} as a [copy and 
delete|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/io/amazon-web-services/src/main/java/org/apache/beam/sdk/io/aws/s3/S3FileSystem.java#L597]
 * {{GcsFileSystem}} implements {{rename()}} as a [copy and 
delete|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/storage/GcsFileSystem.java#L122]
 * {{LocalFileSystem}} implements {{rename()}} by [making the parent directory 
if 
necessary|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/LocalFileSystem.java#L164]
 and then does a file move
 * {{HDFSFileSystem}} following BEAM-4861 (fixed and ready to merge) now 
implements {{rename()}} by creating missing parent directories and doing the 
move

The move across different filesystems is not (fully) supported because the 
{{FileSystems.rename}} gets only the [filesystem for the source 
resource|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystems.java#L325].
 It is not clear to me what might happen if the source were an 
{{HDFSFilesystem}} which itself can span multiple Filesystems. It is also not 
currently clear to me where we can best do the check - we could simply log a 
warn before the call to rename().


was (Author: timrobertson100):
For info on the other rename() methods:
 * {{S3FileSystem}} implements {{rename()}} as a [copy and 
delete|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/io/amazon-web-services/src/main/java/org/apache/beam/sdk/io/aws/s3/S3FileSystem.java#L597]
 * {{GcsFileSystem}} implements {{rename()}} as a [copy and 
delete|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/storage/GcsFileSystem.java#L122]
 * {{LocalFileSystem}} implements {{rename()}} by [making the parent directory 
if 
necessary|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/LocalFileSystem.java#L164]
 and then does a file move
 * {{HDFSFileSystem}} following BEAM-4861 (fixed and ready to merge) now 
implements {{rename()}} by creating missing parent directories and doing the 
move

The move across different filesystems is not (fully) supported because the 
{{FileSystems.rename}} gets only the [filesystem for the source 
resource|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystems.java#L325].
 It is not clear to me what might happen if the source were an 
{{HDFSFilesystem}} which itself can span multiple Filesystems. It is also not 
currently clear to me where we can best do the check.

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> --
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Affects Versions: 2.5.0
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()

2018-08-28 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595337#comment-16595337
 ] 

Tim Robertson edited comment on BEAM-5036 at 8/28/18 5:26 PM:
--

For info on the other rename() methods:
 * {{S3FileSystem}} implements {{rename()}} as a [copy and 
delete|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/io/amazon-web-services/src/main/java/org/apache/beam/sdk/io/aws/s3/S3FileSystem.java#L597]
 * {{GcsFileSystem}} implements {{rename()}} as a [copy and 
delete|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/storage/GcsFileSystem.java#L122]
 * {{LocalFileSystem}} implements {{rename()}} by [making the parent directory 
if 
necessary|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/LocalFileSystem.java#L164]
 and then does a file move
 * {{HDFSFileSystem}} following BEAM-4861 (fixed and ready to merge) now 
implements {{rename()}} by creating missing parent directories and doing the 
move

The move across different filesystems is not (fully) supported because the 
{{FileSystems.rename}} gets only the [filesystem for the source 
resource|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystems.java#L325].
 It is not clear to me what might happen if the source were an 
{{HDFSFilesystem}} which itself can span multiple Filesystems. It is also not 
currently clear to me where we can best do the check.


was (Author: timrobertson100):
For info on the other FileSystem rename():
 * {{S3FileSystem}} implements {{rename()}} as a [copy and 
delete|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/io/amazon-web-services/src/main/java/org/apache/beam/sdk/io/aws/s3/S3FileSystem.java#L597]
 * {{GcsFileSystem}} implements {{rename()}} as a [copy and 
delete|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/storage/GcsFileSystem.java#L122]
 * {{LocalFileSystem}} implements {{rename()}} by [making the parent directory 
if 
necessary|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/LocalFileSystem.java#L164]
 and then does a file move
 * {{HDFSFileSystem}} following BEAM-4861 (fixed and ready to merge) now 
implements {{rename()}} by creating missing parent directories and doing the 
move

The move across different filesystems is not (fully) supported because the 
{{FileSystems.rename}} gets only the [filesystem for the source 
resource|https://github.com/apache/beam/blob/release-2.6.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystems.java#L325].
 It is not clear to me what might happen if the source were an 
{{HDFSFilesystem}} which itself can span multiple Filesystems. It is also not 
currently clear to me where we can best do the check.

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> --
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Affects Versions: 2.5.0
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (BEAM-5036) Optimize FileBasedSink's WriteOperation.moveToOutput()

2018-08-27 Thread Tim Robertson (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593439#comment-16593439
 ] 

Tim Robertson edited comment on BEAM-5036 at 8/27/18 10:09 AM:
---

Thanks [~reuvenlax]

1. Adding a cross FS check seems reasonable as a precaution.

2. Please see [this 
comment|https://issues.apache.org/jira/browse/BEAM-4861?focusedCommentId=16593406=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16593406]
 on BEAM-4861 where we have a decision to make on the HDFS parent directory not 
existing. Appreciate your and [~JozoVilcek] thoughts on that (and others).


was (Author: timrobertson100):
Thanks [~reuvenlax]

1. Adding a cross FS check seems reasonable as a precaution.

2. Please see this comment on BEAM-4861 where we have a decision to make on the 
HDFS parent directory not existing. Appreciate your and [~JozoVilcek] thoughts 
on that (and others).

> Optimize FileBasedSink's WriteOperation.moveToOutput()
> --
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-files
>Affects Versions: 2.5.0
>Reporter: Jozef Vilcek
>Assignee: Tim Robertson
>Priority: Major
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by 
> copy+delete. It would be better to use a rename() which can be much more 
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this 
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)