[jira] [Commented] (BEAM-3268) getPerDestinationOutputFilenames() is getting processed before write is finished on dataflow runner

2018-07-24 Thread Jozef Vilcek (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-3268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16554249#comment-16554249
 ] 

Jozef Vilcek commented on BEAM-3268:


I believe I am facing this issue with windowed writes on Flink runner with Beam 
2.5.0. Accessing files announced by FileIO write results in map function fails 
from time to time and get's working by spinning some wait time.

Can it be that bug is still present for some cases?

> getPerDestinationOutputFilenames() is getting processed before write is 
> finished on dataflow runner
> ---
>
> Key: BEAM-3268
> URL: https://issues.apache.org/jira/browse/BEAM-3268
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Affects Versions: 2.3.0
>Reporter: Kamil Szewczyk
>Assignee: Eugene Kirpichov
>Priority: Major
> Fix For: 2.5.0
>
> Attachments: comparison.png
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> While running filebased-io-test we found dataflow-runnner misbehaving. We run 
> tests using single pipeline and without using Reshuffling between writing and 
> reading dataflow jobs are unsuccessful because the runner tries to access the 
> files that were not created yet. 
> On the picture the difference between execution of writting is presented. On 
> the left there is working example with Reshuffling added and on the right 
> without it.
> !comparison.png|thumbnail!
> Steps to reproduce: substitute your-bucket-name wit your valid bucket.
> {code:java}
> mvn -e -Pio-it verify -pl sdks/java/io/file-based-io-tests 
> -DintegrationTestPipelineOptions='["--runner=dataflow", 
> "--filenamePrefix=gs://your-bucket-name/TEXTIO_IT"]' -Pdataflow-runner
> {code}
> Then look on the cloud console and job should fail.
> Now add Reshuffling to 
> sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java
>  as in the example.
> {code:java}
> .getPerDestinationOutputFilenames().apply(Values.create())
> .apply(Reshuffle.viaRandomKey());
> PCollection consolidatedHashcode = testFilenames
> {code}
> and trigger previously used maven command to see it working in the console 
> right now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3268) getPerDestinationOutputFilenames() is getting processed before write is finished on dataflow runner

2018-04-18 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16442731#comment-16442731
 ] 

Eugene Kirpichov commented on BEAM-3268:


They are, if it's fused with another ParDo. c.output(x) passes x through the 
whole fusion tree rather than buffering it until the current ProcessElement is 
complete.

> getPerDestinationOutputFilenames() is getting processed before write is 
> finished on dataflow runner
> ---
>
> Key: BEAM-3268
> URL: https://issues.apache.org/jira/browse/BEAM-3268
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Affects Versions: 2.3.0
>Reporter: Kamil Szewczyk
>Assignee: Eugene Kirpichov
>Priority: Major
> Attachments: comparison.png
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> While running filebased-io-test we found dataflow-runnner misbehaving. We run 
> tests using single pipeline and without using Reshuffling between writing and 
> reading dataflow jobs are unsuccessful because the runner tries to access the 
> files that were not created yet. 
> On the picture the difference between execution of writting is presented. On 
> the left there is working example with Reshuffling added and on the right 
> without it.
> !comparison.png|thumbnail!
> Steps to reproduce: substitute your-bucket-name wit your valid bucket.
> {code:java}
> mvn -e -Pio-it verify -pl sdks/java/io/file-based-io-tests 
> -DintegrationTestPipelineOptions='["--runner=dataflow", 
> "--filenamePrefix=gs://your-bucket-name/TEXTIO_IT"]' -Pdataflow-runner
> {code}
> Then look on the cloud console and job should fail.
> Now add Reshuffling to 
> sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java
>  as in the example.
> {code:java}
> .getPerDestinationOutputFilenames().apply(Values.create())
> .apply(Reshuffle.viaRandomKey());
> PCollection consolidatedHashcode = testFilenames
> {code}
> and trigger previously used maven command to see it working in the console 
> right now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3268) getPerDestinationOutputFilenames() is getting processed before write is finished on dataflow runner

2018-04-17 Thread Reuven Lax (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16441270#comment-16441270
 ] 

Reuven Lax commented on BEAM-3268:
--

[~jkff] I don't understand your comment. Outputs are not sent to receivers 
until the ParDo has finished processing data, which is after the file copy.

> getPerDestinationOutputFilenames() is getting processed before write is 
> finished on dataflow runner
> ---
>
> Key: BEAM-3268
> URL: https://issues.apache.org/jira/browse/BEAM-3268
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Affects Versions: 2.3.0
>Reporter: Kamil Szewczyk
>Assignee: Eugene Kirpichov
>Priority: Major
> Attachments: comparison.png
>
>
> While running filebased-io-test we found dataflow-runnner misbehaving. We run 
> tests using single pipeline and without using Reshuffling between writing and 
> reading dataflow jobs are unsuccessful because the runner tries to access the 
> files that were not created yet. 
> On the picture the difference between execution of writting is presented. On 
> the left there is working example with Reshuffling added and on the right 
> without it.
> !comparison.png|thumbnail!
> Steps to reproduce: substitute your-bucket-name wit your valid bucket.
> {code:java}
> mvn -e -Pio-it verify -pl sdks/java/io/file-based-io-tests 
> -DintegrationTestPipelineOptions='["--runner=dataflow", 
> "--filenamePrefix=gs://your-bucket-name/TEXTIO_IT"]' -Pdataflow-runner
> {code}
> Then look on the cloud console and job should fail.
> Now add Reshuffling to 
> sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java
>  as in the example.
> {code:java}
> .getPerDestinationOutputFilenames().apply(Values.create())
> .apply(Reshuffle.viaRandomKey());
> PCollection consolidatedHashcode = testFilenames
> {code}
> and trigger previously used maven command to see it working in the console 
> right now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3268) getPerDestinationOutputFilenames() is getting processed before write is finished on dataflow runner

2017-11-28 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16269877#comment-16269877
 ] 

Eugene Kirpichov commented on BEAM-3268:


cc: [~reuvenlax]

> getPerDestinationOutputFilenames() is getting processed before write is 
> finished on dataflow runner
> ---
>
> Key: BEAM-3268
> URL: https://issues.apache.org/jira/browse/BEAM-3268
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Affects Versions: 2.3.0
>Reporter: Kamil Szewczyk
>Assignee: Reuven Lax
> Attachments: comparison.png
>
>
> While running filebased-io-test we found dataflow-runnner misbehaving. We run 
> tests using single pipeline and without using Reshuffling between writing and 
> reading dataflow jobs are unsuccessful because the runner tries to access the 
> files that were not created yet. 
> On the picture the difference between execution of writting is presented. On 
> the left there is working example with Reshuffling added and on the right 
> without it.
> !comparison.png|thumbnail!
> Steps to reproduce: substitute your-bucket-name wit your valid bucket.
> {code:java}
> mvn -e -Pio-it verify -pl sdks/java/io/file-based-io-tests 
> -DintegrationTestPipelineOptions='["--runner=dataflow", 
> "--filenamePrefix=gs://your-bucket-name/TEXTIO_IT"]' -Pdataflow-runner
> {code}
> Then look on the cloud console and job should fail.
> Now add Reshuffling to 
> sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java
>  as in the example.
> {code:java}
> .getPerDestinationOutputFilenames().apply(Values.create())
> .apply(Reshuffle.viaRandomKey());
> PCollection consolidatedHashcode = testFilenames
> {code}
> and trigger previously used maven command to see it working in the console 
> right now.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3268) getPerDestinationOutputFilenames() is getting processed before write is finished on dataflow runner

2017-11-28 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16269876#comment-16269876
 ] 

Eugene Kirpichov commented on BEAM-3268:


Yeah this is a bug, because the transforms that produce 
perDestinationOutputFilenames produce them before the files are actually 
copied, e.g. 
https://github.com/apache/beam/blob/f8d8ff14c49e4dfb15541f4b73aa66513c9a9d23/sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java#L942

One fix is to reorder that code (and the respective code in 
FinalizeWindowedFn). Another fix is to insert a reshuffle somewhere around 
https://github.com/apache/beam/blob/f8d8ff14c49e4dfb15541f4b73aa66513c9a9d23/sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java#L816
 , which is less brittle - I would prefer the latter.

> getPerDestinationOutputFilenames() is getting processed before write is 
> finished on dataflow runner
> ---
>
> Key: BEAM-3268
> URL: https://issues.apache.org/jira/browse/BEAM-3268
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Affects Versions: 2.3.0
>Reporter: Kamil Szewczyk
>Assignee: Reuven Lax
> Attachments: comparison.png
>
>
> While running filebased-io-test we found dataflow-runnner misbehaving. We run 
> tests using single pipeline and without using Reshuffling between writing and 
> reading dataflow jobs are unsuccessful because the runner tries to access the 
> files that were not created yet. 
> On the picture the difference between execution of writting is presented. On 
> the left there is working example with Reshuffling added and on the right 
> without it.
> !comparison.png|thumbnail!
> Steps to reproduce: substitute your-bucket-name wit your valid bucket.
> {code:java}
> mvn -e -Pio-it verify -pl sdks/java/io/file-based-io-tests 
> -DintegrationTestPipelineOptions='["--runner=dataflow", 
> "--filenamePrefix=gs://your-bucket-name/TEXTIO_IT"]' -Pdataflow-runner
> {code}
> Then look on the cloud console and job should fail.
> Now add Reshuffling to 
> sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java
>  as in the example.
> {code:java}
> .getPerDestinationOutputFilenames().apply(Values.create())
> .apply(Reshuffle.viaRandomKey());
> PCollection consolidatedHashcode = testFilenames
> {code}
> and trigger previously used maven command to see it working in the console 
> right now.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3268) getPerDestinationOutputFilenames() is getting processed before write is finished on dataflow runner

2017-11-28 Thread Chamikara Jayalath (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16269746#comment-16269746
 ] 

Chamikara Jayalath commented on BEAM-3268:
--

cc: [~jkff]

> getPerDestinationOutputFilenames() is getting processed before write is 
> finished on dataflow runner
> ---
>
> Key: BEAM-3268
> URL: https://issues.apache.org/jira/browse/BEAM-3268
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Affects Versions: 2.3.0
>Reporter: Kamil Szewczyk
>Assignee: Chamikara Jayalath
> Attachments: comparison.png
>
>
> While running filebased-io-test we found dataflow-runnner misbehaving. We run 
> tests using single pipeline and without using Reshuffling between writing and 
> reading dataflow jobs are unsuccessful because the runner tries to access the 
> files that were not created yet. 
> On the picture the difference between execution of writting is presented. On 
> the left there is working example with Reshuffling added and on the right 
> without it.
> !comparison.png|thumbnail!
> Steps to reproduce: substitute your-bucket-name wit your valid bucket.
> {code:java}
> mvn -e -Pio-it verify -pl sdks/java/io/file-based-io-tests 
> -DintegrationTestPipelineOptions='["--runner=dataflow", 
> "--filenamePrefix=gs://your-bucket-name/TEXTIO_IT"]' -Pdataflow-runner
> {code}
> Then look on the cloud console and job should fail.
> Now add Reshuffling to 
> sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java
>  as in the example.
> {code:java}
> .getPerDestinationOutputFilenames().apply(Values.create())
> .apply(Reshuffle.viaRandomKey());
> PCollection consolidatedHashcode = testFilenames
> {code}
> and trigger previously used maven command to see it working in the console 
> right now.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)