[jira] [Commented] (BEAM-1582) ResumeFromCheckpointStreamingTest flakes with what appears as a second firing.
[ https://issues.apache.org/jira/browse/BEAM-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15998557#comment-15998557 ] Davor Bonaci commented on BEAM-1582: +1 -- this makes total sense to me. > ResumeFromCheckpointStreamingTest flakes with what appears as a second firing. > -- > > Key: BEAM-1582 > URL: https://issues.apache.org/jira/browse/BEAM-1582 > Project: Beam > Issue Type: Bug > Components: runner-spark >Reporter: Amit Sela >Priority: Minor > Labels: flake > > See: > https://builds.apache.org/view/Beam/job/beam_PostCommit_Java_MavenInstall/org.apache.beam$beam-runners-spark/2788/testReport/junit/org.apache.beam.runners.spark.translation.streaming/ResumeFromCheckpointStreamingTest/testWithResume/ > After some digging in it appears that a second firing occurs (though only one > is expected) but it doesn't come from a stale state (state is empty before it > fires). > Might be a retry happening for some reason, which is OK in terms of > fault-tolerance guarantees (at-least-once), but not so much in terms of flaky > tests. > I'm looking into this hoping to fix this ASAP. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1582) ResumeFromCheckpointStreamingTest flakes with what appears as a second firing.
[ https://issues.apache.org/jira/browse/BEAM-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15997742#comment-15997742 ] Aviem Zur commented on BEAM-1582: - I don't think we should disable this test as it is a very important one. In the last 30 builds that Jenkins saves the history for this test has not failed once: https://builds.apache.org/view/Beam/job/beam_PostCommit_Java_ValidatesRunner_Spark/ So I'm not sure why this is such an issue? > ResumeFromCheckpointStreamingTest flakes with what appears as a second firing. > -- > > Key: BEAM-1582 > URL: https://issues.apache.org/jira/browse/BEAM-1582 > Project: Beam > Issue Type: Bug > Components: runner-spark >Reporter: Amit Sela >Assignee: Amit Sela > Labels: flake > Fix For: First stable release > > > See: > https://builds.apache.org/view/Beam/job/beam_PostCommit_Java_MavenInstall/org.apache.beam$beam-runners-spark/2788/testReport/junit/org.apache.beam.runners.spark.translation.streaming/ResumeFromCheckpointStreamingTest/testWithResume/ > After some digging in it appears that a second firing occurs (though only one > is expected) but it doesn't come from a stale state (state is empty before it > fires). > Might be a retry happening for some reason, which is OK in terms of > fault-tolerance guarantees (at-least-once), but not so much in terms of flaky > tests. > I'm looking into this hoping to fix this ASAP. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1582) ResumeFromCheckpointStreamingTest flakes with what appears as a second firing.
[ https://issues.apache.org/jira/browse/BEAM-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15997070#comment-15997070 ] Davor Bonaci commented on BEAM-1582: Is there something we can do here to reduce the impact of the flake? I'm not quite sure how @ValidatesRunner annotation would help. If we cannot make this one work, perhaps we should disable the test for a little bit. > ResumeFromCheckpointStreamingTest flakes with what appears as a second firing. > -- > > Key: BEAM-1582 > URL: https://issues.apache.org/jira/browse/BEAM-1582 > Project: Beam > Issue Type: Bug > Components: runner-spark >Reporter: Amit Sela >Assignee: Amit Sela > Labels: flake > Fix For: First stable release > > > See: > https://builds.apache.org/view/Beam/job/beam_PostCommit_Java_MavenInstall/org.apache.beam$beam-runners-spark/2788/testReport/junit/org.apache.beam.runners.spark.translation.streaming/ResumeFromCheckpointStreamingTest/testWithResume/ > After some digging in it appears that a second firing occurs (though only one > is expected) but it doesn't come from a stale state (state is empty before it > fires). > Might be a retry happening for some reason, which is OK in terms of > fault-tolerance guarantees (at-least-once), but not so much in terms of flaky > tests. > I'm looking into this hoping to fix this ASAP. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1582) ResumeFromCheckpointStreamingTest flakes with what appears as a second firing.
[ https://issues.apache.org/jira/browse/BEAM-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15985205#comment-15985205 ] Aviem Zur commented on BEAM-1582: - AFAIK there is no current solution for this in sight for the near future. [~amitsela] can probably elaborate. The current workaround was to annotate this test with {{@ValidatesRunner}}. > ResumeFromCheckpointStreamingTest flakes with what appears as a second firing. > -- > > Key: BEAM-1582 > URL: https://issues.apache.org/jira/browse/BEAM-1582 > Project: Beam > Issue Type: Bug > Components: runner-spark >Reporter: Amit Sela >Assignee: Amit Sela > Labels: flake > Fix For: First stable release > > > See: > https://builds.apache.org/view/Beam/job/beam_PostCommit_Java_MavenInstall/org.apache.beam$beam-runners-spark/2788/testReport/junit/org.apache.beam.runners.spark.translation.streaming/ResumeFromCheckpointStreamingTest/testWithResume/ > After some digging in it appears that a second firing occurs (though only one > is expected) but it doesn't come from a stale state (state is empty before it > fires). > Might be a retry happening for some reason, which is OK in terms of > fault-tolerance guarantees (at-least-once), but not so much in terms of flaky > tests. > I'm looking into this hoping to fix this ASAP. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1582) ResumeFromCheckpointStreamingTest flakes with what appears as a second firing.
[ https://issues.apache.org/jira/browse/BEAM-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15981532#comment-15981532 ] Ahmet Altay commented on BEAM-1582: --- Hey [~amitsela] any updates on this one? (cc: [~aviemzur]) > ResumeFromCheckpointStreamingTest flakes with what appears as a second firing. > -- > > Key: BEAM-1582 > URL: https://issues.apache.org/jira/browse/BEAM-1582 > Project: Beam > Issue Type: Bug > Components: runner-spark >Reporter: Amit Sela >Assignee: Amit Sela > Labels: flake > Fix For: First stable release > > > See: > https://builds.apache.org/view/Beam/job/beam_PostCommit_Java_MavenInstall/org.apache.beam$beam-runners-spark/2788/testReport/junit/org.apache.beam.runners.spark.translation.streaming/ResumeFromCheckpointStreamingTest/testWithResume/ > After some digging in it appears that a second firing occurs (though only one > is expected) but it doesn't come from a stale state (state is empty before it > fires). > Might be a retry happening for some reason, which is OK in terms of > fault-tolerance guarantees (at-least-once), but not so much in terms of flaky > tests. > I'm looking into this hoping to fix this ASAP. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1582) ResumeFromCheckpointStreamingTest flakes with what appears as a second firing.
[ https://issues.apache.org/jira/browse/BEAM-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15931859#comment-15931859 ] Amit Sela commented on BEAM-1582: - Moved tests that use checkpoint recovery to post-commit. Keeping open since the problem is still there. > ResumeFromCheckpointStreamingTest flakes with what appears as a second firing. > -- > > Key: BEAM-1582 > URL: https://issues.apache.org/jira/browse/BEAM-1582 > Project: Beam > Issue Type: Bug > Components: runner-spark >Reporter: Amit Sela >Assignee: Amit Sela > Fix For: First stable release > > > See: > https://builds.apache.org/view/Beam/job/beam_PostCommit_Java_MavenInstall/org.apache.beam$beam-runners-spark/2788/testReport/junit/org.apache.beam.runners.spark.translation.streaming/ResumeFromCheckpointStreamingTest/testWithResume/ > After some digging in it appears that a second firing occurs (though only one > is expected) but it doesn't come from a stale state (state is empty before it > fires). > Might be a retry happening for some reason, which is OK in terms of > fault-tolerance guarantees (at-least-once), but not so much in terms of flaky > tests. > I'm looking into this hoping to fix this ASAP. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1582) ResumeFromCheckpointStreamingTest flakes with what appears as a second firing.
[ https://issues.apache.org/jira/browse/BEAM-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15929222#comment-15929222 ] Eugene Kirpichov commented on BEAM-1582: Still happening and failing precommits quite often. > ResumeFromCheckpointStreamingTest flakes with what appears as a second firing. > -- > > Key: BEAM-1582 > URL: https://issues.apache.org/jira/browse/BEAM-1582 > Project: Beam > Issue Type: Bug > Components: runner-spark >Reporter: Amit Sela >Assignee: Amit Sela > Fix For: First stable release > > > See: > https://builds.apache.org/view/Beam/job/beam_PostCommit_Java_MavenInstall/org.apache.beam$beam-runners-spark/2788/testReport/junit/org.apache.beam.runners.spark.translation.streaming/ResumeFromCheckpointStreamingTest/testWithResume/ > After some digging in it appears that a second firing occurs (though only one > is expected) but it doesn't come from a stale state (state is empty before it > fires). > Might be a retry happening for some reason, which is OK in terms of > fault-tolerance guarantees (at-least-once), but not so much in terms of flaky > tests. > I'm looking into this hoping to fix this ASAP. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1582) ResumeFromCheckpointStreamingTest flakes with what appears as a second firing.
[ https://issues.apache.org/jira/browse/BEAM-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15903277#comment-15903277 ] ASF GitHub Bot commented on BEAM-1582: -- Github user asfgit closed the pull request at: https://github.com/apache/beam/pull/2168 > ResumeFromCheckpointStreamingTest flakes with what appears as a second firing. > -- > > Key: BEAM-1582 > URL: https://issues.apache.org/jira/browse/BEAM-1582 > Project: Beam > Issue Type: Bug > Components: runner-spark >Reporter: Amit Sela >Assignee: Amit Sela > > See: > https://builds.apache.org/view/Beam/job/beam_PostCommit_Java_MavenInstall/org.apache.beam$beam-runners-spark/2788/testReport/junit/org.apache.beam.runners.spark.translation.streaming/ResumeFromCheckpointStreamingTest/testWithResume/ > After some digging in it appears that a second firing occurs (though only one > is expected) but it doesn't come from a stale state (state is empty before it > fires). > Might be a retry happening for some reason, which is OK in terms of > fault-tolerance guarantees (at-least-once), but not so much in terms of flaky > tests. > I'm looking into this hoping to fix this ASAP. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1582) ResumeFromCheckpointStreamingTest flakes with what appears as a second firing.
[ https://issues.apache.org/jira/browse/BEAM-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15897224#comment-15897224 ] ASF GitHub Bot commented on BEAM-1582: -- GitHub user amitsela opened a pull request: https://github.com/apache/beam/pull/2168 [BEAM-1582, BEAM-1562] Stop streaming tests on EOT Watermark. Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [ ] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [ ] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- You can merge this pull request into a Git repository by running: $ git pull https://github.com/amitsela/beam stop-streaming-tests Alternatively you can review and apply these changes as the patch at: https://github.com/apache/beam/pull/2168.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2168 commit 7d171bf5a69e6eb3f7ae58343a06c4a48865feaf Author: Sela Date: 2017-03-04T19:04:02Z Test runner to stop on EOT watermark, or timeout. commit 899021fa4bf969c93288379bf847dd7c06ec5f09 Author: Sela Date: 2017-03-04T19:05:25Z Remove timeout since it is already a pipeline option. commit 6b8a37f66cd4a0ca29d5343f3dffbd95202ecb6f Author: Sela Date: 2017-03-04T19:07:16Z Advance to infinity at the end of pipelines. commit 909201ae2f1bca762fcdd56c0db3bb1841965b54 Author: Sela Date: 2017-03-04T19:07:59Z Add EOT watermark and expected assertions test options. commit e4f66dfb13feea93b0b4c797620c3fa9080652f0 Author: Sela Date: 2017-03-04T19:09:38Z SparkPipelineResult should avoid returning null, and handle exceptions better. commit 73cfebd770fd4b5c6566051172b3e31faaf2c4e9 Author: Sela Date: 2017-03-04T19:11:36Z Make ResumeFromCheckpointStreamingTest use TestSparkRunner and stop on EOT watermark. commit d71207a75ef38dcdbf893fa032753db2875e4d3b Author: Sela Date: 2017-03-04T20:08:05Z fixup! post-rebase import order. commit 4d1222f2cc653d31bc4cfa5e516af08b5e5e53a6 Author: Sela Date: 2017-03-05T10:52:58Z Stop the context and update the state BEFORE throwing the exception. > ResumeFromCheckpointStreamingTest flakes with what appears as a second firing. > -- > > Key: BEAM-1582 > URL: https://issues.apache.org/jira/browse/BEAM-1582 > Project: Beam > Issue Type: Bug > Components: runner-spark >Reporter: Amit Sela >Assignee: Amit Sela > > See: > https://builds.apache.org/view/Beam/job/beam_PostCommit_Java_MavenInstall/org.apache.beam$beam-runners-spark/2788/testReport/junit/org.apache.beam.runners.spark.translation.streaming/ResumeFromCheckpointStreamingTest/testWithResume/ > After some digging in it appears that a second firing occurs (though only one > is expected) but it doesn't come from a stale state (state is empty before it > fires). > Might be a retry happening for some reason, which is OK in terms of > fault-tolerance guarantees (at-least-once), but not so much in terms of flaky > tests. > I'm looking into this hoping to fix this ASAP. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1582) ResumeFromCheckpointStreamingTest flakes with what appears as a second firing.
[ https://issues.apache.org/jira/browse/BEAM-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15897223#comment-15897223 ] ASF GitHub Bot commented on BEAM-1582: -- Github user amitsela closed the pull request at: https://github.com/apache/beam/pull/2159 > ResumeFromCheckpointStreamingTest flakes with what appears as a second firing. > -- > > Key: BEAM-1582 > URL: https://issues.apache.org/jira/browse/BEAM-1582 > Project: Beam > Issue Type: Bug > Components: runner-spark >Reporter: Amit Sela >Assignee: Amit Sela > > See: > https://builds.apache.org/view/Beam/job/beam_PostCommit_Java_MavenInstall/org.apache.beam$beam-runners-spark/2788/testReport/junit/org.apache.beam.runners.spark.translation.streaming/ResumeFromCheckpointStreamingTest/testWithResume/ > After some digging in it appears that a second firing occurs (though only one > is expected) but it doesn't come from a stale state (state is empty before it > fires). > Might be a retry happening for some reason, which is OK in terms of > fault-tolerance guarantees (at-least-once), but not so much in terms of flaky > tests. > I'm looking into this hoping to fix this ASAP. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1582) ResumeFromCheckpointStreamingTest flakes with what appears as a second firing.
[ https://issues.apache.org/jira/browse/BEAM-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895864#comment-15895864 ] ASF GitHub Bot commented on BEAM-1582: -- GitHub user amitsela opened a pull request: https://github.com/apache/beam/pull/2159 [BEAM-1582, BEAM-1562] Stop streaming tests on EOT Watermark. Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [ ] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [ ] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- You can merge this pull request into a Git repository by running: $ git pull https://github.com/amitsela/beam stop-streaming-tests Alternatively you can review and apply these changes as the patch at: https://github.com/apache/beam/pull/2159.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2159 commit 2faeb449d727ccfdc0aeabc9fc78cbeee5ff0dc3 Author: Sela Date: 2017-03-04T19:04:02Z Test runner to stop on EOT watermark, or timeout. commit 18fa962cf3d5fb676bb8f94adecf02a5086d6cfb Author: Sela Date: 2017-03-04T19:05:25Z Remove timeout since it is already a pipeline option. commit c7f4c2c3c0cddf8c357fb202dfedddb0bf9f6335 Author: Sela Date: 2017-03-04T19:07:16Z Advance to infinity at the end of pipelines. commit 0c57773985f8610dd3df2eec00e3dba5f32b00c3 Author: Sela Date: 2017-03-04T19:07:59Z Add EOT watermark and expected assertions test options. commit 5d6c3688f0840bc1a17272b078ba128a71233553 Author: Sela Date: 2017-03-04T19:09:38Z SparkPipelineResult should avoid returning null, and handle exceptions better. commit eb841f01bab6ead2ba68f866591a85995989943b Author: Sela Date: 2017-03-04T19:11:36Z Make ResumeFromCheckpointStreamingTest use TestSparkRunner and stop on EOT watermark. commit 71dfaf3bf916dc704328d6e9ea64c33cc3464bec Author: Sela Date: 2017-03-04T20:08:05Z fixup! post-rebase import order. > ResumeFromCheckpointStreamingTest flakes with what appears as a second firing. > -- > > Key: BEAM-1582 > URL: https://issues.apache.org/jira/browse/BEAM-1582 > Project: Beam > Issue Type: Bug > Components: runner-spark >Reporter: Amit Sela >Assignee: Amit Sela > > See: > https://builds.apache.org/view/Beam/job/beam_PostCommit_Java_MavenInstall/org.apache.beam$beam-runners-spark/2788/testReport/junit/org.apache.beam.runners.spark.translation.streaming/ResumeFromCheckpointStreamingTest/testWithResume/ > After some digging in it appears that a second firing occurs (though only one > is expected) but it doesn't come from a stale state (state is empty before it > fires). > Might be a retry happening for some reason, which is OK in terms of > fault-tolerance guarantees (at-least-once), but not so much in terms of flaky > tests. > I'm looking into this hoping to fix this ASAP. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1582) ResumeFromCheckpointStreamingTest flakes with what appears as a second firing.
[ https://issues.apache.org/jira/browse/BEAM-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895861#comment-15895861 ] Amit Sela commented on BEAM-1582: - Could be related to SPARK-16480 so that the last {{CheckpointMark}} is not properly checkpointed. > ResumeFromCheckpointStreamingTest flakes with what appears as a second firing. > -- > > Key: BEAM-1582 > URL: https://issues.apache.org/jira/browse/BEAM-1582 > Project: Beam > Issue Type: Bug > Components: runner-spark >Reporter: Amit Sela >Assignee: Amit Sela > > See: > https://builds.apache.org/view/Beam/job/beam_PostCommit_Java_MavenInstall/org.apache.beam$beam-runners-spark/2788/testReport/junit/org.apache.beam.runners.spark.translation.streaming/ResumeFromCheckpointStreamingTest/testWithResume/ > After some digging in it appears that a second firing occurs (though only one > is expected) but it doesn't come from a stale state (state is empty before it > fires). > Might be a retry happening for some reason, which is OK in terms of > fault-tolerance guarantees (at-least-once), but not so much in terms of flaky > tests. > I'm looking into this hoping to fix this ASAP. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1582) ResumeFromCheckpointStreamingTest flakes with what appears as a second firing.
[ https://issues.apache.org/jira/browse/BEAM-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15891188#comment-15891188 ] Amit Sela commented on BEAM-1582: - Looks like the flake happens when the entire input is re-read. We inject 4 elements to Kafka before the first run, and 2 more before the second. When all is well, printing the number of elements read by SparkUnboundedSource's {{readUnboundedStream}} JavaDStream says 4 (sometimes 1 followed by 3) in the first run, and 2 in the second, but in failures, it reads 6 in the second. This would happen if the checkpoint of the readers are not persisted for some reason causing the KafkaIO to use the default "earliest" and so read everything. This happens even though checkpoint interval is batch interval. I will check if there's a way to guarantee/block on checkpointing. > ResumeFromCheckpointStreamingTest flakes with what appears as a second firing. > -- > > Key: BEAM-1582 > URL: https://issues.apache.org/jira/browse/BEAM-1582 > Project: Beam > Issue Type: Bug > Components: runner-spark >Reporter: Amit Sela >Assignee: Amit Sela > > See: > https://builds.apache.org/view/Beam/job/beam_PostCommit_Java_MavenInstall/org.apache.beam$beam-runners-spark/2788/testReport/junit/org.apache.beam.runners.spark.translation.streaming/ResumeFromCheckpointStreamingTest/testWithResume/ > After some digging in it appears that a second firing occurs (though only one > is expected) but it doesn't come from a stale state (state is empty before it > fires). > Might be a retry happening for some reason, which is OK in terms of > fault-tolerance guarantees (at-least-once), but not so much in terms of flaky > tests. > I'm looking into this hoping to fix this ASAP. -- This message was sent by Atlassian JIRA (v6.3.15#6346)