[jira] [Work logged] (BEAM-3798) Performance tests flaky due to Dataflow transient errors
[ https://issues.apache.org/jira/browse/BEAM-3798?focusedWorklogId=81021=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-81021 ] ASF GitHub Bot logged work on BEAM-3798: Author: ASF GitHub Bot Created on: 15/Mar/18 23:26 Start Date: 15/Mar/18 23:26 Worklog Time Spent: 10m Work Description: lukecwik closed pull request #4871: [BEAM-3798] Remove error check on dataflow when getting batch job state URL: https://github.com/apache/beam/pull/4871 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/TestDataflowRunner.java b/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/TestDataflowRunner.java index e163fe8d674..8679a952284 100644 --- a/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/TestDataflowRunner.java +++ b/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/TestDataflowRunner.java @@ -181,7 +181,7 @@ private boolean waitForStreamingJobTermination( } /** - * Return {@code true} if the job succeeded or {@code false} if it terminated in any other manner. + * Return {@code true} if job state is {@code State.DONE}. {@code false} otherwise. */ private boolean waitForBatchJobTermination( DataflowPipelineJob job, ErrorMonitorMessagesHandler messageHandler) { @@ -195,7 +195,7 @@ private boolean waitForBatchJobTermination( return false; } - return job.getState() == State.DONE && !messageHandler.hasSeenError(); + return job.getState() == State.DONE; } } diff --git a/runners/google-cloud-dataflow-java/src/test/java/org/apache/beam/runners/dataflow/TestDataflowRunnerTest.java b/runners/google-cloud-dataflow-java/src/test/java/org/apache/beam/runners/dataflow/TestDataflowRunnerTest.java index f382e4b6ed2..cf54556a093 100644 --- a/runners/google-cloud-dataflow-java/src/test/java/org/apache/beam/runners/dataflow/TestDataflowRunnerTest.java +++ b/runners/google-cloud-dataflow-java/src/test/java/org/apache/beam/runners/dataflow/TestDataflowRunnerTest.java @@ -121,6 +121,40 @@ public void testRunBatchJobThatSucceeds() throws Exception { assertEquals(mockJob, runner.run(p, mockRunner)); } + /** + * Job success on Dataflow means that it handled transient errors (if any) successfully + * by retrying failed bundles. + */ + @Test + public void testRunBatchJobThatSucceedsDespiteTransientErrors() throws Exception { +Pipeline p = Pipeline.create(options); +PCollection pc = p.apply(Create.of(1, 2, 3)); +PAssert.that(pc).containsInAnyOrder(1, 2, 3); + +DataflowPipelineJob mockJob = Mockito.mock(DataflowPipelineJob.class); +when(mockJob.getState()).thenReturn(State.DONE); +when(mockJob.getProjectId()).thenReturn("test-project"); +when(mockJob.getJobId()).thenReturn("test-job"); +when(mockJob.waitUntilFinish(any(Duration.class), any(JobMessagesHandler.class))) + .thenAnswer( +invocation -> { + JobMessage message = new JobMessage(); + message.setMessageText("TransientError"); + message.setTime(TimeUtil.toCloudTime(Instant.now())); + message.setMessageImportance("JOB_MESSAGE_ERROR"); + ((JobMessagesHandler) invocation.getArguments()[1]).process(Arrays.asList(message)); + return State.DONE; +}); + +DataflowRunner mockRunner = Mockito.mock(DataflowRunner.class); +when(mockRunner.run(any(Pipeline.class))).thenReturn(mockJob); + +TestDataflowRunner runner = TestDataflowRunner.fromOptionsAndClient(options, mockClient); +when(mockClient.getJobMetrics(anyString())) + .thenReturn(generateMockMetricResponse(true /* success */, true /* tentative */)); +assertEquals(mockJob, runner.run(p, mockRunner)); + } + /** * Tests that when a batch job terminates in a failure state even if all assertions * passed, it throws an error to that effect. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 81021) Time Spent: 40m (was: 0.5h) > Performance tests flaky due to Dataflow transient errors > > > Key: BEAM-3798 > URL:
[jira] [Work logged] (BEAM-3798) Performance tests flaky due to Dataflow transient errors
[ https://issues.apache.org/jira/browse/BEAM-3798?focusedWorklogId=80892=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-80892 ] ASF GitHub Bot logged work on BEAM-3798: Author: ASF GitHub Bot Created on: 15/Mar/18 17:16 Start Date: 15/Mar/18 17:16 Worklog Time Spent: 10m Work Description: lgajowy commented on issue #4871: [BEAM-3798] Remove error check on dataflow when getting batch job state URL: https://github.com/apache/beam/pull/4871#issuecomment-373454679 @lukecwik could you take a look? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 80892) Time Spent: 0.5h (was: 20m) > Performance tests flaky due to Dataflow transient errors > > > Key: BEAM-3798 > URL: https://issues.apache.org/jira/browse/BEAM-3798 > Project: Beam > Issue Type: Bug > Components: runner-dataflow >Reporter: Łukasz Gajowy >Assignee: Thomas Groh >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > Performance tests are flaky due to transient errors that happened during data > processing (eg. SocketTimeoutException while connecting to DB). Currently > exceptions that happen on Dataflow runner but are retried successfully, fail > the test regardless of the final job state (giving a false-negative result). > Possible solution for batch scenarios: > We could "rethrow" exceptions that happened due to transient errors *only* if > the job status is other than DONE. > Possible solution for streaming scenarios: > (don't know yet) > [Link to discussion on dev list > |https://lists.apache.org/thread.html/e480f8181913dc81d2d4cd1430557a646537473ccf29fe6390229098@%3Cdev.beam.apache.org%3E] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-3798) Performance tests flaky due to Dataflow transient errors
[ https://issues.apache.org/jira/browse/BEAM-3798?focusedWorklogId=80863=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-80863 ] ASF GitHub Bot logged work on BEAM-3798: Author: ASF GitHub Bot Created on: 15/Mar/18 15:05 Start Date: 15/Mar/18 15:05 Worklog Time Spent: 10m Work Description: lgajowy commented on issue #4871: [BEAM-3798] Remove error check on dataflow when getting batch job state URL: https://github.com/apache/beam/pull/4871#issuecomment-373408006 Run Dataflow ValidatesRunner This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 80863) Time Spent: 20m (was: 10m) > Performance tests flaky due to Dataflow transient errors > > > Key: BEAM-3798 > URL: https://issues.apache.org/jira/browse/BEAM-3798 > Project: Beam > Issue Type: Bug > Components: runner-dataflow >Reporter: Łukasz Gajowy >Assignee: Thomas Groh >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > Performance tests are flaky due to transient errors that happened during data > processing (eg. SocketTimeoutException while connecting to DB). Currently > exceptions that happen on Dataflow runner but are retried successfully, fail > the test regardless of the final job state (giving a false-negative result). > Possible solution for batch scenarios: > We could "rethrow" exceptions that happened due to transient errors *only* if > the job status is other than DONE. > Possible solution for streaming scenarios: > (don't know yet) > [Link to discussion on dev list > |https://lists.apache.org/thread.html/e480f8181913dc81d2d4cd1430557a646537473ccf29fe6390229098@%3Cdev.beam.apache.org%3E] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-3798) Performance tests flaky due to Dataflow transient errors
[ https://issues.apache.org/jira/browse/BEAM-3798?focusedWorklogId=80862=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-80862 ] ASF GitHub Bot logged work on BEAM-3798: Author: ASF GitHub Bot Created on: 15/Mar/18 15:04 Start Date: 15/Mar/18 15:04 Worklog Time Spent: 10m Work Description: lgajowy opened a new pull request #4871: [BEAM-3798] Remove error check on dataflow when getting batch job state URL: https://github.com/apache/beam/pull/4871 In TestDataflowRunner beam can rely solely on Dataflow job state (for batch jobs). It it's "DONE", then it means that Dataflow handled any errors on it's own (eg. by retrying failed bundles). We don't need to throw them further. This PR does not fix a streaming scenario. We should fix it on separate PR later. It still fixes Performance tests that we have now. Follow this checklist to help us incorporate your contribution quickly and easily: - [x] Make sure there is a [JIRA issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes. - [x] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue. - [x] Write a pull request description that is detailed enough to understand: - [x] What the pull request does - [x] Why it does it - [x] How it does it - [x] Why this approach - [x] Each commit in the pull request should have a meaningful subject line and body. - [x] Run `mvn clean verify` to make sure basic checks pass. A more thorough check will be performed on your pull request automatically. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 80862) Time Spent: 10m Remaining Estimate: 0h > Performance tests flaky due to Dataflow transient errors > > > Key: BEAM-3798 > URL: https://issues.apache.org/jira/browse/BEAM-3798 > Project: Beam > Issue Type: Bug > Components: runner-dataflow >Reporter: Łukasz Gajowy >Assignee: Thomas Groh >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Performance tests are flaky due to transient errors that happened during data > processing (eg. SocketTimeoutException while connecting to DB). Currently > exceptions that happen on Dataflow runner but are retried successfully, fail > the test regardless of the final job state (giving a false-negative result). > Possible solution for batch scenarios: > We could "rethrow" exceptions that happened due to transient errors *only* if > the job status is other than DONE. > Possible solution for streaming scenarios: > (don't know yet) > [Link to discussion on dev list > |https://lists.apache.org/thread.html/e480f8181913dc81d2d4cd1430557a646537473ccf29fe6390229098@%3Cdev.beam.apache.org%3E] -- This message was sent by Atlassian JIRA (v7.6.3#76005)