[ https://issues.apache.org/jira/browse/BEAM-350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334134#comment-15334134 ]
ASF GitHub Bot commented on BEAM-350: ------------------------------------- GitHub user dhalperi opened a pull request: https://github.com/apache/incubator-beam/pull/476 [BEAM-350] DataflowPipelineJob: Retry messages, metrics, and status polls At some point in the past, we decided to use a rawDataflowClient that does not do retries when checking job status, because it was best-effort reporting to users. The purported goal was to not clutter the log with networking errors. However, since that time, we have: * Added the ability to suppress logs (emit only at DEBUG level or not at all) when retrying. * Increased reliability of the job checking status so that these errors are less frequent and more indicative of quota or other issues. * Started using the metrics in tests, where we do need to retry transient issues (BEAM-350). So let's drop the raw transport client and just use the one that retries. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dhalperi/incubator-beam test-pipeline-runne Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/476.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #476 ---- commit 127d7f48bb50e42f879ea4299f78889132ebc4b7 Author: Dan Halperin <dhalp...@google.com> Date: 2016-06-16T15:57:18Z DataflowPipelineJob: Retry messages, metrics, and status polls At some point in the past, we decided to use a rawDataflowClient that does not do retries when checking job status, because it was best-effort reporting to users. The purported goal was to not clutter the log with networking errors. However, since that time, we have: * Added the ability to suppress logs (emit only at DEBUG level or not at all) when retrying. * Increased reliability of the job checking status so that these errors are less frequent and more indicative of quota or other issues. * Started using the metrics in tests, where we do need to retry transient issues (BEAM-350). So let's drop the raw transport client and just use the one that retries. ---- > beam_PostCommit_RunnableOnService_GoogleCloudDataflow flaky > ----------------------------------------------------------- > > Key: BEAM-350 > URL: https://issues.apache.org/jira/browse/BEAM-350 > Project: Beam > Issue Type: Bug > Components: testing > Reporter: Daniel Halperin > Assignee: Daniel Halperin > > These tests have been flaking for a little while with 500 server errors > calling Google Cloud Dataflow APIS. There may be additional retry logic > needed in the SDK around the calls to check the job status. > {code} > { > "code" : 500, > "errors" : [ { > "domain" : "global", > "message" : "Internal error encountered.", > "reason" : "backendError" > } ], > "message" : "Internal error encountered.", > "status" : "INTERNAL" > } > at > com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:146) > at > com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113) > at > com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40) > at > com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:321) > at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1065) > at > com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419) > at > com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352) > at > com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469) > at > org.apache.beam.runners.dataflow.testing.TestDataflowPipelineRunner.checkForSuccess(TestDataflowPipelineRunner.java:185) > at > org.apache.beam.runners.dataflow.testing.TestDataflowPipelineRunner.run(TestDataflowPipelineRunner.java:141) > ... 25 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)