[ https://issues.apache.org/jira/browse/BEAM-11400?focusedWorklogId=521765&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-521765 ]
ASF GitHub Bot logged work on BEAM-11400: ----------------------------------------- Author: ASF GitHub Bot Created on: 08/Dec/20 17:00 Start Date: 08/Dec/20 17:00 Worklog Time Spent: 10m Work Description: reuvenlax commented on pull request #13486: URL: https://github.com/apache/beam/pull/13486#issuecomment-740763268 run dataflow validatesrunner ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking ------------------- Worklog Id: (was: 521765) Time Spent: 50m (was: 40m) > StreamingDataflowWorker stuck commits logic triggers exceptions if commits > eventually complete > ---------------------------------------------------------------------------------------------- > > Key: BEAM-11400 > URL: https://issues.apache.org/jira/browse/BEAM-11400 > Project: Beam > Issue Type: Bug > Components: runner-dataflow > Reporter: Sam Whittle > Assignee: Sam Whittle > Priority: P2 > Time Spent: 50m > Remaining Estimate: 0h > > Commits that have not completed in a timeout are cancelled as stuck and lost, > in logs showing up as: > Detected key with sharding key -6893288510319386341 stuck in COMMITTING > state, completing it with error. > However if the commit was not lost but just very slow, when it eventually > does complete the following error occurs: > Exception while processing commit response {} > "java.lang.NullPointerException > at > org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:877) > at > org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$ComputationState.completeWork(StreamingDataflowWorker.java:2246) > This occurs on the commit stream which finishes processing the current batch > of responses but then throws the error. This causes the stream to complete > with an error, resending all of the other commits. So if there were a large > number of commits on the stream, we make slow progress and only complete a > couple before retrying everything again. This slowdown can cause further > commits to exceed the timeout, entering a feedback loop. -- This message was sent by Atlassian Jira (v8.3.4#803005)