[ https://issues.apache.org/jira/browse/BEAM-6451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16744625#comment-16744625 ]
Kenneth Knowles commented on BEAM-6451: --------------------------------------- Dug into the code. Should add a timeout to throwIfFailure, here: https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/fn/control/RegisterAndProcessBundleOperation.java#L538 > Portability Pipeline eventually hangs on bundle registration > ------------------------------------------------------------ > > Key: BEAM-6451 > URL: https://issues.apache.org/jira/browse/BEAM-6451 > Project: Beam > Issue Type: Bug > Components: java-fn-execution, runner-dataflow, sdk-py-harness > Reporter: Scott Wegner > Priority: Minor > Labels: portability > > We've seen jobs using portability start off in a healthy state, but then > eventually get stuck and hang on bundle registration. We see error logs from > the worker harness: > {code} > Processing stuck in step s01 for at least 06h30m00s without outputting or > completing in state finish at > sun.misc.Unsafe.park(Native Method) at > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) > at > java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) > at > java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) at > org.apache.beam.sdk.util.MoreFutures.get(MoreFutures.java:57) at > org.apache.beam.runners.dataflow.worker.fn.control.RegisterAndProcessBundleOperation.finish(RegisterAndProcessBundleOperation.java:277) > at > org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:85) > at > org.apache.beam.runners.dataflow.worker.fn.control.BeamFnMapTaskExecutor.execute(BeamFnMapTaskExecutor.java:119) > at > org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1226) > at > org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:141) > at > org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:965) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at > java.lang.Thread.run(Thread.java:745) > {code} > Looking at [the > code|https://github.com/apache/beam/blob/release-2.8.0/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/fn/control/RegisterAndProcessBundleOperation.java#L277], > it looks like there are no timeouts on the Bundle Registration calls over > the FnApi, which contributes to this hanging forever rather than giving a > better failure. > This bug report came from a customer running a python streaming pipeline > using the new portability framework on Dataflow. Hopefully we can repro on > our own in order to link to the job / logs. -- This message was sent by Atlassian JIRA (v7.6.3#76005)