+Ken, Did you happen to look into this test? I heard that you may have been looking into this.
On Mon, Nov 26, 2018 at 3:36 PM Maximilian Michels <[email protected]> wrote: > Hi Alex, > > Thanks for your help! I'm quite used to debugging concurrent/distributed > problems. But this one is quite tricky, especially with regards to GRPC > threads. I try to provide more information in the following. > > There are two observations: > > 1) The problem is specifically related to how the cleanup is performed > for the EmbeddedEnvironmentFactory. The environment is shutdown when the > SDK Harness exists but the GRPC threads continue to linger for some time > and may stall state processing on the next test. > > If you do _not_ close DefaultJobBundleFactory, which happens during > close() or dispose() in the FlinkExecutableStageFunction or > ExecutableStageDoFnOperator respectively, the tests run just fine. I ran > 1000 test runs without a single failure. > > The EmbeddedEnvironment uses direct channels which are marked > experimental in GRPC. We may have to convert them to regular socket > communication. > > 2) Try setting a conditional breakpoint in GrpcStateService which will > never break, e.g. "false". Set it here: > > https://github.com/apache/beam/blob/6da9aa5594f96c0201d497f6dce4797c4984a2fd/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/state/GrpcStateService.java#L134 > > The tests will never fail. The SDK harness is always shutdown correctly > at the end of the test. > > Thanks, > Max > > On 26.11.18 19:15, Alex Amato wrote: > > Thanks Maximilian, let me know if you need any help. Usually I debug > > this sort of thing by pausing the IntelliJ debugger to see all the > > different threads which are waiting on various conditions. If you find > > any insights from that, please post them here and we can try to figure > > out the source of the stuckness. Perhaps it may be some concurrency > > issue leading to deadlock? > > > > On Thu, Nov 22, 2018 at 12:57 PM Maximilian Michels <[email protected] > > <mailto:[email protected]>> wrote: > > > > I couldn't fix it thus far. The issue does not seem to be in the > Flink > > Runner but in the way the tests utilizes the EMBEDDED environment to > > run > > multiple portable jobs in a row. > > > > When it gets stuck it is in RemoteBundle#close and it is independent > of > > the test type (batch and streaming have different implementations). > > > > Will give it another look tomorrow. > > > > Thanks, > > Max > > > > On 22.11.18 13:07, Maximilian Michels wrote: > > > Hi Alex, > > > > > > The test seems to have gotten flaky after we merged support for > > portable > > > timers in Flink's batch mode. > > > > > > Looking into this now. > > > > > > Thanks, > > > Max > > > > > > On 21.11.18 23:56, Alex Amato wrote: > > >> Hello, I have noticed > > >> that org.apache.beam.runners.flink.PortableTimersExecutionTest > > is very > > >> flakey, and repro'd this test timeout on the master branch in > > 40/50 runs. > > >> > > >> I filed a JIRA issue: BEAM-6111 > > >> <https://issues.apache.org/jira/browse/BEAM-6111>. I was just > > >> wondering if anyone knew why this may be occurring, and to check > if > > >> anyone else has been experiencing this. > > >> > > >> Thanks, > > >> Alex > > >
