I actually didn't look at this one. I filed a bunch more adjacent flake bugs. I didn't find your bug but I do see that test flaking at the same time as the others. FWIW here is the list of flakes and sickbayed tests: https://issues.apache.org/jira/issues/?filter=12343195
Kenn On Tue, Nov 27, 2018 at 10:25 AM Alex Amato <ajam...@google.com> wrote: > +Ken, > > Did you happen to look into this test? I heard that you may have been > looking into this. > > On Mon, Nov 26, 2018 at 3:36 PM Maximilian Michels <m...@apache.org> wrote: > >> Hi Alex, >> >> Thanks for your help! I'm quite used to debugging concurrent/distributed >> problems. But this one is quite tricky, especially with regards to GRPC >> threads. I try to provide more information in the following. >> >> There are two observations: >> >> 1) The problem is specifically related to how the cleanup is performed >> for the EmbeddedEnvironmentFactory. The environment is shutdown when the >> SDK Harness exists but the GRPC threads continue to linger for some time >> and may stall state processing on the next test. >> >> If you do _not_ close DefaultJobBundleFactory, which happens during >> close() or dispose() in the FlinkExecutableStageFunction or >> ExecutableStageDoFnOperator respectively, the tests run just fine. I ran >> 1000 test runs without a single failure. >> >> The EmbeddedEnvironment uses direct channels which are marked >> experimental in GRPC. We may have to convert them to regular socket >> communication. >> >> 2) Try setting a conditional breakpoint in GrpcStateService which will >> never break, e.g. "false". Set it here: >> >> https://github.com/apache/beam/blob/6da9aa5594f96c0201d497f6dce4797c4984a2fd/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/state/GrpcStateService.java#L134 >> >> The tests will never fail. The SDK harness is always shutdown correctly >> at the end of the test. >> >> Thanks, >> Max >> >> On 26.11.18 19:15, Alex Amato wrote: >> > Thanks Maximilian, let me know if you need any help. Usually I debug >> > this sort of thing by pausing the IntelliJ debugger to see all the >> > different threads which are waiting on various conditions. If you find >> > any insights from that, please post them here and we can try to figure >> > out the source of the stuckness. Perhaps it may be some concurrency >> > issue leading to deadlock? >> > >> > On Thu, Nov 22, 2018 at 12:57 PM Maximilian Michels <m...@apache.org >> > <mailto:m...@apache.org>> wrote: >> > >> > I couldn't fix it thus far. The issue does not seem to be in the >> Flink >> > Runner but in the way the tests utilizes the EMBEDDED environment to >> > run >> > multiple portable jobs in a row. >> > >> > When it gets stuck it is in RemoteBundle#close and it is >> independent of >> > the test type (batch and streaming have different implementations). >> > >> > Will give it another look tomorrow. >> > >> > Thanks, >> > Max >> > >> > On 22.11.18 13:07, Maximilian Michels wrote: >> > > Hi Alex, >> > > >> > > The test seems to have gotten flaky after we merged support for >> > portable >> > > timers in Flink's batch mode. >> > > >> > > Looking into this now. >> > > >> > > Thanks, >> > > Max >> > > >> > > On 21.11.18 23:56, Alex Amato wrote: >> > >> Hello, I have noticed >> > >> that org.apache.beam.runners.flink.PortableTimersExecutionTest >> > is very >> > >> flakey, and repro'd this test timeout on the master branch in >> > 40/50 runs. >> > >> >> > >> I filed a JIRA issue: BEAM-6111 >> > >> <https://issues.apache.org/jira/browse/BEAM-6111>. I was just >> > >> wondering if anyone knew why this may be occurring, and to >> check if >> > >> anyone else has been experiencing this. >> > >> >> > >> Thanks, >> > >> Alex >> > >> >