Re: org.apache.beam.runners.flink.PortableTimersExecutionTest is very flakey

Alex Amato Tue, 27 Nov 2018 10:25:46 -0800

+Ken,

Did you happen to look into this test? I heard that you may have been
looking into this.


On Mon, Nov 26, 2018 at 3:36 PM Maximilian Michels <[email protected]> wrote:

> Hi Alex,
>
> Thanks for your help! I'm quite used to debugging concurrent/distributed
> problems. But this one is quite tricky, especially with regards to GRPC
> threads. I try to provide more information in the following.
>
> There are two observations:
>
> 1) The problem is specifically related to how the cleanup is performed
> for the EmbeddedEnvironmentFactory. The environment is shutdown when the
> SDK Harness exists but the GRPC threads continue to linger for some time
> and may stall state processing on the next test.
>
> If you do _not_ close DefaultJobBundleFactory, which happens during
> close() or dispose() in the FlinkExecutableStageFunction or
> ExecutableStageDoFnOperator respectively, the tests run just fine. I ran
> 1000 test runs without a single failure.
>
> The EmbeddedEnvironment uses direct channels which are marked
> experimental in GRPC. We may have to convert them to regular socket
> communication.
>
> 2) Try setting a conditional breakpoint in GrpcStateService which will
> never break, e.g. "false". Set it here:
>
> https://github.com/apache/beam/blob/6da9aa5594f96c0201d497f6dce4797c4984a2fd/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/state/GrpcStateService.java#L134
>
> The tests will never fail. The SDK harness is always shutdown correctly
> at the end of the test.
>
> Thanks,
> Max
>
> On 26.11.18 19:15, Alex Amato wrote:
> > Thanks Maximilian, let me know if you need any help. Usually I debug
> > this sort of thing by pausing the IntelliJ debugger to see all the
> > different threads which are waiting on various conditions. If you find
> > any insights from that, please post them here and we can try to figure
> > out the source of the stuckness. Perhaps it may be some concurrency
> > issue leading to deadlock?
> >
> > On Thu, Nov 22, 2018 at 12:57 PM Maximilian Michels <[email protected]
> > <mailto:[email protected]>> wrote:
> >
> >     I couldn't fix it thus far. The issue does not seem to be in the
> Flink
> >     Runner but in the way the tests utilizes the EMBEDDED environment to
> >     run
> >     multiple portable jobs in a row.
> >
> >     When it gets stuck it is in RemoteBundle#close and it is independent
> of
> >     the test type (batch and streaming have different implementations).
> >
> >     Will give it another look tomorrow.
> >
> >     Thanks,
> >     Max
> >
> >     On 22.11.18 13:07, Maximilian Michels wrote:
> >      > Hi Alex,
> >      >
> >      > The test seems to have gotten flaky after we merged support for
> >     portable
> >      > timers in Flink's batch mode.
> >      >
> >      > Looking into this now.
> >      >
> >      > Thanks,
> >      > Max
> >      >
> >      > On 21.11.18 23:56, Alex Amato wrote:
> >      >> Hello, I have noticed
> >      >> that org.apache.beam.runners.flink.PortableTimersExecutionTest
> >     is very
> >      >> flakey, and repro'd this test timeout on the master branch in
> >     40/50 runs.
> >      >>
> >      >> I filed a JIRA issue: BEAM-6111
> >      >> <https://issues.apache.org/jira/browse/BEAM-6111>. I was just
> >      >> wondering if anyone knew why this may be occurring, and to check
> if
> >      >> anyone else has been experiencing this.
> >      >>
> >      >> Thanks,
> >      >> Alex
> >
>

Re: org.apache.beam.runners.flink.PortableTimersExecutionTest is very flakey

Reply via email to