This turned out to be a tricky bug. Robert and me had a joined debugging session and managed to find the culprit.

PR pending: https://github.com/apache/beam/pull/7171

On 27.11.18 19:35, Kenneth Knowles wrote:
I actually didn't look at this one. I filed a bunch more adjacent flake bugs. I didn't find your bug but I do see that test flaking at the same time as the others. FWIW here is the list of flakes and sickbayed tests: https://issues.apache.org/jira/issues/?filter=12343195

Kenn

On Tue, Nov 27, 2018 at 10:25 AM Alex Amato <[email protected] <mailto:[email protected]>> wrote:

    +Ken,

    Did you happen to look into this test? I heard that you may have
    been looking into this.

    On Mon, Nov 26, 2018 at 3:36 PM Maximilian Michels <[email protected]
    <mailto:[email protected]>> wrote:

        Hi Alex,

        Thanks for your help! I'm quite used to debugging
        concurrent/distributed
        problems. But this one is quite tricky, especially with regards
        to GRPC
        threads. I try to provide more information in the following.

        There are two observations:

        1) The problem is specifically related to how the cleanup is
        performed
        for the EmbeddedEnvironmentFactory. The environment is shutdown
        when the
        SDK Harness exists but the GRPC threads continue to linger for
        some time
        and may stall state processing on the next test.

        If you do _not_ close DefaultJobBundleFactory, which happens during
        close() or dispose() in the FlinkExecutableStageFunction or
        ExecutableStageDoFnOperator respectively, the tests run just
        fine. I ran
        1000 test runs without a single failure.

        The EmbeddedEnvironment uses direct channels which are marked
        experimental in GRPC. We may have to convert them to regular socket
        communication.

        2) Try setting a conditional breakpoint in GrpcStateService
        which will
        never break, e.g. "false". Set it here:
        
https://github.com/apache/beam/blob/6da9aa5594f96c0201d497f6dce4797c4984a2fd/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/state/GrpcStateService.java#L134

        The tests will never fail. The SDK harness is always shutdown
        correctly
        at the end of the test.

        Thanks,
        Max

        On 26.11.18 19:15, Alex Amato wrote:
         > Thanks Maximilian, let me know if you need any help. Usually
        I debug
         > this sort of thing by pausing the IntelliJ debugger to see
        all the
         > different threads which are waiting on various conditions. If
        you find
         > any insights from that, please post them here and we can try
        to figure
         > out the source of the stuckness. Perhaps it may be some
        concurrency
         > issue leading to deadlock?
         >
         > On Thu, Nov 22, 2018 at 12:57 PM Maximilian Michels
        <[email protected] <mailto:[email protected]>
         > <mailto:[email protected] <mailto:[email protected]>>> wrote:
         >
         >     I couldn't fix it thus far. The issue does not seem to be
        in the Flink
         >     Runner but in the way the tests utilizes the EMBEDDED
        environment to
         >     run
         >     multiple portable jobs in a row.
         >
         >     When it gets stuck it is in RemoteBundle#close and it is
        independent of
         >     the test type (batch and streaming have different
        implementations).
         >
         >     Will give it another look tomorrow.
         >
         >     Thanks,
         >     Max
         >
         >     On 22.11.18 13:07, Maximilian Michels wrote:
         >      > Hi Alex,
         >      >
         >      > The test seems to have gotten flaky after we merged
        support for
         >     portable
         >      > timers in Flink's batch mode.
         >      >
         >      > Looking into this now.
         >      >
         >      > Thanks,
         >      > Max
         >      >
         >      > On 21.11.18 23:56, Alex Amato wrote:
         >      >> Hello, I have noticed
         >      >>
        that org.apache.beam.runners.flink.PortableTimersExecutionTest
         >     is very
         >      >> flakey, and repro'd this test timeout on the master
        branch in
         >     40/50 runs.
         >      >>
         >      >> I filed a JIRA issue: BEAM-6111
         >      >> <https://issues.apache.org/jira/browse/BEAM-6111>. I
        was just
         >      >> wondering if anyone knew why this may be occurring,
        and to check if
         >      >> anyone else has been experiencing this.
         >      >>
         >      >> Thanks,
         >      >> Alex
         >

Reply via email to