I actually didn't look at this one. I filed a bunch more adjacent flake
bugs. I didn't find your bug but I do see that test flaking at the same
time as the others. FWIW here is the list of flakes and sickbayed tests:
https://issues.apache.org/jira/issues/?filter=12343195

Kenn

On Tue, Nov 27, 2018 at 10:25 AM Alex Amato <ajam...@google.com> wrote:

> +Ken,
>
> Did you happen to look into this test? I heard that you may have been
> looking into this.
>
> On Mon, Nov 26, 2018 at 3:36 PM Maximilian Michels <m...@apache.org> wrote:
>
>> Hi Alex,
>>
>> Thanks for your help! I'm quite used to debugging concurrent/distributed
>> problems. But this one is quite tricky, especially with regards to GRPC
>> threads. I try to provide more information in the following.
>>
>> There are two observations:
>>
>> 1) The problem is specifically related to how the cleanup is performed
>> for the EmbeddedEnvironmentFactory. The environment is shutdown when the
>> SDK Harness exists but the GRPC threads continue to linger for some time
>> and may stall state processing on the next test.
>>
>> If you do _not_ close DefaultJobBundleFactory, which happens during
>> close() or dispose() in the FlinkExecutableStageFunction or
>> ExecutableStageDoFnOperator respectively, the tests run just fine. I ran
>> 1000 test runs without a single failure.
>>
>> The EmbeddedEnvironment uses direct channels which are marked
>> experimental in GRPC. We may have to convert them to regular socket
>> communication.
>>
>> 2) Try setting a conditional breakpoint in GrpcStateService which will
>> never break, e.g. "false". Set it here:
>>
>> https://github.com/apache/beam/blob/6da9aa5594f96c0201d497f6dce4797c4984a2fd/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/state/GrpcStateService.java#L134
>>
>> The tests will never fail. The SDK harness is always shutdown correctly
>> at the end of the test.
>>
>> Thanks,
>> Max
>>
>> On 26.11.18 19:15, Alex Amato wrote:
>> > Thanks Maximilian, let me know if you need any help. Usually I debug
>> > this sort of thing by pausing the IntelliJ debugger to see all the
>> > different threads which are waiting on various conditions. If you find
>> > any insights from that, please post them here and we can try to figure
>> > out the source of the stuckness. Perhaps it may be some concurrency
>> > issue leading to deadlock?
>> >
>> > On Thu, Nov 22, 2018 at 12:57 PM Maximilian Michels <m...@apache.org
>> > <mailto:m...@apache.org>> wrote:
>> >
>> >     I couldn't fix it thus far. The issue does not seem to be in the
>> Flink
>> >     Runner but in the way the tests utilizes the EMBEDDED environment to
>> >     run
>> >     multiple portable jobs in a row.
>> >
>> >     When it gets stuck it is in RemoteBundle#close and it is
>> independent of
>> >     the test type (batch and streaming have different implementations).
>> >
>> >     Will give it another look tomorrow.
>> >
>> >     Thanks,
>> >     Max
>> >
>> >     On 22.11.18 13:07, Maximilian Michels wrote:
>> >      > Hi Alex,
>> >      >
>> >      > The test seems to have gotten flaky after we merged support for
>> >     portable
>> >      > timers in Flink's batch mode.
>> >      >
>> >      > Looking into this now.
>> >      >
>> >      > Thanks,
>> >      > Max
>> >      >
>> >      > On 21.11.18 23:56, Alex Amato wrote:
>> >      >> Hello, I have noticed
>> >      >> that org.apache.beam.runners.flink.PortableTimersExecutionTest
>> >     is very
>> >      >> flakey, and repro'd this test timeout on the master branch in
>> >     40/50 runs.
>> >      >>
>> >      >> I filed a JIRA issue: BEAM-6111
>> >      >> <https://issues.apache.org/jira/browse/BEAM-6111>. I was just
>> >      >> wondering if anyone knew why this may be occurring, and to
>> check if
>> >      >> anyone else has been experiencing this.
>> >      >>
>> >      >> Thanks,
>> >      >> Alex
>> >
>>
>

Reply via email to