Re: org.apache.beam.runners.flink.PortableTimersExecutionTest is very flakey

Alex Amato Tue, 04 Dec 2018 18:19:38 -0800

Well, here is my hacky solution.
You can see the changes I make to PortableTimersExecutionTest
https://github.com/apache/beam/pull/6786/files


I don't really understand why the pipeline never starts running when I make
the results object transient in PortableTiemrsExecutionTest.

So I instead continue to access a static object, but key it with the test
parameter, to prevent tests from interfering with each other.

I am not too sure how to proceed. I don't really want to check in this
hacky solution. But I am not too sure of what else to do with solved the
problems. Please let me know if you have any suggestions.

On Tue, Dec 4, 2018 at 5:26 PM Alex Amato <ajam...@google.com> wrote:

> Thanks for letting me know Maximillian,
>
> Btw, I've been looking a this test the last few days as well. I have found
> a few other concurrency issues. That I hope to send a PR out for.
>
>
>    - The PortableTimersExecutionTest result variable is using a static
>    ArrayList, but can be writen to concurrently (by multiple thread AND
>    multiple parameterized test instnace) which causing flakeyness.
>    - But just using a ConcurrentLinkedQueue and a non static variable
>    isn't sufficient as that will cause a copy of the results object to be
>    copied during doFn serialization. So that makes all the assertions fail,
>    since nothing get written to the same result object the test is using/
>    - So it should be made private transient final. However, after trying
>       this I am seeing the test timeout, and I am not sure why. Continuing to
>       debug this.
>
>
> I think that my PR was increasing flakeyness, which is why I saw more of
> these issues.
> Just wanted to point these out in the meantime, hopefull it helps with
> debugging for you too.
>
> On Fri, Nov 30, 2018 at 7:49 AM Maximilian Michels <m...@apache.org> wrote:
>
>> This turned out to be a tricky bug. Robert and me had a joined debugging
>> session and managed to find the culprit.
>>
>> PR pending: https://github.com/apache/beam/pull/7171
>>
>> On 27.11.18 19:35, Kenneth Knowles wrote:
>> > I actually didn't look at this one. I filed a bunch more adjacent flake
>> > bugs. I didn't find your bug but I do see that test flaking at the same
>> > time as the others. FWIW here is the list of flakes and sickbayed
>> tests:
>> > https://issues.apache.org/jira/issues/?filter=12343195
>> >
>> > Kenn
>> >
>> > On Tue, Nov 27, 2018 at 10:25 AM Alex Amato <ajam...@google.com
>> > <mailto:ajam...@google.com>> wrote:
>> >
>> >     +Ken,
>> >
>> >     Did you happen to look into this test? I heard that you may have
>> >     been looking into this.
>> >
>> >     On Mon, Nov 26, 2018 at 3:36 PM Maximilian Michels <m...@apache.org
>> >     <mailto:m...@apache.org>> wrote:
>> >
>> >         Hi Alex,
>> >
>> >         Thanks for your help! I'm quite used to debugging
>> >         concurrent/distributed
>> >         problems. But this one is quite tricky, especially with regards
>> >         to GRPC
>> >         threads. I try to provide more information in the following.
>> >
>> >         There are two observations:
>> >
>> >         1) The problem is specifically related to how the cleanup is
>> >         performed
>> >         for the EmbeddedEnvironmentFactory. The environment is shutdown
>> >         when the
>> >         SDK Harness exists but the GRPC threads continue to linger for
>> >         some time
>> >         and may stall state processing on the next test.
>> >
>> >         If you do _not_ close DefaultJobBundleFactory, which happens
>> during
>> >         close() or dispose() in the FlinkExecutableStageFunction or
>> >         ExecutableStageDoFnOperator respectively, the tests run just
>> >         fine. I ran
>> >         1000 test runs without a single failure.
>> >
>> >         The EmbeddedEnvironment uses direct channels which are marked
>> >         experimental in GRPC. We may have to convert them to regular
>> socket
>> >         communication.
>> >
>> >         2) Try setting a conditional breakpoint in GrpcStateService
>> >         which will
>> >         never break, e.g. "false". Set it here:
>> >
>> https://github.com/apache/beam/blob/6da9aa5594f96c0201d497f6dce4797c4984a2fd/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/state/GrpcStateService.java#L134
>> >
>> >         The tests will never fail. The SDK harness is always shutdown
>> >         correctly
>> >         at the end of the test.
>> >
>> >         Thanks,
>> >         Max
>> >
>> >         On 26.11.18 19:15, Alex Amato wrote:
>> >          > Thanks Maximilian, let me know if you need any help. Usually
>> >         I debug
>> >          > this sort of thing by pausing the IntelliJ debugger to see
>> >         all the
>> >          > different threads which are waiting on various conditions. If
>> >         you find
>> >          > any insights from that, please post them here and we can try
>> >         to figure
>> >          > out the source of the stuckness. Perhaps it may be some
>> >         concurrency
>> >          > issue leading to deadlock?
>> >          >
>> >          > On Thu, Nov 22, 2018 at 12:57 PM Maximilian Michels
>> >         <m...@apache.org <mailto:m...@apache.org>
>> >          > <mailto:m...@apache.org <mailto:m...@apache.org>>> wrote:
>> >          >
>> >          >     I couldn't fix it thus far. The issue does not seem to be
>> >         in the Flink
>> >          >     Runner but in the way the tests utilizes the EMBEDDED
>> >         environment to
>> >          >     run
>> >          >     multiple portable jobs in a row.
>> >          >
>> >          >     When it gets stuck it is in RemoteBundle#close and it is
>> >         independent of
>> >          >     the test type (batch and streaming have different
>> >         implementations).
>> >          >
>> >          >     Will give it another look tomorrow.
>> >          >
>> >          >     Thanks,
>> >          >     Max
>> >          >
>> >          >     On 22.11.18 13:07, Maximilian Michels wrote:
>> >          >      > Hi Alex,
>> >          >      >
>> >          >      > The test seems to have gotten flaky after we merged
>> >         support for
>> >          >     portable
>> >          >      > timers in Flink's batch mode.
>> >          >      >
>> >          >      > Looking into this now.
>> >          >      >
>> >          >      > Thanks,
>> >          >      > Max
>> >          >      >
>> >          >      > On 21.11.18 23:56, Alex Amato wrote:
>> >          >      >> Hello, I have noticed
>> >          >      >>
>> >         that org.apache.beam.runners.flink.PortableTimersExecutionTest
>> >          >     is very
>> >          >      >> flakey, and repro'd this test timeout on the master
>> >         branch in
>> >          >     40/50 runs.
>> >          >      >>
>> >          >      >> I filed a JIRA issue: BEAM-6111
>> >          >      >> <https://issues.apache.org/jira/browse/BEAM-6111>. I
>> >         was just
>> >          >      >> wondering if anyone knew why this may be occurring,
>> >         and to check if
>> >          >      >> anyone else has been experiencing this.
>> >          >      >>
>> >          >      >> Thanks,
>> >          >      >> Alex
>> >          >
>> >
>>
>

Re: org.apache.beam.runners.flink.PortableTimersExecutionTest is very flakey

Reply via email to