Well, here is my hacky solution. You can see the changes I make to PortableTimersExecutionTest https://github.com/apache/beam/pull/6786/files
I don't really understand why the pipeline never starts running when I make the results object transient in PortableTiemrsExecutionTest. So I instead continue to access a static object, but key it with the test parameter, to prevent tests from interfering with each other. I am not too sure how to proceed. I don't really want to check in this hacky solution. But I am not too sure of what else to do with solved the problems. Please let me know if you have any suggestions. On Tue, Dec 4, 2018 at 5:26 PM Alex Amato <ajam...@google.com> wrote: > Thanks for letting me know Maximillian, > > Btw, I've been looking a this test the last few days as well. I have found > a few other concurrency issues. That I hope to send a PR out for. > > > - The PortableTimersExecutionTest result variable is using a static > ArrayList, but can be writen to concurrently (by multiple thread AND > multiple parameterized test instnace) which causing flakeyness. > - But just using a ConcurrentLinkedQueue and a non static variable > isn't sufficient as that will cause a copy of the results object to be > copied during doFn serialization. So that makes all the assertions fail, > since nothing get written to the same result object the test is using/ > - So it should be made private transient final. However, after trying > this I am seeing the test timeout, and I am not sure why. Continuing to > debug this. > > > I think that my PR was increasing flakeyness, which is why I saw more of > these issues. > Just wanted to point these out in the meantime, hopefull it helps with > debugging for you too. > > On Fri, Nov 30, 2018 at 7:49 AM Maximilian Michels <m...@apache.org> wrote: > >> This turned out to be a tricky bug. Robert and me had a joined debugging >> session and managed to find the culprit. >> >> PR pending: https://github.com/apache/beam/pull/7171 >> >> On 27.11.18 19:35, Kenneth Knowles wrote: >> > I actually didn't look at this one. I filed a bunch more adjacent flake >> > bugs. I didn't find your bug but I do see that test flaking at the same >> > time as the others. FWIW here is the list of flakes and sickbayed >> tests: >> > https://issues.apache.org/jira/issues/?filter=12343195 >> > >> > Kenn >> > >> > On Tue, Nov 27, 2018 at 10:25 AM Alex Amato <ajam...@google.com >> > <mailto:ajam...@google.com>> wrote: >> > >> > +Ken, >> > >> > Did you happen to look into this test? I heard that you may have >> > been looking into this. >> > >> > On Mon, Nov 26, 2018 at 3:36 PM Maximilian Michels <m...@apache.org >> > <mailto:m...@apache.org>> wrote: >> > >> > Hi Alex, >> > >> > Thanks for your help! I'm quite used to debugging >> > concurrent/distributed >> > problems. But this one is quite tricky, especially with regards >> > to GRPC >> > threads. I try to provide more information in the following. >> > >> > There are two observations: >> > >> > 1) The problem is specifically related to how the cleanup is >> > performed >> > for the EmbeddedEnvironmentFactory. The environment is shutdown >> > when the >> > SDK Harness exists but the GRPC threads continue to linger for >> > some time >> > and may stall state processing on the next test. >> > >> > If you do _not_ close DefaultJobBundleFactory, which happens >> during >> > close() or dispose() in the FlinkExecutableStageFunction or >> > ExecutableStageDoFnOperator respectively, the tests run just >> > fine. I ran >> > 1000 test runs without a single failure. >> > >> > The EmbeddedEnvironment uses direct channels which are marked >> > experimental in GRPC. We may have to convert them to regular >> socket >> > communication. >> > >> > 2) Try setting a conditional breakpoint in GrpcStateService >> > which will >> > never break, e.g. "false". Set it here: >> > >> https://github.com/apache/beam/blob/6da9aa5594f96c0201d497f6dce4797c4984a2fd/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/state/GrpcStateService.java#L134 >> > >> > The tests will never fail. The SDK harness is always shutdown >> > correctly >> > at the end of the test. >> > >> > Thanks, >> > Max >> > >> > On 26.11.18 19:15, Alex Amato wrote: >> > > Thanks Maximilian, let me know if you need any help. Usually >> > I debug >> > > this sort of thing by pausing the IntelliJ debugger to see >> > all the >> > > different threads which are waiting on various conditions. If >> > you find >> > > any insights from that, please post them here and we can try >> > to figure >> > > out the source of the stuckness. Perhaps it may be some >> > concurrency >> > > issue leading to deadlock? >> > > >> > > On Thu, Nov 22, 2018 at 12:57 PM Maximilian Michels >> > <m...@apache.org <mailto:m...@apache.org> >> > > <mailto:m...@apache.org <mailto:m...@apache.org>>> wrote: >> > > >> > > I couldn't fix it thus far. The issue does not seem to be >> > in the Flink >> > > Runner but in the way the tests utilizes the EMBEDDED >> > environment to >> > > run >> > > multiple portable jobs in a row. >> > > >> > > When it gets stuck it is in RemoteBundle#close and it is >> > independent of >> > > the test type (batch and streaming have different >> > implementations). >> > > >> > > Will give it another look tomorrow. >> > > >> > > Thanks, >> > > Max >> > > >> > > On 22.11.18 13:07, Maximilian Michels wrote: >> > > > Hi Alex, >> > > > >> > > > The test seems to have gotten flaky after we merged >> > support for >> > > portable >> > > > timers in Flink's batch mode. >> > > > >> > > > Looking into this now. >> > > > >> > > > Thanks, >> > > > Max >> > > > >> > > > On 21.11.18 23:56, Alex Amato wrote: >> > > >> Hello, I have noticed >> > > >> >> > that org.apache.beam.runners.flink.PortableTimersExecutionTest >> > > is very >> > > >> flakey, and repro'd this test timeout on the master >> > branch in >> > > 40/50 runs. >> > > >> >> > > >> I filed a JIRA issue: BEAM-6111 >> > > >> <https://issues.apache.org/jira/browse/BEAM-6111>. I >> > was just >> > > >> wondering if anyone knew why this may be occurring, >> > and to check if >> > > >> anyone else has been experiencing this. >> > > >> >> > > >> Thanks, >> > > >> Alex >> > > >> > >> >