Thank you for the advice. Yes, the latch not being counted-down is the problem. (my memo: https://github.com/apache/beam/pull/14474#discussion_r619557479 ) I'll need to figure out why withOnError is not called.
> Can you repro locally? No, the task succeeds in my environment (./gradlew :runners:google-cloud-dataflow-java:worker:test). On Tue, May 11, 2021 at 12:34 PM Kenneth Knowles <k...@apache.org> wrote: > I am not sure how much you read the code of the test. So apologies if I am > saying things you already know. The test does something like: > > - start a logging service > - set up some stub clients, each with onError wired up to release a > countdown latch > - send error responses to all three of them (actually it sends the error > in the same task it creates the stub) > - each task waits on the latch > > So if onError does not deliver or does not call to release the countdown > latch, it will hang. I notice in the gist you provide that all three stub > clients are hung awaiting the latch. That is suspicious to me. I would want > to confirm if the flakiness always occurs in a way that hangs all three. > Then there are gRPC workers waiting on empty queues, and the main test > thread waiting for the hung tasks to complete. > > The problem could be something about the test set up. Personally I would > add a ton of logs, or potentially use a debugger, to confirm exactly the > state of things when it hangs. Can you repro locally? I think this same > functionality could be tested in different ways that might remove some of > the variables. For example starting up all the waiting tasks, then sending > all the onError messages that should cause them to terminate. > > Since this is a unit test, adding a timeout to just that method should > save time (but will make it harder to capture stack traces, etc). I've > opened up https://github.com/apache/beam/pull/14781 for that. There may > be a nice way to add a timeout to the executor to capture the hung stack, > but I didn't look for it. > > Kenn > > On Tue, May 11, 2021 at 7:36 AM Tomo Suzuki <suzt...@google.com> wrote: > >> gRPC 1.37.0 showed the same problem: >> BeamFnLoggingServiceTest.testMultipleClientsFailingIsHandledGracefullyByServer >> waits tasks forever, causing timeout in Java precommit. >> >> While I continue my investigation, I appreciate if someone knows the >> cause of the problem, I pasted the thread dump of the Java process when the >> test was frozen: >> https://github.com/apache/beam/pull/14768 >> >> If this mystery is never solved, vendoring (a bit old) gRPC 1.32.2 >> without the jboss dependencies is an alternate option, (suggestion by Kenn; >> memo >> <https://issues.apache.org/jira/browse/BEAM-11227?focusedCommentId=17318238&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17318238> >> ) >> >> Regards, >> Tomo >> >> >> On Mon, May 10, 2021 at 9:40 AM Tomo Suzuki <suzt...@google.com> wrote: >> >>> I was investigating the strange timeout ( >>> https://github.com/apache/beam/pull/14474) but was occupied with >>> something else lately. >>> Let me try the new version today to see any improvements. >>> >>> >>> On Mon, May 10, 2021 at 4:57 AM Ismaël Mejía <ieme...@gmail.com> wrote: >>> >>>> I just saw that gRPC 1.37.1 is out now (and with aarch64 support for >>>> python!) that made me wonder about this, what is the current status of >>>> upgrading the vendored dependency Tomo? >>>> >>>> >>>> On Thu, Apr 8, 2021 at 4:16 PM Tomo Suzuki <suzt...@google.com> wrote: >>>> >>>>> We observed the cron job of Java Precommit for the master branch >>>>> started timing out often (not always) since upgrading the gRPC version. >>>>> https://github.com/apache/beam/pull/14466#issuecomment-815343974 >>>>> >>>>> Exchanged messages with Kenn, I reverted to the change; now the master >>>>> branch uses the vendored gRPC 1.26. >>>>> >>>>> >>>>> On Wed, Mar 31, 2021 at 11:40 AM Kenneth Knowles <k...@apache.org> >>>>> wrote: >>>>> >>>>>> Merged. Let's keep an eye for trouble, and I will incorporate to the >>>>>> release branch. >>>>>> >>>>>> Kenn >>>>>> >>>>>> On Wed, Mar 31, 2021 at 6:45 AM Tomo Suzuki <suzt...@google.com> >>>>>> wrote: >>>>>> >>>>>>> Regarding troubleshooting on build timeout, it seems that Docker >>>>>>> cache in Jenkins machines might be playing a role. As I run more "Java >>>>>>> Presubmit", I no longer observe timeouts in the PR. >>>>>>> >>>>>>> Kenn, would you merge the PR? >>>>>>> https://github.com/apache/beam/pull/14295 (all checks green, >>>>>>> including the new Java postcommit checks) >>>>>>> >>>>>>> On Thu, Mar 25, 2021 at 5:24 PM Kenneth Knowles <k...@apache.org> >>>>>>> wrote: >>>>>>> >>>>>>>> Yes, I agree this might be a good idea. This is not the only major >>>>>>>> issue on the release-2.29.0 branch. >>>>>>>> >>>>>>>> The counter argument is that we will be pulling in all the bugs >>>>>>>> introduced to `master` since the branch cut. >>>>>>>> >>>>>>>> As far as effort goes, I have been mostly focused on burning down >>>>>>>> the bugs so I would not lose much work in the release process. >>>>>>>> >>>>>>>> Kenn >>>>>>>> >>>>>>>> On Thu, Mar 25, 2021 at 1:42 PM Ismaël Mejía <ieme...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Precommit is quite unstable in the last days, so worth to check if >>>>>>>>> something is wrong in the CI. >>>>>>>>> >>>>>>>>> I have a question Kenn. Given that cherry picking this might be a >>>>>>>>> bit >>>>>>>>> big as a change can we just reconsider cutting the 2.29.0 branch >>>>>>>>> again >>>>>>>>> after the updated gRPC version use gets merged and mark the issues >>>>>>>>> already fixed for version 2.30.0 to version 2.29.0 ? Seems like an >>>>>>>>> easier upgrade path (and we will get some nice fixes/improvements >>>>>>>>> like >>>>>>>>> official Spark 3 support for free on the release). >>>>>>>>> >>>>>>>>> WDYT? >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Mar 24, 2021 at 8:06 PM Tomo Suzuki <suzt...@google.com> >>>>>>>>> wrote: >>>>>>>>> > >>>>>>>>> > Update: I observe that Java precommit check is unstable in the >>>>>>>>> PR to upgrade vendored gRPC (compared with an PR with an empty >>>>>>>>> change). >>>>>>>>> There's no constant failures; sometimes it succeeds and other times it >>>>>>>>> faces timeout and flaky test failures. >>>>>>>>> > >>>>>>>>> > https://github.com/apache/beam/pull/14295#issuecomment-806071087 >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > On Mon, Mar 22, 2021 at 10:46 AM Tomo Suzuki <suzt...@google.com> >>>>>>>>> wrote: >>>>>>>>> >> >>>>>>>>> >> Thank you for the voting and I see the artifact available in >>>>>>>>> Maven Central. I'll work on the PR to use the published artifact >>>>>>>>> today. >>>>>>>>> >> >>>>>>>>> https://search.maven.org/artifact/org.apache.beam/beam-vendor-grpc-1_36_0/0.1/jar >>>>>>>>> >> >>>>>>>>> >> On Tue, Mar 16, 2021 at 3:07 PM Kenneth Knowles < >>>>>>>>> k...@apache.org> wrote: >>>>>>>>> >>> >>>>>>>>> >>> Update on this: there are some minor issues and then I'll send >>>>>>>>> out the RC. >>>>>>>>> >>> >>>>>>>>> >>> I think this is worth blocking 2.29.0 release on, so I will do >>>>>>>>> this first. We are still eliminating other blockers from 2.29.0 >>>>>>>>> anyhow. >>>>>>>>> >>> >>>>>>>>> >>> Kenn >>>>>>>>> >>> >>>>>>>>> >>> On Mon, Mar 15, 2021 at 7:17 AM Tomo Suzuki < >>>>>>>>> suzt...@google.com> wrote: >>>>>>>>> >>>> >>>>>>>>> >>>> Hi Beam developers, >>>>>>>>> >>>> >>>>>>>>> >>>> I'm working on upgrading the vendored gRPC 1.36.0 >>>>>>>>> >>>> https://issues.apache.org/jira/browse/BEAM-11227 (PR: >>>>>>>>> https://github.com/apache/beam/pull/14028) >>>>>>>>> >>>> Let me know if you have any questions or concerns. >>>>>>>>> >>>> >>>>>>>>> >>>> Background: >>>>>>>>> >>>> Exchanged messages with Ismaël in BEAM-11227, it seems that >>>>>>>>> it the ticket created by some automation is false positive, but it's >>>>>>>>> nice >>>>>>>>> to use an artifact without being marked with CVE. >>>>>>>>> >>>> >>>>>>>>> >>>> Kenn offered to work as the release manager (as in >>>>>>>>> https://s.apache.org/beam-release-vendored-artifacts) of the >>>>>>>>> vendored artifact. >>>>>>>>> >>>> >>>>>>>>> >>>> -- >>>>>>>>> >>>> Regards, >>>>>>>>> >>>> Tomo >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> -- >>>>>>>>> >> Regards, >>>>>>>>> >> Tomo >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > -- >>>>>>>>> > Regards, >>>>>>>>> > Tomo >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Regards, >>>>>>> Tomo >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> Regards, >>>>> Tomo >>>>> >>>> >>> >>> -- >>> Regards, >>> Tomo >>> >> >> >> -- >> Regards, >> Tomo >> > -- Regards, Tomo