I think graceful shutdown has been historically overlooked, it would not surprise me if there are a few things accidentally left out to gracefully shutdown the runner harness/sdk.
IIRC there was also some discussion around starting up incorrectly as well (requiring a certain order of SDK process startup and runner harness startup, which may have had races as well.) On Fri, Feb 8, 2019 at 4:49 PM Brian Hulette <bhule...@google.com> wrote: > I think I've finally got a handle on this flake, and a possible solution > [1]. One thing that's still bothering me though is that the "CANCELLED: > Multiplexer hanging up" errors seem to be unavoidable. > > They occur when the GrpcDataService is closed [2] and it closes all of > it's multiplexers, which just send an error to their outbound observers > [3]. It seems to me that there should be a more graceful way to shut > everything down, but I'm not seeing it. Am I missing something? > > grpc-java suggests using GrpcCleanupRule to gracefully shut-down > in-process servers and clients [4], should we be utilizing that somehow? > > Brian > > [1] https://github.com/apache/beam/pull/7794 > [2] > https://github.com/apache/beam/blob/master/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/data/GrpcDataService.java#L117 > [3] > https://github.com/apache/beam/tree/master/sdks/java/fn-execution/src/main/java/org/apache/beam/sdk/fn/data/BeamFnDataGrpcMultiplexer.java#L112 > [4] > https://github.com/grpc/grpc-java/blob/master/examples/README.md#unit-test-examples > > On Thu, Feb 7, 2019 at 11:49 AM Brian Hulette <bhule...@google.com> wrote: > >> This was already reported in BEAM-6512 [1], which Scott gave me as a >> starter bug. I haven't been able to reproduce locally, so I'm trying to see >> if I can get it to fail on Jenkins again with some additional logging [2]. >> >> Definitely interested in other's thoughts on this, I only vaguely >> understand what's going on. So far the only headway I've made is noticing >> that the "CANCELLED: Multiplexer hanging up" error seems to always occur >> exactly three times in failing tests. Successful runs may have one or two >> of these messages but never three. >> >> [1] https://issues.apache.org/jira/browse/BEAM-6512 >> [2] https://github.com/apache/beam/pull/7767 >> >> On Tue, Feb 5, 2019 at 9:50 AM Alex Amato <ajam...@google.com> wrote: >> >>> >>> org.apache.beam.runners.fnexecution.data.GrpcDataServiceTest.testMessageReceivedBySingleClientWhenThereAreMultipleClients >>> >>> I keep seeing this test failing in my PRs >>> >>> https://builds.apache.org/job/beam_PreCommit_Java_Commit/4018/ >>> >>> >>> https://builds.apache.org/job/beam_PreCommit_Java_Commit/4018/testReport/junit/org.apache.beam.runners.fnexecution.data/GrpcDataServiceTest/testMessageReceivedBySingleClientWhenThereAreMultipleClients/ >>> >>> >>> I've seen this one come and go for a few weeks or so. I am unsure >>> exactly when it first occured. >>> >>