[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16259346#comment-16259346 ] ASF GitHub Bot commented on FLINK-7880: --- GitHub user kl0u opened a pull request: https://github.com/apache/flink/pull/5038 [FLINK-7880][FLINK-7975][FLINK-7974][QS] QS test instability fix. This is a follow-up on https://github.com/apache/flink/pull/4993. It contains one additional commit that makes the QS ITcases to wait for proper clean-up before exiting. R: @aljoscha @zentol You can merge this pull request into a Git repository by running: $ git pull https://github.com/kl0u/flink qs-shutdown-fin-fin Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/5038.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5038 commit 15c39807f560f6ee4f34e4ad1ba245b66dd10b09 Author: kkloudas Date: 2017-11-09T18:21:43Z [FLINK-7975][QS] Wait for QS client to shutdown. commit 689502e70cced18a9cb5b2b6435ff500b7bf70dd Author: kkloudas Date: 2017-11-09T18:30:29Z [FLINK-7974][QS] Wait for QS abstract server to shutdown. commit 5c0e3fc82a88768caa750367fbf02e24848b3b53 Author: kkloudas Date: 2017-11-20T13:28:24Z [FLINK-7880][QS] Wait for proper resource cleanup after each ITCase. > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > Fix For: 1.4.0 > > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16259424#comment-16259424 ] ASF GitHub Bot commented on FLINK-7880: --- Github user kl0u closed the pull request at: https://github.com/apache/flink/pull/5038 > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16260866#comment-16260866 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5038#discussion_r152298240 --- Diff: flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java --- @@ -312,32 +345,43 @@ private void handInChannel(Channel channel) { /** * Close the connecting channel with a ClosedChannelException. */ - private void close() { - close(new ClosedChannelException()); + private CompletableFuture close() { + return close(new ClosedChannelException()); } /** * Close the connecting channel with an Exception (can be {@code null}) * or forward to the established channel. */ - private void close(Throwable cause) { - synchronized (connectLock) { - if (!closed) { - if (failureCause == null) { - failureCause = cause; - } + private CompletableFuture close(Throwable cause) { + CompletableFuture future = new CompletableFuture<>(); + if (connectionShutdownFuture.compareAndSet(null, future)) { + synchronized (connectLock) { + if (!closed) { --- End diff -- this seems unnecessary, doesn't the check at L358 guarantee that the entire branch is only executed once? > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16260870#comment-16260870 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5038#discussion_r152298612 --- Diff: flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java --- @@ -386,6 +430,9 @@ private PendingRequest(REQ request) { /** Reference to a failure that was reported by the channel. */ private final AtomicReference failureCause = new AtomicReference<>(); + /** Atomic shut down future. */ + private final AtomicReference> connectionShutdownFuture = new AtomicReference<>(null); --- End diff -- why does this one suddenly return a boolean? > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16260871#comment-16260871 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5038#discussion_r152297601 --- Diff: flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java --- @@ -312,32 +345,43 @@ private void handInChannel(Channel channel) { /** * Close the connecting channel with a ClosedChannelException. */ - private void close() { - close(new ClosedChannelException()); + private CompletableFuture close() { + return close(new ClosedChannelException()); } /** * Close the connecting channel with an Exception (can be {@code null}) * or forward to the established channel. */ - private void close(Throwable cause) { - synchronized (connectLock) { - if (!closed) { - if (failureCause == null) { - failureCause = cause; - } + private CompletableFuture close(Throwable cause) { --- End diff -- same as above > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16260867#comment-16260867 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5038#discussion_r152299398 --- Diff: flink-queryable-state/flink-queryable-state-runtime/src/test/java/org/apache/flink/queryablestate/network/ClientTest.java --- @@ -95,15 +97,20 @@ private static final Logger LOG = LoggerFactory.getLogger(ClientTest.class); + private static final FiniteDuration TEST_TIMEOUT = new FiniteDuration(20L, TimeUnit.SECONDS); + // Thread pool for client bootstrap (shared between tests) - private static final NioEventLoopGroup NIO_GROUP = new NioEventLoopGroup(); + private NioEventLoopGroup nioGroup; - private static final FiniteDuration TEST_TIMEOUT = new FiniteDuration(10L, TimeUnit.SECONDS); + @Before + public void setUp() throws Exception { + nioGroup = new NioEventLoopGroup(); --- End diff -- you could just write `private NioEventLoopGroup nioGroup = new NioEventLoopGroup();` and remove the `@Before` method > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16260868#comment-16260868 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5038#discussion_r152294984 --- Diff: flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java --- @@ -166,28 +167,57 @@ public String getClientName() { * Shuts down the client and closes all connections. * * After a call to this method, all returned futures will be failed. +* +* @return A {@link CompletableFuture} that will be completed when the shutdown process is done. */ - public void shutdown() { - if (shutDown.compareAndSet(false, true)) { + public CompletableFuture shutdown() { --- End diff -- should be typed to `Void`. > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16260869#comment-16260869 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5038#discussion_r152300724 --- Diff: flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java --- @@ -166,28 +167,57 @@ public String getClientName() { * Shuts down the client and closes all connections. * * After a call to this method, all returned futures will be failed. +* +* @return A {@link CompletableFuture} that will be completed when the shutdown process is done. */ - public void shutdown() { - if (shutDown.compareAndSet(false, true)) { + public CompletableFuture shutdown() { + final CompletableFuture newShutdownFuture = new CompletableFuture<>(); + if (clientShutdownFuture.compareAndSet(null, newShutdownFuture)) { + + final List> connectionFutures = new ArrayList<>(); + for (Map.Entry conn : establishedConnections.entrySet()) { if (establishedConnections.remove(conn.getKey(), conn.getValue())) { - conn.getValue().close(); + connectionFutures.add(conn.getValue().close()); } } for (Map.Entry conn : pendingConnections.entrySet()) { if (pendingConnections.remove(conn.getKey()) != null) { - conn.getValue().close(); + connectionFutures.add(conn.getValue().close()); } } - if (bootstrap != null) { - EventLoopGroup group = bootstrap.group(); - if (group != null) { - group.shutdownGracefully(0L, 10L, TimeUnit.SECONDS); + CompletableFuture.allOf( + connectionFutures.toArray(new CompletableFuture[connectionFutures.size()]) + ).whenComplete((result, throwable) -> { + if (throwable != null) { + newShutdownFuture.completeExceptionally(throwable); + } else if (bootstrap != null) { + EventLoopGroup group = bootstrap.group(); + if (group != null && !group.isShutdown()) { + group.shutdownGracefully(0L, 0L, TimeUnit.MILLISECONDS) + .addListener(finished -> { + if (finished.isSuccess()) { + newShutdownFuture.complete(null); + } else { + newShutdownFuture.completeExceptionally(finished.cause()); + } + }); + } else { + newShutdownFuture.complete(null); + } + } else { + newShutdownFuture.complete(null); } + }); + + // check again if in the meantime another thread completed the future + if (clientShutdownFuture.compareAndSet(null, newShutdownFuture)) { --- End diff -- where in close() do we set the shutdown future to null? I only see that being done in sendRequest. (which seems fishy) > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/t
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16264379#comment-16264379 ] ASF GitHub Bot commented on FLINK-7880: --- GitHub user kl0u opened a pull request: https://github.com/apache/flink/pull/5062 [FLINK-7880][QS] Wait for proper resource cleanup after each ITCase. R @aljoscha You can merge this pull request into a Git repository by running: $ git pull https://github.com/kl0u/flink qs-shutdown-fin-fin Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/5062.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5062 commit fd8ea759321b83022497e25e2dfa56ac6d976e15 Author: kkloudas Date: 2017-11-09T18:21:43Z [FLINK-7975][QS] Wait for QS client to shutdown. commit a474071d5810562326930a147f0ecdcfe506f13d Author: kkloudas Date: 2017-11-09T18:30:29Z [FLINK-7974][QS] Wait for QS abstract server to shutdown. commit 2ffe223a8c8d89efbee0e08f8b90ea7b58fbf9df Author: kkloudas Date: 2017-11-20T13:28:24Z [FLINK-7880][QS] Wait for proper resource cleanup after each ITCase. commit f7f8d2a6ebbaf464200c44a902e5b829c88f9499 Author: kkloudas Date: 2017-11-23T13:20:32Z [hotfix][QS] ITCase refactoring. > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16265060#comment-16265060 ] ASF GitHub Bot commented on FLINK-7880: --- Github user kl0u commented on the issue: https://github.com/apache/flink/pull/5062 This is a follow-up to https://github.com/apache/flink/pull/5062 > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270715#comment-16270715 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on the issue: https://github.com/apache/flink/pull/5062 Your follow-up reference points to this very PR :P > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270718#comment-16270718 ] ASF GitHub Bot commented on FLINK-7880: --- Github user kl0u commented on the issue: https://github.com/apache/flink/pull/5062 Fixed @zentol ! > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270847#comment-16270847 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r153804523 --- Diff: flink-queryable-state/flink-queryable-state-runtime/src/test/java/org/apache/flink/queryablestate/itcases/AbstractQueryableStateTestBase.java --- @@ -434,13 +432,16 @@ public Integer getKey(Tuple2 value) throws Exception { * contains a wrong jobId or wrong queryable state name. */ @Test + @Ignore --- End diff -- why? > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270829#comment-16270829 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r153790466 --- Diff: flink-queryable-state/flink-queryable-state-runtime/src/test/java/org/apache/flink/queryablestate/network/AbstractServerTest.java --- @@ -86,77 +83,89 @@ public void testServerInitializationFailure() throws Throwable { */ @Test public void testPortRangeSuccess() throws Throwable { - TestServer server1 = null; - TestServer server2 = null; - Client client = null; - try { - server1 = startServer("Test Server 1", , 7778, 7779); + // this is shared between the two servers. + AtomicKvStateRequestStats serverStats = new AtomicKvStateRequestStats(); + AtomicKvStateRequestStats clientStats = new AtomicKvStateRequestStats(); + + List portList = new ArrayList<>(); + portList.add(); + portList.add(7778); + portList.add(7779); + + try ( + TestServer server1 = new TestServer("Test Server 1", serverStats, portList.iterator()); + TestServer server2 = new TestServer("Test Server 2", serverStats, portList.iterator()); + TestClient client = new TestClient( + "Test Client", + 1, + new MessageSerializer<>(new TestMessage.TestMessageDeserializer(), new TestMessage.TestMessageDeserializer()), + clientStats + ) + ) { + server1.start(); Assert.assertEquals(L, server1.getServerAddress().getPort()); --- End diff -- these should be removed, or generalized to `portRangeStart <= server1.getServerAddress().getPort() <= portRangeEnd`. > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270838#comment-16270838 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r153793887 --- Diff: flink-queryable-state/flink-queryable-state-runtime/src/test/java/org/apache/flink/queryablestate/itcases/HAAbstractQueryableStateTestBase.java --- @@ -80,19 +80,18 @@ public static void setup(int proxyPortRangeStart, int serverPortRangeStart) { @AfterClass public static void tearDown() { - if (cluster != null) { + try { + client.shutdownAndWait(); + cluster.stop(); cluster.awaitTermination(); - } - try { zkServer.stop(); zkServer.close(); - } catch (Exception e) { + + } catch (Throwable e) { --- End diff -- try-catch block is unnecessary > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270842#comment-16270842 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r153799484 --- Diff: flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/AbstractServerBase.java --- @@ -251,34 +263,81 @@ private boolean attemptToBind(final int port) throws Throwable { throw future.cause(); } catch (BindException e) { log.debug("Failed to start {} on port {}: {}.", serverName, port, e.getMessage()); - shutdown(); + try { + shutdownServer(Time.seconds(10L)).get(); + } catch (Exception r) { + + // Here we were seeing this problem: + // https://github.com/netty/netty/issues/4357 if we do a get(). + // this is why we now simply wait a bit so that everything is + // shut down and then we check + + log.warn("Problem while shutting down {}: {}", serverName, r.getMessage()); + } } // any other type of exception we let it bubble up. return false; } /** * Shuts down the server and all related thread pools. +* @param timeout The time to wait for the shutdown process to complete. +* @return A {@link CompletableFuture} that will be completed upon termination of the shutdown process. */ - public void shutdown() { - log.info("Shutting down {} @ {}", serverName, serverAddress); - - if (handler != null) { - handler.shutdown(); - handler = null; - } - - if (queryExecutor != null) { - queryExecutor.shutdown(); - } + public CompletableFuture shutdownServer(Time timeout) throws InterruptedException { + CompletableFuture shutdownFuture = new CompletableFuture<>(); + if (serverShutdownFuture.compareAndSet(null, shutdownFuture)) { --- End diff -- calling shutdown twice in a row will result in a future being returned that never completes. > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270832#comment-16270832 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r153794621 --- Diff: flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/client/QueryableStateClient.java --- @@ -108,9 +108,33 @@ public QueryableStateClient(final InetAddress remoteAddress, final int remotePor new DisabledKvStateRequestStats()); } - /** Shuts down the client. */ - public void shutdown() { - client.shutdown(); + /** +* Shuts down the client and returns a {@link CompletableFuture} that +* will be completed when the shutdown process is completed. +* +* If an exception is thrown for any reason, then the returned future +* will be completed exceptionally with that exception. +* +* @return A {@link CompletableFuture} for further handling of the +* shutdown result. +*/ + public CompletableFuture shutdownAndHandle() { + return client.shutdown(); + } + + /** +* Shuts down the client and waits until shutdown is completed. +* +* If an exception is thrown, a warning is printed containing --- End diff -- the warning is logged, not printed (generally implies stdout). > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270823#comment-16270823 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r153785916 --- Diff: flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java --- @@ -166,28 +167,57 @@ public String getClientName() { * Shuts down the client and closes all connections. * * After a call to this method, all returned futures will be failed. +* +* @return A {@link CompletableFuture} that will be completed when the shutdown process is done. */ - public void shutdown() { - if (shutDown.compareAndSet(false, true)) { + public CompletableFuture shutdown() { + final CompletableFuture newShutdownFuture = new CompletableFuture<>(); + if (clientShutdownFuture.compareAndSet(null, newShutdownFuture)) { + + final List> connectionFutures = new ArrayList<>(); + for (Map.Entry conn : establishedConnections.entrySet()) { if (establishedConnections.remove(conn.getKey(), conn.getValue())) { - conn.getValue().close(); + connectionFutures.add(conn.getValue().close()); } } for (Map.Entry conn : pendingConnections.entrySet()) { if (pendingConnections.remove(conn.getKey()) != null) { - conn.getValue().close(); + connectionFutures.add(conn.getValue().close()); } } - if (bootstrap != null) { - EventLoopGroup group = bootstrap.group(); - if (group != null) { - group.shutdownGracefully(0L, 10L, TimeUnit.SECONDS); + CompletableFuture.allOf( + connectionFutures.toArray(new CompletableFuture[connectionFutures.size()]) + ).whenComplete((result, throwable) -> { + if (throwable != null) { + newShutdownFuture.completeExceptionally(throwable); + } else if (bootstrap != null) { + EventLoopGroup group = bootstrap.group(); + if (group != null && !group.isShutdown()) { + group.shutdownGracefully(0L, 0L, TimeUnit.MILLISECONDS) + .addListener(finished -> { + if (finished.isSuccess()) { + newShutdownFuture.complete(null); + } else { + newShutdownFuture.completeExceptionally(finished.cause()); + } + }); + } else { + newShutdownFuture.complete(null); + } + } else { + newShutdownFuture.complete(null); } + }); + + // check again if in the meantime another thread completed the future + if (clientShutdownFuture.compareAndSet(null, newShutdownFuture)) { --- End diff -- This can never be true. We can only arrive here if `clientShutdownFuture.compareAndSet(null, newShutdownFuture)` succeeds, and there is no other code path that ever sets it to null again. > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-jav
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270826#comment-16270826 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r153789622 --- Diff: flink-queryable-state/flink-queryable-state-runtime/src/test/java/org/apache/flink/queryablestate/network/ClientTest.java --- @@ -218,7 +225,18 @@ public void channelRead(ChannelHandlerContext ctx, Object msg) throws Exception assertEquals(expectedRequests, stats.getNumFailed()); } finally { if (client != null) { - client.shutdown(); + try { + + // todo here we were seeing this problem: + // https://github.com/netty/netty/issues/4357 if we do a get(). + // this is why we now simply wait a bit so that everything is + // shut down and then we check + + client.shutdown().get(10L, TimeUnit.SECONDS); + } catch (Exception e) { + e.printStackTrace(); + } + Assert.assertTrue(client.isEventGroupShutdown()); --- End diff -- I would prefer if we would throw `e` in this and similar cases. > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270848#comment-16270848 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r153799197 --- Diff: flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java --- @@ -166,28 +167,57 @@ public String getClientName() { * Shuts down the client and closes all connections. * * After a call to this method, all returned futures will be failed. +* +* @return A {@link CompletableFuture} that will be completed when the shutdown process is done. */ - public void shutdown() { - if (shutDown.compareAndSet(false, true)) { + public CompletableFuture shutdown() { + final CompletableFuture newShutdownFuture = new CompletableFuture<>(); + if (clientShutdownFuture.compareAndSet(null, newShutdownFuture)) { + + final List> connectionFutures = new ArrayList<>(); + for (Map.Entry conn : establishedConnections.entrySet()) { if (establishedConnections.remove(conn.getKey(), conn.getValue())) { - conn.getValue().close(); + connectionFutures.add(conn.getValue().close()); } } for (Map.Entry conn : pendingConnections.entrySet()) { if (pendingConnections.remove(conn.getKey()) != null) { - conn.getValue().close(); + connectionFutures.add(conn.getValue().close()); } } - if (bootstrap != null) { - EventLoopGroup group = bootstrap.group(); - if (group != null) { - group.shutdownGracefully(0L, 10L, TimeUnit.SECONDS); + CompletableFuture.allOf( + connectionFutures.toArray(new CompletableFuture[connectionFutures.size()]) + ).whenComplete((result, throwable) -> { + if (throwable != null) { + newShutdownFuture.completeExceptionally(throwable); + } else if (bootstrap != null) { --- End diff -- this means that the bootstrap is not shutdown if a connection cannot be closed for whatever reason, which is rather odd. We should still at least try to shut it down. > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270825#comment-16270825 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r153790485 --- Diff: flink-queryable-state/flink-queryable-state-runtime/src/test/java/org/apache/flink/queryablestate/network/AbstractServerTest.java --- @@ -86,77 +83,89 @@ public void testServerInitializationFailure() throws Throwable { */ @Test public void testPortRangeSuccess() throws Throwable { - TestServer server1 = null; - TestServer server2 = null; - Client client = null; - try { - server1 = startServer("Test Server 1", , 7778, 7779); + // this is shared between the two servers. + AtomicKvStateRequestStats serverStats = new AtomicKvStateRequestStats(); + AtomicKvStateRequestStats clientStats = new AtomicKvStateRequestStats(); + + List portList = new ArrayList<>(); + portList.add(); + portList.add(7778); + portList.add(7779); + + try ( + TestServer server1 = new TestServer("Test Server 1", serverStats, portList.iterator()); + TestServer server2 = new TestServer("Test Server 2", serverStats, portList.iterator()); + TestClient client = new TestClient( + "Test Client", + 1, + new MessageSerializer<>(new TestMessage.TestMessageDeserializer(), new TestMessage.TestMessageDeserializer()), + clientStats + ) + ) { + server1.start(); Assert.assertEquals(L, server1.getServerAddress().getPort()); - server2 = startServer("Test Server 2", , 7778, 7779); + server2.start(); Assert.assertEquals(7778L, server2.getServerAddress().getPort()); --- End diff -- same as above > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270827#comment-16270827 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r153790032 --- Diff: flink-queryable-state/flink-queryable-state-runtime/src/test/java/org/apache/flink/queryablestate/network/AbstractServerTest.java --- @@ -60,23 +63,17 @@ public void testServerInitializationFailure() throws Throwable { expectedEx.expect(FlinkRuntimeException.class); expectedEx.expectMessage("Unable to start Test Server 2. All ports in provided range are occupied."); - TestServer server1 = null; - TestServer server2 = null; - try { + List portList = new ArrayList<>(); + portList.add(); - server1 = startServer("Test Server 1", ); + try ( + TestServer server1 = new TestServer("Test Server 1", new DisabledKvStateRequestStats(), portList.iterator()); --- End diff -- a safer pattern is to give `server1` multiple ports to choose from and specifically start `server2` on the port `server1` eventually started with. > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270840#comment-16270840 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r153793789 --- Diff: flink-queryable-state/flink-queryable-state-runtime/src/test/java/org/apache/flink/queryablestate/itcases/NonHAAbstractQueryableStateTestBase.java --- @@ -68,11 +68,14 @@ public static void setup(int proxyPortRangeStart, int serverPortRangeStart) { @AfterClass public static void tearDown() { try { + client.shutdownAndWait(); + cluster.stop(); - } catch (Exception e) { + cluster.awaitTermination(); + + } catch (Throwable e) { --- End diff -- try-catch block is unnecessary > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270846#comment-16270846 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r153802908 --- Diff: flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java --- @@ -486,27 +542,25 @@ private boolean close(Throwable cause) { @Override public void onRequestResult(long requestId, RESP response) { TimestampedCompletableFuture pending = pendingRequests.remove(requestId); - if (pending != null && pending.complete(response)) { + if (pending != null && !pending.isDone()) { long durationMillis = (System.nanoTime() - pending.getTimestamp()) / 1_000_000L; stats.reportSuccessfulRequest(durationMillis); + pending.complete(response); } } @Override public void onRequestFailure(long requestId, Throwable cause) { TimestampedCompletableFuture pending = pendingRequests.remove(requestId); - if (pending != null && pending.completeExceptionally(cause)) { + if (pending != null && !pending.isDone()) { stats.reportFailedRequest(); + pending.completeExceptionally(cause); } } @Override public void onFailure(Throwable cause) { - if (close(cause)) { - // Remove from established channels, otherwise future - // requests will be handled by this failed channel. - establishedConnections.remove(serverAddress, this); - } + close(cause).thenAccept(cancelled -> establishedConnections.remove(serverAddress, this)); --- End diff -- shouldn't we remove the connection in any case, since if we can't close *something* is probably wrong with it anyway? > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270843#comment-16270843 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r153801762 --- Diff: flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java --- @@ -422,20 +467,31 @@ void close() { * @param cause The cause to close the channel with. * @return Channel close future */ - private boolean close(Throwable cause) { - if (failureCause.compareAndSet(null, cause)) { - channel.close(); - stats.reportInactiveConnection(); + private CompletableFuture close(Throwable cause) { + final CompletableFuture shutdownFuture = new CompletableFuture<>(); - for (long requestId : pendingRequests.keySet()) { - TimestampedCompletableFuture pending = pendingRequests.remove(requestId); - if (pending != null && pending.completeExceptionally(cause)) { - stats.reportFailedRequest(); + if (connectionShutdownFuture.compareAndSet(null, shutdownFuture) && --- End diff -- this can result in odd race conditions, where one thread has set the shutdownFuture, but another one the failureCause. I would suggest to create a container for the future and exception and update them atomically. > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270831#comment-16270831 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r153786058 --- Diff: flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java --- @@ -300,7 +333,7 @@ private void handInChannel(Channel channel) { // Check shut down for possible race with shut down. We // don't want any lingering connections after shut down, // which can happen if we don't check this here. - if (shutDown.get()) { + if (!clientShutdownFuture.compareAndSet(null, null)) { --- End diff -- for clarity we should call `get()` instead of `compareAndSet` since the setting part is a no-op anyway. > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270839#comment-16270839 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r153798768 --- Diff: flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java --- @@ -166,28 +167,57 @@ public String getClientName() { * Shuts down the client and closes all connections. * * After a call to this method, all returned futures will be failed. +* +* @return A {@link CompletableFuture} that will be completed when the shutdown process is done. */ - public void shutdown() { - if (shutDown.compareAndSet(false, true)) { + public CompletableFuture shutdown() { + final CompletableFuture newShutdownFuture = new CompletableFuture<>(); --- End diff -- calling shutdown twice in a row will result in a future being returned that never completes. > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270833#comment-16270833 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r153794159 --- Diff: flink-queryable-state/flink-queryable-state-runtime/src/main/java/org/apache/flink/queryablestate/server/KvStateServerImpl.java --- @@ -101,6 +103,12 @@ public InetSocketAddress getServerAddress() { @Override public void shutdown() { - super.shutdown(); + try { + Time timeout = Time.seconds(10L); + shutdownServer(timeout).get(timeout.toMilliseconds(), TimeUnit.MILLISECONDS); --- End diff -- why not just `get(10, TimeUnit.SECONDS)`? > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270824#comment-16270824 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r153788979 --- Diff: flink-queryable-state/flink-queryable-state-runtime/src/test/java/org/apache/flink/queryablestate/network/ClientTest.java --- @@ -95,15 +97,20 @@ private static final Logger LOG = LoggerFactory.getLogger(ClientTest.class); + private static final FiniteDuration TEST_TIMEOUT = new FiniteDuration(20L, TimeUnit.SECONDS); + // Thread pool for client bootstrap (shared between tests) - private static final NioEventLoopGroup NIO_GROUP = new NioEventLoopGroup(); + private NioEventLoopGroup nioGroup; - private static final FiniteDuration TEST_TIMEOUT = new FiniteDuration(10L, TimeUnit.SECONDS); + @Before + public void setUp() throws Exception { + nioGroup = new NioEventLoopGroup(); --- End diff -- this method isn't strictly necessary; you can also do `private final NioEventLoopGroup nioGroup = new NioEventLoopGroup()`. Then you would also no longer need the null-check in `@After`. > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270836#comment-16270836 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r153798113 --- Diff: flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/AbstractServerBase.java --- @@ -251,34 +263,81 @@ private boolean attemptToBind(final int port) throws Throwable { throw future.cause(); } catch (BindException e) { log.debug("Failed to start {} on port {}: {}.", serverName, port, e.getMessage()); - shutdown(); + try { + shutdownServer(Time.seconds(10L)).get(); + } catch (Exception r) { + + // Here we were seeing this problem: + // https://github.com/netty/netty/issues/4357 if we do a get(). + // this is why we now simply wait a bit so that everything is + // shut down and then we check + + log.warn("Problem while shutting down {}: {}", serverName, r.getMessage()); + } } // any other type of exception we let it bubble up. return false; } /** * Shuts down the server and all related thread pools. +* @param timeout The time to wait for the shutdown process to complete. +* @return A {@link CompletableFuture} that will be completed upon termination of the shutdown process. */ - public void shutdown() { - log.info("Shutting down {} @ {}", serverName, serverAddress); - - if (handler != null) { - handler.shutdown(); - handler = null; - } - - if (queryExecutor != null) { - queryExecutor.shutdown(); - } + public CompletableFuture shutdownServer(Time timeout) throws InterruptedException { + CompletableFuture shutdownFuture = new CompletableFuture<>(); + if (serverShutdownFuture.compareAndSet(null, shutdownFuture)) { + log.info("Shutting down {} @ {}", serverName, serverAddress); + + final CompletableFuture groupShutdownFuture = new CompletableFuture<>(); + if (bootstrap != null) { + EventLoopGroup group = bootstrap.group(); + if (group != null && !group.isShutdown()) { + group.shutdownGracefully(0L, 0L, TimeUnit.MILLISECONDS) + .addListener(finished -> { + if (finished.isSuccess()) { + groupShutdownFuture.complete(null); + } else { + groupShutdownFuture.completeExceptionally(finished.cause()); + } + }); + } else { + groupShutdownFuture.complete(null); + } + } else { + groupShutdownFuture.complete(null); + } - if (bootstrap != null) { - EventLoopGroup group = bootstrap.group(); - if (group != null) { - group.shutdownGracefully(0L, 10L, TimeUnit.SECONDS); + final CompletableFuture handlerShutdownFuture = new CompletableFuture<>(); + if (handler == null) { + handlerShutdownFuture.complete(null); + } else { + handler.shutdown().whenComplete((result, throwable) -> { + if (throwable != null) { + handlerShutdownFuture.completeExceptionally(throwable); + } else { + handlerShutdownFuture.complete(null); + } + }); } + + final CompletableFuture queryExecShutdownFuture = Completa
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270830#comment-16270830 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r153786954 --- Diff: flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java --- @@ -312,32 +345,41 @@ private void handInChannel(Channel channel) { /** * Close the connecting channel with a ClosedChannelException. */ - private void close() { - close(new ClosedChannelException()); + private CompletableFuture close() { + return close(new ClosedChannelException()); } /** * Close the connecting channel with an Exception (can be {@code null}) * or forward to the established channel. */ - private void close(Throwable cause) { - synchronized (connectLock) { - if (!closed) { + private CompletableFuture close(Throwable cause) { + CompletableFuture future = new CompletableFuture<>(); + if (connectionShutdownFuture.compareAndSet(null, future)) { + synchronized (connectLock) { if (failureCause == null) { failureCause = cause; } + closed = true; --- End diff -- this field should be redundant, as `(connectionShutdownFuture.get() != null) == closed` should always hold. > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270837#comment-16270837 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r153797851 --- Diff: flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/AbstractServerBase.java --- @@ -251,34 +263,81 @@ private boolean attemptToBind(final int port) throws Throwable { throw future.cause(); } catch (BindException e) { log.debug("Failed to start {} on port {}: {}.", serverName, port, e.getMessage()); - shutdown(); + try { + shutdownServer(Time.seconds(10L)).get(); + } catch (Exception r) { + + // Here we were seeing this problem: + // https://github.com/netty/netty/issues/4357 if we do a get(). + // this is why we now simply wait a bit so that everything is + // shut down and then we check + + log.warn("Problem while shutting down {}: {}", serverName, r.getMessage()); + } } // any other type of exception we let it bubble up. return false; } /** * Shuts down the server and all related thread pools. +* @param timeout The time to wait for the shutdown process to complete. +* @return A {@link CompletableFuture} that will be completed upon termination of the shutdown process. */ - public void shutdown() { - log.info("Shutting down {} @ {}", serverName, serverAddress); - - if (handler != null) { - handler.shutdown(); - handler = null; - } - - if (queryExecutor != null) { - queryExecutor.shutdown(); - } + public CompletableFuture shutdownServer(Time timeout) throws InterruptedException { --- End diff -- i would remove this timeout argument and simply define large timeout for shutting down the `queryExecutor`. > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270844#comment-16270844 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r153801913 --- Diff: flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java --- @@ -422,20 +467,31 @@ void close() { * @param cause The cause to close the channel with. * @return Channel close future */ - private boolean close(Throwable cause) { - if (failureCause.compareAndSet(null, cause)) { - channel.close(); - stats.reportInactiveConnection(); + private CompletableFuture close(Throwable cause) { + final CompletableFuture shutdownFuture = new CompletableFuture<>(); - for (long requestId : pendingRequests.keySet()) { - TimestampedCompletableFuture pending = pendingRequests.remove(requestId); - if (pending != null && pending.completeExceptionally(cause)) { - stats.reportFailedRequest(); + if (connectionShutdownFuture.compareAndSet(null, shutdownFuture) && --- End diff -- Let's also log other shutdown attempts as DEBUG. > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270845#comment-16270845 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r153804310 --- Diff: flink-queryable-state/flink-queryable-state-runtime/src/test/java/org/apache/flink/queryablestate/itcases/AbstractQueryableStateTestBase.java --- @@ -359,12 +355,12 @@ public Integer getKey(Tuple2 value) throws Exception { } finally { // Free cluster resources if (jobId != null) { - scala.concurrent.Future cancellation = cluster - .getLeaderGateway(deadline.timeLeft()) + cluster.getLeaderGateway(deadline.timeLeft()) .ask(new JobManagerMessages.CancelJob(jobId), deadline.timeLeft()) - .mapTo(ClassTag$.MODULE$.apply(JobManagerMessages.CancellationSuccess.class)); + .mapTo(ClassTag$.MODULE$.apply(CancellationSuccess.class)); - Await.ready(cancellation, deadline.timeLeft()); + // we are not waiting for the cancellation to happen because the + // job has actually failed, as tested above. --- End diff -- you can't guarantee this in the `finally` block. (for example if submitJobDetached failed but the job is actually running) > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270841#comment-16270841 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r153798592 --- Diff: flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java --- @@ -133,7 +134,7 @@ public String getClientName() { } public CompletableFuture sendRequest(final InetSocketAddress serverAddress, final REQ request) { - if (shutDown.get()) { + if (!clientShutdownFuture.compareAndSet(null, null)) { --- End diff -- this should use ´clientShutdownFuture.get() != null` for clarity. > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270828#comment-16270828 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r153794265 --- Diff: flink-queryable-state/flink-queryable-state-runtime/src/main/java/org/apache/flink/queryablestate/client/proxy/KvStateClientProxyImpl.java --- @@ -96,7 +98,13 @@ public void start() throws Throwable { @Override public void shutdown() { - super.shutdown(); + try { + Time timeout = Time.seconds(10L); + shutdownServer(timeout).get(timeout.toMilliseconds(), TimeUnit.MILLISECONDS); --- End diff -- why not just get(10, TimeUnit.SECONDS)? > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270835#comment-16270835 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r153797256 --- Diff: flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/AbstractServerBase.java --- @@ -251,34 +263,81 @@ private boolean attemptToBind(final int port) throws Throwable { throw future.cause(); } catch (BindException e) { log.debug("Failed to start {} on port {}: {}.", serverName, port, e.getMessage()); - shutdown(); + try { + shutdownServer(Time.seconds(10L)).get(); + } catch (Exception r) { + + // Here we were seeing this problem: + // https://github.com/netty/netty/issues/4357 if we do a get(). + // this is why we now simply wait a bit so that everything is + // shut down and then we check --- End diff -- we don't check anything here. > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270834#comment-16270834 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r153796142 --- Diff: flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/AbstractServerBase.java --- @@ -205,6 +212,11 @@ public void start() throws Throwable { private boolean attemptToBind(final int port) throws Throwable { log.debug("Attempting to start {} on port {}.", serverName, port); + // here we reset the future every time because in case of failure + // to bind, we call shutdown here and this may interfere with future + // shutdown attempts. + this.serverShutdownFuture.getAndSet(null); --- End diff -- If we intend to only support a single start of a server this should use compareAndSet and fail immediately if a shutdown is in progress. If we intend to support multiple starts of a server we should guard start() and shutdown() with a lock since otherwise you easily end up in an inconsistent state. > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278246#comment-16278246 ] ASF GitHub Bot commented on FLINK-7880: --- Github user kl0u commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r154884801 --- Diff: flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/AbstractServerBase.java --- @@ -251,34 +263,81 @@ private boolean attemptToBind(final int port) throws Throwable { throw future.cause(); } catch (BindException e) { log.debug("Failed to start {} on port {}: {}.", serverName, port, e.getMessage()); - shutdown(); + try { + shutdownServer(Time.seconds(10L)).get(); + } catch (Exception r) { + + // Here we were seeing this problem: + // https://github.com/netty/netty/issues/4357 if we do a get(). + // this is why we now simply wait a bit so that everything is + // shut down and then we check + + log.warn("Problem while shutting down {}: {}", serverName, r.getMessage()); + } } // any other type of exception we let it bubble up. return false; } /** * Shuts down the server and all related thread pools. +* @param timeout The time to wait for the shutdown process to complete. +* @return A {@link CompletableFuture} that will be completed upon termination of the shutdown process. */ - public void shutdown() { - log.info("Shutting down {} @ {}", serverName, serverAddress); - - if (handler != null) { - handler.shutdown(); - handler = null; - } - - if (queryExecutor != null) { - queryExecutor.shutdown(); - } + public CompletableFuture shutdownServer(Time timeout) throws InterruptedException { + CompletableFuture shutdownFuture = new CompletableFuture<>(); + if (serverShutdownFuture.compareAndSet(null, shutdownFuture)) { --- End diff -- Here the idea is that we atomically set the future the first time (`serverShutdownFuture.compareAndSet(null, shutdownFuture)`), and then, at each subsequent call we return that exact future (`serverShutdownFuture.get();`). So I cannot see how this can lead to returning a future that never completes. > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278253#comment-16278253 ] ASF GitHub Bot commented on FLINK-7880: --- Github user kl0u commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r154885293 --- Diff: flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/AbstractServerBase.java --- @@ -251,34 +263,81 @@ private boolean attemptToBind(final int port) throws Throwable { throw future.cause(); } catch (BindException e) { log.debug("Failed to start {} on port {}: {}.", serverName, port, e.getMessage()); - shutdown(); + try { + shutdownServer(Time.seconds(10L)).get(); + } catch (Exception r) { + + // Here we were seeing this problem: + // https://github.com/netty/netty/issues/4357 if we do a get(). + // this is why we now simply wait a bit so that everything is + // shut down and then we check + + log.warn("Problem while shutting down {}: {}", serverName, r.getMessage()); + } } // any other type of exception we let it bubble up. return false; } /** * Shuts down the server and all related thread pools. +* @param timeout The time to wait for the shutdown process to complete. +* @return A {@link CompletableFuture} that will be completed upon termination of the shutdown process. */ - public void shutdown() { - log.info("Shutting down {} @ {}", serverName, serverAddress); - - if (handler != null) { - handler.shutdown(); - handler = null; - } - - if (queryExecutor != null) { - queryExecutor.shutdown(); - } + public CompletableFuture shutdownServer(Time timeout) throws InterruptedException { + CompletableFuture shutdownFuture = new CompletableFuture<>(); + if (serverShutdownFuture.compareAndSet(null, shutdownFuture)) { + log.info("Shutting down {} @ {}", serverName, serverAddress); + + final CompletableFuture groupShutdownFuture = new CompletableFuture<>(); + if (bootstrap != null) { + EventLoopGroup group = bootstrap.group(); + if (group != null && !group.isShutdown()) { + group.shutdownGracefully(0L, 0L, TimeUnit.MILLISECONDS) + .addListener(finished -> { + if (finished.isSuccess()) { + groupShutdownFuture.complete(null); + } else { + groupShutdownFuture.completeExceptionally(finished.cause()); + } + }); + } else { + groupShutdownFuture.complete(null); + } + } else { + groupShutdownFuture.complete(null); + } - if (bootstrap != null) { - EventLoopGroup group = bootstrap.group(); - if (group != null) { - group.shutdownGracefully(0L, 10L, TimeUnit.SECONDS); + final CompletableFuture handlerShutdownFuture = new CompletableFuture<>(); + if (handler == null) { + handlerShutdownFuture.complete(null); + } else { + handler.shutdown().whenComplete((result, throwable) -> { + if (throwable != null) { + handlerShutdownFuture.completeExceptionally(throwable); + } else { + handlerShutdownFuture.complete(null); + } + }); } + + final CompletableFuture queryExecShutdownFuture = Completabl
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278254#comment-16278254 ] ASF GitHub Bot commented on FLINK-7880: --- Github user kl0u commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r154885503 --- Diff: flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java --- @@ -133,7 +134,7 @@ public String getClientName() { } public CompletableFuture sendRequest(final InetSocketAddress serverAddress, final REQ request) { - if (shutDown.get()) { + if (!clientShutdownFuture.compareAndSet(null, null)) { --- End diff -- Well, I agree that the code looks clearer, but the `compareAndSet` guarantees atomicity. While `get()` and then compare, does not. > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278256#comment-16278256 ] ASF GitHub Bot commented on FLINK-7880: --- Github user kl0u commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r154885936 --- Diff: flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java --- @@ -166,28 +167,57 @@ public String getClientName() { * Shuts down the client and closes all connections. * * After a call to this method, all returned futures will be failed. +* +* @return A {@link CompletableFuture} that will be completed when the shutdown process is done. */ - public void shutdown() { - if (shutDown.compareAndSet(false, true)) { + public CompletableFuture shutdown() { + final CompletableFuture newShutdownFuture = new CompletableFuture<>(); --- End diff -- This is the same pattern as in the case of the `AbstractServerBase` ;) > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278265#comment-16278265 ] ASF GitHub Bot commented on FLINK-7880: --- Github user kl0u commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r154886964 --- Diff: flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java --- @@ -166,28 +167,57 @@ public String getClientName() { * Shuts down the client and closes all connections. * * After a call to this method, all returned futures will be failed. +* +* @return A {@link CompletableFuture} that will be completed when the shutdown process is done. */ - public void shutdown() { - if (shutDown.compareAndSet(false, true)) { + public CompletableFuture shutdown() { + final CompletableFuture newShutdownFuture = new CompletableFuture<>(); + if (clientShutdownFuture.compareAndSet(null, newShutdownFuture)) { + + final List> connectionFutures = new ArrayList<>(); + for (Map.Entry conn : establishedConnections.entrySet()) { if (establishedConnections.remove(conn.getKey(), conn.getValue())) { - conn.getValue().close(); + connectionFutures.add(conn.getValue().close()); } } for (Map.Entry conn : pendingConnections.entrySet()) { if (pendingConnections.remove(conn.getKey()) != null) { - conn.getValue().close(); + connectionFutures.add(conn.getValue().close()); } } - if (bootstrap != null) { - EventLoopGroup group = bootstrap.group(); - if (group != null) { - group.shutdownGracefully(0L, 10L, TimeUnit.SECONDS); + CompletableFuture.allOf( + connectionFutures.toArray(new CompletableFuture[connectionFutures.size()]) + ).whenComplete((result, throwable) -> { + if (throwable != null) { + newShutdownFuture.completeExceptionally(throwable); + } else if (bootstrap != null) { + EventLoopGroup group = bootstrap.group(); + if (group != null && !group.isShutdown()) { + group.shutdownGracefully(0L, 0L, TimeUnit.MILLISECONDS) + .addListener(finished -> { + if (finished.isSuccess()) { + newShutdownFuture.complete(null); + } else { + newShutdownFuture.completeExceptionally(finished.cause()); + } + }); + } else { + newShutdownFuture.complete(null); + } + } else { + newShutdownFuture.complete(null); } + }); + + // check again if in the meantime another thread completed the future + if (clientShutdownFuture.compareAndSet(null, newShutdownFuture)) { --- End diff -- You are right! > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278273#comment-16278273 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r154888733 --- Diff: flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java --- @@ -133,7 +134,7 @@ public String getClientName() { } public CompletableFuture sendRequest(final InetSocketAddress serverAddress, final REQ request) { - if (shutDown.get()) { + if (!clientShutdownFuture.compareAndSet(null, null)) { --- End diff -- But the atomicity doesn't get you anything. Which case wouldn't be covered? This check can only handle the case when sendRequest() is called after shutdown(), regardless of whether you use compareAndSet() or get(), so it shouldn't make a difference. > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278285#comment-16278285 ] ASF GitHub Bot commented on FLINK-7880: --- Github user kl0u commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r154890227 --- Diff: flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java --- @@ -312,32 +345,41 @@ private void handInChannel(Channel channel) { /** * Close the connecting channel with a ClosedChannelException. */ - private void close() { - close(new ClosedChannelException()); + private CompletableFuture close() { + return close(new ClosedChannelException()); } /** * Close the connecting channel with an Exception (can be {@code null}) * or forward to the established channel. */ - private void close(Throwable cause) { - synchronized (connectLock) { - if (!closed) { + private CompletableFuture close(Throwable cause) { + CompletableFuture future = new CompletableFuture<>(); + if (connectionShutdownFuture.compareAndSet(null, future)) { + synchronized (connectLock) { if (failureCause == null) { failureCause = cause; } + closed = true; --- End diff -- Yes. I will remove that. > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278292#comment-16278292 ] ASF GitHub Bot commented on FLINK-7880: --- Github user kl0u commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r154890905 --- Diff: flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java --- @@ -486,27 +542,25 @@ private boolean close(Throwable cause) { @Override public void onRequestResult(long requestId, RESP response) { TimestampedCompletableFuture pending = pendingRequests.remove(requestId); - if (pending != null && pending.complete(response)) { + if (pending != null && !pending.isDone()) { long durationMillis = (System.nanoTime() - pending.getTimestamp()) / 1_000_000L; stats.reportSuccessfulRequest(durationMillis); + pending.complete(response); } } @Override public void onRequestFailure(long requestId, Throwable cause) { TimestampedCompletableFuture pending = pendingRequests.remove(requestId); - if (pending != null && pending.completeExceptionally(cause)) { + if (pending != null && !pending.isDone()) { stats.reportFailedRequest(); + pending.completeExceptionally(cause); } } @Override public void onFailure(Throwable cause) { - if (close(cause)) { - // Remove from established channels, otherwise future - // requests will be handled by this failed channel. - establishedConnections.remove(serverAddress, this); - } + close(cause).thenAccept(cancelled -> establishedConnections.remove(serverAddress, this)); --- End diff -- Yes. > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278584#comment-16278584 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r154950558 --- Diff: flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/AbstractServerBase.java --- @@ -251,34 +263,81 @@ private boolean attemptToBind(final int port) throws Throwable { throw future.cause(); } catch (BindException e) { log.debug("Failed to start {} on port {}: {}.", serverName, port, e.getMessage()); - shutdown(); + try { + shutdownServer(Time.seconds(10L)).get(); + } catch (Exception r) { + + // Here we were seeing this problem: + // https://github.com/netty/netty/issues/4357 if we do a get(). + // this is why we now simply wait a bit so that everything is + // shut down and then we check + + log.warn("Problem while shutting down {}: {}", serverName, r.getMessage()); + } } // any other type of exception we let it bubble up. return false; } /** * Shuts down the server and all related thread pools. +* @param timeout The time to wait for the shutdown process to complete. +* @return A {@link CompletableFuture} that will be completed upon termination of the shutdown process. */ - public void shutdown() { - log.info("Shutting down {} @ {}", serverName, serverAddress); - - if (handler != null) { - handler.shutdown(); - handler = null; - } - - if (queryExecutor != null) { - queryExecutor.shutdown(); - } + public CompletableFuture shutdownServer(Time timeout) throws InterruptedException { + CompletableFuture shutdownFuture = new CompletableFuture<>(); + if (serverShutdownFuture.compareAndSet(null, shutdownFuture)) { + log.info("Shutting down {} @ {}", serverName, serverAddress); + + final CompletableFuture groupShutdownFuture = new CompletableFuture<>(); + if (bootstrap != null) { + EventLoopGroup group = bootstrap.group(); + if (group != null && !group.isShutdown()) { + group.shutdownGracefully(0L, 0L, TimeUnit.MILLISECONDS) + .addListener(finished -> { + if (finished.isSuccess()) { + groupShutdownFuture.complete(null); + } else { + groupShutdownFuture.completeExceptionally(finished.cause()); + } + }); + } else { + groupShutdownFuture.complete(null); + } + } else { + groupShutdownFuture.complete(null); + } - if (bootstrap != null) { - EventLoopGroup group = bootstrap.group(); - if (group != null) { - group.shutdownGracefully(0L, 10L, TimeUnit.SECONDS); + final CompletableFuture handlerShutdownFuture = new CompletableFuture<>(); + if (handler == null) { + handlerShutdownFuture.complete(null); + } else { + handler.shutdown().whenComplete((result, throwable) -> { + if (throwable != null) { + handlerShutdownFuture.completeExceptionally(throwable); + } else { + handlerShutdownFuture.complete(null); + } + }); } + + final CompletableFuture queryExecShutdownFuture = Completa
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278593#comment-16278593 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r154950988 --- Diff: flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java --- @@ -191,9 +196,12 @@ public String getClientName() { CompletableFuture.allOf( connectionFutures.toArray(new CompletableFuture[connectionFutures.size()]) ).whenComplete((result, throwable) -> { + if (throwable != null) { - newShutdownFuture.completeExceptionally(throwable); - } else if (bootstrap != null) { + LOG.warn("Problem while shutting down the connections at the {}: {}", clientName, throwable); --- End diff -- `LOG.warn("Problem while shutting down the connections at the {}", clientName, throwable);` instead? > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278594#comment-16278594 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r154952648 --- Diff: flink-queryable-state/flink-queryable-state-runtime/src/test/java/org/apache/flink/queryablestate/network/AbstractServerTest.java --- @@ -66,14 +65,15 @@ public void testServerInitializationFailure() throws Throwable { List portList = new ArrayList<>(); portList.add(); - try ( - TestServer server1 = new TestServer("Test Server 1", new DisabledKvStateRequestStats(), portList.iterator()); - TestServer server2 = new TestServer("Test Server 2", new DisabledKvStateRequestStats(), portList.iterator()) - ) { + try (TestServer server1 = new TestServer("Test Server 1", new DisabledKvStateRequestStats(), portList.iterator())) { server1.start(); - Assert.assertEquals(L, server1.getServerAddress().getPort()); - server2.start(); + List occupiedPortList = new ArrayList<>(); + occupiedPortList.add(server1.getServerAddress().getPort()); --- End diff -- could use `Collections.singletonList()` here > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278791#comment-16278791 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r154955042 --- Diff: flink-queryable-state/flink-queryable-state-runtime/src/test/java/org/apache/flink/queryablestate/itcases/AbstractQueryableStateTestBase.java --- @@ -1480,7 +1259,52 @@ public String fold(String accumulator, Tuple2 value) throws Excep / General Utility Methods // - private CompletableFuture notifyWhenJobStatusIs( + /** +* A wrapper of the job graph that makes sure to cancell the job and wait for --- End diff -- typo: cancell -> cancel > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16280005#comment-16280005 ] ASF GitHub Bot commented on FLINK-7880: --- Github user kl0u commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r155201806 --- Diff: flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/AbstractServerBase.java --- @@ -251,34 +263,81 @@ private boolean attemptToBind(final int port) throws Throwable { throw future.cause(); } catch (BindException e) { log.debug("Failed to start {} on port {}: {}.", serverName, port, e.getMessage()); - shutdown(); + try { + shutdownServer(Time.seconds(10L)).get(); + } catch (Exception r) { + + // Here we were seeing this problem: + // https://github.com/netty/netty/issues/4357 if we do a get(). + // this is why we now simply wait a bit so that everything is + // shut down and then we check + + log.warn("Problem while shutting down {}: {}", serverName, r.getMessage()); + } } // any other type of exception we let it bubble up. return false; } /** * Shuts down the server and all related thread pools. +* @param timeout The time to wait for the shutdown process to complete. +* @return A {@link CompletableFuture} that will be completed upon termination of the shutdown process. */ - public void shutdown() { - log.info("Shutting down {} @ {}", serverName, serverAddress); - - if (handler != null) { - handler.shutdown(); - handler = null; - } - - if (queryExecutor != null) { - queryExecutor.shutdown(); - } + public CompletableFuture shutdownServer(Time timeout) throws InterruptedException { + CompletableFuture shutdownFuture = new CompletableFuture<>(); + if (serverShutdownFuture.compareAndSet(null, shutdownFuture)) { + log.info("Shutting down {} @ {}", serverName, serverAddress); + + final CompletableFuture groupShutdownFuture = new CompletableFuture<>(); + if (bootstrap != null) { + EventLoopGroup group = bootstrap.group(); + if (group != null && !group.isShutdown()) { + group.shutdownGracefully(0L, 0L, TimeUnit.MILLISECONDS) + .addListener(finished -> { + if (finished.isSuccess()) { + groupShutdownFuture.complete(null); + } else { + groupShutdownFuture.completeExceptionally(finished.cause()); + } + }); + } else { + groupShutdownFuture.complete(null); + } + } else { + groupShutdownFuture.complete(null); + } - if (bootstrap != null) { - EventLoopGroup group = bootstrap.group(); - if (group != null) { - group.shutdownGracefully(0L, 10L, TimeUnit.SECONDS); + final CompletableFuture handlerShutdownFuture = new CompletableFuture<>(); + if (handler == null) { + handlerShutdownFuture.complete(null); + } else { + handler.shutdown().whenComplete((result, throwable) -> { + if (throwable != null) { + handlerShutdownFuture.completeExceptionally(throwable); + } else { + handlerShutdownFuture.complete(null); + } + }); } + + final CompletableFuture queryExecShutdownFuture = Completabl
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16280057#comment-16280057 ] ASF GitHub Bot commented on FLINK-7880: --- Github user kl0u commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r155212313 --- Diff: flink-queryable-state/flink-queryable-state-runtime/src/test/java/org/apache/flink/queryablestate/itcases/AbstractQueryableStateTestBase.java --- @@ -359,12 +355,12 @@ public Integer getKey(Tuple2 value) throws Exception { } finally { // Free cluster resources if (jobId != null) { - scala.concurrent.Future cancellation = cluster - .getLeaderGateway(deadline.timeLeft()) + cluster.getLeaderGateway(deadline.timeLeft()) .ask(new JobManagerMessages.CancelJob(jobId), deadline.timeLeft()) - .mapTo(ClassTag$.MODULE$.apply(JobManagerMessages.CancellationSuccess.class)); + .mapTo(ClassTag$.MODULE$.apply(CancellationSuccess.class)); - Await.ready(cancellation, deadline.timeLeft()); + // we are not waiting for the cancellation to happen because the + // job has actually failed, as tested above. --- End diff -- You are right. So I will just remove the `finally` block and check if the status of the job is `FAILED`. > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16280061#comment-16280061 ] ASF GitHub Bot commented on FLINK-7880: --- Github user kl0u commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r155212885 --- Diff: flink-queryable-state/flink-queryable-state-runtime/src/test/java/org/apache/flink/queryablestate/itcases/NonHAAbstractQueryableStateTestBase.java --- @@ -68,11 +68,14 @@ public static void setup(int proxyPortRangeStart, int serverPortRangeStart) { @AfterClass public static void tearDown() { try { + client.shutdownAndWait(); + cluster.stop(); - } catch (Exception e) { + cluster.awaitTermination(); + + } catch (Throwable e) { --- End diff -- Done. > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16280060#comment-16280060 ] ASF GitHub Bot commented on FLINK-7880: --- Github user kl0u commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r155212865 --- Diff: flink-queryable-state/flink-queryable-state-runtime/src/test/java/org/apache/flink/queryablestate/itcases/HAAbstractQueryableStateTestBase.java --- @@ -80,19 +80,18 @@ public static void setup(int proxyPortRangeStart, int serverPortRangeStart) { @AfterClass public static void tearDown() { - if (cluster != null) { + try { + client.shutdownAndWait(); + cluster.stop(); cluster.awaitTermination(); - } - try { zkServer.stop(); zkServer.close(); - } catch (Exception e) { + + } catch (Throwable e) { --- End diff -- Done. > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16280125#comment-16280125 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r155224199 --- Diff: flink-queryable-state/flink-queryable-state-runtime/src/test/java/org/apache/flink/queryablestate/network/ClientTest.java --- @@ -224,6 +225,7 @@ public void channelRead(ChannelHandlerContext ctx, Object msg) throws Exception assertEquals(expectedRequests, stats.getNumSuccessful()); assertEquals(expectedRequests, stats.getNumFailed()); } finally { + Exception exc = null; --- End diff -- this could be moved into the if block. > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16280124#comment-16280124 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r155224312 --- Diff: flink-queryable-state/flink-queryable-state-runtime/src/test/java/org/apache/flink/queryablestate/network/ClientTest.java --- @@ -234,9 +236,14 @@ public void channelRead(ChannelHandlerContext ctx, Object msg) throws Exception client.shutdown().get(10L, TimeUnit.SECONDS); } catch (Exception e) { - e.printStackTrace(); + exc = e; + LOG.error(ExceptionUtils.stringifyException(e)); --- End diff -- use `LOG.error("An exception occurred while shutting down netty.", e)` instead > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16280126#comment-16280126 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5062#discussion_r155225199 --- Diff: flink-queryable-state/flink-queryable-state-runtime/src/test/java/org/apache/flink/queryablestate/itcases/AbstractQueryableStateTestBase.java --- @@ -260,89 +260,90 @@ public void testDuplicateRegistrationFailsJob() throws Exception { final Deadline deadline = TEST_TIMEOUT.fromNow(); final int numKeys = 256; - JobID jobId = null; + StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); + env.setStateBackend(stateBackend); + env.setParallelism(maxParallelism); + // Very important, because cluster is shared between tests and we + // don't explicitly check that all slots are available before + // submitting. + env.setRestartStrategy(RestartStrategies.fixedDelayRestart(Integer.MAX_VALUE, 1000L)); - try { - // - // Test program - // - StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); - env.setStateBackend(stateBackend); - env.setParallelism(maxParallelism); - // Very important, because cluster is shared between tests and we - // don't explicitly check that all slots are available before - // submitting. - env.setRestartStrategy(RestartStrategies.fixedDelayRestart(Integer.MAX_VALUE, 1000L)); - - DataStream> source = env - .addSource(new TestKeyRangeSource(numKeys)); - - // Reducing state - ReducingStateDescriptor> reducingState = new ReducingStateDescriptor<>( - "any-name", - new SumReduce(), - source.getType()); - - final String queryName = "duplicate-me"; - - final QueryableStateStream> queryableState = - source.keyBy(new KeySelector, Integer>() { - private static final long serialVersionUID = -4126824763829132959L; - - @Override - public Integer getKey(Tuple2 value) throws Exception { - return value.f0; - } - }).asQueryableState(queryName, reducingState); + DataStream> source = env.addSource(new TestKeyRangeSource(numKeys)); - final QueryableStateStream> duplicate = - source.keyBy(new KeySelector, Integer>() { - private static final long serialVersionUID = -6265024000462809436L; + // Reducing state + ReducingStateDescriptor> reducingState = new ReducingStateDescriptor<>( + "any-name", + new SumReduce(), + source.getType()); - @Override - public Integer getKey(Tuple2 value) throws Exception { - return value.f0; - } - }).asQueryableState(queryName); + final String queryName = "duplicate-me"; - // Submit the job graph - JobGraph jobGraph = env.getStreamGraph().getJobGraph(); - jobId = jobGraph.getJobID(); + final QueryableStateStream> queryableState = + source.keyBy(new KeySelector, Integer>() { + private static final long serialVersionUID = -4126824763829132959L; - final CompletableFuture failedFuture = - notifyWhenJobStatusIs(jobId, JobStatus.FAILED, deadline); + @Override + public Integer getKey(Tuple2 value) { + return value.f0;
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16280140#comment-16280140 ] ASF GitHub Bot commented on FLINK-7880: --- Github user zentol commented on the issue: https://github.com/apache/flink/pull/5062 👍 > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16280184#comment-16280184 ] ASF GitHub Bot commented on FLINK-7880: --- Github user kl0u commented on the issue: https://github.com/apache/flink/pull/5062 Thanks @zentol ! I will let another travis run, and then merge. > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16280295#comment-16280295 ] ASF GitHub Bot commented on FLINK-7880: --- Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/5062 > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212278#comment-16212278 ] Till Rohrmann commented on FLINK-7880: -- One more: https://travis-ci.org/apache/flink/jobs/290015718 > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Priority: Critical > Labels: test-stability > Fix For: 1.4.0 > > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212301#comment-16212301 ] Kostas Kloudas commented on FLINK-7880: --- Thanks for opening this. I will check it out! > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > Fix For: 1.4.0 > > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16214281#comment-16214281 ] Aljoscha Krettek commented on FLINK-7880: - I {{@Ignored}} the unstable tests in 137e61736b92a0e4c48e4f6ef6d8077c37584da2. They should be re-activated once the cause for the flakyness is found. > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > Fix For: 1.4.0 > > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218876#comment-16218876 ] Till Rohrmann commented on FLINK-7880: -- Apparently, there are also other QS tests affected by this. Can we either fix this or make sure that only tests are affected but not the real usage of the QS client? https://travis-ci.org/tillrohrmann/flink/jobs/292608461 > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Blocker > Labels: test-stability > Fix For: 1.4.0 > > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16220281#comment-16220281 ] Kostas Kloudas commented on FLINK-7880: --- Yes I will look into it. > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Blocker > Labels: test-stability > Fix For: 1.4.0 > > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16220824#comment-16220824 ] ASF GitHub Bot commented on FLINK-7880: --- GitHub user kl0u opened a pull request: https://github.com/apache/flink/pull/4909 [FLINK-7880][QS] Fix QS test instabilities. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kl0u/flink qs-test-instability Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/4909.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4909 commit 2b3bba7a2e3778cf3f9d580a32bd466d77fcdb39 Author: kkloudas Date: 2017-10-26T17:11:03Z [FLINK-7880][QS] Fix QS test instabilities. > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Blocker > Labels: test-stability > Fix For: 1.4.0 > > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16220825#comment-16220825 ] Kostas Kloudas commented on FLINK-7880: --- I think this PR fixes it. > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Blocker > Labels: test-stability > Fix For: 1.4.0 > > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16221816#comment-16221816 ] ASF GitHub Bot commented on FLINK-7880: --- Github user kl0u commented on the issue: https://github.com/apache/flink/pull/4909 It should call `dispose()`, you are correct. This was a mistake due to sloppy "manual rebasing". > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Blocker > Labels: test-stability > Fix For: 1.4.0 > > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16224043#comment-16224043 ] ASF GitHub Bot commented on FLINK-7880: --- Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/4909 > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Blocker > Labels: test-stability > Fix For: 1.4.0 > > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16227479#comment-16227479 ] Chesnay Schepler commented on FLINK-7880: - [~till.rohrmann] yes it's still an issue, another occurrence can be seen here: https://travis-ci.org/zentol/flink/jobs/29519 The branch did include the assumed fix. > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Blocker > Labels: test-stability > Fix For: 1.4.0 > > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16237584#comment-16237584 ] Kostas Kloudas commented on FLINK-7880: --- I will keep on investigating *BUT* I commented out the {{queryable state}} code from the tests and they still seem to fail even locally with: {code} libc++abi.dylib: terminating with uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument /bin/sh: line 1: 10553 Abort trap: 6 /Library/Java/JavaVirtualMachines/jdk1.8.0_60.jdk/Contents/Home/jre/bin/java -Xms256m -Xmx2048m -Dmvn.forkNumber=1 -XX:+UseSerialGC -jar /Users/kkloudas/repos/dataartisans/flink/flink-queryable-state/flink-queryable-state-runtime/target/surefire/surefirebooter6530581495710923655.jar /Users/kkloudas/repos/dataartisans/flink/flink-queryable-state/flink-queryable-state-runtime/target/surefire/surefire7293759706604721607tmp /Users/kkloudas/repos/dataartisans/flink/flink-queryable-state/flink-queryable-state-runtime/target/surefire/surefire_87379374358455308985tmp {code} You can find my code here https://github.com/kl0u/flink/tree/more-than-qs and to reproduce, go to the {{flink-queryable-state}} dir and run a couple of times {{mvn verify}}. At least on my machine this reproduces the problem. > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Blocker > Labels: test-stability > Fix For: 1.4.0 > > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16237616#comment-16237616 ] Gary Yao commented on FLINK-7880: - I have further isolated the problem. The problem still appears even if you are running a single test method. I tried {{NonHAQueryableStateRocksDBBackendITCase#testValueState}} and even replaced the state query with a {{Thread.sleep(400)}}. My changes to the code are documented here: https://github.com/apache/flink/compare/master...GJL:FLINK-7880?expand=1 To run the tests, I use the command below. Note that multiple iterations may be required. Hence, the {{while}} loop. {noformat} while mvn -o clean verify -Dtest=NonHAQueryableStateRocksDBBackendITCase -DfailIfNoTests=false -Dcheckstyle.skip; do :; done {noformat} > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Blocker > Labels: test-stability > Fix For: 1.4.0 > > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16238936#comment-16238936 ] Aljoscha Krettek commented on FLINK-7880: - I think this is an issue with RocksDB cleanup (or lack thereof). Maybe [~srichter] has an idea? > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Blocker > Labels: test-stability > Fix For: 1.4.0 > > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16239068#comment-16239068 ] Stefan Richter commented on FLINK-7880: --- My theory is that the `dispose()` is not properly executed before the test finishes. By adding printouts to the backend's constructor, dispose(), and the code that waits for cancelation like this: {code} } finally { System.out.println("shutdown a"); // Free cluster resources if (jobId != null) { CompletableFuture cancellation = FutureUtils.toJava(cluster .getLeaderGateway(deadline.timeLeft()) .ask(new JobManagerMessages.CancelJob(jobId), deadline.timeLeft()) .mapTo(ClassTag$.MODULE$.apply(CancellationSuccess.class))); cancellation.get(deadline.timeLeft().toMillis(), TimeUnit.MILLISECONDS); } System.out.println("shutdown b"); } {code} I obtain the following printing order: {code} created org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend@58482c7d created org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend@5278696f created org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend@11c418d8 created org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend@78424ca5 created org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend@79690844 created org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend@44855a3 created org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend@459dcdef created org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend@54cef1d0 shutdown a shutdown b dispose org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend@11c418d8 dispose org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend@79690844 dispose org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend@58482c7d dispose org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend@44855a3 dispose org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend@78424ca5 dispose org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend@5278696f dispose org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend@459dcdef dispose org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend@54cef1d0 {code} I conclude that the way in which the tests waits for cancelation does not work as expected. In particular, it does not ensure that `dispose()` was executed before the method ends and this can lead to problems with the native resources like what you observe. > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Blocker > Labels: test-stability > Fix For: 1.4.0 > > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240024#comment-16240024 ] Till Rohrmann commented on FLINK-7880: -- The {{CancellationSuccess}} message is sent after the {{JobManager}} has called {{cancel}} on the {{ExecutionGraph}}. But this does not necessarily mean that the {{ExecutionGraph}} is in a terminal state when the client receives the {{CancellationSuccess}} message. Thus, waiting on the {{CancelJob}} response won't work to ensure proper job termination/cleanup. > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Blocker > Labels: test-stability > Fix For: 1.4.0 > > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240153#comment-16240153 ] Stefan Richter commented on FLINK-7880: --- Yes, but the test seems to expect that waiting for {{CancellationSuccess}} includes a successful cleanup or it was just not aware how important the proper cleanup is with native resources. In any case, I think the origin of this problem might be taking other IT cases a blueprints, and I have seen different patterns for this "wait until the job is gone" problem in different tests. Many of them might be similar to this one, but they will often look correct or cause no trouble if there are no native libraries involved (e.g. test only uses heap backend). I would suggest that there should be one and only one simple (maybe one helper class that does this), idiomatic way of waiting for a job to go away and release all resources that is used throughout all tests that actually want to have this behaviour. Otherwise, for example, extending an existing test to include a different backend can suddenly uncover the improper cleanup and make tests randomly fail with a JVM crash. Having a clear way to end IT cases could help to avoid chasing seriously looking, misleading test failures that seem to originate from the RocksDB backend code, but are actually tests problems from improper cleanup. What do you think? > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Blocker > Labels: test-stability > Fix For: 1.4.0 > > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240252#comment-16240252 ] Aljoscha Krettek commented on FLINK-7880: - +2 to (to what [~srichter] said) > flink-queryable-state-java fails with core-dump > --- > > Key: FLINK-7880 > URL: https://issues.apache.org/jira/browse/FLINK-7880 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.4.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Blocker > Labels: test-stability > Fix For: 1.4.0 > > > The {{flink-queryable-state-java}} module fails on Travis with a core dump. > https://travis-ci.org/tillrohrmann/flink/jobs/289949829 -- This message was sent by Atlassian JIRA (v6.4.14#64029)