[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16259346#comment-16259346
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

GitHub user kl0u opened a pull request:

https://github.com/apache/flink/pull/5038

[FLINK-7880][FLINK-7975][FLINK-7974][QS] QS test instability fix.

This is a follow-up on https://github.com/apache/flink/pull/4993. 
It contains one additional commit that makes the QS ITcases to wait for 
proper clean-up before exiting.

R: @aljoscha @zentol 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kl0u/flink qs-shutdown-fin-fin

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/5038.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5038


commit 15c39807f560f6ee4f34e4ad1ba245b66dd10b09
Author: kkloudas 
Date:   2017-11-09T18:21:43Z

[FLINK-7975][QS] Wait for QS client to shutdown.

commit 689502e70cced18a9cb5b2b6435ff500b7bf70dd
Author: kkloudas 
Date:   2017-11-09T18:30:29Z

[FLINK-7974][QS] Wait for QS abstract server to shutdown.

commit 5c0e3fc82a88768caa750367fbf02e24848b3b53
Author: kkloudas 
Date:   2017-11-20T13:28:24Z

[FLINK-7880][QS] Wait for proper resource cleanup after each ITCase.




> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.4.0
>
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16259424#comment-16259424
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user kl0u closed the pull request at:

https://github.com/apache/flink/pull/5038


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16260866#comment-16260866
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5038#discussion_r152298240
  
--- Diff: 
flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java
 ---
@@ -312,32 +345,43 @@ private void handInChannel(Channel channel) {
/**
 * Close the connecting channel with a ClosedChannelException.
 */
-   private void close() {
-   close(new ClosedChannelException());
+   private CompletableFuture close() {
+   return close(new ClosedChannelException());
}
 
/**
 * Close the connecting channel with an Exception (can be 
{@code null})
 * or forward to the established channel.
 */
-   private void close(Throwable cause) {
-   synchronized (connectLock) {
-   if (!closed) {
-   if (failureCause == null) {
-   failureCause = cause;
-   }
+   private CompletableFuture close(Throwable cause) {
+   CompletableFuture future = new CompletableFuture<>();
+   if (connectionShutdownFuture.compareAndSet(null, 
future)) {
+   synchronized (connectLock) {
+   if (!closed) {
--- End diff --

this seems unnecessary, doesn't the check at L358 guarantee that the entire 
branch is only executed once?


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16260870#comment-16260870
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5038#discussion_r152298612
  
--- Diff: 
flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java
 ---
@@ -386,6 +430,9 @@ private PendingRequest(REQ request) {
/** Reference to a failure that was reported by the channel. */
private final AtomicReference failureCause = new 
AtomicReference<>();
 
+   /** Atomic shut down future. */
+   private final AtomicReference> 
connectionShutdownFuture = new AtomicReference<>(null);
--- End diff --

why does this one suddenly return a boolean?


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16260871#comment-16260871
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5038#discussion_r152297601
  
--- Diff: 
flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java
 ---
@@ -312,32 +345,43 @@ private void handInChannel(Channel channel) {
/**
 * Close the connecting channel with a ClosedChannelException.
 */
-   private void close() {
-   close(new ClosedChannelException());
+   private CompletableFuture close() {
+   return close(new ClosedChannelException());
}
 
/**
 * Close the connecting channel with an Exception (can be 
{@code null})
 * or forward to the established channel.
 */
-   private void close(Throwable cause) {
-   synchronized (connectLock) {
-   if (!closed) {
-   if (failureCause == null) {
-   failureCause = cause;
-   }
+   private CompletableFuture close(Throwable cause) {
--- End diff --

same as above


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16260867#comment-16260867
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5038#discussion_r152299398
  
--- Diff: 
flink-queryable-state/flink-queryable-state-runtime/src/test/java/org/apache/flink/queryablestate/network/ClientTest.java
 ---
@@ -95,15 +97,20 @@
 
private static final Logger LOG = 
LoggerFactory.getLogger(ClientTest.class);
 
+   private static final FiniteDuration TEST_TIMEOUT = new 
FiniteDuration(20L, TimeUnit.SECONDS);
+
// Thread pool for client bootstrap (shared between tests)
-   private static final NioEventLoopGroup NIO_GROUP = new 
NioEventLoopGroup();
+   private NioEventLoopGroup nioGroup;
 
-   private static final FiniteDuration TEST_TIMEOUT = new 
FiniteDuration(10L, TimeUnit.SECONDS);
+   @Before
+   public void setUp() throws Exception {
+   nioGroup = new NioEventLoopGroup();
--- End diff --

you could just write `private NioEventLoopGroup nioGroup = new 
NioEventLoopGroup();` and remove the `@Before` method


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16260868#comment-16260868
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5038#discussion_r152294984
  
--- Diff: 
flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java
 ---
@@ -166,28 +167,57 @@ public String getClientName() {
 * Shuts down the client and closes all connections.
 *
 * After a call to this method, all returned futures will be failed.
+*
+* @return A {@link CompletableFuture} that will be completed when the 
shutdown process is done.
 */
-   public void shutdown() {
-   if (shutDown.compareAndSet(false, true)) {
+   public CompletableFuture shutdown() {
--- End diff --

should be typed to `Void`.


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16260869#comment-16260869
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5038#discussion_r152300724
  
--- Diff: 
flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java
 ---
@@ -166,28 +167,57 @@ public String getClientName() {
 * Shuts down the client and closes all connections.
 *
 * After a call to this method, all returned futures will be failed.
+*
+* @return A {@link CompletableFuture} that will be completed when the 
shutdown process is done.
 */
-   public void shutdown() {
-   if (shutDown.compareAndSet(false, true)) {
+   public CompletableFuture shutdown() {
+   final CompletableFuture newShutdownFuture = new 
CompletableFuture<>();
+   if (clientShutdownFuture.compareAndSet(null, 
newShutdownFuture)) {
+
+   final List> connectionFutures = 
new ArrayList<>();
+
for (Map.Entry conn : establishedConnections.entrySet()) {
if 
(establishedConnections.remove(conn.getKey(), conn.getValue())) {
-   conn.getValue().close();
+   
connectionFutures.add(conn.getValue().close());
}
}
 
for (Map.Entry 
conn : pendingConnections.entrySet()) {
if (pendingConnections.remove(conn.getKey()) != 
null) {
-   conn.getValue().close();
+   
connectionFutures.add(conn.getValue().close());
}
}
 
-   if (bootstrap != null) {
-   EventLoopGroup group = bootstrap.group();
-   if (group != null) {
-   group.shutdownGracefully(0L, 10L, 
TimeUnit.SECONDS);
+   CompletableFuture.allOf(
+   connectionFutures.toArray(new 
CompletableFuture[connectionFutures.size()])
+   ).whenComplete((result, throwable) -> {
+   if (throwable != null) {
+   
newShutdownFuture.completeExceptionally(throwable);
+   } else if (bootstrap != null) {
+   EventLoopGroup group = 
bootstrap.group();
+   if (group != null && 
!group.isShutdown()) {
+   group.shutdownGracefully(0L, 
0L, TimeUnit.MILLISECONDS)
+   
.addListener(finished -> {
+   if 
(finished.isSuccess()) {
+   
newShutdownFuture.complete(null);
+   } else {
+   
newShutdownFuture.completeExceptionally(finished.cause());
+   }
+   });
+   } else {
+   
newShutdownFuture.complete(null);
+   }
+   } else {
+   newShutdownFuture.complete(null);
}
+   });
+
+   // check again if in the meantime another thread 
completed the future
+   if (clientShutdownFuture.compareAndSet(null, 
newShutdownFuture)) {
--- End diff --

where in close() do we set the shutdown future to null? I only see that 
being done in sendRequest. (which seems fishy)


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/t

[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16264379#comment-16264379
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

GitHub user kl0u opened a pull request:

https://github.com/apache/flink/pull/5062

[FLINK-7880][QS] Wait for proper resource cleanup after each ITCase.

R @aljoscha 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kl0u/flink qs-shutdown-fin-fin

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/5062.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5062


commit fd8ea759321b83022497e25e2dfa56ac6d976e15
Author: kkloudas 
Date:   2017-11-09T18:21:43Z

[FLINK-7975][QS] Wait for QS client to shutdown.

commit a474071d5810562326930a147f0ecdcfe506f13d
Author: kkloudas 
Date:   2017-11-09T18:30:29Z

[FLINK-7974][QS] Wait for QS abstract server to shutdown.

commit 2ffe223a8c8d89efbee0e08f8b90ea7b58fbf9df
Author: kkloudas 
Date:   2017-11-20T13:28:24Z

[FLINK-7880][QS] Wait for proper resource cleanup after each ITCase.

commit f7f8d2a6ebbaf464200c44a902e5b829c88f9499
Author: kkloudas 
Date:   2017-11-23T13:20:32Z

[hotfix][QS] ITCase refactoring.




> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16265060#comment-16265060
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user kl0u commented on the issue:

https://github.com/apache/flink/pull/5062
  
This is a follow-up to https://github.com/apache/flink/pull/5062


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270715#comment-16270715
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on the issue:

https://github.com/apache/flink/pull/5062
  
Your follow-up reference points to this very PR :P


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270718#comment-16270718
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user kl0u commented on the issue:

https://github.com/apache/flink/pull/5062
  
Fixed @zentol !


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270847#comment-16270847
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r153804523
  
--- Diff: 
flink-queryable-state/flink-queryable-state-runtime/src/test/java/org/apache/flink/queryablestate/itcases/AbstractQueryableStateTestBase.java
 ---
@@ -434,13 +432,16 @@ public Integer getKey(Tuple2 value) 
throws Exception {
 * contains a wrong jobId or wrong queryable state name.
 */
@Test
+   @Ignore
--- End diff --

why?


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270829#comment-16270829
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r153790466
  
--- Diff: 
flink-queryable-state/flink-queryable-state-runtime/src/test/java/org/apache/flink/queryablestate/network/AbstractServerTest.java
 ---
@@ -86,77 +83,89 @@ public void testServerInitializationFailure() throws 
Throwable {
 */
@Test
public void testPortRangeSuccess() throws Throwable {
-   TestServer server1 = null;
-   TestServer server2 = null;
-   Client client = null;
 
-   try {
-   server1 = startServer("Test Server 1", , 7778, 
7779);
+   // this is shared between the two servers.
+   AtomicKvStateRequestStats serverStats = new 
AtomicKvStateRequestStats();
+   AtomicKvStateRequestStats clientStats = new 
AtomicKvStateRequestStats();
+
+   List portList = new ArrayList<>();
+   portList.add();
+   portList.add(7778);
+   portList.add(7779);
+
+   try (
+   TestServer server1 = new TestServer("Test 
Server 1", serverStats, portList.iterator());
+   TestServer server2 = new TestServer("Test 
Server 2", serverStats, portList.iterator());
+   TestClient client = new TestClient(
+   "Test Client",
+   1,
+   new MessageSerializer<>(new 
TestMessage.TestMessageDeserializer(), new 
TestMessage.TestMessageDeserializer()),
+   clientStats
+   )
+   ) {
+   server1.start();
Assert.assertEquals(L, 
server1.getServerAddress().getPort());
--- End diff --

these should be removed, or generalized to `portRangeStart <= 
server1.getServerAddress().getPort() <= portRangeEnd`. 


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270838#comment-16270838
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r153793887
  
--- Diff: 
flink-queryable-state/flink-queryable-state-runtime/src/test/java/org/apache/flink/queryablestate/itcases/HAAbstractQueryableStateTestBase.java
 ---
@@ -80,19 +80,18 @@ public static void setup(int proxyPortRangeStart, int 
serverPortRangeStart) {
 
@AfterClass
public static void tearDown() {
-   if (cluster != null) {
+   try {
+   client.shutdownAndWait();
+
cluster.stop();
cluster.awaitTermination();
-   }
 
-   try {
zkServer.stop();
zkServer.close();
-   } catch (Exception e) {
+
+   } catch (Throwable e) {
--- End diff --

try-catch block is unnecessary


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270842#comment-16270842
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r153799484
  
--- Diff: 
flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/AbstractServerBase.java
 ---
@@ -251,34 +263,81 @@ private boolean attemptToBind(final int port) throws 
Throwable {
throw future.cause();
} catch (BindException e) {
log.debug("Failed to start {} on port {}: {}.", 
serverName, port, e.getMessage());
-   shutdown();
+   try {
+   shutdownServer(Time.seconds(10L)).get();
+   } catch (Exception r) {
+
+   // Here we were seeing this problem:
+   // https://github.com/netty/netty/issues/4357 
if we do a get().
+   // this is why we now simply wait a bit so that 
everything is
+   // shut down and then we check
+
+   log.warn("Problem while shutting down {}: {}", 
serverName, r.getMessage());
+   }
}
// any other type of exception we let it bubble up.
return false;
}
 
/**
 * Shuts down the server and all related thread pools.
+* @param timeout The time to wait for the shutdown process to complete.
+* @return A {@link CompletableFuture} that will be completed upon 
termination of the shutdown process.
 */
-   public void shutdown() {
-   log.info("Shutting down {} @ {}", serverName, serverAddress);
-
-   if (handler != null) {
-   handler.shutdown();
-   handler = null;
-   }
-
-   if (queryExecutor != null) {
-   queryExecutor.shutdown();
-   }
+   public CompletableFuture shutdownServer(Time timeout) throws 
InterruptedException {
+   CompletableFuture shutdownFuture = new 
CompletableFuture<>();
+   if (serverShutdownFuture.compareAndSet(null, shutdownFuture)) {
--- End diff --

calling shutdown twice in a row will result in a future being returned that 
never completes.


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270832#comment-16270832
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r153794621
  
--- Diff: 
flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/client/QueryableStateClient.java
 ---
@@ -108,9 +108,33 @@ public QueryableStateClient(final InetAddress 
remoteAddress, final int remotePor
new DisabledKvStateRequestStats());
}
 
-   /** Shuts down the client. */
-   public void shutdown() {
-   client.shutdown();
+   /**
+* Shuts down the client and returns a {@link CompletableFuture} that
+* will be completed when the shutdown process is completed.
+*
+* If an exception is thrown for any reason, then the returned future
+* will be completed exceptionally with that exception.
+*
+* @return A {@link CompletableFuture} for further handling of the
+* shutdown result.
+*/
+   public CompletableFuture shutdownAndHandle() {
+   return client.shutdown();
+   }
+
+   /**
+* Shuts down the client and waits until shutdown is completed.
+*
+* If an exception is thrown, a warning is printed containing
--- End diff --

the warning is logged, not printed (generally implies stdout).


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270823#comment-16270823
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r153785916
  
--- Diff: 
flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java
 ---
@@ -166,28 +167,57 @@ public String getClientName() {
 * Shuts down the client and closes all connections.
 *
 * After a call to this method, all returned futures will be failed.
+*
+* @return A {@link CompletableFuture} that will be completed when the 
shutdown process is done.
 */
-   public void shutdown() {
-   if (shutDown.compareAndSet(false, true)) {
+   public CompletableFuture shutdown() {
+   final CompletableFuture newShutdownFuture = new 
CompletableFuture<>();
+   if (clientShutdownFuture.compareAndSet(null, 
newShutdownFuture)) {
+
+   final List> connectionFutures = 
new ArrayList<>();
+
for (Map.Entry conn : establishedConnections.entrySet()) {
if 
(establishedConnections.remove(conn.getKey(), conn.getValue())) {
-   conn.getValue().close();
+   
connectionFutures.add(conn.getValue().close());
}
}
 
for (Map.Entry 
conn : pendingConnections.entrySet()) {
if (pendingConnections.remove(conn.getKey()) != 
null) {
-   conn.getValue().close();
+   
connectionFutures.add(conn.getValue().close());
}
}
 
-   if (bootstrap != null) {
-   EventLoopGroup group = bootstrap.group();
-   if (group != null) {
-   group.shutdownGracefully(0L, 10L, 
TimeUnit.SECONDS);
+   CompletableFuture.allOf(
+   connectionFutures.toArray(new 
CompletableFuture[connectionFutures.size()])
+   ).whenComplete((result, throwable) -> {
+   if (throwable != null) {
+   
newShutdownFuture.completeExceptionally(throwable);
+   } else if (bootstrap != null) {
+   EventLoopGroup group = 
bootstrap.group();
+   if (group != null && 
!group.isShutdown()) {
+   group.shutdownGracefully(0L, 
0L, TimeUnit.MILLISECONDS)
+   
.addListener(finished -> {
+   if 
(finished.isSuccess()) {
+   
newShutdownFuture.complete(null);
+   } else {
+   
newShutdownFuture.completeExceptionally(finished.cause());
+   }
+   });
+   } else {
+   
newShutdownFuture.complete(null);
+   }
+   } else {
+   newShutdownFuture.complete(null);
}
+   });
+
+   // check again if in the meantime another thread 
completed the future
+   if (clientShutdownFuture.compareAndSet(null, 
newShutdownFuture)) {
--- End diff --

This can never be true. We can only arrive here if 
`clientShutdownFuture.compareAndSet(null, newShutdownFuture)` succeeds, and 
there is no other code path that ever sets it to null again.


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-jav

[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270826#comment-16270826
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r153789622
  
--- Diff: 
flink-queryable-state/flink-queryable-state-runtime/src/test/java/org/apache/flink/queryablestate/network/ClientTest.java
 ---
@@ -218,7 +225,18 @@ public void channelRead(ChannelHandlerContext ctx, 
Object msg) throws Exception
assertEquals(expectedRequests, stats.getNumFailed());
} finally {
if (client != null) {
-   client.shutdown();
+   try {
+
+   // todo here we were seeing this 
problem:
+   // 
https://github.com/netty/netty/issues/4357 if we do a get().
+   // this is why we now simply wait a bit 
so that everything is
+   // shut down and then we check
+
+   client.shutdown().get(10L, 
TimeUnit.SECONDS);
+   } catch (Exception e) {
+   e.printStackTrace();
+   }
+   
Assert.assertTrue(client.isEventGroupShutdown());
--- End diff --

I would prefer if we would throw `e` in this and similar cases.


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270848#comment-16270848
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r153799197
  
--- Diff: 
flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java
 ---
@@ -166,28 +167,57 @@ public String getClientName() {
 * Shuts down the client and closes all connections.
 *
 * After a call to this method, all returned futures will be failed.
+*
+* @return A {@link CompletableFuture} that will be completed when the 
shutdown process is done.
 */
-   public void shutdown() {
-   if (shutDown.compareAndSet(false, true)) {
+   public CompletableFuture shutdown() {
+   final CompletableFuture newShutdownFuture = new 
CompletableFuture<>();
+   if (clientShutdownFuture.compareAndSet(null, 
newShutdownFuture)) {
+
+   final List> connectionFutures = 
new ArrayList<>();
+
for (Map.Entry conn : establishedConnections.entrySet()) {
if 
(establishedConnections.remove(conn.getKey(), conn.getValue())) {
-   conn.getValue().close();
+   
connectionFutures.add(conn.getValue().close());
}
}
 
for (Map.Entry 
conn : pendingConnections.entrySet()) {
if (pendingConnections.remove(conn.getKey()) != 
null) {
-   conn.getValue().close();
+   
connectionFutures.add(conn.getValue().close());
}
}
 
-   if (bootstrap != null) {
-   EventLoopGroup group = bootstrap.group();
-   if (group != null) {
-   group.shutdownGracefully(0L, 10L, 
TimeUnit.SECONDS);
+   CompletableFuture.allOf(
+   connectionFutures.toArray(new 
CompletableFuture[connectionFutures.size()])
+   ).whenComplete((result, throwable) -> {
+   if (throwable != null) {
+   
newShutdownFuture.completeExceptionally(throwable);
+   } else if (bootstrap != null) {
--- End diff --

this means that the bootstrap is not shutdown if a connection cannot be 
closed for whatever reason, which is rather odd. We should still at least try 
to shut it down.


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270825#comment-16270825
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r153790485
  
--- Diff: 
flink-queryable-state/flink-queryable-state-runtime/src/test/java/org/apache/flink/queryablestate/network/AbstractServerTest.java
 ---
@@ -86,77 +83,89 @@ public void testServerInitializationFailure() throws 
Throwable {
 */
@Test
public void testPortRangeSuccess() throws Throwable {
-   TestServer server1 = null;
-   TestServer server2 = null;
-   Client client = null;
 
-   try {
-   server1 = startServer("Test Server 1", , 7778, 
7779);
+   // this is shared between the two servers.
+   AtomicKvStateRequestStats serverStats = new 
AtomicKvStateRequestStats();
+   AtomicKvStateRequestStats clientStats = new 
AtomicKvStateRequestStats();
+
+   List portList = new ArrayList<>();
+   portList.add();
+   portList.add(7778);
+   portList.add(7779);
+
+   try (
+   TestServer server1 = new TestServer("Test 
Server 1", serverStats, portList.iterator());
+   TestServer server2 = new TestServer("Test 
Server 2", serverStats, portList.iterator());
+   TestClient client = new TestClient(
+   "Test Client",
+   1,
+   new MessageSerializer<>(new 
TestMessage.TestMessageDeserializer(), new 
TestMessage.TestMessageDeserializer()),
+   clientStats
+   )
+   ) {
+   server1.start();
Assert.assertEquals(L, 
server1.getServerAddress().getPort());
 
-   server2 = startServer("Test Server 2", , 7778, 
7779);
+   server2.start();
Assert.assertEquals(7778L, 
server2.getServerAddress().getPort());
--- End diff --

same as above


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270827#comment-16270827
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r153790032
  
--- Diff: 
flink-queryable-state/flink-queryable-state-runtime/src/test/java/org/apache/flink/queryablestate/network/AbstractServerTest.java
 ---
@@ -60,23 +63,17 @@ public void testServerInitializationFailure() throws 
Throwable {
expectedEx.expect(FlinkRuntimeException.class);
expectedEx.expectMessage("Unable to start Test Server 2. All 
ports in provided range are occupied.");
 
-   TestServer server1 = null;
-   TestServer server2 = null;
-   try {
+   List portList = new ArrayList<>();
+   portList.add();
 
-   server1 = startServer("Test Server 1", );
+   try (
+   TestServer server1 = new TestServer("Test 
Server 1", new DisabledKvStateRequestStats(), portList.iterator());
--- End diff --

a safer pattern is to give `server1` multiple ports to choose from and 
specifically start `server2` on the port `server1` eventually started with.


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270840#comment-16270840
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r153793789
  
--- Diff: 
flink-queryable-state/flink-queryable-state-runtime/src/test/java/org/apache/flink/queryablestate/itcases/NonHAAbstractQueryableStateTestBase.java
 ---
@@ -68,11 +68,14 @@ public static void setup(int proxyPortRangeStart, int 
serverPortRangeStart) {
@AfterClass
public static void tearDown() {
try {
+   client.shutdownAndWait();
+
cluster.stop();
-   } catch (Exception e) {
+   cluster.awaitTermination();
+
+   } catch (Throwable e) {
--- End diff --

try-catch block is unnecessary


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270846#comment-16270846
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r153802908
  
--- Diff: 
flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java
 ---
@@ -486,27 +542,25 @@ private boolean close(Throwable cause) {
@Override
public void onRequestResult(long requestId, RESP response) {
TimestampedCompletableFuture pending = 
pendingRequests.remove(requestId);
-   if (pending != null && pending.complete(response)) {
+   if (pending != null && !pending.isDone()) {
long durationMillis = (System.nanoTime() - 
pending.getTimestamp()) / 1_000_000L;
stats.reportSuccessfulRequest(durationMillis);
+   pending.complete(response);
}
}
 
@Override
public void onRequestFailure(long requestId, Throwable cause) {
TimestampedCompletableFuture pending = 
pendingRequests.remove(requestId);
-   if (pending != null && 
pending.completeExceptionally(cause)) {
+   if (pending != null && !pending.isDone()) {
stats.reportFailedRequest();
+   pending.completeExceptionally(cause);
}
}
 
@Override
public void onFailure(Throwable cause) {
-   if (close(cause)) {
-   // Remove from established channels, otherwise 
future
-   // requests will be handled by this failed 
channel.
-   establishedConnections.remove(serverAddress, 
this);
-   }
+   close(cause).thenAccept(cancelled -> 
establishedConnections.remove(serverAddress, this));
--- End diff --

shouldn't we remove the connection in any case, since if we can't close 
*something*  is probably wrong with it anyway?


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270843#comment-16270843
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r153801762
  
--- Diff: 
flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java
 ---
@@ -422,20 +467,31 @@ void close() {
 * @param cause The cause to close the channel with.
 * @return Channel close future
 */
-   private boolean close(Throwable cause) {
-   if (failureCause.compareAndSet(null, cause)) {
-   channel.close();
-   stats.reportInactiveConnection();
+   private CompletableFuture close(Throwable cause) {
+   final CompletableFuture shutdownFuture = new 
CompletableFuture<>();
 
-   for (long requestId : pendingRequests.keySet()) 
{
-   TimestampedCompletableFuture pending = 
pendingRequests.remove(requestId);
-   if (pending != null && 
pending.completeExceptionally(cause)) {
-   stats.reportFailedRequest();
+   if (connectionShutdownFuture.compareAndSet(null, 
shutdownFuture) &&
--- End diff --

this can result in odd race conditions, where one thread has set the 
shutdownFuture, but another one the failureCause. I would suggest to create a 
container for the future and exception and update them atomically.


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270831#comment-16270831
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r153786058
  
--- Diff: 
flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java
 ---
@@ -300,7 +333,7 @@ private void handInChannel(Channel channel) {
// Check shut down for possible race 
with shut down. We
// don't want any lingering connections 
after shut down,
// which can happen if we don't check 
this here.
-   if (shutDown.get()) {
+   if 
(!clientShutdownFuture.compareAndSet(null, null)) {
--- End diff --

for clarity we should call `get()` instead of `compareAndSet` since the 
setting part is a no-op anyway.


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270839#comment-16270839
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r153798768
  
--- Diff: 
flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java
 ---
@@ -166,28 +167,57 @@ public String getClientName() {
 * Shuts down the client and closes all connections.
 *
 * After a call to this method, all returned futures will be failed.
+*
+* @return A {@link CompletableFuture} that will be completed when the 
shutdown process is done.
 */
-   public void shutdown() {
-   if (shutDown.compareAndSet(false, true)) {
+   public CompletableFuture shutdown() {
+   final CompletableFuture newShutdownFuture = new 
CompletableFuture<>();
--- End diff --

calling shutdown twice in a row will result in a future being returned that 
never completes.


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270833#comment-16270833
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r153794159
  
--- Diff: 
flink-queryable-state/flink-queryable-state-runtime/src/main/java/org/apache/flink/queryablestate/server/KvStateServerImpl.java
 ---
@@ -101,6 +103,12 @@ public InetSocketAddress getServerAddress() {
 
@Override
public void shutdown() {
-   super.shutdown();
+   try {
+   Time timeout = Time.seconds(10L);
+   shutdownServer(timeout).get(timeout.toMilliseconds(), 
TimeUnit.MILLISECONDS);
--- End diff --

why not just `get(10, TimeUnit.SECONDS)`?


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270824#comment-16270824
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r153788979
  
--- Diff: 
flink-queryable-state/flink-queryable-state-runtime/src/test/java/org/apache/flink/queryablestate/network/ClientTest.java
 ---
@@ -95,15 +97,20 @@
 
private static final Logger LOG = 
LoggerFactory.getLogger(ClientTest.class);
 
+   private static final FiniteDuration TEST_TIMEOUT = new 
FiniteDuration(20L, TimeUnit.SECONDS);
+
// Thread pool for client bootstrap (shared between tests)
-   private static final NioEventLoopGroup NIO_GROUP = new 
NioEventLoopGroup();
+   private NioEventLoopGroup nioGroup;
 
-   private static final FiniteDuration TEST_TIMEOUT = new 
FiniteDuration(10L, TimeUnit.SECONDS);
+   @Before
+   public void setUp() throws Exception {
+   nioGroup = new NioEventLoopGroup();
--- End diff --

this method isn't strictly necessary; you can also do `private final 
NioEventLoopGroup nioGroup = new NioEventLoopGroup()`. Then you would also no 
longer need the null-check in `@After`.


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270836#comment-16270836
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r153798113
  
--- Diff: 
flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/AbstractServerBase.java
 ---
@@ -251,34 +263,81 @@ private boolean attemptToBind(final int port) throws 
Throwable {
throw future.cause();
} catch (BindException e) {
log.debug("Failed to start {} on port {}: {}.", 
serverName, port, e.getMessage());
-   shutdown();
+   try {
+   shutdownServer(Time.seconds(10L)).get();
+   } catch (Exception r) {
+
+   // Here we were seeing this problem:
+   // https://github.com/netty/netty/issues/4357 
if we do a get().
+   // this is why we now simply wait a bit so that 
everything is
+   // shut down and then we check
+
+   log.warn("Problem while shutting down {}: {}", 
serverName, r.getMessage());
+   }
}
// any other type of exception we let it bubble up.
return false;
}
 
/**
 * Shuts down the server and all related thread pools.
+* @param timeout The time to wait for the shutdown process to complete.
+* @return A {@link CompletableFuture} that will be completed upon 
termination of the shutdown process.
 */
-   public void shutdown() {
-   log.info("Shutting down {} @ {}", serverName, serverAddress);
-
-   if (handler != null) {
-   handler.shutdown();
-   handler = null;
-   }
-
-   if (queryExecutor != null) {
-   queryExecutor.shutdown();
-   }
+   public CompletableFuture shutdownServer(Time timeout) throws 
InterruptedException {
+   CompletableFuture shutdownFuture = new 
CompletableFuture<>();
+   if (serverShutdownFuture.compareAndSet(null, shutdownFuture)) {
+   log.info("Shutting down {} @ {}", serverName, 
serverAddress);
+
+   final CompletableFuture groupShutdownFuture = new 
CompletableFuture<>();
+   if (bootstrap != null) {
+   EventLoopGroup group = bootstrap.group();
+   if (group != null && !group.isShutdown()) {
+   group.shutdownGracefully(0L, 0L, 
TimeUnit.MILLISECONDS)
+   .addListener(finished 
-> {
+   if 
(finished.isSuccess()) {
+   
groupShutdownFuture.complete(null);
+   } else {
+   
groupShutdownFuture.completeExceptionally(finished.cause());
+   }
+   });
+   } else {
+   groupShutdownFuture.complete(null);
+   }
+   } else {
+   groupShutdownFuture.complete(null);
+   }
 
-   if (bootstrap != null) {
-   EventLoopGroup group = bootstrap.group();
-   if (group != null) {
-   group.shutdownGracefully(0L, 10L, 
TimeUnit.SECONDS);
+   final CompletableFuture handlerShutdownFuture = 
new CompletableFuture<>();
+   if (handler == null) {
+   handlerShutdownFuture.complete(null);
+   } else {
+   handler.shutdown().whenComplete((result, 
throwable) -> {
+   if (throwable != null) {
+   
handlerShutdownFuture.completeExceptionally(throwable);
+   } else {
+   
handlerShutdownFuture.complete(null);
+   }
+   });
}
+
+   final CompletableFuture queryExecShutdownFuture = 
Completa

[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270830#comment-16270830
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r153786954
  
--- Diff: 
flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java
 ---
@@ -312,32 +345,41 @@ private void handInChannel(Channel channel) {
/**
 * Close the connecting channel with a ClosedChannelException.
 */
-   private void close() {
-   close(new ClosedChannelException());
+   private CompletableFuture close() {
+   return close(new ClosedChannelException());
}
 
/**
 * Close the connecting channel with an Exception (can be 
{@code null})
 * or forward to the established channel.
 */
-   private void close(Throwable cause) {
-   synchronized (connectLock) {
-   if (!closed) {
+   private CompletableFuture close(Throwable cause) {
+   CompletableFuture future = new 
CompletableFuture<>();
+   if (connectionShutdownFuture.compareAndSet(null, 
future)) {
+   synchronized (connectLock) {
if (failureCause == null) {
failureCause = cause;
}
 
+   closed = true;
--- End diff --

this field should be redundant, as `(connectionShutdownFuture.get() != 
null) == closed` should always hold.


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270837#comment-16270837
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r153797851
  
--- Diff: 
flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/AbstractServerBase.java
 ---
@@ -251,34 +263,81 @@ private boolean attemptToBind(final int port) throws 
Throwable {
throw future.cause();
} catch (BindException e) {
log.debug("Failed to start {} on port {}: {}.", 
serverName, port, e.getMessage());
-   shutdown();
+   try {
+   shutdownServer(Time.seconds(10L)).get();
+   } catch (Exception r) {
+
+   // Here we were seeing this problem:
+   // https://github.com/netty/netty/issues/4357 
if we do a get().
+   // this is why we now simply wait a bit so that 
everything is
+   // shut down and then we check
+
+   log.warn("Problem while shutting down {}: {}", 
serverName, r.getMessage());
+   }
}
// any other type of exception we let it bubble up.
return false;
}
 
/**
 * Shuts down the server and all related thread pools.
+* @param timeout The time to wait for the shutdown process to complete.
+* @return A {@link CompletableFuture} that will be completed upon 
termination of the shutdown process.
 */
-   public void shutdown() {
-   log.info("Shutting down {} @ {}", serverName, serverAddress);
-
-   if (handler != null) {
-   handler.shutdown();
-   handler = null;
-   }
-
-   if (queryExecutor != null) {
-   queryExecutor.shutdown();
-   }
+   public CompletableFuture shutdownServer(Time timeout) throws 
InterruptedException {
--- End diff --

i would remove this timeout argument and simply define large timeout for 
shutting down the `queryExecutor`.


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270844#comment-16270844
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r153801913
  
--- Diff: 
flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java
 ---
@@ -422,20 +467,31 @@ void close() {
 * @param cause The cause to close the channel with.
 * @return Channel close future
 */
-   private boolean close(Throwable cause) {
-   if (failureCause.compareAndSet(null, cause)) {
-   channel.close();
-   stats.reportInactiveConnection();
+   private CompletableFuture close(Throwable cause) {
+   final CompletableFuture shutdownFuture = new 
CompletableFuture<>();
 
-   for (long requestId : pendingRequests.keySet()) 
{
-   TimestampedCompletableFuture pending = 
pendingRequests.remove(requestId);
-   if (pending != null && 
pending.completeExceptionally(cause)) {
-   stats.reportFailedRequest();
+   if (connectionShutdownFuture.compareAndSet(null, 
shutdownFuture) &&
--- End diff --

Let's also log other shutdown attempts as DEBUG.


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270845#comment-16270845
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r153804310
  
--- Diff: 
flink-queryable-state/flink-queryable-state-runtime/src/test/java/org/apache/flink/queryablestate/itcases/AbstractQueryableStateTestBase.java
 ---
@@ -359,12 +355,12 @@ public Integer getKey(Tuple2 value) 
throws Exception {
} finally {
// Free cluster resources
if (jobId != null) {
-   scala.concurrent.Future 
cancellation = cluster
-   
.getLeaderGateway(deadline.timeLeft())
+   cluster.getLeaderGateway(deadline.timeLeft())
.ask(new 
JobManagerMessages.CancelJob(jobId), deadline.timeLeft())
-   
.mapTo(ClassTag$.MODULE$.apply(JobManagerMessages.CancellationSuccess.class));
+   
.mapTo(ClassTag$.MODULE$.apply(CancellationSuccess.class));
 
-   Await.ready(cancellation, deadline.timeLeft());
+   // we are not waiting for the cancellation to 
happen because the
+   // job has actually failed, as tested above.
--- End diff --

you can't guarantee this in the `finally` block. (for example if 
submitJobDetached failed but the job is actually running)


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270841#comment-16270841
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r153798592
  
--- Diff: 
flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java
 ---
@@ -133,7 +134,7 @@ public String getClientName() {
}
 
public CompletableFuture sendRequest(final InetSocketAddress 
serverAddress, final REQ request) {
-   if (shutDown.get()) {
+   if (!clientShutdownFuture.compareAndSet(null, null)) {
--- End diff --

this should use ´clientShutdownFuture.get() != null` for clarity.


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270828#comment-16270828
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r153794265
  
--- Diff: 
flink-queryable-state/flink-queryable-state-runtime/src/main/java/org/apache/flink/queryablestate/client/proxy/KvStateClientProxyImpl.java
 ---
@@ -96,7 +98,13 @@ public void start() throws Throwable {
 
@Override
public void shutdown() {
-   super.shutdown();
+   try {
+   Time timeout = Time.seconds(10L);
+   shutdownServer(timeout).get(timeout.toMilliseconds(), 
TimeUnit.MILLISECONDS);
--- End diff --

why not just get(10, TimeUnit.SECONDS)?


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270835#comment-16270835
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r153797256
  
--- Diff: 
flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/AbstractServerBase.java
 ---
@@ -251,34 +263,81 @@ private boolean attemptToBind(final int port) throws 
Throwable {
throw future.cause();
} catch (BindException e) {
log.debug("Failed to start {} on port {}: {}.", 
serverName, port, e.getMessage());
-   shutdown();
+   try {
+   shutdownServer(Time.seconds(10L)).get();
+   } catch (Exception r) {
+
+   // Here we were seeing this problem:
+   // https://github.com/netty/netty/issues/4357 
if we do a get().
+   // this is why we now simply wait a bit so that 
everything is
+   // shut down and then we check
--- End diff --

we don't check anything here.


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16270834#comment-16270834
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r153796142
  
--- Diff: 
flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/AbstractServerBase.java
 ---
@@ -205,6 +212,11 @@ public void start() throws Throwable {
private boolean attemptToBind(final int port) throws Throwable {
log.debug("Attempting to start {} on port {}.", serverName, 
port);
 
+   // here we reset the future every time because in case of 
failure
+   // to bind, we call shutdown here and this may interfere with 
future
+   // shutdown attempts.
+   this.serverShutdownFuture.getAndSet(null);
--- End diff --

If we intend to only support a single start of a server this should use 
compareAndSet and fail immediately if a shutdown is in progress.

If we intend to support multiple starts of a server we should guard start() 
and shutdown() with a lock since otherwise you easily end up in an inconsistent 
state.


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-12-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278246#comment-16278246
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user kl0u commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r154884801
  
--- Diff: 
flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/AbstractServerBase.java
 ---
@@ -251,34 +263,81 @@ private boolean attemptToBind(final int port) throws 
Throwable {
throw future.cause();
} catch (BindException e) {
log.debug("Failed to start {} on port {}: {}.", 
serverName, port, e.getMessage());
-   shutdown();
+   try {
+   shutdownServer(Time.seconds(10L)).get();
+   } catch (Exception r) {
+
+   // Here we were seeing this problem:
+   // https://github.com/netty/netty/issues/4357 
if we do a get().
+   // this is why we now simply wait a bit so that 
everything is
+   // shut down and then we check
+
+   log.warn("Problem while shutting down {}: {}", 
serverName, r.getMessage());
+   }
}
// any other type of exception we let it bubble up.
return false;
}
 
/**
 * Shuts down the server and all related thread pools.
+* @param timeout The time to wait for the shutdown process to complete.
+* @return A {@link CompletableFuture} that will be completed upon 
termination of the shutdown process.
 */
-   public void shutdown() {
-   log.info("Shutting down {} @ {}", serverName, serverAddress);
-
-   if (handler != null) {
-   handler.shutdown();
-   handler = null;
-   }
-
-   if (queryExecutor != null) {
-   queryExecutor.shutdown();
-   }
+   public CompletableFuture shutdownServer(Time timeout) throws 
InterruptedException {
+   CompletableFuture shutdownFuture = new 
CompletableFuture<>();
+   if (serverShutdownFuture.compareAndSet(null, shutdownFuture)) {
--- End diff --

Here the idea is that we atomically set the future the first time 
(`serverShutdownFuture.compareAndSet(null, shutdownFuture)`), and then, at each 
subsequent call we return that exact future (`serverShutdownFuture.get();`). So 
I cannot see how this can lead to returning a future that never completes.


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-12-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278253#comment-16278253
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user kl0u commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r154885293
  
--- Diff: 
flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/AbstractServerBase.java
 ---
@@ -251,34 +263,81 @@ private boolean attemptToBind(final int port) throws 
Throwable {
throw future.cause();
} catch (BindException e) {
log.debug("Failed to start {} on port {}: {}.", 
serverName, port, e.getMessage());
-   shutdown();
+   try {
+   shutdownServer(Time.seconds(10L)).get();
+   } catch (Exception r) {
+
+   // Here we were seeing this problem:
+   // https://github.com/netty/netty/issues/4357 
if we do a get().
+   // this is why we now simply wait a bit so that 
everything is
+   // shut down and then we check
+
+   log.warn("Problem while shutting down {}: {}", 
serverName, r.getMessage());
+   }
}
// any other type of exception we let it bubble up.
return false;
}
 
/**
 * Shuts down the server and all related thread pools.
+* @param timeout The time to wait for the shutdown process to complete.
+* @return A {@link CompletableFuture} that will be completed upon 
termination of the shutdown process.
 */
-   public void shutdown() {
-   log.info("Shutting down {} @ {}", serverName, serverAddress);
-
-   if (handler != null) {
-   handler.shutdown();
-   handler = null;
-   }
-
-   if (queryExecutor != null) {
-   queryExecutor.shutdown();
-   }
+   public CompletableFuture shutdownServer(Time timeout) throws 
InterruptedException {
+   CompletableFuture shutdownFuture = new 
CompletableFuture<>();
+   if (serverShutdownFuture.compareAndSet(null, shutdownFuture)) {
+   log.info("Shutting down {} @ {}", serverName, 
serverAddress);
+
+   final CompletableFuture groupShutdownFuture = new 
CompletableFuture<>();
+   if (bootstrap != null) {
+   EventLoopGroup group = bootstrap.group();
+   if (group != null && !group.isShutdown()) {
+   group.shutdownGracefully(0L, 0L, 
TimeUnit.MILLISECONDS)
+   .addListener(finished 
-> {
+   if 
(finished.isSuccess()) {
+   
groupShutdownFuture.complete(null);
+   } else {
+   
groupShutdownFuture.completeExceptionally(finished.cause());
+   }
+   });
+   } else {
+   groupShutdownFuture.complete(null);
+   }
+   } else {
+   groupShutdownFuture.complete(null);
+   }
 
-   if (bootstrap != null) {
-   EventLoopGroup group = bootstrap.group();
-   if (group != null) {
-   group.shutdownGracefully(0L, 10L, 
TimeUnit.SECONDS);
+   final CompletableFuture handlerShutdownFuture = 
new CompletableFuture<>();
+   if (handler == null) {
+   handlerShutdownFuture.complete(null);
+   } else {
+   handler.shutdown().whenComplete((result, 
throwable) -> {
+   if (throwable != null) {
+   
handlerShutdownFuture.completeExceptionally(throwable);
+   } else {
+   
handlerShutdownFuture.complete(null);
+   }
+   });
}
+
+   final CompletableFuture queryExecShutdownFuture = 
Completabl

[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-12-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278254#comment-16278254
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user kl0u commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r154885503
  
--- Diff: 
flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java
 ---
@@ -133,7 +134,7 @@ public String getClientName() {
}
 
public CompletableFuture sendRequest(final InetSocketAddress 
serverAddress, final REQ request) {
-   if (shutDown.get()) {
+   if (!clientShutdownFuture.compareAndSet(null, null)) {
--- End diff --

Well, I agree that the code looks clearer, but the `compareAndSet` 
guarantees atomicity. While `get()` and then compare, does not.


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-12-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278256#comment-16278256
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user kl0u commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r154885936
  
--- Diff: 
flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java
 ---
@@ -166,28 +167,57 @@ public String getClientName() {
 * Shuts down the client and closes all connections.
 *
 * After a call to this method, all returned futures will be failed.
+*
+* @return A {@link CompletableFuture} that will be completed when the 
shutdown process is done.
 */
-   public void shutdown() {
-   if (shutDown.compareAndSet(false, true)) {
+   public CompletableFuture shutdown() {
+   final CompletableFuture newShutdownFuture = new 
CompletableFuture<>();
--- End diff --

This is the same pattern as in the case of the `AbstractServerBase` ;)


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-12-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278265#comment-16278265
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user kl0u commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r154886964
  
--- Diff: 
flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java
 ---
@@ -166,28 +167,57 @@ public String getClientName() {
 * Shuts down the client and closes all connections.
 *
 * After a call to this method, all returned futures will be failed.
+*
+* @return A {@link CompletableFuture} that will be completed when the 
shutdown process is done.
 */
-   public void shutdown() {
-   if (shutDown.compareAndSet(false, true)) {
+   public CompletableFuture shutdown() {
+   final CompletableFuture newShutdownFuture = new 
CompletableFuture<>();
+   if (clientShutdownFuture.compareAndSet(null, 
newShutdownFuture)) {
+
+   final List> connectionFutures = 
new ArrayList<>();
+
for (Map.Entry conn : establishedConnections.entrySet()) {
if 
(establishedConnections.remove(conn.getKey(), conn.getValue())) {
-   conn.getValue().close();
+   
connectionFutures.add(conn.getValue().close());
}
}
 
for (Map.Entry 
conn : pendingConnections.entrySet()) {
if (pendingConnections.remove(conn.getKey()) != 
null) {
-   conn.getValue().close();
+   
connectionFutures.add(conn.getValue().close());
}
}
 
-   if (bootstrap != null) {
-   EventLoopGroup group = bootstrap.group();
-   if (group != null) {
-   group.shutdownGracefully(0L, 10L, 
TimeUnit.SECONDS);
+   CompletableFuture.allOf(
+   connectionFutures.toArray(new 
CompletableFuture[connectionFutures.size()])
+   ).whenComplete((result, throwable) -> {
+   if (throwable != null) {
+   
newShutdownFuture.completeExceptionally(throwable);
+   } else if (bootstrap != null) {
+   EventLoopGroup group = 
bootstrap.group();
+   if (group != null && 
!group.isShutdown()) {
+   group.shutdownGracefully(0L, 
0L, TimeUnit.MILLISECONDS)
+   
.addListener(finished -> {
+   if 
(finished.isSuccess()) {
+   
newShutdownFuture.complete(null);
+   } else {
+   
newShutdownFuture.completeExceptionally(finished.cause());
+   }
+   });
+   } else {
+   
newShutdownFuture.complete(null);
+   }
+   } else {
+   newShutdownFuture.complete(null);
}
+   });
+
+   // check again if in the meantime another thread 
completed the future
+   if (clientShutdownFuture.compareAndSet(null, 
newShutdownFuture)) {
--- End diff --

You are right!


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-12-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278273#comment-16278273
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r154888733
  
--- Diff: 
flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java
 ---
@@ -133,7 +134,7 @@ public String getClientName() {
}
 
public CompletableFuture sendRequest(final InetSocketAddress 
serverAddress, final REQ request) {
-   if (shutDown.get()) {
+   if (!clientShutdownFuture.compareAndSet(null, null)) {
--- End diff --

But the atomicity doesn't get you anything. Which case wouldn't be covered?

This check can only handle the case when sendRequest() is called after 
shutdown(), regardless of whether you use compareAndSet() or get(), so it 
shouldn't make a difference.



> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-12-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278285#comment-16278285
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user kl0u commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r154890227
  
--- Diff: 
flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java
 ---
@@ -312,32 +345,41 @@ private void handInChannel(Channel channel) {
/**
 * Close the connecting channel with a ClosedChannelException.
 */
-   private void close() {
-   close(new ClosedChannelException());
+   private CompletableFuture close() {
+   return close(new ClosedChannelException());
}
 
/**
 * Close the connecting channel with an Exception (can be 
{@code null})
 * or forward to the established channel.
 */
-   private void close(Throwable cause) {
-   synchronized (connectLock) {
-   if (!closed) {
+   private CompletableFuture close(Throwable cause) {
+   CompletableFuture future = new 
CompletableFuture<>();
+   if (connectionShutdownFuture.compareAndSet(null, 
future)) {
+   synchronized (connectLock) {
if (failureCause == null) {
failureCause = cause;
}
 
+   closed = true;
--- End diff --

Yes. I will remove that.


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-12-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278292#comment-16278292
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user kl0u commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r154890905
  
--- Diff: 
flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java
 ---
@@ -486,27 +542,25 @@ private boolean close(Throwable cause) {
@Override
public void onRequestResult(long requestId, RESP response) {
TimestampedCompletableFuture pending = 
pendingRequests.remove(requestId);
-   if (pending != null && pending.complete(response)) {
+   if (pending != null && !pending.isDone()) {
long durationMillis = (System.nanoTime() - 
pending.getTimestamp()) / 1_000_000L;
stats.reportSuccessfulRequest(durationMillis);
+   pending.complete(response);
}
}
 
@Override
public void onRequestFailure(long requestId, Throwable cause) {
TimestampedCompletableFuture pending = 
pendingRequests.remove(requestId);
-   if (pending != null && 
pending.completeExceptionally(cause)) {
+   if (pending != null && !pending.isDone()) {
stats.reportFailedRequest();
+   pending.completeExceptionally(cause);
}
}
 
@Override
public void onFailure(Throwable cause) {
-   if (close(cause)) {
-   // Remove from established channels, otherwise 
future
-   // requests will be handled by this failed 
channel.
-   establishedConnections.remove(serverAddress, 
this);
-   }
+   close(cause).thenAccept(cancelled -> 
establishedConnections.remove(serverAddress, this));
--- End diff --

Yes.


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-12-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278584#comment-16278584
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r154950558
  
--- Diff: 
flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/AbstractServerBase.java
 ---
@@ -251,34 +263,81 @@ private boolean attemptToBind(final int port) throws 
Throwable {
throw future.cause();
} catch (BindException e) {
log.debug("Failed to start {} on port {}: {}.", 
serverName, port, e.getMessage());
-   shutdown();
+   try {
+   shutdownServer(Time.seconds(10L)).get();
+   } catch (Exception r) {
+
+   // Here we were seeing this problem:
+   // https://github.com/netty/netty/issues/4357 
if we do a get().
+   // this is why we now simply wait a bit so that 
everything is
+   // shut down and then we check
+
+   log.warn("Problem while shutting down {}: {}", 
serverName, r.getMessage());
+   }
}
// any other type of exception we let it bubble up.
return false;
}
 
/**
 * Shuts down the server and all related thread pools.
+* @param timeout The time to wait for the shutdown process to complete.
+* @return A {@link CompletableFuture} that will be completed upon 
termination of the shutdown process.
 */
-   public void shutdown() {
-   log.info("Shutting down {} @ {}", serverName, serverAddress);
-
-   if (handler != null) {
-   handler.shutdown();
-   handler = null;
-   }
-
-   if (queryExecutor != null) {
-   queryExecutor.shutdown();
-   }
+   public CompletableFuture shutdownServer(Time timeout) throws 
InterruptedException {
+   CompletableFuture shutdownFuture = new 
CompletableFuture<>();
+   if (serverShutdownFuture.compareAndSet(null, shutdownFuture)) {
+   log.info("Shutting down {} @ {}", serverName, 
serverAddress);
+
+   final CompletableFuture groupShutdownFuture = new 
CompletableFuture<>();
+   if (bootstrap != null) {
+   EventLoopGroup group = bootstrap.group();
+   if (group != null && !group.isShutdown()) {
+   group.shutdownGracefully(0L, 0L, 
TimeUnit.MILLISECONDS)
+   .addListener(finished 
-> {
+   if 
(finished.isSuccess()) {
+   
groupShutdownFuture.complete(null);
+   } else {
+   
groupShutdownFuture.completeExceptionally(finished.cause());
+   }
+   });
+   } else {
+   groupShutdownFuture.complete(null);
+   }
+   } else {
+   groupShutdownFuture.complete(null);
+   }
 
-   if (bootstrap != null) {
-   EventLoopGroup group = bootstrap.group();
-   if (group != null) {
-   group.shutdownGracefully(0L, 10L, 
TimeUnit.SECONDS);
+   final CompletableFuture handlerShutdownFuture = 
new CompletableFuture<>();
+   if (handler == null) {
+   handlerShutdownFuture.complete(null);
+   } else {
+   handler.shutdown().whenComplete((result, 
throwable) -> {
+   if (throwable != null) {
+   
handlerShutdownFuture.completeExceptionally(throwable);
+   } else {
+   
handlerShutdownFuture.complete(null);
+   }
+   });
}
+
+   final CompletableFuture queryExecShutdownFuture = 
Completa

[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-12-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278593#comment-16278593
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r154950988
  
--- Diff: 
flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/Client.java
 ---
@@ -191,9 +196,12 @@ public String getClientName() {
CompletableFuture.allOf(
connectionFutures.toArray(new 
CompletableFuture[connectionFutures.size()])
).whenComplete((result, throwable) -> {
+
if (throwable != null) {
-   
newShutdownFuture.completeExceptionally(throwable);
-   } else if (bootstrap != null) {
+   LOG.warn("Problem while shutting down 
the connections at the {}: {}", clientName, throwable);
--- End diff --

`LOG.warn("Problem while shutting down the connections at the {}", 
clientName, throwable);` instead?


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-12-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278594#comment-16278594
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r154952648
  
--- Diff: 
flink-queryable-state/flink-queryable-state-runtime/src/test/java/org/apache/flink/queryablestate/network/AbstractServerTest.java
 ---
@@ -66,14 +65,15 @@ public void testServerInitializationFailure() throws 
Throwable {
List portList = new ArrayList<>();
portList.add();
 
-   try (
-   TestServer server1 = new TestServer("Test 
Server 1", new DisabledKvStateRequestStats(), portList.iterator());
-   TestServer server2 = new TestServer("Test 
Server 2", new DisabledKvStateRequestStats(), portList.iterator())
-   ) {
+   try (TestServer server1 = new TestServer("Test Server 1", new 
DisabledKvStateRequestStats(), portList.iterator())) {
server1.start();
-   Assert.assertEquals(L, 
server1.getServerAddress().getPort());
 
-   server2.start();
+   List occupiedPortList = new ArrayList<>();
+   
occupiedPortList.add(server1.getServerAddress().getPort());
--- End diff --

could use `Collections.singletonList()` here


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-12-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278791#comment-16278791
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r154955042
  
--- Diff: 
flink-queryable-state/flink-queryable-state-runtime/src/test/java/org/apache/flink/queryablestate/itcases/AbstractQueryableStateTestBase.java
 ---
@@ -1480,7 +1259,52 @@ public String fold(String accumulator, 
Tuple2 value) throws Excep
 
/   General Utility Methods 
//
 
-   private CompletableFuture 
notifyWhenJobStatusIs(
+   /**
+* A wrapper of the job graph that makes sure to cancell the job and 
wait for
--- End diff --

typo: cancell -> cancel


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-12-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16280005#comment-16280005
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user kl0u commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r155201806
  
--- Diff: 
flink-queryable-state/flink-queryable-state-client-java/src/main/java/org/apache/flink/queryablestate/network/AbstractServerBase.java
 ---
@@ -251,34 +263,81 @@ private boolean attemptToBind(final int port) throws 
Throwable {
throw future.cause();
} catch (BindException e) {
log.debug("Failed to start {} on port {}: {}.", 
serverName, port, e.getMessage());
-   shutdown();
+   try {
+   shutdownServer(Time.seconds(10L)).get();
+   } catch (Exception r) {
+
+   // Here we were seeing this problem:
+   // https://github.com/netty/netty/issues/4357 
if we do a get().
+   // this is why we now simply wait a bit so that 
everything is
+   // shut down and then we check
+
+   log.warn("Problem while shutting down {}: {}", 
serverName, r.getMessage());
+   }
}
// any other type of exception we let it bubble up.
return false;
}
 
/**
 * Shuts down the server and all related thread pools.
+* @param timeout The time to wait for the shutdown process to complete.
+* @return A {@link CompletableFuture} that will be completed upon 
termination of the shutdown process.
 */
-   public void shutdown() {
-   log.info("Shutting down {} @ {}", serverName, serverAddress);
-
-   if (handler != null) {
-   handler.shutdown();
-   handler = null;
-   }
-
-   if (queryExecutor != null) {
-   queryExecutor.shutdown();
-   }
+   public CompletableFuture shutdownServer(Time timeout) throws 
InterruptedException {
+   CompletableFuture shutdownFuture = new 
CompletableFuture<>();
+   if (serverShutdownFuture.compareAndSet(null, shutdownFuture)) {
+   log.info("Shutting down {} @ {}", serverName, 
serverAddress);
+
+   final CompletableFuture groupShutdownFuture = new 
CompletableFuture<>();
+   if (bootstrap != null) {
+   EventLoopGroup group = bootstrap.group();
+   if (group != null && !group.isShutdown()) {
+   group.shutdownGracefully(0L, 0L, 
TimeUnit.MILLISECONDS)
+   .addListener(finished 
-> {
+   if 
(finished.isSuccess()) {
+   
groupShutdownFuture.complete(null);
+   } else {
+   
groupShutdownFuture.completeExceptionally(finished.cause());
+   }
+   });
+   } else {
+   groupShutdownFuture.complete(null);
+   }
+   } else {
+   groupShutdownFuture.complete(null);
+   }
 
-   if (bootstrap != null) {
-   EventLoopGroup group = bootstrap.group();
-   if (group != null) {
-   group.shutdownGracefully(0L, 10L, 
TimeUnit.SECONDS);
+   final CompletableFuture handlerShutdownFuture = 
new CompletableFuture<>();
+   if (handler == null) {
+   handlerShutdownFuture.complete(null);
+   } else {
+   handler.shutdown().whenComplete((result, 
throwable) -> {
+   if (throwable != null) {
+   
handlerShutdownFuture.completeExceptionally(throwable);
+   } else {
+   
handlerShutdownFuture.complete(null);
+   }
+   });
}
+
+   final CompletableFuture queryExecShutdownFuture = 
Completabl

[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-12-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16280057#comment-16280057
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user kl0u commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r155212313
  
--- Diff: 
flink-queryable-state/flink-queryable-state-runtime/src/test/java/org/apache/flink/queryablestate/itcases/AbstractQueryableStateTestBase.java
 ---
@@ -359,12 +355,12 @@ public Integer getKey(Tuple2 value) 
throws Exception {
} finally {
// Free cluster resources
if (jobId != null) {
-   scala.concurrent.Future 
cancellation = cluster
-   
.getLeaderGateway(deadline.timeLeft())
+   cluster.getLeaderGateway(deadline.timeLeft())
.ask(new 
JobManagerMessages.CancelJob(jobId), deadline.timeLeft())
-   
.mapTo(ClassTag$.MODULE$.apply(JobManagerMessages.CancellationSuccess.class));
+   
.mapTo(ClassTag$.MODULE$.apply(CancellationSuccess.class));
 
-   Await.ready(cancellation, deadline.timeLeft());
+   // we are not waiting for the cancellation to 
happen because the
+   // job has actually failed, as tested above.
--- End diff --

You are right. So I will just remove the `finally` block and check if the 
status of the job is `FAILED`.


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-12-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16280061#comment-16280061
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user kl0u commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r155212885
  
--- Diff: 
flink-queryable-state/flink-queryable-state-runtime/src/test/java/org/apache/flink/queryablestate/itcases/NonHAAbstractQueryableStateTestBase.java
 ---
@@ -68,11 +68,14 @@ public static void setup(int proxyPortRangeStart, int 
serverPortRangeStart) {
@AfterClass
public static void tearDown() {
try {
+   client.shutdownAndWait();
+
cluster.stop();
-   } catch (Exception e) {
+   cluster.awaitTermination();
+
+   } catch (Throwable e) {
--- End diff --

Done.


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-12-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16280060#comment-16280060
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user kl0u commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r155212865
  
--- Diff: 
flink-queryable-state/flink-queryable-state-runtime/src/test/java/org/apache/flink/queryablestate/itcases/HAAbstractQueryableStateTestBase.java
 ---
@@ -80,19 +80,18 @@ public static void setup(int proxyPortRangeStart, int 
serverPortRangeStart) {
 
@AfterClass
public static void tearDown() {
-   if (cluster != null) {
+   try {
+   client.shutdownAndWait();
+
cluster.stop();
cluster.awaitTermination();
-   }
 
-   try {
zkServer.stop();
zkServer.close();
-   } catch (Exception e) {
+
+   } catch (Throwable e) {
--- End diff --

Done.


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-12-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16280125#comment-16280125
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r155224199
  
--- Diff: 
flink-queryable-state/flink-queryable-state-runtime/src/test/java/org/apache/flink/queryablestate/network/ClientTest.java
 ---
@@ -224,6 +225,7 @@ public void channelRead(ChannelHandlerContext ctx, 
Object msg) throws Exception
assertEquals(expectedRequests, 
stats.getNumSuccessful());
assertEquals(expectedRequests, stats.getNumFailed());
} finally {
+   Exception exc = null;
--- End diff --

this could be moved into the if block.


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-12-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16280124#comment-16280124
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r155224312
  
--- Diff: 
flink-queryable-state/flink-queryable-state-runtime/src/test/java/org/apache/flink/queryablestate/network/ClientTest.java
 ---
@@ -234,9 +236,14 @@ public void channelRead(ChannelHandlerContext ctx, 
Object msg) throws Exception
 
client.shutdown().get(10L, 
TimeUnit.SECONDS);
} catch (Exception e) {
-   e.printStackTrace();
+   exc = e;
+   
LOG.error(ExceptionUtils.stringifyException(e));
--- End diff --

use `LOG.error("An exception occurred while shutting down netty.", e)` 
instead


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-12-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16280126#comment-16280126
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/5062#discussion_r155225199
  
--- Diff: 
flink-queryable-state/flink-queryable-state-runtime/src/test/java/org/apache/flink/queryablestate/itcases/AbstractQueryableStateTestBase.java
 ---
@@ -260,89 +260,90 @@ public void testDuplicateRegistrationFailsJob() 
throws Exception {
final Deadline deadline = TEST_TIMEOUT.fromNow();
final int numKeys = 256;
 
-   JobID jobId = null;
+   StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment();
+   env.setStateBackend(stateBackend);
+   env.setParallelism(maxParallelism);
+   // Very important, because cluster is shared between tests and 
we
+   // don't explicitly check that all slots are available before
+   // submitting.
+   
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(Integer.MAX_VALUE, 
1000L));
 
-   try {
-   //
-   // Test program
-   //
-   StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment();
-   env.setStateBackend(stateBackend);
-   env.setParallelism(maxParallelism);
-   // Very important, because cluster is shared between 
tests and we
-   // don't explicitly check that all slots are available 
before
-   // submitting.
-   
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(Integer.MAX_VALUE, 
1000L));
-
-   DataStream> source = env
-   .addSource(new 
TestKeyRangeSource(numKeys));
-
-   // Reducing state
-   ReducingStateDescriptor> 
reducingState = new ReducingStateDescriptor<>(
-   "any-name",
-   new SumReduce(),
-   source.getType());
-
-   final String queryName = "duplicate-me";
-
-   final QueryableStateStream> queryableState =
-   source.keyBy(new 
KeySelector, Integer>() {
-   private static final long 
serialVersionUID = -4126824763829132959L;
-
-   @Override
-   public Integer 
getKey(Tuple2 value) throws Exception {
-   return value.f0;
-   }
-   }).asQueryableState(queryName, 
reducingState);
+   DataStream> source = env.addSource(new 
TestKeyRangeSource(numKeys));
 
-   final QueryableStateStream> duplicate =
-   source.keyBy(new 
KeySelector, Integer>() {
-   private static final long 
serialVersionUID = -6265024000462809436L;
+   // Reducing state
+   ReducingStateDescriptor> reducingState = 
new ReducingStateDescriptor<>(
+   "any-name",
+   new SumReduce(),
+   source.getType());
 
-   @Override
-   public Integer 
getKey(Tuple2 value) throws Exception {
-   return value.f0;
-   }
-   }).asQueryableState(queryName);
+   final String queryName = "duplicate-me";
 
-   // Submit the job graph
-   JobGraph jobGraph = env.getStreamGraph().getJobGraph();
-   jobId = jobGraph.getJobID();
+   final QueryableStateStream> 
queryableState =
+   source.keyBy(new KeySelector, Integer>() {
+   private static final long 
serialVersionUID = -4126824763829132959L;
 
-   final 
CompletableFuture failedFuture =
-   notifyWhenJobStatusIs(jobId, 
JobStatus.FAILED, deadline);
+   @Override
+   public Integer getKey(Tuple2 value) {
+   return value.f0;

[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-12-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16280140#comment-16280140
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user zentol commented on the issue:

https://github.com/apache/flink/pull/5062
  
👍 


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-12-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16280184#comment-16280184
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user kl0u commented on the issue:

https://github.com/apache/flink/pull/5062
  
Thanks @zentol ! I will let another travis run, and then merge.


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-12-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16280295#comment-16280295
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/5062


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-10-20 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212278#comment-16212278
 ] 

Till Rohrmann commented on FLINK-7880:
--

One more: https://travis-ci.org/apache/flink/jobs/290015718

> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.4.0
>
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-10-20 Thread Kostas Kloudas (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212301#comment-16212301
 ] 

Kostas Kloudas commented on FLINK-7880:
---

Thanks for opening this. I will check it out!

> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.4.0
>
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-10-22 Thread Aljoscha Krettek (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16214281#comment-16214281
 ] 

Aljoscha Krettek commented on FLINK-7880:
-

I {{@Ignored}} the unstable tests in 137e61736b92a0e4c48e4f6ef6d8077c37584da2. 
They should be re-activated once the cause for the flakyness is found.

> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.4.0
>
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-10-25 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218876#comment-16218876
 ] 

Till Rohrmann commented on FLINK-7880:
--

Apparently, there are also other QS tests affected by this. Can we either fix 
this or make sure that only tests are affected but not the real usage of the QS 
client?

https://travis-ci.org/tillrohrmann/flink/jobs/292608461

> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.4.0
>
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-10-26 Thread Kostas Kloudas (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16220281#comment-16220281
 ] 

Kostas Kloudas commented on FLINK-7880:
---

Yes I will look into it.

> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.4.0
>
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-10-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16220824#comment-16220824
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

GitHub user kl0u opened a pull request:

https://github.com/apache/flink/pull/4909

[FLINK-7880][QS] Fix QS test instabilities.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kl0u/flink qs-test-instability

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/4909.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4909


commit 2b3bba7a2e3778cf3f9d580a32bd466d77fcdb39
Author: kkloudas 
Date:   2017-10-26T17:11:03Z

[FLINK-7880][QS] Fix QS test instabilities.




> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.4.0
>
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-10-26 Thread Kostas Kloudas (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16220825#comment-16220825
 ] 

Kostas Kloudas commented on FLINK-7880:
---

I think this PR fixes it.

> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.4.0
>
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-10-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16221816#comment-16221816
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user kl0u commented on the issue:

https://github.com/apache/flink/pull/4909
  
It should call `dispose()`, you are correct. This was a mistake  due to 
sloppy "manual rebasing".


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.4.0
>
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-10-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16224043#comment-16224043
 ] 

ASF GitHub Bot commented on FLINK-7880:
---

Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/4909


> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.4.0
>
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-10-31 Thread Chesnay Schepler (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16227479#comment-16227479
 ] 

Chesnay Schepler commented on FLINK-7880:
-

[~till.rohrmann] yes it's still an issue, another occurrence can be seen here: 
https://travis-ci.org/zentol/flink/jobs/29519

The branch did include the assumed fix.

> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.4.0
>
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-03 Thread Kostas Kloudas (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16237584#comment-16237584
 ] 

Kostas Kloudas commented on FLINK-7880:
---

I will keep on investigating *BUT* I commented out the {{queryable state}} code 
from the tests and they still seem to fail even locally with:

{code}
libc++abi.dylib: terminating with uncaught exception of type 
std::__1::system_error: mutex lock failed: Invalid argument
/bin/sh: line 1: 10553 Abort trap: 6   
/Library/Java/JavaVirtualMachines/jdk1.8.0_60.jdk/Contents/Home/jre/bin/java 
-Xms256m -Xmx2048m -Dmvn.forkNumber=1 -XX:+UseSerialGC -jar 
/Users/kkloudas/repos/dataartisans/flink/flink-queryable-state/flink-queryable-state-runtime/target/surefire/surefirebooter6530581495710923655.jar
 
/Users/kkloudas/repos/dataartisans/flink/flink-queryable-state/flink-queryable-state-runtime/target/surefire/surefire7293759706604721607tmp
 
/Users/kkloudas/repos/dataartisans/flink/flink-queryable-state/flink-queryable-state-runtime/target/surefire/surefire_87379374358455308985tmp
{code}

You can find my code here https://github.com/kl0u/flink/tree/more-than-qs and 
to reproduce, go to the {{flink-queryable-state}} dir and run a couple of times 
{{mvn verify}}. At least on my machine this reproduces the problem.

> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.4.0
>
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-03 Thread Gary Yao (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16237616#comment-16237616
 ] 

Gary Yao commented on FLINK-7880:
-

I have further isolated the problem. The problem still appears even if you are 
running a single test method. I tried 
{{NonHAQueryableStateRocksDBBackendITCase#testValueState}} and even replaced 
the state query with a {{Thread.sleep(400)}}. My changes to the code are 
documented here: 
https://github.com/apache/flink/compare/master...GJL:FLINK-7880?expand=1

To run the tests, I use the command below. Note that multiple iterations may be 
required. Hence, the {{while}} loop.
{noformat}
while mvn -o clean verify -Dtest=NonHAQueryableStateRocksDBBackendITCase 
-DfailIfNoTests=false -Dcheckstyle.skip; do :; done
{noformat}





> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.4.0
>
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-04 Thread Aljoscha Krettek (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16238936#comment-16238936
 ] 

Aljoscha Krettek commented on FLINK-7880:
-

I think this is an issue with RocksDB cleanup (or lack thereof). Maybe 
[~srichter] has an idea?

> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.4.0
>
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-04 Thread Stefan Richter (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16239068#comment-16239068
 ] 

Stefan Richter commented on FLINK-7880:
---

My theory is that the `dispose()` is not properly executed before the test 
finishes. By adding printouts to the backend's constructor, dispose(), and the 
code that waits for cancelation like this:

{code}
} finally {
System.out.println("shutdown a");
// Free cluster resources
if (jobId != null) {
CompletableFuture 
cancellation = FutureUtils.toJava(cluster

.getLeaderGateway(deadline.timeLeft())
.ask(new 
JobManagerMessages.CancelJob(jobId), deadline.timeLeft())

.mapTo(ClassTag$.MODULE$.apply(CancellationSuccess.class)));


cancellation.get(deadline.timeLeft().toMillis(), TimeUnit.MILLISECONDS);
}
System.out.println("shutdown b");
}

{code}

I obtain the following printing order:


{code}
created 
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend@58482c7d
created 
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend@5278696f
created 
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend@11c418d8
created 
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend@78424ca5
created 
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend@79690844
created 
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend@44855a3
created 
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend@459dcdef
created 
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend@54cef1d0
shutdown a
shutdown b
dispose 
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend@11c418d8
dispose 
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend@79690844
dispose 
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend@58482c7d
dispose 
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend@44855a3
dispose 
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend@78424ca5
dispose 
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend@5278696f
dispose 
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend@459dcdef
dispose 
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend@54cef1d0
{code}

I conclude that the way in which the tests waits for cancelation does not work 
as expected. In particular, it does not ensure that `dispose()` was executed 
before the method ends and this can lead to problems with the native resources 
like what you observe.

> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.4.0
>
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-06 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240024#comment-16240024
 ] 

Till Rohrmann commented on FLINK-7880:
--

The {{CancellationSuccess}} message is sent after the {{JobManager}} has called 
{{cancel}} on the {{ExecutionGraph}}. But this does not necessarily mean that 
the {{ExecutionGraph}} is in a terminal state when the client receives the 
{{CancellationSuccess}} message. Thus, waiting on the {{CancelJob}} response 
won't work to ensure proper job termination/cleanup.

> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.4.0
>
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-06 Thread Stefan Richter (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240153#comment-16240153
 ] 

Stefan Richter commented on FLINK-7880:
---

Yes, but the test seems to expect that waiting for {{CancellationSuccess}} 
includes a successful cleanup or it was just not aware how important the proper 
cleanup is with native resources. In any case, I think the origin of this 
problem might be taking other IT cases a blueprints, and I have seen different 
patterns for this "wait until the job is gone" problem in different tests. Many 
of them might be similar to this one, but they will often look correct or cause 
no trouble if there are no native libraries involved (e.g. test only uses heap 
backend). I would suggest that there should be one and only one simple (maybe 
one helper class that does this), idiomatic way of waiting for a job to go away 
and release all resources that is used throughout all tests that actually want 
to have this behaviour. Otherwise, for example, extending an existing test to 
include a different backend can suddenly uncover the improper cleanup and make 
tests randomly fail with a JVM crash. 

Having a clear way to end IT cases could help to avoid chasing seriously 
looking, misleading test failures that seem to originate from the RocksDB 
backend code, but are actually tests problems from improper cleanup. What do 
you think?

> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.4.0
>
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-06 Thread Aljoscha Krettek (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240252#comment-16240252
 ] 

Aljoscha Krettek commented on FLINK-7880:
-

+2 to (to what [~srichter] said)

> flink-queryable-state-java fails with core-dump
> ---
>
> Key: FLINK-7880
> URL: https://issues.apache.org/jira/browse/FLINK-7880
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.4.0
>
>
> The {{flink-queryable-state-java}} module fails on Travis with a core dump.
> https://travis-ci.org/tillrohrmann/flink/jobs/289949829



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)