[jira] [Commented] (FLINK-8163) NonHAQueryableStateFsBackendITCase test getting stuck on Travis

2018-07-29 Thread Congxian Qiu (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561045#comment-16561045
 ] 

Congxian Qiu commented on FLINK-8163:
-

When I got stuck, I got the jstack message as below
{code:java}
"Flink Netty Client (0) Thread 0" #11277 daemon prio=5 os_prio=31 
tid=0x7fe7237ed800 nid=0x323a3 runnable [0x70002c2d]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.KQueueArrayWrapper.kevent0(Native Method)
at sun.nio.ch.KQueueArrayWrapper.poll(KQueueArrayWrapper.java:198)
at sun.nio.ch.KQueueSelectorImpl.doSelect(KQueueSelectorImpl.java:117)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0x000742003840> (a 
org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet)
- locked <0x000742003830> (a java.util.Collections$UnmodifiableSet)
- locked <0x0007420037e0> (a sun.nio.ch.KQueueSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:753)
at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:409)
at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
at java.lang.Thread.run(Thread.java:748)

...

"Flink Netty Client (0) Thread 0" #11275 daemon prio=5 os_prio=31 
tid=0x7fe70900b000 nid=0x3fe37 runnable [0x70002b9b5000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.KQueueArrayWrapper.kevent0(Native Method)
at sun.nio.ch.KQueueArrayWrapper.poll(KQueueArrayWrapper.java:198)
at sun.nio.ch.KQueueSelectorImpl.doSelect(KQueueSelectorImpl.java:117)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0x000741e60f58> (a 
org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet)
- locked <0x000741e60ea0> (a java.util.Collections$UnmodifiableSet)
- locked <0x000741e60dc0> (a sun.nio.ch.KQueueSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:753)
at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:409)
at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
at java.lang.Thread.run(Thread.java:748)
{code}

> NonHAQueryableStateFsBackendITCase test getting stuck on Travis
> ---
>
> Key: FLINK-8163
> URL: https://issues.apache.org/jira/browse/FLINK-8163
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Assignee: Chesnay Schepler
>Priority: Critical
>  Labels: pull-request-available, test-stability
>
> The {{NonHAQueryableStateFsBackendITCase}} tests seems to get stuck on Travis 
> producing no output for 300s.
> https://travis-ci.org/tillrohrmann/flink/jobs/307988209



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8163) NonHAQueryableStateFsBackendITCase test getting stuck on Travis

2018-07-20 Thread Congxian Qiu (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550288#comment-16550288
 ] 

Congxian Qiu commented on FLINK-8163:
-

I ran the NonHAQueryableStateFsBackendITCase#testValueStateDefault locally, it 
hung sometime, I will debug to find out the reason about it.

> NonHAQueryableStateFsBackendITCase test getting stuck on Travis
> ---
>
> Key: FLINK-8163
> URL: https://issues.apache.org/jira/browse/FLINK-8163
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Assignee: Chesnay Schepler
>Priority: Critical
>  Labels: pull-request-available, test-stability
>
> The {{NonHAQueryableStateFsBackendITCase}} tests seems to get stuck on Travis 
> producing no output for 300s.
> https://travis-ci.org/tillrohrmann/flink/jobs/307988209



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8163) NonHAQueryableStateFsBackendITCase test getting stuck on Travis

2018-07-17 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546468#comment-16546468
 ] 

ASF GitHub Bot commented on FLINK-8163:
---

GitHub user zentol opened a pull request:

https://github.com/apache/flink/pull/6352

[FLINK-8163][yarn][tests] Harden tests against slow job shutdowns

## What is the purpose of the change

This PR hardens the `YarnTestBase` against jobs that just don't want to 
shut down that quickly (i.e. within 500ms).
The maximum waiting time has been increase to 10 seconds, during which we 
periodically check the state of all applications.

Additionally, the failure condition from `@Before` was moved to the 
`@After` method.

This change will allow us to better differentiate between simple timing 
issues and unsuccessful job shutdowns.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zentol/flink 8163

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/6352.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #6352


commit 0dd65378f0c9f477bb8f5712bbc0b1f31440f5f0
Author: zentol 
Date:   2018-07-17T11:29:16Z

[FLINK-8163][yarn][tests] Harden tests against slow job shutdowns




> NonHAQueryableStateFsBackendITCase test getting stuck on Travis
> ---
>
> Key: FLINK-8163
> URL: https://issues.apache.org/jira/browse/FLINK-8163
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Assignee: Chesnay Schepler
>Priority: Critical
>  Labels: pull-request-available, test-stability
>
> The {{NonHAQueryableStateFsBackendITCase}} tests seems to get stuck on Travis 
> producing no output for 300s.
> https://travis-ci.org/tillrohrmann/flink/jobs/307988209



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8163) NonHAQueryableStateFsBackendITCase test getting stuck on Travis

2018-07-17 Thread Chesnay Schepler (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546433#comment-16546433
 ] 

Chesnay Schepler commented on FLINK-8163:
-

I run into the memory issue consistently after ~100 iterations.

> NonHAQueryableStateFsBackendITCase test getting stuck on Travis
> ---
>
> Key: FLINK-8163
> URL: https://issues.apache.org/jira/browse/FLINK-8163
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Assignee: Chesnay Schepler
>Priority: Critical
>  Labels: test-stability
>
> The {{NonHAQueryableStateFsBackendITCase}} tests seems to get stuck on Travis 
> producing no output for 300s.
> https://travis-ci.org/tillrohrmann/flink/jobs/307988209



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8163) NonHAQueryableStateFsBackendITCase test getting stuck on Travis

2018-07-17 Thread Chesnay Schepler (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546392#comment-16546392
 ] 

Chesnay Schepler commented on FLINK-8163:
-

This test has some funky retrying logic that ignores most exceptions:
{code:java}
CompletableFuture expected = client.getKvState(jobId, queryName, key, 
keyTypeInfo, stateDescriptor);
expected.whenCompleteAsync((result, throwable) -> {
   if (throwable != null) {
  if (
throwable.getCause() instanceof CancellationException ||
throwable.getCause() instanceof AssertionError ||
(failForUnknownKeyOrNamespace && throwable.getCause() instanceof 
UnknownKeyOrNamespaceException)
  ) {
 resultFuture.completeExceptionally(throwable.getCause());
  } else if (deadline.hasTimeLeft()) {
 getKvStateIgnoringCertainExceptions(
   deadline, resultFuture, client, jobId, queryName, key, 
keyTypeInfo,
   stateDescriptor, failForUnknownKeyOrNamespace, executor);
  }
   } else {
  resultFuture.complete(result);
   }
}, executor);{code}

When running the test locally in a loop this exception was logged on the server.
{code}
org.apache.flink.shaded.netty4.io.netty.util.internal.OutOfDirectMemoryError: 
failed to allocate 16777216 byte(s) of direct memory (used: 2147483648, max: 
2147483648)
{code}
The client ignores this error, infinitely retries the operation, causing the 
timeout. Incidentally, on every subsequent attempt the same exception is 
printed.

> NonHAQueryableStateFsBackendITCase test getting stuck on Travis
> ---
>
> Key: FLINK-8163
> URL: https://issues.apache.org/jira/browse/FLINK-8163
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Assignee: Chesnay Schepler
>Priority: Critical
>  Labels: test-stability
>
> The {{NonHAQueryableStateFsBackendITCase}} tests seems to get stuck on Travis 
> producing no output for 300s.
> https://travis-ci.org/tillrohrmann/flink/jobs/307988209



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8163) NonHAQueryableStateFsBackendITCase test getting stuck on Travis

2018-06-15 Thread Till Rohrmann (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514028#comment-16514028
 ] 

Till Rohrmann commented on FLINK-8163:
--

Any news [~kkl0u]?

> NonHAQueryableStateFsBackendITCase test getting stuck on Travis
> ---
>
> Key: FLINK-8163
> URL: https://issues.apache.org/jira/browse/FLINK-8163
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.5.1
>
>
> The {{NonHAQueryableStateFsBackendITCase}} tests seems to get stuck on Travis 
> producing no output for 300s.
> https://travis-ci.org/tillrohrmann/flink/jobs/307988209



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8163) NonHAQueryableStateFsBackendITCase test getting stuck on Travis

2018-03-29 Thread Kostas Kloudas (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419066#comment-16419066
 ] 

Kostas Kloudas commented on FLINK-8163:
---

I already have it in my radar ;)

> NonHAQueryableStateFsBackendITCase test getting stuck on Travis
> ---
>
> Key: FLINK-8163
> URL: https://issues.apache.org/jira/browse/FLINK-8163
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.5.0
>
>
> The {{NonHAQueryableStateFsBackendITCase}} tests seems to get stuck on Travis 
> producing no output for 300s.
> https://travis-ci.org/tillrohrmann/flink/jobs/307988209



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8163) NonHAQueryableStateFsBackendITCase test getting stuck on Travis

2018-03-29 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419054#comment-16419054
 ] 

Till Rohrmann commented on FLINK-8163:
--

Alright, then I will unblock 1.5.0 from this issue since it might also be a 
testing issue. Please keep an eye on this issue in case that it reoccurs. 

> NonHAQueryableStateFsBackendITCase test getting stuck on Travis
> ---
>
> Key: FLINK-8163
> URL: https://issues.apache.org/jira/browse/FLINK-8163
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.5.0
>
>
> The {{NonHAQueryableStateFsBackendITCase}} tests seems to get stuck on Travis 
> producing no output for 300s.
> https://travis-ci.org/tillrohrmann/flink/jobs/307988209



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8163) NonHAQueryableStateFsBackendITCase test getting stuck on Travis

2018-03-29 Thread Kostas Kloudas (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419036#comment-16419036
 ] 

Kostas Kloudas commented on FLINK-8163:
---

Not exactly. I want to merge this open PR 
[https://github.com/apache/flink/pull/5691] and see if the problem persists.

> NonHAQueryableStateFsBackendITCase test getting stuck on Travis
> ---
>
> Key: FLINK-8163
> URL: https://issues.apache.org/jira/browse/FLINK-8163
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.5.0
>
>
> The {{NonHAQueryableStateFsBackendITCase}} tests seems to get stuck on Travis 
> producing no output for 300s.
> https://travis-ci.org/tillrohrmann/flink/jobs/307988209



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8163) NonHAQueryableStateFsBackendITCase test getting stuck on Travis

2018-03-29 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419030#comment-16419030
 ] 

Till Rohrmann commented on FLINK-8163:
--

What's the state [~kkl0u]? Do we know what's causing the problem?

> NonHAQueryableStateFsBackendITCase test getting stuck on Travis
> ---
>
> Key: FLINK-8163
> URL: https://issues.apache.org/jira/browse/FLINK-8163
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State, Tests
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Assignee: Kostas Kloudas
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.5.0
>
>
> The {{NonHAQueryableStateFsBackendITCase}} tests seems to get stuck on Travis 
> producing no output for 300s.
> https://travis-ci.org/tillrohrmann/flink/jobs/307988209



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)