[jira] [Commented] (FLINK-8163) NonHAQueryableStateFsBackendITCase test getting stuck on Travis
[ https://issues.apache.org/jira/browse/FLINK-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561045#comment-16561045 ] Congxian Qiu commented on FLINK-8163: - When I got stuck, I got the jstack message as below {code:java} "Flink Netty Client (0) Thread 0" #11277 daemon prio=5 os_prio=31 tid=0x7fe7237ed800 nid=0x323a3 runnable [0x70002c2d] java.lang.Thread.State: RUNNABLE at sun.nio.ch.KQueueArrayWrapper.kevent0(Native Method) at sun.nio.ch.KQueueArrayWrapper.poll(KQueueArrayWrapper.java:198) at sun.nio.ch.KQueueSelectorImpl.doSelect(KQueueSelectorImpl.java:117) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) - locked <0x000742003840> (a org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet) - locked <0x000742003830> (a java.util.Collections$UnmodifiableSet) - locked <0x0007420037e0> (a sun.nio.ch.KQueueSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) at org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:753) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:409) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884) at java.lang.Thread.run(Thread.java:748) ... "Flink Netty Client (0) Thread 0" #11275 daemon prio=5 os_prio=31 tid=0x7fe70900b000 nid=0x3fe37 runnable [0x70002b9b5000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.KQueueArrayWrapper.kevent0(Native Method) at sun.nio.ch.KQueueArrayWrapper.poll(KQueueArrayWrapper.java:198) at sun.nio.ch.KQueueSelectorImpl.doSelect(KQueueSelectorImpl.java:117) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) - locked <0x000741e60f58> (a org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet) - locked <0x000741e60ea0> (a java.util.Collections$UnmodifiableSet) - locked <0x000741e60dc0> (a sun.nio.ch.KQueueSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) at org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:753) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:409) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884) at java.lang.Thread.run(Thread.java:748) {code} > NonHAQueryableStateFsBackendITCase test getting stuck on Travis > --- > > Key: FLINK-8163 > URL: https://issues.apache.org/jira/browse/FLINK-8163 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Chesnay Schepler >Priority: Critical > Labels: pull-request-available, test-stability > > The {{NonHAQueryableStateFsBackendITCase}} tests seems to get stuck on Travis > producing no output for 300s. > https://travis-ci.org/tillrohrmann/flink/jobs/307988209 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8163) NonHAQueryableStateFsBackendITCase test getting stuck on Travis
[ https://issues.apache.org/jira/browse/FLINK-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550288#comment-16550288 ] Congxian Qiu commented on FLINK-8163: - I ran the NonHAQueryableStateFsBackendITCase#testValueStateDefault locally, it hung sometime, I will debug to find out the reason about it. > NonHAQueryableStateFsBackendITCase test getting stuck on Travis > --- > > Key: FLINK-8163 > URL: https://issues.apache.org/jira/browse/FLINK-8163 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Chesnay Schepler >Priority: Critical > Labels: pull-request-available, test-stability > > The {{NonHAQueryableStateFsBackendITCase}} tests seems to get stuck on Travis > producing no output for 300s. > https://travis-ci.org/tillrohrmann/flink/jobs/307988209 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8163) NonHAQueryableStateFsBackendITCase test getting stuck on Travis
[ https://issues.apache.org/jira/browse/FLINK-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546468#comment-16546468 ] ASF GitHub Bot commented on FLINK-8163: --- GitHub user zentol opened a pull request: https://github.com/apache/flink/pull/6352 [FLINK-8163][yarn][tests] Harden tests against slow job shutdowns ## What is the purpose of the change This PR hardens the `YarnTestBase` against jobs that just don't want to shut down that quickly (i.e. within 500ms). The maximum waiting time has been increase to 10 seconds, during which we periodically check the state of all applications. Additionally, the failure condition from `@Before` was moved to the `@After` method. This change will allow us to better differentiate between simple timing issues and unsuccessful job shutdowns. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zentol/flink 8163 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/6352.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #6352 commit 0dd65378f0c9f477bb8f5712bbc0b1f31440f5f0 Author: zentol Date: 2018-07-17T11:29:16Z [FLINK-8163][yarn][tests] Harden tests against slow job shutdowns > NonHAQueryableStateFsBackendITCase test getting stuck on Travis > --- > > Key: FLINK-8163 > URL: https://issues.apache.org/jira/browse/FLINK-8163 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Chesnay Schepler >Priority: Critical > Labels: pull-request-available, test-stability > > The {{NonHAQueryableStateFsBackendITCase}} tests seems to get stuck on Travis > producing no output for 300s. > https://travis-ci.org/tillrohrmann/flink/jobs/307988209 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8163) NonHAQueryableStateFsBackendITCase test getting stuck on Travis
[ https://issues.apache.org/jira/browse/FLINK-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546433#comment-16546433 ] Chesnay Schepler commented on FLINK-8163: - I run into the memory issue consistently after ~100 iterations. > NonHAQueryableStateFsBackendITCase test getting stuck on Travis > --- > > Key: FLINK-8163 > URL: https://issues.apache.org/jira/browse/FLINK-8163 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Chesnay Schepler >Priority: Critical > Labels: test-stability > > The {{NonHAQueryableStateFsBackendITCase}} tests seems to get stuck on Travis > producing no output for 300s. > https://travis-ci.org/tillrohrmann/flink/jobs/307988209 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8163) NonHAQueryableStateFsBackendITCase test getting stuck on Travis
[ https://issues.apache.org/jira/browse/FLINK-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546392#comment-16546392 ] Chesnay Schepler commented on FLINK-8163: - This test has some funky retrying logic that ignores most exceptions: {code:java} CompletableFuture expected = client.getKvState(jobId, queryName, key, keyTypeInfo, stateDescriptor); expected.whenCompleteAsync((result, throwable) -> { if (throwable != null) { if ( throwable.getCause() instanceof CancellationException || throwable.getCause() instanceof AssertionError || (failForUnknownKeyOrNamespace && throwable.getCause() instanceof UnknownKeyOrNamespaceException) ) { resultFuture.completeExceptionally(throwable.getCause()); } else if (deadline.hasTimeLeft()) { getKvStateIgnoringCertainExceptions( deadline, resultFuture, client, jobId, queryName, key, keyTypeInfo, stateDescriptor, failForUnknownKeyOrNamespace, executor); } } else { resultFuture.complete(result); } }, executor);{code} When running the test locally in a loop this exception was logged on the server. {code} org.apache.flink.shaded.netty4.io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 byte(s) of direct memory (used: 2147483648, max: 2147483648) {code} The client ignores this error, infinitely retries the operation, causing the timeout. Incidentally, on every subsequent attempt the same exception is printed. > NonHAQueryableStateFsBackendITCase test getting stuck on Travis > --- > > Key: FLINK-8163 > URL: https://issues.apache.org/jira/browse/FLINK-8163 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Chesnay Schepler >Priority: Critical > Labels: test-stability > > The {{NonHAQueryableStateFsBackendITCase}} tests seems to get stuck on Travis > producing no output for 300s. > https://travis-ci.org/tillrohrmann/flink/jobs/307988209 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8163) NonHAQueryableStateFsBackendITCase test getting stuck on Travis
[ https://issues.apache.org/jira/browse/FLINK-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514028#comment-16514028 ] Till Rohrmann commented on FLINK-8163: -- Any news [~kkl0u]? > NonHAQueryableStateFsBackendITCase test getting stuck on Travis > --- > > Key: FLINK-8163 > URL: https://issues.apache.org/jira/browse/FLINK-8163 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > Fix For: 1.5.1 > > > The {{NonHAQueryableStateFsBackendITCase}} tests seems to get stuck on Travis > producing no output for 300s. > https://travis-ci.org/tillrohrmann/flink/jobs/307988209 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8163) NonHAQueryableStateFsBackendITCase test getting stuck on Travis
[ https://issues.apache.org/jira/browse/FLINK-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419066#comment-16419066 ] Kostas Kloudas commented on FLINK-8163: --- I already have it in my radar ;) > NonHAQueryableStateFsBackendITCase test getting stuck on Travis > --- > > Key: FLINK-8163 > URL: https://issues.apache.org/jira/browse/FLINK-8163 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Critical > Labels: test-stability > Fix For: 1.5.0 > > > The {{NonHAQueryableStateFsBackendITCase}} tests seems to get stuck on Travis > producing no output for 300s. > https://travis-ci.org/tillrohrmann/flink/jobs/307988209 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8163) NonHAQueryableStateFsBackendITCase test getting stuck on Travis
[ https://issues.apache.org/jira/browse/FLINK-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419054#comment-16419054 ] Till Rohrmann commented on FLINK-8163: -- Alright, then I will unblock 1.5.0 from this issue since it might also be a testing issue. Please keep an eye on this issue in case that it reoccurs. > NonHAQueryableStateFsBackendITCase test getting stuck on Travis > --- > > Key: FLINK-8163 > URL: https://issues.apache.org/jira/browse/FLINK-8163 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Blocker > Labels: test-stability > Fix For: 1.5.0 > > > The {{NonHAQueryableStateFsBackendITCase}} tests seems to get stuck on Travis > producing no output for 300s. > https://travis-ci.org/tillrohrmann/flink/jobs/307988209 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8163) NonHAQueryableStateFsBackendITCase test getting stuck on Travis
[ https://issues.apache.org/jira/browse/FLINK-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419036#comment-16419036 ] Kostas Kloudas commented on FLINK-8163: --- Not exactly. I want to merge this open PR [https://github.com/apache/flink/pull/5691] and see if the problem persists. > NonHAQueryableStateFsBackendITCase test getting stuck on Travis > --- > > Key: FLINK-8163 > URL: https://issues.apache.org/jira/browse/FLINK-8163 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Blocker > Labels: test-stability > Fix For: 1.5.0 > > > The {{NonHAQueryableStateFsBackendITCase}} tests seems to get stuck on Travis > producing no output for 300s. > https://travis-ci.org/tillrohrmann/flink/jobs/307988209 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8163) NonHAQueryableStateFsBackendITCase test getting stuck on Travis
[ https://issues.apache.org/jira/browse/FLINK-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419030#comment-16419030 ] Till Rohrmann commented on FLINK-8163: -- What's the state [~kkl0u]? Do we know what's causing the problem? > NonHAQueryableStateFsBackendITCase test getting stuck on Travis > --- > > Key: FLINK-8163 > URL: https://issues.apache.org/jira/browse/FLINK-8163 > Project: Flink > Issue Type: Bug > Components: Queryable State, Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Kostas Kloudas >Priority: Blocker > Labels: test-stability > Fix For: 1.5.0 > > > The {{NonHAQueryableStateFsBackendITCase}} tests seems to get stuck on Travis > producing no output for 300s. > https://travis-ci.org/tillrohrmann/flink/jobs/307988209 -- This message was sent by Atlassian JIRA (v7.6.3#76005)