[ https://issues.apache.org/jira/browse/FLINK-12595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16889564#comment-16889564 ]
Shannon Carey commented on FLINK-12595: --------------------------------------- Sorry about that! Had to draw out a big sequence diagram, but I have a hypothesis about the issue. It looks like KinesisDataFetcher (on the FlinkKinesisConsumer thread) could by chance pause inside the while(running) loop, before Thread.sleep(). Then, the KinesisShardConsumer thread might interrupt that thread via KinesisDataFetcher#mainThread.interrupt() and then trigger shutdownWaiter, allowing the tests's fetcher.waitUntilShutdown() to complete, after which the test would also interrupt the FlinkKinesisConsumer thread. Then, when the KinesisDataFetcher code resumes, it calls Thread.sleep() and catches/ignores the interrupted state (caused by both interrupts), then it exits the while(running) loop due to running == false, and gets to our awaitTermination() code which waits forever because it has already absorbed the test's interrupt. This can be reproduced by adding code like this to the KinesisDataFetcher beneath the if(running && discoveryIntervalMillis !=0) line in order to force a longer delay in that thread: {code:java} boolean wasInterrupted = false; int interruptionCount = 0; for (int i = 0; i < 4; i++) { try { Thread.sleep(4000); } catch (InterruptedException ie) { wasInterrupted = true; interruptionCount++; } } if (wasInterrupted) { // Restore the interrupted state Thread.currentThread().interrupt(); } System.out.println("KinesisDataFetcher was interrupted " + interruptionCount + " times during the " + "while(running) loop."); System.out.flush(); {code} You'll likely see that it gets interrupted twice, and the test deadlocks with the same stacks as the logs provided above. I have 02a0cf3d4e checked out to reproduce the issue matching the provided logs. I assume it's best to write this patch against HEAD of master, and let you handle backporting it? Let me know if that's not the case. I'll post again once I have a PR to address my hypothesis. > KinesisDataFetcherTest.testOriginalExceptionIsPreservedWhenInterruptedDuringShutdown > deadlocks > ---------------------------------------------------------------------------------------------- > > Key: FLINK-12595 > URL: https://issues.apache.org/jira/browse/FLINK-12595 > Project: Flink > Issue Type: Bug > Components: Connectors / Kinesis, Tests > Affects Versions: 1.9.0 > Reporter: Dawid Wysakowicz > Priority: Critical > Labels: test-stability > Fix For: 1.9.0 > > > https://api.travis-ci.org/v3/job/535738122/log.txt -- This message was sent by Atlassian JIRA (v7.6.14#76016)