[ 
https://issues.apache.org/jira/browse/FLINK-12595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16889564#comment-16889564
 ] 

Shannon Carey commented on FLINK-12595:
---------------------------------------

Sorry about that! Had to draw out a big sequence diagram, but I have a 
hypothesis about the issue. It looks like KinesisDataFetcher (on the 
FlinkKinesisConsumer thread) could by chance pause inside the while(running) 
loop, before Thread.sleep(). Then, the KinesisShardConsumer thread might 
interrupt that thread via KinesisDataFetcher#mainThread.interrupt() and then 
trigger shutdownWaiter, allowing the tests's fetcher.waitUntilShutdown() to 
complete, after which the test would also interrupt the FlinkKinesisConsumer 
thread. Then, when the KinesisDataFetcher code resumes, it calls Thread.sleep() 
and catches/ignores the interrupted state (caused by both interrupts), then it 
exits the while(running) loop due to running == false, and gets to our 
awaitTermination() code which waits forever because it has already absorbed the 
test's interrupt.

This can be reproduced by adding code like this to the KinesisDataFetcher 
beneath the if(running && discoveryIntervalMillis !=0) line in order to force a 
longer delay in that thread:

 
{code:java}
boolean wasInterrupted = false;
int interruptionCount = 0;
for (int i = 0; i < 4; i++) {
   try {
      Thread.sleep(4000);
   } catch (InterruptedException ie) {
      wasInterrupted = true;
      interruptionCount++;
   }
}
if (wasInterrupted) {
   // Restore the interrupted state
   Thread.currentThread().interrupt();
}
System.out.println("KinesisDataFetcher was interrupted " + interruptionCount + 
" times during the " +
   "while(running) loop.");
System.out.flush();
{code}
You'll likely see that it gets interrupted twice, and the test deadlocks with 
the same stacks as the logs provided above.

 

I have 02a0cf3d4e checked out to reproduce the issue matching the provided 
logs. I assume it's best to write this patch against HEAD of master, and let 
you handle backporting it? Let me know if that's not the case. I'll post again 
once I have a PR to address my hypothesis.

> KinesisDataFetcherTest.testOriginalExceptionIsPreservedWhenInterruptedDuringShutdown
>  deadlocks
> ----------------------------------------------------------------------------------------------
>
>                 Key: FLINK-12595
>                 URL: https://issues.apache.org/jira/browse/FLINK-12595
>             Project: Flink
>          Issue Type: Bug
>          Components: Connectors / Kinesis, Tests
>    Affects Versions: 1.9.0
>            Reporter: Dawid Wysakowicz
>            Priority: Critical
>              Labels: test-stability
>             Fix For: 1.9.0
>
>
> https://api.travis-ci.org/v3/job/535738122/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to