[ 
https://issues.apache.org/jira/browse/FLINK-13769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913209#comment-16913209
 ] 

Andrey Zagrebin commented on FLINK-13769:
-----------------------------------------

The problem is that after we merged waiting for all tasks termination in 
TaskExecutor.onStop() (FLINK-11630),

this interrupted the testing mappers earlier in BatchFineGrainedRecoveryITCase, 
causing concurrency problems with the next slot allocation.

The JM got notified about task failure faster and requested quickly a slot from 
RM which has not realised yet that the slot of the stopping TM cannot be used 
anymore. To fix this we need to deregister TM with the RM at the beginning of 
the TaskExecutor.onStop().

Having deregistered itself, TM should stop reconnecting to RM. A preliminary 
change is required for that to check the stopping state (FLINK-13819) of the TM 
RpcEndpoint in TaskExecutor.disconnectResourceManager to decide whether to 
reconnect.

> BatchFineGrainedRecoveryITCase.testProgram failed on Travis
> -----------------------------------------------------------
>
>                 Key: FLINK-13769
>                 URL: https://issues.apache.org/jira/browse/FLINK-13769
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.9.0
>            Reporter: Andrey Zagrebin
>            Assignee: Andrey Zagrebin
>            Priority: Critical
>              Labels: test-stability
>
> {{BatchFineGrainedRecoveryITCase.testProgram}} failed on Travis.
> {code}
> 23:14:26.860 [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time 
> elapsed: 50.007 s <<< FAILURE! - in 
> org.apache.flink.test.recovery.BatchFineGrainedRecoveryITCase
> 23:14:26.868 [ERROR] 
> testProgram(org.apache.flink.test.recovery.BatchFineGrainedRecoveryITCase)  
> Time elapsed: 49.469 s  <<< ERROR!
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
>       at 
> org.apache.flink.test.recovery.BatchFineGrainedRecoveryITCase.testProgram(BatchFineGrainedRecoveryITCase.java:225)
> Caused by: java.util.concurrent.CompletionException: 
> akka.pattern.AskTimeoutException: Ask timed out on 
> [Actor[akka.tcp://flink@localhost:39333/user/taskmanager_3#-344551647]] after 
> [10000 ms]. Message of type 
> [org.apache.flink.runtime.rpc.messages.RemoteRpcInvocation]. A typical reason 
> for `AskTimeoutException` is that the recipient actor didn't send a reply.
> Caused by: akka.pattern.AskTimeoutException: Ask timed out on 
> [Actor[akka.tcp://flink@localhost:39333/user/taskmanager_3#-344551647]] after 
> [10000 ms]. Message of type 
> [org.apache.flink.runtime.rpc.messages.RemoteRpcInvocation]. A typical reason 
> for `AskTimeoutException` is that the recipient actor didn't send a reply.
> {code}
> [https://travis-ci.org/apache/flink/jobs/573523669]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to