[ 
https://issues.apache.org/jira/browse/FLINK-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14709453#comment-14709453
 ] 

ASF GitHub Bot commented on FLINK-2472:
---------------------------------------

Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/979#discussion_r37764196
  
    --- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/client/JobClientActor.java 
---
    @@ -144,11 +268,25 @@ else if (message instanceof Terminated) {
                                String msg = "Lost connection to JobManager " + 
jobManager.path();
                                logger.info(msg);
                                submitter.tell(decorateMessage(new 
Status.Failure(new Exception(msg))), getSelf());
    +                           resetContextAndActor();
                        } else {
                                logger.error("Received 'Terminated' for unknown 
actor " + target);
                        }
                }
     
    +           // ============= No messgaes received in the job manager 
timeout duration ========
    +           else if (message instanceof ReceiveTimeout){
    +                   double tolerance = 0.1 * 
JOB_CLIENT_JOB_MANAGER_TIMEOUT.toMillis();
    --- End diff --
    
    Why not setting the tolerance to the `JOB_CLIENT_JOB_MANAGER_TIMEOUT`?


> Make the JobClientActor check periodically if the submitted Job is still 
> running and if the JobManager is still alive
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-2472
>                 URL: https://issues.apache.org/jira/browse/FLINK-2472
>             Project: Flink
>          Issue Type: Improvement
>            Reporter: Till Rohrmann
>            Assignee: Sachin Goel
>
> In case that the {{JobManager}} dies without notifying possibly connected 
> {{JobClientActors}} or if the job execution finishes without sending the 
> {{SerializedJobExecutionResult}} back to the {{JobClientActor}}, it might 
> happen that a {{JobClient.submitJobAndWait}} never returns.
> I propose to let the {{JobClientActor}} periodically check whether the 
> {{JobManager}} is still alive and whether the submitted job is still running. 
> If not, then the {{JobClientActor}} should return an exception to complete 
> the waiting future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to