[ https://issues.apache.org/jira/browse/FLINK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16479151#comment-16479151 ]
ASF GitHub Bot commented on FLINK-6160: --------------------------------------- GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/6035 [FLINK-6160] Add reconnection attempts in case of heartbeat timeouts to JobMaster and TaskExecutor ## What is the purpose of the change If a timeout with the RM occurs on on the JobMaster and TaskExecutor, then they will both try to reconnect to the last known RM address. Additionally, we now respect the TaskManagerOption#REGISTRATION_TIMEOUT on the TaskExecutor. This means that if the TaskExecutor could not register at a RM within the given registration timeout, it will fail with a fatal exception. This allows to fail the TaskExecutor process in case that it cannot establish a connection and ultimately frees the occupied resources. The commit also changes the default value for TaskManagerOption#REGISTRATION_TIMEOUT from "Inf" to "5 min". cc @GJL. ## Brief change log - Retry connection to RM in case of heartbeat timeout on `JobMaster` and `TaskExecutor` - Fail `TaskExecutor` if we could not connect to `RM` within `TaskManagerOptions#REGISTRATION_TIMEOUT` ## Verifying this change - Adapted `JobMasterTest#testHeartbeatTimeoutWithResourceManager` - Adapted `TaskExecutorTest#testHeartbeatTimeoutWithResourceManager` - Added `TaskExecutorTest#testMaximumRegistrationDuration` and `TaskExecutorTest#testMaximumRegistrationDurationAfterConnectionLoss` ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (no) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no) - The serializers: (no) - The runtime per-record code paths (performance sensitive): (no) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes) - The S3 file system connector: (no) ## Documentation - Does this pull request introduce a new feature? (no) - If yes, how is the feature documented? (not applicable) You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink fixReconnection Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/6035.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #6035 ---- commit 6b45c84cf06688099e71c9e1809917653af43d31 Author: Till Rohrmann <trohrmann@...> Date: 2018-05-17T12:44:14Z [FLINK-6160] Add reconnection attempts in case of heartbeat timeouts to JobMaster and TaskExecutor If a timeout with the RM occurs on on the JobMaster and TaskExecutor, then they will both try to reconnect to the last known RM address. Additionally, we now respect the TaskManagerOption#REGISTRATION_TIMEOUT on the TaskExecutor. This means that if the TaskExecutor could not register at a RM within the given registration timeout, it will fail with a fatal exception. This allows to fail the TaskExecutor process in case that it cannot establish a connection and ultimately frees the occupied resources. The commit also changes the default value for TaskManagerOption#REGISTRATION_TIMEOUT from "Inf" to "5 min". ---- > Retry JobManager/ResourceManager connection in case of timeout > --------------------------------------------------------------- > > Key: FLINK-6160 > URL: https://issues.apache.org/jira/browse/FLINK-6160 > Project: Flink > Issue Type: Sub-task > Components: Distributed Coordination > Affects Versions: 1.3.0, 1.5.0, 1.6.0 > Reporter: Till Rohrmann > Assignee: Till Rohrmann > Priority: Blocker > Labels: flip-6 > Fix For: 1.5.0 > > > In case of a heartbeat timeout, the {{TaskExecutor}} closes the connection to > the remote component. Furthermore, it assumes that the component has actually > failed and, thus, it will only start trying to connect to the component if it > is notified about a new leader address and leader session id. This is > brittle, because the heartbeat could also time out without the component > having crashed. Thus, we should add an automatic retry to the latest known > leader address information in case of a timeout. > *Acceptance criteria:* > - The registration should be retried until a time limit expires after which > the {{TaskExecutor}} terminates. > - If the registration is declined ({{RegistrationResponse.Decline}}), the > {{TaskExecutor}} should terminate. -- This message was sent by Atlassian JIRA (v7.6.3#76005)