[jira] [Commented] (FLINK-6160) Retry JobManager/ResourceManager connection in case of timeout

2018-06-29 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16527223#comment-16527223
 ] 

ASF GitHub Bot commented on FLINK-6160:
---

Github user zhangminglei closed the pull request at:

https://github.com/apache/flink/pull/5440


>  Retry JobManager/ResourceManager connection in case of timeout
> ---
>
> Key: FLINK-6160
> URL: https://issues.apache.org/jira/browse/FLINK-6160
> Project: Flink
>  Issue Type: Sub-task
>  Components: Distributed Coordination
>Affects Versions: 1.3.0, 1.5.0, 1.6.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Blocker
>  Labels: flip-6
> Fix For: 1.5.0
>
>
> In case of a heartbeat timeout, the {{TaskExecutor}} closes the connection to 
> the remote component. Furthermore, it assumes that the component has actually 
> failed and, thus, it will only start trying to connect to the component if it 
> is notified about a new leader address and leader session id. This is 
> brittle, because the heartbeat could also time out without the component 
> having crashed. Thus, we should add an automatic retry to the latest known 
> leader address information in case of a timeout.
> *Acceptance criteria:*
>  - The registration should be retried until a time limit expires after which 
> the {{TaskExecutor}} terminates



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-6160) Retry JobManager/ResourceManager connection in case of timeout

2018-05-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479855#comment-16479855
 ] 

ASF GitHub Bot commented on FLINK-6160:
---

Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/6035


>  Retry JobManager/ResourceManager connection in case of timeout
> ---
>
> Key: FLINK-6160
> URL: https://issues.apache.org/jira/browse/FLINK-6160
> Project: Flink
>  Issue Type: Sub-task
>  Components: Distributed Coordination
>Affects Versions: 1.3.0, 1.5.0, 1.6.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Blocker
>  Labels: flip-6
> Fix For: 1.5.0
>
>
> In case of a heartbeat timeout, the {{TaskExecutor}} closes the connection to 
> the remote component. Furthermore, it assumes that the component has actually 
> failed and, thus, it will only start trying to connect to the component if it 
> is notified about a new leader address and leader session id. This is 
> brittle, because the heartbeat could also time out without the component 
> having crashed. Thus, we should add an automatic retry to the latest known 
> leader address information in case of a timeout.
> *Acceptance criteria:*
>  - The registration should be retried until a time limit expires after which 
> the {{TaskExecutor}} terminates



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-6160) Retry JobManager/ResourceManager connection in case of timeout

2018-05-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479309#comment-16479309
 ] 

ASF GitHub Bot commented on FLINK-6160:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/6035#discussion_r189019837
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutor.java
 ---
@@ -947,6 +964,36 @@ private void closeResourceManagerConnection(Exception 
cause) {
resourceManagerConnection.close();
resourceManagerConnection = null;
}
+
+   startRegistrationTimeout();
+   }
+
+   private void startRegistrationTimeout() {
+   final Time maxRegistrationDuration = 
taskManagerConfiguration.getMaxRegistrationDuration();
+
+   if (maxRegistrationDuration != null) {
+   final UUID newRegistrationTimeoutId = UUID.randomUUID();
+   currentRegistrationTimeoutId = newRegistrationTimeoutId;
+   scheduleRunAsync(() -> 
registrationTimeout(newRegistrationTimeoutId), maxRegistrationDuration);
+   }
+   }
+
+   private void stopRegistrationTimeout() {
+   currentRegistrationTimeoutId = null;
+   }
+
+   private void registrationTimeout(@Nonnull UUID registrationTimeoutId) {
+   if (registrationTimeoutId.equals(currentRegistrationTimeoutId)) 
{
+   final Time maxRegistrationDuration = 
taskManagerConfiguration.getMaxRegistrationDuration();
+
+   onFatalError(
+   new RegistrationTimeoutException(
+   String.format("Could not register at 
the ResourceManager within the specified maximum " +
+   "registration duration %s. This 
indicates a problem with this instance. Terminating now.",
+   maxRegistrationDuration)));
+   } else {
+   log.debug("Ignoring outdated registration timeout.");
--- End diff --

True, will remove it.


>  Retry JobManager/ResourceManager connection in case of timeout
> ---
>
> Key: FLINK-6160
> URL: https://issues.apache.org/jira/browse/FLINK-6160
> Project: Flink
>  Issue Type: Sub-task
>  Components: Distributed Coordination
>Affects Versions: 1.3.0, 1.5.0, 1.6.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Blocker
>  Labels: flip-6
> Fix For: 1.5.0
>
>
> In case of a heartbeat timeout, the {{TaskExecutor}} closes the connection to 
> the remote component. Furthermore, it assumes that the component has actually 
> failed and, thus, it will only start trying to connect to the component if it 
> is notified about a new leader address and leader session id. This is 
> brittle, because the heartbeat could also time out without the component 
> having crashed. Thus, we should add an automatic retry to the latest known 
> leader address information in case of a timeout.
> *Acceptance criteria:*
>   - The registration should be retried until a time limit expires after which 
> the {{TaskExecutor}} terminates.
>   - If the registration is declined ({{RegistrationResponse.Decline}}), the 
> {{TaskExecutor}} should terminate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-6160) Retry JobManager/ResourceManager connection in case of timeout

2018-05-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479308#comment-16479308
 ] 

ASF GitHub Bot commented on FLINK-6160:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/6035#discussion_r189019783
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java 
---
@@ -1605,9 +1609,16 @@ public void notifyHeartbeatTimeout(final ResourceID 
resourceId) {
runAsync(() -> {
log.info("The heartbeat of ResourceManager with 
id {} timed out.", resourceId);
 
-   closeResourceManagerConnection(
-   new TimeoutException(
-   "The heartbeat of 
ResourceManager with id " + resourceId + " timed out."));
+   if (establishedResourceManagerConnection != 
null && 
establishedResourceManagerConnection.getResourceManagerResourceID().equals(resourceId))
 {
+   final String resourceManagerAddress = 
establishedResourceManagerConnection.getResourceManagerGateway().getAddress();
--- End diff --

Actually not, because we set `establishedResourcemanagerConnection` to 
`null` in the close method.


>  Retry JobManager/ResourceManager connection in case of timeout
> ---
>
> Key: FLINK-6160
> URL: https://issues.apache.org/jira/browse/FLINK-6160
> Project: Flink
>  Issue Type: Sub-task
>  Components: Distributed Coordination
>Affects Versions: 1.3.0, 1.5.0, 1.6.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Blocker
>  Labels: flip-6
> Fix For: 1.5.0
>
>
> In case of a heartbeat timeout, the {{TaskExecutor}} closes the connection to 
> the remote component. Furthermore, it assumes that the component has actually 
> failed and, thus, it will only start trying to connect to the component if it 
> is notified about a new leader address and leader session id. This is 
> brittle, because the heartbeat could also time out without the component 
> having crashed. Thus, we should add an automatic retry to the latest known 
> leader address information in case of a timeout.
> *Acceptance criteria:*
>   - The registration should be retried until a time limit expires after which 
> the {{TaskExecutor}} terminates.
>   - If the registration is declined ({{RegistrationResponse.Decline}}), the 
> {{TaskExecutor}} should terminate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-6160) Retry JobManager/ResourceManager connection in case of timeout

2018-05-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479296#comment-16479296
 ] 

ASF GitHub Bot commented on FLINK-6160:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/6035#discussion_r189017088
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutor.java
 ---
@@ -947,6 +964,36 @@ private void closeResourceManagerConnection(Exception 
cause) {
resourceManagerConnection.close();
resourceManagerConnection = null;
}
+
+   startRegistrationTimeout();
--- End diff --

The problem is that we want this timeout to start whenever the 
`TaskExecutor` loses its connection to the RM and that's when we close the RM 
connection. This also covers the case, where we don't know the RM address (e.g. 
if the RM loses leadership).


>  Retry JobManager/ResourceManager connection in case of timeout
> ---
>
> Key: FLINK-6160
> URL: https://issues.apache.org/jira/browse/FLINK-6160
> Project: Flink
>  Issue Type: Sub-task
>  Components: Distributed Coordination
>Affects Versions: 1.3.0, 1.5.0, 1.6.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Blocker
>  Labels: flip-6
> Fix For: 1.5.0
>
>
> In case of a heartbeat timeout, the {{TaskExecutor}} closes the connection to 
> the remote component. Furthermore, it assumes that the component has actually 
> failed and, thus, it will only start trying to connect to the component if it 
> is notified about a new leader address and leader session id. This is 
> brittle, because the heartbeat could also time out without the component 
> having crashed. Thus, we should add an automatic retry to the latest known 
> leader address information in case of a timeout.
> *Acceptance criteria:*
>   - The registration should be retried until a time limit expires after which 
> the {{TaskExecutor}} terminates.
>   - If the registration is declined ({{RegistrationResponse.Decline}}), the 
> {{TaskExecutor}} should terminate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-6160) Retry JobManager/ResourceManager connection in case of timeout

2018-05-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479292#comment-16479292
 ] 

ASF GitHub Bot commented on FLINK-6160:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/6035#discussion_r189016636
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskManagerConfiguration.java
 ---
@@ -50,8 +50,11 @@
private final String[] tmpDirectories;
 
private final Time timeout;
+
// null indicates an infinite duration
+   @Nullable
--- End diff --

Will change it.


>  Retry JobManager/ResourceManager connection in case of timeout
> ---
>
> Key: FLINK-6160
> URL: https://issues.apache.org/jira/browse/FLINK-6160
> Project: Flink
>  Issue Type: Sub-task
>  Components: Distributed Coordination
>Affects Versions: 1.3.0, 1.5.0, 1.6.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Blocker
>  Labels: flip-6
> Fix For: 1.5.0
>
>
> In case of a heartbeat timeout, the {{TaskExecutor}} closes the connection to 
> the remote component. Furthermore, it assumes that the component has actually 
> failed and, thus, it will only start trying to connect to the component if it 
> is notified about a new leader address and leader session id. This is 
> brittle, because the heartbeat could also time out without the component 
> having crashed. Thus, we should add an automatic retry to the latest known 
> leader address information in case of a timeout.
> *Acceptance criteria:*
>   - The registration should be retried until a time limit expires after which 
> the {{TaskExecutor}} terminates.
>   - If the registration is declined ({{RegistrationResponse.Decline}}), the 
> {{TaskExecutor}} should terminate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-6160) Retry JobManager/ResourceManager connection in case of timeout

2018-05-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479294#comment-16479294
 ] 

ASF GitHub Bot commented on FLINK-6160:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/6035#discussion_r189016685
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java 
---
@@ -1605,9 +1609,16 @@ public void notifyHeartbeatTimeout(final ResourceID 
resourceId) {
runAsync(() -> {
log.info("The heartbeat of ResourceManager with 
id {} timed out.", resourceId);
 
-   closeResourceManagerConnection(
-   new TimeoutException(
-   "The heartbeat of 
ResourceManager with id " + resourceId + " timed out."));
+   if (establishedResourceManagerConnection != 
null && 
establishedResourceManagerConnection.getResourceManagerResourceID().equals(resourceId))
 {
+   final String resourceManagerAddress = 
establishedResourceManagerConnection.getResourceManagerGateway().getAddress();
--- End diff --

True, will change it.


>  Retry JobManager/ResourceManager connection in case of timeout
> ---
>
> Key: FLINK-6160
> URL: https://issues.apache.org/jira/browse/FLINK-6160
> Project: Flink
>  Issue Type: Sub-task
>  Components: Distributed Coordination
>Affects Versions: 1.3.0, 1.5.0, 1.6.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Blocker
>  Labels: flip-6
> Fix For: 1.5.0
>
>
> In case of a heartbeat timeout, the {{TaskExecutor}} closes the connection to 
> the remote component. Furthermore, it assumes that the component has actually 
> failed and, thus, it will only start trying to connect to the component if it 
> is notified about a new leader address and leader session id. This is 
> brittle, because the heartbeat could also time out without the component 
> having crashed. Thus, we should add an automatic retry to the latest known 
> leader address information in case of a timeout.
> *Acceptance criteria:*
>   - The registration should be retried until a time limit expires after which 
> the {{TaskExecutor}} terminates.
>   - If the registration is declined ({{RegistrationResponse.Decline}}), the 
> {{TaskExecutor}} should terminate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-6160) Retry JobManager/ResourceManager connection in case of timeout

2018-05-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479291#comment-16479291
 ] 

ASF GitHub Bot commented on FLINK-6160:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/6035#discussion_r189016603
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutor.java
 ---
@@ -232,8 +239,10 @@ public TaskExecutor(
rpcService.getScheduledExecutor(),
log);
 
-   hardwareDescription = HardwareDescription.extractFromSystem(
+   this.hardwareDescription = 
HardwareDescription.extractFromSystem(

taskExecutorServices.getMemoryManager().getMemorySize());
+
+   this.currentRegistrationTimeoutId = null;
--- End diff --

Will change it.


>  Retry JobManager/ResourceManager connection in case of timeout
> ---
>
> Key: FLINK-6160
> URL: https://issues.apache.org/jira/browse/FLINK-6160
> Project: Flink
>  Issue Type: Sub-task
>  Components: Distributed Coordination
>Affects Versions: 1.3.0, 1.5.0, 1.6.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Blocker
>  Labels: flip-6
> Fix For: 1.5.0
>
>
> In case of a heartbeat timeout, the {{TaskExecutor}} closes the connection to 
> the remote component. Furthermore, it assumes that the component has actually 
> failed and, thus, it will only start trying to connect to the component if it 
> is notified about a new leader address and leader session id. This is 
> brittle, because the heartbeat could also time out without the component 
> having crashed. Thus, we should add an automatic retry to the latest known 
> leader address information in case of a timeout.
> *Acceptance criteria:*
>   - The registration should be retried until a time limit expires after which 
> the {{TaskExecutor}} terminates.
>   - If the registration is declined ({{RegistrationResponse.Decline}}), the 
> {{TaskExecutor}} should terminate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-6160) Retry JobManager/ResourceManager connection in case of timeout

2018-05-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479272#comment-16479272
 ] 

ASF GitHub Bot commented on FLINK-6160:
---

Github user GJL commented on a diff in the pull request:

https://github.com/apache/flink/pull/6035#discussion_r188992099
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutor.java
 ---
@@ -232,8 +239,10 @@ public TaskExecutor(
rpcService.getScheduledExecutor(),
log);
 
-   hardwareDescription = HardwareDescription.extractFromSystem(
+   this.hardwareDescription = 
HardwareDescription.extractFromSystem(

taskExecutorServices.getMemoryManager().getMemorySize());
+
+   this.currentRegistrationTimeoutId = null;
--- End diff --

It's already `null` by default.


>  Retry JobManager/ResourceManager connection in case of timeout
> ---
>
> Key: FLINK-6160
> URL: https://issues.apache.org/jira/browse/FLINK-6160
> Project: Flink
>  Issue Type: Sub-task
>  Components: Distributed Coordination
>Affects Versions: 1.3.0, 1.5.0, 1.6.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Blocker
>  Labels: flip-6
> Fix For: 1.5.0
>
>
> In case of a heartbeat timeout, the {{TaskExecutor}} closes the connection to 
> the remote component. Furthermore, it assumes that the component has actually 
> failed and, thus, it will only start trying to connect to the component if it 
> is notified about a new leader address and leader session id. This is 
> brittle, because the heartbeat could also time out without the component 
> having crashed. Thus, we should add an automatic retry to the latest known 
> leader address information in case of a timeout.
> *Acceptance criteria:*
>   - The registration should be retried until a time limit expires after which 
> the {{TaskExecutor}} terminates.
>   - If the registration is declined ({{RegistrationResponse.Decline}}), the 
> {{TaskExecutor}} should terminate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-6160) Retry JobManager/ResourceManager connection in case of timeout

2018-05-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479268#comment-16479268
 ] 

ASF GitHub Bot commented on FLINK-6160:
---

Github user GJL commented on a diff in the pull request:

https://github.com/apache/flink/pull/6035#discussion_r189004280
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskManagerConfiguration.java
 ---
@@ -50,8 +50,11 @@
private final String[] tmpDirectories;
 
private final Time timeout;
+
// null indicates an infinite duration
+   @Nullable
--- End diff --

Should also be `@Nullable` on the constructor.


>  Retry JobManager/ResourceManager connection in case of timeout
> ---
>
> Key: FLINK-6160
> URL: https://issues.apache.org/jira/browse/FLINK-6160
> Project: Flink
>  Issue Type: Sub-task
>  Components: Distributed Coordination
>Affects Versions: 1.3.0, 1.5.0, 1.6.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Blocker
>  Labels: flip-6
> Fix For: 1.5.0
>
>
> In case of a heartbeat timeout, the {{TaskExecutor}} closes the connection to 
> the remote component. Furthermore, it assumes that the component has actually 
> failed and, thus, it will only start trying to connect to the component if it 
> is notified about a new leader address and leader session id. This is 
> brittle, because the heartbeat could also time out without the component 
> having crashed. Thus, we should add an automatic retry to the latest known 
> leader address information in case of a timeout.
> *Acceptance criteria:*
>   - The registration should be retried until a time limit expires after which 
> the {{TaskExecutor}} terminates.
>   - If the registration is declined ({{RegistrationResponse.Decline}}), the 
> {{TaskExecutor}} should terminate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-6160) Retry JobManager/ResourceManager connection in case of timeout

2018-05-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479270#comment-16479270
 ] 

ASF GitHub Bot commented on FLINK-6160:
---

Github user GJL commented on a diff in the pull request:

https://github.com/apache/flink/pull/6035#discussion_r189013436
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutor.java
 ---
@@ -947,6 +964,36 @@ private void closeResourceManagerConnection(Exception 
cause) {
resourceManagerConnection.close();
resourceManagerConnection = null;
}
+
+   startRegistrationTimeout();
--- End diff --

It looks weird that we call `startRegistrationTimeout();` in 
`closeResourceManagerConnection`. Can it be done in 
`createResourceManagerConnection`


>  Retry JobManager/ResourceManager connection in case of timeout
> ---
>
> Key: FLINK-6160
> URL: https://issues.apache.org/jira/browse/FLINK-6160
> Project: Flink
>  Issue Type: Sub-task
>  Components: Distributed Coordination
>Affects Versions: 1.3.0, 1.5.0, 1.6.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Blocker
>  Labels: flip-6
> Fix For: 1.5.0
>
>
> In case of a heartbeat timeout, the {{TaskExecutor}} closes the connection to 
> the remote component. Furthermore, it assumes that the component has actually 
> failed and, thus, it will only start trying to connect to the component if it 
> is notified about a new leader address and leader session id. This is 
> brittle, because the heartbeat could also time out without the component 
> having crashed. Thus, we should add an automatic retry to the latest known 
> leader address information in case of a timeout.
> *Acceptance criteria:*
>   - The registration should be retried until a time limit expires after which 
> the {{TaskExecutor}} terminates.
>   - If the registration is declined ({{RegistrationResponse.Decline}}), the 
> {{TaskExecutor}} should terminate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-6160) Retry JobManager/ResourceManager connection in case of timeout

2018-05-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479271#comment-16479271
 ] 

ASF GitHub Bot commented on FLINK-6160:
---

Github user GJL commented on a diff in the pull request:

https://github.com/apache/flink/pull/6035#discussion_r189004536
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java 
---
@@ -1605,9 +1609,16 @@ public void notifyHeartbeatTimeout(final ResourceID 
resourceId) {
runAsync(() -> {
log.info("The heartbeat of ResourceManager with 
id {} timed out.", resourceId);
 
-   closeResourceManagerConnection(
-   new TimeoutException(
-   "The heartbeat of 
ResourceManager with id " + resourceId + " timed out."));
+   if (establishedResourceManagerConnection != 
null && 
establishedResourceManagerConnection.getResourceManagerResourceID().equals(resourceId))
 {
+   final String resourceManagerAddress = 
establishedResourceManagerConnection.getResourceManagerGateway().getAddress();
--- End diff --

Declaration and assignment can be moved closer to 
`createResourceManagerConnection`.


>  Retry JobManager/ResourceManager connection in case of timeout
> ---
>
> Key: FLINK-6160
> URL: https://issues.apache.org/jira/browse/FLINK-6160
> Project: Flink
>  Issue Type: Sub-task
>  Components: Distributed Coordination
>Affects Versions: 1.3.0, 1.5.0, 1.6.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Blocker
>  Labels: flip-6
> Fix For: 1.5.0
>
>
> In case of a heartbeat timeout, the {{TaskExecutor}} closes the connection to 
> the remote component. Furthermore, it assumes that the component has actually 
> failed and, thus, it will only start trying to connect to the component if it 
> is notified about a new leader address and leader session id. This is 
> brittle, because the heartbeat could also time out without the component 
> having crashed. Thus, we should add an automatic retry to the latest known 
> leader address information in case of a timeout.
> *Acceptance criteria:*
>   - The registration should be retried until a time limit expires after which 
> the {{TaskExecutor}} terminates.
>   - If the registration is declined ({{RegistrationResponse.Decline}}), the 
> {{TaskExecutor}} should terminate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-6160) Retry JobManager/ResourceManager connection in case of timeout

2018-05-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479269#comment-16479269
 ] 

ASF GitHub Bot commented on FLINK-6160:
---

Github user GJL commented on a diff in the pull request:

https://github.com/apache/flink/pull/6035#discussion_r189012246
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutor.java
 ---
@@ -947,6 +964,36 @@ private void closeResourceManagerConnection(Exception 
cause) {
resourceManagerConnection.close();
resourceManagerConnection = null;
}
+
+   startRegistrationTimeout();
+   }
+
+   private void startRegistrationTimeout() {
+   final Time maxRegistrationDuration = 
taskManagerConfiguration.getMaxRegistrationDuration();
+
+   if (maxRegistrationDuration != null) {
+   final UUID newRegistrationTimeoutId = UUID.randomUUID();
+   currentRegistrationTimeoutId = newRegistrationTimeoutId;
+   scheduleRunAsync(() -> 
registrationTimeout(newRegistrationTimeoutId), maxRegistrationDuration);
+   }
+   }
+
+   private void stopRegistrationTimeout() {
+   currentRegistrationTimeoutId = null;
+   }
+
+   private void registrationTimeout(@Nonnull UUID registrationTimeoutId) {
+   if (registrationTimeoutId.equals(currentRegistrationTimeoutId)) 
{
+   final Time maxRegistrationDuration = 
taskManagerConfiguration.getMaxRegistrationDuration();
+
+   onFatalError(
+   new RegistrationTimeoutException(
+   String.format("Could not register at 
the ResourceManager within the specified maximum " +
+   "registration duration %s. This 
indicates a problem with this instance. Terminating now.",
+   maxRegistrationDuration)));
+   } else {
+   log.debug("Ignoring outdated registration timeout.");
--- End diff --

I think this will be logged even if the registration succeeded.


>  Retry JobManager/ResourceManager connection in case of timeout
> ---
>
> Key: FLINK-6160
> URL: https://issues.apache.org/jira/browse/FLINK-6160
> Project: Flink
>  Issue Type: Sub-task
>  Components: Distributed Coordination
>Affects Versions: 1.3.0, 1.5.0, 1.6.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Blocker
>  Labels: flip-6
> Fix For: 1.5.0
>
>
> In case of a heartbeat timeout, the {{TaskExecutor}} closes the connection to 
> the remote component. Furthermore, it assumes that the component has actually 
> failed and, thus, it will only start trying to connect to the component if it 
> is notified about a new leader address and leader session id. This is 
> brittle, because the heartbeat could also time out without the component 
> having crashed. Thus, we should add an automatic retry to the latest known 
> leader address information in case of a timeout.
> *Acceptance criteria:*
>   - The registration should be retried until a time limit expires after which 
> the {{TaskExecutor}} terminates.
>   - If the registration is declined ({{RegistrationResponse.Decline}}), the 
> {{TaskExecutor}} should terminate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-6160) Retry JobManager/ResourceManager connection in case of timeout

2018-05-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479151#comment-16479151
 ] 

ASF GitHub Bot commented on FLINK-6160:
---

GitHub user tillrohrmann opened a pull request:

https://github.com/apache/flink/pull/6035

[FLINK-6160] Add reconnection attempts in case of heartbeat timeouts to 
JobMaster and TaskExecutor

## What is the purpose of the change

If a timeout with the RM occurs on on the JobMaster and TaskExecutor, then 
they will both try to reconnect
to the last known RM address.

Additionally, we now respect the TaskManagerOption#REGISTRATION_TIMEOUT on 
the TaskExecutor. This means that
if the TaskExecutor could not register at a RM within the given 
registration timeout, it will fail with a
fatal exception. This allows to fail the TaskExecutor process in case that 
it cannot establish a connection
and ultimately frees the occupied resources.

The commit also changes the default value for 
TaskManagerOption#REGISTRATION_TIMEOUT from "Inf" to "5 min".

cc @GJL.

## Brief change log

- Retry connection to RM in case of heartbeat timeout on `JobMaster` and 
`TaskExecutor`
- Fail `TaskExecutor` if we could not connect to `RM` within 
`TaskManagerOptions#REGISTRATION_TIMEOUT`

## Verifying this change

- Adapted `JobMasterTest#testHeartbeatTimeoutWithResourceManager`
- Adapted `TaskExecutorTest#testHeartbeatTimeoutWithResourceManager`
- Added `TaskExecutorTest#testMaximumRegistrationDuration` and 
`TaskExecutorTest#testMaximumRegistrationDurationAfterConnectionLoss`

## Does this pull request potentially affect one of the following parts:

  - Dependencies (does it add or upgrade a dependency): (no)
  - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (no)
  - The serializers: (no)
  - The runtime per-record code paths (performance sensitive): (no)
  - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes)
  - The S3 file system connector: (no)

## Documentation

  - Does this pull request introduce a new feature? (no)
  - If yes, how is the feature documented? (not applicable)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tillrohrmann/flink fixReconnection

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/6035.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #6035


commit 6b45c84cf06688099e71c9e1809917653af43d31
Author: Till Rohrmann 
Date:   2018-05-17T12:44:14Z

[FLINK-6160] Add reconnection attempts in case of heartbeat timeouts to 
JobMaster and TaskExecutor

If a timeout with the RM occurs on on the JobMaster and TaskExecutor, then 
they will both try to reconnect
to the last known RM address.

Additionally, we now respect the TaskManagerOption#REGISTRATION_TIMEOUT on 
the TaskExecutor. This means that
if the TaskExecutor could not register at a RM within the given 
registration timeout, it will fail with a
fatal exception. This allows to fail the TaskExecutor process in case that 
it cannot establish a connection
and ultimately frees the occupied resources.

The commit also changes the default value for 
TaskManagerOption#REGISTRATION_TIMEOUT from "Inf" to "5 min".




>  Retry JobManager/ResourceManager connection in case of timeout
> ---
>
> Key: FLINK-6160
> URL: https://issues.apache.org/jira/browse/FLINK-6160
> Project: Flink
>  Issue Type: Sub-task
>  Components: Distributed Coordination
>Affects Versions: 1.3.0, 1.5.0, 1.6.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Blocker
>  Labels: flip-6
> Fix For: 1.5.0
>
>
> In case of a heartbeat timeout, the {{TaskExecutor}} closes the connection to 
> the remote component. Furthermore, it assumes that the component has actually 
> failed and, thus, it will only start trying to connect to the component if it 
> is notified about a new leader address and leader session id. This is 
> brittle, because the heartbeat could also time out without the component 
> having crashed. Thus, we should add an automatic retry to the latest known 
> leader address information in case of a timeout.
> *Acceptance criteria:*
>   - The registration should be retried until a time limit expires after which 
> the {{TaskExecutor}} terminates.
>   - If the registration is declined ({{RegistrationResponse.Decline}}), 

[jira] [Commented] (FLINK-6160) Retry JobManager/ResourceManager connection in case of timeout

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387316#comment-16387316
 ] 

ASF GitHub Bot commented on FLINK-6160:
---

Github user zhangminglei commented on the issue:

https://github.com/apache/flink/pull/5440
  
@tillrohrmann Could you please take a look on this when available ? Thanks 


>  Retry JobManager/ResourceManager connection in case of timeout
> ---
>
> Key: FLINK-6160
> URL: https://issues.apache.org/jira/browse/FLINK-6160
> Project: Flink
>  Issue Type: Sub-task
>  Components: Distributed Coordination
>Affects Versions: 1.3.0
>Reporter: Till Rohrmann
>Priority: Major
>  Labels: flip-6
>
> In case of a heartbeat timeout, the {{TaskExecutor}} closes the connection to 
> the remote component. Furthermore, it assumes that the component has actually 
> failed and, thus, it will only start trying to connect to the component if it 
> is notified about a new leader address and leader session id. This is 
> brittle, because the heartbeat could also time out without the component 
> having crashed. Thus, we should add an automatic retry to the latest known 
> leader address information in case of a timeout.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-6160) Retry JobManager/ResourceManager connection in case of timeout

2018-03-02 Thread mingleizhang (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383541#comment-16383541
 ] 

mingleizhang commented on FLINK-6160:
-

Sup ?

>  Retry JobManager/ResourceManager connection in case of timeout
> ---
>
> Key: FLINK-6160
> URL: https://issues.apache.org/jira/browse/FLINK-6160
> Project: Flink
>  Issue Type: Sub-task
>  Components: Distributed Coordination
>Affects Versions: 1.3.0
>Reporter: Till Rohrmann
>Priority: Major
>  Labels: flip-6
>
> In case of a heartbeat timeout, the {{TaskExecutor}} closes the connection to 
> the remote component. Furthermore, it assumes that the component has actually 
> failed and, thus, it will only start trying to connect to the component if it 
> is notified about a new leader address and leader session id. This is 
> brittle, because the heartbeat could also time out without the component 
> having crashed. Thus, we should add an automatic retry to the latest known 
> leader address information in case of a timeout.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-6160) Retry JobManager/ResourceManager connection in case of timeout

2018-02-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358310#comment-16358310
 ] 

ASF GitHub Bot commented on FLINK-6160:
---

GitHub user zhangminglei opened a pull request:

https://github.com/apache/flink/pull/5440

[FLINK-6160] [flip-6] Retry JobManager/ResourceManager connection in …

## What is the purpose of the change

When timeout comes, retry JobManager/ResourceManager connection in case of 
timeout

## Brief change log
When timeout, invoke ```requestHeartbeat``` in HeartbeatMonitor thread. Not 
directly invoke ```notifyHeartbeatTimeout``` and close the connection.

## Verifying this change

This change is already covered by existing tests, but did minor changes. in 
the TaskExecutorTest.java, change ```testHeartbeatTimeoutWithResourceManager``` 
behavior to while timeout, does not invoke ```disconnectTaskManager```.

## Does this pull request potentially affect one of the following parts:

  - Dependencies (does it add or upgrade a dependency): ( no)
  - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: ( no)
  - The serializers: (no)
  - The runtime per-record code paths (performance sensitive): ( don't know)
  - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: (no )

## Documentation

  - Does this pull request introduce a new feature? (no)
  - If yes, how is the feature documented? (not documented)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhangminglei/flink flink-6160

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/5440.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5440


commit 5319abdf503c757baf7afde9913ab2fb6fb61b60
Author: zhangminglei 
Date:   2018-02-09T11:52:50Z

[FLINK-6160] [flip-6] Retry JobManager/ResourceManager connection in case 
of timeout




>  Retry JobManager/ResourceManager connection in case of timeout
> ---
>
> Key: FLINK-6160
> URL: https://issues.apache.org/jira/browse/FLINK-6160
> Project: Flink
>  Issue Type: Sub-task
>  Components: Distributed Coordination
>Affects Versions: 1.3.0
>Reporter: Till Rohrmann
>Priority: Major
>  Labels: flip-6
>
> In case of a heartbeat timeout, the {{TaskExecutor}} closes the connection to 
> the remote component. Furthermore, it assumes that the component has actually 
> failed and, thus, it will only start trying to connect to the component if it 
> is notified about a new leader address and leader session id. This is 
> brittle, because the heartbeat could also time out without the component 
> having crashed. Thus, we should add an automatic retry to the latest known 
> leader address information in case of a timeout.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-6160) Retry JobManager/ResourceManager connection in case of timeout

2018-01-31 Thread mingleizhang (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347928#comment-16347928
 ] 

mingleizhang commented on FLINK-6160:
-

Thanks [~till.rohrmann] I will take a look on what you said.

>  Retry JobManager/ResourceManager connection in case of timeout
> ---
>
> Key: FLINK-6160
> URL: https://issues.apache.org/jira/browse/FLINK-6160
> Project: Flink
>  Issue Type: Sub-task
>  Components: Distributed Coordination
>Affects Versions: 1.3.0
>Reporter: Till Rohrmann
>Priority: Major
>  Labels: flip-6
>
> In case of a heartbeat timeout, the {{TaskExecutor}} closes the connection to 
> the remote component. Furthermore, it assumes that the component has actually 
> failed and, thus, it will only start trying to connect to the component if it 
> is notified about a new leader address and leader session id. This is 
> brittle, because the heartbeat could also time out without the component 
> having crashed. Thus, we should add an automatic retry to the latest known 
> leader address information in case of a timeout.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-6160) Retry JobManager/ResourceManager connection in case of timeout

2018-01-29 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343559#comment-16343559
 ] 

Till Rohrmann commented on FLINK-6160:
--

The {{TaskExecutor}} retries a timed out connection to the {{JobMaster}} but 
the other components don't yet retry their connections. We should fix this 
issue.

>  Retry JobManager/ResourceManager connection in case of timeout
> ---
>
> Key: FLINK-6160
> URL: https://issues.apache.org/jira/browse/FLINK-6160
> Project: Flink
>  Issue Type: Sub-task
>  Components: Distributed Coordination
>Affects Versions: 1.3.0
>Reporter: Till Rohrmann
>Priority: Major
>  Labels: flip-6
>
> In case of a heartbeat timeout, the {{TaskExecutor}} closes the connection to 
> the remote component. Furthermore, it assumes that the component has actually 
> failed and, thus, it will only start trying to connect to the component if it 
> is notified about a new leader address and leader session id. This is 
> brittle, because the heartbeat could also time out without the component 
> having crashed. Thus, we should add an automatic retry to the latest known 
> leader address information in case of a timeout.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-6160) Retry JobManager/ResourceManager connection in case of timeout

2018-01-29 Thread mingleizhang (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343323#comment-16343323
 ] 

mingleizhang commented on FLINK-6160:
-

I found all components such as {{TaskExecutor}}, {{JobMaster}} and 
{{ResourceManager}} all implements the {{HeartbeatListener}} interface, But 
none of them retry to each other while timeout comes. I think all of them 
should have a retry action indeed.

>  Retry JobManager/ResourceManager connection in case of timeout
> ---
>
> Key: FLINK-6160
> URL: https://issues.apache.org/jira/browse/FLINK-6160
> Project: Flink
>  Issue Type: Sub-task
>  Components: Distributed Coordination
>Affects Versions: 1.3.0
>Reporter: Till Rohrmann
>Priority: Major
>  Labels: flip-6
>
> In case of a heartbeat timeout, the {{TaskExecutor}} closes the connection to 
> the remote component. Furthermore, it assumes that the component has actually 
> failed and, thus, it will only start trying to connect to the component if it 
> is notified about a new leader address and leader session id. This is 
> brittle, because the heartbeat could also time out without the component 
> having crashed. Thus, we should add an automatic retry to the latest known 
> leader address information in case of a timeout.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)