[jira] [Commented] (YARN-6667) Handle containerId duplicate without failing the heartbeat in Federation Interceptor

2022-09-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599640#comment-17599640
 ] 

ASF GitHub Bot commented on YARN-6667:
--

goiri merged PR #4810:
URL: https://github.com/apache/hadoop/pull/4810




> Handle containerId duplicate without failing the heartbeat in Federation 
> Interceptor
> 
>
> Key: YARN-6667
> URL: https://issues.apache.org/jira/browse/YARN-6667
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: fanshilun
>Priority: Minor
>  Labels: pull-request-available
>
> From the actual situation, the probability of this happening is very low. 
> It can only be caused by the master-slave fail-hover of YARN and the wrong 
> Epoch parameter configuration.
> We will try to be compatible with this situation and let the Application run 
> as much as possible, using the following measures:
> 1. Select a node whose heartbeat does not time out for allocation, and at the 
> same time require the node to be in the RUNNING state.
> 2. If the heartbeat of both RMs does not time out, and both are in the 
> RUNNING state, select the previously allocated RM for Container processing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6667) Handle containerId duplicate without failing the heartbeat in Federation Interceptor

2022-09-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599253#comment-17599253
 ] 

ASF GitHub Bot commented on YARN-6667:
--

hadoop-yetus commented on PR #4810:
URL: https://github.com/apache/hadoop/pull/4810#issuecomment-1235059105

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |  16m 13s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  1s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  39m  3s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 44s |  |  trunk passed with JDK 
Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |   1m 34s |  |  trunk passed with JDK 
Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  checkstyle  |   0m 48s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 56s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 57s |  |  trunk passed with JDK 
Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   0m 53s |  |  trunk passed with JDK 
Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   1m 45s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  21m  1s |  |  branch has no errors 
when building and testing our client artifacts.  |
   | -0 :warning: |  patch  |  21m 24s |  |  Used diff version of patch file. 
Binary files and potentially other changes not applied. Please rebase and 
squash commits if necessary.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 42s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 25s |  |  the patch passed with JDK 
Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javac  |   1m 25s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 25s |  |  the patch passed with JDK 
Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  javac  |   1m 25s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 29s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 46s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 36s |  |  the patch passed with JDK 
Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   0m 33s |  |  the patch passed with JDK 
Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   1m 36s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  20m 37s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  23m 57s |  |  hadoop-yarn-server-nodemanager 
in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 43s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 138m 18s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4810/10/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/4810 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux f017c225cfb2 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 
01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 6143a7e3fb5a4c4d19fd7382247100362fc06b21 |
   | Default Java | Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4810/10/testReport/ |
   | Max. process+thread count | 704 (vs. ulimit of 5500) |
   | modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 |
   | Console output | 

[jira] [Commented] (YARN-6667) Handle containerId duplicate without failing the heartbeat in Federation Interceptor

2022-08-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597917#comment-17597917
 ] 

ASF GitHub Bot commented on YARN-6667:
--

slfan1989 commented on PR #4810:
URL: https://github.com/apache/hadoop/pull/4810#issuecomment-1231789354

   @goiri Please help to review the code again, Thank you very much!




> Handle containerId duplicate without failing the heartbeat in Federation 
> Interceptor
> 
>
> Key: YARN-6667
> URL: https://issues.apache.org/jira/browse/YARN-6667
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: fanshilun
>Priority: Minor
>  Labels: pull-request-available
>
> From the actual situation, the probability of this happening is very low. 
> It can only be caused by the master-slave fail-hover of YARN and the wrong 
> Epoch parameter configuration.
> We will try to be compatible with this situation and let the Application run 
> as much as possible, using the following measures:
> 1. Select a node whose heartbeat does not time out for allocation, and at the 
> same time require the node to be in the RUNNING state.
> 2. If the heartbeat of both RMs does not time out, and both are in the 
> RUNNING state, select the previously allocated RM for Container processing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6667) Handle containerId duplicate without failing the heartbeat in Federation Interceptor

2022-08-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597519#comment-17597519
 ] 

ASF GitHub Bot commented on YARN-6667:
--

hadoop-yetus commented on PR #4810:
URL: https://github.com/apache/hadoop/pull/4810#issuecomment-1231126792

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 38s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  38m 32s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 43s |  |  trunk passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  compile  |   1m 32s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  checkstyle  |   0m 48s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m  2s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m  2s |  |  trunk passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 44s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   1m 52s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  21m 15s |  |  branch has no errors 
when building and testing our client artifacts.  |
   | -0 :warning: |  patch  |  21m 40s |  |  Used diff version of patch file. 
Binary files and potentially other changes not applied. Please rebase and 
squash commits if necessary.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 42s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 27s |  |  the patch passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javac  |   1m 27s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 21s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  javac  |   1m 21s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 32s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 44s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 36s |  |  the patch passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 33s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   1m 28s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  20m 37s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  24m  1s |  |  hadoop-yarn-server-nodemanager 
in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 43s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 122m 12s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4810/9/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/4810 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 980b2cd3c46d 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 
01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 554719b7b2c672abcb099a589b288861b2e9de31 |
   | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Private 
Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4810/9/testReport/ |
   | Max. process+thread count | 675 (vs. ulimit of 5500) |
   | modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager

[jira] [Commented] (YARN-6667) Handle containerId duplicate without failing the heartbeat in Federation Interceptor

2022-08-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597472#comment-17597472
 ] 

ASF GitHub Bot commented on YARN-6667:
--

slfan1989 commented on code in PR #4810:
URL: https://github.com/apache/hadoop/pull/4810#discussion_r957943343


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/amrmproxy/FederationInterceptor.java:
##
@@ -1761,4 +1794,26 @@ public static  boolean isNullOrEmpty(Collection c) 
{
   public static  boolean isNullOrEmpty(Map c) {
 return (c == null || c.size() == 0);
   }
+
+  @VisibleForTesting
+  protected void cacheAllocatedContainersForSubClusterId(
+  List containers, SubClusterId subClusterId) {
+cacheAllocatedContainers(containers, subClusterId);
+  }
+
+  @VisibleForTesting
+  protected Map getContainerIdToSubClusterIdMap() {
+return containerIdToSubClusterIdMap;
+  }
+
+  private boolean isSCHealth(SubClusterId subClusterId) throws YarnException {
+boolean isSCHealth = true;
+Set timeOutScs = getTimedOutSCs(true);
+SubClusterInfo subClusterInfo = 
federationFacade.getSubCluster(subClusterId);
+if (timeOutScs.contains(subClusterId) ||
+ subClusterInfo == null || subClusterInfo.getState().isUnusable()) {
+  isSCHealth = false;

Review Comment:
   Thanks for your suggestion, I will modify the code.





> Handle containerId duplicate without failing the heartbeat in Federation 
> Interceptor
> 
>
> Key: YARN-6667
> URL: https://issues.apache.org/jira/browse/YARN-6667
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: fanshilun
>Priority: Minor
>  Labels: pull-request-available
>
> From the actual situation, the probability of this happening is very low. 
> It can only be caused by the master-slave fail-hover of YARN and the wrong 
> Epoch parameter configuration.
> We will try to be compatible with this situation and let the Application run 
> as much as possible, using the following measures:
> 1. Select a node whose heartbeat does not time out for allocation, and at the 
> same time require the node to be in the RUNNING state.
> 2. If the heartbeat of both RMs does not time out, and both are in the 
> RUNNING state, select the previously allocated RM for Container processing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6667) Handle containerId duplicate without failing the heartbeat in Federation Interceptor

2022-08-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597471#comment-17597471
 ] 

ASF GitHub Bot commented on YARN-6667:
--

slfan1989 commented on code in PR #4810:
URL: https://github.com/apache/hadoop/pull/4810#discussion_r957943189


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/amrmproxy/FederationInterceptor.java:
##
@@ -1761,4 +1794,26 @@ public static  boolean isNullOrEmpty(Collection c) 
{
   public static  boolean isNullOrEmpty(Map c) {
 return (c == null || c.size() == 0);
   }
+
+  @VisibleForTesting
+  protected void cacheAllocatedContainersForSubClusterId(
+  List containers, SubClusterId subClusterId) {
+cacheAllocatedContainers(containers, subClusterId);
+  }
+
+  @VisibleForTesting
+  protected Map getContainerIdToSubClusterIdMap() {
+return containerIdToSubClusterIdMap;
+  }
+
+  private boolean isSCHealth(SubClusterId subClusterId) throws YarnException {
+boolean isSCHealth = true;
+Set timeOutScs = getTimedOutSCs(true);
+SubClusterInfo subClusterInfo = 
federationFacade.getSubCluster(subClusterId);
+if (timeOutScs.contains(subClusterId) ||
+ subClusterInfo == null || subClusterInfo.getState().isUnusable()) {

Review Comment:
   I will fix it.





> Handle containerId duplicate without failing the heartbeat in Federation 
> Interceptor
> 
>
> Key: YARN-6667
> URL: https://issues.apache.org/jira/browse/YARN-6667
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: fanshilun
>Priority: Minor
>  Labels: pull-request-available
>
> From the actual situation, the probability of this happening is very low. 
> It can only be caused by the master-slave fail-hover of YARN and the wrong 
> Epoch parameter configuration.
> We will try to be compatible with this situation and let the Application run 
> as much as possible, using the following measures:
> 1. Select a node whose heartbeat does not time out for allocation, and at the 
> same time require the node to be in the RUNNING state.
> 2. If the heartbeat of both RMs does not time out, and both are in the 
> RUNNING state, select the previously allocated RM for Container processing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6667) Handle containerId duplicate without failing the heartbeat in Federation Interceptor

2022-08-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597280#comment-17597280
 ] 

ASF GitHub Bot commented on YARN-6667:
--

goiri commented on code in PR #4810:
URL: https://github.com/apache/hadoop/pull/4810#discussion_r957544974


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/amrmproxy/FederationInterceptor.java:
##
@@ -1761,4 +1794,26 @@ public static  boolean isNullOrEmpty(Collection c) 
{
   public static  boolean isNullOrEmpty(Map c) {
 return (c == null || c.size() == 0);
   }
+
+  @VisibleForTesting
+  protected void cacheAllocatedContainersForSubClusterId(
+  List containers, SubClusterId subClusterId) {
+cacheAllocatedContainers(containers, subClusterId);
+  }
+
+  @VisibleForTesting
+  protected Map getContainerIdToSubClusterIdMap() {
+return containerIdToSubClusterIdMap;
+  }
+
+  private boolean isSCHealth(SubClusterId subClusterId) throws YarnException {
+boolean isSCHealth = true;
+Set timeOutScs = getTimedOutSCs(true);
+SubClusterInfo subClusterInfo = 
federationFacade.getSubCluster(subClusterId);
+if (timeOutScs.contains(subClusterId) ||
+ subClusterInfo == null || subClusterInfo.getState().isUnusable()) {
+  isSCHealth = false;

Review Comment:
   I would just do return true and return false at the end.



##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/amrmproxy/FederationInterceptor.java:
##
@@ -1761,4 +1794,26 @@ public static  boolean isNullOrEmpty(Collection c) 
{
   public static  boolean isNullOrEmpty(Map c) {
 return (c == null || c.size() == 0);
   }
+
+  @VisibleForTesting
+  protected void cacheAllocatedContainersForSubClusterId(
+  List containers, SubClusterId subClusterId) {
+cacheAllocatedContainers(containers, subClusterId);
+  }
+
+  @VisibleForTesting
+  protected Map getContainerIdToSubClusterIdMap() {
+return containerIdToSubClusterIdMap;
+  }
+
+  private boolean isSCHealth(SubClusterId subClusterId) throws YarnException {
+boolean isSCHealth = true;
+Set timeOutScs = getTimedOutSCs(true);
+SubClusterInfo subClusterInfo = 
federationFacade.getSubCluster(subClusterId);
+if (timeOutScs.contains(subClusterId) ||
+ subClusterInfo == null || subClusterInfo.getState().isUnusable()) {

Review Comment:
   Indentation





> Handle containerId duplicate without failing the heartbeat in Federation 
> Interceptor
> 
>
> Key: YARN-6667
> URL: https://issues.apache.org/jira/browse/YARN-6667
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: fanshilun
>Priority: Minor
>  Labels: pull-request-available
>
> From the actual situation, the probability of this happening is very low. 
> It can only be caused by the master-slave fail-hover of YARN and the wrong 
> Epoch parameter configuration.
> We will try to be compatible with this situation and let the Application run 
> as much as possible, using the following measures:
> 1. Select a node whose heartbeat does not time out for allocation, and at the 
> same time require the node to be in the RUNNING state.
> 2. If the heartbeat of both RMs does not time out, and both are in the 
> RUNNING state, select the previously allocated RM for Container processing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6667) Handle containerId duplicate without failing the heartbeat in Federation Interceptor

2022-08-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597238#comment-17597238
 ] 

ASF GitHub Bot commented on YARN-6667:
--

slfan1989 commented on PR #4810:
URL: https://github.com/apache/hadoop/pull/4810#issuecomment-1230413309

   @goiri Please help to review the code again, Thank you very much!




> Handle containerId duplicate without failing the heartbeat in Federation 
> Interceptor
> 
>
> Key: YARN-6667
> URL: https://issues.apache.org/jira/browse/YARN-6667
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: fanshilun
>Priority: Minor
>  Labels: pull-request-available
>
> From the actual situation, the probability of this happening is very low. 
> It can only be caused by the master-slave fail-hover of YARN and the wrong 
> Epoch parameter configuration.
> We will try to be compatible with this situation and let the Application run 
> as much as possible, using the following measures:
> 1. Select a node whose heartbeat does not time out for allocation, and at the 
> same time require the node to be in the RUNNING state.
> 2. If the heartbeat of both RMs does not time out, and both are in the 
> RUNNING state, select the previously allocated RM for Container processing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6667) Handle containerId duplicate without failing the heartbeat in Federation Interceptor

2022-08-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17593539#comment-17593539
 ] 

ASF GitHub Bot commented on YARN-6667:
--

hadoop-yetus commented on PR #4810:
URL: https://github.com/apache/hadoop/pull/4810#issuecomment-1229479256

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 35s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  37m 38s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 40s |  |  trunk passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  compile  |   1m 32s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  checkstyle  |   0m 57s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 56s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m  3s |  |  trunk passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 50s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   1m 44s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  21m  1s |  |  branch has no errors 
when building and testing our client artifacts.  |
   | -0 :warning: |  patch  |  21m 24s |  |  Used diff version of patch file. 
Binary files and potentially other changes not applied. Please rebase and 
squash commits if necessary.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 41s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 26s |  |  the patch passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javac  |   1m 26s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 24s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  javac  |   1m 24s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 33s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 41s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 35s |  |  the patch passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 33s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   1m 33s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  20m 39s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  24m 10s |  |  hadoop-yarn-server-nodemanager 
in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 41s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 121m 37s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4810/8/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/4810 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 7241a7730551 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 
01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 72bd079b22720f1ab29c168a9f8b752d95ff74cf |
   | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Private 
Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4810/8/testReport/ |
   | Max. process+thread count | 556 (vs. ulimit of 5500) |
   | modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager

[jira] [Commented] (YARN-6667) Handle containerId duplicate without failing the heartbeat in Federation Interceptor

2022-08-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17587917#comment-17587917
 ] 

ASF GitHub Bot commented on YARN-6667:
--

hadoop-yetus commented on PR #4810:
URL: https://github.com/apache/hadoop/pull/4810#issuecomment-1229442703

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 36s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  38m 47s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 38s |  |  trunk passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  compile  |   1m 30s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  checkstyle  |   0m 44s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 53s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 54s |  |  trunk passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 41s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   1m 44s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  21m 58s |  |  branch has no errors 
when building and testing our client artifacts.  |
   | -0 :warning: |  patch  |  22m 20s |  |  Used diff version of patch file. 
Binary files and potentially other changes not applied. Please rebase and 
squash commits if necessary.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 38s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 25s |  |  the patch passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javac  |   1m 25s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 21s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  javac  |   1m 21s |  |  the patch passed  |
   | -1 :x: |  blanks  |   0m  0s | 
[/blanks-eol.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4810/7/artifact/out/blanks-eol.txt)
 |  The patch has 2 line(s) that end in blanks. Use git apply --whitespace=fix 
<>. Refer https://git-scm.com/docs/git-apply  |
   | +1 :green_heart: |  checkstyle  |   0m 28s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 40s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 32s |  |  the patch passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 30s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | -1 :x: |  spotbugs  |   1m 36s | 
[/new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4810/7/artifact/out/new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.html)
 |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0)  |
   | +1 :green_heart: |  shadedclient  |  22m 49s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  23m 58s |  |  hadoop-yarn-server-nodemanager 
in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 43s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 124m 34s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | SpotBugs | 
module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 |
   |  |  Exception is caught when Exception is not thrown in 
org.apache.hadoop.yarn.server.nodemanager.amrmproxy.FederationInterceptor.cacheAllocatedContainers(List,
 SubClusterId)  At FederationInterceptor.java:is not thrown in 
org.apache.hadoop.yarn.server.nodemanager.amrmproxy.FederationInterceptor.cacheAllocatedContainers(List,
 SubClusterId)  At FederationInterceptor.java:[line 1537] |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 

[jira] [Commented] (YARN-6667) Handle containerId duplicate without failing the heartbeat in Federation Interceptor

2022-08-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17587406#comment-17587406
 ] 

ASF GitHub Bot commented on YARN-6667:
--

hadoop-yetus commented on PR #4810:
URL: https://github.com/apache/hadoop/pull/4810#issuecomment-1229440178

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 35s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  38m 46s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 45s |  |  trunk passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  compile  |   1m 37s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  checkstyle  |   0m 46s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 57s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 54s |  |  trunk passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 41s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   1m 53s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  21m 41s |  |  branch has no errors 
when building and testing our client artifacts.  |
   | -0 :warning: |  patch  |  22m  1s |  |  Used diff version of patch file. 
Binary files and potentially other changes not applied. Please rebase and 
squash commits if necessary.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 39s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 25s |  |  the patch passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javac  |   1m 25s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 24s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  javac  |   1m 24s |  |  the patch passed  |
   | -1 :x: |  blanks  |   0m  0s | 
[/blanks-eol.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4810/6/artifact/out/blanks-eol.txt)
 |  The patch has 2 line(s) that end in blanks. Use git apply --whitespace=fix 
<>. Refer https://git-scm.com/docs/git-apply  |
   | +1 :green_heart: |  checkstyle  |   0m 28s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 43s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 33s |  |  the patch passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 30s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   1m 30s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  21m  8s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  23m 47s |  |  hadoop-yarn-server-nodemanager 
in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 42s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 122m 38s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4810/6/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/4810 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux a8ddd9faf984 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 
01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / adeed0988de46d62e740cd483073ebfd7ee339d4 |
   | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Private 
Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4810/6/testReport/ |
   | Max. process+thread count | 738 (vs. ulimit of 

[jira] [Commented] (YARN-6667) Handle containerId duplicate without failing the heartbeat in Federation Interceptor

2022-08-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17586176#comment-17586176
 ] 

ASF GitHub Bot commented on YARN-6667:
--

slfan1989 commented on code in PR #4810:
URL: https://github.com/apache/hadoop/pull/4810#discussion_r956701525


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/amrmproxy/FederationInterceptor.java:
##
@@ -1545,18 +1546,66 @@ private void cacheAllocatedContainers(List 
containers,
   + " from same sub-cluster: {}, so ignoring.",
   container.getId(), subClusterId);
 } else {
+
+  LOG.info("Duplicate containerID found in the allocated containers. " 
+
+  "try to re-pick the sub-cluster.");
+
   // The same container allocation from different sub-clusters,
   // something is wrong.
-  // TODO: YARN-6667 if some subcluster RM is configured wrong, we
-  // should not fail the entire heartbeat.
-  throw new YarnRuntimeException(
-  "Duplicate containerID found in the allocated containers. This"
-  + " can happen if the RM epoch is not configured properly."
-  + " ContainerId: " + container.getId().toString()
-  + " ApplicationId: " + this.attemptId + " From RM: "
-  + subClusterId
-  + " . Previous container was from sub-cluster: "
-  + existingSubClusterId);
+  try {
+
+Set timeOutScs = getTimedOutSCs(true);
+SubClusterInfo existingSubCluster =
+federationFacade.getSubCluster(existingSubClusterId);
+SubClusterInfo newSubCluster = 
federationFacade.getSubCluster(subClusterId);
+
+boolean existAllocatedScHealth = true;
+boolean newAllocatedScHealth = true;
+
+// Previous SubCluster Time Out Or Unusable, Can't Continue to use.
+if (timeOutScs.contains(existingSubClusterId) ||
+existingSubCluster == null || 
existingSubCluster.getState().isUnusable()) {
+  existAllocatedScHealth = false;
+}
+
+// New SubCluster Time Out Or Unusable, Can't Continue to use.
+if (timeOutScs.contains(existingSubClusterId) ||
+newSubCluster == null || 
newSubCluster.getState().isUnusable()) {
+  newAllocatedScHealth = false;
+}
+
+// If the previous RM which allocated Container is normal,
+// the previous RM will be used first
+if ((existAllocatedScHealth && !newAllocatedScHealth) ||

Review Comment:
   Thanks for your suggestion, your suggestion is correct, I will modify the 
code.





> Handle containerId duplicate without failing the heartbeat in Federation 
> Interceptor
> 
>
> Key: YARN-6667
> URL: https://issues.apache.org/jira/browse/YARN-6667
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: fanshilun
>Priority: Minor
>  Labels: pull-request-available
>
> From the actual situation, the probability of this happening is very low. 
> It can only be caused by the master-slave fail-hover of YARN and the wrong 
> Epoch parameter configuration.
> We will try to be compatible with this situation and let the Application run 
> as much as possible, using the following measures:
> 1. Select a node whose heartbeat does not time out for allocation, and at the 
> same time require the node to be in the RUNNING state.
> 2. If the heartbeat of both RMs does not time out, and both are in the 
> RUNNING state, select the previously allocated RM for Container processing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6667) Handle containerId duplicate without failing the heartbeat in Federation Interceptor

2022-08-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17586175#comment-17586175
 ] 

ASF GitHub Bot commented on YARN-6667:
--

slfan1989 commented on code in PR #4810:
URL: https://github.com/apache/hadoop/pull/4810#discussion_r956701424


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/amrmproxy/FederationInterceptor.java:
##
@@ -1545,18 +1546,66 @@ private void cacheAllocatedContainers(List 
containers,
   + " from same sub-cluster: {}, so ignoring.",
   container.getId(), subClusterId);
 } else {
+
+  LOG.info("Duplicate containerID found in the allocated containers. " 
+
+  "try to re-pick the sub-cluster.");
+
   // The same container allocation from different sub-clusters,
   // something is wrong.
-  // TODO: YARN-6667 if some subcluster RM is configured wrong, we
-  // should not fail the entire heartbeat.
-  throw new YarnRuntimeException(
-  "Duplicate containerID found in the allocated containers. This"
-  + " can happen if the RM epoch is not configured properly."
-  + " ContainerId: " + container.getId().toString()
-  + " ApplicationId: " + this.attemptId + " From RM: "
-  + subClusterId
-  + " . Previous container was from sub-cluster: "
-  + existingSubClusterId);
+  try {
+
+Set timeOutScs = getTimedOutSCs(true);
+SubClusterInfo existingSubCluster =
+federationFacade.getSubCluster(existingSubClusterId);
+SubClusterInfo newSubCluster = 
federationFacade.getSubCluster(subClusterId);
+
+boolean existAllocatedScHealth = true;
+boolean newAllocatedScHealth = true;
+
+// Previous SubCluster Time Out Or Unusable, Can't Continue to use.
+if (timeOutScs.contains(existingSubClusterId) ||
+existingSubCluster == null || 
existingSubCluster.getState().isUnusable()) {
+  existAllocatedScHealth = false;
+}
+
+// New SubCluster Time Out Or Unusable, Can't Continue to use.
+if (timeOutScs.contains(existingSubClusterId) ||
+newSubCluster == null || 
newSubCluster.getState().isUnusable()) {
+  newAllocatedScHealth = false;
+}
+
+// If the previous RM which allocated Container is normal,
+// the previous RM will be used first
+if ((existAllocatedScHealth && !newAllocatedScHealth) ||
+(existAllocatedScHealth && newAllocatedScHealth)) {
+  LOG.info("Use Previous Allocated Container's subCluster. " +
+  "ContainerId: {} ApplicationId: {} From RM: {}.", 
this.attemptId,
+  container.getId(), existingSubClusterId);
+  continue;

Review Comment:
   At the end of the for loop, there is a piece of code like this
   
   ```
   this.containerIdToSubClusterIdMap.put(container.getId(), subClusterId);
   ```
   If a container uses the previous subcluster, it does not need to execute 
this line, so use continue, jump out of the loop, and let the for loop continue 
to execute the next round.
   
   This code may not be easy to read, I will refactor this part of the code.





> Handle containerId duplicate without failing the heartbeat in Federation 
> Interceptor
> 
>
> Key: YARN-6667
> URL: https://issues.apache.org/jira/browse/YARN-6667
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: fanshilun
>Priority: Minor
>  Labels: pull-request-available
>
> From the actual situation, the probability of this happening is very low. 
> It can only be caused by the master-slave fail-hover of YARN and the wrong 
> Epoch parameter configuration.
> We will try to be compatible with this situation and let the Application run 
> as much as possible, using the following measures:
> 1. Select a node whose heartbeat does not time out for allocation, and at the 
> same time require the node to be in the RUNNING state.
> 2. If the heartbeat of both RMs does not time out, and both are in the 
> RUNNING state, select the previously allocated RM for Container processing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6667) Handle containerId duplicate without failing the heartbeat in Federation Interceptor

2022-08-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17586125#comment-17586125
 ] 

ASF GitHub Bot commented on YARN-6667:
--

slfan1989 commented on code in PR #4810:
URL: https://github.com/apache/hadoop/pull/4810#discussion_r956643212


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/amrmproxy/FederationInterceptor.java:
##
@@ -1545,18 +1546,66 @@ private void cacheAllocatedContainers(List 
containers,
   + " from same sub-cluster: {}, so ignoring.",
   container.getId(), subClusterId);
 } else {
+
+  LOG.info("Duplicate containerID found in the allocated containers. " 
+
+  "try to re-pick the sub-cluster.");
+
   // The same container allocation from different sub-clusters,
   // something is wrong.
-  // TODO: YARN-6667 if some subcluster RM is configured wrong, we
-  // should not fail the entire heartbeat.
-  throw new YarnRuntimeException(
-  "Duplicate containerID found in the allocated containers. This"
-  + " can happen if the RM epoch is not configured properly."
-  + " ContainerId: " + container.getId().toString()
-  + " ApplicationId: " + this.attemptId + " From RM: "
-  + subClusterId
-  + " . Previous container was from sub-cluster: "
-  + existingSubClusterId);
+  try {
+
+Set timeOutScs = getTimedOutSCs(true);
+SubClusterInfo existingSubCluster =
+federationFacade.getSubCluster(existingSubClusterId);
+SubClusterInfo newSubCluster = 
federationFacade.getSubCluster(subClusterId);
+
+boolean existAllocatedScHealth = true;
+boolean newAllocatedScHealth = true;
+
+// Previous SubCluster Time Out Or Unusable, Can't Continue to use.
+if (timeOutScs.contains(existingSubClusterId) ||

Review Comment:
   Thank you very much for helping to review the code, I will modify the code.





> Handle containerId duplicate without failing the heartbeat in Federation 
> Interceptor
> 
>
> Key: YARN-6667
> URL: https://issues.apache.org/jira/browse/YARN-6667
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: fanshilun
>Priority: Minor
>  Labels: pull-request-available
>
> From the actual situation, the probability of this happening is very low. 
> It can only be caused by the master-slave fail-hover of YARN and the wrong 
> Epoch parameter configuration.
> We will try to be compatible with this situation and let the Application run 
> as much as possible, using the following measures:
> 1. Select a node whose heartbeat does not time out for allocation, and at the 
> same time require the node to be in the RUNNING state.
> 2. If the heartbeat of both RMs does not time out, and both are in the 
> RUNNING state, select the previously allocated RM for Container processing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6667) Handle containerId duplicate without failing the heartbeat in Federation Interceptor

2022-08-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17586094#comment-17586094
 ] 

ASF GitHub Bot commented on YARN-6667:
--

goiri commented on code in PR #4810:
URL: https://github.com/apache/hadoop/pull/4810#discussion_r956603434


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/amrmproxy/FederationInterceptor.java:
##
@@ -1545,18 +1546,66 @@ private void cacheAllocatedContainers(List 
containers,
   + " from same sub-cluster: {}, so ignoring.",
   container.getId(), subClusterId);
 } else {
+
+  LOG.info("Duplicate containerID found in the allocated containers. " 
+
+  "try to re-pick the sub-cluster.");
+
   // The same container allocation from different sub-clusters,
   // something is wrong.
-  // TODO: YARN-6667 if some subcluster RM is configured wrong, we
-  // should not fail the entire heartbeat.
-  throw new YarnRuntimeException(
-  "Duplicate containerID found in the allocated containers. This"
-  + " can happen if the RM epoch is not configured properly."
-  + " ContainerId: " + container.getId().toString()
-  + " ApplicationId: " + this.attemptId + " From RM: "
-  + subClusterId
-  + " . Previous container was from sub-cluster: "
-  + existingSubClusterId);
+  try {
+
+Set timeOutScs = getTimedOutSCs(true);
+SubClusterInfo existingSubCluster =
+federationFacade.getSubCluster(existingSubClusterId);
+SubClusterInfo newSubCluster = 
federationFacade.getSubCluster(subClusterId);
+
+boolean existAllocatedScHealth = true;
+boolean newAllocatedScHealth = true;
+
+// Previous SubCluster Time Out Or Unusable, Can't Continue to use.
+if (timeOutScs.contains(existingSubClusterId) ||
+existingSubCluster == null || 
existingSubCluster.getState().isUnusable()) {
+  existAllocatedScHealth = false;
+}
+
+// New SubCluster Time Out Or Unusable, Can't Continue to use.
+if (timeOutScs.contains(existingSubClusterId) ||
+newSubCluster == null || 
newSubCluster.getState().isUnusable()) {
+  newAllocatedScHealth = false;
+}
+
+// If the previous RM which allocated Container is normal,
+// the previous RM will be used first
+if ((existAllocatedScHealth && !newAllocatedScHealth) ||
+(existAllocatedScHealth && newAllocatedScHealth)) {
+  LOG.info("Use Previous Allocated Container's subCluster. " +
+  "ContainerId: {} ApplicationId: {} From RM: {}.", 
this.attemptId,
+  container.getId(), existingSubClusterId);
+  continue;

Review Comment:
   Why do we need a continue?



##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/amrmproxy/FederationInterceptor.java:
##
@@ -1545,18 +1546,66 @@ private void cacheAllocatedContainers(List 
containers,
   + " from same sub-cluster: {}, so ignoring.",
   container.getId(), subClusterId);
 } else {
+
+  LOG.info("Duplicate containerID found in the allocated containers. " 
+
+  "try to re-pick the sub-cluster.");
+
   // The same container allocation from different sub-clusters,
   // something is wrong.
-  // TODO: YARN-6667 if some subcluster RM is configured wrong, we
-  // should not fail the entire heartbeat.
-  throw new YarnRuntimeException(
-  "Duplicate containerID found in the allocated containers. This"
-  + " can happen if the RM epoch is not configured properly."
-  + " ContainerId: " + container.getId().toString()
-  + " ApplicationId: " + this.attemptId + " From RM: "
-  + subClusterId
-  + " . Previous container was from sub-cluster: "
-  + existingSubClusterId);
+  try {
+
+Set timeOutScs = getTimedOutSCs(true);
+SubClusterInfo existingSubCluster =
+federationFacade.getSubCluster(existingSubClusterId);
+SubClusterInfo newSubCluster = 
federationFacade.getSubCluster(subClusterId);
+
+boolean existAllocatedScHealth = true;
+boolean newAllocatedScHealth = true;
+
+// Previous SubCluster Time Out Or Unusable, Can't Continue to use.
+if (timeOutScs.contains(existingSubClusterId) ||
+existingSubCluster == null || 

[jira] [Commented] (YARN-6667) Handle containerId duplicate without failing the heartbeat in Federation Interceptor

2022-08-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17586076#comment-17586076
 ] 

ASF GitHub Bot commented on YARN-6667:
--

slfan1989 commented on PR #4810:
URL: https://github.com/apache/hadoop/pull/4810#issuecomment-1229211705

   @goiri Can you help review the code? Thank you very much!




> Handle containerId duplicate without failing the heartbeat in Federation 
> Interceptor
> 
>
> Key: YARN-6667
> URL: https://issues.apache.org/jira/browse/YARN-6667
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: fanshilun
>Priority: Minor
>  Labels: pull-request-available
>
> From the actual situation, the probability of this happening is very low. 
> It can only be caused by the master-slave fail-hover of YARN and the wrong 
> Epoch parameter configuration.
> We will try to be compatible with this situation and let the Application run 
> as much as possible, using the following measures:
> 1. Select a node whose heartbeat does not time out for allocation, and at the 
> same time require the node to be in the RUNNING state.
> 2. If the heartbeat of both RMs does not time out, and both are in the 
> RUNNING state, select the previously allocated RM for Container processing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6667) Handle containerId duplicate without failing the heartbeat in Federation Interceptor

2022-08-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17585742#comment-17585742
 ] 

ASF GitHub Bot commented on YARN-6667:
--

hadoop-yetus commented on PR #4810:
URL: https://github.com/apache/hadoop/pull/4810#issuecomment-1229169962

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 34s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  38m 13s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 37s |  |  trunk passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  compile  |   1m 38s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  checkstyle  |   0m 50s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 56s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 55s |  |  trunk passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 44s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   1m 43s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  20m 56s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 41s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 26s |  |  the patch passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javac  |   1m 26s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 23s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  javac  |   1m 23s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 28s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 46s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 39s |  |  the patch passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 32s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   1m 29s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  20m 27s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  23m 53s |  |  hadoop-yarn-server-nodemanager 
in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 42s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 121m 15s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4810/5/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/4810 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 22ffe7e5a131 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 
01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / a774a724ceeda2ee5b2fc4fe75f3bcfe12eabb1a |
   | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Private 
Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4810/5/testReport/ |
   | Max. process+thread count | 728 (vs. ulimit of 5500) |
   | modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4810/5/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache 

[jira] [Commented] (YARN-6667) Handle containerId duplicate without failing the heartbeat in Federation Interceptor

2022-08-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17585572#comment-17585572
 ] 

ASF GitHub Bot commented on YARN-6667:
--

hadoop-yetus commented on PR #4810:
URL: https://github.com/apache/hadoop/pull/4810#issuecomment-1228763458

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 35s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  38m 28s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 40s |  |  trunk passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  compile  |   1m 36s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  checkstyle  |   0m 48s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m  3s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m  4s |  |  trunk passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 45s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   1m 51s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  21m  6s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 40s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 27s |  |  the patch passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javac  |   1m 27s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 20s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  javac  |   1m 20s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 29s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 46s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 39s |  |  the patch passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 32s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   1m 28s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  20m 36s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  23m 53s |  |  hadoop-yarn-server-nodemanager 
in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 44s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 121m 58s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4810/4/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/4810 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux efb97872a5ae 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 
01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 979a84ef8df9020419d9d07c210127e3f70ad236 |
   | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Private 
Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4810/4/testReport/ |
   | Max. process+thread count | 698 (vs. ulimit of 5500) |
   | modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4810/4/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache 

[jira] [Commented] (YARN-6667) Handle containerId duplicate without failing the heartbeat in Federation Interceptor

2022-08-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17585461#comment-17585461
 ] 

ASF GitHub Bot commented on YARN-6667:
--

hadoop-yetus commented on PR #4810:
URL: https://github.com/apache/hadoop/pull/4810#issuecomment-1228645254

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 49s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  38m 51s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 44s |  |  trunk passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  compile  |   1m 34s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  checkstyle  |   0m 52s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m  3s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m  4s |  |  trunk passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 46s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   1m 43s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  21m 15s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 40s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 31s |  |  the patch passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javac  |   1m 31s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 25s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  javac  |   1m 25s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 32s | 
[/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4810/3/artifact/out/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt)
 |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:
 The patch generated 2 new + 0 unchanged - 0 fixed = 2 total (was 0)  |
   | +1 :green_heart: |  mvnsite  |   0m 47s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 36s |  |  the patch passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 34s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   1m 33s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  20m 43s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  23m 56s |  |  hadoop-yarn-server-nodemanager 
in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 43s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 123m  7s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4810/3/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/4810 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux eb8e29343e7c 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 
01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / fe48d3fac4fcb6d6b6d148ac15d2a2372ea6d812 |
   | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Private 
Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4810/3/testReport/ |
   | 

[jira] [Commented] (YARN-6667) Handle containerId duplicate without failing the heartbeat in Federation Interceptor

2022-08-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17585460#comment-17585460
 ] 

ASF GitHub Bot commented on YARN-6667:
--

hadoop-yetus commented on PR #4810:
URL: https://github.com/apache/hadoop/pull/4810#issuecomment-1228645189

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 39s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  39m  0s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 39s |  |  trunk passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  compile  |   1m 40s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  checkstyle  |   0m 49s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 58s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 57s |  |  trunk passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 51s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   1m 52s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  21m 59s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 46s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 41s |  |  the patch passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javac  |   1m 41s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 33s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  javac  |   1m 33s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 31s | 
[/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4810/2/artifact/out/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt)
 |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:
 The patch generated 2 new + 0 unchanged - 0 fixed = 2 total (was 0)  |
   | +1 :green_heart: |  mvnsite  |   0m 45s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 38s |  |  the patch passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 37s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   1m 44s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  21m  6s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  23m 52s |  |  hadoop-yarn-server-nodemanager 
in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 48s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 124m 27s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4810/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/4810 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 3fd67a96ffda 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 
01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / fe48d3fac4fcb6d6b6d148ac15d2a2372ea6d812 |
   | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Private 
Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4810/2/testReport/ |
   | 

[jira] [Commented] (YARN-6667) Handle containerId duplicate without failing the heartbeat in Federation Interceptor

2022-08-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17585395#comment-17585395
 ] 

ASF GitHub Bot commented on YARN-6667:
--

hadoop-yetus commented on PR #4810:
URL: https://github.com/apache/hadoop/pull/4810#issuecomment-1228531678

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 36s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  38m 30s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 37s |  |  trunk passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  compile  |   1m 36s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  checkstyle  |   0m 48s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 56s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 56s |  |  trunk passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 51s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   1m 43s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  21m  8s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 42s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 29s |  |  the patch passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javac  |   1m 29s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 20s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  javac  |   1m 20s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 28s | 
[/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4810/1/artifact/out/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt)
 |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:
 The patch generated 2 new + 0 unchanged - 0 fixed = 2 total (was 0)  |
   | +1 :green_heart: |  mvnsite  |   0m 42s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 35s |  |  the patch passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 35s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   1m 34s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  21m 15s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  23m 58s |  |  hadoop-yarn-server-nodemanager 
in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 40s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 122m 37s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4810/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/4810 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 9ac942544aec 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 
01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 27cb24fde3702f641ee3d9cd6c58c14526396342 |
   | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Private 
Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4810/1/testReport/ |
   | 

[jira] [Commented] (YARN-6667) Handle containerId duplicate without failing the heartbeat in Federation Interceptor

2022-08-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17585325#comment-17585325
 ] 

ASF GitHub Bot commented on YARN-6667:
--

slfan1989 opened a new pull request, #4810:
URL: https://github.com/apache/hadoop/pull/4810

   JIRA: YARN-6667. Handle containerId duplicate without failing the heartbeat 
in Federation Interceptor.




> Handle containerId duplicate without failing the heartbeat in Federation 
> Interceptor
> 
>
> Key: YARN-6667
> URL: https://issues.apache.org/jira/browse/YARN-6667
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Minor
>
> From the actual situation, the probability of this happening is very low. 
> It can only be caused by the master-slave fail-hover of YARN and the wrong 
> Epoch parameter configuration.
> We will try to be compatible with this situation and let the Application run 
> as much as possible, using the following measures:
> 1. Select a node whose heartbeat does not time out for allocation, and at the 
> same time require the node to be in the RUNNING state.
> 2. If the heartbeat of both RMs does not time out, and both are in the 
> RUNNING state, select the previously allocated RM for Container processing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6667) Handle containerId duplicate without failing the heartbeat in Federation Interceptor

2022-08-25 Thread fanshilun (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17585100#comment-17585100
 ] 

fanshilun commented on YARN-6667:
-

I will continue to follow up on this pr.

> Handle containerId duplicate without failing the heartbeat in Federation 
> Interceptor
> 
>
> Key: YARN-6667
> URL: https://issues.apache.org/jira/browse/YARN-6667
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Minor
>
> From the actual situation, the probability of this happening is very low. 
> It can only be caused by the master-slave fail-hover of YARN and the wrong 
> Epoch parameter configuration.
> We will try to be compatible with this situation and let the Application run 
> as much as possible, using the following measures:
> 1. Select a node whose heartbeat does not time out for allocation, and at the 
> same time require the node to be in the RUNNING state.
> 2. If the heartbeat of both RMs does not time out, and both are in the 
> RUNNING state, select the previously allocated RM for Container processing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org