[jira] [Comment Edited] (YARN-11447) A bug occurs during the active/standby switchover of RM, causing RM to work abnormally

2023-02-28 Thread Xiping Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17694391#comment-17694391
 ] 

Xiping Zhang edited comment on YARN-11447 at 2/28/23 9:00 AM:
--

After analyzing the two logs in the attachment, the conclusion of my analysis 
is: 2023-02-20 14:03:26 The StandByTransitionThread is started because the zk 
is faulty. During the switchover, the activeservice service is initialized. In 
activeservice init process, will instantiate StandByTransitionRunnable again, 
However, a fatal error occurs when the StandByTransitionThread-EventThread is 
working. Received RMFatalEvent of type TRANSITION_TO_ACTIVE_FAILED, caused by 
failure to refresh configuration settings, the StandByTransitionThread is 
switched to standby. At this time just be instantiated 
StandByTransitionRunnable AtomicBoolean hasAlreadyRun flag bit has been set to 
true (the actual after two switch lead to sign an error). At 2023-02-23 
01:05:14, the Received RMFatalEvent of type STATE_STORE_FENCED exception 
occurs. The STATE_STORE service status is active->. fenced will not be 
served. Then the StandByTransitionThread is called with the result that the 
AtomicBoolean hasAlreadyRun =true for this thread will be active all the time 
without switching to standby. STATE_STORE is out of service, and all tasks are 
stuck in the persistent zookeeperk phase.


was (Author: zhangxiping):
After analyzing the two logs in the attachment, the conclusion of my analysis 
is: 2023-02-20 The StandByTransitionThread is started because the zk is faulty. 
During the switchover, the activeservice service is initialized. In 
activeservice init process, will instantiate StandByTransitionRunnable again, 
However, a fatal error occurs when the StandByTransitionThread-EventThread is 
working. Received RMFatalEvent of type TRANSITION_TO_ACTIVE_FAILED, caused by 
failure to refresh configuration settings, the StandByTransitionThread is 
switched to standby. At this time just be instantiated 
StandByTransitionRunnable AtomicBoolean hasAlreadyRun flag bit has been set to 
true (the actual after two switch lead to sign an error). At 2023-02-23 
01:05:14, the Received RMFatalEvent of type STATE_STORE_FENCED exception 
occurs. The STATE_STORE service status is active->. fenced will not be 
served. Then the StandByTransitionThread is called with the result that the 
AtomicBoolean hasAlreadyRun =true for this thread will be active all the time 
without switching to standby. STATE_STORE is out of service, and all tasks are 
stuck in the persistent zookeeperk phase.

> A bug occurs during the active/standby switchover of RM, causing RM to work 
> abnormally
> --
>
> Key: YARN-11447
> URL: https://issues.apache.org/jira/browse/YARN-11447
> Project: Hadoop YARN
>  Issue Type: Bug
> Environment: hadoop 2.9.2 
>Reporter: Xiping Zhang
>Priority: Critical
> Attachments: image-2023-02-28-15-58-56-377.png, 
> yarn-yarn-resourcemanager-wxqcbd002.wxqc.cn.log-1.1, 
> yarn-yarn-resourcemanager-wxqcbd002.wxqc.cn.log.2
>
>
>  
>  
> 2023-02-23 01:05:14  All applications are in the NEW_SAVING state。
> !http://easyproject.nos-jd.163yun.com/d873ad4dd006421b8797f588bf466616.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-11447) A bug occurs during the active/standby switchover of RM, causing RM to work abnormally

2023-02-28 Thread Xiping Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17694396#comment-17694396
 ] 

Xiping Zhang edited comment on YARN-11447 at 2/28/23 7:59 AM:
--

!image-2023-02-28-15-58-56-377.png!


was (Author: zhangxiping):
!image-2023-02-28-15-56-57-552.png!

> A bug occurs during the active/standby switchover of RM, causing RM to work 
> abnormally
> --
>
> Key: YARN-11447
> URL: https://issues.apache.org/jira/browse/YARN-11447
> Project: Hadoop YARN
>  Issue Type: Bug
> Environment: hadoop 2.9.2 
>Reporter: Xiping Zhang
>Priority: Critical
> Attachments: image-2023-02-28-15-56-57-552.png, 
> image-2023-02-28-15-58-56-377.png, 
> yarn-yarn-resourcemanager-wxqcbd002.wxqc.cn.log-1.1, 
> yarn-yarn-resourcemanager-wxqcbd002.wxqc.cn.log.2
>
>
>  
>  
> 2023-02-23 01:05:14  All applications are in the NEW_SAVING state。
> !http://easyproject.nos-jd.163yun.com/d873ad4dd006421b8797f588bf466616.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11447) A bug occurs during the active/standby switchover of RM, causing RM to work abnormally

2023-02-28 Thread Xiping Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiping Zhang updated YARN-11447:

Attachment: (was: image-2023-02-28-15-56-57-552.png)

> A bug occurs during the active/standby switchover of RM, causing RM to work 
> abnormally
> --
>
> Key: YARN-11447
> URL: https://issues.apache.org/jira/browse/YARN-11447
> Project: Hadoop YARN
>  Issue Type: Bug
> Environment: hadoop 2.9.2 
>Reporter: Xiping Zhang
>Priority: Critical
> Attachments: image-2023-02-28-15-58-56-377.png, 
> yarn-yarn-resourcemanager-wxqcbd002.wxqc.cn.log-1.1, 
> yarn-yarn-resourcemanager-wxqcbd002.wxqc.cn.log.2
>
>
>  
>  
> 2023-02-23 01:05:14  All applications are in the NEW_SAVING state。
> !http://easyproject.nos-jd.163yun.com/d873ad4dd006421b8797f588bf466616.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11447) A bug occurs during the active/standby switchover of RM, causing RM to work abnormally

2023-02-27 Thread Xiping Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17694396#comment-17694396
 ] 

Xiping Zhang commented on YARN-11447:
-

!image-2023-02-28-15-56-57-552.png!

> A bug occurs during the active/standby switchover of RM, causing RM to work 
> abnormally
> --
>
> Key: YARN-11447
> URL: https://issues.apache.org/jira/browse/YARN-11447
> Project: Hadoop YARN
>  Issue Type: Bug
> Environment: hadoop 2.9.2 
>Reporter: Xiping Zhang
>Priority: Critical
> Attachments: image-2023-02-28-15-56-57-552.png, 
> yarn-yarn-resourcemanager-wxqcbd002.wxqc.cn.log-1.1, 
> yarn-yarn-resourcemanager-wxqcbd002.wxqc.cn.log.2
>
>
>  
>  
> 2023-02-23 01:05:14  All applications are in the NEW_SAVING state。
> !http://easyproject.nos-jd.163yun.com/d873ad4dd006421b8797f588bf466616.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11447) A bug occurs during the active/standby switchover of RM, causing RM to work abnormally

2023-02-27 Thread Xiping Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiping Zhang updated YARN-11447:

Attachment: image-2023-02-28-15-56-57-552.png

> A bug occurs during the active/standby switchover of RM, causing RM to work 
> abnormally
> --
>
> Key: YARN-11447
> URL: https://issues.apache.org/jira/browse/YARN-11447
> Project: Hadoop YARN
>  Issue Type: Bug
> Environment: hadoop 2.9.2 
>Reporter: Xiping Zhang
>Priority: Critical
> Attachments: image-2023-02-28-15-56-57-552.png, 
> yarn-yarn-resourcemanager-wxqcbd002.wxqc.cn.log-1.1, 
> yarn-yarn-resourcemanager-wxqcbd002.wxqc.cn.log.2
>
>
>  
>  
> 2023-02-23 01:05:14  All applications are in the NEW_SAVING state。
> !http://easyproject.nos-jd.163yun.com/d873ad4dd006421b8797f588bf466616.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11447) A bug occurs during the active/standby switchover of RM, causing RM to work abnormally

2023-02-27 Thread Xiping Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17694391#comment-17694391
 ] 

Xiping Zhang commented on YARN-11447:
-

After analyzing the two logs in the attachment, the conclusion of my analysis 
is: 2023-02-20 The StandByTransitionThread is started because the zk is faulty. 
During the switchover, the activeservice service is initialized. In 
activeservice init process, will instantiate StandByTransitionRunnable again, 
However, a fatal error occurs when the StandByTransitionThread-EventThread is 
working. Received RMFatalEvent of type TRANSITION_TO_ACTIVE_FAILED, caused by 
failure to refresh configuration settings, the StandByTransitionThread is 
switched to standby. At this time just be instantiated 
StandByTransitionRunnable AtomicBoolean hasAlreadyRun flag bit has been set to 
true (the actual after two switch lead to sign an error). At 2023-02-23 
01:05:14, the Received RMFatalEvent of type STATE_STORE_FENCED exception 
occurs. The STATE_STORE service status is active->. fenced will not be 
served. Then the StandByTransitionThread is called with the result that the 
AtomicBoolean hasAlreadyRun =true for this thread will be active all the time 
without switching to standby. STATE_STORE is out of service, and all tasks are 
stuck in the persistent zookeeperk phase.

> A bug occurs during the active/standby switchover of RM, causing RM to work 
> abnormally
> --
>
> Key: YARN-11447
> URL: https://issues.apache.org/jira/browse/YARN-11447
> Project: Hadoop YARN
>  Issue Type: Bug
> Environment: hadoop 2.9.2 
>Reporter: Xiping Zhang
>Priority: Critical
> Attachments: yarn-yarn-resourcemanager-wxqcbd002.wxqc.cn.log-1.1, 
> yarn-yarn-resourcemanager-wxqcbd002.wxqc.cn.log.2
>
>
>  
>  
> 2023-02-23 01:05:14  All applications are in the NEW_SAVING state。
> !http://easyproject.nos-jd.163yun.com/d873ad4dd006421b8797f588bf466616.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11447) A bug occurs during the active/standby switchover of RM, causing RM to work abnormally

2023-02-27 Thread Xiping Zhang (Jira)
Xiping Zhang created YARN-11447:
---

 Summary: A bug occurs during the active/standby switchover of RM, 
causing RM to work abnormally
 Key: YARN-11447
 URL: https://issues.apache.org/jira/browse/YARN-11447
 Project: Hadoop YARN
  Issue Type: Bug
 Environment: hadoop 2.9.2 
Reporter: Xiping Zhang
 Attachments: yarn-yarn-resourcemanager-wxqcbd002.wxqc.cn.log-1.1, 
yarn-yarn-resourcemanager-wxqcbd002.wxqc.cn.log.2

 

 

2023-02-23 01:05:14  All applications are in the NEW_SAVING state。

!http://easyproject.nos-jd.163yun.com/d873ad4dd006421b8797f588bf466616.png!

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11107) When NodeLabel is enabled for a YARN cluster, AM blacklist program does not work properly

2022-04-14 Thread Xiping Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiping Zhang updated YARN-11107:

Attachment: (was: YARN-11107-branch-3.3.0.001.patch)

> When NodeLabel is enabled for a YARN cluster, AM blacklist program does not 
> work properly
> -
>
> Key: YARN-11107
> URL: https://issues.apache.org/jira/browse/YARN-11107
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.9.2, 3.3.0
>Reporter: Xiping Zhang
>Assignee: Xiping Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Yarn NodeLabel is enabled in the production environment. We encountered a 
> application AM that blacklisted all NMS corresponding to the lable in the 
> queue, and other application in the queue cannot apply for computing 
> resources. We found that RM printed a lot of logs "Trying to fulfill 
> reservation for application..."



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11107) When NodeLabel is enabled for a YARN cluster, AM blacklist program does not work properly

2022-04-14 Thread Xiping Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiping Zhang updated YARN-11107:

Attachment: (was: YARN-11107-branch-2.9.2.001.patch)

> When NodeLabel is enabled for a YARN cluster, AM blacklist program does not 
> work properly
> -
>
> Key: YARN-11107
> URL: https://issues.apache.org/jira/browse/YARN-11107
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.9.2, 3.3.0
>Reporter: Xiping Zhang
>Assignee: Xiping Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: YARN-11107-branch-3.3.0.001.patch
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Yarn NodeLabel is enabled in the production environment. We encountered a 
> application AM that blacklisted all NMS corresponding to the lable in the 
> queue, and other application in the queue cannot apply for computing 
> resources. We found that RM printed a lot of logs "Trying to fulfill 
> reservation for application..."



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11107) When NodeLabel is enabled for a YARN cluster, AM blacklist program does not work properly

2022-04-13 Thread Xiping Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521997#comment-17521997
 ] 

Xiping Zhang commented on YARN-11107:
-

[~bteke]  Yes,Also need to.

> When NodeLabel is enabled for a YARN cluster, AM blacklist program does not 
> work properly
> -
>
> Key: YARN-11107
> URL: https://issues.apache.org/jira/browse/YARN-11107
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.9.2, 3.3.0
>Reporter: Xiping Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: YARN-11107-branch-2.9.2.001.patch, 
> YARN-11107-branch-3.3.0.001.patch
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Yarn NodeLabel is enabled in the production environment. We encountered a 
> application AM that blacklisted all NMS corresponding to the lable in the 
> queue, and other application in the queue cannot apply for computing 
> resources. We found that RM printed a lot of logs "Trying to fulfill 
> reservation for application..."



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11107) When NodeLabel is enabled for a YARN cluster, AM blacklist program does not work properly

2022-04-11 Thread Xiping Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17520382#comment-17520382
 ] 

Xiping Zhang commented on YARN-11107:
-

[~junping_du]  [~hexiaoqiao]  Thank you for your help. Ok, I will refer to this 
article to learn.:)

> When NodeLabel is enabled for a YARN cluster, AM blacklist program does not 
> work properly
> -
>
> Key: YARN-11107
> URL: https://issues.apache.org/jira/browse/YARN-11107
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.9.2, 3.3.0
>Reporter: Xiping Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: YARN-11107-branch-2.9.2.001.patch, 
> YARN-11107-branch-3.3.0.001.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Yarn NodeLabel is enabled in the production environment. We encountered a 
> application AM that blacklisted all NMS corresponding to the lable in the 
> queue, and other application in the queue cannot apply for computing 
> resources. We found that RM printed a lot of logs "Trying to fulfill 
> reservation for application..."



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11107) When NodeLabel is enabled for a YARN cluster, AM blacklist program does not work properly

2022-04-08 Thread Xiping Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17519358#comment-17519358
 ] 

Xiping Zhang commented on YARN-11107:
-

BTW, I am new here, I hope to contribute to Hadoop and improve myself. Do you 
have any documentation about the workflow of hadoop community? There is offline 
WX, QQ exchange group can invite me? WX: zxp877758823 Thank you again for!

> When NodeLabel is enabled for a YARN cluster, AM blacklist program does not 
> work properly
> -
>
> Key: YARN-11107
> URL: https://issues.apache.org/jira/browse/YARN-11107
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.9.2, 3.3.0
>Reporter: Xiping Zhang
>Priority: Major
> Attachments: YARN-11107-branch-2.9.2.001.patch, 
> YARN-11107-branch-3.3.0.001.patch
>
>
> Yarn NodeLabel is enabled in the production environment. We encountered a 
> application AM that blacklisted all NMS corresponding to the lable in the 
> queue, and other application in the queue cannot apply for computing 
> resources. We found that RM printed a lot of logs "Trying to fulfill 
> reservation for application..."



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11107) When NodeLabel is enabled for a YARN cluster, AM blacklist program does not work properly

2022-04-07 Thread Xiping Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17519349#comment-17519349
 ] 

Xiping Zhang commented on YARN-11107:
-

[~hexiaoqiao]    Thank you for your reply ,i will submit PR later. 

> When NodeLabel is enabled for a YARN cluster, AM blacklist program does not 
> work properly
> -
>
> Key: YARN-11107
> URL: https://issues.apache.org/jira/browse/YARN-11107
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.9.2, 3.3.0
>Reporter: Xiping Zhang
>Priority: Major
> Attachments: YARN-11107-branch-2.9.2.001.patch, 
> YARN-11107-branch-3.3.0.001.patch
>
>
> Yarn NodeLabel is enabled in the production environment. We encountered a 
> application AM that blacklisted all NMS corresponding to the lable in the 
> queue, and other application in the queue cannot apply for computing 
> resources. We found that RM printed a lot of logs "Trying to fulfill 
> reservation for application..."



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Reopened] (YARN-11107) When NodeLabel is enabled for a YARN cluster, AM blacklist program does not work properly

2022-04-07 Thread Xiping Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiping Zhang reopened YARN-11107:
-

> When NodeLabel is enabled for a YARN cluster, AM blacklist program does not 
> work properly
> -
>
> Key: YARN-11107
> URL: https://issues.apache.org/jira/browse/YARN-11107
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.9.2, 3.3.0
>Reporter: Xiping Zhang
>Priority: Major
> Attachments: YARN-11107-branch-2.9.2.001.patch, 
> YARN-11107-branch-3.3.0.001.patch
>
>
> Yarn NodeLabel is enabled in the production environment. We encountered a 
> application AM that blacklisted all NMS corresponding to the lable in the 
> queue, and other application in the queue cannot apply for computing 
> resources. We found that RM printed a lot of logs "Trying to fulfill 
> reservation for application..."



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] (YARN-11107) When NodeLabel is enabled for a YARN cluster, AM blacklist program does not work properly

2022-04-06 Thread Xiping Zhang (Jira)


[ https://issues.apache.org/jira/browse/YARN-11107 ]


Xiping Zhang deleted comment on YARN-11107:
-

was (Author: zhangxiping):
cc [~BilwaST]  [~tangzhankun]

> When NodeLabel is enabled for a YARN cluster, AM blacklist program does not 
> work properly
> -
>
> Key: YARN-11107
> URL: https://issues.apache.org/jira/browse/YARN-11107
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.9.2, 3.3.0
>Reporter: Xiping Zhang
>Priority: Major
> Attachments: YARN-11107-branch-2.9.2.001.patch, 
> YARN-11107-branch-3.3.0.001.patch
>
>
> Yarn NodeLabel is enabled in the production environment. We encountered a 
> application AM that blacklisted all NMS corresponding to the lable in the 
> queue, and other application in the queue cannot apply for computing 
> resources. We found that RM printed a lot of logs "Trying to fulfill 
> reservation for application..."



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11107) When NodeLabel is enabled for a YARN cluster, AM blacklist program does not work properly

2022-04-06 Thread Xiping Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17518555#comment-17518555
 ] 

Xiping Zhang commented on YARN-11107:
-

cc [~leosun08] [~linyiqun]  [~weichiu] [~hexiaoqiao] 

Could you help review this?

Thanks.

> When NodeLabel is enabled for a YARN cluster, AM blacklist program does not 
> work properly
> -
>
> Key: YARN-11107
> URL: https://issues.apache.org/jira/browse/YARN-11107
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.9.2, 3.3.0
>Reporter: Xiping Zhang
>Priority: Major
> Attachments: YARN-11107-branch-2.9.2.001.patch, 
> YARN-11107-branch-3.3.0.001.patch
>
>
> Yarn NodeLabel is enabled in the production environment. We encountered a 
> application AM that blacklisted all NMS corresponding to the lable in the 
> queue, and other application in the queue cannot apply for computing 
> resources. We found that RM printed a lot of logs "Trying to fulfill 
> reservation for application..."



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11107) When NodeLabel is enabled for a YARN cluster, AM blacklist program does not work properly

2022-04-06 Thread Xiping Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17518004#comment-17518004
 ] 

Xiping Zhang commented on YARN-11107:
-

cc [~BilwaST]  [~tangzhankun]

> When NodeLabel is enabled for a YARN cluster, AM blacklist program does not 
> work properly
> -
>
> Key: YARN-11107
> URL: https://issues.apache.org/jira/browse/YARN-11107
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.9.2, 3.3.0
>Reporter: Xiping Zhang
>Priority: Major
> Attachments: YARN-11107-branch-2.9.2.001.patch, 
> YARN-11107-branch-3.3.0.001.patch
>
>
> Yarn NodeLabel is enabled in the production environment. We encountered a 
> application AM that blacklisted all NMS corresponding to the lable in the 
> queue, and other application in the queue cannot apply for computing 
> resources. We found that RM printed a lot of logs "Trying to fulfill 
> reservation for application..."



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-11107) When NodeLabel is enabled for a YARN cluster, AM blacklist program does not work properly

2022-04-06 Thread Xiping Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517921#comment-17517921
 ] 

Xiping Zhang edited comment on YARN-11107 at 4/6/22 9:56 AM:
-

i think when NodeLabel is enabled, RM should consider the lable of the 
application when passing the number of NM to AM ,When the number of blacklisted 
nodes exceeds 33% of the total number of lable nodes, the AM releases NM in the 
blacklist. for DefaultAMSProcessor.java :
{code:java}
//代码占位符

final class DefaultAMSProcessor implements ApplicationMasterServiceProcessor {
...
public void allocate(ApplicationAttemptId appAttemptId,
AllocateRequest request, AllocateResponse response) throws YarnException {
...
//Consider whether NodeLabel is enabled
response.setNumClusterNodes(getScheduler().getNumClusterNodes());
...
}




{code}


was (Author: zhangxiping):
I think when NodeLabel is enabled, RM should consider the lable of the 
application when passing the number of NM to AM ,When the number of blacklisted 
nodes exceeds 33% of the total number of lable nodes, the AM releases NM in the 
blacklist. for DefaultAMSProcessor.java :
{code:java}
//代码占位符

final class DefaultAMSProcessor implements ApplicationMasterServiceProcessor {
...
public void allocate(ApplicationAttemptId appAttemptId,
AllocateRequest request, AllocateResponse response) throws YarnException {
...
//Consider whether NodeLabel is enabled
response.setNumClusterNodes(getScheduler().getNumClusterNodes());
...
}




{code}

> When NodeLabel is enabled for a YARN cluster, AM blacklist program does not 
> work properly
> -
>
> Key: YARN-11107
> URL: https://issues.apache.org/jira/browse/YARN-11107
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.9.2, 3.3.0
>Reporter: Xiping Zhang
>Priority: Major
> Attachments: YARN-11107-branch-2.9.2.001.patch, 
> YARN-11107-branch-3.3.0.001.patch
>
>
> Yarn NodeLabel is enabled in the production environment. We encountered a 
> application AM that blacklisted all NMS corresponding to the lable in the 
> queue, and other application in the queue cannot apply for computing 
> resources. We found that RM printed a lot of logs "Trying to fulfill 
> reservation for application..."



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11107) When NodeLabel is enabled for a YARN cluster, AM blacklist program does not work properly

2022-04-06 Thread Xiping Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiping Zhang updated YARN-11107:

Summary: When NodeLabel is enabled for a YARN cluster, AM blacklist program 
does not work properly  (was: When NodeLabel is enabled for a YARN cluster, the 
blacklist feature is abnormal)

> When NodeLabel is enabled for a YARN cluster, AM blacklist program does not 
> work properly
> -
>
> Key: YARN-11107
> URL: https://issues.apache.org/jira/browse/YARN-11107
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.9.2, 3.3.0
>Reporter: Xiping Zhang
>Priority: Major
> Attachments: YARN-11107-branch-2.9.2.001.patch, 
> YARN-11107-branch-3.3.0.001.patch
>
>
> Yarn NodeLabel is enabled in the production environment. We encountered a 
> application AM that blacklisted all NMS corresponding to the lable in the 
> queue, and other application in the queue cannot apply for computing 
> resources. We found that RM printed a lot of logs "Trying to fulfill 
> reservation for application..."



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11107) When NodeLabel is enabled for a YARN cluster, the blacklist feature is abnormal

2022-04-06 Thread Xiping Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiping Zhang updated YARN-11107:

Attachment: YARN-11107-branch-3.3.0.001.patch

> When NodeLabel is enabled for a YARN cluster, the blacklist feature is 
> abnormal
> ---
>
> Key: YARN-11107
> URL: https://issues.apache.org/jira/browse/YARN-11107
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.9.2, 3.3.0
>Reporter: Xiping Zhang
>Priority: Major
> Attachments: YARN-11107-branch-2.9.2.001.patch, 
> YARN-11107-branch-3.3.0.001.patch
>
>
> Yarn NodeLabel is enabled in the production environment. We encountered a 
> application AM that blacklisted all NMS corresponding to the lable in the 
> queue, and other application in the queue cannot apply for computing 
> resources. We found that RM printed a lot of logs "Trying to fulfill 
> reservation for application..."



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11107) When NodeLabel is enabled for a YARN cluster, the blacklist feature is abnormal

2022-04-06 Thread Xiping Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517921#comment-17517921
 ] 

Xiping Zhang commented on YARN-11107:
-

I think when NodeLabel is enabled, RM should consider the lable of the 
application when passing the number of NM to AM ,When the number of blacklisted 
nodes exceeds 33% of the total number of lable nodes, the AM releases NM in the 
blacklist. for DefaultAMSProcessor.java :
{code:java}
//代码占位符

final class DefaultAMSProcessor implements ApplicationMasterServiceProcessor {
...
public void allocate(ApplicationAttemptId appAttemptId,
AllocateRequest request, AllocateResponse response) throws YarnException {
...
//Consider whether NodeLabel is enabled
response.setNumClusterNodes(getScheduler().getNumClusterNodes());
...
}




{code}

> When NodeLabel is enabled for a YARN cluster, the blacklist feature is 
> abnormal
> ---
>
> Key: YARN-11107
> URL: https://issues.apache.org/jira/browse/YARN-11107
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.9.2, 3.3.0
>Reporter: Xiping Zhang
>Priority: Major
>
> Yarn NodeLabel is enabled in the production environment. We encountered a 
> application AM that blacklisted all NMS corresponding to the lable in the 
> queue, and other application in the queue cannot apply for computing 
> resources. We found that RM printed a lot of logs "Trying to fulfill 
> reservation for application..."



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11107) When NodeLabel is enabled for a YARN cluster, the blacklist feature is abnormal

2022-04-06 Thread Xiping Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiping Zhang updated YARN-11107:

Description: Yarn NodeLabel is enabled in the production environment. We 
encountered a application AM that blacklisted all NMS corresponding to the 
lable in the queue, and other application in the queue cannot apply for 
computing resources. We found that RM printed a lot of logs "Trying to fulfill 
reservation for application..."  (was: Yarn NodeLabel is enabled in the 
production environment. During application running, an AM task blacklists all 
NMs corresponding to the Lable in the queue, and other application in the queue 
cannot apply for computing resources. We found that RM printed a lot of logs 
"Trying to fulfill reservation for application...")

> When NodeLabel is enabled for a YARN cluster, the blacklist feature is 
> abnormal
> ---
>
> Key: YARN-11107
> URL: https://issues.apache.org/jira/browse/YARN-11107
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.9.2, 3.3.0
>Reporter: Xiping Zhang
>Priority: Major
>
> Yarn NodeLabel is enabled in the production environment. We encountered a 
> application AM that blacklisted all NMS corresponding to the lable in the 
> queue, and other application in the queue cannot apply for computing 
> resources. We found that RM printed a lot of logs "Trying to fulfill 
> reservation for application..."



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11107) When NodeLabel is enabled for a YARN cluster, the blacklist feature is abnormal

2022-04-06 Thread Xiping Zhang (Jira)
Xiping Zhang created YARN-11107:
---

 Summary: When NodeLabel is enabled for a YARN cluster, the 
blacklist feature is abnormal
 Key: YARN-11107
 URL: https://issues.apache.org/jira/browse/YARN-11107
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 3.3.0, 2.9.2
Reporter: Xiping Zhang


Yarn NodeLabel is enabled in the production environment. During application 
running, an AM task blacklists all NMs corresponding to the Lable in the queue, 
and other application in the queue cannot apply for computing resources. We 
found that RM printed a lot of logs "Trying to fulfill reservation for 
application..."



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10781) The Thread of the NM aggregate log is exhausted and no other Application can aggregate the log

2021-05-25 Thread Xiping Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17351459#comment-17351459
 ] 

Xiping Zhang commented on YARN-10781:
-

[~zhuqi]  

Yes, we have enabled rolling log aggregation, but that doesn't seem to be the 
problem.This long running job may occupy one thread for all nodes of the 
cluster due to the dynamic resource mechanism. If there are 100 such long 
running jobs, all the NM(default 100 threads) aggregation threads on the 
cluster will be occupied. : (

> The Thread of the NM aggregate log is exhausted and no other Application can 
> aggregate the log
> --
>
> Key: YARN-10781
> URL: https://issues.apache.org/jira/browse/YARN-10781
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.9.2, 3.3.0
>Reporter: Xiping Zhang
>Priority: Major
> Attachments: applications.png, containers.png, containers.png
>
>
> We observed more than 100 applications running on one NM.Most of these 
> applications are SparkStreaming applications, but these applications do not 
> have running Containers.When the offline application running on it finishes, 
> the log cannot be reported to HDFS. When we killed a large number of 
> SparkStreaming applications, we found that a large number of log files were 
> being created on the NN side, causing the read and write performance on the 
> NN side to degrade significantly.Causes the business application to time out。



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10781) The Thread of the NM aggregate log is exhausted and no other Application can aggregate the log

2021-05-25 Thread Xiping Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiping Zhang updated YARN-10781:

Description: We observed more than 100 applications running on one NM.Most 
of these applications are SparkStreaming applications, but these applications 
do not have running Containers.When the offline application running on it 
finishes, the log cannot be reported to HDFS. When we killed a large number of 
SparkStreaming applications, we found that a large number of log files were 
being created on the NN side, causing the read and write performance on the NN 
side to degrade significantly.Causes the business application to time out。  
(was: We observed more than 100 applications running on one NM.Most of these 
applications are SparkStreaming tasks, but these applications do not have 
running Containers.When the offline application running on it finishes, the log 
cannot be reported to HDFS.)

> The Thread of the NM aggregate log is exhausted and no other Application can 
> aggregate the log
> --
>
> Key: YARN-10781
> URL: https://issues.apache.org/jira/browse/YARN-10781
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.9.2, 3.3.0
>Reporter: Xiping Zhang
>Priority: Major
> Attachments: applications.png, containers.png, containers.png
>
>
> We observed more than 100 applications running on one NM.Most of these 
> applications are SparkStreaming applications, but these applications do not 
> have running Containers.When the offline application running on it finishes, 
> the log cannot be reported to HDFS. When we killed a large number of 
> SparkStreaming applications, we found that a large number of log files were 
> being created on the NN side, causing the read and write performance on the 
> NN side to degrade significantly.Causes the business application to time out。



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10781) The Thread of the NM aggregate log is exhausted and no other Application can aggregate the log

2021-05-24 Thread Xiping Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17350276#comment-17350276
 ] 

Xiping Zhang commented on YARN-10781:
-

[~zhuqi] 

Thank you for your reply.

NM instantiates LogAggregationService at startup. There is a pool of threads. 
By default, there are 100 threads that serve the log aggregation of all 
applications on NM.When AM first notifies NM to start a Container for a 
application, the LogAggregationService initApp method is called and a thread is 
assigned to handle the application's log aggregation.

Here is an NM attachment to our production environment, showing the Application 
and Container running on it. There are 49 applications running on it but there 
are only 14 Containers.As the NM code understands, there are at least 49 
threads working on aggregate logging tasks, 35 of which do not have a Container.

 

!applications.png!

applications

!containers.png!

 

 

> The Thread of the NM aggregate log is exhausted and no other Application can 
> aggregate the log
> --
>
> Key: YARN-10781
> URL: https://issues.apache.org/jira/browse/YARN-10781
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.9.2, 3.3.0
>Reporter: Xiping Zhang
>Priority: Major
> Attachments: applications.png, containers.png, containers.png
>
>
> We observed more than 100 applications running on one NM.Most of these 
> applications are SparkStreaming tasks, but these applications do not have 
> running Containers.When the offline application running on it finishes, the 
> log cannot be reported to HDFS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10781) The Thread of the NM aggregate log is exhausted and no other Application can aggregate the log

2021-05-24 Thread Xiping Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiping Zhang updated YARN-10781:

Attachment: containers.png

> The Thread of the NM aggregate log is exhausted and no other Application can 
> aggregate the log
> --
>
> Key: YARN-10781
> URL: https://issues.apache.org/jira/browse/YARN-10781
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.9.2, 3.3.0
>Reporter: Xiping Zhang
>Priority: Major
> Attachments: applications.png, containers.png, containers.png
>
>
> We observed more than 100 applications running on one NM.Most of these 
> applications are SparkStreaming tasks, but these applications do not have 
> running Containers.When the offline application running on it finishes, the 
> log cannot be reported to HDFS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10781) The Thread of the NM aggregate log is exhausted and no other Application can aggregate the log

2021-05-24 Thread Xiping Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiping Zhang updated YARN-10781:

Attachment: applications.png

> The Thread of the NM aggregate log is exhausted and no other Application can 
> aggregate the log
> --
>
> Key: YARN-10781
> URL: https://issues.apache.org/jira/browse/YARN-10781
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.9.2, 3.3.0
>Reporter: Xiping Zhang
>Priority: Major
> Attachments: applications.png, containers.png, containers.png
>
>
> We observed more than 100 applications running on one NM.Most of these 
> applications are SparkStreaming tasks, but these applications do not have 
> running Containers.When the offline application running on it finishes, the 
> log cannot be reported to HDFS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10781) The Thread of the NM aggregate log is exhausted and no other Application can aggregate the log

2021-05-24 Thread Xiping Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiping Zhang updated YARN-10781:

Attachment: containers.png

> The Thread of the NM aggregate log is exhausted and no other Application can 
> aggregate the log
> --
>
> Key: YARN-10781
> URL: https://issues.apache.org/jira/browse/YARN-10781
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.9.2, 3.3.0
>Reporter: Xiping Zhang
>Priority: Major
> Attachments: containers.png
>
>
> We observed more than 100 applications running on one NM.Most of these 
> applications are SparkStreaming tasks, but these applications do not have 
> running Containers.When the offline application running on it finishes, the 
> log cannot be reported to HDFS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10781) The Thread of the NM aggregate log is exhausted and no other Application can aggregate the log

2021-05-24 Thread Xiping Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17350226#comment-17350226
 ] 

Xiping Zhang edited comment on YARN-10781 at 5/24/21, 7:00 AM:
---

Aggregated logs are handled by NM.When NM initializes an Application, it 
allocates a thread to do the aggregate logging for that application.Here's the 
code on the NM side.

 
{code:java}
// LogAggregationService.java

@SuppressWarnings("unchecked")
private void initApp(final ApplicationId appId, String user,
Credentials credentials, Map appAcls,
LogAggregationContext logAggregationContext,
long recoveredLogInitedTime) {
  ApplicationEvent eventResponse;
  try {
initAppAggregator(appId, user, credentials, appAcls,
logAggregationContext, recoveredLogInitedTime);
eventResponse = new ApplicationEvent(appId,
ApplicationEventType.APPLICATION_LOG_HANDLING_INITED);
  } catch (YarnRuntimeException e) {
LOG.warn("Application failed to init aggregation", e);
eventResponse = new ApplicationEvent(appId,
ApplicationEventType.APPLICATION_LOG_HANDLING_FAILED);
  }
  this.dispatcher.getEventHandler().handle(eventResponse);
}
{code}
{code:java}
protected void initAppAggregator(final ApplicationId appId, String user,
Credentials credentials, Map appAcls,
LogAggregationContext logAggregationContext,
long recoveredLogInitedTime) {
 ...

  final AppLogAggregator appLogAggregator =
  new AppLogAggregatorImpl(this.dispatcher, this.deletionService,
  getConfig(), appId, userUgi, this.nodeId, dirsHandler,
  logAggregationFileController.getRemoteNodeLogFileForApp(appId,
  user, nodeId), appAcls, logAggregationContext, this.context,
  getLocalFileContext(getConfig()), this.rollingMonitorInterval,
  recoveredLogInitedTime, logAggregationFileController);
  ...

  // Schedule the aggregator.
  Runnable aggregatorWrapper = new Runnable() {
public void run() {
  try {
appLogAggregator.run();
  } finally {
appLogAggregators.remove(appId);
closeFileSystems(userUgi);
  }
}
  };
  this.threadPool.execute(aggregatorWrapper);
  if (appDirException != null) {
throw appDirException;
  }
}
{code}
{code:java}
// AppLogAggregatorImpl.java

@Override
public void run() {
  try {
doAppLogAggregation();
  }
 ... 
}

{code}
 
{code:java}
//AppLogAggregatorImpl.java
private void doAppLogAggregation() throws LogAggregationDFSException {
  while (!this.appFinishing.get() && !this.aborted.get()) {
synchronized(this) {
  try {
waiting.set(true);
if (logControllerContext.isLogAggregationInRolling()) {
  wait(logControllerContext.getRollingMonitorInterval() * 1000);
  if (this.appFinishing.get() || this.aborted.get()) {
break;
  }
  uploadLogsForContainers(false);
} else {
  wait(THREAD_SLEEP_TIME);
}
  } catch (InterruptedException e) {
LOG.warn("PendingContainers queue is interrupted");
this.appFinishing.set(true);
  } catch (LogAggregationDFSException e) {
this.appFinishing.set(true);
throw e;
  }
}
  }

  if (this.aborted.get()) {
return;
  }

  try {
// App is finished, upload the container logs.
uploadLogsForContainers(true);

doAppLogAggregationPostCleanUp();
  } catch (LogAggregationDFSException e) {
LOG.error("Error during log aggregation", e);
  }

  this.dispatcher.getEventHandler().handle(
  new ApplicationEvent(this.appId,
  ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED));
  this.appAggregationFinished.set(true);
}

{code}
When handling the APPLICATION_STARTED event at NM, NM initializes the App and 
initializes an ApplogAggregatorImpl to handle the log aggregation, which 
allocates a thread to run its run method.Inside this is a loop until the task 
is Finished or aborted. 

 

 

 

 

 

 


was (Author: zhangxiping):
Aggregated logs are handled by NM.When NM initializes an Application, it 
allocates a thread to do the aggregate logging for that application.Here's the 
code on the NM side.

 

 

 
{code:java}
@SuppressWarnings("unchecked")
private void initApp(final ApplicationId appId, String user,
Credentials credentials, Map appAcls,
LogAggregationContext logAggregationContext,
long recoveredLogInitedTime) {
  ApplicationEvent eventResponse;
  try {
initAppAggregator(appId, user, credentials, appAcls,
logAggregationContext, recoveredLogInitedTime);
eventResponse = new ApplicationEvent(appId,
ApplicationEventType.APPLICATION_LOG_HANDLING_INITED);
  } catch (YarnRuntimeException e) {
LOG.warn("Application failed to init aggregation", e);
eventResponse = new ApplicationEvent(appId,
ApplicationEventType.APPLICATION_LOG_HANDLING_FAILED);
  }
  this.dispatcher.

[jira] [Commented] (YARN-10781) The Thread of the NM aggregate log is exhausted and no other Application can aggregate the log

2021-05-23 Thread Xiping Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17350226#comment-17350226
 ] 

Xiping Zhang commented on YARN-10781:
-

Aggregated logs are handled by NM.When NM initializes an Application, it 
allocates a thread to do the aggregate logging for that application.Here's the 
code on the NM side.

 

 

 
{code:java}
@SuppressWarnings("unchecked")
private void initApp(final ApplicationId appId, String user,
Credentials credentials, Map appAcls,
LogAggregationContext logAggregationContext,
long recoveredLogInitedTime) {
  ApplicationEvent eventResponse;
  try {
initAppAggregator(appId, user, credentials, appAcls,
logAggregationContext, recoveredLogInitedTime);
eventResponse = new ApplicationEvent(appId,
ApplicationEventType.APPLICATION_LOG_HANDLING_INITED);
  } catch (YarnRuntimeException e) {
LOG.warn("Application failed to init aggregation", e);
eventResponse = new ApplicationEvent(appId,
ApplicationEventType.APPLICATION_LOG_HANDLING_FAILED);
  }
  this.dispatcher.getEventHandler().handle(eventResponse);
}
{code}
{code:java}
protected void initAppAggregator(final ApplicationId appId, String user,
Credentials credentials, Map appAcls,
LogAggregationContext logAggregationContext,
long recoveredLogInitedTime) {
 ...

  final AppLogAggregator appLogAggregator =
  new AppLogAggregatorImpl(this.dispatcher, this.deletionService,
  getConfig(), appId, userUgi, this.nodeId, dirsHandler,
  logAggregationFileController.getRemoteNodeLogFileForApp(appId,
  user, nodeId), appAcls, logAggregationContext, this.context,
  getLocalFileContext(getConfig()), this.rollingMonitorInterval,
  recoveredLogInitedTime, logAggregationFileController);
  ...

  // Schedule the aggregator.
  Runnable aggregatorWrapper = new Runnable() {
public void run() {
  try {
appLogAggregator.run();
  } finally {
appLogAggregators.remove(appId);
closeFileSystems(userUgi);
  }
}
  };
  this.threadPool.execute(aggregatorWrapper);
  if (appDirException != null) {
throw appDirException;
  }
}
{code}
{code:java}
// AppLogAggregatorImpl.java

@Override
public void run() {
  try {
doAppLogAggregation();
  }
 ... 
}

{code}
 
{code:java}
//
private void doAppLogAggregation() throws LogAggregationDFSException {
  while (!this.appFinishing.get() && !this.aborted.get()) {
synchronized(this) {
  try {
waiting.set(true);
if (logControllerContext.isLogAggregationInRolling()) {
  wait(logControllerContext.getRollingMonitorInterval() * 1000);
  if (this.appFinishing.get() || this.aborted.get()) {
break;
  }
  uploadLogsForContainers(false);
} else {
  wait(THREAD_SLEEP_TIME);
}
  } catch (InterruptedException e) {
LOG.warn("PendingContainers queue is interrupted");
this.appFinishing.set(true);
  } catch (LogAggregationDFSException e) {
this.appFinishing.set(true);
throw e;
  }
}
  }

  if (this.aborted.get()) {
return;
  }

  try {
// App is finished, upload the container logs.
uploadLogsForContainers(true);

doAppLogAggregationPostCleanUp();
  } catch (LogAggregationDFSException e) {
LOG.error("Error during log aggregation", e);
  }

  this.dispatcher.getEventHandler().handle(
  new ApplicationEvent(this.appId,
  ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED));
  this.appAggregationFinished.set(true);
}

{code}
When handling the APPLICATION_STARTED event at NM, NM initializes the App and 
initializes an ApplogAggregatorImpl to handle the log aggregation, which 
allocates a thread to run its run method.Inside this is a loop until the task 
is Finished or aborted. 

 

 

 

 

 

 

> The Thread of the NM aggregate log is exhausted and no other Application can 
> aggregate the log
> --
>
> Key: YARN-10781
> URL: https://issues.apache.org/jira/browse/YARN-10781
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.9.2, 3.3.0
>Reporter: Xiping Zhang
>Priority: Major
>
> We observed more than 100 applications running on one NM.Most of these 
> applications are SparkStreaming tasks, but these applications do not have 
> running Containers.When the offline application running on it finishes, the 
> log cannot be reported to HDFS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10781) The Thread of the NM aggregate log is exhausted and no other Application can aggregate the log

2021-05-23 Thread Xiping Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17350214#comment-17350214
 ] 

Xiping Zhang commented on YARN-10781:
-

Sorry for the late reply.
{code:java}
// ExecutorAllocationManager.scala

/**
 * Register for scheduler callbacks to decide when to add and remove executors, 
and start
 * the scheduling task.
 */
def start(): Unit = {
  listenerBus.addToManagementQueue(listener)

  val scheduleTask = new Runnable() {
override def run(): Unit = {
  try {
schedule()
  } catch {
case ct: ControlThrowable =>
  throw ct
case t: Throwable =>
  logWarning(s"Uncaught exception in thread 
${Thread.currentThread().getName}", t)
  }
}
  }
  executor.scheduleWithFixedDelay(scheduleTask, 0, intervalMillis, 
TimeUnit.MILLISECONDS)

  client.requestTotalExecutors(numExecutorsTarget, localityAwareTasks, 
hostToLocalTaskCount)
}
{code}
The schedule method is periodically executed .

 

 
{code:java}
/**
 * This is called at a fixed interval to regulate the number of pending 
executor requests
 * and number of executors running.
 *
 * First, adjust our requested executors based on the add time and our current 
needs.
 * Then, if the remove time for an existing executor has expired, kill the 
executor.
 *
 * This is factored out into its own method for testing.
 */
private def schedule(): Unit = synchronized {
  val now = clock.getTimeMillis

  val executorIdsToBeRemoved = ArrayBuffer[String]()
  removeTimes.retain { case (executorId, expireTime) =>
val expired = now >= expireTime
if (expired) {
  initializing = false
  executorIdsToBeRemoved += executorId
}
!expired
  }
  // Update executor target number only after initializing flag is unset
  updateAndSyncNumExecutorsTarget(now)
  if (executorIdsToBeRemoved.nonEmpty) {
removeExecutors(executorIdsToBeRemoved)
  }
}


{code}
This will remove executors from the executorIdsToBeRemoved set.

 
{code:java}
/**
 * Request the cluster manager to remove the given executors.
 * Returns the list of executors which are removed.
 */
private def removeExecutors(executors: Seq[String]): Seq[String] = synchronized 
{
  val executorIdsToBeRemoved = new ArrayBuffer[String]

  logInfo("Request to remove executorIds: " + executors.mkString(", "))
  val numExistingExecutors = allocationManager.executorIds.size - 
executorsPendingToRemove.size

  var newExecutorTotal = numExistingExecutors
  executors.foreach { executorIdToBeRemoved =>
if (newExecutorTotal - 1 < minNumExecutors) {
  logDebug(s"Not removing idle executor $executorIdToBeRemoved because 
there are only " +
s"$newExecutorTotal executor(s) left (minimum number of executor limit 
$minNumExecutors)")
} else if (newExecutorTotal - 1 < numExecutorsTarget) {
  logDebug(s"Not removing idle executor $executorIdToBeRemoved because 
there are only " +
s"$newExecutorTotal executor(s) left (number of executor target 
$numExecutorsTarget)")
} else if (canBeKilled(executorIdToBeRemoved)) {
  executorIdsToBeRemoved += executorIdToBeRemoved
  newExecutorTotal -= 1
}
  }

  if (executorIdsToBeRemoved.isEmpty) {
return Seq.empty[String]
  }

  // Send a request to the backend to kill this executor(s)
  val executorsRemoved = if (testing) {
executorIdsToBeRemoved
  } else {
// We don't want to change our target number of executors, because we 
already did that
// when the task backlog decreased.
client.killExecutors(executorIdsToBeRemoved, adjustTargetNumExecutors = 
false,
  countFailures = false, force = false)
  }
  // [SPARK-21834] killExecutors api reduces the target number of executors.
  // So we need to update the target with desired value.
  client.requestTotalExecutors(numExecutorsTarget, localityAwareTasks, 
hostToLocalTaskCount)
  // reset the newExecutorTotal to the existing number of executors
  newExecutorTotal = numExistingExecutors
  if (testing || executorsRemoved.nonEmpty) {
executorsRemoved.foreach { removedExecutorId =>
  // If it is a cached block, it uses cachedExecutorIdleTimeoutS for timeout
  val idleTimeout = if 
(blockManagerMaster.hasCachedBlocks(removedExecutorId)) {
cachedExecutorIdleTimeoutS
  } else {
executorIdleTimeoutS
  }
  newExecutorTotal -= 1
  logInfo(s"Removing executor $removedExecutorId because it has been idle 
for " +
s"$idleTimeout seconds (new desired total will be $newExecutorTotal)")
  executorsPendingToRemove.add(removedExecutorId)
}
executorsRemoved
  } else {
logWarning(s"Unable to reach the cluster manager to kill executor/s " +
  s"${executorIdsToBeRemoved.mkString(",")} or no executor eligible to 
kill!")
Seq.empty[String]
  }
}
{code}
 

 

 

 

> The Thread of the NM aggregate log is exhausted and no other Application can 
> agg

[jira] [Comment Edited] (YARN-10781) The Thread of the NM aggregate log is exhausted and no other Application can aggregate the log

2021-05-20 Thread Xiping Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348991#comment-17348991
 ] 

Xiping Zhang edited comment on YARN-10781 at 5/21/21, 6:02 AM:
---

When NM accepts an Application , it initializes an ApplogAggregatorImpl 
internally and drops it into a thread pool with a default size of 100.Each 
thread is responsible for reporting the log of an Application until the 
Application Finish or aborted.
 A maximum of 100 running applications can be processed simultaneously.Due to 
SparkStreaming's dynamic resource mechanism, it is possible that such threads 
on NM cannot exit even though there is no Container running on it.Increasing 
the number of core threads in the thread pool is possible, but this is not a 
good solution, as the number of sparkStreaming appliacations increases, 
resulting in more threads being occupied. U can correct me if there is any 
problem with my understanding


was (Author: zhangxiping):
When NM accepts an Application container, it initializes an 
ApplogAggregatorImpl internally and drops it into a thread pool with a default 
size of 100.Each thread is responsible for reporting the log of an Application 
until the Application Finish or aborted.
A maximum of 100 running applications can be processed simultaneously.Due to 
SparkStreaming's dynamic resource mechanism, it is possible that such threads 
on NM cannot exit even though there is no Container running on it.Increasing 
the number of core threads in the thread pool is possible, but this is not a 
good solution, as the number of sparkStreaming appliacations increases, 
resulting in more threads being occupied. U can correct me if there is any 
problem with my understanding

> The Thread of the NM aggregate log is exhausted and no other Application can 
> aggregate the log
> --
>
> Key: YARN-10781
> URL: https://issues.apache.org/jira/browse/YARN-10781
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.9.2, 3.3.0
>Reporter: Xiping Zhang
>Priority: Major
>
> We observed more than 100 applications running on one NM.Most of these 
> applications are SparkStreaming tasks, but these applications do not have 
> running Containers.When the offline application running on it finishes, the 
> log cannot be reported to HDFS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10781) The Thread of the NM aggregate log is exhausted and no other Application can aggregate the log

2021-05-20 Thread Xiping Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348991#comment-17348991
 ] 

Xiping Zhang commented on YARN-10781:
-

When NM accepts an Application container, it initializes an 
ApplogAggregatorImpl internally and drops it into a thread pool with a default 
size of 100.Each thread is responsible for reporting the log of an Application 
until the Application Finish or aborted.
A maximum of 100 running applications can be processed simultaneously.Due to 
SparkStreaming's dynamic resource mechanism, it is possible that such threads 
on NM cannot exit even though there is no Container running on it.Increasing 
the number of core threads in the thread pool is possible, but this is not a 
good solution, as the number of sparkStreaming appliacations increases, 
resulting in more threads being occupied. U can correct me if there is any 
problem with my understanding

> The Thread of the NM aggregate log is exhausted and no other Application can 
> aggregate the log
> --
>
> Key: YARN-10781
> URL: https://issues.apache.org/jira/browse/YARN-10781
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.9.2, 3.3.0
>Reporter: Xiping Zhang
>Priority: Major
>
> We observed more than 100 applications running on one NM.Most of these 
> applications are SparkStreaming tasks, but these applications do not have 
> running Containers.When the offline application running on it finishes, the 
> log cannot be reported to HDFS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10781) The Thread of the NM aggregate log is exhausted and no other Application can aggregate the log

2021-05-20 Thread Xiping Zhang (Jira)
Xiping Zhang created YARN-10781:
---

 Summary: The Thread of the NM aggregate log is exhausted and no 
other Application can aggregate the log
 Key: YARN-10781
 URL: https://issues.apache.org/jira/browse/YARN-10781
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 3.3.0, 2.9.2
Reporter: Xiping Zhang


We observed more than 100 applications running on one NM.Most of these 
applications are SparkStreaming tasks, but these applications do not have 
running Containers.When the offline application running on it finishes, the log 
cannot be reported to HDFS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org