[jira] [Comment Edited] (YARN-11447) A bug occurs during the active/standby switchover of RM, causing RM to work abnormally
[ https://issues.apache.org/jira/browse/YARN-11447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17694391#comment-17694391 ] Xiping Zhang edited comment on YARN-11447 at 2/28/23 9:00 AM: -- After analyzing the two logs in the attachment, the conclusion of my analysis is: 2023-02-20 14:03:26 The StandByTransitionThread is started because the zk is faulty. During the switchover, the activeservice service is initialized. In activeservice init process, will instantiate StandByTransitionRunnable again, However, a fatal error occurs when the StandByTransitionThread-EventThread is working. Received RMFatalEvent of type TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration settings, the StandByTransitionThread is switched to standby. At this time just be instantiated StandByTransitionRunnable AtomicBoolean hasAlreadyRun flag bit has been set to true (the actual after two switch lead to sign an error). At 2023-02-23 01:05:14, the Received RMFatalEvent of type STATE_STORE_FENCED exception occurs. The STATE_STORE service status is active->. fenced will not be served. Then the StandByTransitionThread is called with the result that the AtomicBoolean hasAlreadyRun =true for this thread will be active all the time without switching to standby. STATE_STORE is out of service, and all tasks are stuck in the persistent zookeeperk phase. was (Author: zhangxiping): After analyzing the two logs in the attachment, the conclusion of my analysis is: 2023-02-20 The StandByTransitionThread is started because the zk is faulty. During the switchover, the activeservice service is initialized. In activeservice init process, will instantiate StandByTransitionRunnable again, However, a fatal error occurs when the StandByTransitionThread-EventThread is working. Received RMFatalEvent of type TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration settings, the StandByTransitionThread is switched to standby. At this time just be instantiated StandByTransitionRunnable AtomicBoolean hasAlreadyRun flag bit has been set to true (the actual after two switch lead to sign an error). At 2023-02-23 01:05:14, the Received RMFatalEvent of type STATE_STORE_FENCED exception occurs. The STATE_STORE service status is active->. fenced will not be served. Then the StandByTransitionThread is called with the result that the AtomicBoolean hasAlreadyRun =true for this thread will be active all the time without switching to standby. STATE_STORE is out of service, and all tasks are stuck in the persistent zookeeperk phase. > A bug occurs during the active/standby switchover of RM, causing RM to work > abnormally > -- > > Key: YARN-11447 > URL: https://issues.apache.org/jira/browse/YARN-11447 > Project: Hadoop YARN > Issue Type: Bug > Environment: hadoop 2.9.2 >Reporter: Xiping Zhang >Priority: Critical > Attachments: image-2023-02-28-15-58-56-377.png, > yarn-yarn-resourcemanager-wxqcbd002.wxqc.cn.log-1.1, > yarn-yarn-resourcemanager-wxqcbd002.wxqc.cn.log.2 > > > > > 2023-02-23 01:05:14 All applications are in the NEW_SAVING state。 > !http://easyproject.nos-jd.163yun.com/d873ad4dd006421b8797f588bf466616.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-11447) A bug occurs during the active/standby switchover of RM, causing RM to work abnormally
[ https://issues.apache.org/jira/browse/YARN-11447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17694396#comment-17694396 ] Xiping Zhang edited comment on YARN-11447 at 2/28/23 7:59 AM: -- !image-2023-02-28-15-58-56-377.png! was (Author: zhangxiping): !image-2023-02-28-15-56-57-552.png! > A bug occurs during the active/standby switchover of RM, causing RM to work > abnormally > -- > > Key: YARN-11447 > URL: https://issues.apache.org/jira/browse/YARN-11447 > Project: Hadoop YARN > Issue Type: Bug > Environment: hadoop 2.9.2 >Reporter: Xiping Zhang >Priority: Critical > Attachments: image-2023-02-28-15-56-57-552.png, > image-2023-02-28-15-58-56-377.png, > yarn-yarn-resourcemanager-wxqcbd002.wxqc.cn.log-1.1, > yarn-yarn-resourcemanager-wxqcbd002.wxqc.cn.log.2 > > > > > 2023-02-23 01:05:14 All applications are in the NEW_SAVING state。 > !http://easyproject.nos-jd.163yun.com/d873ad4dd006421b8797f588bf466616.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11447) A bug occurs during the active/standby switchover of RM, causing RM to work abnormally
[ https://issues.apache.org/jira/browse/YARN-11447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiping Zhang updated YARN-11447: Attachment: (was: image-2023-02-28-15-56-57-552.png) > A bug occurs during the active/standby switchover of RM, causing RM to work > abnormally > -- > > Key: YARN-11447 > URL: https://issues.apache.org/jira/browse/YARN-11447 > Project: Hadoop YARN > Issue Type: Bug > Environment: hadoop 2.9.2 >Reporter: Xiping Zhang >Priority: Critical > Attachments: image-2023-02-28-15-58-56-377.png, > yarn-yarn-resourcemanager-wxqcbd002.wxqc.cn.log-1.1, > yarn-yarn-resourcemanager-wxqcbd002.wxqc.cn.log.2 > > > > > 2023-02-23 01:05:14 All applications are in the NEW_SAVING state。 > !http://easyproject.nos-jd.163yun.com/d873ad4dd006421b8797f588bf466616.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11447) A bug occurs during the active/standby switchover of RM, causing RM to work abnormally
[ https://issues.apache.org/jira/browse/YARN-11447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17694396#comment-17694396 ] Xiping Zhang commented on YARN-11447: - !image-2023-02-28-15-56-57-552.png! > A bug occurs during the active/standby switchover of RM, causing RM to work > abnormally > -- > > Key: YARN-11447 > URL: https://issues.apache.org/jira/browse/YARN-11447 > Project: Hadoop YARN > Issue Type: Bug > Environment: hadoop 2.9.2 >Reporter: Xiping Zhang >Priority: Critical > Attachments: image-2023-02-28-15-56-57-552.png, > yarn-yarn-resourcemanager-wxqcbd002.wxqc.cn.log-1.1, > yarn-yarn-resourcemanager-wxqcbd002.wxqc.cn.log.2 > > > > > 2023-02-23 01:05:14 All applications are in the NEW_SAVING state。 > !http://easyproject.nos-jd.163yun.com/d873ad4dd006421b8797f588bf466616.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11447) A bug occurs during the active/standby switchover of RM, causing RM to work abnormally
[ https://issues.apache.org/jira/browse/YARN-11447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiping Zhang updated YARN-11447: Attachment: image-2023-02-28-15-56-57-552.png > A bug occurs during the active/standby switchover of RM, causing RM to work > abnormally > -- > > Key: YARN-11447 > URL: https://issues.apache.org/jira/browse/YARN-11447 > Project: Hadoop YARN > Issue Type: Bug > Environment: hadoop 2.9.2 >Reporter: Xiping Zhang >Priority: Critical > Attachments: image-2023-02-28-15-56-57-552.png, > yarn-yarn-resourcemanager-wxqcbd002.wxqc.cn.log-1.1, > yarn-yarn-resourcemanager-wxqcbd002.wxqc.cn.log.2 > > > > > 2023-02-23 01:05:14 All applications are in the NEW_SAVING state。 > !http://easyproject.nos-jd.163yun.com/d873ad4dd006421b8797f588bf466616.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11447) A bug occurs during the active/standby switchover of RM, causing RM to work abnormally
[ https://issues.apache.org/jira/browse/YARN-11447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17694391#comment-17694391 ] Xiping Zhang commented on YARN-11447: - After analyzing the two logs in the attachment, the conclusion of my analysis is: 2023-02-20 The StandByTransitionThread is started because the zk is faulty. During the switchover, the activeservice service is initialized. In activeservice init process, will instantiate StandByTransitionRunnable again, However, a fatal error occurs when the StandByTransitionThread-EventThread is working. Received RMFatalEvent of type TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration settings, the StandByTransitionThread is switched to standby. At this time just be instantiated StandByTransitionRunnable AtomicBoolean hasAlreadyRun flag bit has been set to true (the actual after two switch lead to sign an error). At 2023-02-23 01:05:14, the Received RMFatalEvent of type STATE_STORE_FENCED exception occurs. The STATE_STORE service status is active->. fenced will not be served. Then the StandByTransitionThread is called with the result that the AtomicBoolean hasAlreadyRun =true for this thread will be active all the time without switching to standby. STATE_STORE is out of service, and all tasks are stuck in the persistent zookeeperk phase. > A bug occurs during the active/standby switchover of RM, causing RM to work > abnormally > -- > > Key: YARN-11447 > URL: https://issues.apache.org/jira/browse/YARN-11447 > Project: Hadoop YARN > Issue Type: Bug > Environment: hadoop 2.9.2 >Reporter: Xiping Zhang >Priority: Critical > Attachments: yarn-yarn-resourcemanager-wxqcbd002.wxqc.cn.log-1.1, > yarn-yarn-resourcemanager-wxqcbd002.wxqc.cn.log.2 > > > > > 2023-02-23 01:05:14 All applications are in the NEW_SAVING state。 > !http://easyproject.nos-jd.163yun.com/d873ad4dd006421b8797f588bf466616.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11447) A bug occurs during the active/standby switchover of RM, causing RM to work abnormally
Xiping Zhang created YARN-11447: --- Summary: A bug occurs during the active/standby switchover of RM, causing RM to work abnormally Key: YARN-11447 URL: https://issues.apache.org/jira/browse/YARN-11447 Project: Hadoop YARN Issue Type: Bug Environment: hadoop 2.9.2 Reporter: Xiping Zhang Attachments: yarn-yarn-resourcemanager-wxqcbd002.wxqc.cn.log-1.1, yarn-yarn-resourcemanager-wxqcbd002.wxqc.cn.log.2 2023-02-23 01:05:14 All applications are in the NEW_SAVING state。 !http://easyproject.nos-jd.163yun.com/d873ad4dd006421b8797f588bf466616.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11107) When NodeLabel is enabled for a YARN cluster, AM blacklist program does not work properly
[ https://issues.apache.org/jira/browse/YARN-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiping Zhang updated YARN-11107: Attachment: (was: YARN-11107-branch-3.3.0.001.patch) > When NodeLabel is enabled for a YARN cluster, AM blacklist program does not > work properly > - > > Key: YARN-11107 > URL: https://issues.apache.org/jira/browse/YARN-11107 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.9.2, 3.3.0 >Reporter: Xiping Zhang >Assignee: Xiping Zhang >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > Yarn NodeLabel is enabled in the production environment. We encountered a > application AM that blacklisted all NMS corresponding to the lable in the > queue, and other application in the queue cannot apply for computing > resources. We found that RM printed a lot of logs "Trying to fulfill > reservation for application..." -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11107) When NodeLabel is enabled for a YARN cluster, AM blacklist program does not work properly
[ https://issues.apache.org/jira/browse/YARN-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiping Zhang updated YARN-11107: Attachment: (was: YARN-11107-branch-2.9.2.001.patch) > When NodeLabel is enabled for a YARN cluster, AM blacklist program does not > work properly > - > > Key: YARN-11107 > URL: https://issues.apache.org/jira/browse/YARN-11107 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.9.2, 3.3.0 >Reporter: Xiping Zhang >Assignee: Xiping Zhang >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Attachments: YARN-11107-branch-3.3.0.001.patch > > Time Spent: 1.5h > Remaining Estimate: 0h > > Yarn NodeLabel is enabled in the production environment. We encountered a > application AM that blacklisted all NMS corresponding to the lable in the > queue, and other application in the queue cannot apply for computing > resources. We found that RM printed a lot of logs "Trying to fulfill > reservation for application..." -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11107) When NodeLabel is enabled for a YARN cluster, AM blacklist program does not work properly
[ https://issues.apache.org/jira/browse/YARN-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521997#comment-17521997 ] Xiping Zhang commented on YARN-11107: - [~bteke] Yes,Also need to. > When NodeLabel is enabled for a YARN cluster, AM blacklist program does not > work properly > - > > Key: YARN-11107 > URL: https://issues.apache.org/jira/browse/YARN-11107 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.9.2, 3.3.0 >Reporter: Xiping Zhang >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Attachments: YARN-11107-branch-2.9.2.001.patch, > YARN-11107-branch-3.3.0.001.patch > > Time Spent: 1.5h > Remaining Estimate: 0h > > Yarn NodeLabel is enabled in the production environment. We encountered a > application AM that blacklisted all NMS corresponding to the lable in the > queue, and other application in the queue cannot apply for computing > resources. We found that RM printed a lot of logs "Trying to fulfill > reservation for application..." -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11107) When NodeLabel is enabled for a YARN cluster, AM blacklist program does not work properly
[ https://issues.apache.org/jira/browse/YARN-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17520382#comment-17520382 ] Xiping Zhang commented on YARN-11107: - [~junping_du] [~hexiaoqiao] Thank you for your help. Ok, I will refer to this article to learn.:) > When NodeLabel is enabled for a YARN cluster, AM blacklist program does not > work properly > - > > Key: YARN-11107 > URL: https://issues.apache.org/jira/browse/YARN-11107 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.9.2, 3.3.0 >Reporter: Xiping Zhang >Priority: Major > Labels: pull-request-available > Attachments: YARN-11107-branch-2.9.2.001.patch, > YARN-11107-branch-3.3.0.001.patch > > Time Spent: 20m > Remaining Estimate: 0h > > Yarn NodeLabel is enabled in the production environment. We encountered a > application AM that blacklisted all NMS corresponding to the lable in the > queue, and other application in the queue cannot apply for computing > resources. We found that RM printed a lot of logs "Trying to fulfill > reservation for application..." -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11107) When NodeLabel is enabled for a YARN cluster, AM blacklist program does not work properly
[ https://issues.apache.org/jira/browse/YARN-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17519358#comment-17519358 ] Xiping Zhang commented on YARN-11107: - BTW, I am new here, I hope to contribute to Hadoop and improve myself. Do you have any documentation about the workflow of hadoop community? There is offline WX, QQ exchange group can invite me? WX: zxp877758823 Thank you again for! > When NodeLabel is enabled for a YARN cluster, AM blacklist program does not > work properly > - > > Key: YARN-11107 > URL: https://issues.apache.org/jira/browse/YARN-11107 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.9.2, 3.3.0 >Reporter: Xiping Zhang >Priority: Major > Attachments: YARN-11107-branch-2.9.2.001.patch, > YARN-11107-branch-3.3.0.001.patch > > > Yarn NodeLabel is enabled in the production environment. We encountered a > application AM that blacklisted all NMS corresponding to the lable in the > queue, and other application in the queue cannot apply for computing > resources. We found that RM printed a lot of logs "Trying to fulfill > reservation for application..." -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11107) When NodeLabel is enabled for a YARN cluster, AM blacklist program does not work properly
[ https://issues.apache.org/jira/browse/YARN-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17519349#comment-17519349 ] Xiping Zhang commented on YARN-11107: - [~hexiaoqiao] Thank you for your reply ,i will submit PR later. > When NodeLabel is enabled for a YARN cluster, AM blacklist program does not > work properly > - > > Key: YARN-11107 > URL: https://issues.apache.org/jira/browse/YARN-11107 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.9.2, 3.3.0 >Reporter: Xiping Zhang >Priority: Major > Attachments: YARN-11107-branch-2.9.2.001.patch, > YARN-11107-branch-3.3.0.001.patch > > > Yarn NodeLabel is enabled in the production environment. We encountered a > application AM that blacklisted all NMS corresponding to the lable in the > queue, and other application in the queue cannot apply for computing > resources. We found that RM printed a lot of logs "Trying to fulfill > reservation for application..." -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Reopened] (YARN-11107) When NodeLabel is enabled for a YARN cluster, AM blacklist program does not work properly
[ https://issues.apache.org/jira/browse/YARN-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiping Zhang reopened YARN-11107: - > When NodeLabel is enabled for a YARN cluster, AM blacklist program does not > work properly > - > > Key: YARN-11107 > URL: https://issues.apache.org/jira/browse/YARN-11107 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.9.2, 3.3.0 >Reporter: Xiping Zhang >Priority: Major > Attachments: YARN-11107-branch-2.9.2.001.patch, > YARN-11107-branch-3.3.0.001.patch > > > Yarn NodeLabel is enabled in the production environment. We encountered a > application AM that blacklisted all NMS corresponding to the lable in the > queue, and other application in the queue cannot apply for computing > resources. We found that RM printed a lot of logs "Trying to fulfill > reservation for application..." -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] (YARN-11107) When NodeLabel is enabled for a YARN cluster, AM blacklist program does not work properly
[ https://issues.apache.org/jira/browse/YARN-11107 ] Xiping Zhang deleted comment on YARN-11107: - was (Author: zhangxiping): cc [~BilwaST] [~tangzhankun] > When NodeLabel is enabled for a YARN cluster, AM blacklist program does not > work properly > - > > Key: YARN-11107 > URL: https://issues.apache.org/jira/browse/YARN-11107 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.9.2, 3.3.0 >Reporter: Xiping Zhang >Priority: Major > Attachments: YARN-11107-branch-2.9.2.001.patch, > YARN-11107-branch-3.3.0.001.patch > > > Yarn NodeLabel is enabled in the production environment. We encountered a > application AM that blacklisted all NMS corresponding to the lable in the > queue, and other application in the queue cannot apply for computing > resources. We found that RM printed a lot of logs "Trying to fulfill > reservation for application..." -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11107) When NodeLabel is enabled for a YARN cluster, AM blacklist program does not work properly
[ https://issues.apache.org/jira/browse/YARN-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17518555#comment-17518555 ] Xiping Zhang commented on YARN-11107: - cc [~leosun08] [~linyiqun] [~weichiu] [~hexiaoqiao] Could you help review this? Thanks. > When NodeLabel is enabled for a YARN cluster, AM blacklist program does not > work properly > - > > Key: YARN-11107 > URL: https://issues.apache.org/jira/browse/YARN-11107 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.9.2, 3.3.0 >Reporter: Xiping Zhang >Priority: Major > Attachments: YARN-11107-branch-2.9.2.001.patch, > YARN-11107-branch-3.3.0.001.patch > > > Yarn NodeLabel is enabled in the production environment. We encountered a > application AM that blacklisted all NMS corresponding to the lable in the > queue, and other application in the queue cannot apply for computing > resources. We found that RM printed a lot of logs "Trying to fulfill > reservation for application..." -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11107) When NodeLabel is enabled for a YARN cluster, AM blacklist program does not work properly
[ https://issues.apache.org/jira/browse/YARN-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17518004#comment-17518004 ] Xiping Zhang commented on YARN-11107: - cc [~BilwaST] [~tangzhankun] > When NodeLabel is enabled for a YARN cluster, AM blacklist program does not > work properly > - > > Key: YARN-11107 > URL: https://issues.apache.org/jira/browse/YARN-11107 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.9.2, 3.3.0 >Reporter: Xiping Zhang >Priority: Major > Attachments: YARN-11107-branch-2.9.2.001.patch, > YARN-11107-branch-3.3.0.001.patch > > > Yarn NodeLabel is enabled in the production environment. We encountered a > application AM that blacklisted all NMS corresponding to the lable in the > queue, and other application in the queue cannot apply for computing > resources. We found that RM printed a lot of logs "Trying to fulfill > reservation for application..." -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-11107) When NodeLabel is enabled for a YARN cluster, AM blacklist program does not work properly
[ https://issues.apache.org/jira/browse/YARN-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517921#comment-17517921 ] Xiping Zhang edited comment on YARN-11107 at 4/6/22 9:56 AM: - i think when NodeLabel is enabled, RM should consider the lable of the application when passing the number of NM to AM ,When the number of blacklisted nodes exceeds 33% of the total number of lable nodes, the AM releases NM in the blacklist. for DefaultAMSProcessor.java : {code:java} //代码占位符 final class DefaultAMSProcessor implements ApplicationMasterServiceProcessor { ... public void allocate(ApplicationAttemptId appAttemptId, AllocateRequest request, AllocateResponse response) throws YarnException { ... //Consider whether NodeLabel is enabled response.setNumClusterNodes(getScheduler().getNumClusterNodes()); ... } {code} was (Author: zhangxiping): I think when NodeLabel is enabled, RM should consider the lable of the application when passing the number of NM to AM ,When the number of blacklisted nodes exceeds 33% of the total number of lable nodes, the AM releases NM in the blacklist. for DefaultAMSProcessor.java : {code:java} //代码占位符 final class DefaultAMSProcessor implements ApplicationMasterServiceProcessor { ... public void allocate(ApplicationAttemptId appAttemptId, AllocateRequest request, AllocateResponse response) throws YarnException { ... //Consider whether NodeLabel is enabled response.setNumClusterNodes(getScheduler().getNumClusterNodes()); ... } {code} > When NodeLabel is enabled for a YARN cluster, AM blacklist program does not > work properly > - > > Key: YARN-11107 > URL: https://issues.apache.org/jira/browse/YARN-11107 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.9.2, 3.3.0 >Reporter: Xiping Zhang >Priority: Major > Attachments: YARN-11107-branch-2.9.2.001.patch, > YARN-11107-branch-3.3.0.001.patch > > > Yarn NodeLabel is enabled in the production environment. We encountered a > application AM that blacklisted all NMS corresponding to the lable in the > queue, and other application in the queue cannot apply for computing > resources. We found that RM printed a lot of logs "Trying to fulfill > reservation for application..." -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11107) When NodeLabel is enabled for a YARN cluster, AM blacklist program does not work properly
[ https://issues.apache.org/jira/browse/YARN-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiping Zhang updated YARN-11107: Summary: When NodeLabel is enabled for a YARN cluster, AM blacklist program does not work properly (was: When NodeLabel is enabled for a YARN cluster, the blacklist feature is abnormal) > When NodeLabel is enabled for a YARN cluster, AM blacklist program does not > work properly > - > > Key: YARN-11107 > URL: https://issues.apache.org/jira/browse/YARN-11107 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.9.2, 3.3.0 >Reporter: Xiping Zhang >Priority: Major > Attachments: YARN-11107-branch-2.9.2.001.patch, > YARN-11107-branch-3.3.0.001.patch > > > Yarn NodeLabel is enabled in the production environment. We encountered a > application AM that blacklisted all NMS corresponding to the lable in the > queue, and other application in the queue cannot apply for computing > resources. We found that RM printed a lot of logs "Trying to fulfill > reservation for application..." -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11107) When NodeLabel is enabled for a YARN cluster, the blacklist feature is abnormal
[ https://issues.apache.org/jira/browse/YARN-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiping Zhang updated YARN-11107: Attachment: YARN-11107-branch-3.3.0.001.patch > When NodeLabel is enabled for a YARN cluster, the blacklist feature is > abnormal > --- > > Key: YARN-11107 > URL: https://issues.apache.org/jira/browse/YARN-11107 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.9.2, 3.3.0 >Reporter: Xiping Zhang >Priority: Major > Attachments: YARN-11107-branch-2.9.2.001.patch, > YARN-11107-branch-3.3.0.001.patch > > > Yarn NodeLabel is enabled in the production environment. We encountered a > application AM that blacklisted all NMS corresponding to the lable in the > queue, and other application in the queue cannot apply for computing > resources. We found that RM printed a lot of logs "Trying to fulfill > reservation for application..." -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11107) When NodeLabel is enabled for a YARN cluster, the blacklist feature is abnormal
[ https://issues.apache.org/jira/browse/YARN-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517921#comment-17517921 ] Xiping Zhang commented on YARN-11107: - I think when NodeLabel is enabled, RM should consider the lable of the application when passing the number of NM to AM ,When the number of blacklisted nodes exceeds 33% of the total number of lable nodes, the AM releases NM in the blacklist. for DefaultAMSProcessor.java : {code:java} //代码占位符 final class DefaultAMSProcessor implements ApplicationMasterServiceProcessor { ... public void allocate(ApplicationAttemptId appAttemptId, AllocateRequest request, AllocateResponse response) throws YarnException { ... //Consider whether NodeLabel is enabled response.setNumClusterNodes(getScheduler().getNumClusterNodes()); ... } {code} > When NodeLabel is enabled for a YARN cluster, the blacklist feature is > abnormal > --- > > Key: YARN-11107 > URL: https://issues.apache.org/jira/browse/YARN-11107 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.9.2, 3.3.0 >Reporter: Xiping Zhang >Priority: Major > > Yarn NodeLabel is enabled in the production environment. We encountered a > application AM that blacklisted all NMS corresponding to the lable in the > queue, and other application in the queue cannot apply for computing > resources. We found that RM printed a lot of logs "Trying to fulfill > reservation for application..." -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11107) When NodeLabel is enabled for a YARN cluster, the blacklist feature is abnormal
[ https://issues.apache.org/jira/browse/YARN-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiping Zhang updated YARN-11107: Description: Yarn NodeLabel is enabled in the production environment. We encountered a application AM that blacklisted all NMS corresponding to the lable in the queue, and other application in the queue cannot apply for computing resources. We found that RM printed a lot of logs "Trying to fulfill reservation for application..." (was: Yarn NodeLabel is enabled in the production environment. During application running, an AM task blacklists all NMs corresponding to the Lable in the queue, and other application in the queue cannot apply for computing resources. We found that RM printed a lot of logs "Trying to fulfill reservation for application...") > When NodeLabel is enabled for a YARN cluster, the blacklist feature is > abnormal > --- > > Key: YARN-11107 > URL: https://issues.apache.org/jira/browse/YARN-11107 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.9.2, 3.3.0 >Reporter: Xiping Zhang >Priority: Major > > Yarn NodeLabel is enabled in the production environment. We encountered a > application AM that blacklisted all NMS corresponding to the lable in the > queue, and other application in the queue cannot apply for computing > resources. We found that RM printed a lot of logs "Trying to fulfill > reservation for application..." -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11107) When NodeLabel is enabled for a YARN cluster, the blacklist feature is abnormal
Xiping Zhang created YARN-11107: --- Summary: When NodeLabel is enabled for a YARN cluster, the blacklist feature is abnormal Key: YARN-11107 URL: https://issues.apache.org/jira/browse/YARN-11107 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.3.0, 2.9.2 Reporter: Xiping Zhang Yarn NodeLabel is enabled in the production environment. During application running, an AM task blacklists all NMs corresponding to the Lable in the queue, and other application in the queue cannot apply for computing resources. We found that RM printed a lot of logs "Trying to fulfill reservation for application..." -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10781) The Thread of the NM aggregate log is exhausted and no other Application can aggregate the log
[ https://issues.apache.org/jira/browse/YARN-10781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17351459#comment-17351459 ] Xiping Zhang commented on YARN-10781: - [~zhuqi] Yes, we have enabled rolling log aggregation, but that doesn't seem to be the problem.This long running job may occupy one thread for all nodes of the cluster due to the dynamic resource mechanism. If there are 100 such long running jobs, all the NM(default 100 threads) aggregation threads on the cluster will be occupied. : ( > The Thread of the NM aggregate log is exhausted and no other Application can > aggregate the log > -- > > Key: YARN-10781 > URL: https://issues.apache.org/jira/browse/YARN-10781 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.9.2, 3.3.0 >Reporter: Xiping Zhang >Priority: Major > Attachments: applications.png, containers.png, containers.png > > > We observed more than 100 applications running on one NM.Most of these > applications are SparkStreaming applications, but these applications do not > have running Containers.When the offline application running on it finishes, > the log cannot be reported to HDFS. When we killed a large number of > SparkStreaming applications, we found that a large number of log files were > being created on the NN side, causing the read and write performance on the > NN side to degrade significantly.Causes the business application to time out。 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10781) The Thread of the NM aggregate log is exhausted and no other Application can aggregate the log
[ https://issues.apache.org/jira/browse/YARN-10781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiping Zhang updated YARN-10781: Description: We observed more than 100 applications running on one NM.Most of these applications are SparkStreaming applications, but these applications do not have running Containers.When the offline application running on it finishes, the log cannot be reported to HDFS. When we killed a large number of SparkStreaming applications, we found that a large number of log files were being created on the NN side, causing the read and write performance on the NN side to degrade significantly.Causes the business application to time out。 (was: We observed more than 100 applications running on one NM.Most of these applications are SparkStreaming tasks, but these applications do not have running Containers.When the offline application running on it finishes, the log cannot be reported to HDFS.) > The Thread of the NM aggregate log is exhausted and no other Application can > aggregate the log > -- > > Key: YARN-10781 > URL: https://issues.apache.org/jira/browse/YARN-10781 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.9.2, 3.3.0 >Reporter: Xiping Zhang >Priority: Major > Attachments: applications.png, containers.png, containers.png > > > We observed more than 100 applications running on one NM.Most of these > applications are SparkStreaming applications, but these applications do not > have running Containers.When the offline application running on it finishes, > the log cannot be reported to HDFS. When we killed a large number of > SparkStreaming applications, we found that a large number of log files were > being created on the NN side, causing the read and write performance on the > NN side to degrade significantly.Causes the business application to time out。 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10781) The Thread of the NM aggregate log is exhausted and no other Application can aggregate the log
[ https://issues.apache.org/jira/browse/YARN-10781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17350276#comment-17350276 ] Xiping Zhang commented on YARN-10781: - [~zhuqi] Thank you for your reply. NM instantiates LogAggregationService at startup. There is a pool of threads. By default, there are 100 threads that serve the log aggregation of all applications on NM.When AM first notifies NM to start a Container for a application, the LogAggregationService initApp method is called and a thread is assigned to handle the application's log aggregation. Here is an NM attachment to our production environment, showing the Application and Container running on it. There are 49 applications running on it but there are only 14 Containers.As the NM code understands, there are at least 49 threads working on aggregate logging tasks, 35 of which do not have a Container. !applications.png! applications !containers.png! > The Thread of the NM aggregate log is exhausted and no other Application can > aggregate the log > -- > > Key: YARN-10781 > URL: https://issues.apache.org/jira/browse/YARN-10781 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.9.2, 3.3.0 >Reporter: Xiping Zhang >Priority: Major > Attachments: applications.png, containers.png, containers.png > > > We observed more than 100 applications running on one NM.Most of these > applications are SparkStreaming tasks, but these applications do not have > running Containers.When the offline application running on it finishes, the > log cannot be reported to HDFS. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10781) The Thread of the NM aggregate log is exhausted and no other Application can aggregate the log
[ https://issues.apache.org/jira/browse/YARN-10781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiping Zhang updated YARN-10781: Attachment: containers.png > The Thread of the NM aggregate log is exhausted and no other Application can > aggregate the log > -- > > Key: YARN-10781 > URL: https://issues.apache.org/jira/browse/YARN-10781 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.9.2, 3.3.0 >Reporter: Xiping Zhang >Priority: Major > Attachments: applications.png, containers.png, containers.png > > > We observed more than 100 applications running on one NM.Most of these > applications are SparkStreaming tasks, but these applications do not have > running Containers.When the offline application running on it finishes, the > log cannot be reported to HDFS. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10781) The Thread of the NM aggregate log is exhausted and no other Application can aggregate the log
[ https://issues.apache.org/jira/browse/YARN-10781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiping Zhang updated YARN-10781: Attachment: applications.png > The Thread of the NM aggregate log is exhausted and no other Application can > aggregate the log > -- > > Key: YARN-10781 > URL: https://issues.apache.org/jira/browse/YARN-10781 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.9.2, 3.3.0 >Reporter: Xiping Zhang >Priority: Major > Attachments: applications.png, containers.png, containers.png > > > We observed more than 100 applications running on one NM.Most of these > applications are SparkStreaming tasks, but these applications do not have > running Containers.When the offline application running on it finishes, the > log cannot be reported to HDFS. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10781) The Thread of the NM aggregate log is exhausted and no other Application can aggregate the log
[ https://issues.apache.org/jira/browse/YARN-10781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiping Zhang updated YARN-10781: Attachment: containers.png > The Thread of the NM aggregate log is exhausted and no other Application can > aggregate the log > -- > > Key: YARN-10781 > URL: https://issues.apache.org/jira/browse/YARN-10781 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.9.2, 3.3.0 >Reporter: Xiping Zhang >Priority: Major > Attachments: containers.png > > > We observed more than 100 applications running on one NM.Most of these > applications are SparkStreaming tasks, but these applications do not have > running Containers.When the offline application running on it finishes, the > log cannot be reported to HDFS. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10781) The Thread of the NM aggregate log is exhausted and no other Application can aggregate the log
[ https://issues.apache.org/jira/browse/YARN-10781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17350226#comment-17350226 ] Xiping Zhang edited comment on YARN-10781 at 5/24/21, 7:00 AM: --- Aggregated logs are handled by NM.When NM initializes an Application, it allocates a thread to do the aggregate logging for that application.Here's the code on the NM side. {code:java} // LogAggregationService.java @SuppressWarnings("unchecked") private void initApp(final ApplicationId appId, String user, Credentials credentials, Map appAcls, LogAggregationContext logAggregationContext, long recoveredLogInitedTime) { ApplicationEvent eventResponse; try { initAppAggregator(appId, user, credentials, appAcls, logAggregationContext, recoveredLogInitedTime); eventResponse = new ApplicationEvent(appId, ApplicationEventType.APPLICATION_LOG_HANDLING_INITED); } catch (YarnRuntimeException e) { LOG.warn("Application failed to init aggregation", e); eventResponse = new ApplicationEvent(appId, ApplicationEventType.APPLICATION_LOG_HANDLING_FAILED); } this.dispatcher.getEventHandler().handle(eventResponse); } {code} {code:java} protected void initAppAggregator(final ApplicationId appId, String user, Credentials credentials, Map appAcls, LogAggregationContext logAggregationContext, long recoveredLogInitedTime) { ... final AppLogAggregator appLogAggregator = new AppLogAggregatorImpl(this.dispatcher, this.deletionService, getConfig(), appId, userUgi, this.nodeId, dirsHandler, logAggregationFileController.getRemoteNodeLogFileForApp(appId, user, nodeId), appAcls, logAggregationContext, this.context, getLocalFileContext(getConfig()), this.rollingMonitorInterval, recoveredLogInitedTime, logAggregationFileController); ... // Schedule the aggregator. Runnable aggregatorWrapper = new Runnable() { public void run() { try { appLogAggregator.run(); } finally { appLogAggregators.remove(appId); closeFileSystems(userUgi); } } }; this.threadPool.execute(aggregatorWrapper); if (appDirException != null) { throw appDirException; } } {code} {code:java} // AppLogAggregatorImpl.java @Override public void run() { try { doAppLogAggregation(); } ... } {code} {code:java} //AppLogAggregatorImpl.java private void doAppLogAggregation() throws LogAggregationDFSException { while (!this.appFinishing.get() && !this.aborted.get()) { synchronized(this) { try { waiting.set(true); if (logControllerContext.isLogAggregationInRolling()) { wait(logControllerContext.getRollingMonitorInterval() * 1000); if (this.appFinishing.get() || this.aborted.get()) { break; } uploadLogsForContainers(false); } else { wait(THREAD_SLEEP_TIME); } } catch (InterruptedException e) { LOG.warn("PendingContainers queue is interrupted"); this.appFinishing.set(true); } catch (LogAggregationDFSException e) { this.appFinishing.set(true); throw e; } } } if (this.aborted.get()) { return; } try { // App is finished, upload the container logs. uploadLogsForContainers(true); doAppLogAggregationPostCleanUp(); } catch (LogAggregationDFSException e) { LOG.error("Error during log aggregation", e); } this.dispatcher.getEventHandler().handle( new ApplicationEvent(this.appId, ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED)); this.appAggregationFinished.set(true); } {code} When handling the APPLICATION_STARTED event at NM, NM initializes the App and initializes an ApplogAggregatorImpl to handle the log aggregation, which allocates a thread to run its run method.Inside this is a loop until the task is Finished or aborted. was (Author: zhangxiping): Aggregated logs are handled by NM.When NM initializes an Application, it allocates a thread to do the aggregate logging for that application.Here's the code on the NM side. {code:java} @SuppressWarnings("unchecked") private void initApp(final ApplicationId appId, String user, Credentials credentials, Map appAcls, LogAggregationContext logAggregationContext, long recoveredLogInitedTime) { ApplicationEvent eventResponse; try { initAppAggregator(appId, user, credentials, appAcls, logAggregationContext, recoveredLogInitedTime); eventResponse = new ApplicationEvent(appId, ApplicationEventType.APPLICATION_LOG_HANDLING_INITED); } catch (YarnRuntimeException e) { LOG.warn("Application failed to init aggregation", e); eventResponse = new ApplicationEvent(appId, ApplicationEventType.APPLICATION_LOG_HANDLING_FAILED); } this.dispatcher.
[jira] [Commented] (YARN-10781) The Thread of the NM aggregate log is exhausted and no other Application can aggregate the log
[ https://issues.apache.org/jira/browse/YARN-10781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17350226#comment-17350226 ] Xiping Zhang commented on YARN-10781: - Aggregated logs are handled by NM.When NM initializes an Application, it allocates a thread to do the aggregate logging for that application.Here's the code on the NM side. {code:java} @SuppressWarnings("unchecked") private void initApp(final ApplicationId appId, String user, Credentials credentials, Map appAcls, LogAggregationContext logAggregationContext, long recoveredLogInitedTime) { ApplicationEvent eventResponse; try { initAppAggregator(appId, user, credentials, appAcls, logAggregationContext, recoveredLogInitedTime); eventResponse = new ApplicationEvent(appId, ApplicationEventType.APPLICATION_LOG_HANDLING_INITED); } catch (YarnRuntimeException e) { LOG.warn("Application failed to init aggregation", e); eventResponse = new ApplicationEvent(appId, ApplicationEventType.APPLICATION_LOG_HANDLING_FAILED); } this.dispatcher.getEventHandler().handle(eventResponse); } {code} {code:java} protected void initAppAggregator(final ApplicationId appId, String user, Credentials credentials, Map appAcls, LogAggregationContext logAggregationContext, long recoveredLogInitedTime) { ... final AppLogAggregator appLogAggregator = new AppLogAggregatorImpl(this.dispatcher, this.deletionService, getConfig(), appId, userUgi, this.nodeId, dirsHandler, logAggregationFileController.getRemoteNodeLogFileForApp(appId, user, nodeId), appAcls, logAggregationContext, this.context, getLocalFileContext(getConfig()), this.rollingMonitorInterval, recoveredLogInitedTime, logAggregationFileController); ... // Schedule the aggregator. Runnable aggregatorWrapper = new Runnable() { public void run() { try { appLogAggregator.run(); } finally { appLogAggregators.remove(appId); closeFileSystems(userUgi); } } }; this.threadPool.execute(aggregatorWrapper); if (appDirException != null) { throw appDirException; } } {code} {code:java} // AppLogAggregatorImpl.java @Override public void run() { try { doAppLogAggregation(); } ... } {code} {code:java} // private void doAppLogAggregation() throws LogAggregationDFSException { while (!this.appFinishing.get() && !this.aborted.get()) { synchronized(this) { try { waiting.set(true); if (logControllerContext.isLogAggregationInRolling()) { wait(logControllerContext.getRollingMonitorInterval() * 1000); if (this.appFinishing.get() || this.aborted.get()) { break; } uploadLogsForContainers(false); } else { wait(THREAD_SLEEP_TIME); } } catch (InterruptedException e) { LOG.warn("PendingContainers queue is interrupted"); this.appFinishing.set(true); } catch (LogAggregationDFSException e) { this.appFinishing.set(true); throw e; } } } if (this.aborted.get()) { return; } try { // App is finished, upload the container logs. uploadLogsForContainers(true); doAppLogAggregationPostCleanUp(); } catch (LogAggregationDFSException e) { LOG.error("Error during log aggregation", e); } this.dispatcher.getEventHandler().handle( new ApplicationEvent(this.appId, ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED)); this.appAggregationFinished.set(true); } {code} When handling the APPLICATION_STARTED event at NM, NM initializes the App and initializes an ApplogAggregatorImpl to handle the log aggregation, which allocates a thread to run its run method.Inside this is a loop until the task is Finished or aborted. > The Thread of the NM aggregate log is exhausted and no other Application can > aggregate the log > -- > > Key: YARN-10781 > URL: https://issues.apache.org/jira/browse/YARN-10781 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.9.2, 3.3.0 >Reporter: Xiping Zhang >Priority: Major > > We observed more than 100 applications running on one NM.Most of these > applications are SparkStreaming tasks, but these applications do not have > running Containers.When the offline application running on it finishes, the > log cannot be reported to HDFS. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10781) The Thread of the NM aggregate log is exhausted and no other Application can aggregate the log
[ https://issues.apache.org/jira/browse/YARN-10781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17350214#comment-17350214 ] Xiping Zhang commented on YARN-10781: - Sorry for the late reply. {code:java} // ExecutorAllocationManager.scala /** * Register for scheduler callbacks to decide when to add and remove executors, and start * the scheduling task. */ def start(): Unit = { listenerBus.addToManagementQueue(listener) val scheduleTask = new Runnable() { override def run(): Unit = { try { schedule() } catch { case ct: ControlThrowable => throw ct case t: Throwable => logWarning(s"Uncaught exception in thread ${Thread.currentThread().getName}", t) } } } executor.scheduleWithFixedDelay(scheduleTask, 0, intervalMillis, TimeUnit.MILLISECONDS) client.requestTotalExecutors(numExecutorsTarget, localityAwareTasks, hostToLocalTaskCount) } {code} The schedule method is periodically executed . {code:java} /** * This is called at a fixed interval to regulate the number of pending executor requests * and number of executors running. * * First, adjust our requested executors based on the add time and our current needs. * Then, if the remove time for an existing executor has expired, kill the executor. * * This is factored out into its own method for testing. */ private def schedule(): Unit = synchronized { val now = clock.getTimeMillis val executorIdsToBeRemoved = ArrayBuffer[String]() removeTimes.retain { case (executorId, expireTime) => val expired = now >= expireTime if (expired) { initializing = false executorIdsToBeRemoved += executorId } !expired } // Update executor target number only after initializing flag is unset updateAndSyncNumExecutorsTarget(now) if (executorIdsToBeRemoved.nonEmpty) { removeExecutors(executorIdsToBeRemoved) } } {code} This will remove executors from the executorIdsToBeRemoved set. {code:java} /** * Request the cluster manager to remove the given executors. * Returns the list of executors which are removed. */ private def removeExecutors(executors: Seq[String]): Seq[String] = synchronized { val executorIdsToBeRemoved = new ArrayBuffer[String] logInfo("Request to remove executorIds: " + executors.mkString(", ")) val numExistingExecutors = allocationManager.executorIds.size - executorsPendingToRemove.size var newExecutorTotal = numExistingExecutors executors.foreach { executorIdToBeRemoved => if (newExecutorTotal - 1 < minNumExecutors) { logDebug(s"Not removing idle executor $executorIdToBeRemoved because there are only " + s"$newExecutorTotal executor(s) left (minimum number of executor limit $minNumExecutors)") } else if (newExecutorTotal - 1 < numExecutorsTarget) { logDebug(s"Not removing idle executor $executorIdToBeRemoved because there are only " + s"$newExecutorTotal executor(s) left (number of executor target $numExecutorsTarget)") } else if (canBeKilled(executorIdToBeRemoved)) { executorIdsToBeRemoved += executorIdToBeRemoved newExecutorTotal -= 1 } } if (executorIdsToBeRemoved.isEmpty) { return Seq.empty[String] } // Send a request to the backend to kill this executor(s) val executorsRemoved = if (testing) { executorIdsToBeRemoved } else { // We don't want to change our target number of executors, because we already did that // when the task backlog decreased. client.killExecutors(executorIdsToBeRemoved, adjustTargetNumExecutors = false, countFailures = false, force = false) } // [SPARK-21834] killExecutors api reduces the target number of executors. // So we need to update the target with desired value. client.requestTotalExecutors(numExecutorsTarget, localityAwareTasks, hostToLocalTaskCount) // reset the newExecutorTotal to the existing number of executors newExecutorTotal = numExistingExecutors if (testing || executorsRemoved.nonEmpty) { executorsRemoved.foreach { removedExecutorId => // If it is a cached block, it uses cachedExecutorIdleTimeoutS for timeout val idleTimeout = if (blockManagerMaster.hasCachedBlocks(removedExecutorId)) { cachedExecutorIdleTimeoutS } else { executorIdleTimeoutS } newExecutorTotal -= 1 logInfo(s"Removing executor $removedExecutorId because it has been idle for " + s"$idleTimeout seconds (new desired total will be $newExecutorTotal)") executorsPendingToRemove.add(removedExecutorId) } executorsRemoved } else { logWarning(s"Unable to reach the cluster manager to kill executor/s " + s"${executorIdsToBeRemoved.mkString(",")} or no executor eligible to kill!") Seq.empty[String] } } {code} > The Thread of the NM aggregate log is exhausted and no other Application can > agg
[jira] [Comment Edited] (YARN-10781) The Thread of the NM aggregate log is exhausted and no other Application can aggregate the log
[ https://issues.apache.org/jira/browse/YARN-10781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348991#comment-17348991 ] Xiping Zhang edited comment on YARN-10781 at 5/21/21, 6:02 AM: --- When NM accepts an Application , it initializes an ApplogAggregatorImpl internally and drops it into a thread pool with a default size of 100.Each thread is responsible for reporting the log of an Application until the Application Finish or aborted. A maximum of 100 running applications can be processed simultaneously.Due to SparkStreaming's dynamic resource mechanism, it is possible that such threads on NM cannot exit even though there is no Container running on it.Increasing the number of core threads in the thread pool is possible, but this is not a good solution, as the number of sparkStreaming appliacations increases, resulting in more threads being occupied. U can correct me if there is any problem with my understanding was (Author: zhangxiping): When NM accepts an Application container, it initializes an ApplogAggregatorImpl internally and drops it into a thread pool with a default size of 100.Each thread is responsible for reporting the log of an Application until the Application Finish or aborted. A maximum of 100 running applications can be processed simultaneously.Due to SparkStreaming's dynamic resource mechanism, it is possible that such threads on NM cannot exit even though there is no Container running on it.Increasing the number of core threads in the thread pool is possible, but this is not a good solution, as the number of sparkStreaming appliacations increases, resulting in more threads being occupied. U can correct me if there is any problem with my understanding > The Thread of the NM aggregate log is exhausted and no other Application can > aggregate the log > -- > > Key: YARN-10781 > URL: https://issues.apache.org/jira/browse/YARN-10781 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.9.2, 3.3.0 >Reporter: Xiping Zhang >Priority: Major > > We observed more than 100 applications running on one NM.Most of these > applications are SparkStreaming tasks, but these applications do not have > running Containers.When the offline application running on it finishes, the > log cannot be reported to HDFS. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10781) The Thread of the NM aggregate log is exhausted and no other Application can aggregate the log
[ https://issues.apache.org/jira/browse/YARN-10781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348991#comment-17348991 ] Xiping Zhang commented on YARN-10781: - When NM accepts an Application container, it initializes an ApplogAggregatorImpl internally and drops it into a thread pool with a default size of 100.Each thread is responsible for reporting the log of an Application until the Application Finish or aborted. A maximum of 100 running applications can be processed simultaneously.Due to SparkStreaming's dynamic resource mechanism, it is possible that such threads on NM cannot exit even though there is no Container running on it.Increasing the number of core threads in the thread pool is possible, but this is not a good solution, as the number of sparkStreaming appliacations increases, resulting in more threads being occupied. U can correct me if there is any problem with my understanding > The Thread of the NM aggregate log is exhausted and no other Application can > aggregate the log > -- > > Key: YARN-10781 > URL: https://issues.apache.org/jira/browse/YARN-10781 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.9.2, 3.3.0 >Reporter: Xiping Zhang >Priority: Major > > We observed more than 100 applications running on one NM.Most of these > applications are SparkStreaming tasks, but these applications do not have > running Containers.When the offline application running on it finishes, the > log cannot be reported to HDFS. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10781) The Thread of the NM aggregate log is exhausted and no other Application can aggregate the log
Xiping Zhang created YARN-10781: --- Summary: The Thread of the NM aggregate log is exhausted and no other Application can aggregate the log Key: YARN-10781 URL: https://issues.apache.org/jira/browse/YARN-10781 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 3.3.0, 2.9.2 Reporter: Xiping Zhang We observed more than 100 applications running on one NM.Most of these applications are SparkStreaming tasks, but these applications do not have running Containers.When the offline application running on it finishes, the log cannot be reported to HDFS. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org