[jira] [Updated] (YARN-8493) LogAggregation in NodeManager is put off because great amount of long running app
[ https://issues.apache.org/jira/browse/YARN-8493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JayceAu updated YARN-8493: -- Affects Version/s: 2.7.0 2.8.0 2.9.0 3.0.0 > LogAggregation in NodeManager is put off because great amount of long running > app > - > > Key: YARN-8493 > URL: https://issues.apache.org/jira/browse/YARN-8493 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0 >Reporter: JayceAu >Priority: Major > Attachments: YARN-8493.001.patch > > > h2. Issue summary > In our Yarn cluster, on average, it will take 30 min to show the app log on > web after the app is finished. This problem is caused by the limitation of > threadPool size in NodeManager. > In NodeManager, it will set aside an appLogAggregator to do log Aggregation > for each container running on this NodeManager. This appLogAggregator will > occupy one thread in the threadPool until it's finished in the whole cluster. > NodeManager uses FixedThreadPool (default size is 100) instead of > CachedThreadPool which is used in the old version. At peak moment in our > production environment, there is more than 350 AppLogAggregator running or > queuing in thread pool and those app queuing will suffer from great log > aggregation latency. > h2. Possible Solution > We can increase yarn.nodemanager.logaggregation.threadpool-size-max to a > higher value to solve it. But this problem will happen again if the running > app increase and it will create a lot of idle thread waiting for log > aggregation. > Our solution is not to put the {color:#33}appLogAggregator {color}into > the threadPool until it's finished: > # give an callback to each {color:#33}appLogAggregator to put itself > into the threadPool, it's not called until it's notified{color} > # if rollingMonitorInterval is greater than 0, NodeManager will set aside a > thread in LogAggregationService to do log Aggregation for all the running app > periodically > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8493) LogAggregation in NodeManager is put off because great amount of long running app
[ https://issues.apache.org/jira/browse/YARN-8493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JayceAu updated YARN-8493: -- Fix Version/s: (was: 2.6.0) > LogAggregation in NodeManager is put off because great amount of long running > app > - > > Key: YARN-8493 > URL: https://issues.apache.org/jira/browse/YARN-8493 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: JayceAu >Priority: Major > Attachments: YARN-8493.001.patch > > > h2. Issue summary > In our Yarn cluster, on average, it will take 30 min to show the app log on > web after the app is finished. This problem is caused by the limitation of > threadPool size in NodeManager. > In NodeManager, it will set aside an appLogAggregator to do log Aggregation > for each container running on this NodeManager. This appLogAggregator will > occupy one thread in the threadPool until it's finished in the whole cluster. > NodeManager uses FixedThreadPool (default size is 100) instead of > CachedThreadPool which is used in the old version. At peak moment in our > production environment, there is more than 350 AppLogAggregator running or > queuing in thread pool and those app queuing will suffer from great log > aggregation latency. > h2. Possible Solution > We can increase yarn.nodemanager.logaggregation.threadpool-size-max to a > higher value to solve it. But this problem will happen again if the running > app increase and it will create a lot of idle thread waiting for log > aggregation. > Our solution is not to put the {color:#33}appLogAggregator {color}into > the threadPool until it's finished: > # give an callback to each {color:#33}appLogAggregator to put itself > into the threadPool, it's not called until it's notified{color} > # if rollingMonitorInterval is greater than 0, NodeManager will set aside a > thread in LogAggregationService to do log Aggregation for all the running app > periodically > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8493) LogAggregation in NodeManager is put off because great amount of long running app
[ https://issues.apache.org/jira/browse/YARN-8493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JayceAu updated YARN-8493: -- Attachment: YARN-8493.001.patch > LogAggregation in NodeManager is put off because great amount of long running > app > - > > Key: YARN-8493 > URL: https://issues.apache.org/jira/browse/YARN-8493 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: JayceAu >Priority: Major > Fix For: 2.6.0 > > Attachments: YARN-8493.001.patch > > > h2. Issue summary > In our Yarn cluster, on average, it will take 30 min to show the app log on > web after the app is finished. This problem is caused by the limitation of > threadPool size in NodeManager. > In NodeManager, it will set aside an appLogAggregator to do log Aggregation > for each container running on this NodeManager. This appLogAggregator will > occupy one thread in the threadPool until it's finished in the whole cluster. > NodeManager uses FixedThreadPool (default size is 100) instead of > CachedThreadPool which is used in the old version. At peak moment in our > production environment, there is more than 350 AppLogAggregator running or > queuing in thread pool and those app queuing will suffer from great log > aggregation latency. > h2. Possible Solution > We can increase yarn.nodemanager.logaggregation.threadpool-size-max to a > higher value to solve it. But this problem will happen again if the running > app increase and it will create a lot of idle thread waiting for log > aggregation. > Our solution is not to put the {color:#33}appLogAggregator {color}into > the threadPool until it's finished: > # give an callback to each {color:#33}appLogAggregator to put itself > into the threadPool, it's not called until it's notified{color} > # if rollingMonitorInterval is greater than 0, NodeManager will set aside a > thread in LogAggregationService to do log Aggregation for all the running app > periodically > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8493) LogAggregation in NodeManager is put off because great amount of long running app
JayceAu created YARN-8493: - Summary: LogAggregation in NodeManager is put off because great amount of long running app Key: YARN-8493 URL: https://issues.apache.org/jira/browse/YARN-8493 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: JayceAu Fix For: 2.6.0 h2. Issue summary In our Yarn cluster, on average, it will take 30 min to show the app log on web after the app is finished. This problem is caused by the limitation of threadPool size in NodeManager. In NodeManager, it will set aside an appLogAggregator to do log Aggregation for each container running on this NodeManager. This appLogAggregator will occupy one thread in the threadPool until it's finished in the whole cluster. NodeManager uses FixedThreadPool (default size is 100) instead of CachedThreadPool which is used in the old version. At peak moment in our production environment, there is more than 350 AppLogAggregator running or queuing in thread pool and those app queuing will suffer from great log aggregation latency. h2. Possible Solution We can increase yarn.nodemanager.logaggregation.threadpool-size-max to a higher value to solve it. But this problem will happen again if the running app increase and it will create a lot of idle thread waiting for log aggregation. Our solution is not to put the {color:#33}appLogAggregator {color}into the threadPool until it's finished: # give an callback to each {color:#33}appLogAggregator to put itself into the threadPool, it's not called until it's notified{color} # if rollingMonitorInterval is greater than 0, NodeManager will set aside a thread in LogAggregationService to do log Aggregation for all the running app periodically -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8031) NodeManager will fail to start if cpu subsystem is already mounted
[ https://issues.apache.org/jira/browse/YARN-8031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16403294#comment-16403294 ] JayceAu commented on YARN-8031: --- @Miklos Szegedi, after reading the source code and according to my test result, if set this *yarn.nodemanager.linux-container-executor.cgroups.mount* to false, NM won't create the hierarchy directory hadoop-yarn with cpu controller mounted, which is conflict with what is mentioned in the doc: [https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html] {code:java} // code placeholder The cgroups hierarchy under which to place YARN proccesses(cannot contain commas). If yarn.nodemanager.linux-container-executor.cgroups.mount is false (that is, if cgroups have been pre-configured) and the YARN user has write access to the parent directory, then the directory will be created. If the directory already exists, the administrator has to give YARN write permissions to it recursively. {code} > NodeManager will fail to start if cpu subsystem is already mounted > -- > > Key: YARN-8031 > URL: https://issues.apache.org/jira/browse/YARN-8031 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: JayceAu >Priority: Major > Attachments: YARN-8031.001.patch > > > if *yarn.nodemanager.linux-container-executor.cgroups.mount* is set to true > and cpu subsystem is not yet mounted, NodeManager will mount the cpu > subsystem and then create the control group whose default name is > *hadoop-yarn* if the mount step is successful. This procedure works well if > cpu subsystem is not yet mounted. However, under some situation cpu subsystem > is already mounted before NodeManager starts and NodeManager will fail to > start because of no write permission to the *hadoop-yarn* path . For example: > # in OS that use systemd such as centos7 will have cpu subsystem mounted by > default on machine startup > # some deamon whose start order is more precedent than NodeManager may also > rely on the mounted state of cpu subsystem. In our production environment, we > limit the cpu usage of the monitoring and control agent, which starts on > reboot > In order to solve this problem, container-executor must be able to create the > control group *hadoop-yarn* if mounting controller is successful or this > controller is already mounted. Besides, if cpu subsystem is used in > combination with other subsystem and it's already mounted, container-executor > should use the latest mount point of cpu subsystem instread of the one > provided by NodeManager. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8031) NodeManager will fail to start if cpu subsystem is already mounted
[ https://issues.apache.org/jira/browse/YARN-8031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JayceAu updated YARN-8031: -- Attachment: YARN-8031.001.patch > NodeManager will fail to start if cpu subsystem is already mounted > -- > > Key: YARN-8031 > URL: https://issues.apache.org/jira/browse/YARN-8031 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: JayceAu >Priority: Major > Attachments: YARN-8031.001.patch > > > if *yarn.nodemanager.linux-container-executor.cgroups.mount* is set to true > and cpu subsystem is not yet mounted, NodeManager will mount the cpu > subsystem and then create the control group whose default name is > *hadoop-yarn* if the mount step is successful. This procedure works well if > cpu subsystem is not yet mounted. However, under some situation cpu subsystem > is already mounted before NodeManager starts and NodeManager will fail to > start because of no write permission to the *hadoop-yarn* path . For example: > # in OS that use systemd such as centos7 will have cpu subsystem mounted by > default on machine startup > # some deamon whose start order is more precedent than NodeManager may also > rely on the mounted state of cpu subsystem. In our production environment, we > limit the cpu usage of the monitoring and control agent, which starts on > reboot > In order to solve this problem, container-executor must be able to create the > control group *hadoop-yarn* if mounting controller is successful or this > controller is already mounted. Besides, if cpu subsystem is used in > combination with other subsystem and it's already mounted, container-executor > should use the latest mount point of cpu subsystem instread of the one > provided by NodeManager. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8031) NodeManager will fail to start if cpu subsystem is already mounted
[ https://issues.apache.org/jira/browse/YARN-8031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JayceAu updated YARN-8031: -- Attachment: (was: image-2018-03-15-14-47-30-583.png) > NodeManager will fail to start if cpu subsystem is already mounted > -- > > Key: YARN-8031 > URL: https://issues.apache.org/jira/browse/YARN-8031 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: JayceAu >Priority: Major > > if *yarn.nodemanager.linux-container-executor.cgroups.mount* is set to true > and cpu subsystem is not yet mounted, NodeManager will mount the cpu > subsystem and then create the control group whose default name is > *hadoop-yarn* if the mount step is successful. This procedure works well if > cpu subsystem is not yet mounted. However, under some situation cpu subsystem > is already mounted before NodeManager starts and NodeManager will fail to > start because of no write permission to the *hadoop-yarn* path . For example: > # in OS that use systemd such as centos7 will have cpu subsystem mounted by > default on machine startup > # some deamon whose start order is more precedent than NodeManager may also > rely on the mounted state of cpu subsystem. In our production environment, we > limit the cpu usage of the monitoring and control agent, which starts on > reboot > In order to solve this problem, container-executor must be able to create the > control group *hadoop-yarn* if mounting controller is successful or this > controller is already mounted. Besides, if cpu subsystem is used in > combination with other subsystem and it's already mounted, container-executor > should use the latest mount point of cpu subsystem instread of the one > provided by NodeManager. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8031) NodeManager will fail to start if cpu subsystem is already mounted
JayceAu created YARN-8031: - Summary: NodeManager will fail to start if cpu subsystem is already mounted Key: YARN-8031 URL: https://issues.apache.org/jira/browse/YARN-8031 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: JayceAu Attachments: image-2018-03-15-14-47-30-583.png if *yarn.nodemanager.linux-container-executor.cgroups.mount* is set to true and cpu subsystem is not yet mounted, NodeManager will mount the cpu subsystem and then create the control group whose default name is *hadoop-yarn* if the mount step is successful. This procedure works well if cpu subsystem is not yet mounted. However, under some situation cpu subsystem is already mounted before NodeManager starts and NodeManager will fail to start because of no write permission to the *hadoop-yarn* path . For example: # in OS that use systemd such as centos7 will have cpu subsystem mounted by default on machine startup # some deamon whose start order is more precedent than NodeManager may also rely on the mounted state of cpu subsystem. In our production environment, we limit the cpu usage of the monitoring and control agent, which starts on reboot In order to solve this problem, container-executor must be able to create the control group *hadoop-yarn* if mounting controller is successful or this controller is already mounted. Besides, if cpu subsystem is used in combination with other subsystem and it's already mounted, container-executor should use the latest mount point of cpu subsystem instread of the one provided by NodeManager. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-6583) Hadoop-sls failed to start because of premature state of RM
JayceAu created YARN-6583: - Summary: Hadoop-sls failed to start because of premature state of RM Key: YARN-6583 URL: https://issues.apache.org/jira/browse/YARN-6583 Project: Hadoop YARN Issue Type: Bug Components: scheduler-load-simulator Affects Versions: 2.6.0 Reporter: JayceAu During startup of SLS, after startRM() in SLSRunner.start(), BaseContainerTokenSecretManager not yet generate its onw internal key or it's not yet exposed to the other thread, then NM registration will fail because of the following exception. Finally, the whole SLS process will crash. {noformat} Exception in thread "main" java.lang.NullPointerException at org.apache.hadoop.yarn.server.security.BaseContainerTokenSecretManager.getCurrentKey(BaseContainerTokenSecretManager.java:81) at org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.registerNodeManager(ResourceTrackerService.java:300) at org.apache.hadoop.yarn.sls.nodemanager.NMSimulator.init(NMSimulator.java:105) at org.apache.hadoop.yarn.sls.SLSRunner.startNM(SLSRunner.java:202) at org.apache.hadoop.yarn.sls.SLSRunner.start(SLSRunner.java:143) at org.apache.hadoop.yarn.sls.SLSRunner.main(SLSRunner.java:528) 17/05/11 10:21:06 INFO resourcemanager.ResourceManager: Recovery started 17/05/11 10:21:06 INFO recovery.ZKRMStateStore: Watcher event type: None with state:SyncConnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org