[jira] [Updated] (YARN-8493) LogAggregation in NodeManager is put off because great amount of long running app

2018-07-08 Thread JayceAu (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JayceAu updated YARN-8493:
--
Affects Version/s: 2.7.0
   2.8.0
   2.9.0
   3.0.0

> LogAggregation in NodeManager is put off because great amount of long running 
> app
> -
>
> Key: YARN-8493
> URL: https://issues.apache.org/jira/browse/YARN-8493
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0
>Reporter: JayceAu
>Priority: Major
> Attachments: YARN-8493.001.patch
>
>
> h2. Issue summary
> In our Yarn cluster, on average, it will take 30 min to show the app log on 
> web after the app is finished. This problem is caused by the limitation of 
> threadPool size in NodeManager.
> In NodeManager, it will set aside an appLogAggregator to do log Aggregation 
> for each container running on this NodeManager. This appLogAggregator will 
> occupy one thread in the threadPool until it's finished in the whole cluster. 
>  NodeManager uses FixedThreadPool (default size is 100) instead of 
> CachedThreadPool which is used in the old version. At peak moment in our 
> production environment, there is more than 350 AppLogAggregator running or 
> queuing in thread pool and those app queuing will suffer from great log 
> aggregation latency.
> h2. Possible Solution
> We can increase yarn.nodemanager.logaggregation.threadpool-size-max to a 
> higher value to solve it. But this problem will happen again if the running 
> app increase and it will create a lot of idle thread waiting for log 
> aggregation. 
> Our solution is not to put the {color:#33}appLogAggregator {color}into 
> the threadPool until it's finished:
>  # give an callback to each {color:#33}appLogAggregator to put itself 
> into the threadPool, it's not called until it's notified{color}
>  # if rollingMonitorInterval is greater than 0, NodeManager will set aside a 
> thread in LogAggregationService to do log Aggregation for all the running app 
> periodically
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8493) LogAggregation in NodeManager is put off because great amount of long running app

2018-07-08 Thread JayceAu (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JayceAu updated YARN-8493:
--
Fix Version/s: (was: 2.6.0)

> LogAggregation in NodeManager is put off because great amount of long running 
> app
> -
>
> Key: YARN-8493
> URL: https://issues.apache.org/jira/browse/YARN-8493
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: JayceAu
>Priority: Major
> Attachments: YARN-8493.001.patch
>
>
> h2. Issue summary
> In our Yarn cluster, on average, it will take 30 min to show the app log on 
> web after the app is finished. This problem is caused by the limitation of 
> threadPool size in NodeManager.
> In NodeManager, it will set aside an appLogAggregator to do log Aggregation 
> for each container running on this NodeManager. This appLogAggregator will 
> occupy one thread in the threadPool until it's finished in the whole cluster. 
>  NodeManager uses FixedThreadPool (default size is 100) instead of 
> CachedThreadPool which is used in the old version. At peak moment in our 
> production environment, there is more than 350 AppLogAggregator running or 
> queuing in thread pool and those app queuing will suffer from great log 
> aggregation latency.
> h2. Possible Solution
> We can increase yarn.nodemanager.logaggregation.threadpool-size-max to a 
> higher value to solve it. But this problem will happen again if the running 
> app increase and it will create a lot of idle thread waiting for log 
> aggregation. 
> Our solution is not to put the {color:#33}appLogAggregator {color}into 
> the threadPool until it's finished:
>  # give an callback to each {color:#33}appLogAggregator to put itself 
> into the threadPool, it's not called until it's notified{color}
>  # if rollingMonitorInterval is greater than 0, NodeManager will set aside a 
> thread in LogAggregationService to do log Aggregation for all the running app 
> periodically
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8493) LogAggregation in NodeManager is put off because great amount of long running app

2018-07-03 Thread JayceAu (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JayceAu updated YARN-8493:
--
Attachment: YARN-8493.001.patch

> LogAggregation in NodeManager is put off because great amount of long running 
> app
> -
>
> Key: YARN-8493
> URL: https://issues.apache.org/jira/browse/YARN-8493
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: JayceAu
>Priority: Major
> Fix For: 2.6.0
>
> Attachments: YARN-8493.001.patch
>
>
> h2. Issue summary
> In our Yarn cluster, on average, it will take 30 min to show the app log on 
> web after the app is finished. This problem is caused by the limitation of 
> threadPool size in NodeManager.
> In NodeManager, it will set aside an appLogAggregator to do log Aggregation 
> for each container running on this NodeManager. This appLogAggregator will 
> occupy one thread in the threadPool until it's finished in the whole cluster. 
>  NodeManager uses FixedThreadPool (default size is 100) instead of 
> CachedThreadPool which is used in the old version. At peak moment in our 
> production environment, there is more than 350 AppLogAggregator running or 
> queuing in thread pool and those app queuing will suffer from great log 
> aggregation latency.
> h2. Possible Solution
> We can increase yarn.nodemanager.logaggregation.threadpool-size-max to a 
> higher value to solve it. But this problem will happen again if the running 
> app increase and it will create a lot of idle thread waiting for log 
> aggregation. 
> Our solution is not to put the {color:#33}appLogAggregator {color}into 
> the threadPool until it's finished:
>  # give an callback to each {color:#33}appLogAggregator to put itself 
> into the threadPool, it's not called until it's notified{color}
>  # if rollingMonitorInterval is greater than 0, NodeManager will set aside a 
> thread in LogAggregationService to do log Aggregation for all the running app 
> periodically
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8493) LogAggregation in NodeManager is put off because great amount of long running app

2018-07-03 Thread JayceAu (JIRA)
JayceAu created YARN-8493:
-

 Summary: LogAggregation in NodeManager is put off because great 
amount of long running app
 Key: YARN-8493
 URL: https://issues.apache.org/jira/browse/YARN-8493
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: JayceAu
 Fix For: 2.6.0


h2. Issue summary

In our Yarn cluster, on average, it will take 30 min to show the app log on web 
after the app is finished. This problem is caused by the limitation of 
threadPool size in NodeManager.

In NodeManager, it will set aside an appLogAggregator to do log Aggregation for 
each container running on this NodeManager. This appLogAggregator will occupy 
one thread in the threadPool until it's finished in the whole cluster.  
NodeManager uses FixedThreadPool (default size is 100) instead of 
CachedThreadPool which is used in the old version. At peak moment in our 
production environment, there is more than 350 AppLogAggregator running or 
queuing in thread pool and those app queuing will suffer from great log 
aggregation latency.
h2. Possible Solution

We can increase yarn.nodemanager.logaggregation.threadpool-size-max to a higher 
value to solve it. But this problem will happen again if the running app 
increase and it will create a lot of idle thread waiting for log aggregation. 

Our solution is not to put the {color:#33}appLogAggregator {color}into the 
threadPool until it's finished:
 # give an callback to each {color:#33}appLogAggregator to put itself into 
the threadPool, it's not called until it's notified{color}
 # if rollingMonitorInterval is greater than 0, NodeManager will set aside a 
thread in LogAggregationService to do log Aggregation for all the running app 
periodically

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8031) NodeManager will fail to start if cpu subsystem is already mounted

2018-03-17 Thread JayceAu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16403294#comment-16403294
 ] 

JayceAu commented on YARN-8031:
---

@Miklos Szegedi, after reading the source code and according to my test result, 
if set this *yarn.nodemanager.linux-container-executor.cgroups.mount* to false, 
NM won't create the hierarchy directory hadoop-yarn with cpu controller 
mounted, which is conflict with what is mentioned in the doc:

[https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html]
{code:java}
// code placeholder
The cgroups hierarchy under which to place YARN proccesses(cannot contain 
commas). If yarn.nodemanager.linux-container-executor.cgroups.mount is false 
(that is, if cgroups have been pre-configured) and the YARN user has write 
access to the parent directory, then the directory will be created. If the 
directory already exists, the administrator has to give YARN write permissions 
to it recursively.
{code}

> NodeManager will fail to start if cpu subsystem is already mounted
> --
>
> Key: YARN-8031
> URL: https://issues.apache.org/jira/browse/YARN-8031
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.0
>Reporter: JayceAu
>Priority: Major
> Attachments: YARN-8031.001.patch
>
>
> if *yarn.nodemanager.linux-container-executor.cgroups.mount* is set to true 
> and cpu subsystem is not yet mounted, NodeManager will mount the cpu 
> subsystem and then create the control group whose default name is 
> *hadoop-yarn* if the mount step is successful. This procedure works well if 
> cpu subsystem is not yet mounted. However, under some situation cpu subsystem 
> is already mounted before NodeManager starts and NodeManager will fail to 
> start because of no write permission to the *hadoop-yarn* path . For example:
>  # in OS that use systemd such as centos7 will have cpu subsystem mounted by 
> default on machine startup
>  # some deamon whose start order is more precedent than NodeManager may also 
> rely on the mounted state of cpu subsystem. In our production environment, we 
> limit the cpu usage of the monitoring and control agent, which starts on 
> reboot
> In order to solve this problem, container-executor must be able to create the 
> control group *hadoop-yarn* if mounting controller is successful or this 
> controller is already mounted. Besides, if cpu subsystem is used in 
> combination with other subsystem and it's already mounted, container-executor 
> should use the latest mount point of cpu subsystem instread of the one 
> provided by NodeManager.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8031) NodeManager will fail to start if cpu subsystem is already mounted

2018-03-17 Thread JayceAu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JayceAu updated YARN-8031:
--
Attachment: YARN-8031.001.patch

> NodeManager will fail to start if cpu subsystem is already mounted
> --
>
> Key: YARN-8031
> URL: https://issues.apache.org/jira/browse/YARN-8031
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.0
>Reporter: JayceAu
>Priority: Major
> Attachments: YARN-8031.001.patch
>
>
> if *yarn.nodemanager.linux-container-executor.cgroups.mount* is set to true 
> and cpu subsystem is not yet mounted, NodeManager will mount the cpu 
> subsystem and then create the control group whose default name is 
> *hadoop-yarn* if the mount step is successful. This procedure works well if 
> cpu subsystem is not yet mounted. However, under some situation cpu subsystem 
> is already mounted before NodeManager starts and NodeManager will fail to 
> start because of no write permission to the *hadoop-yarn* path . For example:
>  # in OS that use systemd such as centos7 will have cpu subsystem mounted by 
> default on machine startup
>  # some deamon whose start order is more precedent than NodeManager may also 
> rely on the mounted state of cpu subsystem. In our production environment, we 
> limit the cpu usage of the monitoring and control agent, which starts on 
> reboot
> In order to solve this problem, container-executor must be able to create the 
> control group *hadoop-yarn* if mounting controller is successful or this 
> controller is already mounted. Besides, if cpu subsystem is used in 
> combination with other subsystem and it's already mounted, container-executor 
> should use the latest mount point of cpu subsystem instread of the one 
> provided by NodeManager.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8031) NodeManager will fail to start if cpu subsystem is already mounted

2018-03-16 Thread JayceAu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JayceAu updated YARN-8031:
--
Attachment: (was: image-2018-03-15-14-47-30-583.png)

> NodeManager will fail to start if cpu subsystem is already mounted
> --
>
> Key: YARN-8031
> URL: https://issues.apache.org/jira/browse/YARN-8031
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.0
>Reporter: JayceAu
>Priority: Major
>
> if *yarn.nodemanager.linux-container-executor.cgroups.mount* is set to true 
> and cpu subsystem is not yet mounted, NodeManager will mount the cpu 
> subsystem and then create the control group whose default name is 
> *hadoop-yarn* if the mount step is successful. This procedure works well if 
> cpu subsystem is not yet mounted. However, under some situation cpu subsystem 
> is already mounted before NodeManager starts and NodeManager will fail to 
> start because of no write permission to the *hadoop-yarn* path . For example:
>  # in OS that use systemd such as centos7 will have cpu subsystem mounted by 
> default on machine startup
>  # some deamon whose start order is more precedent than NodeManager may also 
> rely on the mounted state of cpu subsystem. In our production environment, we 
> limit the cpu usage of the monitoring and control agent, which starts on 
> reboot
> In order to solve this problem, container-executor must be able to create the 
> control group *hadoop-yarn* if mounting controller is successful or this 
> controller is already mounted. Besides, if cpu subsystem is used in 
> combination with other subsystem and it's already mounted, container-executor 
> should use the latest mount point of cpu subsystem instread of the one 
> provided by NodeManager.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8031) NodeManager will fail to start if cpu subsystem is already mounted

2018-03-15 Thread JayceAu (JIRA)
JayceAu created YARN-8031:
-

 Summary: NodeManager will fail to start if cpu subsystem is 
already mounted
 Key: YARN-8031
 URL: https://issues.apache.org/jira/browse/YARN-8031
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: JayceAu
 Attachments: image-2018-03-15-14-47-30-583.png

if *yarn.nodemanager.linux-container-executor.cgroups.mount* is set to true and 
cpu subsystem is not yet mounted, NodeManager will mount the cpu subsystem and 
then create the control group whose default name is *hadoop-yarn* if the mount 
step is successful. This procedure works well if cpu subsystem is not yet 
mounted. However, under some situation cpu subsystem is already mounted before 
NodeManager starts and NodeManager will fail to start because of no write 
permission to the *hadoop-yarn* path . For example:
 # in OS that use systemd such as centos7 will have cpu subsystem mounted by 
default on machine startup
 # some deamon whose start order is more precedent than NodeManager may also 
rely on the mounted state of cpu subsystem. In our production environment, we 
limit the cpu usage of the monitoring and control agent, which starts on reboot

In order to solve this problem, container-executor must be able to create the 
control group *hadoop-yarn* if mounting controller is successful or this 
controller is already mounted. Besides, if cpu subsystem is used in combination 
with other subsystem and it's already mounted, container-executor should use 
the latest mount point of cpu subsystem instread of the one provided by 
NodeManager.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-6583) Hadoop-sls failed to start because of premature state of RM

2017-05-10 Thread JayceAu (JIRA)
JayceAu created YARN-6583:
-

 Summary: Hadoop-sls failed to start because of premature state of 
RM
 Key: YARN-6583
 URL: https://issues.apache.org/jira/browse/YARN-6583
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler-load-simulator
Affects Versions: 2.6.0
Reporter: JayceAu


During startup of SLS, after startRM() in SLSRunner.start(), 
BaseContainerTokenSecretManager not yet generate its onw internal key or it's 
not yet exposed to the other thread, then NM registration will fail because of 
the following exception. Finally, the whole SLS process will crash.

{noformat}
Exception in thread "main" java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.security.BaseContainerTokenSecretManager.getCurrentKey(BaseContainerTokenSecretManager.java:81)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.registerNodeManager(ResourceTrackerService.java:300)
at 
org.apache.hadoop.yarn.sls.nodemanager.NMSimulator.init(NMSimulator.java:105)
at org.apache.hadoop.yarn.sls.SLSRunner.startNM(SLSRunner.java:202)
at org.apache.hadoop.yarn.sls.SLSRunner.start(SLSRunner.java:143)
at org.apache.hadoop.yarn.sls.SLSRunner.main(SLSRunner.java:528)
17/05/11 10:21:06 INFO resourcemanager.ResourceManager: Recovery started
17/05/11 10:21:06 INFO recovery.ZKRMStateStore: Watcher event type: None with 
state:SyncConnected for path:null for Service 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org