JayceAu created YARN-8493: ----------------------------- Summary: LogAggregation in NodeManager is put off because great amount of long running app Key: YARN-8493 URL: https://issues.apache.org/jira/browse/YARN-8493 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: JayceAu Fix For: 2.6.0
h2. Issue summary In our Yarn cluster, on average, it will take 30 min to show the app log on web after the app is finished. This problem is caused by the limitation of threadPool size in NodeManager. In NodeManager, it will set aside an appLogAggregator to do log Aggregation for each container running on this NodeManager. This appLogAggregator will occupy one thread in the threadPool until it's finished in the whole cluster. NodeManager uses FixedThreadPool (default size is 100) instead of CachedThreadPool which is used in the old version. At peak moment in our production environment, there is more than 350 AppLogAggregator running or queuing in thread pool and those app queuing will suffer from great log aggregation latency. h2. Possible Solution We can increase yarn.nodemanager.logaggregation.threadpool-size-max to a higher value to solve it. But this problem will happen again if the running app increase and it will create a lot of idle thread waiting for log aggregation. Our solution is not to put the {color:#333333}appLogAggregator {color}into the threadPool until it's finished: # give an callback to each {color:#333333}appLogAggregator to put itself into the threadPool, it's not called until it's notified{color} # if rollingMonitorInterval is greater than 0, NodeManager will set aside a thread in LogAggregationService to do log Aggregation for all the running app periodically -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org