[ 
https://issues.apache.org/jira/browse/YARN-4697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252164#comment-15252164
 ] 

Junping Du commented on YARN-4697:
----------------------------------

bq. My concern is that if don't fix the root-cause, though we've protected 
ourselves from crashes, we'd just be queueing a lot of aggregation processes 
and causing long waiting times.
Agree. We do see NM log aggregation service launch many active threads which 
keep large number of TCP connections to DN which use out system's file limit. 
We can fix shared limited thread number here, but the TCP connections problem 
may not solved by this patch.

bq. Upon NM restart, NM will try to recover all applications and submit a log 
aggregation task to the thread pool for each application recovered. Therefore, 
a large number of recovered applications plus concurrent applications can cause 
the thread pool to increase without a bound.
Does all these applications are active one or finished already? I suspect we 
are leaking finished applications in NM state store in recover process. I 
noticed this issue in filing YARN-4325 but lost my progress as previous long 
running cluster is gone. [~haibochen], could you check if your case is the same 
here?

In general, I think the fix on this JIRA is OK. But I agree with Vinod that we 
should dig out more on the root cause or it could be other holes (like TCP 
connection leaking mentioned above).

> NM aggregation thread pool is not bound by limits
> -------------------------------------------------
>
>                 Key: YARN-4697
>                 URL: https://issues.apache.org/jira/browse/YARN-4697
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: nodemanager
>            Reporter: Haibo Chen
>            Assignee: Haibo Chen
>            Priority: Critical
>             Fix For: 2.9.0
>
>         Attachments: yarn4697.001.patch, yarn4697.002.patch, 
> yarn4697.003.patch, yarn4697.004.patch
>
>
> In the LogAggregationService.java we create a threadpool to upload logs from 
> the nodemanager to HDFS if log aggregation is turned on. This is a cached 
> threadpool which based on the javadoc is an ulimited pool of threads.
> In the case that we have had a problem with log aggregation this could cause 
> a problem on restart. The number of threads created at that point could be 
> huge and will put a large load on the NameNode and in worse case could even 
> bring it down due to file descriptor issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to