[ 
https://issues.apache.org/jira/browse/YARN-5370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375089#comment-15375089
 ] 

Jason Lowe commented on YARN-5370:
----------------------------------

It's expected behavior in the sense that the debug delay setting causes the NM 
to buffer every deletion task up to the specified amount of time.  100 days is 
a lot of time, so if there are many deletions within that period it will have 
to buffer a lot of tasks as you saw in the heap dump.

The debug delay is, as the name implies, for debugging.  If you set it to a 
very large value then, depending upon the amount of container churn on the 
cluster, a correspondingly large heap will be required given the way it works 
today.  It's not typical to set this to a very large value since it only needs 
to be large enough to give someone a chance to examine/copy off the requisite 
files after reproducing the issue.  Normally it doesn't take someone 100 days 
to get around to examining the files after a problem occurs. ;-)

Theoretically we could extend the functionality to spill tasks to disk or do 
something more clever with how they are stored to reduce the memory pressure, 
but I question the cost/benefit tradeoff.  Again this is a feature intended 
just for debugging.  I'm also not a big fan of putting in an arbitrary limit on 
the value.  If someone wants to store files for a few years and has the heap 
size and disk space to hold all that, who are we to stop them from trying?


> Setting yarn.nodemanager.delete.debug-delay-sec to high number crashes NM 
> because of OOM
> ----------------------------------------------------------------------------------------
>
>                 Key: YARN-5370
>                 URL: https://issues.apache.org/jira/browse/YARN-5370
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Manikandan R
>
> I set yarn.nodemanager.delete.debug-delay-sec to 100 + days in my dev  
> cluster for some reasons. It has been done before 3-4 weeks. After setting 
> this up, at times, NM crashes because of OOM. So, I kept on increasing from 
> 512MB to 6 GB over the past few weeks gradually as and when this crash occurs 
> as temp fix. Sometimes, It won't start smoothly and after multiple tries, it 
> starts functioning. While analyzing heap dump of corresponding JVM, come to 
> know that DeletionService.Java is occupying almost 99% of total allocated 
> memory (-xmx) something like this
> org.apache.hadoop.yarn.server.nodemanager.DeletionService$DelServiceSchedThreadPoolExecutor
>  @ 0x6c1d09068| 80 | 3,544,094,696 | 99.13%
> Basically, there are huge no. of above mentioned tasks scheduled for 
> deletion. Usually, I see NM memory requirements as 2-4GB for large clusters. 
> In my case, cluster is very small and OOM occurs.
> Is it expected behaviour? (or) Is there any limit we can expose on 
> yarn.nodemanager.delete.debug-delay-sec to avoid these kind of issues?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to