[ https://issues.apache.org/jira/browse/YARN-5370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375089#comment-15375089 ]
Jason Lowe commented on YARN-5370: ---------------------------------- It's expected behavior in the sense that the debug delay setting causes the NM to buffer every deletion task up to the specified amount of time. 100 days is a lot of time, so if there are many deletions within that period it will have to buffer a lot of tasks as you saw in the heap dump. The debug delay is, as the name implies, for debugging. If you set it to a very large value then, depending upon the amount of container churn on the cluster, a correspondingly large heap will be required given the way it works today. It's not typical to set this to a very large value since it only needs to be large enough to give someone a chance to examine/copy off the requisite files after reproducing the issue. Normally it doesn't take someone 100 days to get around to examining the files after a problem occurs. ;-) Theoretically we could extend the functionality to spill tasks to disk or do something more clever with how they are stored to reduce the memory pressure, but I question the cost/benefit tradeoff. Again this is a feature intended just for debugging. I'm also not a big fan of putting in an arbitrary limit on the value. If someone wants to store files for a few years and has the heap size and disk space to hold all that, who are we to stop them from trying? > Setting yarn.nodemanager.delete.debug-delay-sec to high number crashes NM > because of OOM > ---------------------------------------------------------------------------------------- > > Key: YARN-5370 > URL: https://issues.apache.org/jira/browse/YARN-5370 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Manikandan R > > I set yarn.nodemanager.delete.debug-delay-sec to 100 + days in my dev > cluster for some reasons. It has been done before 3-4 weeks. After setting > this up, at times, NM crashes because of OOM. So, I kept on increasing from > 512MB to 6 GB over the past few weeks gradually as and when this crash occurs > as temp fix. Sometimes, It won't start smoothly and after multiple tries, it > starts functioning. While analyzing heap dump of corresponding JVM, come to > know that DeletionService.Java is occupying almost 99% of total allocated > memory (-xmx) something like this > org.apache.hadoop.yarn.server.nodemanager.DeletionService$DelServiceSchedThreadPoolExecutor > @ 0x6c1d09068| 80 | 3,544,094,696 | 99.13% > Basically, there are huge no. of above mentioned tasks scheduled for > deletion. Usually, I see NM memory requirements as 2-4GB for large clusters. > In my case, cluster is very small and OOM occurs. > Is it expected behaviour? (or) Is there any limit we can expose on > yarn.nodemanager.delete.debug-delay-sec to avoid these kind of issues? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org