[ https://issues.apache.org/jira/browse/YARN-8991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16683946#comment-16683946 ]
Thomas Graves commented on YARN-8991: ------------------------------------- if its while its running then you should file this with Spark. Its very similar to https://issues.apache.org/jira/browse/SPARK-17233. The spark external shuffle service doesn't supports that at this point. The problem with that is that you may have an Spark Executor running on one host, generate some map output data to shuffle and then that executor exits as its not needed anymore. When a reduce starts it just talked to the Yarn nodemanager and the external shuffle server to get the map output. Now there is no executor left on the node to cleanup the shuffle output. Support would have to be added for like the driver to tell the spark external shuffle service to cleanup. If you don't use dynamic allocation and the external shuffle service it should cleanup properly. > nodemanager not cleaning blockmgr directories inside appcache > -------------------------------------------------------------- > > Key: YARN-8991 > URL: https://issues.apache.org/jira/browse/YARN-8991 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Affects Versions: 2.6.0 > Reporter: Hidayat Teonadi > Priority: Major > Attachments: yarn-nm-log.txt > > > Hi, I'm running spark on yarn and have enabled the Spark Shuffle Service. I'm > noticing that during the lifetime of my spark streaming application, the nm > appcache folder is building up with blockmgr directories (filled with > shuffle_*.data). > Looking into the nm logs, it seems like the blockmgr directories is not part > of the cleanup process of the application. Eventually disk will fill up and > app will crash. I have both > {{yarn.nodemanager.localizer.cache.cleanup.interval-ms}} and > {{yarn.nodemanager.localizer.cache.target-size-mb}} set, so I don't think its > a configuration issue. > What is stumping me is the executor ID listed by spark during the external > shuffle block registration doesn't match the executor ID listed in yarn's nm > log. Maybe this executorID disconnect explains why the cleanup is not done ? > I'm assuming that blockmgr directories are supposed to be cleaned up ? > > {noformat} > 2018-11-05 15:01:21,349 INFO > org.apache.spark.network.shuffle.ExternalShuffleBlockResolver: Registered > executor AppExecId{appId=application_1541045942679_0193, execId=1299} with > ExecutorShuffleInfo{localDirs=[/mnt1/yarn/nm/usercache/auction_importer/appcache/application_1541045942679_0193/blockmgr-b9703ae3-722c-47d1-a374-abf1cc954f42], > subDirsPerLocalDir=64, > shuffleManager=org.apache.spark.shuffle.sort.SortShuffleManager} > {noformat} > > seems similar to https://issues.apache.org/jira/browse/YARN-7070, although > I'm not sure if the behavior I'm seeing is spark use related. > [https://stackoverflow.com/questions/52923386/spark-streaming-job-doesnt-delete-shuffle-files] > has a stop gap solution of cleaning up via cron. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org