[ https://issues.apache.org/jira/browse/SPARK-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-4927. ------------------------------ Resolution: Cannot Reproduce At the moment I've tried to reproduce this a few ways and wasn't able to. It may have been fixed somehow since. It can be reopened if there is a reproduction vs 1.3+ > Spark does not clean up properly during long jobs. > --------------------------------------------------- > > Key: SPARK-4927 > URL: https://issues.apache.org/jira/browse/SPARK-4927 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.1.0 > Reporter: Ilya Ganelin > > On a long running Spark job, Spark will eventually run out of memory on the > driver node due to metadata overhead from the shuffle operation. Spark will > continue to operate, however with drastically decreased performance (since > swapping now occurs with every operation). > The spark.cleanup.tll parameter allows a user to configure when cleanup > happens but the issue with doing this is that it isn’t done safely, e.g. If > this clears a cached RDD or active task in the middle of processing a stage, > this ultimately causes a KeyNotFoundException when the next stage attempts to > reference the cleared RDD or task. > There should be a sustainable mechanism for cleaning up stale metadata that > allows the program to continue running. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org