tiers of caching
i noticed that some algorithms such as graphx liberally cache RDDs for efficiency, which makes sense. however it can also leave a long trail of unused yet cached RDDs, that might push other RDDs out of memory. in a long-lived spark context i would like to decide which RDDs stick around. would it make sense to create tiers of caching, to distinguish explicitly cached RDDs by the application from RDDs that are temporary cached by algos, so as to make sure these temporary caches don't push application RDDs out of memory?
Re: tiers of caching
I think tiers/priorities for caching are a very good idea and I'd be interested to see what others think. In addition to letting libraries cache RDDs liberally, it could also unify memory management across other parts of Spark. For example, small shuffles benefit from explicitly keeping the shuffle outputs in memory rather than writing it to disk, possibly due to filesystem overhead. To prevent in-memory shuffle outputs from competing with application RDDs, Spark could mark them as lower-priority and specify that they should be dropped to disk when memory runs low. Ankur http://www.ankurdave.com/
Re: tiers of caching
Others have also asked for this on the mailing list, and hence there's a related JIRA: https://issues.apache.org/jira/browse/SPARK-1762. Ankur brings up a good point in that any current implementation of in-memory shuffles will compete with application RDD blocks. I think we should definitely add this at some point. In terms of a timeline, we already have many features lined up for 1.1, however, so it will likely be after that. 2014-07-07 10:13 GMT-07:00 Ankur Dave ankurd...@gmail.com: I think tiers/priorities for caching are a very good idea and I'd be interested to see what others think. In addition to letting libraries cache RDDs liberally, it could also unify memory management across other parts of Spark. For example, small shuffles benefit from explicitly keeping the shuffle outputs in memory rather than writing it to disk, possibly due to filesystem overhead. To prevent in-memory shuffle outputs from competing with application RDDs, Spark could mark them as lower-priority and specify that they should be dropped to disk when memory runs low. Ankur http://www.ankurdave.com/