tiers of caching

2014-07-07 Thread Koert Kuipers
i noticed that some algorithms such as graphx liberally cache RDDs for
efficiency, which makes sense. however it can also leave a long trail of
unused yet cached RDDs, that might push other RDDs out of memory.

in a long-lived spark context i would like to decide which RDDs stick
around. would it make sense to create tiers of caching, to distinguish
explicitly cached RDDs by the application from RDDs that are temporary
cached by algos, so as to make sure these temporary caches don't push
application RDDs out of memory?


Re: tiers of caching

2014-07-07 Thread Ankur Dave
I think tiers/priorities for caching are a very good idea and I'd be
interested to see what others think. In addition to letting libraries cache
RDDs liberally, it could also unify memory management across other parts of
Spark. For example, small shuffles benefit from explicitly keeping the
shuffle outputs in memory rather than writing it to disk, possibly due to
filesystem overhead. To prevent in-memory shuffle outputs from competing
with application RDDs, Spark could mark them as lower-priority and specify
that they should be dropped to disk when memory runs low.

Ankur http://www.ankurdave.com/


Re: tiers of caching

2014-07-07 Thread Andrew Or
Others have also asked for this on the mailing list, and hence there's a
related JIRA: https://issues.apache.org/jira/browse/SPARK-1762. Ankur
brings up a good point in that any current implementation of in-memory
shuffles will compete with application RDD blocks. I think we should
definitely add this at some point. In terms of a timeline, we already have
many features lined up for 1.1, however, so it will likely be after that.


2014-07-07 10:13 GMT-07:00 Ankur Dave ankurd...@gmail.com:

 I think tiers/priorities for caching are a very good idea and I'd be
 interested to see what others think. In addition to letting libraries cache
 RDDs liberally, it could also unify memory management across other parts of
 Spark. For example, small shuffles benefit from explicitly keeping the
 shuffle outputs in memory rather than writing it to disk, possibly due to
 filesystem overhead. To prevent in-memory shuffle outputs from competing
 with application RDDs, Spark could mark them as lower-priority and specify
 that they should be dropped to disk when memory runs low.

 Ankur http://www.ankurdave.com/