My understanding of off-heap storage was that you'd still need to get those JVM objects on-heap in order to actually use them with map, filter, etc. Would we be trading CPU time to get memory efficiency if we went down the off-heap storage route? I'm not sure what discussions have already happened here and what kind of implementation we're talking about.
On Mon, Feb 10, 2014 at 2:28 AM, Rafal Kwasny <m...@entropy.be> wrote: > Hi Everyone, > Maybe it's a good time to reevaluate off-heap storage for RDD's with > custom allocator? > > On a few occasions recently I had to lower both > spark.storage.memoryFraction and spark.shuffle.memoryFraction > spark.shuffle.spill helps a bit with large scale reduces > > Also it could be you're hitting: > https://github.com/apache/incubator-spark/pull/180 > > /Rafal > > > > Andrew Ash wrote: > > I dropped down to 0.5 but still OOM'd, so sent it all the way to 0.1 and > didn't get an OOM. I could tune this some more to find where the cliff is, > but this is a one-off job so now that it's completed I don't want to spend > any more time tuning it. > > Is there a reason that this value couldn't be dynamically adjusted in > response to actual heap usage? > > I can imagine a scenario where spending too much time in GC (descending > into GC hell) drops the value a little to keep from OOM, or directly > measuring how much of the heap is spent on this scratch space and adjusting > appropriately. > > > On Sat, Feb 8, 2014 at 3:40 PM, Matei Zaharia <matei.zaha...@gmail.com>wrote: > >> This probably means that there’s not enough free memory for the “scratch” >> space used for computations, so we OOM before the Spark cache decides that >> it’s full and starts to spill stuff. Try reducing >> spark.storage.memoryFraction (default is 0.66, try 0.5). >> >> Matei >> >> On Feb 5, 2014, at 10:29 PM, Andrew Ash <and...@andrewash.com> wrote: >> >> // version 0.9.0 >> >> Hi Spark users, >> >> My understanding of the MEMORY_AND_DISK_SER persistence level was that if >> an RDD could fit into memory then it would be left there (same as >> MEMORY_ONLY), and only if it was too big for memory would it spill to disk. >> Here's how the docs describe it: >> >> MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions >> that don't fit in memory to disk instead of recomputing them on the fly >> each time they're needed. >> >> https://spark.incubator.apache.org/docs/latest/scala-programming-guide.html >> >> >> >> What I'm observing though is that really large RDDs are actually causing >> OOMs. I'm not sure if this is a regression in 0.9.0 or if it has been this >> way for some time. >> >> While I look through the source code, has anyone actually observed the >> correct spill to disk behavior rather than an OOM? >> >> Thanks! >> Andrew >> >> >> > >