SPARK-942

Kyle Ellrott Sat, 26 Oct 2013 10:54:38 -0700

I was wondering if anybody had any thoughts on the best way to tackle
SPARK-942 ( https://spark-project.atlassian.net/browse/SPARK-942 ).
Basically, Spark takes an iterator from a flatmap call and because I tell
it that it needs to persist Spark proceeds to push it all into an array
before deciding that it doesn't have enough memory and trying to serialize
it to disk, and somewhere along the line it runs out of memory. For my
particular operation, the function return an iterator that reads data out
of a file, and the size of the files passed to that function can vary
greatly (from a few kilobytes to a few gigabytes). The funny thing is that
if I do a strait 'map' operation after the flat map, everything works,
because Spark just passes the iterator forward and never tries to expand
the whole thing into memory. But I need do a reduceByKey across all the
records, so I'd like to persist to disk first, and that is where I hit this
snag.
I've already setup a unit test to replicate the problem, and I know the
area of the code that would need to be fixed.
I'm just hoping for some tips on the best way to fix the problem.


Kyle

SPARK-942

Reply via email to