I was wondering if anybody had any thoughts on the best way to tackle SPARK-942 ( https://spark-project.atlassian.net/browse/SPARK-942 ). Basically, Spark takes an iterator from a flatmap call and because I tell it that it needs to persist Spark proceeds to push it all into an array before deciding that it doesn't have enough memory and trying to serialize it to disk, and somewhere along the line it runs out of memory. For my particular operation, the function return an iterator that reads data out of a file, and the size of the files passed to that function can vary greatly (from a few kilobytes to a few gigabytes). The funny thing is that if I do a strait 'map' operation after the flat map, everything works, because Spark just passes the iterator forward and never tries to expand the whole thing into memory. But I need do a reduceByKey across all the records, so I'd like to persist to disk first, and that is where I hit this snag. I've already setup a unit test to replicate the problem, and I know the area of the code that would need to be fixed. I'm just hoping for some tips on the best way to fix the problem.
Kyle
