flatMap and spilling of output to disk

2014-12-10 Thread Johannes Simon
Hi! I have been using spark a lot recently and it's been running really well and fast, but now when I increase the data size, it's starting to run into problems: I have an RDD in the form of (String, Iterable[String]) - the Iterable[String] was produced by a groupByKey() - and I perform a

Re: flatMap and spilling of output to disk

2014-12-10 Thread Sean Owen
You are rightly thinking that Spark should be able to just stream this massive collection of pairs you are creating, and never need to put it all in memory. That's true, but, your function actually creates a huge collection of pairs in memory before Spark ever touches it. This is going to

Re: flatMap and spilling of output to disk

2014-12-10 Thread Shixiong Zhu
for(v1 - values; v2 - values) yield ((v1, v2), 1) will generate all data at once and return all of them to flatMap. To solve your problem, you should use for (v1 - values.iterator; v2 - values.iterator) yield ((v1, v2), 1) which will generate the data when it’s necessary. ​ Best Regards,

Re: flatMap and spilling of output to disk

2014-12-10 Thread Johannes Simon
Hi! Using an iterator solved the problem! I've been chewing on this for days, so thanks a lot to both of you!! :) Since in an earlier version of my code, I used a self-join to perform the same thing, and ran into the same problems, I just looked at the implementation of PairRDDFunction.join

Re: flatMap and spilling of output to disk

2014-12-10 Thread Shixiong Zhu
Good catch. `Join` should use `Iterator`, too. I open an JIRA here: https://issues.apache.org/jira/browse/SPARK-4824 Best Regards, Shixiong Zhu 2014-12-10 21:35 GMT+08:00 Johannes Simon johannes.si...@mail.de: Hi! Using an iterator solved the problem! I've been chewing on this for days, so