Hi!
I have been using spark a lot recently and it's been running really well and
fast, but now when I increase the data size, it's starting to run into problems:
I have an RDD in the form of (String, Iterable[String]) - the Iterable[String]
was produced by a groupByKey() - and I perform a
You are rightly thinking that Spark should be able to just stream
this massive collection of pairs you are creating, and never need to
put it all in memory. That's true, but, your function actually creates
a huge collection of pairs in memory before Spark ever touches it.
This is going to
for(v1 - values; v2 - values) yield ((v1, v2), 1) will generate all data
at once and return all of them to flatMap.
To solve your problem, you should use for (v1 - values.iterator; v2 -
values.iterator) yield ((v1, v2), 1) which will generate the data when it’s
necessary.
Best Regards,
Hi!
Using an iterator solved the problem! I've been chewing on this for days, so
thanks a lot to both of you!! :)
Since in an earlier version of my code, I used a self-join to perform the same
thing, and ran into the same problems, I just looked at the implementation of
PairRDDFunction.join
Good catch. `Join` should use `Iterator`, too. I open an JIRA here:
https://issues.apache.org/jira/browse/SPARK-4824
Best Regards,
Shixiong Zhu
2014-12-10 21:35 GMT+08:00 Johannes Simon johannes.si...@mail.de:
Hi!
Using an iterator solved the problem! I've been chewing on this for days,
so