Good catch. `Join` should use `Iterator`, too. I open an JIRA here:
https://issues.apache.org/jira/browse/SPARK-4824
Best Regards,
Shixiong Zhu
2014-12-10 21:35 GMT+08:00 Johannes Simon :
> Hi!
>
> Using an iterator solved the problem! I've been chewing on this for days,
> so thanks a lot to bot
Hi!
Using an iterator solved the problem! I've been chewing on this for days, so
thanks a lot to both of you!! :)
Since in an earlier version of my code, I used a self-join to perform the same
thing, and ran into the same problems, I just looked at the implementation of
PairRDDFunction.join (S
for(v1 <- values; v2 <- values) yield ((v1, v2), 1) will generate all data
at once and return all of them to flatMap.
To solve your problem, you should use for (v1 <- values.iterator; v2 <-
values.iterator) yield ((v1, v2), 1) which will generate the data when it’s
necessary.
Best Regards,
Shix
You are rightly thinking that Spark should be able to just "stream"
this massive collection of pairs you are creating, and never need to
put it all in memory. That's true, but, your function actually creates
a huge collection of pairs in memory before Spark ever touches it.
This is going to materi
Hi!
I have been using spark a lot recently and it's been running really well and
fast, but now when I increase the data size, it's starting to run into problems:
I have an RDD in the form of (String, Iterable[String]) - the Iterable[String]
was produced by a groupByKey() - and I perform a flatMa