It wasn't so much the cogroup that was optimized here, but what is
done to the result of cogroup. Yes, it was a matter of not
materializing the entire result of a flatMap-like function after the
cogroup, since this will accept just an Iterator (actually,
TraversableOnce).
I'd say that wherever
It wasn't so much the cogroup that was optimized here, but what is
done to the result of cogroup.
Right.
Yes, it was a matter of not materializing the entire result of a
flatMap-like function after the cogroup, since this will accept just
an Iterator (actually, TraversableOnce).
Yeah...I
On Mon, Dec 29, 2014 at 2:11 PM, Stephen Haberman
stephen.haber...@gmail.com wrote:
Yeah...I was trying to poke around, are the Iterables that Spark passes
into cogroup already materialized (e.g. the bug was making a copy of
an already-in-memory list) or are the Iterables streaming?
The result
The Iterable from cogroup is CompactBuffer, which is already materialized.
It's not a lazy Iterable. So now Spark cannot handle skewed data that some
key has too many values that cannot be fit into the memory.
Hi Shixiong,
The Iterable from cogroup is CompactBuffer, which is already
materialized. It's not a lazy Iterable. So now Spark cannot handle
skewed data that some key has too many values that cannot be fit into
the memory.
Cool, thanks for the confirmation.
- Stephen
Hey,
I saw this commit go by, and find it fairly fascinating:
https://github.com/apache/spark/commit/c233ab3d8d75a33495298964fe73dbf7dd8fe305
For two reasons: 1) we have a report that is bogging down exactly in
a .join with lots of elements, so, glad to see the fix, but, more
interesting I