Re: recent join/iterator fix

2014-12-29 Thread Sean Owen
It wasn't so much the cogroup that was optimized here, but what is done to the result of cogroup. Yes, it was a matter of not materializing the entire result of a flatMap-like function after the cogroup, since this will accept just an Iterator (actually, TraversableOnce). I'd say that wherever

Re: recent join/iterator fix

2014-12-29 Thread Stephen Haberman
It wasn't so much the cogroup that was optimized here, but what is done to the result of cogroup. Right. Yes, it was a matter of not materializing the entire result of a flatMap-like function after the cogroup, since this will accept just an Iterator (actually, TraversableOnce). Yeah...I

Re: recent join/iterator fix

2014-12-29 Thread Sean Owen
On Mon, Dec 29, 2014 at 2:11 PM, Stephen Haberman stephen.haber...@gmail.com wrote: Yeah...I was trying to poke around, are the Iterables that Spark passes into cogroup already materialized (e.g. the bug was making a copy of an already-in-memory list) or are the Iterables streaming? The result

Re: recent join/iterator fix

2014-12-29 Thread Shixiong Zhu
The Iterable from cogroup is CompactBuffer, which is already materialized. It's not a lazy Iterable. So now Spark cannot handle skewed data that some key has too many values that cannot be fit into the memory.​

Re: recent join/iterator fix

2014-12-29 Thread Stephen Haberman
Hi Shixiong, The Iterable from cogroup is CompactBuffer, which is already materialized. It's not a lazy Iterable. So now Spark cannot handle skewed data that some key has too many values that cannot be fit into the memory.​ Cool, thanks for the confirmation. - Stephen

recent join/iterator fix

2014-12-28 Thread Stephen Haberman
Hey, I saw this commit go by, and find it fairly fascinating: https://github.com/apache/spark/commit/c233ab3d8d75a33495298964fe73dbf7dd8fe305 For two reasons: 1) we have a report that is bogging down exactly in a .join with lots of elements, so, glad to see the fix, but, more interesting I