I am working on a PR to leverage the HashJoin trait code to optimize the Left/Right outer join. It's already been tested locally and will send out the PR soon after some clean up.
Thanks, Liquan On Wed, Oct 8, 2014 at 12:09 AM, Matei Zaharia <matei.zaha...@gmail.com> wrote: > I'm pretty sure inner joins on Spark SQL already build only one of the > sides. Take a look at ShuffledHashJoin, which calls HashJoin.joinIterators. > Only outer joins do both, and it seems like we could optimize it for those > that are not full. > > Matei > > > > On Oct 7, 2014, at 11:04 PM, Haopu Wang <hw...@qilinsoft.com> wrote: > > Liquan, yes, for full outer join, one hash table on both sides is more > efficient. > > For the left/right outer join, it looks like one hash table should be > enought. > > ------------------------------ > *From:* Liquan Pei [mailto:liquan...@gmail.com <liquan...@gmail.com>] > *Sent:* 2014年9月30日 18:34 > *To:* Haopu Wang > *Cc:* dev@spark.apache.org; user > *Subject:* Re: Spark SQL question: why build hashtable for both sides in > HashOuterJoin? > > Hi Haopu, > > How about full outer join? One hash table may not be efficient for this > case. > > Liquan > > On Mon, Sep 29, 2014 at 11:47 PM, Haopu Wang <hw...@qilinsoft.com> wrote: > Hi, Liquan, thanks for the response. > > In your example, I think the hash table should be built on the "right" > side, so Spark can iterate through the left side and find matches in the > right side from the hash table efficiently. Please comment and suggest, > thanks again! > > ------------------------------ > *From:* Liquan Pei [mailto:liquan...@gmail.com] > *Sent:* 2014年9月30日 12:31 > *To:* Haopu Wang > *Cc:* dev@spark.apache.org; user > *Subject:* Re: Spark SQL question: why build hashtable for both sides in > HashOuterJoin? > > Hi Haopu, > > My understanding is that the hashtable on both left and right side is used > for including null values in result in an efficient manner. If hash table > is only built on one side, let's say left side and we perform a left outer > join, for each row in left side, a scan over the right side is needed to > make sure that no matching tuples for that row on left side. > > Hope this helps! > Liquan > > On Mon, Sep 29, 2014 at 8:36 PM, Haopu Wang <hw...@qilinsoft.com> wrote: > > I take a look at HashOuterJoin and it's building a Hashtable for both > sides. > > This consumes quite a lot of memory when the partition is big. And it > doesn't reduce the iteration on streamed relation, right? > > Thanks! > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > > > -- > Liquan Pei > Department of Physics > University of Massachusetts Amherst > > > > -- > Liquan Pei > Department of Physics > University of Massachusetts Amherst > > > -- Liquan Pei Department of Physics University of Massachusetts Amherst