Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

Liquan Pei Wed, 08 Oct 2014 00:13:01 -0700

I am working on a PR to leverage the HashJoin trait code to optimize the
Left/Right outer join. It's already been tested locally and will send out
the PR soon after some clean up.


Thanks,
Liquan

On Wed, Oct 8, 2014 at 12:09 AM, Matei Zaharia <[email protected]>
wrote:

> I'm pretty sure inner joins on Spark SQL already build only one of the
> sides. Take a look at ShuffledHashJoin, which calls HashJoin.joinIterators.
> Only outer joins do both, and it seems like we could optimize it for those
> that are not full.
>
> Matei
>
>
>
> On Oct 7, 2014, at 11:04 PM, Haopu Wang <[email protected]> wrote:
>
> Liquan, yes, for full outer join, one hash table on both sides is more
> efficient.
>
> For the left/right outer join, it looks like one hash table should be
> enought.
>
> ------------------------------
> *From:* Liquan Pei [mailto:[email protected] <[email protected]>]
> *Sent:* 2014年9月30日 18:34
> *To:* Haopu Wang
> *Cc:* [email protected]; user
> *Subject:* Re: Spark SQL question: why build hashtable for both sides in
> HashOuterJoin?
>
> Hi Haopu,
>
> How about full outer join? One hash table may not be efficient for this
> case.
>
> Liquan
>
> On Mon, Sep 29, 2014 at 11:47 PM, Haopu Wang <[email protected]> wrote:
> Hi, Liquan, thanks for the response.
>
> In your example, I think the hash table should be built on the "right"
> side, so Spark can iterate through the left side and find matches in the
> right side from the hash table efficiently. Please comment and suggest,
> thanks again!
>
> ------------------------------
> *From:* Liquan Pei [mailto:[email protected]]
> *Sent:* 2014年9月30日 12:31
> *To:* Haopu Wang
> *Cc:* [email protected]; user
> *Subject:* Re: Spark SQL question: why build hashtable for both sides in
> HashOuterJoin?
>
> Hi Haopu,
>
> My understanding is that the hashtable on both left and right side is used
> for including null values in result in an efficient manner. If hash table
> is only built on one side, let's say left side and we perform a left outer
> join, for each row in left side, a scan over the right side is needed to
> make sure that no matching tuples for that row on left side.
>
> Hope this helps!
> Liquan
>
> On Mon, Sep 29, 2014 at 8:36 PM, Haopu Wang <[email protected]> wrote:
>
> I take a look at HashOuterJoin and it's building a Hashtable for both
> sides.
>
> This consumes quite a lot of memory when the partition is big. And it
> doesn't reduce the iteration on streamed relation, right?
>
> Thanks!
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>
>
> --
> Liquan Pei
> Department of Physics
> University of Massachusetts Amherst
>
>
>
> --
> Liquan Pei
> Department of Physics
> University of Massachusetts Amherst
>
>
>


-- 
Liquan Pei
Department of Physics
University of Massachusetts Amherst

Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

Reply via email to