RE: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

Haopu Wang Mon, 29 Sep 2014 23:48:42 -0700

Hi, Liquan, thanks for the response.


In your example, I think the hash table should be built on the "right" side, so 
Spark can iterate through the left side and find matches in the right side from 
the hash table efficiently. Please comment and suggest, thanks again!

 

________________________________

From: Liquan Pei [mailto:liquan...@gmail.com] 
Sent: 2014年9月30日 12:31
To: Haopu Wang
Cc: d...@spark.apache.org; user
Subject: Re: Spark SQL question: why build hashtable for both sides in 
HashOuterJoin?

 

Hi Haopu,

 

My understanding is that the hashtable on both left and right side is used for 
including null values in result in an efficient manner. If hash table is only 
built on one side, let's say left side and we perform a left outer join, for 
each row in left side, a scan over the right side is needed to make sure that 
no matching tuples for that row on left side. 

 

Hope this helps!

Liquan

 

On Mon, Sep 29, 2014 at 8:36 PM, Haopu Wang <hw...@qilinsoft.com> wrote:

I take a look at HashOuterJoin and it's building a Hashtable for both
sides.

This consumes quite a lot of memory when the partition is big. And it
doesn't reduce the iteration on streamed relation, right?

Thanks!

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





 

-- 
Liquan Pei 
Department of Physics 
University of Massachusetts Amherst

RE: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

Reply via email to