RE: Efficient Equality Joins of Large Tables

java8964 Tue, 10 Jun 2014 06:39:11 -0700

I agree that the originally request is not very clear. 
>From my understanding, the reference_id is very unique in both Ad load and Ad 
>click tables, but both tables could contain huge amount of data. (But in 
>theory, click table should be much smaller than the load table, right? But 
>let's just assume that click table is also big enough to not fit into memory).
In this case, you have to use reducer side join. I don't know any other trick 
can make it faster. If you have millions (or even billions) of unique 
reference_ids in both tables, you just need a lot of reducers, as the 
reference_id will be your key.
Make sure you give the query all the reducers available in your cluster, and 
make sure MR job use them all. The default HashPartitioner should generate 
enough reducer groups, if your unique reference_id count is large enough.
If there are other optimize ways, I am happy to hear it.
Yong

From: furcy....@flaminem.com
Date: Tue, 10 Jun 2014 10:16:53 +0200
Subject: Re: Efficient Equality Joins of Large Tables
To: user@hive.apache.org

Hi Mark,
I still don't understand if you are trying to perform your join for one 
specific known reference_id or for all of them?In the first case, you should 
apply a pre-filter on each table first with subqueries, and this would leave 
you with only a few rows for the join.

In the second case, you could try applying a GROUP BY on each table first 
(again with subqueries) which would leave you with a few rows for the join 
too.This second method might work or not depending on what you are trying to 
compute. 

For complex GROUP BY operations, cleverly combining the UDFs collect_set with 
split might help too...
If it does't work for you, then I would need more details on what you are 
willing to compute precisely...

Regards,
Furcy

2014-06-09 20:30 GMT+02:00 Mark Desnoyer <desno...@neon-lab.com>:

Hi Furcy,
Thanks for the reply. I looked at MapJoin but it won't do what I need because 
all the tables will be large and actually, explicitly going through the entire 
table in an n^2 fashion is very inefficient.

I have large tables, but the intersection is very small. In the Ad Click case, 
I would have two streams of data: one for loads and one for clicks. When a page 
is rendered, a unique reference id is generated, which gets tagged to the load 
and any click that might happen. So, if I had two tables, one for the loads and 
one for the clicks, the reference id in each table could be used to link the 
click that happened to its associated load. In other words, for a given 
reference id, there is probably only one load and a small number of clicks. So, 
my join would look some like:

SELECT <some cols> FROM loads LEFT JOIN clicks ON loads.reference_id = 
clicks.reference_id;
In the MapReduce framework, I could use that reference id as the map output key 
in order to collect those events that are associated with each other and then 
the reducer would generate the rows by doing the join with the small number of 
rows that it received. Technically, you could take this approach for any 
equality join because you're using the partitioning step in order to enforce 
the join predicate instead of actually evaluating it. This could be much more 
efficient in some cases, so I was wondering if Hive can do it and if so, how 
can I trigger it?

Cheers,Mark

On Mon, Jun 9, 2014 at 10:34 AM, Furcy Pin <furcy....@flaminem.com> wrote:

Hi Mark,
I'm not sure if I understand what your trying to do correctly, do you know the 
reference id on which you want to do the join beforehand? 

Or is one of your tables small? 

Or are they all big with a small intersection?
I you haven't yet, I would suggest you to have a look at MapJoin: 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins

Mapjoins allow hive to perform a join on the map side by putting the small 
table in cache.
(depending on your hive version, this might or might not be enabled by default)

Depending on your use case, you might have to pre-filter some tables, for that 
you can use sub-queries like:

FROM (   SELECT id FROM T WHERE id = myID ) T1

JOIN T2 ON T1.id=T2.idSELECT * 
Unlike a standard sql that would simplifies the query, I believe this will 
force Hive to perform the sub-select with a first mapreduce, and apply the join 
in the second

(as a mapjoin if the result of the subquery is small enough).
Hope this helps,
Furcy

2014-06-09 18:48 GMT+02:00 Mark Desnoyer <desno...@neon-lab.com>:

Hi,

I was wondering if there was a way in Hive to trigger it to perform an 
efficient equality join on large tables? Specifically, I have two or more 
tables where the joined key is relatively rare in each table. A good example 
would be an AdClick scenario where you would have two tables, one for ad loads 
and one for ad clicks, where there is a reference id to connect a click to a 
specific ad load and that's the key to join on. 

In the MapReduce framework, this join could be done efficiently by using the 
reference id as the key from the map output and then the join would be done in 
the reducer since there are a small number of rows associated with each 
reference id. However, hive implements join by buffering all but the last 
tables in the reducer, and then streams the last table through. This works 
great if the first tables are relatively small, but if the tables are large and 
the associated rows are rare, it's very inefficient.

So, can hive perform the join the first way I described? If so, any idea how I 
trigger it? Or do I have to write my own MR job?

Cheers,Mark

RE: Efficient Equality Joins of Large Tables

Reply via email to