Re: Best Performance on Large Scale Join

Brad Ruderman Mon, 29 Jul 2013 11:38:45 -0700

Hi Michael and Nitin-
Thanks for your response. Some things to note:

Michael-
I will definitely try this method, it looks interesting.


Nitin -
-Users Table and Product Tables are already unique.
-I cannot partition the data, since the data is coming from already
partitioned tables and I am doing a "monthly" overlap. Meaning I am looking
at users and products for a month's worth of data
-I am not sure what i means to be a "good" column to bucket, but I would
think user (as a bigint) would be a good column.

Others-
The other question is do you think I could improve performance by using PIG
LATIN instead of hive?

Thanks,
Brad



On Mon, Jul 29, 2013 at 10:59 AM, Nitin Pawar <[email protected]>wrote:

> Brad,
> whats the cluster capacity you have got?
> how many uniq values of a,b  and c  you have got individually in any of
> the one table?
>
>
> Is there any chance you can partition data? are there any columns you have
> on which you can create buckets?
>
> I have done joins having 10 billion records in one table but other table
> was significantly smaller. and I had a 1000 node cluster ad disposal
>
>
>
>
>
> On Mon, Jul 29, 2013 at 11:08 PM, Brad Ruderman 
> <[email protected]>wrote:
>
>> Hi All-
>>
>> I have 2 tables:
>>
>> CREATE TABLE users (
>> a bigint,
>> b int
>> )
>>
>> CREATE TABLE products (
>> a bigint,
>> c int
>> )
>>
>> Each table has about 8 billion records (roughly 2k files total mappers).
>> I want to know the most performant way to do the following query:
>>
>> SELECT u.b,
>>               p.c,
>>               count(*) as count
>> FROM users u
>> INNER JOIN products p
>> ON u.a = p.a
>> GROUP BY u.b, p.c
>>
>> Right now the reducing is killing me. Any suggestions on improving
>> performance? Would a mapbucket join be optimal here?
>>
>> Thanks,
>> Brad
>>
>
>
>
> --
> Nitin Pawar
>

Re: Best Performance on Large Scale Join

Reply via email to