Re: Best Performance on Large Scale Join

Nitin Pawar Mon, 29 Jul 2013 10:59:59 -0700

Brad,
whats the cluster capacity you have got?
how many uniq values of a,b  and c  you have got individually in any of the
one table?

Is there any chance you can partition data? are there any columns you have
on which you can create buckets?

I have done joins having 10 billion records in one table but other table
was significantly smaller. and I had a 1000 node cluster ad disposal

On Mon, Jul 29, 2013 at 11:08 PM, Brad Ruderman <bruder...@radiumone.com>wrote:

> Hi All-
>
> I have 2 tables:
>
> CREATE TABLE users (
> a bigint,
> b int
> )
>
> CREATE TABLE products (
> a bigint,
> c int
> )
>
> Each table has about 8 billion records (roughly 2k files total mappers). I
> want to know the most performant way to do the following query:
>
> SELECT u.b,
>               p.c,
>               count(*) as count
> FROM users u
> INNER JOIN products p
> ON u.a = p.a
> GROUP BY u.b, p.c
>
> Right now the reducing is killing me. Any suggestions on improving
> performance? Would a mapbucket join be optimal here?
>
> Thanks,
> Brad
>

-- 
Nitin Pawar

Re: Best Performance on Large Scale Join

Reply via email to