Best Performance on Large Scale Join

2013-07-29 Thread Brad Ruderman
Hi All-

I have 2 tables:

CREATE TABLE users (
a bigint,
b int
)

CREATE TABLE products (
a bigint,
c int
)

Each table has about 8 billion records (roughly 2k files total mappers). I
want to know the most performant way to do the following query:

SELECT u.b,
  p.c,
  count(*) as count
FROM users u
INNER JOIN products p
ON u.a = p.a
GROUP BY u.b, p.c

Right now the reducing is killing me. Any suggestions on improving
performance? Would a mapbucket join be optimal here?

Thanks,
Brad


Re: Best Performance on Large Scale Join

2013-07-29 Thread Nitin Pawar
Brad,
whats the cluster capacity you have got?
how many uniq values of a,b  and c  you have got individually in any of the
one table?


Is there any chance you can partition data? are there any columns you have
on which you can create buckets?

I have done joins having 10 billion records in one table but other table
was significantly smaller. and I had a 1000 node cluster ad disposal





On Mon, Jul 29, 2013 at 11:08 PM, Brad Ruderman bruder...@radiumone.comwrote:

 Hi All-

 I have 2 tables:

 CREATE TABLE users (
 a bigint,
 b int
 )

 CREATE TABLE products (
 a bigint,
 c int
 )

 Each table has about 8 billion records (roughly 2k files total mappers). I
 want to know the most performant way to do the following query:

 SELECT u.b,
   p.c,
   count(*) as count
 FROM users u
 INNER JOIN products p
 ON u.a = p.a
 GROUP BY u.b, p.c

 Right now the reducing is killing me. Any suggestions on improving
 performance? Would a mapbucket join be optimal here?

 Thanks,
 Brad




-- 
Nitin Pawar


Re: Best Performance on Large Scale Join

2013-07-29 Thread Michael Malak
Perhaps you can first create a temp table that contains only the records that 
will match?  See the UNION ALL trick at
http://www.mail-archive.com/hive-user@hadoop.apache.org/msg01906.html




 From: Brad Ruderman bruder...@radiumone.com
To: user@hive.apache.org 
Sent: Monday, July 29, 2013 11:38 AM
Subject: Best Performance on Large Scale Join
 


Hi All-

I have 2 tables:

CREATE TABLE users (
a bigint,
b int
)

CREATE TABLE products (
a bigint,
c int
)

Each table has about 8 billion records (roughly 2k files total mappers). I want 
to know the most performant way to do the following query:

SELECT u.b,
              p.c,
              count(*) as count
FROM users u
INNER JOIN products p
ON u.a = p.a
GROUP BY u.b, p.c

Right now the reducing is killing me. Any suggestions on improving performance? 
Would a mapbucket join be optimal here?

Thanks,
Brad

Re: Best Performance on Large Scale Join

2013-07-29 Thread Brad Ruderman
Hi Michael and Nitin-
Thanks for your response. Some things to note:

Michael-
I will definitely try this method, it looks interesting.

Nitin -
-Users Table and Product Tables are already unique.
-I cannot partition the data, since the data is coming from already
partitioned tables and I am doing a monthly overlap. Meaning I am looking
at users and products for a month's worth of data
-I am not sure what i means to be a good column to bucket, but I would
think user (as a bigint) would be a good column.

Others-
The other question is do you think I could improve performance by using PIG
LATIN instead of hive?

Thanks,
Brad



On Mon, Jul 29, 2013 at 10:59 AM, Nitin Pawar nitinpawar...@gmail.comwrote:

 Brad,
 whats the cluster capacity you have got?
 how many uniq values of a,b  and c  you have got individually in any of
 the one table?


 Is there any chance you can partition data? are there any columns you have
 on which you can create buckets?

 I have done joins having 10 billion records in one table but other table
 was significantly smaller. and I had a 1000 node cluster ad disposal





 On Mon, Jul 29, 2013 at 11:08 PM, Brad Ruderman 
 bruder...@radiumone.comwrote:

 Hi All-

 I have 2 tables:

 CREATE TABLE users (
 a bigint,
 b int
 )

 CREATE TABLE products (
 a bigint,
 c int
 )

 Each table has about 8 billion records (roughly 2k files total mappers).
 I want to know the most performant way to do the following query:

 SELECT u.b,
   p.c,
   count(*) as count
 FROM users u
 INNER JOIN products p
 ON u.a = p.a
 GROUP BY u.b, p.c

 Right now the reducing is killing me. Any suggestions on improving
 performance? Would a mapbucket join be optimal here?

 Thanks,
 Brad




 --
 Nitin Pawar