Best Performance on Large Scale Join

Brad Ruderman Mon, 29 Jul 2013 10:42:43 -0700

Hi All-

I have 2 tables:


CREATE TABLE users (
a bigint,
b int
)

CREATE TABLE products (
a bigint,
c int
)

Each table has about 8 billion records (roughly 2k files total mappers). I
want to know the most performant way to do the following query:

SELECT u.b,
              p.c,
              count(*) as count
FROM users u
INNER JOIN products p
ON u.a = p.a
GROUP BY u.b, p.c

Right now the reducing is killing me. Any suggestions on improving
performance? Would a mapbucket join be optimal here?

Thanks,
Brad

Best Performance on Large Scale Join

Reply via email to