Brad, whats the cluster capacity you have got? how many uniq values of a,b and c you have got individually in any of the one table?
Is there any chance you can partition data? are there any columns you have on which you can create buckets? I have done joins having 10 billion records in one table but other table was significantly smaller. and I had a 1000 node cluster ad disposal On Mon, Jul 29, 2013 at 11:08 PM, Brad Ruderman <bruder...@radiumone.com>wrote: > Hi All- > > I have 2 tables: > > CREATE TABLE users ( > a bigint, > b int > ) > > CREATE TABLE products ( > a bigint, > c int > ) > > Each table has about 8 billion records (roughly 2k files total mappers). I > want to know the most performant way to do the following query: > > SELECT u.b, > p.c, > count(*) as count > FROM users u > INNER JOIN products p > ON u.a = p.a > GROUP BY u.b, p.c > > Right now the reducing is killing me. Any suggestions on improving > performance? Would a mapbucket join be optimal here? > > Thanks, > Brad > -- Nitin Pawar