Perhaps you can first create a temp table that contains only the records that will match? See the UNION ALL trick at http://www.mail-archive.com/hive-user@hadoop.apache.org/msg01906.html
________________________________ From: Brad Ruderman <bruder...@radiumone.com> To: user@hive.apache.org Sent: Monday, July 29, 2013 11:38 AM Subject: Best Performance on Large Scale Join Hi All- I have 2 tables: CREATE TABLE users ( a bigint, b int ) CREATE TABLE products ( a bigint, c int ) Each table has about 8 billion records (roughly 2k files total mappers). I want to know the most performant way to do the following query: SELECT u.b, p.c, count(*) as count FROM users u INNER JOIN products p ON u.a = p.a GROUP BY u.b, p.c Right now the reducing is killing me. Any suggestions on improving performance? Would a mapbucket join be optimal here? Thanks, Brad