Best Performance on Large Scale Join

2013-07-29 Thread Brad Ruderman
Hi All- I have 2 tables: CREATE TABLE users ( a bigint, b int ) CREATE TABLE products ( a bigint, c int ) Each table has about 8 billion records (roughly 2k files total mappers). I want to know the most performant way to do the following query: SELECT u.b, p.c,

Re: Best Performance on Large Scale Join

2013-07-29 Thread Nitin Pawar
Brad, whats the cluster capacity you have got? how many uniq values of a,b and c you have got individually in any of the one table? Is there any chance you can partition data? are there any columns you have on which you can create buckets? I have done joins having 10 billion records in one

Re: Best Performance on Large Scale Join

2013-07-29 Thread Michael Malak
Sent: Monday, July 29, 2013 11:38 AM Subject: Best Performance on Large Scale Join Hi All- I have 2 tables: CREATE TABLE users ( a bigint, b int ) CREATE TABLE products ( a bigint, c int ) Each table has about 8 billion records (roughly 2k files total mappers). I want to know the most

Re: Best Performance on Large Scale Join

2013-07-29 Thread Brad Ruderman
Hi Michael and Nitin- Thanks for your response. Some things to note: Michael- I will definitely try this method, it looks interesting. Nitin - -Users Table and Product Tables are already unique. -I cannot partition the data, since the data is coming from already partitioned tables and I am doing