Best Performance on Large Scale Join
Hi All- I have 2 tables: CREATE TABLE users ( a bigint, b int ) CREATE TABLE products ( a bigint, c int ) Each table has about 8 billion records (roughly 2k files total mappers). I want to know the most performant way to do the following query: SELECT u.b, p.c, count(*) as count FROM users u INNER JOIN products p ON u.a = p.a GROUP BY u.b, p.c Right now the reducing is killing me. Any suggestions on improving performance? Would a mapbucket join be optimal here? Thanks, Brad
Re: Best Performance on Large Scale Join
Brad, whats the cluster capacity you have got? how many uniq values of a,b and c you have got individually in any of the one table? Is there any chance you can partition data? are there any columns you have on which you can create buckets? I have done joins having 10 billion records in one table but other table was significantly smaller. and I had a 1000 node cluster ad disposal On Mon, Jul 29, 2013 at 11:08 PM, Brad Ruderman bruder...@radiumone.comwrote: Hi All- I have 2 tables: CREATE TABLE users ( a bigint, b int ) CREATE TABLE products ( a bigint, c int ) Each table has about 8 billion records (roughly 2k files total mappers). I want to know the most performant way to do the following query: SELECT u.b, p.c, count(*) as count FROM users u INNER JOIN products p ON u.a = p.a GROUP BY u.b, p.c Right now the reducing is killing me. Any suggestions on improving performance? Would a mapbucket join be optimal here? Thanks, Brad -- Nitin Pawar
Re: Best Performance on Large Scale Join
Perhaps you can first create a temp table that contains only the records that will match? See the UNION ALL trick at http://www.mail-archive.com/hive-user@hadoop.apache.org/msg01906.html From: Brad Ruderman bruder...@radiumone.com To: user@hive.apache.org Sent: Monday, July 29, 2013 11:38 AM Subject: Best Performance on Large Scale Join Hi All- I have 2 tables: CREATE TABLE users ( a bigint, b int ) CREATE TABLE products ( a bigint, c int ) Each table has about 8 billion records (roughly 2k files total mappers). I want to know the most performant way to do the following query: SELECT u.b, p.c, count(*) as count FROM users u INNER JOIN products p ON u.a = p.a GROUP BY u.b, p.c Right now the reducing is killing me. Any suggestions on improving performance? Would a mapbucket join be optimal here? Thanks, Brad
Re: Best Performance on Large Scale Join
Hi Michael and Nitin- Thanks for your response. Some things to note: Michael- I will definitely try this method, it looks interesting. Nitin - -Users Table and Product Tables are already unique. -I cannot partition the data, since the data is coming from already partitioned tables and I am doing a monthly overlap. Meaning I am looking at users and products for a month's worth of data -I am not sure what i means to be a good column to bucket, but I would think user (as a bigint) would be a good column. Others- The other question is do you think I could improve performance by using PIG LATIN instead of hive? Thanks, Brad On Mon, Jul 29, 2013 at 10:59 AM, Nitin Pawar nitinpawar...@gmail.comwrote: Brad, whats the cluster capacity you have got? how many uniq values of a,b and c you have got individually in any of the one table? Is there any chance you can partition data? are there any columns you have on which you can create buckets? I have done joins having 10 billion records in one table but other table was significantly smaller. and I had a 1000 node cluster ad disposal On Mon, Jul 29, 2013 at 11:08 PM, Brad Ruderman bruder...@radiumone.comwrote: Hi All- I have 2 tables: CREATE TABLE users ( a bigint, b int ) CREATE TABLE products ( a bigint, c int ) Each table has about 8 billion records (roughly 2k files total mappers). I want to know the most performant way to do the following query: SELECT u.b, p.c, count(*) as count FROM users u INNER JOIN products p ON u.a = p.a GROUP BY u.b, p.c Right now the reducing is killing me. Any suggestions on improving performance? Would a mapbucket join be optimal here? Thanks, Brad -- Nitin Pawar