Hi Michael and Nitin- Thanks for your response. Some things to note: Michael- I will definitely try this method, it looks interesting.
Nitin - -Users Table and Product Tables are already unique. -I cannot partition the data, since the data is coming from already partitioned tables and I am doing a "monthly" overlap. Meaning I am looking at users and products for a month's worth of data -I am not sure what i means to be a "good" column to bucket, but I would think user (as a bigint) would be a good column. Others- The other question is do you think I could improve performance by using PIG LATIN instead of hive? Thanks, Brad On Mon, Jul 29, 2013 at 10:59 AM, Nitin Pawar <[email protected]>wrote: > Brad, > whats the cluster capacity you have got? > how many uniq values of a,b and c you have got individually in any of > the one table? > > > Is there any chance you can partition data? are there any columns you have > on which you can create buckets? > > I have done joins having 10 billion records in one table but other table > was significantly smaller. and I had a 1000 node cluster ad disposal > > > > > > On Mon, Jul 29, 2013 at 11:08 PM, Brad Ruderman > <[email protected]>wrote: > >> Hi All- >> >> I have 2 tables: >> >> CREATE TABLE users ( >> a bigint, >> b int >> ) >> >> CREATE TABLE products ( >> a bigint, >> c int >> ) >> >> Each table has about 8 billion records (roughly 2k files total mappers). >> I want to know the most performant way to do the following query: >> >> SELECT u.b, >> p.c, >> count(*) as count >> FROM users u >> INNER JOIN products p >> ON u.a = p.a >> GROUP BY u.b, p.c >> >> Right now the reducing is killing me. Any suggestions on improving >> performance? Would a mapbucket join be optimal here? >> >> Thanks, >> Brad >> > > > > -- > Nitin Pawar >
