Hi All-
I have 2 tables:
CREATE TABLE users (
a bigint,
b int
)
CREATE TABLE products (
a bigint,
c int
)
Each table has about 8 billion records (roughly 2k files total mappers). I
want to know the most performant way to do the following query:
SELECT u.b,
p.c,
Brad,
whats the cluster capacity you have got?
how many uniq values of a,b and c you have got individually in any of the
one table?
Is there any chance you can partition data? are there any columns you have
on which you can create buckets?
I have done joins having 10 billion records in one
Sent: Monday, July 29, 2013 11:38 AM
Subject: Best Performance on Large Scale Join
Hi All-
I have 2 tables:
CREATE TABLE users (
a bigint,
b int
)
CREATE TABLE products (
a bigint,
c int
)
Each table has about 8 billion records (roughly 2k files total mappers). I want
to know the most
Hi Michael and Nitin-
Thanks for your response. Some things to note:
Michael-
I will definitely try this method, it looks interesting.
Nitin -
-Users Table and Product Tables are already unique.
-I cannot partition the data, since the data is coming from already
partitioned tables and I am doing