Re: Cross join/cartesian product explanation

2015-11-13 Thread Rory Sawyer
Hi Gopal,

Thanks for the detailed response.

It’s really a very simple query that I’m trying to run:
select
a.a_id,
b.b_id,
count(*) as c
from
table_a a, 
table_b b
where
bloom_contains(a_id, b_id_bloom)
group by
a.a_id,
b.b_id;

Where “bloom_contains” is a custom UDF. The only changes I made were renaming 
the tables and columns. The sizes of the tables I’m running against are small — 
roughly 50-100Mb — but this query would need to be expanded to run on a table 
that is >100Gb (table_b would likely max out around 100Mb).

Any suggestions on how to approach this would be greatly appreciated.

Best,
Rory


Re: Cross join/cartesian product explanation

2015-11-09 Thread Rory Sawyer
Hi Gopal,

Thanks for the speedy response! A follow-up question though: 10Mb input sounds 
like that would work for a map join. I’m having trouble doing a cross join 
between two tables that are too big for a map-side join. Trying to break down 
one table into small enough partitions and then unioning them together seems to 
give comparable performance to a cross join. I’m running Hive on Map Reduce 
right now. Short of moving to a different execution engine, are there any 
performance improvements that can be made to lessen the pain of a cross join? 
Also, could you please elaborate on your comment “The real trouble is that 
MapReduce cannot re-direct data at all (there’s only shuffle edges)"? Thanks!

Best,
Rory



On 11/6/15, 5:09 PM, "Gopal Vijayaraghavan"  wrote:

>
>> Over the last few week I¹ve been trying to use cross joins/cartesian
>>products and was wondering why, exactly, this all gets sent to one
>>reducer. All I¹ve heard or read is that Hive can¹t/doesn¹t parallelize
>>the job. 
>
>The hashcode of the shuffle key is 0, since you need to process every row
>against every key - there's no possibility of dividing up the work.
>
>Tez will actually have a cross-product edge (TEZ-2104), which is a
>distributed cross-product proposal but wasn't picked up in the last Google
>Summer of Code.
>
>The real trouble is that MapReduce cannot re-direct data at all (there's
>only shuffle edges).
>
>> Does anyone have a workaround?
>
>I use a staged partitioned table as a workaround for this, hashed on a
>high nDV key - the goal of the Tez edge is to shuffle the data similarly
>at runtime.
>
>For instance, this python script makes a query with a 19x improvement in
>distribution for a cross-product which generates 50+Gb of data from a
>~10Mb input.
>
>https://gist.github.com/t3rmin4t0r/cfb5bb4f7094d595c1e8
>
>
>It is possible for Hive-Tez to actually generate UNION VertexGroups, but
>it's much more efficient to do this as a edge with a custom EdgeManager,
>since that opens up potentially implementing ThetaJoins in hive using that.
>
>Cheers,
>Gopal
>
>


Cross join/cartesian product explanation

2015-11-06 Thread Rory Sawyer
Hi all,

Over the last few week I’ve been trying to use cross joins/cartesian products 
and was wondering why, exactly, this all gets sent to one reducer. All I’ve 
heard or read is that Hive can’t/doesn’t parallelize the job. Is there some 
code people can point me to? Does anyone have a workaround? Thanks!

Best,
Rory Sawyer