> Over the last few week I¹ve been trying to use cross joins/cartesian
>products and was wondering why, exactly, this all gets sent to one
>reducer. All I¹ve heard or read is that Hive can¹t/doesn¹t parallelize
>the job. 

The hashcode of the shuffle key is 0, since you need to process every row
against every key - there's no possibility of dividing up the work.

Tez will actually have a cross-product edge (TEZ-2104), which is a
distributed cross-product proposal but wasn't picked up in the last Google
Summer of Code.

The real trouble is that MapReduce cannot re-direct data at all (there's
only shuffle edges).

> Does anyone have a workaround?

I use a staged partitioned table as a workaround for this, hashed on a
high nDV key - the goal of the Tez edge is to shuffle the data similarly
at runtime.

For instance, this python script makes a query with a 19x improvement in
distribution for a cross-product which generates 50+Gb of data from a
~10Mb input.

https://gist.github.com/t3rmin4t0r/cfb5bb4f7094d595c1e8


It is possible for Hive-Tez to actually generate UNION VertexGroups, but
it's much more efficient to do this as a edge with a custom EdgeManager,
since that opens up potentially implementing ThetaJoins in hive using that.

Cheers,
Gopal


Reply via email to