>I¹m having trouble doing a cross join between two tables that are too big
>for a map-side join.

The actual query would help btw. Usually what is planned as a cross-join
can be optimized out into a binning query with a custom UDF.

In particular with 2-D geo queries with binning, which people tend to run
as cross-joins when they port PostGIS queries into Hive (ST_Contains).

>Trying to break down one table into small enough partitions and then
>unioning them together seems to give comparable performance to a cross
>join.
...
>Short of moving to a different execution engine, are there any
>performance improvements that can be made to lessen the pain of a cross
>join? 

No, with MapReduce you're mostly stuck running each part of the union one
after the other.

Since this is a simple fan-out, you can try the simple parallelization

set hive.exec.parallel=true;

Beware, this has known deadlocks as queries get more complex - for which
you need a real Acyclic Graph scheduler engine like Tez.

Drop me an off-list mail if you want to run Tez on recent CDH, EMR or MapR
releases.

>Also, could you please elaborate on your comment ³The real trouble is
>that MapReduce cannot re-direct data at all (there¹s only shuffle
>edges)"? Thanks!

Mapreduce cannot do non pair-wise routing operations, since it violates
the direct assumption of map() partitioning (same applies to Spark's map()
operators).

Tez goes a bit out of the way to separate the control plane from the data
plane, so that you can do non-pairwise operations like auto-reducer
parallelism or splitting up pair-wise operations as m:n matching.

Here's slides from last year describing how that rewiring works.

http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/13

http://www.slideshare.net/Hadoop_Summit/pig-on-tez-low-latency-etl-with-big
-data/19


Cheers,
Gopal

Reply via email to