>I¹m having trouble doing a cross join between two tables that are too big >for a map-side join.
The actual query would help btw. Usually what is planned as a cross-join can be optimized out into a binning query with a custom UDF. In particular with 2-D geo queries with binning, which people tend to run as cross-joins when they port PostGIS queries into Hive (ST_Contains). >Trying to break down one table into small enough partitions and then >unioning them together seems to give comparable performance to a cross >join. ... >Short of moving to a different execution engine, are there any >performance improvements that can be made to lessen the pain of a cross >join? No, with MapReduce you're mostly stuck running each part of the union one after the other. Since this is a simple fan-out, you can try the simple parallelization set hive.exec.parallel=true; Beware, this has known deadlocks as queries get more complex - for which you need a real Acyclic Graph scheduler engine like Tez. Drop me an off-list mail if you want to run Tez on recent CDH, EMR or MapR releases. >Also, could you please elaborate on your comment ³The real trouble is >that MapReduce cannot re-direct data at all (there¹s only shuffle >edges)"? Thanks! Mapreduce cannot do non pair-wise routing operations, since it violates the direct assumption of map() partitioning (same applies to Spark's map() operators). Tez goes a bit out of the way to separate the control plane from the data plane, so that you can do non-pairwise operations like auto-reducer parallelism or splitting up pair-wise operations as m:n matching. Here's slides from last year describing how that rewiring works. http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/13 http://www.slideshare.net/Hadoop_Summit/pig-on-tez-low-latency-etl-with-big -data/19 Cheers, Gopal