> We are using a query with union all and groupby and same table is read 
> multiple times in the union all subquery.
…
> When run with Mapreduce, the job is run in one stage consuming n mappers and 
> m reducers and all union all scans are done with the same job.

The logical plans are identical btw - MR effectively reads the same table again 
and again, unless the correlation optimizer is folding this.

I doubt that due to the unix_timestamp(). An explain would be useful.

> Hence if there are 50 union alls in a query, the 50n map vertex tasks are 
> launched which is huge.

Tez lets you scale the mappers up/down using split grouping parameters, so you 
can tweak it to scale down if you want to.

set tez.grouping.split-waves=0.1;

would try to shrink the width of the mappers.

An alternative is to use a CTE + materialization (HIVE-11752), but for that you 
need Hive2.

> http://pastebin.com/u7Rw6Hag

You can probably get a ~2x speedup by removing the UNIX_TIMESTAMP() and using 
CURRENT_TIMESTAMP instead.

Cheers,
Gopal


Reply via email to