Hi Gopal,

Please find the answers inline.

On Fri, Mar 17, 2017 at 9:01 PM, Gopal Vijayaraghavan <gop...@apache.org>
wrote:

>
> > We are using a query with union all and groupby and same table is read
> multiple times in the union all subquery.
> …
> > When run with Mapreduce, the job is run in one stage consuming n mappers
> and m reducers and all union all scans are done with the same job.
>
> The logical plans are identical btw - MR effectively reads the same table
> again and again, unless the correlation optimizer is folding this.
>
> I doubt that due to the unix_timestamp(). An explain would be useful.



Below are the explain plans for tez and MR.

Query:
http://pastebin.com/t6n91u6a

Tez explain plan:
http://pastebin.com/aWwVxhii

MR explain plan:
http://pastebin.com/iDbWwtKR



>
> > Hence if there are 50 union alls in a query, the 50n map vertex tasks
> are launched which is huge.
>
> Tez lets you scale the mappers up/down using split grouping parameters, so
> you can tweak it to scale down if you want to.
>
> set tez.grouping.split-waves=0.1;
>
> would try to shrink the width of the mappers.
>


The split waves allowed us to reduce the number of mappers per vertex but
each map is heavier now and the number of vertices don't change.  The work
done is the same and the same table is read repeatedly for each union all
query.


>
> An alternative is to use a CTE + materialization (HIVE-11752), but for
> that you need Hive2.
>
> > http://pastebin.com/u7Rw6Hag
>
>
We are using hive 2.1.0 and tez 0.8.4, and tried enabling CTE
materialization using hive.optimize.cte.materialize.threshold=1, but still
the problem exists.


> You can probably get a ~2x speedup by removing the UNIX_TIMESTAMP() and
> using CURRENT_TIMESTAMP instead.
>
>
Thanks for the tip Gopal, will change this.


> Cheers,
> Gopal
>
>
>

Reply via email to