Hello,

I'm trying to optimize some queries in Hive that were recently switched to
Tez.. had a general question regarding reducer parallelism ..

A lot of our queries do the following style of simultaneous windowing ..


 SELECT

    row_number() OVER( PARTITION BY app, user, type ORDER BY ts ) as
a_number,

    row_number() OVER( PARTITION BY day, app, user, type ORDER BY ts ) as
type_rank,

    row_number() OVER( PARTITION BY day, app, user   ORDER BY ts ) as
dau_rank,

 FROM messages

 WHERE ...


Since each OVER / PARTITION-By clause is independent they can the put into
parallelized Reducer phases. But what I see is that these get serialized
into M1 -> R1 -> R2 -> R3 .. instead of M1 -> [ R1, R2, R3 ]

Is this something that Tez tries to do at all or an optimization that I can
use to my benefit ?

Cheers,
-Gautam.

Reply via email to