> The windowing is not simultaneous unless they are all over the same window
> - the following query has 3 different windows applied over the same rows
> sequentially.
Ok. Just wanted to confirm. Maybe I could restructure my query to get more
parallelism ..
> They are all over the same rows so they're done in sequence, so the final
> row-set contains all values, causing multiple shuffles of the same rows.
So you'r saying, since these windows are part of a single SELECT
projection they need to be serial?
> Does your query only have row_numbers or does it have other columns in
them?
yes. Something like ...
{code}
SELECT
row_number() OVER( PARTITION BY app, user, type ORDER BY
ts ) as a_number,
row_number() OVER( PARTITION BY day, app, user, type ORDER
BY ts ) as type_rank,
row_number() OVER( PARTITION BY day, app, user ORDER BY
ts ) as dau_rank,
day,
user,
app,
type,
ts
FROM messages
{code}
On Tue, Mar 15, 2016 at 6:41 PM, Gopal Vijayaraghavan <[email protected]>
wrote:
> > A lot of our queries do the following style of simultaneous windowing ..
>
> The windowing is not simultaneous unless they are all over the same window
> - the following query has 3 different windows applied over the same rows
> sequentially.
>
> > SELECT
> > row_number() OVER( PARTITION BY app, user,
> > type ORDER BY ts )as a_number,
> > row_number() OVER( PARTITION BY day, app, user,
> > type ORDER BY ts )as type_rank,
>
> > Since each OVER / PARTITION-By clause is independent they can the put
> >into parallelized Reducer phases.
>
> They are all over the same rows so they're done in sequence, so the final
> row-set contains all values, causing multiple shuffles of the same rows.
>
> > Is this something that Tez tries to do at all or an optimization that I
> >can use to my benefit ?
>
> Does your query only have row_numbers or does it have other columns in
> them?
>
> Cheers,
> Gopal
>
>
>
--
"If you really want something in this life, you have to work for it. Now,
quiet! They're about to announce the lottery numbers..."