Re: [HACKERS] Why percent_rank is so slower than rank?

Tom Lane Thu, 09 Dec 2010 13:01:50 -0800

Jie Li <jay23j...@gmail.com> writes:
> I'm new to window functions. Recently I run some simple queries but
> surprised to find percent_rank is so slower than rank, could anybody tell me
> why?


Huh, interesting.  I can reproduce this with toy data, such as

create table inventory1 (inv_date_sk int, inv_item_sk int);
insert into inventory1 select 1, random()* 100000 from 
generate_series(1,189000);
explain analyze select inv_date_sk,inv_item_sk, percent_rank()over(partition by 
inv_date_sk order by inv_item_sk) from inventory1;

The example is *not* particularly slow if you leave work_mem at default.
But if you bump up work_mem enough so that the WindowAgg's internal
tuplestore fits into memory, it slows down like crazy.  A bit of quality
time with oprofile shows that all the time is going into this memmove()
in tuplestore_trim():

    /*
     * Slide the array down and readjust pointers.  This may look pretty
     * stupid, but we expect that there will usually not be very many
     * tuple-pointers to move, so this isn't that expensive; and it keeps a
     * lot of other logic simple.
     *
     * In fact, in the current usage for merge joins, it's demonstrable that
     * there will always be exactly one non-removed tuple; so optimize that
     * case.
     */
    if (nremove + 1 == state->memtupcount)
        state->memtuples[0] = state->memtuples[nremove];
    else
        memmove(state->memtuples, state->memtuples + nremove,
                (state->memtupcount - nremove) * sizeof(void *));

We're throwing away one tuple at a time as we advance forward through
the tuplestore, and moving 100000+ tuple pointers each time.  Ugh.
This code was all right when written, because (IIRC) the mergejoin
case was actually the only caller.  But it's not all right for
WindowAgg's less-predictable usage patterns.

I thought for a bit about changing things around so that the first-used
tuple slot isn't necessarily state->memtuples[0], but just like the
comment says, that complicates a lot of other logic.  And there isn't
any easy place to reclaim the wasted slots later.

What seems like the best bet is to put in a heuristic to make
tuplestore_trim simply not do anything until nremove reaches some
reasonably large amount, perhaps 10% of the number of stored tuples.
This wastes up to 10% of the alloted memory, but that seems tolerable.
We could complicate things a bit more by remembering that so-and-so
many slots are authorized to be removed, and forcing a trim operation
to discard them if we find ourselves in memory trouble.  I'm not sure
that extra complication is worthwhile though.  Comments?

                        regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why percent_rank is so slower than rank?

Reply via email to