Re: [HACKERS] Why percent_rank is so slower than rank?

2010-12-13 Thread Ron Mayer
Tom Lane wrote:
 argue that there was a regression.  It's certainly a performance bug
 though: nobody would expect that giving a query *more* work_mem would
 cause it to run many times slower.

I wouldn't be that surprised - otherwise it'd just be hard-coded to
something large.

Especially since earlier in the thread:
 The example is *not* particularly slow if you leave work_mem at default.
which makes me think it's arguably not quite a bug.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why percent_rank is so slower than rank?

2010-12-10 Thread Hitoshi Harada
2010/12/10 Tom Lane t...@sss.pgh.pa.us:
 I wrote:
 We're throwing away one tuple at a time as we advance forward through
 the tuplestore, and moving 10+ tuple pointers each time.  Ugh.
 This code was all right when written, because (IIRC) the mergejoin
 case was actually the only caller.  But it's not all right for
 WindowAgg's less-predictable usage patterns.

 I thought for a bit about changing things around so that the first-used
 tuple slot isn't necessarily state-memtuples[0], but just like the
 comment says, that complicates a lot of other logic.  And there isn't
 any easy place to reclaim the wasted slots later.

 What seems like the best bet is to put in a heuristic to make
 tuplestore_trim simply not do anything until nremove reaches some
 reasonably large amount, perhaps 10% of the number of stored tuples.
 This wastes up to 10% of the alloted memory, but that seems tolerable.

 On reflection I think just not doing anything isn't a very good idea.
 The problem with that is that a mis-coded caller could try to fetch
 tuples that it had already told the tuplestore could be trimmed away;
 and this would work, most of the time, until you got unlucky and the
 trim operation had actually deleted them.  I think it's pretty important
 for bug-catching purposes that the tuplestore enforce that those tuples
 are not available anymore.

I see it's too late now that you've committed it, but it seems there
was another way to avoid it by not trimming from percent_rank()
individually. Once the whole partition is fit to the memory, you don't
need to trim it since it never grows. The trimming logic is for
something like moving aggregates and (simple) rank(), which grows
tuplestore content as it advances. percent_rank() doesn't seem to
match the optimization.

Regards,

-- 
Hitoshi Harada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why percent_rank is so slower than rank?

2010-12-10 Thread Tom Lane
Hitoshi Harada umi.tan...@gmail.com writes:
 I see it's too late now that you've committed it,

Patches can always be reverted...

 but it seems there
 was another way to avoid it by not trimming from percent_rank()
 individually. Once the whole partition is fit to the memory, you don't
 need to trim it since it never grows. The trimming logic is for
 something like moving aggregates and (simple) rank(), which grows
 tuplestore content as it advances. percent_rank() doesn't seem to
 match the optimization.

I don't think this idea leads to a robust solution.  When you have a
combination of different window functions being used in the same scan,
you can't expect any one of them to know the global situation.  Having
percent_rank lie about its requirements in order to avoid bad behavior
in the tuplestore infrastructure is just going to create more problems
down the road.  We need to have the individual functions tell the truth
and then do any optimization hacking in the WindowAgg code or
infrastructure.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why percent_rank is so slower than rank?

2010-12-10 Thread Hitoshi Harada
2010/12/11 Tom Lane t...@sss.pgh.pa.us:
 Hitoshi Harada umi.tan...@gmail.com writes:
 I see it's too late now that you've committed it,

 Patches can always be reverted...

 but it seems there
 was another way to avoid it by not trimming from percent_rank()
 individually. Once the whole partition is fit to the memory, you don't
 need to trim it since it never grows. The trimming logic is for
 something like moving aggregates and (simple) rank(), which grows
 tuplestore content as it advances. percent_rank() doesn't seem to
 match the optimization.

 I don't think this idea leads to a robust solution.  When you have a
 combination of different window functions being used in the same scan,
 you can't expect any one of them to know the global situation.  Having
 percent_rank lie about its requirements in order to avoid bad behavior
 in the tuplestore infrastructure is just going to create more problems
 down the road.  We need to have the individual functions tell the truth
 and then do any optimization hacking in the WindowAgg code or
 infrastructure.

Hm? Once percent_rank() scans to the partition end, any other window
functions that scans row by row don't need to care the memory
reduction, aren't they? Or more generally, if the partition was
scanned to the end, we don't need to trim tuplestore anymore. Am I
misunderstanding?

Regards,

-- 
Hitoshi Harada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why percent_rank is so slower than rank?

2010-12-10 Thread Tom Lane
Hitoshi Harada umi.tan...@gmail.com writes:
 Hm? Once percent_rank() scans to the partition end, any other window
 functions that scans row by row don't need to care the memory
 reduction, aren't they? Or more generally, if the partition was
 scanned to the end, we don't need to trim tuplestore anymore. Am I
 misunderstanding?

Giving back the memory as we do the scan is still a good thing IMO;
there might be other uses for it.  In any case I don't see where you're
going to put such a heuristic without breaking potentially interesting
uses elsewhere.  The tuplestore doesn't know anything about partitions
being read to the end; and WindowAgg doesn't (or shouldn't) know about
whether the tuplestore is all in memory.

Furthermore, the performance problem would exist for any situation where
the window functions had read far beyond the frame start, whether that
was all the way to partition end or not.  Consider a frame like ROWS
BETWEEN 1 PRECEDING AND 1 FOLLOWING.

In the end this is a local problem inside tuplestore, and kluging its
callers to work around it is the wrong approach.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Why percent_rank is so slower than rank?

2010-12-09 Thread Jie Li
Hi all,

I'm new to window functions. Recently I run some simple queries but
surprised to find percent_rank is so slower than rank, could anybody tell me
why?

The table schema:
test=# \d inventory1
 Table public.inventory1
Column|  Type   | Modifiers
--+-+---
 inv_date_sk  | integer | not null
 inv_item_sk  | integer | not null
 inv_warehouse_sk | integer | not null
 inv_quantity_on_hand | integer |

test=# \dt+ inventory1
   List of relations
 Schema |Name| Type  |  Owner   |  Size   | Description
++---+--+-+-
 public | inventory1 | table | workshop | 8880 kB |

The rank query result:
test=# explain analyze select inv_date_sk,inv_item_sk, rank()over(partition
by inv_date_sk order by inv_item_sk) from inventory1;
  QUERY
PLAN
---
 WindowAgg  (cost=19563.99..23343.99 rows=189000 width=8) (actual
time=631.947..1361.158 rows=189000 loops=1)
   -  Sort  (cost=19563.99..20036.49 rows=189000 width=8) (actual
time=631.924..771.990 rows=189000 loops=1)
 Sort Key: inv_date_sk, inv_item_sk
 Sort Method:  quicksort  Memory: 12218kB
 -  Seq Scan on inventory1  (cost=0.00..3000.00 rows=189000
width=8) (actual time=0.055..198.948 rows=189000 loops=1)
 Total runtime: 1500.193 ms
(6 rows)

The percent_rank result:
test=# explain analyze select inv_date_sk,inv_item_sk,
percent_rank()over(partition by inv_date_sk order by inv_item_sk) from
inventory1;
  QUERY
PLAN
---
 WindowAgg  (cost=19563.99..23343.99 rows=189000 width=8) (actual
time=766.432..32924.804 rows=189000 loops=1)
   -  Sort  (cost=19563.99..20036.49 rows=189000 width=8) (actual
time=756.320..905.407 rows=189000 loops=1)
 Sort Key: inv_date_sk, inv_item_sk
 Sort Method:  quicksort  Memory: 12218kB
 -  Seq Scan on inventory1  (cost=0.00..3000.00 rows=189000
width=8) (actual time=0.102..224.607 rows=189000 loops=1)
 Total runtime: 33152.188 ms
(6 rows)

One special thing is that all the values of the partition key(inv_date_sk)
are the same, that is, there is only one window partition. I find that
percent_rank needs to buffer all the tuples to get the total number of rows.
But why is it so expensive?

I use 8.4.4. And I only increase the work_mem to 100M and leave other
parameters untouched.

Thanks,
Li Jie


Re: [HACKERS] Why percent_rank is so slower than rank?

2010-12-09 Thread Tom Lane
Jie Li jay23j...@gmail.com writes:
 I'm new to window functions. Recently I run some simple queries but
 surprised to find percent_rank is so slower than rank, could anybody tell me
 why?

Huh, interesting.  I can reproduce this with toy data, such as

create table inventory1 (inv_date_sk int, inv_item_sk int);
insert into inventory1 select 1, random()* 10 from 
generate_series(1,189000);
explain analyze select inv_date_sk,inv_item_sk, percent_rank()over(partition by 
inv_date_sk order by inv_item_sk) from inventory1;

The example is *not* particularly slow if you leave work_mem at default.
But if you bump up work_mem enough so that the WindowAgg's internal
tuplestore fits into memory, it slows down like crazy.  A bit of quality
time with oprofile shows that all the time is going into this memmove()
in tuplestore_trim():

/*
 * Slide the array down and readjust pointers.  This may look pretty
 * stupid, but we expect that there will usually not be very many
 * tuple-pointers to move, so this isn't that expensive; and it keeps a
 * lot of other logic simple.
 *
 * In fact, in the current usage for merge joins, it's demonstrable that
 * there will always be exactly one non-removed tuple; so optimize that
 * case.
 */
if (nremove + 1 == state-memtupcount)
state-memtuples[0] = state-memtuples[nremove];
else
memmove(state-memtuples, state-memtuples + nremove,
(state-memtupcount - nremove) * sizeof(void *));

We're throwing away one tuple at a time as we advance forward through
the tuplestore, and moving 10+ tuple pointers each time.  Ugh.
This code was all right when written, because (IIRC) the mergejoin
case was actually the only caller.  But it's not all right for
WindowAgg's less-predictable usage patterns.

I thought for a bit about changing things around so that the first-used
tuple slot isn't necessarily state-memtuples[0], but just like the
comment says, that complicates a lot of other logic.  And there isn't
any easy place to reclaim the wasted slots later.

What seems like the best bet is to put in a heuristic to make
tuplestore_trim simply not do anything until nremove reaches some
reasonably large amount, perhaps 10% of the number of stored tuples.
This wastes up to 10% of the alloted memory, but that seems tolerable.
We could complicate things a bit more by remembering that so-and-so
many slots are authorized to be removed, and forcing a trim operation
to discard them if we find ourselves in memory trouble.  I'm not sure
that extra complication is worthwhile though.  Comments?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why percent_rank is so slower than rank?

2010-12-09 Thread Tom Lane
I wrote:
 We're throwing away one tuple at a time as we advance forward through
 the tuplestore, and moving 10+ tuple pointers each time.  Ugh.
 This code was all right when written, because (IIRC) the mergejoin
 case was actually the only caller.  But it's not all right for
 WindowAgg's less-predictable usage patterns.

 I thought for a bit about changing things around so that the first-used
 tuple slot isn't necessarily state-memtuples[0], but just like the
 comment says, that complicates a lot of other logic.  And there isn't
 any easy place to reclaim the wasted slots later.

 What seems like the best bet is to put in a heuristic to make
 tuplestore_trim simply not do anything until nremove reaches some
 reasonably large amount, perhaps 10% of the number of stored tuples.
 This wastes up to 10% of the alloted memory, but that seems tolerable.

On reflection I think just not doing anything isn't a very good idea.
The problem with that is that a mis-coded caller could try to fetch
tuples that it had already told the tuplestore could be trimmed away;
and this would work, most of the time, until you got unlucky and the
trim operation had actually deleted them.  I think it's pretty important
for bug-catching purposes that the tuplestore enforce that those tuples
are not available anymore.

Hence the attached patch, which combines the two ideas by recycling
tuples immediately but not sliding the pointer array until a reasonable
amount of movement has occurred.  This fixes the complained-of
performance problem AFAICT.

I'm not sure whether or not to back-patch this into 9.0 and 8.4.  The
code in tuplestore.c hasn't changed at all since 8.4, so there's not
much risk of cross-version bugs, but if I did miss anything we could
be shipping a buggy version next week.  Thoughts?

regards, tom lane

diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index 9bbaba43771f495fdf24e9f2afd545b69a22ecbd..8c8139c897679892e0d4ad13e69ae8d814484206 100644
*** a/src/backend/utils/sort/tuplestore.c
--- b/src/backend/utils/sort/tuplestore.c
*** struct Tuplestorestate
*** 145,152 
--- 145,159 
  	/*
  	 * This array holds pointers to tuples in memory if we are in state INMEM.
  	 * In states WRITEFILE and READFILE it's not used.
+ 	 *
+ 	 * When memtupdeleted  0, the first memtupdeleted pointers are already
+ 	 * released due to a tuplestore_trim() operation, but we haven't expended
+ 	 * the effort to slide the remaining pointers down.  These unused pointers
+ 	 * are set to NULL to catch any invalid accesses.  Note that memtupcount
+ 	 * includes the deleted pointers.
  	 */
  	void	  **memtuples;		/* array of pointers to palloc'd tuples */
+ 	int			memtupdeleted;	/* the first N slots are currently unused */
  	int			memtupcount;	/* number of tuples currently present */
  	int			memtupsize;		/* allocated length of memtuples array */
  
*** tuplestore_begin_common(int eflags, bool
*** 252,257 
--- 259,265 
  	state-context = CurrentMemoryContext;
  	state-resowner = CurrentResourceOwner;
  
+ 	state-memtupdeleted = 0;
  	state-memtupcount = 0;
  	state-memtupsize = 1024;	/* initial guess */
  	state-memtuples = (void **) palloc(state-memtupsize * sizeof(void *));
*** tuplestore_clear(Tuplestorestate *state)
*** 401,407 
  	state-myfile = NULL;
  	if (state-memtuples)
  	{
! 		for (i = 0; i  state-memtupcount; i++)
  		{
  			FREEMEM(state, GetMemoryChunkSpace(state-memtuples[i]));
  			pfree(state-memtuples[i]);
--- 409,415 
  	state-myfile = NULL;
  	if (state-memtuples)
  	{
! 		for (i = state-memtupdeleted; i  state-memtupcount; i++)
  		{
  			FREEMEM(state, GetMemoryChunkSpace(state-memtuples[i]));
  			pfree(state-memtuples[i]);
*** tuplestore_clear(Tuplestorestate *state)
*** 409,414 
--- 417,423 
  	}
  	state-status = TSS_INMEM;
  	state-truncated = false;
+ 	state-memtupdeleted = 0;
  	state-memtupcount = 0;
  	readptr = state-readptrs;
  	for (i = 0; i  state-readptrcount; readptr++, i++)
*** tuplestore_end(Tuplestorestate *state)
*** 432,438 
  		BufFileClose(state-myfile);
  	if (state-memtuples)
  	{
! 		for (i = 0; i  state-memtupcount; i++)
  			pfree(state-memtuples[i]);
  		pfree(state-memtuples);
  	}
--- 441,447 
  		BufFileClose(state-myfile);
  	if (state-memtuples)
  	{
! 		for (i = state-memtupdeleted; i  state-memtupcount; i++)
  			pfree(state-memtuples[i]);
  		pfree(state-memtuples);
  	}
*** tuplestore_gettuple(Tuplestorestate *sta
*** 774,787 
  }
  else
  {
! 	if (readptr-current = 0)
  	{
  		Assert(!state-truncated);
  		return NULL;
  	}
  	readptr-current--; /* last returned tuple */
  }
! if (readptr-current = 0)
  {
  	Assert(!state-truncated);
  	return NULL;
--- 783,796 
  }
  else
  {
! 	if (readptr-current = state-memtupdeleted)
  	{
  

Re: [HACKERS] Why percent_rank is so slower than rank?

2010-12-09 Thread Kevin Grittner
Tom Lane t...@sss.pgh.pa.us wrote:
 
 I'm not sure whether or not to back-patch this into 9.0 and 8.4. 
 The code in tuplestore.c hasn't changed at all since 8.4, so
 there's not much risk of cross-version bugs, but if I did miss
 anything we could be shipping a buggy version next week. 
 Thoughts?
 
Is there a performance regression involved, or is it a new feature
which hasn't performed well on this type of query until your patch? 
If the latter, I'd be inclined to give it some time on HEAD and
release it in the *following* minor release unless you're *very*
confident it couldn't break anything.
 
It's an uphill battle to convince managers that it's safe to apply
minor upgrades with minimal testing.  It doesn't take to many slips
for the boulder to roll all the way back to the bottom of that hill.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why percent_rank is so slower than rank?

2010-12-09 Thread Tom Lane
Kevin Grittner kevin.gritt...@wicourts.gov writes:
 Tom Lane t...@sss.pgh.pa.us wrote:
 I'm not sure whether or not to back-patch this into 9.0 and 8.4. 
 The code in tuplestore.c hasn't changed at all since 8.4, so
 there's not much risk of cross-version bugs, but if I did miss
 anything we could be shipping a buggy version next week. 
 Thoughts?
 
 Is there a performance regression involved, or is it a new feature
 which hasn't performed well on this type of query until your patch? 

Well, since window functions didn't exist before 8.4, it's difficult to
argue that there was a regression.  It's certainly a performance bug
though: nobody would expect that giving a query *more* work_mem would
cause it to run many times slower.

 If the latter, I'd be inclined to give it some time on HEAD and
 release it in the *following* minor release unless you're *very*
 confident it couldn't break anything.

Well, I'm reasonably confident in the patch, and it does pass regression
tests.  But I've been wrong before.

I'm not terribly thrilled with that suggestion though.  Do you have
reason to think that anybody is likely to exercise window functions in
HEAD, beyond what the regression tests do, in the next couple of months?
Slipping the application of the patch to back branches by a little bit
doesn't make a lot of management sense to me.  I think either we trust
it and put it into back branches, or we don't trust it and put it only
in HEAD, so it goes through a beta cycle before hitting production.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why percent_rank is so slower than rank?

2010-12-09 Thread Kevin Grittner
Tom Lane t...@sss.pgh.pa.us wrote:
 
 Do you have reason to think that anybody is likely to exercise
 window functions in HEAD, beyond what the regression tests do, in
 the next couple of months?
 
Not specifically, no.  From the description (not having read the
patch) I was somewhat concerned that it might affect something
outside that narrow use case.  If that's not possible, then I'm more
comfortable putting it in a release that soon after it hits the
repository.
 
It's a judgment call, and you're clearly in the best position to
make it.  You asked for thoughts, so I shared my concerns.  :-)
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why percent_rank is so slower than rank?

2010-12-09 Thread Kenneth Marshall
On Thu, Dec 09, 2010 at 05:18:57PM -0500, Tom Lane wrote:
 I wrote:
  We're throwing away one tuple at a time as we advance forward through
  the tuplestore, and moving 10+ tuple pointers each time.  Ugh.
  This code was all right when written, because (IIRC) the mergejoin
  case was actually the only caller.  But it's not all right for
  WindowAgg's less-predictable usage patterns.
 
  I thought for a bit about changing things around so that the first-used
  tuple slot isn't necessarily state-memtuples[0], but just like the
  comment says, that complicates a lot of other logic.  And there isn't
  any easy place to reclaim the wasted slots later.
 
  What seems like the best bet is to put in a heuristic to make
  tuplestore_trim simply not do anything until nremove reaches some
  reasonably large amount, perhaps 10% of the number of stored tuples.
  This wastes up to 10% of the alloted memory, but that seems tolerable.
 
 On reflection I think just not doing anything isn't a very good idea.
 The problem with that is that a mis-coded caller could try to fetch
 tuples that it had already told the tuplestore could be trimmed away;
 and this would work, most of the time, until you got unlucky and the
 trim operation had actually deleted them.  I think it's pretty important
 for bug-catching purposes that the tuplestore enforce that those tuples
 are not available anymore.
 
 Hence the attached patch, which combines the two ideas by recycling
 tuples immediately but not sliding the pointer array until a reasonable
 amount of movement has occurred.  This fixes the complained-of
 performance problem AFAICT.
 
 I'm not sure whether or not to back-patch this into 9.0 and 8.4.  The
 code in tuplestore.c hasn't changed at all since 8.4, so there's not
 much risk of cross-version bugs, but if I did miss anything we could
 be shipping a buggy version next week.  Thoughts?
 
   regards, tom lane
 

+1 for back patching.

Ken


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers