Re: [HACKERS] Sort Refinement

2008-03-22 Thread Simon Riggs
On Thu, 2008-03-20 at 22:35 +, Gregory Stark wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
 
  If we assume we use heap sort, then if we *know* that the data is
  presorted on (a) then we should be able to emit tuples directly that the
  value of (a) changes and keep emitting them until the heap is empty,
  since they will exit the heap in (a,b) order.
 
 Actually, I would think the way to do this would be to do a quicksort if you
 find you've accumulated all the records in a subgroup in memory. One easy way
 to do it would be to have nodeSort build a new tuplesort for each subgroup if
 it has a level break key parameter set (memories of RPG III are coming
 bubbling to the surface).

Yes, its essentially the same thing as running a series of otherwise
unconnected sorts. However, that seems to introduce its own overheads if
we did that literally.

We needn't fix this to either a heapsort or a quicksort. We can let the
existing code decide which and let the mode change naturally from one to
the other as is needed.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com 

  PostgreSQL UK 2008 Conference: http://www.postgresql.org.uk


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sort Refinement

2008-03-22 Thread Simon Riggs
On Thu, 2008-03-20 at 21:34 +, Sam Mason wrote:
 On Thu, Mar 20, 2008 at 05:17:22PM +, Simon Riggs wrote:
  Currently, our sort algorithm assumes that its input is unsorted. So if
  your data is sorted on (a) and you would like it to be sorted on (a,b)
  then we need to perform the full sort of (a,b).
  
  For small sorts this doesn't matter much. For larger sorts the heap sort
  algorithm will typically result in just a single run being written to
  disk which must then be read back in. Number of I/Os required is twice
  the total volume of data to be sorted.
  
  If we assume we use heap sort, then if we *know* that the data is
  presorted on (a) then we should be able to emit tuples directly that the
  value of (a) changes and keep emitting them until the heap is empty,
  since they will exit the heap in (a,b) order.
 
 We also have stats to help decide when this will be a win.  For example
 if a has a small range (i.e. a boolean) and b has a large range
 (i.e. some sequence) then this probably isn't going to be a win and
 you're better off using the existing infrastructure.  If it's the other
 way around then this is going to be a big win.

Yep, sounds sensible.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com 

  PostgreSQL UK 2008 Conference: http://www.postgresql.org.uk


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Commit Fest (was Re: [HACKERS] Sort Refinement)

2008-03-21 Thread Simon Riggs
On Thu, 2008-03-20 at 23:07 +, Simon Riggs wrote:

  Simon, would it be too much to ask that you concentrate on reviewing
  existing patches during commit fest?  Trying to get people to think
  about random new ideas is about the most direct undermining of the
  commit-fest concept that I can think of.  Save 'em for later.
 
 That's a fair reminder, thanks, I will do that.

I'm a bit in the dark about this Commit Fest, to be honest.

Is this the list of patches to be reviewed?
http://wiki.postgresql.org/wiki/Todo:CommitFest

I was suspicious of that because it mentions Minor changes to Recovery
related code by Simon Riggs, which I'd already mentioned had been
committed more than 6 months ago.

ISTM that nobody has reviewed anything except you, Tom, from the list.
Is that true, or are there others working on reviews I can't see?

or maybe the patch list is this?
http://wiki.postgresql.org/wiki/Todo:PatchStatus


I'll review Tom Doran's and Dany DeBontridder's work. 

Incidentally, I'm in favour of letting Heikki review his own work
because there's a backlog on index changes that appears to be months
long and he has a good chance of tackling that.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com 

  PostgreSQL UK 2008 Conference: http://www.postgresql.org.uk


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Commit Fest (was Re: [HACKERS] Sort Refinement)

2008-03-21 Thread Heikki Linnakangas

Simon Riggs wrote:

On Thu, 2008-03-20 at 23:07 +, Simon Riggs wrote:


Simon, would it be too much to ask that you concentrate on reviewing
existing patches during commit fest?  Trying to get people to think
about random new ideas is about the most direct undermining of the
commit-fest concept that I can think of.  Save 'em for later.

That's a fair reminder, thanks, I will do that.


I'm a bit in the dark about this Commit Fest, to be honest.

Is this the list of patches to be reviewed?
http://wiki.postgresql.org/wiki/Todo:CommitFest

I was suspicious of that because it mentions Minor changes to Recovery
related code by Simon Riggs, which I'd already mentioned had been
committed more than 6 months ago.

ISTM that nobody has reviewed anything except you, Tom, from the list.
Is that true, or are there others working on reviews I can't see?

or maybe the patch list is this?
http://wiki.postgresql.org/wiki/Todo:PatchStatus


I'll review Tom Doran's and Dany DeBontridder's work. 


Incidentally, I'm in favour of letting Heikki review his own work
because there's a backlog on index changes that appears to be months
long and he has a good chance of tackling that.


Umm, I don't think there's any patches from me in the queue that need 
review. There's some discussion threads related to bitmap indexes, but 
that's all. We're definitely not going to get bitmap indexes in this 
commit fest.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Commit Fest (was Re: [HACKERS] Sort Refinement)

2008-03-21 Thread Bruce Momjian
Simon Riggs wrote:
 On Thu, 2008-03-20 at 23:07 +, Simon Riggs wrote:
 
   Simon, would it be too much to ask that you concentrate on reviewing
   existing patches during commit fest?  Trying to get people to think
   about random new ideas is about the most direct undermining of the
   commit-fest concept that I can think of.  Save 'em for later.
  
  That's a fair reminder, thanks, I will do that.
 
 I'm a bit in the dark about this Commit Fest, to be honest.
 
 Is this the list of patches to be reviewed?
 http://wiki.postgresql.org/wiki/Todo:CommitFest

I don't think that list is complete.  The full archive is:

http://momjian.us/cgi-bin/pgpatches

Sorry, there is no summary.

-- 
  Bruce Momjian  [EMAIL PROTECTED]http://momjian.us
  EnterpriseDB http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Commit Fest (was Re: [HACKERS] Sort Refinement)

2008-03-21 Thread Simon Riggs
On Fri, 2008-03-21 at 08:48 -0400, Bruce Momjian wrote:

 I don't think that list is complete.  The full archive is:
 
   http://momjian.us/cgi-bin/pgpatches
 
 Sorry, there is no summary.

I've reviewed Nikhil's partitioning patch for now.

I have some time to contribute, but not much. I don't want to review
things that will be rejected for other reasons, so unless there is
clearer information I don't see how I can contribute further.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com 

  PostgreSQL UK 2008 Conference: http://www.postgresql.org.uk


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Commit Fest (was Re: [HACKERS] Sort Refinement)

2008-03-21 Thread Andrew Dunstan



Heikki Linnakangas wrote:

Simon Riggs wrote:


Incidentally, I'm in favour of letting Heikki review his own work
because there's a backlog on index changes that appears to be months
long and he has a good chance of tackling that.


Umm, I don't think there's any patches from me in the queue that need 
review. There's some discussion threads related to bitmap indexes, but 
that's all. We're definitely not going to get bitmap indexes in this 
commit fest.




There is your CopyReadLineText speedup, but I think there are too many 
open questions on it, e.g.:


   * should we change the line-end detection mode in text (non-CSV)
 mode by looking for an LF preceded by an even number of
 backslashes, or some similar logic?
   * how do we decide when to use the memchr tests rather than char by
 char tests?
   * is there a more economical way to code the searcher? (although I
 could live with it for now)


So I suggest we take it out of the queue for now and kick it back to you.

cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Commit Fest (was Re: [HACKERS] Sort Refinement)

2008-03-21 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 Umm, I don't think there's any patches from me in the queue that need 
 review. There's some discussion threads related to bitmap indexes, but 
 that's all. We're definitely not going to get bitmap indexes in this 
 commit fest.

I think there are basically three types of work represented in the
current patch queue:

1. Actual patches that have some hope of being applied now, and if not
we are supposed to provide feedback about what's needed to fix them.

2. Design proposals that require further feedback.  I think the idea
of the commit fest is that we should provide such feedback now, so
that whoever is going to work on it can proceed.

3. Discussions that don't really need any further feedback right now,
but should be summarized as TODO entries.

The reason category 3 is represented is that this is after all
Bruce's personal work queue (you'll remember that I pushed him to
open it up before he'd finished cleaning out that type of entry).

Personally I've been trying to knock off items in category 1.
It'd be useful for people to go through some of the longer discussion
threads and try to categorize them as needing further discussion now
or being just TODO items.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Commit Fest (was Re: [HACKERS] Sort Refinement)

2008-03-21 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes:
 There is your CopyReadLineText speedup, but I think there are too many 
 open questions on it, e.g.:
 ...
 So I suggest we take it out of the queue for now and kick it back to you.

Per my comments just now, the question is whether it's been adequately
reviewed or still needs some attention from the community.  If we think
the ball's entirely in Heikki's court on it, then we're done with it
until he comes back with a new version (or evidence showing it's good
as-is).

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Commit Fest (was Re: [HACKERS] Sort Refinement)

2008-03-21 Thread Andrew Dunstan



Tom Lane wrote:

Andrew Dunstan [EMAIL PROTECTED] writes:
  
There is your CopyReadLineText speedup, but I think there are too many 
open questions on it, e.g.:

...
So I suggest we take it out of the queue for now and kick it back to you.



Per my comments just now, the question is whether it's been adequately
reviewed or still needs some attention from the community.  If we think
the ball's entirely in Heikki's court on it, then we're done with it
until he comes back with a new version (or evidence showing it's good
as-is).


  


My comments were intended to say I think the latter is the case (since I 
had previously undertaken to review this patch).


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Commit Fest (was Re: [HACKERS] Sort Refinement)

2008-03-21 Thread Heikki Linnakangas

Tom Lane wrote:

Andrew Dunstan [EMAIL PROTECTED] writes:
There is your CopyReadLineText speedup, but I think there are too many 
open questions on it, e.g.:

...
So I suggest we take it out of the queue for now and kick it back to you.


Per my comments just now, the question is whether it's been adequately
reviewed or still needs some attention from the community.  If we think
the ball's entirely in Heikki's court on it, then we're done with it
until he comes back with a new version (or evidence showing it's good
as-is).


I'm not expecting any more review in this commit fest.

My plan is to try special-casing the usual case of text-mode in a non 
ASCII-embedding encoding (one that can be used as server encoding), by 
using memchr() to find end of line first, and then scanning back from 
there to count preceding backslashes. That requires some refactoring, 
but should avoid the performance penalty when there's plenty of backslashes.


Of course, if anyeone has better ideas, please speak up!

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Commit Fest (was Re: [HACKERS] Sort Refinement)

2008-03-21 Thread Andrew Dunstan



Heikki Linnakangas wrote:

Tom Lane wrote:

Andrew Dunstan [EMAIL PROTECTED] writes:
There is your CopyReadLineText speedup, but I think there are too 
many open questions on it, e.g.:

...
So I suggest we take it out of the queue for now and kick it back to 
you.


Per my comments just now, the question is whether it's been adequately
reviewed or still needs some attention from the community.  If we think
the ball's entirely in Heikki's court on it, then we're done with it
until he comes back with a new version (or evidence showing it's good
as-is).


I'm not expecting any more review in this commit fest.

My plan is to try special-casing the usual case of text-mode in a non 
ASCII-embedding encoding (one that can be used as server encoding), by 
using memchr() to find end of line first, and then scanning back from 
there to count preceding backslashes. That requires some refactoring, 
but should avoid the performance penalty when there's plenty of 
backslashes.


Of course, if anyeone has better ideas, please speak up!



That won't work for CSV mode, of course, which leaves us with trying to 
refine your previous solution in that case.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Sort Refinement

2008-03-20 Thread Simon Riggs
Currently, our sort algorithm assumes that its input is unsorted. So if
your data is sorted on (a) and you would like it to be sorted on (a,b)
then we need to perform the full sort of (a,b).

For small sorts this doesn't matter much. For larger sorts the heap sort
algorithm will typically result in just a single run being written to
disk which must then be read back in. Number of I/Os required is twice
the total volume of data to be sorted.

If we assume we use heap sort, then if we *know* that the data is
presorted on (a) then we should be able to emit tuples directly that the
value of (a) changes and keep emitting them until the heap is empty,
since they will exit the heap in (a,b) order.

The normal pattern of a sort node is to read all of its input and then
begin emitting rows, so the sort is completed before the first row is
returned (OK, lets gloss over the final merge bit for now). So what I'm
suggesting is a sort node that will switch back and forth between
consuming and emitting rows. In some cases it *may* still need to scroll
to disk, but we already have that code.

In this way we can completely avoid writing data to disk for some types
of large sort, even when we have a data set much larger than available
memory. Of course it only applies when we are refining an existing
well-known sort order.

This is a similar concept to the way we handled dynamic buffering of
data for the merge join in 8.3, so I expect it to have an equally warm
reception ;-)

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com 

  PostgreSQL UK 2008 Conference: http://www.postgresql.org.uk


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sort Refinement

2008-03-20 Thread Sam Mason
On Thu, Mar 20, 2008 at 05:17:22PM +, Simon Riggs wrote:
 Currently, our sort algorithm assumes that its input is unsorted. So if
 your data is sorted on (a) and you would like it to be sorted on (a,b)
 then we need to perform the full sort of (a,b).
 
 For small sorts this doesn't matter much. For larger sorts the heap sort
 algorithm will typically result in just a single run being written to
 disk which must then be read back in. Number of I/Os required is twice
 the total volume of data to be sorted.
 
 If we assume we use heap sort, then if we *know* that the data is
 presorted on (a) then we should be able to emit tuples directly that the
 value of (a) changes and keep emitting them until the heap is empty,
 since they will exit the heap in (a,b) order.

We also have stats to help decide when this will be a win.  For example
if a has a small range (i.e. a boolean) and b has a large range
(i.e. some sequence) then this probably isn't going to be a win and
you're better off using the existing infrastructure.  If it's the other
way around then this is going to be a big win.


  Sam

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sort Refinement

2008-03-20 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 Currently, our sort algorithm assumes that its input is unsorted. So if
 your data is sorted on (a) and you would like it to be sorted on (a,b)
 then we need to perform the full sort of (a,b).

Simon, would it be too much to ask that you concentrate on reviewing
existing patches during commit fest?  Trying to get people to think
about random new ideas is about the most direct undermining of the
commit-fest concept that I can think of.  Save 'em for later.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sort Refinement

2008-03-20 Thread Gregory Stark
Simon Riggs [EMAIL PROTECTED] writes:

 If we assume we use heap sort, then if we *know* that the data is
 presorted on (a) then we should be able to emit tuples directly that the
 value of (a) changes and keep emitting them until the heap is empty,
 since they will exit the heap in (a,b) order.

Actually, I would think the way to do this would be to do a quicksort if you
find you've accumulated all the records in a subgroup in memory. One easy way
to do it would be to have nodeSort build a new tuplesort for each subgroup if
it has a level break key parameter set (memories of RPG III are coming
bubbling to the surface).

What I wonder is what the optimal thing to do really is if a level doesn't fit
in memory. Is it best to do a disk sort just of that level and then return to
accumulating levels one by one in memory? Or is it best to fail over to a
single disk sort of all the remaining tuples?

Also, I wonder how expensive checking the level break key on every tuple will
be. I don't think it invalidates the approach but it has to be taken into
account.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's RemoteDBA services!

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sort Refinement

2008-03-20 Thread Simon Riggs
On Thu, 2008-03-20 at 18:08 -0400, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  Currently, our sort algorithm assumes that its input is unsorted. So if
  your data is sorted on (a) and you would like it to be sorted on (a,b)
  then we need to perform the full sort of (a,b).
 
 Simon, would it be too much to ask that you concentrate on reviewing
 existing patches during commit fest?  Trying to get people to think
 about random new ideas is about the most direct undermining of the
 commit-fest concept that I can think of.  Save 'em for later.

That's a fair reminder, thanks, I will do that.

I have no wish at all to undermine things.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com 

  PostgreSQL UK 2008 Conference: http://www.postgresql.org.uk


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers