Last month I wrote:
It seems clear that our qsort.c is doing a pretty awful job of picking
qsort pivots, while glibc is mostly managing not to make that mistake.
I re-ran Gary's test script using the just-committed improvements to
qsort.c, and got pretty nice numbers (attached --- compare to
:
-Original Message-
From: [EMAIL PROTECTED] [mailto:pgsql-hackers-
[EMAIL PROTECTED] On Behalf Of Tom Lane
Sent: Wednesday, February 15, 2006 5:22 PM
To: Ron
Cc: pgsql-performance@postgresql.org; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] qsort again (was Re: [PERFORM
, February 15, 2006 5:22 PM
To: Ron Cc: pgsql-performance@postgresql.org; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create
Index behaviour) Ron [EMAIL PROTECTED] writes: How are we choosing our pivots? See
qsort.c: it looks like median of nine
--On Donnerstag, Februar 16, 2006 10:39:45 -0800 Dann Corbit
[EMAIL PROTECTED] wrote:
He refers to counting sort and radix sort (which comes in most
significant digit and least significant digit format). These are also
called distribution (as opposed to comparison) sorts.
These sorts are
On Fri, Feb 17, 2006 at 09:18:39AM +0100, Jens-Wolfhard Schicke wrote:
What I think as the biggest problem is the digit representation necessary
for Radix-Sort in cases of locales which sort without looking at spaces. I
assume that would be hard to implement. The same goes for the proposed
Hi, David,
David Lang schrieb:
In SQL_ASCII, just take the first 4 characters (or 8, if using a 64-bit
sortKey as elsewhere suggested). The sorting key doesn't need to be a
one-to-one mapping.
that would violate your second contraint ( f(a)==f(b) iff (a==b) )
no, it doesn't.
When both
On Feb 16, 2006, at 2:17 PM, Mark Lewis wrote:Data types which could probably provide a useful function for f would be int2, int4, oid, and possibly int8 and text (at least for SQL_ASCII). ...and with some work, floats (I think just the exponent would work, if nothing else). bytea. Probably just
On Fri, Feb 17, 2006 at 08:18:41AM -0800, Scott Lamb wrote:
On Feb 16, 2006, at 2:17 PM, Mark Lewis wrote:
Data types which could probably provide a useful function for f
would be
int2, int4, oid, and possibly int8 and text (at least for SQL_ASCII).
...and with some work, floats (I think
On Thu, 2006-02-16 at 21:33 -0800, David Lang wrote:
In SQL_ASCII, just take the first 4 characters (or 8, if using a 64-bit
sortKey as elsewhere suggested). The sorting key doesn't need to be a
one-to-one mapping.
that would violate your second contraint ( f(a)==f(b) iff (a==b) )
if
Mark Lewis [EMAIL PROTECTED] writes:
I think we're actually on the same page here; you're right that the
constraint above ( f(a)==f(b) iff a==b ) can't be extended to data types
with more than 32 bits of value space. But the constraint I listed was
actually:
if a==b then f(a)==f(b)
I
At 06:35 AM 2/16/2006, Steinar H. Gunderson wrote:
On Wed, Feb 15, 2006 at 11:30:54PM -0500, Ron wrote:
Even better (and more easily scaled as the number of GPR's in the CPU
changes) is to use
the set {L; L+1; L+2; t1; R-2; R-1; R}
This means that instead of 7 random memory accesses, we have
Hi, Ron,
Ron wrote:
...and of course if you know enough about the data to be sorted so as to
constrain it appropriately, one should use a non comparison based O(N)
sorting algorithm rather than any of the general comparison based
O(NlgN) methods.
Sounds interesting, could you give us some
Last night I implemented a non-recursive introsort in C... let me test it a bit more and then I'll post it here for everyone else to try out.On 2/16/06, Markus Schaber
[EMAIL PROTECTED] wrote:Hi, Ron,
Ron wrote: ...and of course if you know enough about the data to be sorted so as to constrain it
On Thu, Feb 16, 2006 at 08:22:55AM -0500, Ron wrote:
3= Especially in modern systems where the gap between internal CPU
bandwidth and memory bandwidth is so great, the overhead of memory
accesses for comparisons and moves is the majority of the overhead
for both the pivot choosing and the
Markus Schaber wrote:
Ron wrote:
...and of course if you know enough about the data to be sorted so as to
constrain it appropriately, one should use a non comparison based O(N)
sorting algorithm rather than any of the general comparison based
O(NlgN) methods.
Sounds interesting, could you
Craig A. James [EMAIL PROTECTED] writes:
You can also use this trick when the optimizer is asked for fastest first
result. Say you have a cursor on a column of numbers with good
distribution. If you do a bucket sort on the first two or three digits only,
you know the first page of results
-Original Message-
From: [EMAIL PROTECTED] [mailto:pgsql-hackers-
[EMAIL PROTECTED] On Behalf Of Markus Schaber
Sent: Thursday, February 16, 2006 5:45 AM
To: pgsql-performance@postgresql.org; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] qsort again (was Re: [PERFORM] Strange
On Thu, 2006-02-16 at 12:35 +0100, Steinar H. Gunderson wrote:
glibc-2.3.5/stdlib/qsort.c:
/* Order size using quicksort. This implementation incorporates
four optimizations discussed in Sedgewick:
I can't see any references to merge sort in there at all.
stdlib/qsort.c defines
On Thu, 2006-02-16 at 12:15 -0500, Tom Lane wrote:
Once or twice we've kicked around the idea of having some
datatype-specific sorting code paths alongside the general-purpose one,
but I can't honestly see this as being workable from a code maintenance
standpoint.
Hi, Mark,
Mark Lewis schrieb:
It seems that instead of maintaining a different sorting code path for
each data type, you could get away with one generic path and one
(hopefully faster) path if you allowed data types to optionally support
a 'sortKey' interface by providing a function f which
On Thu, Feb 16, 2006 at 02:17:36PM -0800, Mark Lewis wrote:
It seems that instead of maintaining a different sorting code path for
each data type, you could get away with one generic path and one
(hopefully faster) path if you allowed data types to optionally support
a 'sortKey' interface by
Markus Schaber [EMAIL PROTECTED] writes:
Hmm, to remove redundancy, I'd change the = to a and define:
if a==b then f(a)==f(b)
if ab then f(a)=f(b)
Data types which could probably provide a useful function for f would be
int2, int4, oid, and possibly int8 and text (at least for
On Thu, 2006-02-16 at 17:51 -0500, Greg Stark wrote:
Data types which could probably provide a useful function for f would be
int2, int4, oid, and possibly int8 and text (at least for SQL_ASCII).
How exactly do you imagine doing this for text?
I could see doing it for
On Thu, 16 Feb 2006, Mark Lewis wrote:
On Thu, 2006-02-16 at 17:51 -0500, Greg Stark wrote:
Data types which could probably provide a useful function for f would be
int2, int4, oid, and possibly int8 and text (at least for SQL_ASCII).
How exactly do you imagine doing this for text?
I could
Gary Doades [EMAIL PROTECTED] writes:
If I run the script again, it is not always the first case that is slow,
it varies from run to run, which is why I repeated it quite a few times
for the test.
For some reason I hadn't immediately twigged to the fact that your test
script is just N
Tom Lane wrote:
For some reason I hadn't immediately twigged to the fact that your test
script is just N repetitions of the exact same structure with random data.
So it's not so surprising that you get random variations in behavior
with different test data sets.
It seems clear that our
Gary Doades [EMAIL PROTECTED] writes:
Is this likely to hit me in a random fashion during normal operation,
joins, sorts, order by for example?
Yup, anytime you're passing data with that kind of distribution
through a sort.
So the options are:
1) Fix the included qsort.c code and use that
Gary Doades [EMAIL PROTECTED] writes:
Ouch! That confirms my problem. I generated the random test case because
it was easier than including the dump of my tables, but you can
appreciate that tables 20 times the size are basically crippled when it
comes to creating an index on them.
This behavior is consistent with the pivot choosing algorithm
assuming certain distribution(s) for the data. For instance,
median-of-three partitioning is known to be pessimal when the data is
geometrically or hyper-geometrically distributed. Also, care must be
taken that sometimes is not
I wrote:
Gary Doades [EMAIL PROTECTED] writes:
Ouch! That confirms my problem. I generated the random test case because
it was easier than including the dump of my tables, but you can
appreciate that tables 20 times the size are basically crippled when it
comes to creating an index on
Ron [EMAIL PROTECTED] writes:
How are we choosing our pivots?
See qsort.c: it looks like median of nine equally spaced inputs (ie,
the 1/8th points of the initial input array, plus the end points),
implemented as two rounds of median-of-three choices. With half of the
data inputs zero, it's not
-Original Message-
From: [EMAIL PROTECTED] [mailto:pgsql-hackers-
[EMAIL PROTECTED] On Behalf Of Tom Lane
Sent: Wednesday, February 15, 2006 5:22 PM
To: Ron
Cc: pgsql-performance@postgresql.org; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] qsort again (was Re: [PERFORM
Ouch! That confirms my problem. I generated the random test case because
it was easier than including the dump of my tables, but you can
appreciate that tables 20 times the size are basically crippled when it
comes to creating an index on them.
I have to say that I restored a few gigabyte
On Wed, 2006-02-15 at 19:59 -0500, Tom Lane wrote:
I get
amazingly stable runtimes now --- I didn't have the patience to run 100
trials, but in 30 trials I have slowest 11538 msec and fastest 11144 msec.
So this code path is definitely not very sensitive to this data
distribution.
The
On Wed, 2006-02-15 at 18:28 -0500, Tom Lane wrote:
It seems clear that our qsort.c is doing a pretty awful job of picking
qsort pivots, while glibc is mostly managing not to make that mistake.
I haven't looked at the glibc code yet to see what they are doing
differently.
glibc qsort is
Tom Lane [EMAIL PROTECTED] wrote
I did this 100 times and sorted the reported runtimes.
I'd say this puts a considerable damper on my enthusiasm for using our
qsort all the time, as was recently debated in this thread:
http://archives.postgresql.org/pgsql-hackers/2005-12/msg00610.php
100
Qingqing Zhou [EMAIL PROTECTED] writes:
By did this 100 times do you mean generate a sequence of at most
20*100 numbers, and for every 20 numbers, the first half are all
zeros and the other half are uniform random numbers?
No, I mean I ran the bit of SQL script I gave 100 separate
Tom Lane [EMAIL PROTECTED] wrote
Qingqing Zhou [EMAIL PROTECTED] writes:
By did this 100 times do you mean generate a sequence of at most
20*100 numbers, and for every 20 numbers, the first half are all
zeros and the other half are uniform random numbers?
No, I mean I ran the bit
Qingqing Zhou [EMAIL PROTECTED] wrote
I must misunderstand something here -- I can't figure out that why the
cost
of the same procedure keep climbing?
Ooops, I mis-intepret the sentence -- you sorted the results ...
Regards,
Qingqing
---(end of
Qingqing Zhou [EMAIL PROTECTED] writes:
Tom Lane [EMAIL PROTECTED] wrote
No, I mean I ran the bit of SQL script I gave 100 separate times.
I must misunderstand something here -- I can't figure out that why the cost
of the same procedure keep climbing?
No, the run cost varies randomly
At 08:21 PM 2/15/2006, Tom Lane wrote:
Ron [EMAIL PROTECTED] writes:
How are we choosing our pivots?
See qsort.c: it looks like median of nine equally spaced inputs (ie,
the 1/8th points of the initial input array, plus the end points),
implemented as two rounds of median-of-three choices.
41 matches
Mail list logo