Re: [HACKERS] Re: Which qsort is used
On Thu, 22 Dec 2005 08:01:00 +0100, Martijn van Oosterhout kleptog@svana.org wrote: But where are you including the cost to check how many cells are already sorted? That would be O(H), right? Yes. I didn't mention it, because H N. This is where we come back to the issue that comparisons in PostgreSQL are expensive. So we agree that we should try to reduce the number of comparisons. How many comparisons does it take to sort 10 items? 1.5 million? Hmm, what are the chances you have 10 unordered items to sort and that the first 8% will already be in order. ISTM that that probability will be close enough to zero to not matter... If the items are totally unordered, the check is so cheap you won't even notice. OTOH in Tom's example ... |What I think is much more probable in the Postgres environment |is almost-but-not-quite-ordered inputs --- eg, a table that was |perfectly ordered by key when filled, but some of the tuples have since |been moved by UPDATEs. ... I'd not be surprised if H is 90% of N. Servus Manfred ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Re: Which qsort is used
An interesting article on sorting and comparison count: http://www.acm.org/jea/ARTICLES/Vol7Nbr5.pdf Here is the article, the code, and an implementation that I have been toying with: http://cap.connx.com/chess-engines/new-approach/algos.zip Algorithm quickheap is especially interesting because it does not require much additional space (just an array of integers up to size log(element_count) and in addition, it has very few data movements. -Original Message- From: Manfred Koizar [mailto:[EMAIL PROTECTED] Sent: Thursday, December 22, 2005 1:59 PM To: Martijn van Oosterhout Cc: Tom Lane; Dann Corbit; Qingqing Zhou; Bruce Momjian; Luke Lonergan; Neil Conway; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Re: Which qsort is used On Thu, 22 Dec 2005 08:01:00 +0100, Martijn van Oosterhout kleptog@svana.org wrote: But where are you including the cost to check how many cells are already sorted? That would be O(H), right? Yes. I didn't mention it, because H N. This is where we come back to the issue that comparisons in PostgreSQL are expensive. So we agree that we should try to reduce the number of comparisons. How many comparisons does it take to sort 10 items? 1.5 million? Hmm, what are the chances you have 10 unordered items to sort and that the first 8% will already be in order. ISTM that that probability will be close enough to zero to not matter... If the items are totally unordered, the check is so cheap you won't even notice. OTOH in Tom's example ... |What I think is much more probable in the Postgres environment |is almost-but-not-quite-ordered inputs --- eg, a table that was |perfectly ordered by key when filled, but some of the tuples have since |been moved by UPDATEs. ... I'd not be surprised if H is 90% of N. Servus Manfred ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Re: Which qsort is used
On Sat, 17 Dec 2005 00:03:25 -0500, Tom Lane [EMAIL PROTECTED] wrote: I've still got a problem with these checks; I think they are a net waste of cycles on average. [...] and when they fail, those cycles are entirely wasted; you have not advanced the state of the sort at all. How can we make the initial check adavance the state of the sort? One answer might be to exclude the sorted sequence at the start of the array from the qsort, and merge the two sorted lists as the final stage of the sort. Qsorting N elements costs O(N*lnN), so excluding H elements from the sort reduces the cost by at least O(H*lnN). The merge step costs O(N) plus some (=50%) more memory, unless someone knows a fast in-place merge. So depending on the constant factors involved there might be a usable solution. I've been playing with some numbers and assuming the constant factors to be equal for all the O()'s this method starts to pay off at H for N 20 100 130 1000 800010 Servus Manfred ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Re: Which qsort is used
On Thu, Dec 22, 2005 at 01:43:34AM +0100, Manfred Koizar wrote: Qsorting N elements costs O(N*lnN), so excluding H elements from the sort reduces the cost by at least O(H*lnN). The merge step costs O(N) plus some (=50%) more memory, unless someone knows a fast in-place merge. So depending on the constant factors involved there might be a usable solution. But where are you including the cost to check how many cells are already sorted? That would be O(H), right? This is where we come back to the issue that comparisons in PostgreSQL are expensive. The cpu_cost in the tests I saw so far is unrealistically low. I've been playing with some numbers and assuming the constant factors to be equal for all the O()'s this method starts to pay off at H for N 20 100 20% 130 1000 13% 800010 8% Hmm, what are the chances you have 10 unordered items to sort and that the first 8% will already be in order. ISTM that that probability will be close enough to zero to not matter... Have a nice day, -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a tool for doing 5% of the work and then sitting around waiting for someone else to do the other 95% so you can sue them. pgpg6RoCjt5SA.pgp Description: PGP signature
Re: [HACKERS] Re: Which qsort is used
On Fri, Dec 16, 2005 at 10:43:58PM -0800, Dann Corbit wrote: I am actually quite impressed with the excellence of Bentley's sort out of the box. It's definitely the best library implementation of a sort I have seen. I'm not sure whether we have a conclusion here, but I do have one question: is there a significant difference in the number of times the comparison routines are called? Comparisons in PostgreSQL are fairly expensive given the fmgr overhead and when comparing tuples it's even worse. We don't want to accedently pick a routine that saves data shuffling by adding extra comparisons. The stats at [1] don't say. They try to factor in CPU cost but they seem to use unrealistically small values. I would think a number around 50 (or higher) would be more representative. [1] http://www.cs.toronto.edu/~zhouqq/postgresql/sort/sort.html Have a nice day, -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a tool for doing 5% of the work and then sitting around waiting for someone else to do the other 95% so you can sue them. pgpWGyYf8ywud.pgp Description: PGP signature
Re: [HACKERS] Re: Which qsort is used
Martin, On 12/19/05 3:37 AM, Martijn van Oosterhout kleptog@svana.org wrote: I'm not sure whether we have a conclusion here, but I do have one question: is there a significant difference in the number of times the comparison routines are called? Comparisons in PostgreSQL are fairly expensive given the fmgr overhead and when comparing tuples it's even worse. It would be interesting to note the comparison count of the different routines. Something that really grabbed me about the results though is that the relative performance of the routines dramatically shifted when the indirect references in the comparators went in. The first test I did sorted an array of int4 - these tests that Qingqing did sorted arrays using an indirect pointer list, at which point the same distributions performed very differently. I suspect that it is the number of comparisons that caused this, and further that the indirection has disabled the compiler optimizations for memory prefetch and other things that it could normally recognize. Given the usage pattern in Postgres, where sorted things are a mix of strings and intrinsic types, I'm not sure those optimizations could be done by one routine. I haven't verified this, but it certainly seems that the NetBSD routine is the overall winner for the type of use that Postgres has (sorting the using a pointer list). - Luke ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Re: Which qsort is used
Luke Lonergan wrote: Tom, On 12/12/05 2:47 PM, Tom Lane [EMAIL PROTECTED] wrote: As those results suggest, there can be huge differences in sort performance depending on whether the input is random, nearly sorted, nearly reverse sorted, possesses many equal keys, etc. It would be very dangerous to say implementation A is better than implementation B without having tested all those scenarios. Yes. The Linux glibc qsort is proven terrible in the general case by these examples though. Bruce's point on that thread was that we shouldn't be replacing the OS routine in the general case. On the other hand, there is the precedent of replacing Solaris' routine with the NetBSD version. At this point, I think we have done enough testing on enough platforms to just use port/qsort on all platforms in 8.2. It seems whenever someone tries to improve the BSD qsort, they make it worse. Were the BSD programmers geniuses and we are all idiots now, or what? -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Re: Which qsort is used
On Fri, 16 Dec 2005, Bruce Momjian wrote: At this point, I think we have done enough testing on enough platforms to just use port/qsort on all platforms in 8.2. It seems whenever someone tries to improve the BSD qsort, they make it worse. Not necessariliy true. Dann Corbit sent me an implementation a while ago (you can see it on the same site). BSD qsort is improved, though not that much, by two methods. Since Dann write the program from scratch, so I am not sure if we are afford to take the efforts for the improvement. Regards, Qingqing ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Re: Which qsort is used
-Original Message- From: [EMAIL PROTECTED] [mailto:pgsql-hackers- [EMAIL PROTECTED] On Behalf Of Qingqing Zhou Sent: Friday, December 16, 2005 5:14 PM To: Bruce Momjian Cc: Luke Lonergan; Tom Lane; Neil Conway; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Re: Which qsort is used On Fri, 16 Dec 2005, Bruce Momjian wrote: At this point, I think we have done enough testing on enough platforms to just use port/qsort on all platforms in 8.2. It seems whenever someone tries to improve the BSD qsort, they make it worse. Not necessariliy true. Dann Corbit sent me an implementation a while ago (you can see it on the same site). BSD qsort is improved, though not that much, by two methods. Since Dann write the program from scratch, so I am not sure if we are afford to take the efforts for the improvement. Both of them are modified versions of Bentley's sort algorithm. I just added the in-order check to both of them, and the reversed partition check for the second method (in the case of the subfiles because insertion sort is allergic to reversed partitions). At any rate, neither is much of an improvement on Bentley's version. For the rare cases of completely ordered or completely reversed, it will be a monumental improvement. But that is a pretty rare case. If I could use C++, I could do much better. It is very difficult for me to write an ADT in C instead of in C++. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Re: Which qsort is used
Dann Corbit [EMAIL PROTECTED] writes: Both of them are modified versions of Bentley's sort algorithm. Jon Bentley of Bell Labs? Small world ... he was my thesis adviser for awhile when he was at CMU. He's a good algorithms man, for sure. I just added the in-order check to both of them, and the reversed partition check for the second method (in the case of the subfiles because insertion sort is allergic to reversed partitions). I've still got a problem with these checks; I think they are a net waste of cycles on average. They would only be a win if you expected a nontrivial percentage of perfectly-in-order or perfectly-reverse-order inputs. What I think is much more probable in the Postgres environment is almost-but-not-quite-ordered inputs --- eg, a table that was perfectly ordered by key when filled, but some of the tuples have since been moved by UPDATEs. This is the worst possible case for the in-order checks, because they can then grovel over large percentages of the file before failing ... and when they fail, those cycles are entirely wasted; you have not advanced the state of the sort at all. For the usual case of randomly ordered input, of course it doesn't matter much at all because the in-order checks will fail after examining not too many items. But to argue that the checks are worth making, you have to assume that perfectly-ordered inputs are more common than almost-ordered inputs, and I think that is exactly the wrong assumption for the Postgres environment. I sure haven't seen any evidence that it's a good assumption. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Re: Which qsort is used
-Original Message- From: Tom Lane [mailto:[EMAIL PROTECTED] Sent: Friday, December 16, 2005 9:03 PM To: Dann Corbit Cc: Qingqing Zhou; Bruce Momjian; Luke Lonergan; Neil Conway; pgsql- [EMAIL PROTECTED] Subject: Re: [HACKERS] Re: Which qsort is used Dann Corbit [EMAIL PROTECTED] writes: Both of them are modified versions of Bentley's sort algorithm. Jon Bentley of Bell Labs? Small world ... he was my thesis adviser for awhile when he was at CMU. He's a good algorithms man, for sure. I just added the in-order check to both of them, and the reversed partition check for the second method (in the case of the subfiles because insertion sort is allergic to reversed partitions). I've still got a problem with these checks; I think they are a net waste of cycles on average. They would only be a win if you expected a nontrivial percentage of perfectly-in-order or perfectly-reverse-order inputs. What I think is much more probable in the Postgres environment is almost-but-not-quite-ordered inputs --- eg, a table that was perfectly ordered by key when filled, but some of the tuples have since been moved by UPDATEs. This is the worst possible case for the in-order checks, because they can then grovel over large percentages of the file before failing ... and when they fail, those cycles are entirely wasted; you have not advanced the state of the sort at all. For the usual case of randomly ordered input, of course it doesn't matter much at all because the in-order checks will fail after examining not too many items. But to argue that the checks are worth making, you have to assume that perfectly-ordered inputs are more common than almost-ordered inputs, and I think that is exactly the wrong assumption for the Postgres environment. I sure haven't seen any evidence that it's a good assumption. The benchmarks say that they (order checks) are a good idea on average for ordered data, random data, and partly ordered data. It does not require perfectly ordered data for the checks to be useful. On mostly ordered data, it is likely that some partitions are perfectly ordered. If you trace the algorithms in a debugger you will be surprised at how often the partitions are ordered, even with random sets as input. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Re: Which qsort is used
On Sat, 17 Dec 2005, Dann Corbit wrote: The benchmarks say that they (order checks) are a good idea on average for ordered data, random data, and partly ordered data. I interpret that in linux, 500 seems a divide for qsortpdq. Before that number, it wins, after that, bsd wins more. On SunOS, qsortpdq takes the lead till the last second -- I suspect this is due to the rand() function: Linux - #define RAND_MAX2147483647 SunOS - #define RAND_MAX32767 So in SunOS, the data actually not that scattered - so more favourate for sorted() or reversed() check? Regards, Qingqing ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Re: Which qsort is used
-Original Message- From: Qingqing Zhou [mailto:[EMAIL PROTECTED] Sent: Friday, December 16, 2005 10:13 PM To: Dann Corbit Cc: Tom Lane; Bruce Momjian; Luke Lonergan; Neil Conway; pgsql- [EMAIL PROTECTED] Subject: RE: [HACKERS] Re: Which qsort is used On Sat, 17 Dec 2005, Dann Corbit wrote: The benchmarks say that they (order checks) are a good idea on average for ordered data, random data, and partly ordered data. I interpret that in linux, 500 seems a divide for qsortpdq. Before that number, it wins, after that, bsd wins more. On SunOS, qsortpdq takes the lead till the last second -- I suspect this is due to the rand() function: Linux - #define RAND_MAX2147483647 SunOS - #define RAND_MAX32767 So in SunOS, the data actually not that scattered - so more favourate for sorted() or reversed() check? There is a lot of variability from system to system even for the same tests. I see different results depending on whether I use GCC or Intel or MS compilers. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Re: Which qsort is used
Dann Corbit [EMAIL PROTECTED] writes: I've still got a problem with these checks; I think they are a net waste of cycles on average. The benchmarks say that they (order checks) are a good idea on average for ordered data, random data, and partly ordered data. There are lies, damn lies, and benchmarks ;-) The problem with citing a benchmark for this discussion is that a benchmark can't tell you anything about real-world probabilities; it only tells you about the probabilities occuring in the benchmark case. You need to make the case that the benchmark reflects the real world, which you didn't. If you trace the algorithms in a debugger you will be surprised at how often the partitions are ordered, even with random sets as input. Well, I do agree that checking for orderedness on small partitions would succeed more often than on larger partitions or the whole file --- but the code-as-given checks all the way down. Moreover, the argument given for spending these cycles is that insertion sort sucks on reverse-order input ... where sucks means that it spends O(N^2) time. But it spends O(N^2) in the average case, too. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Re: Which qsort is used
-Original Message- From: Tom Lane [mailto:[EMAIL PROTECTED] Sent: Friday, December 16, 2005 10:41 PM To: Dann Corbit Cc: Qingqing Zhou; Bruce Momjian; Luke Lonergan; Neil Conway; pgsql- [EMAIL PROTECTED] Subject: Re: [HACKERS] Re: Which qsort is used Dann Corbit [EMAIL PROTECTED] writes: I've still got a problem with these checks; I think they are a net waste of cycles on average. The benchmarks say that they (order checks) are a good idea on average for ordered data, random data, and partly ordered data. There are lies, damn lies, and benchmarks ;-) The problem with citing a benchmark for this discussion is that a benchmark can't tell you anything about real-world probabilities; it only tells you about the probabilities occuring in the benchmark case. You need to make the case that the benchmark reflects the real world, which you didn't. If you trace the algorithms in a debugger you will be surprised at how often the partitions are ordered, even with random sets as input. Well, I do agree that checking for orderedness on small partitions would succeed more often than on larger partitions or the whole file --- but the code-as-given checks all the way down. Moreover, the argument given for spending these cycles is that insertion sort sucks on reverse-order input ... where sucks means that it spends O(N^2) time. But it spends O(N^2) in the average case, too. I agree that in general these checks are not important and they complicate what is a simple and elegant algorithm. The cases where the checks are important (highly ordered data sets) are rare and so the value added is minimal. I am actually quite impressed with the excellence of Bentley's sort out of the box. It's definitely the best library implementation of a sort I have seen. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly