Re: Insertion Sort Improvements
> That's worth trying out. It might also then be worth trying to push both > unordered values -- the big one up / the small one down. I've seen other > implementations do that, but don't remember where, or what it's called. It is important that we do not do 2 compares two avoid one copy (assignment to temporary) as you did in your patch earlier in this thread, cause compares are usually pretty costly (also two compares are then inlined, bloating the binary). Assigning a sort tuple to a temporary translates to pretty simple assembly code, so my suggested modification should outperform. It cuts down the cost of the inner loop by ca. 40% comparing the assembly. And it avoids having to re-read memory on each comparison, as the temporary can be held in registers. > "Unbounded" means no bounds check on the array. That's not possible in its > current form, so I think you misunderstood something. Sorry for the confusion. I didn't mean unbounded in the "array bound checking" sense, but in the "unrestricted number of loops" sense. > I only remember implementations tracking loop iterations, not swaps. You'd > need evidence that this is better. Well, the idea was to include the presorted check somehow. Stopping after a certain number of iterations is surely more safe than stopping after a number of swaps, but we would then implicitly also stop our presort check. We could change that though: Count loop iterations and on bailout continue with a pure presort check, but from the last position of the insertion sort -- not all over again -- by comparing against the maximum value recorded during the insertion sort. Thoughts? > An important part not mentioned yet: This might only be worth doing if the > previous recursion level performed no swaps during partitioning and the > current pivot candidates are ordered. Agreed. > It's currently 7, but should really be something like 10. A script that > repeats tests for, say, 7 through 18 should show a concave-up shape in the > times. The bottom of the bowl should shift to higher values, and that minimum > is what should be compared. Yeah, as alluded to before, it should be closer to 10 nowadays. In any case it should be changed at least from 7 to 8, cause then we could at least discard the additional check for n > 7 in the quicksort code path (see /src/include/lib/sort_template.h#L322). Currently we check n < 7 and a few lines down we check for n > 7, if we check n < 8 for insertion sort then the second check becomes obsolete. Benjamin Coutu http://www.zeyos.com
Re: Insertion Sort Improvements
On Tue, May 23, 2023 at 4:10 PM Benjamin Coutu wrote: > > Greetings, > > I would like to revisit the discussion and concur with John's perspective that incremental progress through small, successive modifications is the appropriate approach to move forward. Therefore, I would like to propose two distinct ideas aimed at enhancing the functionality of insertion sort: > > 1. Implementation of a more efficient variant (as described in the introductory part of this thread): That's worth trying out. It might also then be worth trying to push both unordered values -- the big one up / the small one down. I've seen other implementations do that, but don't remember where, or what it's called. > 2. It appears that there is a consensus against disregarding the presorted check, despite its questionable value. In light of this, an alternative suggestion is to integrate the presorted check into the insertion sort path by utilizing an unbounded insertion sort. "Unbounded" means no bounds check on the array. That's not possible in its current form, so I think you misunderstood something. > Only after a maximum number of swaps have occurred should we abandon the sorting process. I only remember implementations tracking loop iterations, not swaps. You'd need evidence that this is better. > If insertion sort is executed on the entire array without any swaps, we can simply return. If not, and we continue with quicksort after the swap limit has been reached, we at least have left the array in a more sorted state, which may likely be advantageous for subsequent iterations. An important part not mentioned yet: This might only be worth doing if the previous recursion level performed no swaps during partitioning and the current pivot candidates are ordered. That's a bit of work and might not be convenient now -- it'd be trivial with dual-pivot, but I've not looked at that in a while. (Fittingly, dual-pivot requires a higher insertion sort threshold so it goes both ways.) > I would greatly appreciate assistance in benchmarking these proposed changes. Your collaboration in this matter would be invaluable. I advise looking in the archives for scripts from previous benchmarks. No need to reinvent the wheel -- it's enough work as it is. A key thing here for #1 is that improving insertion sort requires increasing the threshold to show the true improvement. It's currently 7, but should really be something like 10. A script that repeats tests for, say, 7 through 18 should show a concave-up shape in the times. The bottom of the bowl should shift to higher values, and that minimum is what should be compared. -- John Naylor EDB: http://www.enterprisedb.com
Re: Insertion Sort Improvements
Greetings, I would like to revisit the discussion and concur with John's perspective that incremental progress through small, successive modifications is the appropriate approach to move forward. Therefore, I would like to propose two distinct ideas aimed at enhancing the functionality of insertion sort: 1. Implementation of a more efficient variant (as described in the introductory part of this thread): OLD for (pm = a + ST_POINTER_STEP; pm < a + n * ST_POINTER_STEP; pm += ST_POINTER_STEP) for (pl = pm; pl > a && DO_COMPARE(pl - ST_POINTER_STEP, pl) > 0; pl -= ST_POINTER_STEP) DO_SWAP(pl, pl - ST_POINTER_STEP); NEW for ( pm = a + ST_POINTER_STEP; pm < a + n * ST_POINTER_STEP; pm += ST_POINTER_STEP ) { ST_POINTER_TYPE tmp = *pm; for ( pl = pm - ST_POINTER_STEP; pl >= a && DO_COMPARE(pl, &tmp) > 0; pl -= ST_POINTER_STEP ) *(pl + ST_POINTER_STEP) = *pl; *(pl + ST_POINTER_STEP) = tmp; } 2. It appears that there is a consensus against disregarding the presorted check, despite its questionable value. In light of this, an alternative suggestion is to integrate the presorted check into the insertion sort path by utilizing an unbounded insertion sort. Only after a maximum number of swaps have occurred should we abandon the sorting process. If insertion sort is executed on the entire array without any swaps, we can simply return. If not, and we continue with quicksort after the swap limit has been reached, we at least have left the array in a more sorted state, which may likely be advantageous for subsequent iterations. OLD if (n < 7) { for (pm = a + ST_POINTER_STEP; pm < a + n * ST_POINTER_STEP; pm += ST_POINTER_STEP) for (pl = pm; pl > a && DO_COMPARE(pl - ST_POINTER_STEP, pl) > 0; pl -= ST_POINTER_STEP) DO_SWAP(pl, pl - ST_POINTER_STEP); return; } presorted = 1; for (pm = a + ST_POINTER_STEP; pm < a + n * ST_POINTER_STEP; pm += ST_POINTER_STEP) { DO_CHECK_FOR_INTERRUPTS(); if (DO_COMPARE(pm - ST_POINTER_STEP, pm) > 0) { presorted = 0; break; } } if (presorted) return; NEW #define LIMIT_SWAPS 30 /* to be determined empirically */ int swaps = 0; for (pm = a + ST_POINTER_STEP; pm < a + n * ST_POINTER_STEP; pm += ST_POINTER_STEP) for (pl = pm; pl > a && DO_COMPARE(pl - ST_POINTER_STEP, pl) > 0; pl -= ST_POINTER_STEP) { DO_SWAP(pl, pl - ST_POINTER_STEP); if (++swaps == LIMIT_SWAPS) goto quicksort; } if (swaps == 0) return; quicksort: Naturally, both modifications (with point 2 being highly speculative) are currently independent of each other, and it is also crucial to benchmark the combined variant as well as different values for LIMIT_SWAPS. I would greatly appreciate assistance in benchmarking these proposed changes. Your collaboration in this matter would be invaluable. Cheers, Ben
Re: Insertion Sort Improvements
> "Looks like"? I cannot find the reference, but I've read a while back that a well-known company from Redwood uses it for their in-memory columnar storage. That might have just been a rumor or might have been research only - not sure. It does not really matter anyways. > SortTuples are currently 24 bytes, and supported vector registers are 16 > bytes, so not sure how you think that would work. The thought was to logically group multiple sort tuples together and then create a vectorized version of that group with just the primitive type sort key as well as a small-sized index/offset into that sort group to later swap the corresponding sort tuple referenced by that index/offset. The sorting network would allow us to do branch-less register based sorting for a particular sort run. I guess this idea is moot, given ... > More broadly, the more invasive a change is, the more risk is involved, and > the more effort to test and evaluate. If you're serious about trying to > improve insertion sort performance, the simple idea we discussed at the start > of the thread is a much more modest step that has a good chance of justifying > the time put into it. That is not to say it's easy, however, because testing > is a non-trivial amount of work. I absolutely agree. Let's concentrate on improving things incrementally. Please excuse me wondering given that you have contributed some of the recent vectorization stuff and seeing that you have obviously experimented a lot with the sort code, that you might already have tried something along those lines or researched the subject - it is definitely a very interesting topic. Cheers, Ben
Re: Insertion Sort Improvements
On Tue, Sep 27, 2022 at 4:39 PM Benjamin Coutu wrote: > > Getting back to improvements for small sort runs, it might make sense to consider using in-register based sorting via sorting networks for very small runs. > It looks like some of the commercial DBMSs do this very successfully. "Looks like"? > They use 4 512bit registers (AVX-512) in this example, which could technically store 4 * 4 sort-elements (given that the sorting key is 64 bit and the tuple pointer is 64 bit). I wonder whether this could also be done with just SSE (instead of AVX), which the project now readily supports thanks to your recent efforts? SortTuples are currently 24 bytes, and supported vector registers are 16 bytes, so not sure how you think that would work. More broadly, the more invasive a change is, the more risk is involved, and the more effort to test and evaluate. If you're serious about trying to improve insertion sort performance, the simple idea we discussed at the start of the thread is a much more modest step that has a good chance of justifying the time put into it. That is not to say it's easy, however, because testing is a non-trivial amount of work. -- John Naylor EDB: http://www.enterprisedb.com
Re: Insertion Sort Improvements
Getting back to improvements for small sort runs, it might make sense to consider using in-register based sorting via sorting networks for very small runs. This talk is specific to database sorting and illustrates how such an approach can be vectorized: https://youtu.be/HeFwVNHsDzc?list=PLSE8ODhjZXjasmrEd2_Yi1deeE360zv5O&t=1090 It looks like some of the commercial DBMSs do this very successfully. They use 4 512bit registers (AVX-512) in this example, which could technically store 4 * 4 sort-elements (given that the sorting key is 64 bit and the tuple pointer is 64 bit). I wonder whether this could also be done with just SSE (instead of AVX), which the project now readily supports thanks to your recent efforts?
Re: Insertion Sort Improvements
On Fri, Aug 26, 2022 at 9:06 PM Benjamin Coutu wrote: > > Another idea could be to run a binary insertion sort and use a much higher > threshold. This could significantly cut down on comparisons (especially with > presorted runs, which are quite common in real workloads). Comparisons that must go to the full tuple are expensive enough that this idea might have merit in some cases, but that would be a research project. > If full binary search turned out to be an issue regarding cache locality, we > could do it in smaller chunks, The main issue with binary search is poor branch prediction. Also, if large chunks are bad for cache locality, isn't that a strike against using a "much higher threshold"? > With less comparisons we should start keeping track of swaps and use that as > an efficient way to determine presortedness. We could change the insertion > sort threshold to a swap threshold and do insertion sort until we hit the > swap threshold. I suspect that would make the current presorted check > obsolete (as binary insertion sort without or even with a few swaps should be > faster than the current presorted-check). The thread you linked to discusses partial insertion sort as a replacement for the pre-sorted check, along with benchmark results and graphs IIRC. I think it's possibly worth doing, but needs more investigation to make sure the (few) regressions I saw either: 1. were just noise or 2. can be ameliorated. As I said in the dual pivot thread, this would be great for dual pivot since we could reuse partial insertion sort for choosing the pivots, reducing binary space. -- John Naylor EDB: http://www.enterprisedb.com
Re: Insertion Sort Improvements
> convenient if you know the type at compile time. See the attached, > which I had laying around when I was looking at PDQuicksort. I believe > it's similar to what you have in mind. That looks very promising. I also love your recent proposal of partitioning into null and non-null. I suspect that to be a clear winner. > I think it belongs around 10 now anyway. Yeah, I think that change is overdue given modern hardware characteristics (even with the current implementation). Another idea could be to run a binary insertion sort and use a much higher threshold. This could significantly cut down on comparisons (especially with presorted runs, which are quite common in real workloads). If full binary search turned out to be an issue regarding cache locality, we could do it in smaller chunks, e.g. do a micro binary search between the current position (I) and position minus chunk size (say 6-12 or so, whatever fits 1 or 2 cachelines) whenever A[I] < A[I-1] and if we don't find the position within that chunk we continue with the previous chunk, thereby preserving cache locality. With less comparisons we should start keeping track of swaps and use that as an efficient way to determine presortedness. We could change the insertion sort threshold to a swap threshold and do insertion sort until we hit the swap threshold. I suspect that would make the current presorted check obsolete (as binary insertion sort without or even with a few swaps should be faster than the current presorted-check). Cheers, Ben
Re: Insertion Sort Improvements
On Thu, Aug 25, 2022 at 5:55 PM Benjamin Coutu wrote: > > Hello, > > Inspired by the recent discussions[1][2] around sort improvements, I took a > look around the code and noticed the use of a somewhat naive version of > insertion sort within the broader quicksort code. > > The current implementation (see sort_template.h) is practically the textbook > version of insertion sort: Right. I think it's worth looking into. When I tested dual-pivot quicksort, a threshold of 18 seemed best for my inputs, so making insertion sort more efficient could tilt the balance more in favor of dual-pivot. (It already seems slightly ahead, but as I mentioned in the thread you linked, removing branches for null handling would make it more compelling). > DO_ASSIGN and DO_COPY macros would have to be declared analogue to DO_SWAP > via the template. I don't think you need these macros, since this optimization is only convenient if you know the type at compile time. See the attached, which I had laying around when I was looking at PDQuicksort. I believe it's similar to what you have in mind. (Ignore the part about "unguarded", it's irrelevant out of context). Notice that it avoids unnecessary copies, but has two calls to DO_COMPARE, which is *not* small for tuples. > Anyways, there might be some low hanging fruit here. If it turns out to be > significantly faster one might even consider increasing the insertion sort > threshold from < 7 to < 10 sort elements. But that is a whole other > discussion for another day. I think it belongs around 10 now anyway. I also don't think you'll see much benefit with your proposal at a threshold of 7 -- I suspect it'd be more enlightening to test a range of thresholds with and without the patch, to see how the inflection point shifts. That worked pretty well when testing dual-pivot. -- John Naylor EDB: http://www.enterprisedb.com From fd4bfdbee88478fac32838712c112d91f73c5db5 Mon Sep 17 00:00:00 2001 From: John Naylor Date: Wed, 4 May 2022 23:35:06 +0700 Subject: [PATCH v3 8/8] Use unguarded insertion sort where possible Since we recurse to the partitions in order, guarding is only needed at the very beginning of the input. It's not yet clear if this is a performance win for sort keys with complex comparators. --- src/include/lib/sort_template.h| 58 +- src/test/regress/expected/geometry.out | 2 +- 2 files changed, 57 insertions(+), 3 deletions(-) diff --git a/src/include/lib/sort_template.h b/src/include/lib/sort_template.h index 3b3c4bc128..ae76fd39e8 100644 --- a/src/include/lib/sort_template.h +++ b/src/include/lib/sort_template.h @@ -79,7 +79,6 @@ * https://github.com/orlp/pdqsort * * WIP: Not yet implemented: - * - use unguarded insertion sort where possible * - fall back to heap sort to guard the stack * * Other techniques used in PDQuicksort, but likely not worth adopting: @@ -174,6 +173,7 @@ ST_SCOPE void ST_SORT(ST_ELEMENT_TYPE * first, size_t n /* sort private helper functions */ #define ST_INSERTION_SORT ST_MAKE_NAME(ST_SORT, insertion_sort) #define ST_INSERTION_SORT_PARTIAL ST_MAKE_NAME(ST_SORT, insertion_sort_partial) +#define ST_INSERTION_SORT_UNGUARDED ST_MAKE_NAME(ST_SORT, insertion_sort_unguarded) #define ST_PARTITION_LEFT ST_MAKE_NAME(ST_SORT, partition_left) #define ST_PARTITION_RIGHT ST_MAKE_NAME(ST_SORT, partition_right) #define ST_SORT_INTERNAL ST_MAKE_NAME(ST_SORT, internal) @@ -213,6 +213,12 @@ ST_SCOPE void ST_SORT(ST_ELEMENT_TYPE * first, size_t n ST_SORT_INVOKE_COMPARE \ ST_SORT_INVOKE_ARG) +#define DO_INSERTION_SORT_UNGUARDED(begin_, end_) \ + ST_INSERTION_SORT_UNGUARDED((begin_), (end_) \ + ST_SORT_INVOKE_ELEMENT_SIZE \ + ST_SORT_INVOKE_COMPARE \ + ST_SORT_INVOKE_ARG) + #define DO_PARTITION_LEFT(begin_, end_) \ ST_PARTITION_LEFT((begin_), (end_) \ ST_SORT_INVOKE_ELEMENT_SIZE \ @@ -402,6 +408,48 @@ ST_INSERTION_SORT_PARTIAL(ST_POINTER_TYPE * begin, return true; } +static inline void +ST_INSERTION_SORT_UNGUARDED(ST_POINTER_TYPE * begin, + ST_POINTER_TYPE * end + ST_SORT_PROTO_ELEMENT_SIZE + ST_SORT_PROTO_COMPARE + ST_SORT_PROTO_ARG) +{ + ST_POINTER_TYPE *cur; + ST_POINTER_TYPE *sift; + ST_POINTER_TYPE *sift_1; + + if (begin == end) + return; + + for (cur = begin + ST_POINTER_STEP; cur < end; cur += ST_POINTER_STEP) + { +#ifndef ST_ELEMENT_TYPE_VOID + sift = cur; + sift_1 = cur - 1; + + if (DO_COMPARE(sift_1, sift) > 0) + { + ST_ELEMENT_TYPE tmp; + tmp = *sift; + + do + { +*sift-- = *sift_1; +sift_1--; /* Avoid multiple evaluation in DO_COMPARE */ + } while (DO_COMPARE(sift_1, &tmp) > 0); + + *sift = tmp; + } +#else + for (sift = cur, sift_1 = cur - ST_POINTER_STEP; + DO_COMPARE(sift_1, sift) > 0; + sift -= ST_POINTER_STEP, sift_1 -= ST_POINTER_STEP) + DO_SWAP(sift, sift_1); +#endif + } +} + // Partitions [begin, end) around pivot *begin. Elements equal //
Insertion Sort Improvements
Hello, Inspired by the recent discussions[1][2] around sort improvements, I took a look around the code and noticed the use of a somewhat naive version of insertion sort within the broader quicksort code. The current implementation (see sort_template.h) is practically the textbook version of insertion sort: for (pm = a + ST_POINTER_STEP; pm < a + n * ST_POINTER_STEP; pm += ST_POINTER_STEP) for (pl = pm; pl > a && DO_COMPARE(pl - ST_POINTER_STEP, pl) > 0; pl -= ST_POINTER_STEP) DO_SWAP(pl, pl - ST_POINTER_STEP); I propose to rather use the slightly more efficient variant of insertion sort where only a single assignment instead of a fully-fledged swap is performed in the inner loop: for (pm = a + ST_POINTER_STEP; pm < a + n * ST_POINTER_STEP; pm += ST_POINTER_STEP) { DO_COPY(pm_temp, pm); /* pm_temp <- copy of pm */ pl = pm - ST_POINTER_STEP; for (; pl >= a && DO_COMPARE(pl, pm_temp) > 0; pl -= ST_POINTER_STEP) DO_ASSIGN(pl + ST_POINTER_STEP, pl); /* pl + 1 <- pl */ DO_COPY(pl + ST_POINTER_STEP, pm_temp); /* pl + 1 <- copy of pm_temp */ } DO_ASSIGN and DO_COPY macros would have to be declared analogue to DO_SWAP via the template. There is obviously a trade-off involved here as O(1) extra memory is required to hold the temporary variable and DO_COPY might be expensive if the sort element is large. In case of single datum sort with trivial data types this would not be a big issue. For large tuples on the other hand it could mean a significant overhead that might not be made up for by the improved inner loop. One might want to limit this algorithm to a certain maximum tuple size and use the original insertion sort version for larger elements (this could be decided at compile-time via sort_template.h). Anyways, there might be some low hanging fruit here. If it turns out to be significantly faster one might even consider increasing the insertion sort threshold from < 7 to < 10 sort elements. But that is a whole other discussion for another day. Has anyone tested such an approach before? Please let me know your thoughts. Cheers, Benjamin [1] https://www.postgresql.org/message-id/flat/CAFBsxsHanJTsX9DNJppXJxwg3bU%2BYQ6pnmSfPM0uvYUaFdwZdQ%40mail.gmail.com [2] https://www.postgresql.org/message-id/flat/CAApHDvoTTtoQYfp3d0kTPF6y1pjexgLwquzKmjzvjC9NCw4RGw%40mail.gmail.com -- Benjamin Coutu http://www.zeyos.com