Re: Insertion Sort Improvements

2023-05-23 Thread Benjamin Coutu
> That's worth trying out. It might also then be worth trying to push both 
> unordered values -- the big one up / the small one down. I've seen other 
> implementations do that, but don't remember where, or what it's called.

It is important that we do not do 2 compares two avoid one copy (assignment to 
temporary) as you did in your patch earlier in this thread, cause compares are 
usually pretty costly (also two compares are then inlined, bloating the binary).
Assigning a sort tuple to a temporary translates to pretty simple assembly 
code, so my suggested modification should outperform. It cuts down the cost of 
the inner loop by ca. 40% comparing the assembly. And it avoids having to 
re-read memory on each comparison, as the temporary can be held in registers.

> "Unbounded" means no bounds check on the array. That's not possible in its 
> current form, so I think you misunderstood something.

Sorry for the confusion. I didn't mean unbounded in the "array bound checking" 
sense, but in the "unrestricted number of loops" sense.

> I only remember implementations tracking loop iterations, not swaps. You'd 
> need evidence that this is better.

Well, the idea was to include the presorted check somehow. Stopping after a 
certain number of iterations is surely more safe than stopping after a number 
of swaps, but we would then implicitly also stop our presort check. We could 
change that though: Count loop iterations and on bailout continue with a pure 
presort check, but from the last position of the insertion sort -- not all over 
again -- by comparing against the maximum value recorded during the insertion 
sort. Thoughts?

> An important part not mentioned yet: This might only be worth doing if the 
> previous recursion level performed no swaps during partitioning and the 
> current pivot candidates are ordered.

Agreed.

> It's currently 7, but should really be something like 10. A script that 
> repeats tests for, say, 7 through 18 should show a concave-up shape in the 
> times. The bottom of the bowl should shift to higher values, and that minimum 
> is what should be compared.

Yeah, as alluded to before, it should be closer to 10 nowadays.
In any case it should be changed at least from 7 to 8, cause then we could at 
least discard the additional check for n > 7 in the quicksort code path (see 
/src/include/lib/sort_template.h#L322). Currently we check n < 7 and a few 
lines down we check for n > 7, if we check n < 8 for insertion sort then the 
second check becomes obsolete.

Benjamin Coutu
http://www.zeyos.com




Re: Insertion Sort Improvements

2023-05-23 Thread John Naylor
On Tue, May 23, 2023 at 4:10 PM Benjamin Coutu  wrote:
>
> Greetings,
>
> I would like to revisit the discussion and concur with John's perspective
that incremental progress through small, successive modifications is the
appropriate approach to move forward. Therefore, I would like to propose
two distinct ideas aimed at enhancing the functionality of insertion sort:
>
> 1. Implementation of a more efficient variant (as described in the
introductory part of this thread):

That's worth trying out. It might also then be worth trying to push both
unordered values -- the big one up / the small one down. I've seen other
implementations do that, but don't remember where, or what it's called.

> 2. It appears that there is a consensus against disregarding the
presorted check, despite its questionable value. In light of this, an
alternative suggestion is to integrate the presorted check into the
insertion sort path by utilizing an unbounded insertion sort.

"Unbounded" means no bounds check on the array. That's not possible in its
current form, so I think you misunderstood something.

> Only after a maximum number of swaps have occurred should we abandon the
sorting process.

I only remember implementations tracking loop iterations, not swaps. You'd
need evidence that this is better.

> If insertion sort is executed on the entire array without any swaps, we
can simply return. If not, and we continue with quicksort after the swap
limit has been reached, we at least have left the array in a more sorted
state, which may likely be advantageous for subsequent iterations.

An important part not mentioned yet: This might only be worth doing if the
previous recursion level performed no swaps during partitioning and the
current pivot candidates are ordered. That's a bit of work and might not be
convenient now -- it'd be trivial with dual-pivot, but I've not looked at
that in a while. (Fittingly, dual-pivot requires a higher insertion sort
threshold so it goes both ways.)

> I would greatly appreciate assistance in benchmarking these proposed
changes. Your collaboration in this matter would be invaluable.

I advise looking in the archives for scripts from previous benchmarks. No
need to reinvent the wheel -- it's enough work as it is. A key thing here
for #1 is that improving insertion sort requires increasing the threshold
to show the true improvement. It's currently 7, but should really be
something like 10. A script that repeats tests for, say, 7 through 18
should show a concave-up shape in the times. The bottom of the bowl should
shift to higher values, and that minimum is what should be compared.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Insertion Sort Improvements

2023-05-23 Thread Benjamin Coutu
Greetings,

I would like to revisit the discussion and concur with John's perspective that 
incremental progress through small, successive modifications is the appropriate 
approach to move forward. Therefore, I would like to propose two distinct ideas 
aimed at enhancing the functionality of insertion sort:

1. Implementation of a more efficient variant (as described in the introductory 
part of this thread):

 OLD 

for (pm = a + ST_POINTER_STEP; pm < a + n * ST_POINTER_STEP;
 pm += ST_POINTER_STEP)
for (pl = pm; pl > a && DO_COMPARE(pl - ST_POINTER_STEP, pl) > 0;
 pl -= ST_POINTER_STEP)
DO_SWAP(pl, pl - ST_POINTER_STEP);

 NEW 

for (
pm = a + ST_POINTER_STEP;
pm < a + n * ST_POINTER_STEP;
pm += ST_POINTER_STEP
) {
ST_POINTER_TYPE tmp = *pm;
 
for (
pl = pm - ST_POINTER_STEP;
pl >= a && DO_COMPARE(pl, &tmp) > 0;
pl -= ST_POINTER_STEP
)
*(pl + ST_POINTER_STEP) = *pl;

*(pl + ST_POINTER_STEP) = tmp;
}



2. It appears that there is a consensus against disregarding the presorted 
check, despite its questionable value. In light of this, an alternative 
suggestion is to integrate the presorted check into the insertion sort path by 
utilizing an unbounded insertion sort. Only after a maximum number of swaps 
have occurred should we abandon the sorting process. If insertion sort is 
executed on the entire array without any swaps, we can simply return. If not, 
and we continue with quicksort after the swap limit has been reached, we at 
least have left the array in a more sorted state, which may likely be 
advantageous for subsequent iterations.

 OLD 

if (n < 7)
{
for (pm = a + ST_POINTER_STEP; pm < a + n * ST_POINTER_STEP;
 pm += ST_POINTER_STEP)
for (pl = pm; pl > a && DO_COMPARE(pl - ST_POINTER_STEP, pl) > 
0;
 pl -= ST_POINTER_STEP)
DO_SWAP(pl, pl - ST_POINTER_STEP);
return;
}
presorted = 1;
for (pm = a + ST_POINTER_STEP; pm < a + n * ST_POINTER_STEP;
 pm += ST_POINTER_STEP)
{
DO_CHECK_FOR_INTERRUPTS();
if (DO_COMPARE(pm - ST_POINTER_STEP, pm) > 0)
{
presorted = 0;
break;
}
}
if (presorted)
return;

 NEW 

#define LIMIT_SWAPS 30 /* to be determined empirically */

int swaps = 0;

for (pm = a + ST_POINTER_STEP; pm < a + n * ST_POINTER_STEP;
 pm += ST_POINTER_STEP)
for (pl = pm; pl > a && DO_COMPARE(pl - ST_POINTER_STEP, pl) > 0;
 pl -= ST_POINTER_STEP) {
DO_SWAP(pl, pl - ST_POINTER_STEP);

if (++swaps == LIMIT_SWAPS)
goto quicksort;
}

if (swaps == 0)
return;

quicksort:



Naturally, both modifications (with point 2 being highly speculative) are 
currently independent of each other, and it is also crucial to benchmark the 
combined variant as well as different values for LIMIT_SWAPS.
I would greatly appreciate assistance in benchmarking these proposed changes. 
Your collaboration in this matter would be invaluable.

Cheers, Ben




Re: Insertion Sort Improvements

2022-09-28 Thread Benjamin Coutu
> "Looks like"?

I cannot find the reference, but I've read a while back that a well-known 
company from Redwood uses it for their in-memory columnar storage. That might 
have just been a rumor or might have been research only - not sure. It does not 
really matter anyways.

> SortTuples are currently 24 bytes, and supported vector registers are 16 
> bytes, so not sure how you think that would work.

The thought was to logically group multiple sort tuples together and then 
create a vectorized version of that group with just the primitive type sort key 
as well as a small-sized index/offset into that sort group to later swap the 
corresponding sort tuple referenced by that index/offset. The sorting network 
would allow us to do branch-less register based sorting for a particular sort 
run. I guess this idea is moot, given ...

> More broadly, the more invasive a change is, the more risk is involved, and 
> the more effort to test and evaluate. If you're serious about trying to 
> improve insertion sort performance, the simple idea we discussed at the start 
> of the thread is a much more modest step that has a good chance of justifying 
> the time put into it. That is not to say it's easy, however, because testing 
> is a non-trivial amount of work.

I absolutely agree. Let's concentrate on improving things incrementally.
Please excuse me wondering given that you have contributed some of the recent 
vectorization stuff and seeing that you have obviously experimented a lot with 
the sort code, that you might already have tried something along those lines or 
researched the subject - it is definitely a very interesting topic.

Cheers, Ben




Re: Insertion Sort Improvements

2022-09-27 Thread John Naylor
On Tue, Sep 27, 2022 at 4:39 PM Benjamin Coutu  wrote:
>
> Getting back to improvements for small sort runs, it might make sense to
consider using in-register based sorting via sorting networks for very
small runs.

> It looks like some of the commercial DBMSs do this very successfully.

"Looks like"?

> They use 4 512bit registers (AVX-512) in this example, which could
technically store 4 * 4 sort-elements (given that the sorting key is 64 bit
and the tuple pointer is 64 bit). I wonder whether this could also be done
with just SSE (instead of AVX), which the project now readily supports
thanks to your recent efforts?

SortTuples are currently 24 bytes, and supported vector registers are 16
bytes, so not sure how you think that would work.

More broadly, the more invasive a change is, the more risk is involved, and
the more effort to test and evaluate. If you're serious about trying to
improve insertion sort performance, the simple idea we discussed at the
start of the thread is a much more modest step that has a good chance of
justifying the time put into it. That is not to say it's easy, however,
because testing is a non-trivial amount of work.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Insertion Sort Improvements

2022-09-27 Thread Benjamin Coutu
Getting back to improvements for small sort runs, it might make sense to 
consider using in-register based sorting via sorting networks for very small 
runs.

This talk is specific to database sorting and illustrates how such an approach 
can be vectorized: 
https://youtu.be/HeFwVNHsDzc?list=PLSE8ODhjZXjasmrEd2_Yi1deeE360zv5O&t=1090

It looks like some of the commercial DBMSs do this very successfully. They use 
4 512bit registers (AVX-512) in this example, which could technically store 4 * 
4 sort-elements (given that the sorting key is 64 bit and the tuple pointer is 
64 bit). I wonder whether this could also be done with just SSE (instead of 
AVX), which the project now readily supports thanks to your recent efforts?




Re: Insertion Sort Improvements

2022-08-28 Thread John Naylor
On Fri, Aug 26, 2022 at 9:06 PM Benjamin Coutu  wrote:
>
> Another idea could be to run a binary insertion sort and use a much higher 
> threshold. This could significantly cut down on comparisons (especially with 
> presorted runs, which are quite common in real workloads).

Comparisons that must go to the full tuple are expensive enough that
this idea might have merit in some cases, but that would be a research
project.

> If full binary search turned out to be an issue regarding cache locality, we 
> could do it in smaller chunks,

The main issue with binary search is poor branch prediction. Also, if
large chunks are bad for cache locality, isn't that a strike against
using a "much higher threshold"?

> With less comparisons we should start keeping track of swaps and use that as 
> an efficient way to determine presortedness. We could change the insertion 
> sort threshold to a swap threshold and do insertion sort until we hit the 
> swap threshold. I suspect that would make the current presorted check 
> obsolete (as binary insertion sort without or even with a few swaps should be 
> faster than the current presorted-check).

The thread you linked to discusses partial insertion sort as a
replacement for the pre-sorted check, along with benchmark results and
graphs IIRC. I think it's possibly worth doing, but needs more
investigation to make sure the (few) regressions I saw either: 1. were
just noise or 2. can be ameliorated. As I said in the dual pivot
thread, this would be great for dual pivot since we could reuse
partial insertion sort for choosing the pivots, reducing binary space.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: Insertion Sort Improvements

2022-08-26 Thread Benjamin Coutu
> convenient if you know the type at compile time. See the attached,
> which I had laying around when I was looking at PDQuicksort. I believe
> it's similar to what you have in mind.

That looks very promising.
I also love your recent proposal of partitioning into null and non-null. I 
suspect that to be a clear winner.

> I think it belongs around 10 now anyway.

Yeah, I think that change is overdue given modern hardware characteristics 
(even with the current implementation).

Another idea could be to run a binary insertion sort and use a much higher 
threshold. This could significantly cut down on comparisons (especially with 
presorted runs, which are quite common in real workloads).

If full binary search turned out to be an issue regarding cache locality, we 
could do it in smaller chunks, e.g. do a micro binary search between the 
current position (I) and position minus chunk size (say 6-12 or so, whatever 
fits 1 or 2 cachelines) whenever A[I] < A[I-1] and if we don't find the 
position within that chunk we continue with the previous chunk, thereby 
preserving cache locality.

With less comparisons we should start keeping track of swaps and use that as an 
efficient way to determine presortedness. We could change the insertion sort 
threshold to a swap threshold and do insertion sort until we hit the swap 
threshold. I suspect that would make the current presorted check obsolete (as 
binary insertion sort without or even with a few swaps should be faster than 
the current presorted-check).

Cheers, Ben




Re: Insertion Sort Improvements

2022-08-26 Thread John Naylor
On Thu, Aug 25, 2022 at 5:55 PM Benjamin Coutu  wrote:
>
> Hello,
>
> Inspired by the recent discussions[1][2] around sort improvements, I took a 
> look around the code and noticed the use of a somewhat naive version of 
> insertion sort within the broader quicksort code.
>
> The current implementation (see sort_template.h) is practically the textbook 
> version of insertion sort:

Right. I think it's worth looking into. When I tested dual-pivot
quicksort, a threshold of 18 seemed best for my inputs, so making
insertion sort more efficient could tilt the balance more in favor of
dual-pivot. (It already seems slightly ahead, but as I mentioned in
the thread you linked, removing branches for null handling would make
it more compelling).

> DO_ASSIGN and DO_COPY macros would have to be declared analogue to DO_SWAP 
> via the template.

I don't think you need these macros, since this optimization is only
convenient if you know the type at compile time. See the attached,
which I had laying around when I was looking at PDQuicksort. I believe
it's similar to what you have in mind. (Ignore the part about
"unguarded", it's irrelevant out of context). Notice that it avoids
unnecessary copies, but has two calls to DO_COMPARE, which is *not*
small for tuples.

> Anyways, there might be some low hanging fruit here. If it turns out to be 
> significantly faster one might even consider increasing the insertion sort 
> threshold from < 7 to < 10 sort elements. But that is a whole other 
> discussion for another day.

I think it belongs around 10 now anyway. I also don't think you'll see
much benefit with your proposal at a threshold of 7 -- I suspect it'd
be more enlightening to test a range of thresholds with and without
the patch, to see how the inflection point shifts. That worked pretty
well when testing dual-pivot.

-- 
John Naylor
EDB: http://www.enterprisedb.com
From fd4bfdbee88478fac32838712c112d91f73c5db5 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Wed, 4 May 2022 23:35:06 +0700
Subject: [PATCH v3 8/8] Use unguarded insertion sort where possible

Since we recurse to the partitions in order, guarding is only
needed at the very beginning of the input. It's not yet clear
if this is a performance win for sort keys with complex
comparators.
---
 src/include/lib/sort_template.h| 58 +-
 src/test/regress/expected/geometry.out |  2 +-
 2 files changed, 57 insertions(+), 3 deletions(-)

diff --git a/src/include/lib/sort_template.h b/src/include/lib/sort_template.h
index 3b3c4bc128..ae76fd39e8 100644
--- a/src/include/lib/sort_template.h
+++ b/src/include/lib/sort_template.h
@@ -79,7 +79,6 @@
  *		https://github.com/orlp/pdqsort
  *
  * WIP: Not yet implemented:
- *  - use unguarded insertion sort where possible
  *  - fall back to heap sort to guard the stack
  *
  * Other techniques used in PDQuicksort, but likely not worth adopting:
@@ -174,6 +173,7 @@ ST_SCOPE void ST_SORT(ST_ELEMENT_TYPE * first, size_t n
 /* sort private helper functions */
 #define ST_INSERTION_SORT ST_MAKE_NAME(ST_SORT, insertion_sort)
 #define ST_INSERTION_SORT_PARTIAL ST_MAKE_NAME(ST_SORT, insertion_sort_partial)
+#define ST_INSERTION_SORT_UNGUARDED ST_MAKE_NAME(ST_SORT, insertion_sort_unguarded)
 #define ST_PARTITION_LEFT ST_MAKE_NAME(ST_SORT, partition_left)
 #define ST_PARTITION_RIGHT ST_MAKE_NAME(ST_SORT, partition_right)
 #define ST_SORT_INTERNAL ST_MAKE_NAME(ST_SORT, internal)
@@ -213,6 +213,12 @@ ST_SCOPE void ST_SORT(ST_ELEMENT_TYPE * first, size_t n
 			ST_SORT_INVOKE_COMPARE		\
 			ST_SORT_INVOKE_ARG)
 
+#define DO_INSERTION_SORT_UNGUARDED(begin_, end_)		\
+	ST_INSERTION_SORT_UNGUARDED((begin_), (end_)		\
+			ST_SORT_INVOKE_ELEMENT_SIZE	\
+			ST_SORT_INVOKE_COMPARE		\
+			ST_SORT_INVOKE_ARG)
+
 #define DO_PARTITION_LEFT(begin_, end_)	\
 	ST_PARTITION_LEFT((begin_), (end_)	\
 			ST_SORT_INVOKE_ELEMENT_SIZE	\
@@ -402,6 +408,48 @@ ST_INSERTION_SORT_PARTIAL(ST_POINTER_TYPE * begin,
 	return true;
 }
 
+static inline void
+ST_INSERTION_SORT_UNGUARDED(ST_POINTER_TYPE * begin,
+  ST_POINTER_TYPE * end
+  ST_SORT_PROTO_ELEMENT_SIZE
+  ST_SORT_PROTO_COMPARE
+  ST_SORT_PROTO_ARG)
+{
+	ST_POINTER_TYPE *cur;
+	ST_POINTER_TYPE *sift;
+	ST_POINTER_TYPE *sift_1;
+
+	if (begin == end)
+		return;
+
+	for (cur = begin + ST_POINTER_STEP; cur < end; cur += ST_POINTER_STEP)
+	{
+#ifndef ST_ELEMENT_TYPE_VOID
+		sift = cur;
+		sift_1 = cur - 1;
+
+		if (DO_COMPARE(sift_1, sift) > 0)
+		{
+			ST_ELEMENT_TYPE tmp;
+			tmp = *sift;
+
+			do
+			{
+*sift-- = *sift_1;
+sift_1--; /* Avoid multiple evaluation in DO_COMPARE */
+			} while (DO_COMPARE(sift_1, &tmp) > 0);
+
+			*sift = tmp;
+		}
+#else
+		for (sift = cur, sift_1 = cur - ST_POINTER_STEP;
+			 DO_COMPARE(sift_1, sift) > 0;
+			 sift -= ST_POINTER_STEP, sift_1 -= ST_POINTER_STEP)
+			DO_SWAP(sift, sift_1);
+#endif
+	}
+}
+
 // Partitions [begin, end) around pivot *begin. Elements equal
 //

Insertion Sort Improvements

2022-08-25 Thread Benjamin Coutu
Hello,

Inspired by the recent discussions[1][2] around sort improvements, I took a 
look around the code and noticed the use of a somewhat naive version of 
insertion sort within the broader quicksort code.

The current implementation (see sort_template.h) is practically the textbook 
version of insertion sort:

for (pm = a + ST_POINTER_STEP; pm < a + n * ST_POINTER_STEP; pm += 
ST_POINTER_STEP)
  for (pl = pm; pl > a && DO_COMPARE(pl - ST_POINTER_STEP, pl) > 0; pl -= 
ST_POINTER_STEP)
DO_SWAP(pl, pl - ST_POINTER_STEP);

I propose to rather use the slightly more efficient variant of insertion sort 
where only a single assignment instead of a fully-fledged swap is performed in 
the inner loop:

for (pm = a + ST_POINTER_STEP; pm < a + n * ST_POINTER_STEP; pm += 
ST_POINTER_STEP) {
  DO_COPY(pm_temp, pm); /* pm_temp <- copy of pm */
  
  pl = pm - ST_POINTER_STEP;
  
  for (; pl >= a && DO_COMPARE(pl, pm_temp) > 0; pl -= ST_POINTER_STEP)
DO_ASSIGN(pl + ST_POINTER_STEP, pl); /* pl + 1 <- pl */

  DO_COPY(pl + ST_POINTER_STEP, pm_temp); /* pl + 1 <- copy of pm_temp */
}

DO_ASSIGN and DO_COPY macros would have to be declared analogue to DO_SWAP via 
the template.

There is obviously a trade-off involved here as O(1) extra memory is required 
to hold the temporary variable and DO_COPY might be expensive if the sort 
element is large. In case of single datum sort with trivial data types this 
would not be a big issue. For large tuples on the other hand it could mean a 
significant overhead that might not be made up for by the improved inner loop. 
One might want to limit this algorithm to a certain maximum tuple size and use 
the original insertion sort version for larger elements (this could be decided 
at compile-time via sort_template.h).

Anyways, there might be some low hanging fruit here. If it turns out to be 
significantly faster one might even consider increasing the insertion sort 
threshold from < 7 to < 10 sort elements. But that is a whole other discussion 
for another day.

Has anyone tested such an approach before? Please let me know your thoughts.

Cheers,

Benjamin

[1] 
https://www.postgresql.org/message-id/flat/CAFBsxsHanJTsX9DNJppXJxwg3bU%2BYQ6pnmSfPM0uvYUaFdwZdQ%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/flat/CAApHDvoTTtoQYfp3d0kTPF6y1pjexgLwquzKmjzvjC9NCw4RGw%40mail.gmail.com

-- 

Benjamin Coutu
http://www.zeyos.com