Insertion Sort Improvements

2022-08-25 Thread Benjamin Coutu
Hello,

Inspired by the recent discussions[1][2] around sort improvements, I took a 
look around the code and noticed the use of a somewhat naive version of 
insertion sort within the broader quicksort code.

The current implementation (see sort_template.h) is practically the textbook 
version of insertion sort:

for (pm = a + ST_POINTER_STEP; pm < a + n * ST_POINTER_STEP; pm += 
ST_POINTER_STEP)
  for (pl = pm; pl > a && DO_COMPARE(pl - ST_POINTER_STEP, pl) > 0; pl -= 
ST_POINTER_STEP)
DO_SWAP(pl, pl - ST_POINTER_STEP);

I propose to rather use the slightly more efficient variant of insertion sort 
where only a single assignment instead of a fully-fledged swap is performed in 
the inner loop:

for (pm = a + ST_POINTER_STEP; pm < a + n * ST_POINTER_STEP; pm += 
ST_POINTER_STEP) {
  DO_COPY(pm_temp, pm); /* pm_temp <- copy of pm */
  
  pl = pm - ST_POINTER_STEP;
  
  for (; pl >= a && DO_COMPARE(pl, pm_temp) > 0; pl -= ST_POINTER_STEP)
DO_ASSIGN(pl + ST_POINTER_STEP, pl); /* pl + 1 <- pl */

  DO_COPY(pl + ST_POINTER_STEP, pm_temp); /* pl + 1 <- copy of pm_temp */
}

DO_ASSIGN and DO_COPY macros would have to be declared analogue to DO_SWAP via 
the template.

There is obviously a trade-off involved here as O(1) extra memory is required 
to hold the temporary variable and DO_COPY might be expensive if the sort 
element is large. In case of single datum sort with trivial data types this 
would not be a big issue. For large tuples on the other hand it could mean a 
significant overhead that might not be made up for by the improved inner loop. 
One might want to limit this algorithm to a certain maximum tuple size and use 
the original insertion sort version for larger elements (this could be decided 
at compile-time via sort_template.h).

Anyways, there might be some low hanging fruit here. If it turns out to be 
significantly faster one might even consider increasing the insertion sort 
threshold from < 7 to < 10 sort elements. But that is a whole other discussion 
for another day.

Has anyone tested such an approach before? Please let me know your thoughts.

Cheers,

Benjamin

[1] 
https://www.postgresql.org/message-id/flat/CAFBsxsHanJTsX9DNJppXJxwg3bU%2BYQ6pnmSfPM0uvYUaFdwZdQ%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/flat/CAApHDvoTTtoQYfp3d0kTPF6y1pjexgLwquzKmjzvjC9NCw4RGw%40mail.gmail.com

-- 

Benjamin Coutu
http://www.zeyos.com




Re: Insertion Sort Improvements

2022-08-26 Thread Benjamin Coutu
> convenient if you know the type at compile time. See the attached,
> which I had laying around when I was looking at PDQuicksort. I believe
> it's similar to what you have in mind.

That looks very promising.
I also love your recent proposal of partitioning into null and non-null. I 
suspect that to be a clear winner.

> I think it belongs around 10 now anyway.

Yeah, I think that change is overdue given modern hardware characteristics 
(even with the current implementation).

Another idea could be to run a binary insertion sort and use a much higher 
threshold. This could significantly cut down on comparisons (especially with 
presorted runs, which are quite common in real workloads).

If full binary search turned out to be an issue regarding cache locality, we 
could do it in smaller chunks, e.g. do a micro binary search between the 
current position (I) and position minus chunk size (say 6-12 or so, whatever 
fits 1 or 2 cachelines) whenever A[I] < A[I-1] and if we don't find the 
position within that chunk we continue with the previous chunk, thereby 
preserving cache locality.

With less comparisons we should start keeping track of swaps and use that as an 
efficient way to determine presortedness. We could change the insertion sort 
threshold to a swap threshold and do insertion sort until we hit the swap 
threshold. I suspect that would make the current presorted check obsolete (as 
binary insertion sort without or even with a few swaps should be faster than 
the current presorted-check).

Cheers, Ben




Re: Insertion Sort Improvements

2022-09-27 Thread Benjamin Coutu
Getting back to improvements for small sort runs, it might make sense to 
consider using in-register based sorting via sorting networks for very small 
runs.

This talk is specific to database sorting and illustrates how such an approach 
can be vectorized: 
https://youtu.be/HeFwVNHsDzc?list=PLSE8ODhjZXjasmrEd2_Yi1deeE360zv5O&t=1090

It looks like some of the commercial DBMSs do this very successfully. They use 
4 512bit registers (AVX-512) in this example, which could technically store 4 * 
4 sort-elements (given that the sorting key is 64 bit and the tuple pointer is 
64 bit). I wonder whether this could also be done with just SSE (instead of 
AVX), which the project now readily supports thanks to your recent efforts?




Re: Insertion Sort Improvements

2022-09-28 Thread Benjamin Coutu
> "Looks like"?

I cannot find the reference, but I've read a while back that a well-known 
company from Redwood uses it for their in-memory columnar storage. That might 
have just been a rumor or might have been research only - not sure. It does not 
really matter anyways.

> SortTuples are currently 24 bytes, and supported vector registers are 16 
> bytes, so not sure how you think that would work.

The thought was to logically group multiple sort tuples together and then 
create a vectorized version of that group with just the primitive type sort key 
as well as a small-sized index/offset into that sort group to later swap the 
corresponding sort tuple referenced by that index/offset. The sorting network 
would allow us to do branch-less register based sorting for a particular sort 
run. I guess this idea is moot, given ...

> More broadly, the more invasive a change is, the more risk is involved, and 
> the more effort to test and evaluate. If you're serious about trying to 
> improve insertion sort performance, the simple idea we discussed at the start 
> of the thread is a much more modest step that has a good chance of justifying 
> the time put into it. That is not to say it's easy, however, because testing 
> is a non-trivial amount of work.

I absolutely agree. Let's concentrate on improving things incrementally.
Please excuse me wondering given that you have contributed some of the recent 
vectorization stuff and seeing that you have obviously experimented a lot with 
the sort code, that you might already have tried something along those lines or 
researched the subject - it is definitely a very interesting topic.

Cheers, Ben




Re: disfavoring unparameterized nested loops

2022-09-29 Thread Benjamin Coutu
I'd like to revamp this important discussion.

As is well described in this fairly recent paper here 
https://www.vldb.org/pvldb/vol9/p204-leis.pdf (which also looks at Postgres) 
"estimation errors quickly grow as the number of joins increases, and that 
these errors are usually the reason for bad plans" - I think we can all get 
behind that statement.

While nested loop joins work great when cardinality estimates are correct, they 
are notoriously bad when the optimizer underestimates and they degrade very 
fast in such cases - the good vs. bad here is very asymmetric. On the other 
hand hash joins degrade much more gracefully - they are considered very robust 
against underestimation. The above mentioned paper illustrates that all mayor 
DBMS (including Postgres) tend to underestimate and usually that 
underestimation increases drastically with the number of joins (see Figures 3+4 
of the paper).

Now, a simple approach to guarding against bad plans that arise from 
underestimation could be to use what I would call a 
nested-loop-conviction-multiplier based on the current depth of the join tree, 
e.g. for a base table that multiplier would obviously be 1, but would then grow 
(e.g.) quadratically. That conviction-multiplier would *NOT* be used to skew 
the cardinality estimates themselves, but rather be applied to the overall 
nested loop join cost at each particular stage of the plan when comparing it to 
other more robust join strategies like hash or sort-merge joins. That way when 
we can be sure to have a good estimate at the bottom of the join tree we treat 
all things equal, but favor nested loops less and less as we move up the join 
tree for the sake of robustness.
Also, we can expand the multiplier whenever we fall back to using the default 
cardinality constant as surely all bets are off at that point - we should 
definitely treat nested loop joins as out of favor in this instance and that 
could easily be incorporated by simply increasing the conviction-mutliplier.

What are your thoughts on this simple idea - is it perhaps too simple?

Cheers, Ben




Re: disfavoring unparameterized nested loops

2022-09-29 Thread Benjamin Coutu
> Right. But that seems fraught with difficulty. I suspect that the
> costs that the planner attributes to each plan often aren't very
> reliable in any absolute sense, even when everything is working very
> well by every available metric. Even a very noisy cost model with
> somewhat inaccurate selectivity estimates will often pick the cheapest
> plan, or close enough.

Sure, the absolute cost of a complex plan will always be inaccurate at best.
My point is that we can be very confident in the cardinalities of base tables. 
As the paper states in "3.1. Estimates for Base Tables":

"The median q-error is close to the optimal value of 1 for all systems,
indicating that the majority of all selections are estimated correctly."

Thanks to the statistics will practically never be off by an order of magnitude 
when estimating base table cardinalities.

The paper also clearly shows (and that certainly coincides with my experience) 
that those cardinality underestimations grow exponentially as they propagate up 
the join tree.

Given the research I'd stipulate that at any given level of the join tree, the 
current depth is a reasonable indicator of underestimation. Taking that into 
account (even if only to mitigate nested loops on higher levels) is IMV a 
principled approach, and not necesseraly a hack.

Obviously having something like error bars as proposed by Tom would be even 
better and perhaps more general, but that is on a whole different level in 
terms of complexity and I certainly have no idea how we would easily get there.




Re: disfavoring unparameterized nested loops

2022-09-29 Thread Benjamin Coutu


> In general I suspect that we'd be better off focussing on mitigating
> the impact at execution time. There are at least a few things that we
> could do there, at least in theory. Mostly very ambitious, long term
> things.

I think these things are orthogonal. No matter how good the cost model ever 
gets, we will always have degenerate cases.
Having some smarts about that in the executor is surely a good thing, but it 
shouldn't distract us from improving on the planner front.

> 
> I like the idea of just avoiding unparameterized nested loop joins
> altogether when an "equivalent" hash join plan is available because
> it's akin to an execution-time mitigation, despite the fact that it
> happens during planning. While it doesn't actually change anything in
> the executor, it is built on the observation that we have virtually
> everything to gain and nothing to lose during execution, no matter
> what happens.

I agree with you, that those plans are too risky. But let's maybe find a more 
general way of dealing with this.

> Right. Though I am actually sympathetic to the idea that users might
> gladly pay a cost for performance stability -- even a fairly large
> cost. That part doesn't seem like the problem.




Re: disfavoring unparameterized nested loops

2022-09-30 Thread Benjamin Coutu
> Agreed, but dealing with uncertainty in those numbers is an enormous
> task if you want to do it right.  "Doing it right", IMV, would start
> out by extending all the selectivity estimation functions to include
> error bars; then we could have error bars on rowcount estimates and
> then costs; then we could start adding policies about avoiding plans
> with too large a possible upper-bound cost.  Trying to add such
> policy with no data to go on is not going to work well.

Error bars would be fantastic, no question. But that would make things very 
complex.
A lot of judgment calls would be necessary for the policy behind upper-bound 
pruning, picking up on Peter's comment about "conviction multiplier of 
conviction multiplier" ;)
Also, the math in deriving those bounds based on the stats and how they 
propagate up the join tree doesn't seem trivial either.

> I think Peter's point is that a quick-n-dirty patch is likely to make
> as many cases worse as it makes better.  That's certainly my opinion
> about the topic.

As in my reply to Peter, I think the join level/depth metric is a simple but 
principled way of dealing with it, given the referenced research.
In the first step, we'd use this merely to be more risk-averse towards nested 
loop joins as we climb up the join tree - we are not fiddling with the cost 
model itself, nor the join ordering, just when it comes to considering that 
particular join algorithm. Later this could be expanded to be more broadly 
scoped.

Please not give up on a simple way to reap most of the fruits just yet.




Re: disfavoring unparameterized nested loops

2022-09-30 Thread Benjamin Coutu
> Actually, if we wanted to improve things in this area, we should have a
> set of queries that don't chose optimal plans we can test with.  We used
> to see them a lot before we had extended statistics, but I don't
> remember seeing many recently, let alone a collection of them.  I guess
> that is good.

In the VLDB paper they actually created their own "Join Order Benchmark", which 
is publicly available under https://github.com/gregrahn/join-order-benchmark
It would probably be more suited for this kind of testing than, e.g. the TPC 
benchmarks.

If there is interest, I could also compile a set of relevant cases based on the 
message history of the performance mailing list.




Re: disfavoring unparameterized nested loops

2022-09-30 Thread Benjamin Coutu
> Sure, but the model isn't the problem here, really -- not to me. The
> problem is that the planner can in some cases choose a plan that is
> inherently unreasonable, at least in practical terms. You're talking
> about uncertainties. But I'm actually talking about the opposite thing
> -- certainty (albeit a limited kind of certainty that applies only to
> one narrow set of conditions).

I absolutely agree and support your proposal to simply not generate those paths 
at all unless necessary.

> For all I know you might be onto something. But it really seems
> independent to me.

Yeah, I‘m sorry if I highjacked this thread for something related but 
technically different. I just wanted to expand on your proposal by taking into 
account the join depth and also not just talking about unparametrized nested 
loop joins. The research is very clear that the uncertainty is proportional to 
the join level, and that is the point I am trying to focus the discussion on.

I really encourage everyone to read the VLDB paper. BTW, the unnamed 
proprietary DBMSs in that paper are the 3 big ones from Washington, California 
and NY, in that order.




Summary of Sort Improvement Proposals

2024-05-13 Thread Benjamin Coutu
Hello,

In light of multiple threads [1-6] discussing sorting improvements, I'd like to 
consolidate the old (+some new) ideas as a starting point.
It might make sense to brain storm on a few of these ideas and maybe even 
identify some that are worth implementing and testing.

1. Simple algorithmic ideas:
- Use single-assignment insertion-sort instead of swapping
- Increase insertion-sort threshold to at least 8 (possibly 10+), to be 
determined empirically based on current hardware
- Make insertion-sort threshold customizable via template based on sort 
element size

2. More complex/speculative algorithmic ideas:
- Try counting insertion-sort loop iterations and bail after a certain 
limit (include presorted check in insertion-sort loop and continue presorted 
check from last position in separate loop after bailout)
- Try binary search for presorted check (outside of insertion-sort-code)
- Try binary insertion sort (if comparison costs are high)
- Try partial insertion sort (include presorted check)
- Try presorted check only at top-level, not on every recursive step, 
or if on every level than at least only for n > some threshold
- Try asymmetric quick-sort partitioning
- Try dual pivot quick-sort
- Try switching to heap-sort dependent on recursion depth (might allow 
ripping out median-of-median)

3. TupleSort ideas:
- Use separate sort partition for NULL values to avoid null check on 
every comparison and to make nulls first/last trivial
- Pass down non-nullness info to avoid null check and/or null-partition 
creation (should ideally be determined by planner)
- Skip comparison of first sort key on subsequent full tuple 
tie-breaker comparison (unless abbreviated key)
- Encode NULL directly in abbreviated key (only if no null-partitioning)

4. Planner ideas:
- Use pg_stats.correlation to inform sort algorithm selection for sort 
keys that come from sequential-scans/bitmap-heap-scans
- Use n_distinct to inform sort algorithm selection (many tie-breaker 
comparisons necessary on multi-key sort)
- Improve costing of sorts in planner considering tuple size, 
distribution and n_distinct

[1] https://www.postgresql.org/message-id/flat/ddc4e498740a8e411c59%40zeyos.com
[2] 
https://www.postgresql.org/message-id/flat/CAFBsxsHanJTsX9DNJppXJxwg3bU%2BYQ6pnmSfPM0uvYUaFdwZdQ%40mail.gmail.com
[3] 
https://www.postgresql.org/message-id/flat/CAApHDvoTTtoQYfp3d0kTPF6y1pjexgLwquzKmjzvjC9NCw4RGw%40mail.gmail.com
[4] 
https://www.postgresql.org/message-id/flat/CAEYLb_Xn4-6f1ofsf2qduf24dDCVHbQidt7JPpdL_RiT1zBJ6A%40mail.gmail.com
[5] 
https://www.postgresql.org/message-id/flat/CAEYLb_W%2B%2BUhrcWprzG9TyBVF7Sn-c1s9oLbABvAvPGdeP2DFSQ%40mail.gmail.com
[6] 
https://www.postgresql.org/message-id/flat/683635b8-381b-5b08-6069-d6a45de19a12%40enterprisedb.com#12683b7a6c566eb5b926369b5fd2d41f

-- 

Benjamin Coutu
http://www.zeyos.com




Re: Insertion Sort Improvements

2023-05-23 Thread Benjamin Coutu
Greetings,

I would like to revisit the discussion and concur with John's perspective that 
incremental progress through small, successive modifications is the appropriate 
approach to move forward. Therefore, I would like to propose two distinct ideas 
aimed at enhancing the functionality of insertion sort:

1. Implementation of a more efficient variant (as described in the introductory 
part of this thread):

 OLD 

for (pm = a + ST_POINTER_STEP; pm < a + n * ST_POINTER_STEP;
 pm += ST_POINTER_STEP)
for (pl = pm; pl > a && DO_COMPARE(pl - ST_POINTER_STEP, pl) > 0;
 pl -= ST_POINTER_STEP)
DO_SWAP(pl, pl - ST_POINTER_STEP);

 NEW 

for (
pm = a + ST_POINTER_STEP;
pm < a + n * ST_POINTER_STEP;
pm += ST_POINTER_STEP
) {
ST_POINTER_TYPE tmp = *pm;
 
for (
pl = pm - ST_POINTER_STEP;
pl >= a && DO_COMPARE(pl, &tmp) > 0;
pl -= ST_POINTER_STEP
)
*(pl + ST_POINTER_STEP) = *pl;

*(pl + ST_POINTER_STEP) = tmp;
}



2. It appears that there is a consensus against disregarding the presorted 
check, despite its questionable value. In light of this, an alternative 
suggestion is to integrate the presorted check into the insertion sort path by 
utilizing an unbounded insertion sort. Only after a maximum number of swaps 
have occurred should we abandon the sorting process. If insertion sort is 
executed on the entire array without any swaps, we can simply return. If not, 
and we continue with quicksort after the swap limit has been reached, we at 
least have left the array in a more sorted state, which may likely be 
advantageous for subsequent iterations.

 OLD 

if (n < 7)
{
for (pm = a + ST_POINTER_STEP; pm < a + n * ST_POINTER_STEP;
 pm += ST_POINTER_STEP)
for (pl = pm; pl > a && DO_COMPARE(pl - ST_POINTER_STEP, pl) > 
0;
 pl -= ST_POINTER_STEP)
DO_SWAP(pl, pl - ST_POINTER_STEP);
return;
}
presorted = 1;
for (pm = a + ST_POINTER_STEP; pm < a + n * ST_POINTER_STEP;
 pm += ST_POINTER_STEP)
{
DO_CHECK_FOR_INTERRUPTS();
if (DO_COMPARE(pm - ST_POINTER_STEP, pm) > 0)
{
presorted = 0;
break;
}
}
if (presorted)
return;

 NEW 

#define LIMIT_SWAPS 30 /* to be determined empirically */

int swaps = 0;

for (pm = a + ST_POINTER_STEP; pm < a + n * ST_POINTER_STEP;
 pm += ST_POINTER_STEP)
for (pl = pm; pl > a && DO_COMPARE(pl - ST_POINTER_STEP, pl) > 0;
 pl -= ST_POINTER_STEP) {
DO_SWAP(pl, pl - ST_POINTER_STEP);

if (++swaps == LIMIT_SWAPS)
goto quicksort;
}

if (swaps == 0)
return;

quicksort:



Naturally, both modifications (with point 2 being highly speculative) are 
currently independent of each other, and it is also crucial to benchmark the 
combined variant as well as different values for LIMIT_SWAPS.
I would greatly appreciate assistance in benchmarking these proposed changes. 
Your collaboration in this matter would be invaluable.

Cheers, Ben




Re: Insertion Sort Improvements

2023-05-23 Thread Benjamin Coutu
> That's worth trying out. It might also then be worth trying to push both 
> unordered values -- the big one up / the small one down. I've seen other 
> implementations do that, but don't remember where, or what it's called.

It is important that we do not do 2 compares two avoid one copy (assignment to 
temporary) as you did in your patch earlier in this thread, cause compares are 
usually pretty costly (also two compares are then inlined, bloating the binary).
Assigning a sort tuple to a temporary translates to pretty simple assembly 
code, so my suggested modification should outperform. It cuts down the cost of 
the inner loop by ca. 40% comparing the assembly. And it avoids having to 
re-read memory on each comparison, as the temporary can be held in registers.

> "Unbounded" means no bounds check on the array. That's not possible in its 
> current form, so I think you misunderstood something.

Sorry for the confusion. I didn't mean unbounded in the "array bound checking" 
sense, but in the "unrestricted number of loops" sense.

> I only remember implementations tracking loop iterations, not swaps. You'd 
> need evidence that this is better.

Well, the idea was to include the presorted check somehow. Stopping after a 
certain number of iterations is surely more safe than stopping after a number 
of swaps, but we would then implicitly also stop our presort check. We could 
change that though: Count loop iterations and on bailout continue with a pure 
presort check, but from the last position of the insertion sort -- not all over 
again -- by comparing against the maximum value recorded during the insertion 
sort. Thoughts?

> An important part not mentioned yet: This might only be worth doing if the 
> previous recursion level performed no swaps during partitioning and the 
> current pivot candidates are ordered.

Agreed.

> It's currently 7, but should really be something like 10. A script that 
> repeats tests for, say, 7 through 18 should show a concave-up shape in the 
> times. The bottom of the bowl should shift to higher values, and that minimum 
> is what should be compared.

Yeah, as alluded to before, it should be closer to 10 nowadays.
In any case it should be changed at least from 7 to 8, cause then we could at 
least discard the additional check for n > 7 in the quicksort code path (see 
/src/include/lib/sort_template.h#L322). Currently we check n < 7 and a few 
lines down we check for n > 7, if we check n < 8 for insertion sort then the 
second check becomes obsolete.

Benjamin Coutu
http://www.zeyos.com