While reviewing Thomas Munro's patchset to consider expanding the uses of specialized qsort [1], I wondered about some aspects of the current qsort implementation. For background, ours is based on Bentley & McIlroy "Engineering a Sort Function" [2], which is a classic paper worth studying.
1) the check for pre-sorted input at the start each and every recursion This has been discussed before [3]. This is great if the input is sorted, and mostly harmless if the current partition is badly sorted enough that this check fails quickly, but it's not hard to imagine that this is mostly a waste of cycles. There have been proposals to base the pre-sorted check on input size [4] or to do it only once at the beginning, but that strikes me as looking for the keys under the lamppost because that's where the light is. The logical criterion for proceeding to check if the input is sorted is whether we *think* the input could be sorted. That may sound like a tautology, but consider the case where the partitioning phase didn't perform any swaps. Then, it makes sense to check, but we can go further. What if we do the check, but towards the end that check fails. If just a couple elements are out of place, does it really make sense to give up, partition, and recurse? If just a few iterations of insertion sort are all that is needed to finish, why not just do that? This way, we *dynamically* decide to optimistically start an insertion sort, in the hopes that it will occasionally prevent a recursion, and worst case the input is slightly more sorted for the next recursion. All we need is a lightweight way to detect that the partitioning phase didn't do any swaps. More on this later. 2) cardinality issues can cancel abbreviated keys, but our qsort is not optimized for that Since in this case comparison is very expensive, it stands to reason that qsort could profitably be optimized for a low number of unique keys. The Bentley & McIlroy scheme does take great pains to prevent quadratic behavior from lots of duplicates, but I've found something that might have stronger guarantees: Pattern-defeating quicksort (paper: [5] C++ implementation: [6]) claims worst-case runtime of O(nk) for inputs with k distinct elements. This is achieved via an asymmetric partitioning scheme. It's not complex, but it is subtle, so I won't try to describe it here. I recommend reading section 3 of the paper for details. Rust and Boost use PDQuicksort in some form, so it has some production use. The partitioning scheme just mentioned also provides an easy way to detect when no swaps have been done, thus solving #1 above. There is one requirement that is a double-edged sword: Recursion to the partitions must happen in order. This has an additional advantage that in every case but one, insertion sort doesn't need to guard against running off the beginning of the array. The downside for us is that we currently recurse to a partition based on its size as a stack-guarding measure, so that guard would have to be replaced. The C++ implementation of PDQuicksort falls back to heap sort as a last resort to bound runtime, but it would be just as effective at guarding the stack. That's a bit of a snag for making it production-ready, but not enough to prevent testing the actual qsort part. Does anyone see a reason not to put in the necessary work to try it out? [1] https://www.postgresql.org/message-id/CA%2BhUKGKztHEWm676csTFjYzortziWmOcf8HDss2Zr0muZ2xfEg%40mail.gmail.com [2] https://cs.fit.edu/~pkc/classes/writing/samples/bentley93engineering.pdf [3] https://www.postgresql.org/message-id/flat/87F42982BF2B434F831FCEF4C45FC33E5BD369DF%40EXCHANGE.corporate.connx.com#e69718293c45d89555020bd0922ad055 [4] https://www.postgresql.org/message-id/CA%2BTgmobKFcb6ajVEN8eSnBa78sB3FSwuEWTHXd2x9JC5DOh5OA%40mail.gmail.com [5] https://arxiv.org/pdf/2106.05123.pdf [6] https://github.com/orlp/pdqsort -- John Naylor EDB: http://www.enterprisedb.com