Re: [PATCH 2/5] lib/sort: Use more efficient bottom-up heapsort variant
On 14/03/2019 01.03, George Spelvin wrote: > On Wed, 13 Mar 2019 at 23:29:40 +0100, Rasmus Villemoes wrote: > >> Nice! > > Thank you. May I translate that into Acked-by? > Sort-of. I prefer first seeing the full rerolled series for context etc., even if (the important parts of) this patch would be unchanged.
Re: [PATCH 2/5] lib/sort: Use more efficient bottom-up heapsort variant
On Wed, 13 Mar 2019 at 23:29:40 +0100, Rasmus Villemoes wrote: > On 21/02/2019 09.21, George Spelvin wrote: >> +/** >> + * parent - given the offset of the child, find the offset of the parent. >> + * @i: the offset of the heap element whose parent is sought. Non-zero. >> + * @lsbit: a precomputed 1-bit mask, equal to "size & -size" >> + * @size: size of each element >> + * >> + * In terms of array indexes, the parent of element j = i/size is simply >> + * (j-1)/2. But when working in byte offsets, we can't use implicit >> + * truncation of integer divides. >> + * >> + * Fortunately, we only need one bit of the quotient, not the full divide. >> + * size has a least significant bit. That bit will be clear if i is >> + * an even multiple of size, and set if it's an odd multiple. >> + * >> + * Logically, we're doing "if (i & lsbit) i -= size;", but since the >> + * branch is unpredictable, it's done with a bit of clever branch-free >> + * code instead. >> + */ >> +__attribute_const__ __always_inline >> +static size_t parent(size_t i, unsigned int lsbit, size_t size) >> +{ >> +i -= size; >> +i -= size & -(i & lsbit); >> +return i / 2; >> +} >> + > > Really nice :) I had to work through this by hand, but it's solid. Thank you! Yes, the way the mask doesn't include the low-order bits that don't matter anyway is a bit subtle. When the code is subtle, use lots of comments. The entire reason for making this a separate helper function is to leave room for the large comment. >> +unsigned const lsbit = size & -size;/* Used to find parent */ > > Nit: qualifier before type, "const unsigned". And this sets ZF, so a > paranoid check for zero size (cf. the other mail) by doing "if (!lsbit) > return;" is practically free. Though it's probably a bit obscure doing > it that way... Actually, this is a personal style thing which I can ignore for the sake of the kernel, but I believe that it's better to put the qualifier *after* the type. This is due to C's pointer declaration syntax. The standard example of the issue is: typedef char *pointer; const char *a; char const *b; char * const c; const pointer d; pointer const e; Now, which variables are the same types? The answer is that a & b are the same (mutable pointer to const char), and c, d & e are the same (const pointer to mutable char). I you make a habit of putting the qualifier *after* the type, then a simple "textual substitution" mental model for the typedef works, and it's clear that c and e are the same. It's also clear that b cannot be represented by the typedef because the const is between "char" and "*", and you obviously can't do that with the typedef. But if you put the qualifier first, it's annoying to rememeber why a and d are not the same type. So I've deliberately cultivated the style of putting the qualifier after the type. But if the kernel prefers it before... >> +if (!n) >> +return; > > I'd make that n <= 1. Shouldn't be much more costly. (Actually, it's "num <= 1"; n is the pre-multiplied form so n <= 1 can only happen when sorting one one-byte value.) I actually thought about this and decided not to bother. I did it this way during development to stress the general-case code. But should I change it? === NEVER MIND === I had written a long reply justifying leaving it alone to save one instruction when the light dawned: I can do *both* tests in one step with size_t n = num * size, a = (num/2) * size; unsigned const lsbit = size & -size;/* Used to find parent */ if (!a) /* num < 2 || size == 0 */ return; So now everyone's happy. > Nice! Thank you. May I translate that into Acked-by?
Re: [PATCH 2/5] lib/sort: Use more efficient bottom-up heapsort variant
On 21/02/2019 09.21, George Spelvin wrote: > > +/** > + * parent - given the offset of the child, find the offset of the parent. > + * @i: the offset of the heap element whose parent is sought. Non-zero. > + * @lsbit: a precomputed 1-bit mask, equal to "size & -size" > + * @size: size of each element > + * > + * In terms of array indexes, the parent of element j = i/size is simply > + * (j-1)/2. But when working in byte offsets, we can't use implicit > + * truncation of integer divides. > + * > + * Fortunately, we only need one bit of the quotient, not the full divide. > + * size has a least significant bit. That bit will be clear if i is > + * an even multiple of size, and set if it's an odd multiple. > + * > + * Logically, we're doing "if (i & lsbit) i -= size;", but since the > + * branch is unpredictable, it's done with a bit of clever branch-free > + * code instead. > + */ > +__attribute_const__ __always_inline > +static size_t parent(size_t i, unsigned int lsbit, size_t size) > +{ > + i -= size; > + i -= size & -(i & lsbit); > + return i / 2; > +} > + Really nice :) I had to work through this by hand, but it's solid. > /** > * sort - sort an array of elements > * @base: pointer to data to sort > @@ -125,21 +151,26 @@ static void generic_swap(void *a, void *b, int size) > * @cmp_func: pointer to comparison function > * @swap_func: pointer to swap function or NULL > * > - * This function does a heapsort on the given array. You may provide a > - * swap_func function optimized to your element type. > + * This function does a heapsort on the given array. You may provide a > + * swap_func function if you need to do something more than a memory copy > + * (e.g. fix up pointers or auxiliary data), but the built-in swap isn't > + * usually a bottleneck. > * > * Sorting time is O(n log n) both on average and worst-case. While > * qsort is about 20% faster on average, it suffers from exploitable > * O(n*n) worst-case behavior and extra memory requirements that make > * it less suitable for kernel use. > */ > - > void sort(void *base, size_t num, size_t size, > int (*cmp_func)(const void *, const void *), > void (*swap_func)(void *, void *, int size)) > { > /* pre-scale counters for performance */ > - int i = (num/2 - 1) * size, n = num * size, c, r; > + size_t n = num * size, a = (num/2) * size; > + unsigned const lsbit = size & -size;/* Used to find parent */ > + Nit: qualifier before type, "const unsigned". And this sets ZF, so a paranoid check for zero size (cf. the other mail) by doing "if (!lsbit) return;" is practically free. Though it's probably a bit obscure doing it that way... > + if (!n) > + return; I'd make that n <= 1. Shouldn't be much more costly. > - } > - } > + /* > + * Loop invariants: > + * 1. elements [a,n) satisfy the heap property (compare greater than > + *all of their children), > + * 2. elements [n,num*size) are sorted, and > + * 3. a <= b <= c <= d <= n (whenever they are valid). > + */ > + for (;;) { > + size_t b, c, d; > > + if (a) /* Building heap: sift down --a */ > + a -= size; > + else if (n -= size) /* Sorting: Extract root to --n */ > + swap_func(base, base + n, size); > + else/* Sort complete */ > + break; > + > + /* > + * Sift element at "a" down into heap. This is the > + * "bottom-up" variant, which significantly reduces > + * calls to cmp_func(): we find the sift-down path all > + * the way to the leaves (one compare per level), then > + * backtrack to find where to insert the target element. > + * > + * Because elements tend to sift down close to the leaves, > + * this uses fewer compares than doing two per level > + * on the way down. (A bit more than half as many on > + * average, 3/4 worst-case.) > + */ > + for (b = a; c = 2*b + size, (d = c + size) < n;) > + b = cmp_func(base + c, base + d) >= 0 ? c : d; > + if (d == n) /* Special case last leaf with no sibling */ > + b = c; > + > + /* Now backtrack from "b" to the correct location for "a" */ > + while (b != a && cmp_func(base + a, base + b) >= 0) > + b = parent(b, lsbit, size); > + c = b; /* Where "a" belongs */ > + while (b != a) {/* Shift it into place */ > + b = parent(b, lsbit, size); > + swap_func(base + b, base + c, size); > } > } > } > Nice! Rasmus
[PATCH 2/5] lib/sort: Use more efficient bottom-up heapsort variant
This uses fewer comparisons than the previous code (61% as many for large random inputs), but produces identical results; it actually performs the exact same series of swap operations. Standard heapsort, when sifting down, performs two comparisons per level: One to find the greater child, and a second to see if the current node should be exchanged with that child. Bottom-up heapsort observes that it's better to postpone the second comparison and search for the leaf where -infinity would be sent to, then search back *up* for the current node's destination. Since sifting down usually proceeds to the leaf level (that's where half the nodes are), this does many fewer second comparisons. That saves a lot of (expensive since Spectre) indirect function calls. The one time it's worse than the previous code is if there are large numbers of duplicate keys, when the top-down algorithm is O(n) and bottom-up is O(n log n). For distinct keys, it's provably always better. (The code is not significantly more complex. This patch also merges the heap-building and -extracting sift-down loops, resulting in a net code size savings.) x86-64 code size 885 -> 770 bytes (-115) (I see the checkpatch complaint about "else if (n -= size)". The alternative is significantly uglier.) Signed-off-by: George Spelvin --- lib/sort.c | 102 +++-- 1 file changed, 75 insertions(+), 27 deletions(-) diff --git a/lib/sort.c b/lib/sort.c index dff2ab2e196e..2aef4631e7d3 100644 --- a/lib/sort.c +++ b/lib/sort.c @@ -117,6 +117,32 @@ static void generic_swap(void *a, void *b, int size) } while (n); } +/** + * parent - given the offset of the child, find the offset of the parent. + * @i: the offset of the heap element whose parent is sought. Non-zero. + * @lsbit: a precomputed 1-bit mask, equal to "size & -size" + * @size: size of each element + * + * In terms of array indexes, the parent of element j = i/size is simply + * (j-1)/2. But when working in byte offsets, we can't use implicit + * truncation of integer divides. + * + * Fortunately, we only need one bit of the quotient, not the full divide. + * size has a least significant bit. That bit will be clear if i is + * an even multiple of size, and set if it's an odd multiple. + * + * Logically, we're doing "if (i & lsbit) i -= size;", but since the + * branch is unpredictable, it's done with a bit of clever branch-free + * code instead. + */ +__attribute_const__ __always_inline +static size_t parent(size_t i, unsigned int lsbit, size_t size) +{ + i -= size; + i -= size & -(i & lsbit); + return i / 2; +} + /** * sort - sort an array of elements * @base: pointer to data to sort @@ -125,21 +151,26 @@ static void generic_swap(void *a, void *b, int size) * @cmp_func: pointer to comparison function * @swap_func: pointer to swap function or NULL * - * This function does a heapsort on the given array. You may provide a - * swap_func function optimized to your element type. + * This function does a heapsort on the given array. You may provide a + * swap_func function if you need to do something more than a memory copy + * (e.g. fix up pointers or auxiliary data), but the built-in swap isn't + * usually a bottleneck. * * Sorting time is O(n log n) both on average and worst-case. While * qsort is about 20% faster on average, it suffers from exploitable * O(n*n) worst-case behavior and extra memory requirements that make * it less suitable for kernel use. */ - void sort(void *base, size_t num, size_t size, int (*cmp_func)(const void *, const void *), void (*swap_func)(void *, void *, int size)) { /* pre-scale counters for performance */ - int i = (num/2 - 1) * size, n = num * size, c, r; + size_t n = num * size, a = (num/2) * size; + unsigned const lsbit = size & -size;/* Used to find parent */ + + if (!n) + return; if (!swap_func) { if (alignment_ok(base, size, 8)) @@ -150,30 +181,47 @@ void sort(void *base, size_t num, size_t size, swap_func = generic_swap; } - /* heapify */ - for ( ; i >= 0; i -= size) { - for (r = i; r * 2 + size < n; r = c) { - c = r * 2 + size; - if (c < n - size && - cmp_func(base + c, base + c + size) < 0) - c += size; - if (cmp_func(base + r, base + c) >= 0) - break; - swap_func(base + r, base + c, size); - } - } + /* +* Loop invariants: +* 1. elements [a,n) satisfy the heap property (compare greater than +*all of their children), +* 2. elements [n,num*size) are sorted, and +* 3. a <= b <= c <= d <= n (whenever they are valid). +*/ + for (;;) { +