Re: Optimize numeric.c mul_var() using the Karatsuba algorithm
On Sun, 30 Jun 2024 at 11:22, Joel Jacobson wrote: > > On Sat, Jun 29, 2024, at 14:22, Dean Rasheed wrote: > > > But why just split it into two pieces? That will just lead > > to a lot of unnecessary recursion for very unbalanced inputs. Instead, > > why not split the longer input into N roughly equal sized pieces, each > > around the same length as the shorter input, multiplying and adding > > them at the appropriate offsets? > > The approach you're describing is implemented by e.g. CPython > and is called "lopsided" in their code base. It has some different > performance characteristics, compared to the recursive Half-Karatsuba > approach. > > What I didn't like about lopsided is the degenerate case where the > last chunk is much shorter than the var1, for example, if we pretend > we would be doing Karatsuba all the way down to ndigits 2, > and think about the example var1ndigits = 3 and var2ndigits = 10, > then lopsided would do > var1ndigits=3 var2ndigits=3 > var1ndigits=3 var2ndigits=3 > var1ndigits=3 var2ndigits=3 > var1ndigits=3 var2ndigits=1 > > whereas Half-Karatsuba would do > var1ndigits=3 var2ndigits=5 > var1ndigits=3 var2ndigits=5 > > You can find contrary examples too of course where lopsided > is better than Half-Karatsuba, none of them seem substantially better > than the other. Actually, that wasn't quite what I was thinking of. The approach I've seen (I can't remember where) is to split var2 into roughly equal-sized chunks as follows: If var2ndigits >= 2 * var1ndigits, split it into a number chunks where nchunks = var2ndigits / var1ndigits using integer division with truncation. Then call mul_var_chunks() (say), passing it nchunks, to perform the multiplication using that many chunks. mul_var_chunks() would use nchunks to decides on the chunk size according to chunk_size = var2ndigits / nchunks again using integer division with truncation. That has a remainder in the range [0, nchunks), which is divided up between the chunks so that some of them end up being chunk_size + 1 digits. I.e., something like: chunk_remainder = var2ndigits - chunk_size * nchunks for (int start = 0, end = chunk_size; start < var2ndigits; start = end, end = start + chunk_size) { /* Distribute remainder over the first few chunks */ if (chunk_remainder > 0) { end++; chunk_remainder--; } /* Process chunk between "start" and "end" */ } What this does is adapt the shape of the chunks to the region where the Karatsuba algorithm is most efficient. For example, suppose var1ndigits = 1000. Then: For 2000x1000 to 2999x1000 digit products, nchunks will be 2 and chunk_size will be at most (2999/2)+1 = 1500. For 3000x1000 to 3999x1000 digit products, nchunks will be 3 and chunk_size will be at most (3999/3)+1 = 1334. And so on. The result is that all the chunks will fall into that region where var2ndigits / var1ndigits is between 1 and 1.5, for which KARATSUBA_LOW_RANGE_CONDITION() will almost certainly pass, and Karatsuba will operate efficiently all the way down to KARATSUBA_BASE_LIMIT. Regards, Dean
Re: Optimize numeric.c mul_var() using the Karatsuba algorithm
On Sun, 30 Jun 2024 at 11:22, Joel Jacobson wrote: > > The surprising realization here is that there are actually (var1ndigits, > var2ndigits) > combinations where *only* doing mul_var_karatsuba_half() recursively > all the way down to schoolbook *is* a performance win, > even though we don't do any mul_var_karatsuba_full(). > > Indeed only mul_var_karatsuba_half() will be called with the inputs: > var1ndigits=1000 var2ndigits=1 > var1ndigits=1000 var2ndigits=5000 > It will never call mul_var_karatsuba_full(). > > Surprisingly, this still gives a 13% speed-up on a Intel Core i9-14900K. > Hmm, I don't see any gains in that example. Logically, it is doing slightly more work splitting up var2 and adding partial products. However, it's possible that what you're seeing is a memory access problem where, if var2 is too large, it won't fit in cache close to the CPU, and since the schoolbook algorithm traverses var2 multiple times, it ends up being quicker to split up var2. Since I have a slower CPU, I'm more likely to be CPU-limited, while you might be memory-limited. This makes me think that this is always going to be very hardware-dependent, and we shouldn't presume what will work best on the user's hardware. However, if a 1000x1000 ndigit product is known to be faster using Karatsuba on some particular hardware (possibly nearly all hardware), then why wouldn't it make sense to do the above as 10 invocations of Karatsuba? Regards, Dean
Re: Optimize numeric.c mul_var() using the Karatsuba algorithm
On Sun, Jun 30, 2024, at 17:44, Tom Lane wrote: > "Joel Jacobson" writes: >> On Sat, Jun 29, 2024, at 17:25, Tom Lane wrote: >>> (In general I find this patch seriously undercommented.) > >> However, I think the comments above split_var_at(), >> mul_var_karatsuba_full() and mul_var_karatsuba_half() >> are quite good already, what do you think? > > Not remarkably so. For starters, heaven help the reader who has > no idea what "the Karatsuba algorithm" refers to. Nor is there > any mention of why (or when) it's better than the traditional > algorithm. You could at least do people the courtesy of providing > a link to the wikipedia article that you're assuming they've > memorized. Thanks for guidance. New patch attached. I've added as an introduction to Karatsuba: +/* + * mul_var_karatsuba_full() - + * + * Multiplication using the Karatsuba algorithm. + * + * The Karatsuba algorithm is a divide-and-conquer algorithm that reduces + * the complexity of large number multiplication. It splits each number + * into two halves and performs three multiplications on the parts, + * rather than four as in the traditional method. This results in + * a significant performance improvement for sufficiently large numbers. ... + * For more details on the Karatsuba algorithm, see the Wikipedia article: + * https://en.wikipedia.org/wiki/Karatsuba_algorithm > There's also a discussion to be had about why Karatsuba is > a better choice than other divide-and-conquer multiplication > methods. Why not Toom-Cook, for example, which the aforesaid > wikipedia page says is faster yet? I suppose you concluded > that the extra complexity is unwarranted, but this is the > sort of thing I'd expect to see explained in the comments. I've added this to the end of the comment on mul_var_karatsuba_full(): + * The Karatsuba algorithm is preferred over other divide-and-conquer methods + * like Toom-Cook for this implementation due to its balance of complexity and + * performance gains given Numeric's constraints. + * + * For Toom-Cook to be worth the added complexity, the factors would need to + * be much larger than supported by Numeric, making Karatsuba a more + * appropriate choice. Also, I added this comment on the #define's at the beginning: +/* + * Constants used to determine when the Karatsuba algorithm should be used + * for multiplication. These thresholds were determined empirically through + * benchmarking across various architectures, aiming to avoid performance + * regressions while capturing potential gains. The choice of these values + * involves trade-offs and balances simplicity and performance. + */ As well as this comment on KARATSUBA_CONDITION: +/* + * KARATSUBA_CONDITION() - + * + * This macro determines if the Karatsuba algorithm should be applied + * based on the number of digits in the multiplicands. It checks if + * the number of digits in the larger multiplicand exceeds a base limit + * and if the sizes of the multiplicands fall within specific ranges + * where Karatsuba multiplication is usually beneficial. + * + * The conditions encapsulated by KARATSUBA_CONDITION are: + * 1. The larger multiplicand has more digits than the base limit. + * 2. The sizes of the multiplicands fall within low, middle, or high range + *conditions which were identified as performance beneficial regions during + *benchmarks. + * + * The macro ensures that the algorithm is applied only when it is likely + * to provide performance benefits, considering the size and ratio of the + * factors. + */ What do you think? Regards, Joel v2-0001-numeric-mul_var-karatsuba.patch Description: Binary data
Re: Optimize numeric.c mul_var() using the Karatsuba algorithm
"Joel Jacobson" writes: > On Sat, Jun 29, 2024, at 17:25, Tom Lane wrote: >> (In general I find this patch seriously undercommented.) > However, I think the comments above split_var_at(), > mul_var_karatsuba_full() and mul_var_karatsuba_half() > are quite good already, what do you think? Not remarkably so. For starters, heaven help the reader who has no idea what "the Karatsuba algorithm" refers to. Nor is there any mention of why (or when) it's better than the traditional algorithm. You could at least do people the courtesy of providing a link to the wikipedia article that you're assuming they've memorized. There's also a discussion to be had about why Karatsuba is a better choice than other divide-and-conquer multiplication methods. Why not Toom-Cook, for example, which the aforesaid wikipedia page says is faster yet? I suppose you concluded that the extra complexity is unwarranted, but this is the sort of thing I'd expect to see explained in the comments. regards, tom lane
Re: Optimize numeric.c mul_var() using the Karatsuba algorithm
On Sat, Jun 29, 2024, at 14:22, Dean Rasheed wrote: > However, I really don't like having these magic constants at all, > because in practice the threshold above which the Karatsuba algorithm > is a win can vary depending on a number of factors, such as whether > it's running on 32-bit or 64-bit, whether or not SIMD instructions are > available, the relative timings of CPU instructions, the compiler > options used, and probably a bunch of other things. ... > Doing a quick test on my machine, using random equal-length inputs of > various sizes, I got the following performance results: > > digits | rate (HEAD) | rate (patch) | change > +---+---+ > 10 | 6.060014e+06 | 6.0189365e+06 | -0.7% > 100 | 2.7038752e+06 | 2.7287925e+06 | +0.9% Does the PostgreSQL community these days have access to some kind of performance farm, covering some/all of the supported hardware architectures? Personally, I have three machines: MacBook Pro M3 Max Intel Core i9-14900K AMD Ryzen 9 7950X3D In addition I usually spin up a few AWS instances of different types, but this is scary, because one time I forgot to turn them off for a week, which was quite costly. Would be much nicer with a performance farm! If one exists, please let me know and no need to read the rest of this email. Otherwise: Imagine if we could send a patch to a separate mailing list, and the system would auto-detect what catalog functions are affected, and automatically generate a performance report, showing the delta per platform. Binary functions, like numeric_mul(), should generate an image where the two axes would be the size of the inputs, and the color of each pixel should show the performance gain/loss, whereas unary functions like sqrt() should have the size of the input as the x-axis and performance gain/loss as the y-axis. How to test each catalog function would of course need to be designed manually, but maybe the detection of affected function would be automated, if accepting some false positives/negatives, i.e. benchmarking too many or too few catalog functions, given a certain patch. Catalog functions are just a tiny part of PostgreSQL, so there should of course be other tests as well to cover other things, but since they are simple to test predictably, maybe it could be a good start for the project, even if it's far from the most important thing to benchmark. I found an old performance farm topic from 2012 but it seems the discussion just stopped for some reason not clear to me. Regards, Joel
Re: Optimize numeric.c mul_var() using the Karatsuba algorithm
On Sat, Jun 29, 2024, at 17:25, Tom Lane wrote: > Dean Rasheed writes: >> There's another complication though (if the threshold is made >> configurable): the various numeric functions that use mul_var() are >> immutable, which means that the results from the Karatsuba algorithm >> must match those from the schoolbook algorithm exactly, for all >> inputs. > > That seems like an impossible standard to meet. What we'd probably > have to do is enable Karatsuba only when mul_var is being asked > for an exact (full-precision) result. This'd complicate the check > condition further, possibly reaching the point where there is a > visible drag on performance in the non-Karatsuba case. OK, noted that we should only do Karatsuba when mul_var is being asked for an exact (full-precision) result. > Another possible source of drag: if mul_var is now recursive, > does it need a stack depth check? If we can prove that the > number of recursion levels is no more than O(log(N)) in the > number of input digits, it's probably safe to skip that ... > but I see no such proof here. > (In general I find this patch seriously undercommented.) Yes, the #define's around KARATSUBA_CONDITION needs to be documented, but I felt this region of the patch might evolve and need some discussion first, if this level of complexity is acceptable. However, I think the comments above split_var_at(), mul_var_karatsuba_full() and mul_var_karatsuba_half() are quite good already, what do you think? >> There's a wider question as to how many people use such big numeric >> values -- i.e., how many people are actually going to benefit from >> this? I don't have a good feel for that. > > I have heard of people doing calculations on bignum integers in > Postgres, but they are very few and far between --- usually that > sort of task is better done in another language. (No matter how > fast we make mul_var, the general overhead of SQL expressions in > general and type numeric in particular means it's probably not > the right tool for heavy-duty bignum arithmetic.) > > There is definitely an argument to be made that this proposal is > not worth the development effort and ongoing maintenance effort > we'd have to sink into it. It's a good question, I'm not sure what I think, maybe status quo is the best, but it feels like something could be done at least. The patch basically consists of three parts: - Threshold function KARATSUBA_CONDITION This is a bit unorthodox, since it's more ambitious than many popular numeric libraries. I've only found GMP to be even more complex. - mul_var_karatsuba_half() Since it's provably correct using simple school-grade substitutions, I don't think this function is unorthodox, and shouldn't need to change, unless we change the definition of NumericVar in the future. - mul_var_karatsuba() This follows the canonical pseudo-code at Wikipedia for the Karatsuba algorithm precisely, so nothing unorthodox here either. So, I think the KARATSUBA_CONDITION require more development and maintenance effort than the rest of the patch, since it's based on measurements on numerous architectures, which will be different in the future. The rest of the patch, split_var_at(), mul_var_karatsuba_full(), mul_var_karatsuba_half(), should require much less effort to maintain, since they are should remain the same, even when we need to support new architectures. I'm eager to hear your thoughts after you've also read my other reply moments ago to Dean. Regards, Joel
Re: Optimize numeric.c mul_var() using the Karatsuba algorithm
On 2024-Jun-30, Joel Jacobson wrote: > However, I can imagine other crypto algorithms might require larger factors, > as well as other scientific research use cases, such as astronomy and physics, > that could desire storage of numeric values of very high precision. FWIW I was talking to some people with astronomical databases, and for reasons that seem pretty relatable they prefer to do all numerical analysis outside of the database, and keep the database server only for the data storage and retrieval parts. It seems a really hard sell that they would try to do that analysis in the database, because the paralellizability aspect is completely different for the analysis part than for the database part -- I mean, they can easily add more servers to do photography analysis, but trying to add database servers (to cope with additional load from doing the analysis there) is a much harder sell. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ Officer Krupke, what are we to do? Gee, officer Krupke, Krup you! (West Side Story, "Gee, Officer Krupke")
Re: Optimize numeric.c mul_var() using the Karatsuba algorithm
On Sat, Jun 29, 2024, at 14:22, Dean Rasheed wrote: > On Sun, Jun 23, 2024 at 09:00:29AM +0200, Joel Jacobson wrote: >> Attached, rebased version of the patch that implements the Karatsuba >> algorithm in numeric.c's mul_var(). >> > > Something to watch out for is that not all callers of mul_var() want > an exact result. Several internal callers request an approximate > result by passing it an rscale value less than the sum of the input > dscales. The schoolbook algorithm handles that by computing up to > rscale plus MUL_GUARD_DIGITS extra digits and then rounding, whereas > the new Karatsuba code always computes the full result and then > rounds. That impacts the performance of various functions, for > example: > > select sum(exp(x)) from generate_series(5999.9, 5950.0, -0.1) x; > > Time: 1790.825 ms (00:01.791) [HEAD] > Time: 2161.244 ms (00:02.161) [with patch] Ops. Thanks for spotting this and clarifying. I read Tom's reply and note that we should only do Karatsuba an exact (full-precision) result is requested. > Looking at mul_var_karatsuba_half(), I don't really like the approach > it takes. The whole correctness proof using the Karatsuba formula > seems to somewhat miss the point that this function isn't actually > implementing the Karatsuba algorithm, it is implementing the > schoolbook algorithm in two steps, by splitting the longer input into > two pieces. The surprising realization here is that there are actually (var1ndigits, var2ndigits) combinations where *only* doing mul_var_karatsuba_half() recursively all the way down to schoolbook *is* a performance win, even though we don't do any mul_var_karatsuba_full(). mul_var_karatsuba_half() *is* actually implementing the exact Karatsuba formula, it's just taking a shortcut exploiting the pre-known that splitting `var1` at `m2` would result in `high1` being zero. This allows the provably correct substitutions to be made, which avoids the meaningless computations. > But why just split it into two pieces? That will just lead > to a lot of unnecessary recursion for very unbalanced inputs. Instead, > why not split the longer input into N roughly equal sized pieces, each > around the same length as the shorter input, multiplying and adding > them at the appropriate offsets? The approach you're describing is implemented by e.g. CPython and is called "lopsided" in their code base. It has some different performance characteristics, compared to the recursive Half-Karatsuba approach. What I didn't like about lopsided is the degenerate case where the last chunk is much shorter than the var1, for example, if we pretend we would be doing Karatsuba all the way down to ndigits 2, and think about the example var1ndigits = 3 and var2ndigits = 10, then lopsided would do var1ndigits=3 var2ndigits=3 var1ndigits=3 var2ndigits=3 var1ndigits=3 var2ndigits=3 var1ndigits=3 var2ndigits=1 whereas Half-Karatsuba would do var1ndigits=3 var2ndigits=5 var1ndigits=3 var2ndigits=5 You can find contrary examples too of course where lopsided is better than Half-Karatsuba, none of them seem substantially better than the other. My measurements indicated that overall, Half-Karatsuba seemed like the overall marginal winner, on the architectures I tested, but they were all very similar, i.e. almost the same number of "wins" and "losses", for different (var1ndigits, var2ndigits) combinations. Note that even with lopsided, there will still be recursion, since Karatsuba is a recursive algorithm, so to satisfy Tom Lane's request about proving the recursion is limited, we will still need to prove the same thing for lopsided+Karatsuba. Here is some old code from my experiments, if we want to evaluate lopsided: ``` static void slice_var(const NumericVar *var, int start, int length, NumericVar *slice); static void mul_var_lopsided(const NumericVar *var1, const NumericVar *var2, NumericVar *result); /* * slice_var() - * * Extract a slice of a NumericVar starting at a specified position * and with a specified length. */ static void slice_var(const NumericVar *var, int start, int length, NumericVar *slice) { Assert(start >= 0); Assert(start + length <= var->ndigits); init_var(slice); slice->ndigits = length; slice->digits = var->digits + start; slice->buf = NULL; slice->weight = var->weight - var->ndigits + length; slice->sign = var->sign; slice->dscale = (var->ndigits - var->weight - 1) * DEC_DIGITS; } /* * mul_var_lopsided() - * * Lopsided Multiplication for unequal-length factors. * * This function handles the case where var1 has significantly fewer digits * than var2. In such a scenario, splitting var1 for a balanced multiplication * algorithm would be inefficient, as the high part would be zero. * * To overcome this inefficiency, the function divides factor2 into a series
Re: Optimize numeric.c mul_var() using the Karatsuba algorithm
On Sat, 29 Jun 2024 at 16:25, Tom Lane wrote: > > Dean Rasheed writes: > > There's another complication though (if the threshold is made > > configurable): the various numeric functions that use mul_var() are > > immutable, which means that the results from the Karatsuba algorithm > > must match those from the schoolbook algorithm exactly, for all > > inputs. > > That seems like an impossible standard to meet. What we'd probably > have to do is enable Karatsuba only when mul_var is being asked > for an exact (full-precision) result. Yeah, using Karatsuba for approximate products is probably a bit too ambitious. I think it'd be reasonably straightforward to have it produce the same results for all rscale values. You'd just have to decide on the required rscale for each sub-product, based on where it's being added, truncating at various points, and being sure to only round once at the very end. The problem is that it'd end up computing a larger fraction of the full product than the schoolbook algorithm would have done, so the threshold for using Karatsuba would have to be higher (probably quite a lot higher) and figuring out how that varied with the requested rscale would be hard. So using Karatsuba only in the full-precision case seems like a reasonable restriction. That'd still benefit some other functions like sqrt() and therefore ln() and pow() to some extent. However,... > There is definitely an argument to be made that this proposal is > not worth the development effort and ongoing maintenance effort > we'd have to sink into it. I'm leaning more towards this opinion, especially since I think the patch needs a lot more work to be committable. Regards, Dean
Re: Optimize numeric.c mul_var() using the Karatsuba algorithm
Dean Rasheed writes: > There's another complication though (if the threshold is made > configurable): the various numeric functions that use mul_var() are > immutable, which means that the results from the Karatsuba algorithm > must match those from the schoolbook algorithm exactly, for all > inputs. That seems like an impossible standard to meet. What we'd probably have to do is enable Karatsuba only when mul_var is being asked for an exact (full-precision) result. This'd complicate the check condition further, possibly reaching the point where there is a visible drag on performance in the non-Karatsuba case. Another possible source of drag: if mul_var is now recursive, does it need a stack depth check? If we can prove that the number of recursion levels is no more than O(log(N)) in the number of input digits, it's probably safe to skip that ... but I see no such proof here. (In general I find this patch seriously undercommented.) > There's a wider question as to how many people use such big numeric > values -- i.e., how many people are actually going to benefit from > this? I don't have a good feel for that. I have heard of people doing calculations on bignum integers in Postgres, but they are very few and far between --- usually that sort of task is better done in another language. (No matter how fast we make mul_var, the general overhead of SQL expressions in general and type numeric in particular means it's probably not the right tool for heavy-duty bignum arithmetic.) There is definitely an argument to be made that this proposal is not worth the development effort and ongoing maintenance effort we'd have to sink into it. regards, tom lane
Re: Optimize numeric.c mul_var() using the Karatsuba algorithm
On Sun, Jun 23, 2024 at 09:00:29AM +0200, Joel Jacobson wrote: > Attached, rebased version of the patch that implements the Karatsuba > algorithm in numeric.c's mul_var(). > Something to watch out for is that not all callers of mul_var() want an exact result. Several internal callers request an approximate result by passing it an rscale value less than the sum of the input dscales. The schoolbook algorithm handles that by computing up to rscale plus MUL_GUARD_DIGITS extra digits and then rounding, whereas the new Karatsuba code always computes the full result and then rounds. That impacts the performance of various functions, for example: select sum(exp(x)) from generate_series(5999.9, 5950.0, -0.1) x; Time: 1790.825 ms (00:01.791) [HEAD] Time: 2161.244 ms (00:02.161) [with patch] Looking at mul_var_karatsuba_half(), I don't really like the approach it takes. The whole correctness proof using the Karatsuba formula seems to somewhat miss the point that this function isn't actually implementing the Karatsuba algorithm, it is implementing the schoolbook algorithm in two steps, by splitting the longer input into two pieces. But why just split it into two pieces? That will just lead to a lot of unnecessary recursion for very unbalanced inputs. Instead, why not split the longer input into N roughly equal sized pieces, each around the same length as the shorter input, multiplying and adding them at the appropriate offsets? As an example, given inputs with var1ndigits = 1000 and var2ndigits = 1, mul_var() will invoke mul_var_karatsuba_half(), which will then recursively invoke mul_var() twice with var1ndigits = 1000 and var2ndigits = 5000, which no longer satisfies KARATSUBA_CONDITION(), so it will just invoke the schoolbook algorithm on each half, which stands no chance of being any faster. On the other hand, if it divided var2 into 10 chunks of length 1000, it would invoke the Karatsuba algorithm on each chunk, which would at least stand a chance of being faster. Related to that, KARATSUBA_HIGH_RANGE_CONDITION() doesn't appear to make a lot of sense. For inputs with var1ndigits between 128 and 2000, and var2ndigits > 9000, this condition will pass and it will recursively break up the longer input into smaller and smaller pieces until eventually that condition no longer passes, but none of the other conditions in KARATSUBA_CONDITION() will pass either, so it'll just invoke the schoolbook algorithm on each piece, which is bound to be slower once all the overheads are taken into account. For example, given var1ndigits = 200 and var2ndigits = 3, KARATSUBA_CONDITION() will pass due to KARATSUBA_HIGH_RANGE_CONDITION(), and it will recurse with var1ndigits = 200 and var2ndigits = 15000, and then again with var1ndigits = 200 and var2ndigits = 7500, at which point KARATSUBA_CONDITION() no longer passes. With mul_var_karatsuba_half() implemented as it is, that is bound to happen, because each half will end up having var2ndigits between 4500 and 9000, which fails KARATSUBA_CONDITION() if var1ndigits < 2000. If mul_var_karatsuba_half() was replaced by something that recursed with more balanced chunks, then it might make more sense, though allowing values of var1ndigits down to 128 doesn't make sense, since the Karatsuba algorithm will never be invoked for inputs shorter than 384. Looking at KARATSUBA_MIDDLE_RANGE_CONDITION(), the test that var2ndigits > 2500 seems to be redundant. If var1ndigits > 2000 and var2ndigits < 2500, then KARATSUBA_LOW_RANGE_CONDITION() is satisfied, so these tests could be simplified, eliminating some of those magic constants. However, I really don't like having these magic constants at all, because in practice the threshold above which the Karatsuba algorithm is a win can vary depending on a number of factors, such as whether it's running on 32-bit or 64-bit, whether or not SIMD instructions are available, the relative timings of CPU instructions, the compiler options used, and probably a bunch of other things. The last time I looked at the Java source code, for example, they had separate thresholds for 32-bit and 64-bit platforms, and even that's probably too crude. Some numeric libraries tune the thresholds for a large number of different platforms, but that takes a lot of effort. I think a better approach would be to have a configurable threshold. Ideally, this would be just one number, with all other numbers being derived from it, possibly using some simple heuristic to reduce the effective threshold for more balanced inputs, for which the Karatsuba algorithm is more efficient. Having a configurable threshold would allow people to tune for best performance on their own platforms, and also it would make it easier to write tests that hit the new code. As it stands, it's not obvious how much of the new code is being hit by the existing tests. Doing a quick test on my machine, using random equal-length inputs of various sizes, I got the following performance results: digits
Re: Optimize numeric.c mul_var() using the Karatsuba algorithm
On Sun, Jun 23, 2024 at 09:00:29AM +0200, Joel Jacobson wrote: > Attached, rebased version of the patch that implements the Karatsuba > algorithm in numeric.c's mul_var(). It's one of these areas where Dean Rasheed would be a good match for a review, so adding him in CC. He has been doing a lot of stuff in this area over the years. +#define KARATSUBA_BASE_LIMIT 384 +#define KARATSUBA_VAR1_MIN1 128 +#define KARATSUBA_VAR1_MIN2 2000 +#define KARATSUBA_VAR2_MIN1 2500 +#define KARATSUBA_VAR2_MIN2 9000 +#define KARATSUBA_SLOPE 0.764 +#define KARATSUBA_INTERCEPT 90.737 These numbers feel magic, and there are no explanations behind these choices so it is hard to know whether these are good or not, or if there are potentially "better" choices. I'd suggest to explain why these variables are here as well as the basics of the method in this area of the code, with the function doing the operation pointing at that so as all the explanations are in a single place. Okay, these are thresholds based on the number of digits to decide if the normal or Karatsuba's method should be used, but grouping all the explanations in a single place is simpler. I may have missed something, but did you do some benchmark when the thresholds are at their limit where we would fallback to the calculation method of HEAD? I guess that the answer to my question of "Is HEAD performing better across these thresholds?" is clearly "no" based on what I read at [1] and the threshold numbers chosen, still asking. [1]: https://en.wikipedia.org/wiki/Karatsuba_algorithm -- Michael signature.asc Description: PGP signature
Re: Optimize numeric.c mul_var() using the Karatsuba algorithm
On Fri, Jun 14, 2024, at 03:07, Aaron Altman wrote: > Thanks for the detailed background and comments, Joel! > > The new status of this patch is: Ready for Committer Thanks for reviewing. Attached, rebased version of the patch that implements the Karatsuba algorithm in numeric.c's mul_var(). /Joel 0001-numeric-mul_var-karatsuba.patch Description: Binary data
Re: Optimize numeric.c mul_var() using the Karatsuba algorithm
The following review has been posted through the commitfest application: make installcheck-world: not tested Implements feature: not tested Spec compliant: not tested Documentation:not tested This applies cleanly to master, builds and passes regression tests in my Windows/Cygwin environment. I also read through comments and confirmed for myself that the assumption about the caller ensuring var1 is shorter is done already in unchanged code from mul_var. Frees correspond to inits. The "Karatsuba condition" reasoning for deciding whether a number is big enough to use this algorithm appears to match what Joel has stated in this thread. The arithmetic appears to match what's been described in the comments. I have *not* confirmed that with any detailed review of the Karatsuba algorithm from outside sources, other implementations like the Rust one referenced here, or anything similar. I'm hoping that the regression tests give sufficient coverage that if the arithmetic was incorrect there would be obvious failures. If additional coverage was needed, cases falling immediately on either side of the limiting conditions used in the patch would probably be useful. From the limited precedent I've exposed myself to, that doesn't seem to be required here, but I'm open to contrary input from other reviewers. In the meantime, I'm marking this approved. Thanks for the detailed background and comments, Joel! The new status of this patch is: Ready for Committer
Re: Optimize numeric.c mul_var() using the Karatsuba algorithm
On Tue, Jun 11, 2024, at 19:16, Aaron Altman wrote: > Hi Joel, thanks for posting this. Although I have only a cursory > familiarity with fast multiplication algorithms, I'd like to try and > give it a review. To start with, can you help me understand the choice > of this algorithm versus others like Toom? If this looks correct on a > closer view I'll propose it for inclusion. Along the way though I'd like > to have it explicitly called out whether this is superior in general to > other choices, better for more realistic use cases, simpler, clearer to > license or something similar. It would be nice for future dicussions to > have some context around whether it would make sense to have conditions > to choose other algorithms as well, or if this one is generally the best > for what Postgres users are usually doing. > > Continuing with code review in any case. Interested to hear more. Hi Aaron, thanks for looking at this! The choice of best algorithm depends on the factor sizes. The larger factor sizes, the more complicated algorithm can be "afforded". List of fast multiplication algorithms, ordered by factor sizes they are suitable for: - Long multiplication, aka "schoolbook" multiplication. - Karatsuba - Toom-3 - Schönhage–Strassen algorithm (Fast Fourier Transform) The Toom-3 algorithm can be modified to split the smaller and larger factors into different number of parts. The notation used at Wikipedia is e.g. Toom-2.5 which I think means splitting the larger into three parts, and the smaller into two parts, while GMP uses Toom32 to mean the same thing. Personally, I think GMPs notation is easier to understand as the number of parts can be directly derived from the name. I experimented with implementing Toom-3 as well, but there was only a marginal win, at very large factor sizes, and since numeric's max ndigits (number of base-digits) is capped at 32768, I didn't think it was worth it, since it adds quite a lot of complexity. The Karatsuba algorithm is the next step in the hierarchy of fast multiplication algorithms, and all other bigint libs I've looked at implement Karatsuba, even if they also implement Toom-3, since Karatsuba is faster than Toom-3 for sufficiently small factors, but that are at the same time sufficiently large for Karatsuba to be faster than schoolbook. I was initially surprised by the quite large threshold, where Karatsuba started to be a win over schoolbook. I think the explanation why mul_var() stays fast up to quite remarkably high factor sizes, could be a combination of several things, such as: - mul_var() is already heavily optimized, with clever tricks, such as deferred carry propagation. - numeric uses NBASE=1, while other bigint libs usually use a power of two. In the Karatsuba implementation, I tried to keep the KARATSUBA_CONDITION() quite simple, but it's way more complex than what most bigint libs use, that usually just check if the smaller factor is smaller than some threshold, and if so, use schoolbook. For instance, this is what Rust's num-bigint does: if x.len() <= 32 { // Long multiplication } else if x.len() * 2 <= y.len() { // Half-Karatsuba, for factors with significant length disparity } else if x.len() <= 256 { // Karatsuba multiplication } else { // Toom-3 multiplication } Source: https://github.com/rust-num/num-bigint/blob/master/src/biguint/multiplication.rs#L101 Side note: When working on Karatsuba in mul_var(), I looked at some other bigint implementations, to try to understand their threshold functions. I noticed that Rust's num-bigint didn't optimise for factors with significant length disparity, so I contributed a patch based on my "Half-Karatsuba" idea, that I got when working with mul_var(), which has now been merged: https://github.com/rust-num/num-bigint/commit/06b61c8138ad8a9959ac54d9773d0a9ebe25b346 In mul_var(), if we don't like the complexity of KARATSUBA_CONDITION(), we could go for a more traditional threshold approach, i.e. just checking the smaller factor size. However, I believe that would be at the expense of missing out of some performance gains. I've tried quite hard to find the best KARATSUBA_CONDITION(), but I found it to be a really hard problem, the differences between different CPU architectures, in combination with wanting a simple expression, means there is no obvious perfect threshold function, there will always be a trade-off. I eventually stopped trying to improve it, and just settled on the version in the patch, and thought that I'll leave it up to the community to give feedback on what complexity for the threshold function is motivated. If we absolutely just want to check the smallest factor size, like Rust, then it's super simple, then the threshold can easily be found just by testing different values. It's when both factor sizes are input to the threshold function that makes it complicated. /Joel
Re: Optimize numeric.c mul_var() using the Karatsuba algorithm
Hi Joel, thanks for posting this. Although I have only a cursory familiarity with fast multiplication algorithms, I'd like to try and give it a review. To start with, can you help me understand the choice of this algorithm versus others like Toom? If this looks correct on a closer view I'll propose it for inclusion. Along the way though I'd like to have it explicitly called out whether this is superior in general to other choices, better for more realistic use cases, simpler, clearer to license or something similar. It would be nice for future dicussions to have some context around whether it would make sense to have conditions to choose other algorithms as well, or if this one is generally the best for what Postgres users are usually doing. Continuing with code review in any case. Interested to hear more. Regards, Aaron Altman
Optimize numeric.c mul_var() using the Karatsuba algorithm
Hi, This patch introduces the Karatsuba algorithm to speed up multiplication operations in numeric.c, where the operands have many digits. It is implemented via a new conditional in mul_var() that determines whether the sizes of the factors are sufficiently large to justify its use. This decision is non-trivial due to its recursive nature, depending on the size and ratio of the factors. Moreover, the optimal threshold varies across different architectures. This benefits all users of mul_var() in numeric.c, such as: numeric_mul() numeric_lcm() numeric_fac() int64_div_fast_to_numeric() mod_var() div_mod_var() sqrt_var() exp_var() ln_var() power_var() The macro KARATSUBA_CONDITION(var1ndigits, var2ndigits) is responsible of this decision. It is deliberately conservative to prevent performance regressions on the tested architectures while maximizing potential gains and maintaining simplicity. Patches: 1. mul_var-karatsuba.patch Modifies mul_var() to use the Karatsuba-functions for multiplying larger numerical factors. 2. mul_var-karatsuba-benchmark.patch Introduces numeric_mul_karatsuba() and mul_var_karatsuba() alongside the existing numeric_mul() and mul_var() functions. This enables benchmark comparisons between the original multiplication method and the Karatsuba-optimized version. Some benchmark numbers, tested on Intel Core i9-14900K: Helper-function to generate numeric of given ndigits, using the new random(min numeric, max numeric): CREATE OR REPLACE FUNCTION random_ndigits(ndigits INT) RETURNS NUMERIC AS $$ SELECT random( ('1000'||repeat('',ndigits-1))::numeric, (repeat('',ndigits))::numeric ) $$ LANGUAGE sql; Benchmark equal factor sizes, 16384 x 16384 ndigits: SELECT random_ndigits(16384) * random_ndigits(16384) > 0; Time: 33.990 ms Time: 33.961 ms Time: 34.183 ms SELECT numeric_mul_karatsuba(random_ndigits(16384), random_ndigits(16384)) > 0; Time: 17.621 ms Time: 17.209 ms Time: 16.444 ms Benchmark equal factor sizes, 8192 x 8192 ndigits: SELECT random_ndigits(8192) * random_ndigits(8192) > 0; Time: 12.568 ms Time: 12.563 ms Time: 12.701 ms SELECT numeric_mul_karatsuba(random_ndigits(8192), random_ndigits(8192)) > 0; Time: 9.919 ms Time: 9.929 ms Time: 9.659 ms To measure smaller factor sizes, \timing doesn't provide enough precision. Below measurements are made using my pg-timeit extension: Benchmark equal factor sizes, 1024 x 1024 ndigits: SELECT timeit.h('numeric_mul',ARRAY[random_ndigits(1024)::TEXT,random_ndigits(1024)::TEXT],significant_figures:=2); 100 µs SELECT timeit.h('numeric_mul_karatsuba',ARRAY[random_ndigits(1024)::TEXT,random_ndigits(1024)::TEXT],significant_figures:=2); 73 µs Benchmark equal factor sizes, 512 x 512 ndigits: SELECT timeit.h('numeric_mul',ARRAY[random_ndigits(512)::TEXT,random_ndigits(512)::TEXT],significant_figures:=2); 27 µs SELECT timeit.h('numeric_mul_karatsuba',ARRAY[random_ndigits(512)::TEXT,random_ndigits(512)::TEXT],significant_figures:=2); 23 µs Benchmark unequal factor sizes, 2048 x 16384 ndigits: SELECT timeit.h('numeric_mul',ARRAY[random_ndigits(2048)::TEXT,random_ndigits(16384)::TEXT],significant_figures:=2); 3.6 ms SELECT timeit.h('numeric_mul_karatsuba',ARRAY[random_ndigits(2048)::TEXT,random_ndigits(16384)::TEXT],significant_figures:=2); 2.7 ms The KARATSUBA_CONDITION was determined through benchmarking on the following architectures: - Intel Core i9-14900K (desktop) - AMD Ryzen 9 7950X3D (desktop) - Apple M1Max (laptop) - AWS EC2 m7g.4xlarge (cloud server, AWS Graviton3 CPU) - AWS EC2 m7i.4xlarge (cloud server, Intel Xeon 4th Gen, Sapphire Rapids) The images depicting the benchmark plots are rather large, so I've refrained from including them as attachments. Instead, I've provided URLs to the benchmarks for direct access: https://gist.githubusercontent.com/joelonsql/e9d06cdbcdf56cd8ffa673f499880b0d/raw/69df06e95bc254090f8397765079e1a8145eb5ac/derive_threshold_function_using_dynamic_programming.png This image shows the best possible performance ratio per architecture, derived using Dynamic Programming. The black line segment shows the manually crafted threshold function, which aims to avoid performance regressions, while capturing the beneficial regions, as a relatively simple threshold function, which has been implemented in both patches as the KARATSUBA_CONDITION macro. https://gist.githubusercontent.com/joelonsql/e9d06cdbcdf56cd8ffa673f499880b0d/raw/69df06e95bc254090f8397765079e1a8145eb5ac/benchmark.png This plot displays the actual performance ratio per architecture, measured after applying the mul_var-karatsuba-benchmark.patch. The performance_ratio scale in both plots uses a rainbow scale, where blue is at 1.0 and means no change. The maximum at 4.0 means that the Karatsuba version was four times faster than the traditional mul_var() at that architecture. To make it easier to distinguish performance regressions from, a magenta color scale that goes from pure magenta just below 1.0, to dar