Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

2024-07-02 Thread Dean Rasheed
On Sun, 30 Jun 2024 at 11:22, Joel Jacobson  wrote:
>
> On Sat, Jun 29, 2024, at 14:22, Dean Rasheed wrote:
>
> > But why just split it into two pieces? That will just lead
> > to a lot of unnecessary recursion for very unbalanced inputs. Instead,
> > why not split the longer input into N roughly equal sized pieces, each
> > around the same length as the shorter input, multiplying and adding
> > them at the appropriate offsets?
>
> The approach you're describing is implemented by e.g. CPython
> and is called "lopsided" in their code base. It has some different
> performance characteristics, compared to the recursive Half-Karatsuba
> approach.
>
> What I didn't like about lopsided is the degenerate case where the
> last chunk is much shorter than the var1, for example, if we pretend
> we would be doing Karatsuba all the way down to ndigits 2,
> and think about the example var1ndigits = 3 and var2ndigits = 10,
> then lopsided would do
> var1ndigits=3 var2ndigits=3
> var1ndigits=3 var2ndigits=3
> var1ndigits=3 var2ndigits=3
> var1ndigits=3 var2ndigits=1
>
> whereas Half-Karatsuba would do
> var1ndigits=3 var2ndigits=5
> var1ndigits=3 var2ndigits=5
>
> You can find contrary examples too of course where lopsided
> is better than Half-Karatsuba, none of them seem substantially better
> than the other.

Actually, that wasn't quite what I was thinking of. The approach I've
seen (I can't remember where) is to split var2 into roughly
equal-sized chunks as follows:

If var2ndigits >= 2 * var1ndigits, split it into a number chunks where

  nchunks = var2ndigits / var1ndigits

using integer division with truncation. Then call mul_var_chunks()
(say), passing it nchunks, to perform the multiplication using that
many chunks.

mul_var_chunks() would use nchunks to decides on the chunk size according to

  chunk_size = var2ndigits / nchunks

again using integer division with truncation. That has a remainder in
the range [0, nchunks), which is divided up between the chunks so that
some of them end up being chunk_size + 1 digits. I.e., something like:

  chunk_remainder = var2ndigits - chunk_size * nchunks

  for (int start = 0, end = chunk_size;
   start < var2ndigits;
   start = end, end = start + chunk_size)
  {
/* Distribute remainder over the first few chunks */
if (chunk_remainder > 0)
{
  end++;
  chunk_remainder--;
}

/* Process chunk between "start" and "end" */
  }

What this does is adapt the shape of the chunks to the region where
the Karatsuba algorithm is most efficient. For example, suppose
var1ndigits = 1000. Then:

For 2000x1000 to 2999x1000 digit products, nchunks will be 2 and
chunk_size will be at most (2999/2)+1 = 1500.

For 3000x1000 to 3999x1000 digit products, nchunks will be 3 and
chunk_size will be at most (3999/3)+1 = 1334.

And so on.

The result is that all the chunks will fall into that region where
var2ndigits / var1ndigits is between 1 and 1.5, for which
KARATSUBA_LOW_RANGE_CONDITION() will almost certainly pass, and
Karatsuba will operate efficiently all the way down to
KARATSUBA_BASE_LIMIT.

Regards,
Dean




Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

2024-07-02 Thread Dean Rasheed
On Sun, 30 Jun 2024 at 11:22, Joel Jacobson  wrote:
>
> The surprising realization here is that there are actually (var1ndigits, 
> var2ndigits)
> combinations where *only* doing mul_var_karatsuba_half() recursively
> all the way down to schoolbook *is* a performance win,
> even though we don't do any mul_var_karatsuba_full().
>
> Indeed only mul_var_karatsuba_half() will be called with the inputs:
> var1ndigits=1000 var2ndigits=1
> var1ndigits=1000 var2ndigits=5000
> It will never call mul_var_karatsuba_full().
>
> Surprisingly, this still gives a 13% speed-up on a Intel Core i9-14900K.
>

Hmm, I don't see any gains in that example. Logically, it is doing
slightly more work splitting up var2 and adding partial products.
However, it's possible that what you're seeing is a memory access
problem where, if var2 is too large, it won't fit in cache close to
the CPU, and since the schoolbook algorithm traverses var2 multiple
times, it ends up being quicker to split up var2. Since I have a
slower CPU, I'm more likely to be CPU-limited, while you might be
memory-limited.

This makes me think that this is always going to be very
hardware-dependent, and we shouldn't presume what will work best on
the user's hardware.

However, if a 1000x1000 ndigit product is known to be faster using
Karatsuba on some particular hardware (possibly nearly all hardware),
then why wouldn't it make sense to do the above as 10 invocations of
Karatsuba?

Regards,
Dean




Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

2024-06-30 Thread Joel Jacobson
On Sun, Jun 30, 2024, at 17:44, Tom Lane wrote:
> "Joel Jacobson"  writes:
>> On Sat, Jun 29, 2024, at 17:25, Tom Lane wrote:
>>> (In general I find this patch seriously undercommented.)
>
>> However, I think the comments above split_var_at(),
>> mul_var_karatsuba_full() and mul_var_karatsuba_half()
>> are quite good already, what do you think?
>
> Not remarkably so.  For starters, heaven help the reader who has
> no idea what "the Karatsuba algorithm" refers to.  Nor is there
> any mention of why (or when) it's better than the traditional
> algorithm.  You could at least do people the courtesy of providing
> a link to the wikipedia article that you're assuming they've
> memorized.

Thanks for guidance. New patch attached.

I've added as an introduction to Karatsuba:

+/*
+ * mul_var_karatsuba_full() -
+ *
+ * Multiplication using the Karatsuba algorithm.
+ *
+ * The Karatsuba algorithm is a divide-and-conquer algorithm that reduces
+ * the complexity of large number multiplication. It splits each number
+ * into two halves and performs three multiplications on the parts,
+ * rather than four as in the traditional method. This results in
+ * a significant performance improvement for sufficiently large numbers.
...
+ * For more details on the Karatsuba algorithm, see the Wikipedia article:
+ * https://en.wikipedia.org/wiki/Karatsuba_algorithm

> There's also a discussion to be had about why Karatsuba is
> a better choice than other divide-and-conquer multiplication
> methods.  Why not Toom-Cook, for example, which the aforesaid
> wikipedia page says is faster yet?  I suppose you concluded
> that the extra complexity is unwarranted, but this is the
> sort of thing I'd expect to see explained in the comments.

I've added this to the end of the comment on mul_var_karatsuba_full():

+ * The Karatsuba algorithm is preferred over other divide-and-conquer 
methods
+ * like Toom-Cook for this implementation due to its balance of complexity 
and
+ * performance gains given Numeric's constraints.
+ *
+ * For Toom-Cook to be worth the added complexity, the factors would need 
to
+ * be much larger than supported by Numeric, making Karatsuba a more
+ * appropriate choice.

Also, I added this comment on the #define's at the beginning:

+/*
+ * Constants used to determine when the Karatsuba algorithm should be used
+ * for multiplication. These thresholds were determined empirically through
+ * benchmarking across various architectures, aiming to avoid performance
+ * regressions while capturing potential gains. The choice of these values
+ * involves trade-offs and balances simplicity and performance.
+ */

As well as this comment on KARATSUBA_CONDITION:

+/*
+ * KARATSUBA_CONDITION() -
+ *
+ * This macro determines if the Karatsuba algorithm should be applied
+ * based on the number of digits in the multiplicands. It checks if
+ * the number of digits in the larger multiplicand exceeds a base limit
+ * and if the sizes of the multiplicands fall within specific ranges
+ * where Karatsuba multiplication is usually beneficial.
+ *
+ * The conditions encapsulated by KARATSUBA_CONDITION are:
+ * 1. The larger multiplicand has more digits than the base limit.
+ * 2. The sizes of the multiplicands fall within low, middle, or high range
+ *conditions which were identified as performance beneficial regions during
+ *benchmarks.
+ *
+ * The macro ensures that the algorithm is applied only when it is likely
+ * to provide performance benefits, considering the size and ratio of the
+ * factors.
+ */

What do you think?

Regards,
Joel

v2-0001-numeric-mul_var-karatsuba.patch
Description: Binary data


Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

2024-06-30 Thread Tom Lane
"Joel Jacobson"  writes:
> On Sat, Jun 29, 2024, at 17:25, Tom Lane wrote:
>> (In general I find this patch seriously undercommented.)

> However, I think the comments above split_var_at(),
> mul_var_karatsuba_full() and mul_var_karatsuba_half()
> are quite good already, what do you think?

Not remarkably so.  For starters, heaven help the reader who has
no idea what "the Karatsuba algorithm" refers to.  Nor is there
any mention of why (or when) it's better than the traditional
algorithm.  You could at least do people the courtesy of providing
a link to the wikipedia article that you're assuming they've
memorized.

There's also a discussion to be had about why Karatsuba is
a better choice than other divide-and-conquer multiplication
methods.  Why not Toom-Cook, for example, which the aforesaid
wikipedia page says is faster yet?  I suppose you concluded
that the extra complexity is unwarranted, but this is the
sort of thing I'd expect to see explained in the comments.

regards, tom lane




Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

2024-06-30 Thread Joel Jacobson
On Sat, Jun 29, 2024, at 14:22, Dean Rasheed wrote:
> However, I really don't like having these magic constants at all,
> because in practice the threshold above which the Karatsuba algorithm
> is a win can vary depending on a number of factors, such as whether
> it's running on 32-bit or 64-bit, whether or not SIMD instructions are
> available, the relative timings of CPU instructions, the compiler
> options used, and probably a bunch of other things. 
...
> Doing a quick test on my machine, using random equal-length inputs of
> various sizes, I got the following performance results:
>
>  digits | rate (HEAD)   | rate (patch)  | change
> +---+---+
>  10 | 6.060014e+06  | 6.0189365e+06 | -0.7%
> 100 | 2.7038752e+06 | 2.7287925e+06 | +0.9%

Does the PostgreSQL community these days have access to some kind
of performance farm, covering some/all of the supported hardware architectures?

Personally, I have three machines:
MacBook Pro M3 Max
Intel Core i9-14900K
AMD Ryzen 9 7950X3D

In addition I usually spin up a few AWS instances of different types,
but this is scary, because one time I forgot to turn them off for a week,
which was quite costly.

Would be much nicer with a performance farm!

If one exists, please let me know and no need to read the rest of this email.
Otherwise:

Imagine if we could send a patch to a separate mailing list,
and the system would auto-detect what catalog functions are affected,
and automatically generate a performance report, showing the delta per platform.

Binary functions, like numeric_mul(), should generate an image where the two
axes would be the size of the inputs, and the color of each pixel should show
the performance gain/loss, whereas unary functions like sqrt() should have
the size of the input as the x-axis and performance gain/loss as the y-axis.

How to test each catalog function would of course need to be designed
manually, but maybe the detection of affected function would be
automated, if accepting some false positives/negatives, i.e. benchmarking
too many or too few catalog functions, given a certain patch.

Catalog functions are just a tiny part of PostgreSQL, so there should
of course be other tests as well to cover other things, but since they are
simple to test predictably, maybe it could be a good start for the project,
even if it's far from the most important thing to benchmark.

I found an old performance farm topic from 2012 but it seems the discussion
just stopped for some reason not clear to me.

Regards,
Joel




Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

2024-06-30 Thread Joel Jacobson
On Sat, Jun 29, 2024, at 17:25, Tom Lane wrote:
> Dean Rasheed  writes:
>> There's another complication though (if the threshold is made
>> configurable): the various numeric functions that use mul_var() are
>> immutable, which means that the results from the Karatsuba algorithm
>> must match those from the schoolbook algorithm exactly, for all
>> inputs.
>
> That seems like an impossible standard to meet.  What we'd probably
> have to do is enable Karatsuba only when mul_var is being asked
> for an exact (full-precision) result.  This'd complicate the check
> condition further, possibly reaching the point where there is a
> visible drag on performance in the non-Karatsuba case.

OK, noted that we should only do Karatsuba when mul_var is
being asked for an exact (full-precision) result.

> Another possible source of drag: if mul_var is now recursive,
> does it need a stack depth check?  If we can prove that the
> number of recursion levels is no more than O(log(N)) in the
> number of input digits, it's probably safe to skip that ...
> but I see no such proof here.

> (In general I find this patch seriously undercommented.)

Yes, the #define's around KARATSUBA_CONDITION needs
to be documented, but I felt this region of the patch might evolve
and need some discussion first, if this level of complexity is
acceptable.

However, I think the comments above split_var_at(),
mul_var_karatsuba_full() and mul_var_karatsuba_half()
are quite good already, what do you think?

>> There's a wider question as to how many people use such big numeric
>> values -- i.e., how many people are actually going to benefit from
>> this? I don't have a good feel for that.
>
> I have heard of people doing calculations on bignum integers in
> Postgres, but they are very few and far between --- usually that
> sort of task is better done in another language.  (No matter how
> fast we make mul_var, the general overhead of SQL expressions in
> general and type numeric in particular means it's probably not
> the right tool for heavy-duty bignum arithmetic.)
>
> There is definitely an argument to be made that this proposal is
> not worth the development effort and ongoing maintenance effort
> we'd have to sink into it.

It's a good question, I'm not sure what I think, maybe status quo is the best,
but it feels like something could be done at least.

The patch basically consists of three parts:

- Threshold function KARATSUBA_CONDITION
This is a bit unorthodox, since it's more ambitious than many popular numeric
libraries. I've only found GMP to be even more complex.

- mul_var_karatsuba_half()
Since it's provably correct using simple school-grade substitutions,
I don't think this function is unorthodox, and shouldn't need to change,
unless we change the definition of NumericVar in the future.

- mul_var_karatsuba()
This follows the canonical pseudo-code at Wikipedia for the Karatsuba
algorithm precisely, so nothing unorthodox here either.

So, I think the KARATSUBA_CONDITION require more development and maintenance
effort than the rest of the patch, since it's based on measurements
on numerous architectures, which will be different in the future.

The rest of the patch, split_var_at(), mul_var_karatsuba_full(),
mul_var_karatsuba_half(), should require much less effort to maintain,
since they are should remain the same,
even when we need to support new architectures.

I'm eager to hear your thoughts after you've also read my other reply moments 
ago to Dean.

Regards,
Joel




Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

2024-06-30 Thread Alvaro Herrera
On 2024-Jun-30, Joel Jacobson wrote:

> However, I can imagine other crypto algorithms might require larger factors,
> as well as other scientific research use cases, such as astronomy and physics,
> that could desire storage of numeric values of very high precision.

FWIW I was talking to some people with astronomical databases, and for
reasons that seem pretty relatable they prefer to do all numerical
analysis outside of the database, and keep the database server only for
the data storage and retrieval parts.  It seems a really hard sell that
they would try to do that analysis in the database, because the
paralellizability aspect is completely different for the analysis part
than for the database part -- I mean, they can easily add more servers
to do photography analysis, but trying to add database servers (to cope
with additional load from doing the analysis there) is a much harder
sell.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
Officer Krupke, what are we to do?
Gee, officer Krupke, Krup you! (West Side Story, "Gee, Officer Krupke")




Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

2024-06-30 Thread Joel Jacobson
On Sat, Jun 29, 2024, at 14:22, Dean Rasheed wrote:
> On Sun, Jun 23, 2024 at 09:00:29AM +0200, Joel Jacobson wrote:
>> Attached, rebased version of the patch that implements the Karatsuba 
>> algorithm in numeric.c's mul_var().
>>
>
> Something to watch out for is that not all callers of mul_var() want
> an exact result. Several internal callers request an approximate
> result by passing it an rscale value less than the sum of the input
> dscales. The schoolbook algorithm handles that by computing up to
> rscale plus MUL_GUARD_DIGITS extra digits and then rounding, whereas
> the new Karatsuba code always computes the full result and then
> rounds. That impacts the performance of various functions, for
> example:
>
> select sum(exp(x)) from generate_series(5999.9, 5950.0, -0.1) x;
>
> Time: 1790.825 ms (00:01.791)  [HEAD]
> Time: 2161.244 ms (00:02.161)  [with patch]

Ops. Thanks for spotting this and clarifying.

I read Tom's reply and note that we should only do Karatsuba
an exact (full-precision) result is requested.

> Looking at mul_var_karatsuba_half(), I don't really like the approach
> it takes. The whole correctness proof using the Karatsuba formula
> seems to somewhat miss the point that this function isn't actually
> implementing the Karatsuba algorithm, it is implementing the
> schoolbook algorithm in two steps, by splitting the longer input into
> two pieces.

The surprising realization here is that there are actually (var1ndigits, 
var2ndigits)
combinations where *only* doing mul_var_karatsuba_half() recursively
all the way down to schoolbook *is* a performance win,
even though we don't do any mul_var_karatsuba_full().

mul_var_karatsuba_half() *is* actually implementing the exact
Karatsuba formula, it's just taking a shortcut exploiting the pre-known
that splitting `var1` at `m2` would result in `high1` being zero.
This allows the provably correct substitutions to be made,
which avoids the meaningless computations.

> But why just split it into two pieces? That will just lead
> to a lot of unnecessary recursion for very unbalanced inputs. Instead,
> why not split the longer input into N roughly equal sized pieces, each
> around the same length as the shorter input, multiplying and adding
> them at the appropriate offsets?

The approach you're describing is implemented by e.g. CPython
and is called "lopsided" in their code base. It has some different
performance characteristics, compared to the recursive Half-Karatsuba
approach.

What I didn't like about lopsided is the degenerate case where the
last chunk is much shorter than the var1, for example, if we pretend
we would be doing Karatsuba all the way down to ndigits 2,
and think about the example var1ndigits = 3 and var2ndigits = 10,
then lopsided would do
var1ndigits=3 var2ndigits=3
var1ndigits=3 var2ndigits=3
var1ndigits=3 var2ndigits=3
var1ndigits=3 var2ndigits=1

whereas Half-Karatsuba would do
var1ndigits=3 var2ndigits=5
var1ndigits=3 var2ndigits=5

You can find contrary examples too of course where lopsided
is better than Half-Karatsuba, none of them seem substantially better
than the other.

My measurements indicated that overall, Half-Karatsuba seemed like
the overall marginal winner, on the architectures I tested, but they were all
very similar, i.e. almost the same number of "wins" and "losses",
for different (var1ndigits, var2ndigits) combinations.

Note that even with lopsided, there will still be recursion, since Karatsuba
is a recursive algorithm, so to satisfy Tom Lane's request about
proving the recursion is limited, we will still need to prove the same
thing for lopsided+Karatsuba.

Here is some old code from my experiments, if we want to evaluate lopsided:

```
static void slice_var(const NumericVar *var, int start, int length,
  NumericVar *slice);

static void mul_var_lopsided(const NumericVar *var1, const NumericVar *var2,
 NumericVar *result);

/*
 * slice_var() -
 *
 * Extract a slice of a NumericVar starting at a specified position
 * and with a specified length.
 */
static void
slice_var(const NumericVar *var, int start, int length,
  NumericVar *slice)
{
Assert(start >= 0);
Assert(start + length <= var->ndigits);

init_var(slice);

slice->ndigits = length;
slice->digits = var->digits + start;
slice->buf = NULL;
slice->weight = var->weight - var->ndigits + length;
slice->sign = var->sign;
slice->dscale = (var->ndigits - var->weight - 1) * DEC_DIGITS;
}

/*
 * mul_var_lopsided() -
 *
 * Lopsided Multiplication for unequal-length factors.
 *
 * This function handles the case where var1 has significantly fewer digits
 * than var2. In such a scenario, splitting var1 for a balanced multiplication
 * algorithm would be inefficient, as the high part would be zero.
 *
 * To overcome this inefficiency, the function divides factor2 into a series

Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

2024-06-30 Thread Dean Rasheed
On Sat, 29 Jun 2024 at 16:25, Tom Lane  wrote:
>
> Dean Rasheed  writes:
> > There's another complication though (if the threshold is made
> > configurable): the various numeric functions that use mul_var() are
> > immutable, which means that the results from the Karatsuba algorithm
> > must match those from the schoolbook algorithm exactly, for all
> > inputs.
>
> That seems like an impossible standard to meet.  What we'd probably
> have to do is enable Karatsuba only when mul_var is being asked
> for an exact (full-precision) result.

Yeah, using Karatsuba for approximate products is probably a bit too
ambitious. I think it'd be reasonably straightforward to have it
produce the same results for all rscale values. You'd just have to
decide on the required rscale for each sub-product, based on where
it's being added, truncating at various points, and being sure to only
round once at the very end. The problem is that it'd end up computing
a larger fraction of the full product than the schoolbook algorithm
would have done, so the threshold for using Karatsuba would have to be
higher (probably quite a lot higher) and figuring out how that varied
with the requested rscale would be hard.

So using Karatsuba only in the full-precision case seems like a
reasonable restriction. That'd still benefit some other functions like
sqrt() and therefore ln() and pow() to some extent. However,...

> There is definitely an argument to be made that this proposal is
> not worth the development effort and ongoing maintenance effort
> we'd have to sink into it.

I'm leaning more towards this opinion, especially since I think the
patch needs a lot more work to be committable.

Regards,
Dean




Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

2024-06-29 Thread Tom Lane
Dean Rasheed  writes:
> There's another complication though (if the threshold is made
> configurable): the various numeric functions that use mul_var() are
> immutable, which means that the results from the Karatsuba algorithm
> must match those from the schoolbook algorithm exactly, for all
> inputs.

That seems like an impossible standard to meet.  What we'd probably
have to do is enable Karatsuba only when mul_var is being asked
for an exact (full-precision) result.  This'd complicate the check
condition further, possibly reaching the point where there is a
visible drag on performance in the non-Karatsuba case.

Another possible source of drag: if mul_var is now recursive,
does it need a stack depth check?  If we can prove that the
number of recursion levels is no more than O(log(N)) in the
number of input digits, it's probably safe to skip that ...
but I see no such proof here.  (In general I find this patch
seriously undercommented.)

> There's a wider question as to how many people use such big numeric
> values -- i.e., how many people are actually going to benefit from
> this? I don't have a good feel for that.

I have heard of people doing calculations on bignum integers in
Postgres, but they are very few and far between --- usually that
sort of task is better done in another language.  (No matter how
fast we make mul_var, the general overhead of SQL expressions in
general and type numeric in particular means it's probably not
the right tool for heavy-duty bignum arithmetic.)

There is definitely an argument to be made that this proposal is
not worth the development effort and ongoing maintenance effort
we'd have to sink into it.

regards, tom lane




Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

2024-06-29 Thread Dean Rasheed
On Sun, Jun 23, 2024 at 09:00:29AM +0200, Joel Jacobson wrote:
> Attached, rebased version of the patch that implements the Karatsuba 
> algorithm in numeric.c's mul_var().
>

Something to watch out for is that not all callers of mul_var() want
an exact result. Several internal callers request an approximate
result by passing it an rscale value less than the sum of the input
dscales. The schoolbook algorithm handles that by computing up to
rscale plus MUL_GUARD_DIGITS extra digits and then rounding, whereas
the new Karatsuba code always computes the full result and then
rounds. That impacts the performance of various functions, for
example:

select sum(exp(x)) from generate_series(5999.9, 5950.0, -0.1) x;

Time: 1790.825 ms (00:01.791)  [HEAD]
Time: 2161.244 ms (00:02.161)  [with patch]

Looking at mul_var_karatsuba_half(), I don't really like the approach
it takes. The whole correctness proof using the Karatsuba formula
seems to somewhat miss the point that this function isn't actually
implementing the Karatsuba algorithm, it is implementing the
schoolbook algorithm in two steps, by splitting the longer input into
two pieces. But why just split it into two pieces? That will just lead
to a lot of unnecessary recursion for very unbalanced inputs. Instead,
why not split the longer input into N roughly equal sized pieces, each
around the same length as the shorter input, multiplying and adding
them at the appropriate offsets? As an example, given inputs with
var1ndigits = 1000 and var2ndigits = 1, mul_var() will invoke
mul_var_karatsuba_half(), which will then recursively invoke mul_var()
twice with var1ndigits = 1000 and var2ndigits = 5000, which no longer
satisfies KARATSUBA_CONDITION(), so it will just invoke the schoolbook
algorithm on each half, which stands no chance of being any faster. On
the other hand, if it divided var2 into 10 chunks of length 1000, it
would invoke the Karatsuba algorithm on each chunk, which would at
least stand a chance of being faster.

Related to that, KARATSUBA_HIGH_RANGE_CONDITION() doesn't appear to
make a lot of sense. For inputs with var1ndigits between 128 and 2000,
and var2ndigits > 9000, this condition will pass and it will
recursively break up the longer input into smaller and smaller pieces
until eventually that condition no longer passes, but none of the
other conditions in KARATSUBA_CONDITION() will pass either, so it'll
just invoke the schoolbook algorithm on each piece, which is bound to
be slower once all the overheads are taken into account. For example,
given var1ndigits = 200 and var2ndigits = 3, KARATSUBA_CONDITION()
will pass due to KARATSUBA_HIGH_RANGE_CONDITION(), and it will recurse
with var1ndigits = 200 and var2ndigits = 15000, and then again with
var1ndigits = 200 and var2ndigits = 7500, at which point
KARATSUBA_CONDITION() no longer passes. With mul_var_karatsuba_half()
implemented as it is, that is bound to happen, because each half will
end up having var2ndigits between 4500 and 9000, which fails
KARATSUBA_CONDITION() if var1ndigits < 2000. If
mul_var_karatsuba_half() was replaced by something that recursed with
more balanced chunks, then it might make more sense, though allowing
values of var1ndigits down to 128 doesn't make sense, since the
Karatsuba algorithm will never be invoked for inputs shorter than 384.

Looking at KARATSUBA_MIDDLE_RANGE_CONDITION(), the test that
var2ndigits > 2500 seems to be redundant. If var1ndigits > 2000 and
var2ndigits < 2500, then KARATSUBA_LOW_RANGE_CONDITION() is satisfied,
so these tests could be simplified, eliminating some of those magic
constants.

However, I really don't like having these magic constants at all,
because in practice the threshold above which the Karatsuba algorithm
is a win can vary depending on a number of factors, such as whether
it's running on 32-bit or 64-bit, whether or not SIMD instructions are
available, the relative timings of CPU instructions, the compiler
options used, and probably a bunch of other things. The last time I
looked at the Java source code, for example, they had separate
thresholds for 32-bit and 64-bit platforms, and even that's probably
too crude. Some numeric libraries tune the thresholds for a large
number of different platforms, but that takes a lot of effort. I think
a better approach would be to have a configurable threshold. Ideally,
this would be just one number, with all other numbers being derived
from it, possibly using some simple heuristic to reduce the effective
threshold for more balanced inputs, for which the Karatsuba algorithm
is more efficient.

Having a configurable threshold would allow people to tune for best
performance on their own platforms, and also it would make it easier
to write tests that hit the new code. As it stands, it's not obvious
how much of the new code is being hit by the existing tests.

Doing a quick test on my machine, using random equal-length inputs of
various sizes, I got the following performance results:

 digits 

Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

2024-06-23 Thread Michael Paquier
On Sun, Jun 23, 2024 at 09:00:29AM +0200, Joel Jacobson wrote:
> Attached, rebased version of the patch that implements the Karatsuba 
> algorithm in numeric.c's mul_var().

It's one of these areas where Dean Rasheed would be a good match for a
review, so adding him in CC.  He has been doing a lot of stuff in this
area over the years.

+#define KARATSUBA_BASE_LIMIT 384
+#define KARATSUBA_VAR1_MIN1 128
+#define KARATSUBA_VAR1_MIN2 2000
+#define KARATSUBA_VAR2_MIN1 2500
+#define KARATSUBA_VAR2_MIN2 9000
+#define KARATSUBA_SLOPE 0.764
+#define KARATSUBA_INTERCEPT 90.737

These numbers feel magic, and there are no explanations behind these
choices so it is hard to know whether these are good or not, or if
there are potentially "better" choices.  I'd suggest to explain why
these variables are here as well as the basics of the method in this
area of the code, with the function doing the operation pointing at
that so as all the explanations are in a single place.  Okay, these
are thresholds based on the number of digits to decide if the normal
or Karatsuba's method should be used, but grouping all the
explanations in a single place is simpler.

I may have missed something, but did you do some benchmark when the
thresholds are at their limit where we would fallback to the
calculation method of HEAD?  I guess that the answer to my question of
"Is HEAD performing better across these thresholds?" is clearly "no"
based on what I read at [1] and the threshold numbers chosen, still
asking.

[1]: https://en.wikipedia.org/wiki/Karatsuba_algorithm
--
Michael


signature.asc
Description: PGP signature


Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

2024-06-23 Thread Joel Jacobson
On Fri, Jun 14, 2024, at 03:07, Aaron Altman wrote:
> Thanks for the detailed background and comments, Joel!
>
> The new status of this patch is: Ready for Committer

Thanks for reviewing.

Attached, rebased version of the patch that implements the Karatsuba algorithm 
in numeric.c's mul_var().

/Joel


0001-numeric-mul_var-karatsuba.patch
Description: Binary data


Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

2024-06-14 Thread Aaron Altman
The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested

This applies cleanly to master, builds and passes regression tests in my 
Windows/Cygwin environment.  

I also read through comments and confirmed for myself that the assumption about 
the caller ensuring var1 is shorter is done already in unchanged code from 
mul_var.  Frees correspond to inits.  The "Karatsuba condition" reasoning for 
deciding whether a number is big enough to use this algorithm appears to match 
what Joel has stated in this thread.

The arithmetic appears to match what's been described in the comments.  I have 
*not* confirmed that with any detailed review of the Karatsuba algorithm from 
outside sources, other implementations like the Rust one referenced here, or 
anything similar.  I'm hoping that the regression tests give sufficient 
coverage that if the arithmetic was incorrect there would be obvious failures.  
If additional coverage was needed, cases falling immediately on either side of 
the limiting conditions used in the patch would probably be useful.  From the 
limited precedent I've exposed myself to, that doesn't seem to be required 
here, but I'm open to contrary input from other reviewers.  In the meantime, 
I'm marking this approved.  

Thanks for the detailed background and comments, Joel!

The new status of this patch is: Ready for Committer


Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

2024-06-13 Thread Joel Jacobson
On Tue, Jun 11, 2024, at 19:16, Aaron Altman wrote:
> Hi Joel, thanks for posting this.  Although I have only a cursory 
> familiarity with fast multiplication algorithms, I'd like to try and 
> give it a review.  To start with, can you help me understand the choice 
> of this algorithm versus others like Toom?  If this looks correct on a 
> closer view I'll propose it for inclusion. Along the way though I'd like 
> to have it explicitly called out whether this is superior in general to 
> other choices, better for more realistic use cases, simpler, clearer to 
> license or something similar.  It would be nice for future dicussions to 
> have some context around whether it would make sense to have conditions 
> to choose other algorithms as well, or if this one is generally the best 
> for what Postgres users are usually doing.
>
> Continuing with code review in any case.  Interested to hear more.

Hi Aaron, thanks for looking at this!

The choice of best algorithm depends on the factor sizes.

The larger factor sizes, the more complicated algorithm can be "afforded".

List of fast multiplication algorithms,
ordered by factor sizes they are suitable for:

- Long multiplication, aka "schoolbook" multiplication.
- Karatsuba
- Toom-3
- Schönhage–Strassen algorithm (Fast Fourier Transform)

The Toom-3 algorithm can be modified to split the smaller and larger factors
into different number of parts. The notation used at Wikipedia is e.g. Toom-2.5
which I think means splitting the larger into three parts, and the smaller into
two parts, while GMP uses Toom32 to mean the same thing.
Personally, I think GMPs notation is easier to understand as the number of parts
can be directly derived from the name.

I experimented with implementing Toom-3 as well, but there was only a marginal
win, at very large factor sizes, and since numeric's max ndigits
(number of base-digits) is capped at 32768, I didn't think it was worth it,
since it adds quite a lot of complexity.

The Karatsuba algorithm is the next step in the hierarchy of fast multiplication
algorithms, and all other bigint libs I've looked at implement Karatsuba,
even if they also implement Toom-3, since Karatsuba is faster than Toom-3 for
sufficiently small factors, but that are at the same time sufficiently large for
Karatsuba to be faster than schoolbook.

I was initially surprised by the quite large threshold, where Karatsuba started
to be a win over schoolbook.

I think the explanation why mul_var() stays fast up to quite remarkably high
factor sizes, could be a combination of several things, such as:

- mul_var() is already heavily optimized, with clever tricks,
  such as deferred carry propagation.

- numeric uses NBASE=1, while other bigint libs usually use a power of two.

In the Karatsuba implementation, I tried to keep the KARATSUBA_CONDITION()
quite simple, but it's way more complex than what most bigint libs use,
that usually just check if the smaller factor is smaller than some threshold,
and if so, use schoolbook. For instance, this is what Rust's num-bigint does:

if x.len() <= 32 {
// Long multiplication
} else if x.len() * 2 <= y.len() {
// Half-Karatsuba, for factors with significant length disparity
} else if x.len() <= 256 {
// Karatsuba multiplication
} else {
// Toom-3 multiplication
}

Source: 
https://github.com/rust-num/num-bigint/blob/master/src/biguint/multiplication.rs#L101

Side note: When working on Karatsuba in mul_var(), I looked at some other bigint
implementations, to try to understand their threshold functions.
I noticed that Rust's num-bigint didn't optimise for factors with significant
length disparity, so I contributed a patch based on my "Half-Karatsuba" idea,
that I got when working with mul_var(), which has now been merged:
https://github.com/rust-num/num-bigint/commit/06b61c8138ad8a9959ac54d9773d0a9ebe25b346

In mul_var(), if we don't like the complexity of KARATSUBA_CONDITION(),
we could go for a more traditional threshold approach, i.e. just checking
the smaller factor size. However, I believe that would be at the expense
of missing out of some performance gains.

I've tried quite hard to find the best KARATSUBA_CONDITION(), but I found it to
be a really hard problem, the differences between different CPU architectures,
in combination with wanting a simple expression, means there is no obvious
perfect threshold function, there will always be a trade-off.

I eventually stopped trying to improve it, and just settled on the version in
the patch, and thought that I'll leave it up to the community to give feedback
on what complexity for the threshold function is motivated. If we absolutely
just want to check the smallest factor size, like Rust, then it's super simple,
then the threshold can easily be found just by testing different values.
It's when both factor sizes are input to the threshold function that makes it
complicated.

/Joel




Re: Optimize numeric.c mul_var() using the Karatsuba algorithm

2024-06-11 Thread Aaron Altman
Hi Joel, thanks for posting this.  Although I have only a cursory 
familiarity with fast multiplication algorithms, I'd like to try and 
give it a review.  To start with, can you help me understand the choice 
of this algorithm versus others like Toom?  If this looks correct on a 
closer view I'll propose it for inclusion. Along the way though I'd like 
to have it explicitly called out whether this is superior in general to 
other choices, better for more realistic use cases, simpler, clearer to 
license or something similar.  It would be nice for future dicussions to 
have some context around whether it would make sense to have conditions 
to choose other algorithms as well, or if this one is generally the best 
for what Postgres users are usually doing.


Continuing with code review in any case.  Interested to hear more.

Regards,

Aaron Altman





Optimize numeric.c mul_var() using the Karatsuba algorithm

2024-04-15 Thread Joel Jacobson
Hi,

This patch introduces the Karatsuba algorithm to speed up multiplication
operations in numeric.c, where the operands have many digits.

It is implemented via a new conditional in mul_var() that determines whether
the sizes of the factors are sufficiently large to justify its use. This 
decision
is non-trivial due to its recursive nature, depending on the size and ratio of
the factors. Moreover, the optimal threshold varies across different
architectures.

This benefits all users of mul_var() in numeric.c, such as:
numeric_mul()
numeric_lcm()
numeric_fac()
int64_div_fast_to_numeric()
mod_var()
div_mod_var()
sqrt_var()
exp_var()
ln_var()
power_var()

The macro KARATSUBA_CONDITION(var1ndigits, var2ndigits) is responsible of this
decision. It is deliberately conservative to prevent performance regressions on
the tested architectures while maximizing potential gains and maintaining
simplicity.

Patches:

1. mul_var-karatsuba.patch
Modifies mul_var() to use the Karatsuba-functions
for multiplying larger numerical factors.

2. mul_var-karatsuba-benchmark.patch
Introduces numeric_mul_karatsuba() and mul_var_karatsuba()
alongside the existing numeric_mul() and mul_var() functions.
This enables benchmark comparisons between the original multiplication method
and the Karatsuba-optimized version.

Some benchmark numbers, tested on Intel Core i9-14900K:

Helper-function to generate numeric of given ndigits,
using the new random(min numeric, max numeric):

CREATE OR REPLACE FUNCTION random_ndigits(ndigits INT) RETURNS NUMERIC AS $$
SELECT random(
('1000'||repeat('',ndigits-1))::numeric,
(repeat('',ndigits))::numeric
)
$$ LANGUAGE sql;

Benchmark equal factor sizes, 16384 x 16384 ndigits:

SELECT random_ndigits(16384) * random_ndigits(16384) > 0;
Time: 33.990 ms
Time: 33.961 ms
Time: 34.183 ms

SELECT numeric_mul_karatsuba(random_ndigits(16384), random_ndigits(16384)) > 0;
Time: 17.621 ms
Time: 17.209 ms
Time: 16.444 ms

Benchmark equal factor sizes, 8192 x 8192 ndigits:

SELECT random_ndigits(8192) * random_ndigits(8192) > 0;
Time: 12.568 ms
Time: 12.563 ms
Time: 12.701 ms

SELECT numeric_mul_karatsuba(random_ndigits(8192), random_ndigits(8192)) > 0;
Time: 9.919 ms
Time: 9.929 ms
Time: 9.659 ms

To measure smaller factor sizes, \timing doesn't provide enough precision.
Below measurements are made using my pg-timeit extension:

Benchmark equal factor sizes, 1024 x 1024 ndigits:

SELECT 
timeit.h('numeric_mul',ARRAY[random_ndigits(1024)::TEXT,random_ndigits(1024)::TEXT],significant_figures:=2);
 100 µs
SELECT 
timeit.h('numeric_mul_karatsuba',ARRAY[random_ndigits(1024)::TEXT,random_ndigits(1024)::TEXT],significant_figures:=2);
 73 µs

Benchmark equal factor sizes, 512 x 512 ndigits:

SELECT 
timeit.h('numeric_mul',ARRAY[random_ndigits(512)::TEXT,random_ndigits(512)::TEXT],significant_figures:=2);
 27 µs
SELECT 
timeit.h('numeric_mul_karatsuba',ARRAY[random_ndigits(512)::TEXT,random_ndigits(512)::TEXT],significant_figures:=2);
 23 µs

Benchmark unequal factor sizes, 2048 x 16384 ndigits:

SELECT 
timeit.h('numeric_mul',ARRAY[random_ndigits(2048)::TEXT,random_ndigits(16384)::TEXT],significant_figures:=2);
3.6 ms
SELECT 
timeit.h('numeric_mul_karatsuba',ARRAY[random_ndigits(2048)::TEXT,random_ndigits(16384)::TEXT],significant_figures:=2);
2.7 ms

The KARATSUBA_CONDITION was determined through benchmarking on the following 
architectures:

- Intel Core i9-14900K (desktop)
- AMD Ryzen 9 7950X3D (desktop)
- Apple M1Max (laptop)
- AWS EC2 m7g.4xlarge (cloud server, AWS Graviton3 CPU)
- AWS EC2 m7i.4xlarge (cloud server, Intel Xeon 4th Gen, Sapphire Rapids)

The images depicting the benchmark plots are rather large, so I've refrained
from including them as attachments. Instead, I've provided URLs to
the benchmarks for direct access:

https://gist.githubusercontent.com/joelonsql/e9d06cdbcdf56cd8ffa673f499880b0d/raw/69df06e95bc254090f8397765079e1a8145eb5ac/derive_threshold_function_using_dynamic_programming.png
This image shows the best possible performance ratio per architecture,
derived using Dynamic Programming. The black line segment shows the manually 
crafted
threshold function, which aims to avoid performance regressions, while capturing
the beneficial regions, as a relatively simple threshold function,
which has been implemented in both patches as the KARATSUBA_CONDITION macro.

https://gist.githubusercontent.com/joelonsql/e9d06cdbcdf56cd8ffa673f499880b0d/raw/69df06e95bc254090f8397765079e1a8145eb5ac/benchmark.png
This plot displays the actual performance ratio per architecture,
measured after applying the mul_var-karatsuba-benchmark.patch.

The performance_ratio scale in both plots uses a rainbow scale,
where blue is at 1.0 and means no change. The maximum at 4.0
means that the Karatsuba version was four times faster
than the traditional mul_var() at that architecture.

To make it easier to distinguish performance regressions from,
a magenta color scale that goes from pure magenta just below 1.0,
to dar