Re: [PATCH] Use 128-bit math to accelerate numeric division, when 8 < divisor digits <= 16

2023-01-23 Thread Dean Rasheed
On Sun, 22 Jan 2023 at 22:49, Joel Jacobson  wrote:
>
> Many thanks for feedback. Nice catch! New patch attached.
>

Cool, that resolves the performance issues I was seeing for smaller
divisors (which also had a noticeable impact on the numeric_big
regression test).

After some more testing, the gains look good to me, and I wasn't able
to find any cases where it made things slower, so I've gone ahead and
pushed it.

Regards,
Dean




Re: [PATCH] Use 128-bit math to accelerate numeric division, when 8 < divisor digits <= 16

2023-01-23 Thread Dean Rasheed
On Mon, 23 Jan 2023 at 05:06, John Naylor  wrote:
>
> According to Agner's instruction tables [1], integer division on Skylake (for 
> example) has a latency of 26 cycles for 32-bit operands, and 42-95 cycles for 
> 64-bit.
>
> [1] https://www.agner.org/optimize/instruction_tables.pdf
>

Thanks, that's a very useful reference.

(And I do indeed have one of those CPUs, which explains what I was seeing.)

Regards,
Dean




Re: [PATCH] Use 128-bit math to accelerate numeric division, when 8 < divisor digits <= 16

2023-01-22 Thread John Naylor
On Sun, Jan 22, 2023 at 10:42 PM Joel Jacobson  wrote:

> I did write the code like you suggest first, but changed it,
> since I realised the extra "else if" needed could be eliminated,
> and thought div_var_int64() wouldn't be slower than div_var_int() since
> I thought 64-bit instructions in general are as fast as 32-bit
instructions,
> on 64-bit platforms.

According to Agner's instruction tables [1], integer division on Skylake
(for example) has a latency of 26 cycles for 32-bit operands, and 42-95
cycles for 64-bit.

[1] https://www.agner.org/optimize/instruction_tables.pdf

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PATCH] Use 128-bit math to accelerate numeric division, when 8 < divisor digits <= 16

2023-01-22 Thread Joel Jacobson
On Sun, Jan 22, 2023, at 14:25, Dean Rasheed wrote:
> I just modified the previous test you posted:
>
> \timing on
> SELECT count(numeric_div_volatile(1e131071,123456)) FROM 
> generate_series(1,1e4);
>
> Time: 2048.060 ms (00:02.048)-- HEAD
> Time: 2422.720 ms (00:02.423)-- With patch
>
...
>
> Apparently it can make a difference. Probably something to do with
> having less data to move around. I remember noticing that when I wrote
> div_var_int(), which is why I split it into 2 branches in that way.

Many thanks for feedback. Nice catch! New patch attached.

Interesting, I'm not able to reproduce this on my MacBook Pro M1 Max:

SELECT version;
PostgreSQL 16devel on aarch64-apple-darwin22.2.0, compiled by Apple clang 
version 14.0.0 (clang-1400.0.29.202), 64-bit

SELECT count(numeric_div_volatile(1e131071,123456)) FROM generate_series(1,1e 
4);
Time: 1569.730 ms (00:01.570) - HEAD
Time: 1569.918 ms (00:01.570) -- div_var_int64.patch
Time: 1569.038 ms (00:01.569) -- div_var_int64-2.patch

Just curious, what platform are you on?

/Joel

div_var_int64-2.patch
Description: Binary data


Re: [PATCH] Use 128-bit math to accelerate numeric division, when 8 < divisor digits <= 16

2023-01-22 Thread Dean Rasheed
On Sun, 22 Jan 2023 at 15:41, Joel Jacobson  wrote:
>
> On Sun, Jan 22, 2023, at 11:06, Dean Rasheed wrote:
> > Seems like a reasonable idea, with some pretty decent gains.
> >
> > Note, however, that for a divisor having fewer than 5 or 6 digits,
> > it's now significantly slower because it's forced to go through
> > div_var_int64() instead of div_var_int() for all small divisors. So
> > the var2ndigits <= 2 case needs to come first.
>
> Can you give a measurable example of when the patch
> the way it's written is significantly slower for a divisor having
> fewer than 5 or 6 digits, on some platform?
>

I just modified the previous test you posted:

\timing on
SELECT count(numeric_div_volatile(1e131071,123456)) FROM generate_series(1,1e4);

Time: 2048.060 ms (00:02.048)-- HEAD
Time: 2422.720 ms (00:02.423)-- With patch

> I did write the code like you suggest first, but changed it,
> since I realised the extra "else if" needed could be eliminated,
> and thought div_var_int64() wouldn't be slower than div_var_int() since
> I thought 64-bit instructions in general are as fast as 32-bit instructions,
> on 64-bit platforms.
>

Apparently it can make a difference. Probably something to do with
having less data to move around. I remember noticing that when I wrote
div_var_int(), which is why I split it into 2 branches in that way.

Regards,
Dean




Re: [PATCH] Use 128-bit math to accelerate numeric division, when 8 < divisor digits <= 16

2023-01-22 Thread Joel Jacobson
On Sun, Jan 22, 2023, at 11:06, Dean Rasheed wrote:
> Seems like a reasonable idea, with some pretty decent gains.
>
> Note, however, that for a divisor having fewer than 5 or 6 digits,
> it's now significantly slower because it's forced to go through
> div_var_int64() instead of div_var_int() for all small divisors. So
> the var2ndigits <= 2 case needs to come first.

Can you give a measurable example of when the patch
the way it's written is significantly slower for a divisor having
fewer than 5 or 6 digits, on some platform?

I can't detect any difference at all at my MacBook Pro M1 Max for this example:
EXPLAIN ANALYZE SELECT count(numeric_div_volatile(1,)) FROM 
generate_series(1,1e8);

I did write the code like you suggest first, but changed it,
since I realised the extra "else if" needed could be eliminated,
and thought div_var_int64() wouldn't be slower than div_var_int() since
I thought 64-bit instructions in general are as fast as 32-bit instructions,
on 64-bit platforms.

I'm not suggesting your claim is incorrect, I'm just trying to understand
and verify it experimentally.

> The implementation of div_var_int64() should be in an #ifdef HAVE_INT128 
> block.
>
> In div_var_int64(), s/ULONG_MAX/PG_UINT64_MAX/

OK, thanks, I'll fix, but I'll await your feedback first on the above.

/Joel




Re: [PATCH] Use 128-bit math to accelerate numeric division, when 8 < divisor digits <= 16

2023-01-22 Thread Dean Rasheed
On Sun, 22 Jan 2023 at 13:42, Joel Jacobson  wrote:
>
> Hi,
>
> On platforms where we support 128bit integers, we could accelerate division
> when the number of digits in the divisor is larger than 8 and less than or
> equal to 16 digits, i.e. when the divisor that fits in a 64-bit integer but 
> would
> not fit in a 32-bit integer.
>

Seems like a reasonable idea, with some pretty decent gains.

Note, however, that for a divisor having fewer than 5 or 6 digits,
it's now significantly slower because it's forced to go through
div_var_int64() instead of div_var_int() for all small divisors. So
the var2ndigits <= 2 case needs to come first.

The implementation of div_var_int64() should be in an #ifdef HAVE_INT128 block.

In div_var_int64(), s/ULONG_MAX/PG_UINT64_MAX/

Regards,
Dean