Re: [PATCH] Use 128-bit math to accelerate numeric division, when 8 < divisor digits <= 16
On Sun, 22 Jan 2023 at 22:49, Joel Jacobson wrote: > > Many thanks for feedback. Nice catch! New patch attached. > Cool, that resolves the performance issues I was seeing for smaller divisors (which also had a noticeable impact on the numeric_big regression test). After some more testing, the gains look good to me, and I wasn't able to find any cases where it made things slower, so I've gone ahead and pushed it. Regards, Dean
Re: [PATCH] Use 128-bit math to accelerate numeric division, when 8 < divisor digits <= 16
On Mon, 23 Jan 2023 at 05:06, John Naylor wrote: > > According to Agner's instruction tables [1], integer division on Skylake (for > example) has a latency of 26 cycles for 32-bit operands, and 42-95 cycles for > 64-bit. > > [1] https://www.agner.org/optimize/instruction_tables.pdf > Thanks, that's a very useful reference. (And I do indeed have one of those CPUs, which explains what I was seeing.) Regards, Dean
Re: [PATCH] Use 128-bit math to accelerate numeric division, when 8 < divisor digits <= 16
On Sun, Jan 22, 2023 at 10:42 PM Joel Jacobson wrote: > I did write the code like you suggest first, but changed it, > since I realised the extra "else if" needed could be eliminated, > and thought div_var_int64() wouldn't be slower than div_var_int() since > I thought 64-bit instructions in general are as fast as 32-bit instructions, > on 64-bit platforms. According to Agner's instruction tables [1], integer division on Skylake (for example) has a latency of 26 cycles for 32-bit operands, and 42-95 cycles for 64-bit. [1] https://www.agner.org/optimize/instruction_tables.pdf -- John Naylor EDB: http://www.enterprisedb.com
Re: [PATCH] Use 128-bit math to accelerate numeric division, when 8 < divisor digits <= 16
On Sun, Jan 22, 2023, at 14:25, Dean Rasheed wrote: > I just modified the previous test you posted: > > \timing on > SELECT count(numeric_div_volatile(1e131071,123456)) FROM > generate_series(1,1e4); > > Time: 2048.060 ms (00:02.048)-- HEAD > Time: 2422.720 ms (00:02.423)-- With patch > ... > > Apparently it can make a difference. Probably something to do with > having less data to move around. I remember noticing that when I wrote > div_var_int(), which is why I split it into 2 branches in that way. Many thanks for feedback. Nice catch! New patch attached. Interesting, I'm not able to reproduce this on my MacBook Pro M1 Max: SELECT version; PostgreSQL 16devel on aarch64-apple-darwin22.2.0, compiled by Apple clang version 14.0.0 (clang-1400.0.29.202), 64-bit SELECT count(numeric_div_volatile(1e131071,123456)) FROM generate_series(1,1e 4); Time: 1569.730 ms (00:01.570) - HEAD Time: 1569.918 ms (00:01.570) -- div_var_int64.patch Time: 1569.038 ms (00:01.569) -- div_var_int64-2.patch Just curious, what platform are you on? /Joel div_var_int64-2.patch Description: Binary data
Re: [PATCH] Use 128-bit math to accelerate numeric division, when 8 < divisor digits <= 16
On Sun, 22 Jan 2023 at 15:41, Joel Jacobson wrote: > > On Sun, Jan 22, 2023, at 11:06, Dean Rasheed wrote: > > Seems like a reasonable idea, with some pretty decent gains. > > > > Note, however, that for a divisor having fewer than 5 or 6 digits, > > it's now significantly slower because it's forced to go through > > div_var_int64() instead of div_var_int() for all small divisors. So > > the var2ndigits <= 2 case needs to come first. > > Can you give a measurable example of when the patch > the way it's written is significantly slower for a divisor having > fewer than 5 or 6 digits, on some platform? > I just modified the previous test you posted: \timing on SELECT count(numeric_div_volatile(1e131071,123456)) FROM generate_series(1,1e4); Time: 2048.060 ms (00:02.048)-- HEAD Time: 2422.720 ms (00:02.423)-- With patch > I did write the code like you suggest first, but changed it, > since I realised the extra "else if" needed could be eliminated, > and thought div_var_int64() wouldn't be slower than div_var_int() since > I thought 64-bit instructions in general are as fast as 32-bit instructions, > on 64-bit platforms. > Apparently it can make a difference. Probably something to do with having less data to move around. I remember noticing that when I wrote div_var_int(), which is why I split it into 2 branches in that way. Regards, Dean
Re: [PATCH] Use 128-bit math to accelerate numeric division, when 8 < divisor digits <= 16
On Sun, Jan 22, 2023, at 11:06, Dean Rasheed wrote: > Seems like a reasonable idea, with some pretty decent gains. > > Note, however, that for a divisor having fewer than 5 or 6 digits, > it's now significantly slower because it's forced to go through > div_var_int64() instead of div_var_int() for all small divisors. So > the var2ndigits <= 2 case needs to come first. Can you give a measurable example of when the patch the way it's written is significantly slower for a divisor having fewer than 5 or 6 digits, on some platform? I can't detect any difference at all at my MacBook Pro M1 Max for this example: EXPLAIN ANALYZE SELECT count(numeric_div_volatile(1,)) FROM generate_series(1,1e8); I did write the code like you suggest first, but changed it, since I realised the extra "else if" needed could be eliminated, and thought div_var_int64() wouldn't be slower than div_var_int() since I thought 64-bit instructions in general are as fast as 32-bit instructions, on 64-bit platforms. I'm not suggesting your claim is incorrect, I'm just trying to understand and verify it experimentally. > The implementation of div_var_int64() should be in an #ifdef HAVE_INT128 > block. > > In div_var_int64(), s/ULONG_MAX/PG_UINT64_MAX/ OK, thanks, I'll fix, but I'll await your feedback first on the above. /Joel
Re: [PATCH] Use 128-bit math to accelerate numeric division, when 8 < divisor digits <= 16
On Sun, 22 Jan 2023 at 13:42, Joel Jacobson wrote: > > Hi, > > On platforms where we support 128bit integers, we could accelerate division > when the number of digits in the divisor is larger than 8 and less than or > equal to 16 digits, i.e. when the divisor that fits in a 64-bit integer but > would > not fit in a 32-bit integer. > Seems like a reasonable idea, with some pretty decent gains. Note, however, that for a divisor having fewer than 5 or 6 digits, it's now significantly slower because it's forced to go through div_var_int64() instead of div_var_int() for all small divisors. So the var2ndigits <= 2 case needs to come first. The implementation of div_var_int64() should be in an #ifdef HAVE_INT128 block. In div_var_int64(), s/ULONG_MAX/PG_UINT64_MAX/ Regards, Dean