Inlined the hex encode/decode functions in "src/include/utils/builtins.h"
similar to pg_popcount() in pg_bitutils.h.
---
Chiranmoy
v3-0001-SVE-support-for-hex-encode-and-hex-decode.patch
Description: v3-0001-SVE-support-for-hex-encode-and-hex-decode.patch
> The meson configure check seems to fail on my machine
> This test looks quite different than the autoconf one. Why is that? I
would expect them to be the same. And I think ideally the test would check
that all the intrinsics functions we need are available.
Fixed, both meson and autoconf have
> Hm. These results are so similar that I'm tempted to suggest we just
> remove the section of code dedicated to alignment. Is there any reason not
> to do that?
It seems that the double load overhead from unaligned memory access isn’t
too taxing, even on larger inputs. We can remove it to simpl
Thank you for the suggestion; we have removed the `xsave` flag.
We have used the following command for benchmarking:
time ./build_fj/bin/psql pop_db -c "select drive_popcount(1000, 16);"
We ran it 20 times and took the average to flatten any CPU fluctuations. The
results observed on `m7g.4xl
Hello Nathan,
We tried auto-vectorization and observed no performance improvement.
The instructions in src/include/port/simd.h are based on older SIMD
architectures like NEON, whereas the patch uses the newer SVE, so some of the
instructions used in the patch may not have direct equivalents in N
Hi all,
Here is the updated patch using pg_attribute_target("arch=armv8-a+sve") to
compile the arch-specific function instead of using compiler flags.
---
Chiranmoy
v3-0001-SVE-support-for-popcount-and-popcount-masked.patch
Description: v3-0001-SVE-support-for-popcount-and-popcount-masked.p
> The approach looks generally reasonable to me, but IMHO the code needs
much more commentary to explain how it works.
Added comments to explain the SVE implementation.
> I would be interested to see how your bytea test compares with the
improvements added in commit e24d770 and with sending the
> This looks good. Thanks Chiranmoy and team. Can you address any other
> feedback from Nathan or others here? Then we can pursue further reviews and
> merging of the patch.
Thank you for the review.
If there is no further feedback from the community, may we submit the patch for
the next commit
I realized I didn't attach the patch.
v2-0001-SVE-support-for-hex-encode-and-hex-decode.patch
Description: v2-0001-SVE-support-for-hex-encode-and-hex-decode.patch
On Fri, Jan 10, 2025 at 09:38:14AM -0600, Nathan Bossart wrote:
> Do you mean that the auto-vectorization worked and you observed no
> performance improvement, or the auto-vectorization had no effect on the
> code generated?
Auto-vectorization is working now with the following addition on Graviton
On Wed, Mar 13, 2025 at 12:02:07AM +, nathandboss...@gmail.com wrote:
> Those are nice results. I'm a little worried about the Neon implementation
> for smaller inputs since it uses a per-byte loop for the remaining bytes,
> though. If we can ensure there's no regression there, I think this p
It seems that the patch doesn't compile on macOS, it is unable to map 'i'
and 'len' which are of type 'size_t' to 'uint64'. This appears to be a mac
specific
issue. The latest patch should resolve this by casting 'size_t' to 'uint64'
before
passing it to 'svwhilelt_b8'.
[11:04:07.478] ../src/back
> Hm. Any idea why that is? I wonder if the compiler isn't using as many
> SVE registers as it could for this.
Not sure, we tried forcing loop unrolling using the below line in the MakeFile
but the results are the same.
pg_popcount_sve.o: CFLAGS += ${CFLAGS_UNROLL_LOOPS} -march=native
> I've
> Interesting. I do see different assembly with the 2 and 4 register
> versions, but I didn't get to testing it on a machine with SVE support
> today.
> Besides some additional benchmarking, I might make some small adjustments
> to the patch. But overall, it seems to be in decent shape.
Sounds
On Wed, Mar 12, 2025 at 02:41:18AM +, nathandboss...@gmail.com wrote:
> v5-no-sve is the result of using a function pointer, but pointing to the
> "slow" versions instead of the SVE version. v5-sve is the result of the
> latest patch in this thread on a machine with SVE support, and v5-4reg i
Looks good, the code is more readable now.
> For both Neon and SVE, I do see improvements with looping over 4
> registers at a time, so IMHO it's worth doing so even if it performs the
> same as 2-register blocks on some hardware.
There was no regression on Graviton 3 when using the 4-register
Attaching the rebased patch, some regression tests for SIMD hex-coding,
and a script to test bytea performance (usage info in the script).
The results obtained using the script on an m7g.4xlarge are shown below.
Read Operation
table (MB) | HEAD (ms) | SVE (ms) | improvement (%)
--
Here's the rebased patch with a few modifications.
The hand-unrolled hex encode performs better than the non-unrolled version on
r8g.4xlarge. No improvement on m7g.4xlarge.
Added line-by-line comments explaining the changes with an example.
Below are the results. Input size is in bytes, and exec
Hi all,
Since the CommitFest is underway, could we get some feedback to improve the
patch?
___
Chiranmoy
19 matches
Mail list logo