On 28/11/2024 19:59, Sam Russell wrote:
I've ported the PCLMUL to for ARMv8 support, looks to be an 80% time reduction over CPU on an EC2 T4g instance:$ lscpu Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 2 On-line CPU(s) list: 0,1 Vendor ID: ARM Model name: Neoverse-N1 Model: 1 Thread(s) per core: 1 Core(s) per socket: 2 Socket(s): 1 Stepping: r3p1 BogoMIPS: 243.75 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs # ubuntu 24.04 package $ time cksum ubuntu.iso 914429447 2773874688 ubuntu.iso real 0m20.136s user 0m2.044s sys 0m1.691s # built from head $ time ./cksum_old ubuntu.iso 914429447 2773874688 ubuntu.iso real 0m20.217s user 0m2.022s sys 0m1.770s # this patch using only pmull opcodes $ time ./cksum_neon ubuntu.iso 914429447 2773874688 ubuntu.iso real 0m20.135s user 0m0.353s sys 0m1.819s # this patch using pmull and pmull2 opcodes $ time ./cksum_neon2 ubuntu.iso 914429447 2773874688 ubuntu.iso real 0m20.136s user 0m0.346s sys 0m1.819s Benchmark scripts (I used the crc_sum_stream() function so the hash output is different, but have verified against the pclmul script functions locally) $ time ./cksum_bench_old 65536 400000 Hash: 8984ED89, length: 65536 real 0m19.300s user 0m19.299s sys 0m0.001s $ time ./cksum_bench_neon2 65536 400000 Hash: 828F9BAC, length: 65536 real 0m5.001s user 0m4.997s sys 0m0.003s For hash validation $ time ./cksum_bench_neon2 1048576 40000 Hash: EFA0B24F, length: 1048576 real 0m7.540s user 0m7.538s sys 0m0.001s $ time ./cksum_bench_pclmul 1048576 10000 Hash: EFA0B24F, length: 1048576 real 0m3.018s user 0m3.018s sys 0m0.000s -O3 does most of the optimisation work for us, there may be more savings but this is still a good improvement. Some questions - There's no direct equivalent of "__builtin_cpu_supports" for ARM, but the hwcaps interface seems to be the way to test this [1] [2] - ARM is a much more diverse system than x86_64, it's possible that some platforms (e.g. phones) would see a slowdown, is this something we want to give maintainers a flag to disable? - ARMv8 also has a CRC32() opcode, a quick test showed it wasn't super efficient but it's possible that interleaving this against the folding approach might add extra speedups. This is an exercise for the reader.
Cool. I'll try this out on some of the arm64 machines at: https://portal.cfarm.net/machines/list/ Note builders can disable this already with: ./configure utils_cv_vmull_intrinsic_exists=no thanks! Pádraig
