This patch optimizes LongVector multiplication by inferring VPMULUDQ
instruction for following IR pallets.
MulL ( And SRC1, 0xFFFFFFFF) ( And SRC2, 0xFFFFFFFF)
MulL (URShift SRC1 , 32) (URShift SRC2, 32)
MulL (URShift SRC1 , 32) ( And SRC2, 0xFFFFFFFF)
MulL ( And SRC1, 0xFFFFFFFF) (URShift SRC2 , 32)
A 64x64 bit multiplication produces 128 bit result, and can be performed by
individually multiplying upper and lower double word of multiplier with
multiplicand and assembling the partial products to compute full width result.
Targets supporting vector quadword multiplication have separate instructions to
compute upper and lower quadwords for 128 bit result. Therefore existing
VectorAPI multiplication operator expects shape conformance between source and
result vectors.
If upper 32 bits of quadword multiplier and multiplicand is always set to zero
then result of multiplication is only dependent on the partial product of their
lower double words and can be performed using unsigned 32 bit multiplication
instruction with quadword saturation. Patch matches this pattern in a target
dependent manner without introducing new IR node.
VPMULUDQ instruction performs unsigned multiplication between even numbered
doubleword lanes of two long vectors and produces 64 bit result. It has much
lower latency compared to full 64 bit multiplication instruction "VPMULLQ", in
addition non-AVX512DQ targets does not support direct quadword multiplication,
thus we can save redundant partial product for zeroed out upper 32 bits. This
results into throughput improvements on both P and E core Xeons.
Please find below the performance of [XXH3 hashing benchmark
](https://mail.openjdk.org/pipermail/panama-dev/2024-July/020557.html)included
with the patch:-
Sierra Forest :-
============
Baseline:-
Benchmark (SIZE) Mode Cnt Score Error
Units
VectorXXH3HashingBenchmark.hashingKernel 1024 thrpt 2 806.228
ops/ms
VectorXXH3HashingBenchmark.hashingKernel 2048 thrpt 2 403.044
ops/ms
VectorXXH3HashingBenchmark.hashingKernel 4096 thrpt 2 200.641
ops/ms
VectorXXH3HashingBenchmark.hashingKernel 8192 thrpt 2 100.664
ops/ms
With Optimization:-
Benchmark (SIZE) Mode Cnt Score Error
Units
VectorXXH3HashingBenchmark.hashingKernel 1024 thrpt 2 1299.407
ops/ms
VectorXXH3HashingBenchmark.hashingKernel 2048 thrpt 2 504.995
ops/ms
VectorXXH3HashingBenchmark.hashingKernel 4096 thrpt 2 327.544
ops/ms
VectorXXH3HashingBenchmark.hashingKernel 8192 thrpt 2 160.963
ops/ms
Granite Rapids:-
=============
Baseline:-
Benchmark (SIZE) Mode Cnt Score Error
Units
VectorXXH3HashingBenchmark.hashingKernel 1024 thrpt 2 2279.099
ops/ms
VectorXXH3HashingBenchmark.hashingKernel 2048 thrpt 2 1148.609
ops/ms
VectorXXH3HashingBenchmark.hashingKernel 4096 thrpt 2 570.848
ops/ms
VectorXXH3HashingBenchmark.hashingKernel 8192 thrpt 2 268.872
ops/ms
With Optimization:-
Benchmark (SIZE) Mode Cnt Score Error
Units
VectorXXH3HashingBenchmark.hashingKernel 1024 thrpt 2 2612.484
ops/ms
VectorXXH3HashingBenchmark.hashingKernel 2048 thrpt 2 1308.187
ops/ms
VectorXXH3HashingBenchmark.hashingKernel 4096 thrpt 2 653.375
ops/ms
VectorXXH3HashingBenchmark.hashingKernel 8192 thrpt 2 316.182
ops/ms
Kindly review and share your feedback.
Best Regards,
Jatin
-------------
Commit messages:
- 8341137: Optimize long vector multiplication using x86 VPMULUDQ instruction
Changes: https://git.openjdk.org/jdk/pull/21244/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=21244&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8341137
Stats: 355 lines in 12 files changed: 343 ins; 0 del; 12 mod
Patch: https://git.openjdk.org/jdk/pull/21244.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/21244/head:pull/21244
PR: https://git.openjdk.org/jdk/pull/21244