RFR: 8341137: Optimize long vector multiplication using x86 VPMULUDQ instruction

Jatin Bhateja Sat, 28 Sep 2024 21:26:43 -0700

This patch optimizes LongVector multiplication by inferring VPMULUDQ 
instruction for following IR pallets.


       MulL   ( And  SRC1,  0xFFFFFFFF)   ( And  SRC2,  0xFFFFFFFF) 
       MulL   (URShift SRC1 , 32) (URShift SRC2, 32)
       MulL   (URShift SRC1 , 32)  ( And  SRC2,  0xFFFFFFFF)
       MulL   ( And  SRC1,  0xFFFFFFFF) (URShift SRC2 , 32)



 A  64x64 bit multiplication produces 128 bit result, and can be performed by 
individually multiplying upper and lower double word of multiplier with 
multiplicand and assembling the partial products to compute full width result. 
Targets supporting vector quadword multiplication have separate instructions to 
compute upper and lower quadwords for 128 bit result. Therefore existing 
VectorAPI multiplication operator expects shape conformance between source and 
result vectors.

If upper 32 bits of quadword multiplier and multiplicand is always set to zero 
then result of multiplication is only dependent on the partial product of their 
lower double words and can be performed using unsigned 32 bit multiplication 
instruction with quadword saturation. Patch matches this pattern in a target 
dependent manner without introducing new IR node.
 
VPMULUDQ instruction performs unsigned multiplication between even numbered 
doubleword lanes of two long vectors and produces 64 bit result.  It has much 
lower latency compared to full 64 bit multiplication instruction "VPMULLQ", in 
addition non-AVX512DQ targets does not support direct quadword multiplication, 
thus we can save redundant partial product for zeroed out upper 32 bits. This 
results into throughput improvements on both P and E core Xeons.

Please find below the performance of [XXH3 hashing benchmark 
](https://mail.openjdk.org/pipermail/panama-dev/2024-July/020557.html)included 
with the patch:-
 

Sierra Forest :-
============
Baseline:-
Benchmark                                 (SIZE)   Mode  Cnt    Score   Error   
Units
VectorXXH3HashingBenchmark.hashingKernel    1024  thrpt    2  806.228          
ops/ms
VectorXXH3HashingBenchmark.hashingKernel    2048  thrpt    2  403.044          
ops/ms
VectorXXH3HashingBenchmark.hashingKernel    4096  thrpt    2  200.641          
ops/ms
VectorXXH3HashingBenchmark.hashingKernel    8192  thrpt    2  100.664          
ops/ms

With Optimization:-
Benchmark                                 (SIZE)   Mode  Cnt     Score   Error  
 Units
VectorXXH3HashingBenchmark.hashingKernel    1024  thrpt    2  1299.407          
ops/ms
VectorXXH3HashingBenchmark.hashingKernel    2048  thrpt    2   504.995          
ops/ms
VectorXXH3HashingBenchmark.hashingKernel    4096  thrpt    2   327.544          
ops/ms
VectorXXH3HashingBenchmark.hashingKernel    8192  thrpt    2   160.963          
ops/ms

Granite Rapids:-
=============
Baseline:-
Benchmark                                 (SIZE)   Mode  Cnt     Score   Error  
 Units
VectorXXH3HashingBenchmark.hashingKernel    1024  thrpt    2  2279.099          
ops/ms
VectorXXH3HashingBenchmark.hashingKernel    2048  thrpt    2  1148.609          
ops/ms
VectorXXH3HashingBenchmark.hashingKernel    4096  thrpt    2   570.848          
ops/ms
VectorXXH3HashingBenchmark.hashingKernel    8192  thrpt    2   268.872          
ops/ms

With Optimization:-
Benchmark                                 (SIZE)   Mode  Cnt     Score   Error  
 Units
VectorXXH3HashingBenchmark.hashingKernel    1024  thrpt    2  2612.484          
ops/ms
VectorXXH3HashingBenchmark.hashingKernel    2048  thrpt    2  1308.187          
ops/ms
VectorXXH3HashingBenchmark.hashingKernel    4096  thrpt    2   653.375          
ops/ms
VectorXXH3HashingBenchmark.hashingKernel    8192  thrpt    2   316.182          
ops/ms


Kindly review and share your feedback.

Best Regards,
Jatin

-------------

Commit messages:
 - 8341137: Optimize long vector multiplication using x86 VPMULUDQ instruction

Changes: https://git.openjdk.org/jdk/pull/21244/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=21244&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8341137
  Stats: 355 lines in 12 files changed: 343 ins; 0 del; 12 mod
  Patch: https://git.openjdk.org/jdk/pull/21244.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/21244/head:pull/21244

PR: https://git.openjdk.org/jdk/pull/21244

RFR: 8341137: Optimize long vector multiplication using x86 VPMULUDQ instruction

Reply via email to