[PR] Improve performance of panama int4{DotProduct,SquareDistance}SinglePacked [lucene]

via GitHub Thu, 19 Feb 2026 21:04:45 -0800


mccullocht opened a new pull request, #15736:
URL: https://github.com/apache/lucene/pull/15736


   The conversion operations to int were observed to be very slow in the async 
profiler. These operations widen vectors
   that are already at the maximum native length, so it's preferable to do 
fewer of them by summing the two short
   accumulators before widening. Summing before widening could potentially 
overflow, but it is sufficient to switch
   the integer widening to zero extend since these dot products are implicitly 
unsigned.
   
   M4 Before
   ```
   Benchmark                                                       (size)   
Mode  Cnt  Score   Error   Units
   VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedScalar    1024  
thrpt   15  4.111 ± 0.028  ops/us
   VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector    1024  
thrpt   15  3.413 ± 0.022  ops/us
   VectorUtilBenchmark.binaryHalfByteSquareSinglePackedScalar        1024  
thrpt   15  4.597 ± 0.055  ops/us
   VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector        1024  
thrpt   15  3.418 ± 0.017  ops/us
   ```
   
   M4 After
   ```
   Benchmark                                                       (size)   
Mode  Cnt   Score   Error   Units
   VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedScalar    1024  
thrpt   15   4.016 ± 0.032  ops/us
   VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector    1024  
thrpt   15  26.741 ± 0.180  ops/us
   VectorUtilBenchmark.binaryHalfByteSquareSinglePackedScalar        1024  
thrpt   15   4.592 ± 0.041  ops/us
   VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector        1024  
thrpt   15  26.597 ± 0.071  ops/us
   ```
   
   AMD Ryzen AI 395 (AVX 512) Before
   ```
   Benchmark                                                       (size)   
Mode  Cnt   Score   Error   Units
   VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedScalar    1024  
thrpt   15   2.471 ± 0.059  ops/us
   VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector    1024  
thrpt   15  11.226 ± 0.333  ops/us
   VectorUtilBenchmark.binaryHalfByteSquareSinglePackedScalar        1024  
thrpt   15   2.281 ± 0.028  ops/us
   VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector        1024  
thrpt   15  12.019 ± 0.379  ops/us
   ```
   
   AMD Ryzen AI 395 After
   ```
   Benchmark                                                       (size)   
Mode  Cnt   Score   Error   Units
   VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedScalar    1024  
thrpt   15   2.477 ± 0.053  ops/us
   VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector    1024  
thrpt   15  27.912 ± 0.162  ops/us
   VectorUtilBenchmark.binaryHalfByteSquareSinglePackedScalar        1024  
thrpt   15   2.266 ± 0.028  ops/us
   VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector        1024  
thrpt   15  40.499 ± 0.578  ops/us
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Improve performance of panama int4{DotProduct,SquareDistance}SinglePacked [lucene]

Reply via email to