[PR] Optimize int4 vector computations by avoiding conversions [lucene]

via GitHub Fri, 20 Feb 2026 13:52:36 -0800


kaivalnp opened a new pull request, #15742:
URL: https://github.com/apache/lucene/pull/15742


   Spinoff from #15736, where @mccullocht identified vector conversions as a 
potentially slow area (thanks!).
   This PR loads and operates on SIMD registers of the preferred bit size to 
avoid intermediate conversions.
   
   This was sparked from #15697 where we observed a performance drop in JMH 
benchmarks of some 4-bit vector computations after initial warmup. It's 
possible that this issue only affects ARM machines.
   
   Ran JMH benchmarks on an AWS Graviton3 host using:
   
   ```sh
   java --module-path lucene/benchmark-jmh/build/benchmarks --module 
org.apache.lucene.benchmark.jmh "VectorUtilBenchmark.binaryHalfByte.*Vector" -p 
size=1024
   ```
   
   Baseline:
   
   ```
   Benchmark                                                       (size)   
Mode  Cnt   Score   Error   Units
   VectorUtilBenchmark.binaryHalfByteDotProductBothPackedVector      1024  
thrpt   15  11.846 ± 0.034  ops/us
   VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector    1024  
thrpt   15   2.618 ± 0.009  ops/us
   VectorUtilBenchmark.binaryHalfByteDotProductVector                1024  
thrpt   15  20.733 ± 0.063  ops/us
   VectorUtilBenchmark.binaryHalfByteSquareBothPackedVector          1024  
thrpt   15  12.599 ± 0.022  ops/us
   VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector        1024  
thrpt   15   2.603 ± 0.008  ops/us
   VectorUtilBenchmark.binaryHalfByteSquareVector                    1024  
thrpt   15  18.492 ± 0.033  ops/us
   ```
   
   This PR:
   
   ```
   Benchmark                                                       (size)   
Mode  Cnt   Score   Error   Units
   VectorUtilBenchmark.binaryHalfByteDotProductBothPackedVector      1024  
thrpt   15  17.356 ± 0.052  ops/us
   VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector    1024  
thrpt   15  19.157 ± 0.055  ops/us
   VectorUtilBenchmark.binaryHalfByteDotProductVector                1024  
thrpt   15  20.575 ± 0.049  ops/us
   VectorUtilBenchmark.binaryHalfByteSquareBothPackedVector          1024  
thrpt   15  16.030 ± 0.077  ops/us
   VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector        1024  
thrpt   15  16.247 ± 0.120  ops/us
   VectorUtilBenchmark.binaryHalfByteSquareVector                    1024  
thrpt   15  18.952 ± 0.113  ops/us
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Optimize int4 vector computations by avoiding conversions [lucene]

Reply via email to