[PR] Improve int4 compressed comparisons performance [lucene]

via GitHub Tue, 23 Apr 2024 07:22:35 -0700


benwtrent opened a new pull request, #13321:
URL: https://github.com/apache/lucene/pull/13321


   This updates the int4 dot-product comparison to have an optimized one for 
when one of the vectors are compressed (the most common search case). This 
change actually makes the compressed search on ARM faster than the 
uncompressed. However, on AVX512/256, it still slightly slower than 
uncompressed, but it still much faster now with this optimization than before 
(eagerly decompressing).
   
   This optimized is tied tightly with how the vectors are actually compressed 
and stored, consequently, I added a new scorer that is within the lucene99 
codec.
   
   So, this gives us 8x reduction over float32, well more than 2x faster 
queries than float32, and no need to rerank as the recall and accuracy are 
excellent.
   
   Here are some lucene-util numbers over CohereV3 at 1024 dimensions:
   
   New compressed numbers on ARM
   ```
   recall       latency nDoc    fanout  maxConn beamWidth       visited 
   0.891         1.02   500000  0       64      250             4182
   0.910         1.05   500000  0       64      250             4348
   0.925         1.06   500000  0       64      250             4511
   0.974         1.29   500000  0       64      250             5782
   0.986         1.72   500000  0       64      250             7285
   ```
   
   Compared with uncompressed on ARM:
   ```
   recall       latency nDoc    fanout  maxConn beamWidth       visited
   0.891         1.18   500000  0       64      250             4182
   0.910         1.24   500000  0       64      250             4348
   0.925         1.25   500000  0       64      250             4511
   0.974         1.57   500000  0       64      250             5782
   0.986         2.15   500000  0       64      250             7285
   ```
   
   Here are some JMH numbers as well (note, I am excluding odd number of 
indices as these don't support compression).
   
   NOTE: `PackedUnpacked` is eagerly decompressing the vectors and then using 
dot-product, what is occurring now.
   
   ARM:
   ```
   VectorUtilBenchmark.binaryHalfByteScalar                   128  thrpt    5   
25.072 ±  0.364  ops/us
   VectorUtilBenchmark.binaryHalfByteScalar                   256  thrpt    5   
12.534 ±  0.152  ops/us
   VectorUtilBenchmark.binaryHalfByteScalar                   300  thrpt    5   
10.715 ±  0.116  ops/us
   VectorUtilBenchmark.binaryHalfByteScalar                   512  thrpt    5   
 6.275 ±  0.019  ops/us
   VectorUtilBenchmark.binaryHalfByteScalar                   702  thrpt    5   
 4.577 ±  0.019  ops/us
   VectorUtilBenchmark.binaryHalfByteScalar                  1024  thrpt    5   
 3.113 ±  0.010  ops/us
   VectorUtilBenchmark.binaryHalfByteScalarPacked             128  thrpt    5   
24.161 ±  0.183  ops/us
   VectorUtilBenchmark.binaryHalfByteScalarPacked             256  thrpt    5   
12.261 ±  0.356  ops/us
   VectorUtilBenchmark.binaryHalfByteScalarPacked             300  thrpt    5   
10.535 ±  0.264  ops/us
   VectorUtilBenchmark.binaryHalfByteScalarPacked             512  thrpt    5   
 6.157 ±  0.062  ops/us
   VectorUtilBenchmark.binaryHalfByteScalarPacked             702  thrpt    5   
 4.505 ±  0.022  ops/us
   VectorUtilBenchmark.binaryHalfByteScalarPacked            1024  thrpt    5   
 3.104 ±  0.013  ops/us
   VectorUtilBenchmark.binaryHalfByteScalarPackedUnpacked     128  thrpt    5   
15.179 ±  0.307  ops/us
   VectorUtilBenchmark.binaryHalfByteScalarPackedUnpacked     256  thrpt    5   
 7.883 ±  0.126  ops/us
   VectorUtilBenchmark.binaryHalfByteScalarPackedUnpacked     300  thrpt    5   
 6.826 ±  0.014  ops/us
   VectorUtilBenchmark.binaryHalfByteScalarPackedUnpacked     512  thrpt    5   
 3.996 ±  0.013  ops/us
   VectorUtilBenchmark.binaryHalfByteScalarPackedUnpacked     702  thrpt    5   
 2.934 ±  0.010  ops/us
   VectorUtilBenchmark.binaryHalfByteScalarPackedUnpacked    1024  thrpt    5   
 2.008 ±  0.026  ops/us
   VectorUtilBenchmark.binaryHalfByteVector                   128  thrpt    5   
69.386 ±  0.371  ops/us
   VectorUtilBenchmark.binaryHalfByteVector                   256  thrpt    5   
51.016 ±  0.180  ops/us
   VectorUtilBenchmark.binaryHalfByteVector                   300  thrpt    5   
40.186 ±  0.117  ops/us
   VectorUtilBenchmark.binaryHalfByteVector                   512  thrpt    5   
33.453 ±  0.096  ops/us
   VectorUtilBenchmark.binaryHalfByteVector                   702  thrpt    5   
23.627 ±  0.429  ops/us
   VectorUtilBenchmark.binaryHalfByteVector                  1024  thrpt    5   
19.833 ±  0.065  ops/us
   VectorUtilBenchmark.binaryHalfByteVectorPacked             128  thrpt    5   
66.502 ±  0.335  ops/us
   VectorUtilBenchmark.binaryHalfByteVectorPacked             256  thrpt    5   
47.178 ±  0.546  ops/us
   VectorUtilBenchmark.binaryHalfByteVectorPacked             300  thrpt    5   
36.942 ±  0.122  ops/us
   VectorUtilBenchmark.binaryHalfByteVectorPacked             512  thrpt    5   
29.735 ±  0.328  ops/us
   VectorUtilBenchmark.binaryHalfByteVectorPacked             702  thrpt    5   
21.145 ±  0.085  ops/us
   VectorUtilBenchmark.binaryHalfByteVectorPacked            1024  thrpt    5   
17.103 ±  0.050  ops/us
   VectorUtilBenchmark.binaryHalfByteVectorPackedUnpacked     128  thrpt    5   
25.077 ±  0.459  ops/us
   VectorUtilBenchmark.binaryHalfByteVectorPackedUnpacked     256  thrpt    5   
15.033 ±  0.041  ops/us
   VectorUtilBenchmark.binaryHalfByteVectorPackedUnpacked     300  thrpt    5   
12.681 ±  0.222  ops/us
   VectorUtilBenchmark.binaryHalfByteVectorPackedUnpacked     512  thrpt    5   
 8.240 ±  0.461  ops/us
   VectorUtilBenchmark.binaryHalfByteVectorPackedUnpacked     702  thrpt    5   
 6.034 ±  0.022  ops/us
   VectorUtilBenchmark.binaryHalfByteVectorPackedUnpacked    1024  thrpt    5   
 4.320 ±  0.509  ops/us
   ```
   
   AVX512:
   ```
   VectorUtilBenchmark.binaryHalfByteScalar                   128  thrpt   15   
17.767 ± 0.123  ops/us
   VectorUtilBenchmark.binaryHalfByteScalar                   256  thrpt   15   
 9.248 ± 0.112  ops/us
   VectorUtilBenchmark.binaryHalfByteScalar                   300  thrpt   15   
 8.095 ± 0.102  ops/us
   VectorUtilBenchmark.binaryHalfByteScalar                   512  thrpt   15   
 4.723 ± 0.054  ops/us
   VectorUtilBenchmark.binaryHalfByteScalar                   702  thrpt   15   
 3.580 ± 0.030  ops/us
   VectorUtilBenchmark.binaryHalfByteScalar                  1024  thrpt   15   
 2.346 ± 0.047  ops/us
   VectorUtilBenchmark.binaryHalfByteScalarPacked             128  thrpt   15   
14.119 ± 0.069  ops/us
   VectorUtilBenchmark.binaryHalfByteScalarPacked             256  thrpt   15   
 6.478 ± 0.037  ops/us
   VectorUtilBenchmark.binaryHalfByteScalarPacked             300  thrpt   15   
 4.157 ± 0.048  ops/us
   VectorUtilBenchmark.binaryHalfByteScalarPacked             512  thrpt   15   
 2.490 ± 0.017  ops/us
   VectorUtilBenchmark.binaryHalfByteScalarPacked             702  thrpt   15   
 1.817 ± 0.011  ops/us
   VectorUtilBenchmark.binaryHalfByteScalarPacked            1024  thrpt   15   
 1.240 ± 0.009  ops/us
   VectorUtilBenchmark.binaryHalfByteScalarPackedUnpacked     128  thrpt   15   
10.022 ± 0.068  ops/us
   VectorUtilBenchmark.binaryHalfByteScalarPackedUnpacked     256  thrpt   15   
 5.583 ± 0.048  ops/us
   VectorUtilBenchmark.binaryHalfByteScalarPackedUnpacked     300  thrpt   15   
 4.667 ± 0.083  ops/us
   VectorUtilBenchmark.binaryHalfByteScalarPackedUnpacked     512  thrpt   15   
 2.698 ± 0.034  ops/us
   VectorUtilBenchmark.binaryHalfByteScalarPackedUnpacked     702  thrpt   15   
 1.931 ± 0.019  ops/us
   VectorUtilBenchmark.binaryHalfByteScalarPackedUnpacked    1024  thrpt   15   
 1.294 ± 0.019  ops/us
   VectorUtilBenchmark.binaryHalfByteVector                   128  thrpt   15   
84.577 ± 2.424  ops/us
   VectorUtilBenchmark.binaryHalfByteVector                   207  thrpt   15   
44.973 ± 0.448  ops/us
   VectorUtilBenchmark.binaryHalfByteVector                   256  thrpt   15   
51.049 ± 0.379  ops/us
   VectorUtilBenchmark.binaryHalfByteVector                   300  thrpt   15   
39.401 ± 0.527  ops/us
   VectorUtilBenchmark.binaryHalfByteVector                   512  thrpt   15   
27.654 ± 0.145  ops/us
   VectorUtilBenchmark.binaryHalfByteVector                   702  thrpt   15   
20.007 ± 0.120  ops/us
   VectorUtilBenchmark.binaryHalfByteVector                  1024  thrpt   15   
14.378 ± 0.070  ops/us
   VectorUtilBenchmark.binaryHalfByteVectorPacked             128  thrpt   15   
58.249 ± 0.375  ops/us
   VectorUtilBenchmark.binaryHalfByteVectorPacked             256  thrpt   15   
30.865 ± 0.164  ops/us
   VectorUtilBenchmark.binaryHalfByteVectorPacked             300  thrpt   15   
22.795 ± 0.280  ops/us
   VectorUtilBenchmark.binaryHalfByteVectorPacked             512  thrpt   15   
16.406 ± 0.506  ops/us
   VectorUtilBenchmark.binaryHalfByteVectorPacked             702  thrpt   15   
 9.555 ± 0.167  ops/us
   VectorUtilBenchmark.binaryHalfByteVectorPacked            1024  thrpt   15   
 8.638 ± 0.095  ops/us
   VectorUtilBenchmark.binaryHalfByteVectorPackedUnpacked     128  thrpt   15   
15.507 ± 0.122  ops/us
   VectorUtilBenchmark.binaryHalfByteVectorPackedUnpacked     256  thrpt   15   
 9.079 ± 0.068  ops/us
   VectorUtilBenchmark.binaryHalfByteVectorPackedUnpacked     300  thrpt   15   
 7.788 ± 0.083  ops/us
   VectorUtilBenchmark.binaryHalfByteVectorPackedUnpacked     512  thrpt   15   
 4.992 ± 0.064  ops/us
   VectorUtilBenchmark.binaryHalfByteVectorPackedUnpacked     702  thrpt   15   
 3.622 ± 0.033  ops/us
   VectorUtilBenchmark.binaryHalfByteVectorPackedUnpacked    1024  thrpt   15   
 2.488 ± 0.019  ops/us
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[PR] Improve int4 compressed comparisons performance [lucene]

Reply via email to