benwtrent opened a new pull request, #13321: URL: https://github.com/apache/lucene/pull/13321
This updates the int4 dot-product comparison to have an optimized one for when one of the vectors are compressed (the most common search case). This change actually makes the compressed search on ARM faster than the uncompressed. However, on AVX512/256, it still slightly slower than uncompressed, but it still much faster now with this optimization than before (eagerly decompressing). This optimized is tied tightly with how the vectors are actually compressed and stored, consequently, I added a new scorer that is within the lucene99 codec. So, this gives us 8x reduction over float32, well more than 2x faster queries than float32, and no need to rerank as the recall and accuracy are excellent. Here are some lucene-util numbers over CohereV3 at 1024 dimensions: New compressed numbers on ARM ``` recall latency nDoc fanout maxConn beamWidth visited 0.891 1.02 500000 0 64 250 4182 0.910 1.05 500000 0 64 250 4348 0.925 1.06 500000 0 64 250 4511 0.974 1.29 500000 0 64 250 5782 0.986 1.72 500000 0 64 250 7285 ``` Compared with uncompressed on ARM: ``` recall latency nDoc fanout maxConn beamWidth visited 0.891 1.18 500000 0 64 250 4182 0.910 1.24 500000 0 64 250 4348 0.925 1.25 500000 0 64 250 4511 0.974 1.57 500000 0 64 250 5782 0.986 2.15 500000 0 64 250 7285 ``` Here are some JMH numbers as well (note, I am excluding odd number of indices as these don't support compression). NOTE: `PackedUnpacked` is eagerly decompressing the vectors and then using dot-product, what is occurring now. ARM: ``` VectorUtilBenchmark.binaryHalfByteScalar 128 thrpt 5 25.072 ± 0.364 ops/us VectorUtilBenchmark.binaryHalfByteScalar 256 thrpt 5 12.534 ± 0.152 ops/us VectorUtilBenchmark.binaryHalfByteScalar 300 thrpt 5 10.715 ± 0.116 ops/us VectorUtilBenchmark.binaryHalfByteScalar 512 thrpt 5 6.275 ± 0.019 ops/us VectorUtilBenchmark.binaryHalfByteScalar 702 thrpt 5 4.577 ± 0.019 ops/us VectorUtilBenchmark.binaryHalfByteScalar 1024 thrpt 5 3.113 ± 0.010 ops/us VectorUtilBenchmark.binaryHalfByteScalarPacked 128 thrpt 5 24.161 ± 0.183 ops/us VectorUtilBenchmark.binaryHalfByteScalarPacked 256 thrpt 5 12.261 ± 0.356 ops/us VectorUtilBenchmark.binaryHalfByteScalarPacked 300 thrpt 5 10.535 ± 0.264 ops/us VectorUtilBenchmark.binaryHalfByteScalarPacked 512 thrpt 5 6.157 ± 0.062 ops/us VectorUtilBenchmark.binaryHalfByteScalarPacked 702 thrpt 5 4.505 ± 0.022 ops/us VectorUtilBenchmark.binaryHalfByteScalarPacked 1024 thrpt 5 3.104 ± 0.013 ops/us VectorUtilBenchmark.binaryHalfByteScalarPackedUnpacked 128 thrpt 5 15.179 ± 0.307 ops/us VectorUtilBenchmark.binaryHalfByteScalarPackedUnpacked 256 thrpt 5 7.883 ± 0.126 ops/us VectorUtilBenchmark.binaryHalfByteScalarPackedUnpacked 300 thrpt 5 6.826 ± 0.014 ops/us VectorUtilBenchmark.binaryHalfByteScalarPackedUnpacked 512 thrpt 5 3.996 ± 0.013 ops/us VectorUtilBenchmark.binaryHalfByteScalarPackedUnpacked 702 thrpt 5 2.934 ± 0.010 ops/us VectorUtilBenchmark.binaryHalfByteScalarPackedUnpacked 1024 thrpt 5 2.008 ± 0.026 ops/us VectorUtilBenchmark.binaryHalfByteVector 128 thrpt 5 69.386 ± 0.371 ops/us VectorUtilBenchmark.binaryHalfByteVector 256 thrpt 5 51.016 ± 0.180 ops/us VectorUtilBenchmark.binaryHalfByteVector 300 thrpt 5 40.186 ± 0.117 ops/us VectorUtilBenchmark.binaryHalfByteVector 512 thrpt 5 33.453 ± 0.096 ops/us VectorUtilBenchmark.binaryHalfByteVector 702 thrpt 5 23.627 ± 0.429 ops/us VectorUtilBenchmark.binaryHalfByteVector 1024 thrpt 5 19.833 ± 0.065 ops/us VectorUtilBenchmark.binaryHalfByteVectorPacked 128 thrpt 5 66.502 ± 0.335 ops/us VectorUtilBenchmark.binaryHalfByteVectorPacked 256 thrpt 5 47.178 ± 0.546 ops/us VectorUtilBenchmark.binaryHalfByteVectorPacked 300 thrpt 5 36.942 ± 0.122 ops/us VectorUtilBenchmark.binaryHalfByteVectorPacked 512 thrpt 5 29.735 ± 0.328 ops/us VectorUtilBenchmark.binaryHalfByteVectorPacked 702 thrpt 5 21.145 ± 0.085 ops/us VectorUtilBenchmark.binaryHalfByteVectorPacked 1024 thrpt 5 17.103 ± 0.050 ops/us VectorUtilBenchmark.binaryHalfByteVectorPackedUnpacked 128 thrpt 5 25.077 ± 0.459 ops/us VectorUtilBenchmark.binaryHalfByteVectorPackedUnpacked 256 thrpt 5 15.033 ± 0.041 ops/us VectorUtilBenchmark.binaryHalfByteVectorPackedUnpacked 300 thrpt 5 12.681 ± 0.222 ops/us VectorUtilBenchmark.binaryHalfByteVectorPackedUnpacked 512 thrpt 5 8.240 ± 0.461 ops/us VectorUtilBenchmark.binaryHalfByteVectorPackedUnpacked 702 thrpt 5 6.034 ± 0.022 ops/us VectorUtilBenchmark.binaryHalfByteVectorPackedUnpacked 1024 thrpt 5 4.320 ± 0.509 ops/us ``` AVX512: ``` VectorUtilBenchmark.binaryHalfByteScalar 128 thrpt 15 17.767 ± 0.123 ops/us VectorUtilBenchmark.binaryHalfByteScalar 256 thrpt 15 9.248 ± 0.112 ops/us VectorUtilBenchmark.binaryHalfByteScalar 300 thrpt 15 8.095 ± 0.102 ops/us VectorUtilBenchmark.binaryHalfByteScalar 512 thrpt 15 4.723 ± 0.054 ops/us VectorUtilBenchmark.binaryHalfByteScalar 702 thrpt 15 3.580 ± 0.030 ops/us VectorUtilBenchmark.binaryHalfByteScalar 1024 thrpt 15 2.346 ± 0.047 ops/us VectorUtilBenchmark.binaryHalfByteScalarPacked 128 thrpt 15 14.119 ± 0.069 ops/us VectorUtilBenchmark.binaryHalfByteScalarPacked 256 thrpt 15 6.478 ± 0.037 ops/us VectorUtilBenchmark.binaryHalfByteScalarPacked 300 thrpt 15 4.157 ± 0.048 ops/us VectorUtilBenchmark.binaryHalfByteScalarPacked 512 thrpt 15 2.490 ± 0.017 ops/us VectorUtilBenchmark.binaryHalfByteScalarPacked 702 thrpt 15 1.817 ± 0.011 ops/us VectorUtilBenchmark.binaryHalfByteScalarPacked 1024 thrpt 15 1.240 ± 0.009 ops/us VectorUtilBenchmark.binaryHalfByteScalarPackedUnpacked 128 thrpt 15 10.022 ± 0.068 ops/us VectorUtilBenchmark.binaryHalfByteScalarPackedUnpacked 256 thrpt 15 5.583 ± 0.048 ops/us VectorUtilBenchmark.binaryHalfByteScalarPackedUnpacked 300 thrpt 15 4.667 ± 0.083 ops/us VectorUtilBenchmark.binaryHalfByteScalarPackedUnpacked 512 thrpt 15 2.698 ± 0.034 ops/us VectorUtilBenchmark.binaryHalfByteScalarPackedUnpacked 702 thrpt 15 1.931 ± 0.019 ops/us VectorUtilBenchmark.binaryHalfByteScalarPackedUnpacked 1024 thrpt 15 1.294 ± 0.019 ops/us VectorUtilBenchmark.binaryHalfByteVector 128 thrpt 15 84.577 ± 2.424 ops/us VectorUtilBenchmark.binaryHalfByteVector 207 thrpt 15 44.973 ± 0.448 ops/us VectorUtilBenchmark.binaryHalfByteVector 256 thrpt 15 51.049 ± 0.379 ops/us VectorUtilBenchmark.binaryHalfByteVector 300 thrpt 15 39.401 ± 0.527 ops/us VectorUtilBenchmark.binaryHalfByteVector 512 thrpt 15 27.654 ± 0.145 ops/us VectorUtilBenchmark.binaryHalfByteVector 702 thrpt 15 20.007 ± 0.120 ops/us VectorUtilBenchmark.binaryHalfByteVector 1024 thrpt 15 14.378 ± 0.070 ops/us VectorUtilBenchmark.binaryHalfByteVectorPacked 128 thrpt 15 58.249 ± 0.375 ops/us VectorUtilBenchmark.binaryHalfByteVectorPacked 256 thrpt 15 30.865 ± 0.164 ops/us VectorUtilBenchmark.binaryHalfByteVectorPacked 300 thrpt 15 22.795 ± 0.280 ops/us VectorUtilBenchmark.binaryHalfByteVectorPacked 512 thrpt 15 16.406 ± 0.506 ops/us VectorUtilBenchmark.binaryHalfByteVectorPacked 702 thrpt 15 9.555 ± 0.167 ops/us VectorUtilBenchmark.binaryHalfByteVectorPacked 1024 thrpt 15 8.638 ± 0.095 ops/us VectorUtilBenchmark.binaryHalfByteVectorPackedUnpacked 128 thrpt 15 15.507 ± 0.122 ops/us VectorUtilBenchmark.binaryHalfByteVectorPackedUnpacked 256 thrpt 15 9.079 ± 0.068 ops/us VectorUtilBenchmark.binaryHalfByteVectorPackedUnpacked 300 thrpt 15 7.788 ± 0.083 ops/us VectorUtilBenchmark.binaryHalfByteVectorPackedUnpacked 512 thrpt 15 4.992 ± 0.064 ops/us VectorUtilBenchmark.binaryHalfByteVectorPackedUnpacked 702 thrpt 15 3.622 ± 0.033 ops/us VectorUtilBenchmark.binaryHalfByteVectorPackedUnpacked 1024 thrpt 15 2.488 ± 0.019 ops/us ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org