benwtrent commented on PR #15564: URL: https://github.com/apache/lucene/pull/15564#issuecomment-3744384663
> Would it be worth adding benchmarks for int4DibitDotProduct? I think it would be interesting to compare this to both 1 and 4 bit representations. I think we're right on the edge where it might be worth comparing a 2 bit doc and an 8 bit query -- using more bits doesn't help much at 1 bit but I feel like there's a chance it might be different at 2 bits. Here is a quick JMH. The 8bit query is also transposed. Honestly, it seems the main reason this transposition gives us benefits in Panama Vector land is that masking and summing have been stupid slow on Panama in the past. Maybe we should revisit this on larger bit sized queries. But for now, I kept parity. I am seeing about running an end-to-end recall focused test to see how it stacks up against dibit-nibble. ``` Benchmark (size) Mode Cnt Score Error Units ScalarQuantizationDotProductBenchmark.int4BitDotProductScalar 1024 thrpt 15 13.507 ± 0.564 ops/us ScalarQuantizationDotProductBenchmark.int4BitDotProductVector 1024 thrpt 15 51.669 ± 1.192 ops/us ScalarQuantizationDotProductBenchmark.int4DibitDotProductScalar 1024 thrpt 15 6.949 ± 0.113 ops/us ScalarQuantizationDotProductBenchmark.int4DibitDotProductVector 1024 thrpt 15 25.942 ± 1.250 ops/us ScalarQuantizationDotProductBenchmark.int4DotProductPackedScalar 1024 thrpt 15 2.875 ± 0.070 ops/us ScalarQuantizationDotProductBenchmark.int4DotProductPackedVector 1024 thrpt 15 2.440 ± 0.075 ops/us ScalarQuantizationDotProductBenchmark.int7DotProductScalar 1024 thrpt 15 2.875 ± 0.035 ops/us ScalarQuantizationDotProductBenchmark.int7DotProductVector 1024 thrpt 15 6.224 ± 0.137 ops/us ScalarQuantizationDotProductBenchmark.int8DibitDotProductScalar 1024 thrpt 15 3.490 ± 0.060 ops/us ScalarQuantizationDotProductBenchmark.int8DibitDotProductVector 1024 thrpt 15 12.860 ± 0.531 ops/us ScalarQuantizationDotProductBenchmark.uint8DotProductScalar 1024 thrpt 15 2.879 ± 0.060 ops/us ScalarQuantizationDotProductBenchmark.uint8DotProductVector 1024 thrpt 15 6.105 ± 0.299 ops/us ``` I really feel like we are leaving perf on the ground here. But maybe the administration and scaling of HNSW buys us enough (e.g. dibit-byte might explore less as graph quality and scores are better...). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
