mccullocht opened a new pull request, #15736: URL: https://github.com/apache/lucene/pull/15736
The conversion operations to int were observed to be very slow in the async profiler. These operations widen vectors that are already at the maximum native length, so it's preferable to do fewer of them by summing the two short accumulators before widening. Summing before widening could potentially overflow, but it is sufficient to switch the integer widening to zero extend since these dot products are implicitly unsigned. M4 Before ``` Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedScalar 1024 thrpt 15 4.111 ± 0.028 ops/us VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector 1024 thrpt 15 3.413 ± 0.022 ops/us VectorUtilBenchmark.binaryHalfByteSquareSinglePackedScalar 1024 thrpt 15 4.597 ± 0.055 ops/us VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector 1024 thrpt 15 3.418 ± 0.017 ops/us ``` M4 After ``` Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedScalar 1024 thrpt 15 4.016 ± 0.032 ops/us VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector 1024 thrpt 15 26.741 ± 0.180 ops/us VectorUtilBenchmark.binaryHalfByteSquareSinglePackedScalar 1024 thrpt 15 4.592 ± 0.041 ops/us VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector 1024 thrpt 15 26.597 ± 0.071 ops/us ``` AMD Ryzen AI 395 (AVX 512) Before ``` Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedScalar 1024 thrpt 15 2.471 ± 0.059 ops/us VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector 1024 thrpt 15 11.226 ± 0.333 ops/us VectorUtilBenchmark.binaryHalfByteSquareSinglePackedScalar 1024 thrpt 15 2.281 ± 0.028 ops/us VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector 1024 thrpt 15 12.019 ± 0.379 ops/us ``` AMD Ryzen AI 395 After ``` Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedScalar 1024 thrpt 15 2.477 ± 0.053 ops/us VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector 1024 thrpt 15 27.912 ± 0.162 ops/us VectorUtilBenchmark.binaryHalfByteSquareSinglePackedScalar 1024 thrpt 15 2.266 ± 0.028 ops/us VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector 1024 thrpt 15 40.499 ± 0.578 ops/us ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
