gf2121 commented on PR #14176:
URL: https://github.com/apache/lucene/pull/14176#issuecomment-2639090793
Thanks @iverase !
For the vectorized decodeing, I benchmarked the decoding method with jmh,
the result on my M2 mac:
```
Benchmark Mode Cnt Score Error Units
BKDCodecBenchmark.readInts16ForUtil thrpt 5 94.529 ± 2.886 ops/ms
BKDCodecBenchmark.readInts16Vector thrpt 5 194.320 ± 7.082 ops/ms
BKDCodecBenchmark.readInts24ForUtil thrpt 5 93.435 ± 5.063 ops/ms
BKDCodecBenchmark.readInts24Legacy thrpt 5 81.779 ± 1.390 ops/ms
BKDCodecBenchmark.readInts24Vector thrpt 5 151.203 ± 0.460 ops/ms
```
It suggests that `readInts24ForUtil` and `readInts24Legacy` do not have to
much difference, which is consistent with previous luceneutil result:
> The previous result was got by taskRepeatCount=20 . I find that the
speedup disappeared when taskRepeatCount increased to 50:
TaskQPS baseline StdDevQPS
my_modified_version StdDev Pct diff p-value
TermDayOfYearSort 196.21 (8.7%) 194.85
(11.2%) -0.7% ( -18% - 21%) 0.871
CountFilteredIntNRQ 84.92 (13.1%) 84.84
(12.1%) -0.1% ( -22% - 28%) 0.987
IntNRQ 137.14 (20.2%) 137.30
(18.4%) 0.1% ( -31% - 48%) 0.989
FilteredIntNRQ 134.41 (20.0%) 135.05
(18.1%) 0.5% ( -31% - 48%) 0.954
TermDTSort 196.18 (9.0%) 201.19
(9.0%) 2.6% ( -14% - 22%) 0.506
The vectorized decoding method using vector API seems perform much better,
I'll try to run luceneutil to confirm end-to-end result. I'll keep this PR
simple and leave vectorized decoding optimization to another PR.
https://github.com/apache/lucene/pull/14203
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]