HuaHuaY commented on PR #50281: URL: https://github.com/apache/arrow/pull/50281#issuecomment-4845192292
I've compiled the test data into three tables. `BinaryBitBlockCounterSum` is faster than `BinaryVisitTwoSetBitRunsSum` when there are many null values, which I think is expected. In this case, merging the two bitmaps results in too many consecutive intervals, making it slower than judging multiple bits at once. However, I was somewhat surprised that in x86 tests, `BinaryBitBlockCounterSum` achieved better performance when there were almost no null values. I don't understand why. There were also some unexpected findings. Currently, `BinaryVisitTwoSetBitRunsSum` performs better than `BinaryVisitTwoBitRunsSum`. This is partly due to function implementation, and partly because `SetBitRunReader` performs better than `BitRunReader`. I will later adjust `BinaryVisitTwoBitRunsSum` to use `SetBitRunReader` (`BinaryVisitTwoBitRunsSum` can directly wrap `BinaryVisitTwoSetBitRunsSum`). I haven't yet figured out why `SetBitRunReader` is faster. Based on these test results, I plan to revert the changes to `cpp/src/arrow/compute/kernels/hash_aggregate_pivot.cc`. However, I think that the real value of `BinaryVisitTwoSetBitRunsSum` lies in its ability to handle data in batches rather than row-by-row, allowing the use of vector-processing functions such as `::arrow::bit_util::SetBitsTo` or `::arrow::internal::CountSetBits`. Since I plan to use it in a future PR, I'd like to add benchmark scenarios that will showcase the advantages of `BinaryVisitTwoSetBitRunsSum`. ## 4ab7d672 / amd64-c6a-4xlarge-linux / c0d6bf7 | Benchmark | 2 | 8 | 64 | 512 | 4096 | 32768 | 65536 | |---|---:|---:|---:|---:|---:|---:|---:| | `BitBlockCounterSum` | 2.59e+8 | 6.46e+8 | 1.39e+9 | 3.58e+9 | 4.90e+9 | 5.15e+9 | 5.17e+9 | | `VisitBitRunsSum` | 3.10e+8 | 5.70e+8 | 2.07e+9 | 4.30e+9 | 4.68e+9 | 4.73e+9 | 4.73e+9 | | `VisitSetBitRunsSum` | 3.59e+8 | 6.40e+8 | 2.20e+9 | 4.58e+9 | 4.88e+9 | 4.93e+9 | 4.94e+9 | | `BinaryBitBlockCounterSum` | 1.79e+8 | 3.68e+8 | 7.40e+8 | 1.99e+9 | 3.69e+9 | 4.18e+9 | 4.22e+9 | | `BinaryVisitTwoBitRunsSum` | 1.40e+8 | 2.42e+8 | 9.38e+8 | 2.61e+9 | 3.36e+9 | 3.49e+9 | 3.52e+9 | | `BinaryVisitTwoSetBitRunsSum` | 1.78e+8 | 2.77e+8 | 1.03e+9 | 3.05e+9 | 3.94e+9 | 4.06e+9 | 4.08e+9 | ## fd5a984c / amd64-m5-4xlarge-linux / c0d6bf7 | Benchmark | 2 | 8 | 64 | 512 | 4096 | 32768 | 65536 | |---|---:|---:|---:|---:|---:|---:|---:| | `BitBlockCounterSum` | 2.00e+8 | 4.55e+8 | 9.33e+8 | 1.88e+9 | 2.26e+9 | 2.32e+9 | 2.32e+9 | | `VisitBitRunsSum` | 2.07e+8 | 3.73e+8 | 1.15e+9 | 1.98e+9 | 2.17e+9 | 2.19e+9 | 2.19e+9 | | `VisitSetBitRunsSum` | 2.34e+8 | 4.01e+8 | 1.15e+9 | 2.03e+9 | 2.24e+9 | 2.26e+9 | 2.26e+9 | | `BinaryBitBlockCounterSum` | 1.20e+8 | 2.13e+8 | 4.06e+8 | 1.02e+9 | 1.79e+9 | 2.00e+9 | 2.01e+9 | | `BinaryVisitTwoBitRunsSum` | 1.05e+8 | 1.77e+8 | 5.93e+8 | 1.34e+9 | 1.63e+9 | 1.67e+9 | 1.68e+9 | | `BinaryVisitTwoSetBitRunsSum` | 1.16e+8 | 1.83e+8 | 6.26e+8 | 1.51e+9 | 1.87e+9 | 1.91e+9 | 1.91e+9 | ## 457fe991 / test-mac-arm / c0d6bf7 | Benchmark | 2 | 8 | 64 | 512 | 4096 | 32768 | 65536 | |---|---:|---:|---:|---:|---:|---:|---:| | `BitBlockCounterSum` | 2.60e+8 | 7.22e+8 | 1.98e+9 | 5.40e+9 | 7.09e+9 | 7.29e+9 | 7.30e+9 | | `VisitBitRunsSum` | 3.41e+8 | 6.44e+8 | 1.57e+9 | 2.33e+9 | 2.57e+9 | 2.59e+9 | 2.59e+9 | | `VisitSetBitRunsSum` | 4.00e+8 | 6.36e+8 | 2.64e+9 | 9.07e+9 | 1.22e+10 | 1.29e+10 | 1.29e+10 | | `BinaryBitBlockCounterSum` | 1.79e+8 | 4.07e+8 | 1.00e+9 | 2.52e+9 | 4.76e+9 | 5.38e+9 | 5.44e+9 | | `BinaryVisitTwoBitRunsSum` | 1.75e+8 | 2.99e+8 | 1.12e+9 | 3.91e+9 | 5.64e+9 | 5.90e+9 | 5.93e+9 | | `BinaryVisitTwoSetBitRunsSum` | 2.28e+8 | 3.17e+8 | 1.21e+9 | 4.14e+9 | 6.09e+9 | 6.40e+9 | 6.42e+9 | -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
