HuaHuaY commented on PR #47294:
URL: https://github.com/apache/arrow/pull/47294#issuecomment-4004037226
We found that the less the number of distinct values, the more the
performance regression.
`parquet-scan-main` is compiled from the latest commit of the main branch.
`parquet-scan-old` is compiled from
`64f2055ffb68e5077420f4253e76d78952438cab` which is the previous commit of this
PR on the main branch.
```sh
❯ env
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$LLVM_ROOT/lib/x86_64-unknown-linux-gnu
ARROW_RUNTIME_SIMD_LEVEL=AVX2 hyperfine -w 5 -r 20 --sort mean-time
"cpp/out/build/ninja-release/release/parquet-scan-main --columns=6
/dev/shm/TableSink0" "cpp/out/build/ninja-release/release/parquet-scan-old
--columns=6 /dev/shm/TableSink0"
Benchmark 1: cpp/out/build/ninja-release/release/parquet-scan-main
--columns=6 /dev/shm/TableSink0
Time (mean ± σ): 31.1 ms ± 0.3 ms [User: 27.2 ms, System: 3.7 ms]
Range (min … max): 30.7 ms … 31.8 ms 20 runs
Benchmark 2: cpp/out/build/ninja-release/release/parquet-scan-old
--columns=6 /dev/shm/TableSink0
Time (mean ± σ): 22.8 ms ± 0.4 ms [User: 19.3 ms, System: 3.3 ms]
Range (min … max): 22.2 ms … 23.3 ms 20 runs
Summary
cpp/out/build/ninja-release/release/parquet-scan-old --columns=6
/dev/shm/TableSink0 ran
1.37 ± 0.02 times faster than
cpp/out/build/ninja-release/release/parquet-scan-main --columns=6
/dev/shm/TableSink0
❯ env
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$LLVM_ROOT/lib/x86_64-unknown-linux-gnu
cpp/out/build/ninja-release/release/parquet-reader --only-metadata
/dev/shm/TableSink0
File Name: /dev/shm/TableSink0
Version: 2.6
Created By: cz-cpp version
BuildInfo:GitBranch:release/20240820_rc8,GitVersion:73a4383,BuildTime:1725298762,CloudEnv:ALIYUN
Total rows: 6250733
Number of RowGroups: 1
Number of Real Columns: 40
Number of Columns: 40
Number of Selected Columns: 40
......
Column 6: lo_orderpriority (BYTE_ARRAY / String / UTF8)
......
--- Row Group: 0 ---
--- Total Bytes: 1016215218 ---
--- Total Compressed Bytes: 552911018 ---
--- Sort Columns:
column_idx: 5, descending: 0, nulls_first: 1
column_idx: 0, descending: 0, nulls_first: 1
--- Rows: 6250733 ---
......
Column 6
Values: 6250733, Null Values: 0, Distinct Values: 5
Max (exact: unknown): 5-LOW, Min (exact: unknown): 1-URGENT
Compression: LZ4_RAW, Encodings: PLAIN(DICT_PAGE) RLE_DICTIONARY
Uncompressed Size: 2267694, Compressed Size: 2092132
......
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]