HuaHuaY commented on PR #47294:
URL: https://github.com/apache/arrow/pull/47294#issuecomment-4004037226

   We found that the less the number of distinct values, the more the 
performance regression.
   
   `parquet-scan-main` is compiled from the latest commit of the main branch.
   `parquet-scan-old` is compiled from 
`64f2055ffb68e5077420f4253e76d78952438cab` which is the previous commit of this 
PR on the main branch.
   
   ```sh
   ❯ env 
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$LLVM_ROOT/lib/x86_64-unknown-linux-gnu 
ARROW_RUNTIME_SIMD_LEVEL=AVX2 hyperfine -w 5 -r 20 --sort mean-time 
"cpp/out/build/ninja-release/release/parquet-scan-main --columns=6 
/dev/shm/TableSink0" "cpp/out/build/ninja-release/release/parquet-scan-old 
--columns=6 /dev/shm/TableSink0" 
   Benchmark 1: cpp/out/build/ninja-release/release/parquet-scan-main 
--columns=6 /dev/shm/TableSink0
     Time (mean ± σ):      31.1 ms ±   0.3 ms    [User: 27.2 ms, System: 3.7 ms]
     Range (min … max):    30.7 ms …  31.8 ms    20 runs
    
   Benchmark 2: cpp/out/build/ninja-release/release/parquet-scan-old 
--columns=6 /dev/shm/TableSink0
     Time (mean ± σ):      22.8 ms ±   0.4 ms    [User: 19.3 ms, System: 3.3 ms]
     Range (min … max):    22.2 ms …  23.3 ms    20 runs
    
   Summary
     cpp/out/build/ninja-release/release/parquet-scan-old --columns=6 
/dev/shm/TableSink0 ran
       1.37 ± 0.02 times faster than 
cpp/out/build/ninja-release/release/parquet-scan-main --columns=6 
/dev/shm/TableSink0
   
   ❯ env 
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$LLVM_ROOT/lib/x86_64-unknown-linux-gnu 
cpp/out/build/ninja-release/release/parquet-reader --only-metadata 
/dev/shm/TableSink0 
   File Name: /dev/shm/TableSink0
   Version: 2.6
   Created By: cz-cpp version 
BuildInfo:GitBranch:release/20240820_rc8,GitVersion:73a4383,BuildTime:1725298762,CloudEnv:ALIYUN
   Total rows: 6250733
   Number of RowGroups: 1
   Number of Real Columns: 40
   Number of Columns: 40
   Number of Selected Columns: 40
   ......
   Column 6: lo_orderpriority (BYTE_ARRAY / String / UTF8)
   ......
   --- Row Group: 0 ---
   --- Total Bytes: 1016215218 ---
   --- Total Compressed Bytes: 552911018 ---
   --- Sort Columns:
   column_idx: 5, descending: 0, nulls_first: 1
   column_idx: 0, descending: 0, nulls_first: 1
   --- Rows: 6250733 ---
   ......
   Column 6
     Values: 6250733, Null Values: 0, Distinct Values: 5
     Max (exact: unknown): 5-LOW, Min (exact: unknown): 1-URGENT
     Compression: LZ4_RAW, Encodings: PLAIN(DICT_PAGE) RLE_DICTIONARY
     Uncompressed Size: 2267694, Compressed Size: 2092132
   ......
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to