HuaHuaY commented on PR #47294:
URL: https://github.com/apache/arrow/pull/47294#issuecomment-4052114256

   > > We tested using a Parquet file with data generated from SSB Flat. We 
found that the less the number of distinct values, the more the performance 
regression.
   > > 
   > > * `parquet-scan-main` is compiled from the latest commit of the main 
branch.
   > > * `parquet-scan-old` is compiled from 
`64f2055ffb68e5077420f4253e76d78952438cab` which is the previous commit of this 
PR on the main branch.
   > > 
   > > Both are compiled in release mode with `-DARROW_RUNTIME_SIMD_LEVEL=AVX2` 
cmake flag.
   > > ```shell
   > > ❯ env 
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$LLVM_ROOT/lib/x86_64-unknown-linux-gnu 
ARROW_RUNTIME_SIMD_LEVEL=AVX2 hyperfine -w 5 -r 20 --sort mean-time 
"cpp/out/build/ninja-release/release/parquet-scan-main --columns=6 
/dev/shm/TableSink0" "cpp/out/build/ninja-release/release/parquet-scan-old 
--columns=6 /dev/shm/TableSink0" 
   > > Benchmark 1: cpp/out/build/ninja-release/release/parquet-scan-main 
--columns=6 /dev/shm/TableSink0
   > >   Time (mean ± σ):      31.1 ms ±   0.3 ms    [User: 27.2 ms, System: 
3.7 ms]
   > >   Range (min … max):    30.7 ms …  31.8 ms    20 runs
   > >  
   > > Benchmark 2: cpp/out/build/ninja-release/release/parquet-scan-old 
--columns=6 /dev/shm/TableSink0
   > >   Time (mean ± σ):      22.8 ms ±   0.4 ms    [User: 19.3 ms, System: 
3.3 ms]
   > >   Range (min … max):    22.2 ms …  23.3 ms    20 runs
   > >  
   > > Summary
   > >   cpp/out/build/ninja-release/release/parquet-scan-old --columns=6 
/dev/shm/TableSink0 ran
   > >     1.37 ± 0.02 times faster than 
cpp/out/build/ninja-release/release/parquet-scan-main --columns=6 
/dev/shm/TableSink0
   > > 
   > > ❯ env 
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$LLVM_ROOT/lib/x86_64-unknown-linux-gnu 
cpp/out/build/ninja-release/release/parquet-reader --only-metadata 
/dev/shm/TableSink0 
   > > File Name: /dev/shm/TableSink0
   > > Version: 2.6
   > > Created By: cz-cpp version 
BuildInfo:GitBranch:release/20240820_rc8,GitVersion:73a4383,BuildTime:1725298762,CloudEnv:ALIYUN
   > > Total rows: 6250733
   > > Number of RowGroups: 1
   > > Number of Real Columns: 40
   > > Number of Columns: 40
   > > Number of Selected Columns: 40
   > > ......
   > > Column 6: lo_orderpriority (BYTE_ARRAY / String / UTF8)
   > > ......
   > > --- Row Group: 0 ---
   > > --- Total Bytes: 1016215218 ---
   > > --- Total Compressed Bytes: 552911018 ---
   > > --- Sort Columns:
   > > column_idx: 5, descending: 0, nulls_first: 1
   > > column_idx: 0, descending: 0, nulls_first: 1
   > > --- Rows: 6250733 ---
   > > ......
   > > Column 6
   > >   Values: 6250733, Null Values: 0, Distinct Values: 5
   > >   Max (exact: unknown): 5-LOW, Min (exact: unknown): 1-URGENT
   > >   Compression: LZ4_RAW, Encodings: PLAIN(DICT_PAGE) RLE_DICTIONARY
   > >   Uncompressed Size: 2267694, Compressed Size: 2092132
   > > ......
   > > ```
   > 
   > Update from @HuaHuaY's comment: we just located the performance regression 
comes from the API change of the internal `unpack` function introduced by 
#47994. The quick fix is to explicitly set `UnpackOptions::max_read_bytes` to 
not use its default value -1. @AntoinePrv @pitrou
   
   I tested by arrow's `parquet-scan` target and 
`cpp/src/arrow/util/rle_encoding_internal.h` has already set `max_read_bytes_ - 
bytes_fully_read`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to