wgtmac commented on PR #47294: URL: https://github.com/apache/arrow/pull/47294#issuecomment-4047727372
> We tested using a Parquet file with data generated from SSB Flat. We found that the less the number of distinct values, the more the performance regression. > > * `parquet-scan-main` is compiled from the latest commit of the main branch. > * `parquet-scan-old` is compiled from `64f2055ffb68e5077420f4253e76d78952438cab` which is the previous commit of this PR on the main branch. > > Both are compiled in release mode with `-DARROW_RUNTIME_SIMD_LEVEL=AVX2` cmake flag. > > ```shell > ❯ env LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$LLVM_ROOT/lib/x86_64-unknown-linux-gnu ARROW_RUNTIME_SIMD_LEVEL=AVX2 hyperfine -w 5 -r 20 --sort mean-time "cpp/out/build/ninja-release/release/parquet-scan-main --columns=6 /dev/shm/TableSink0" "cpp/out/build/ninja-release/release/parquet-scan-old --columns=6 /dev/shm/TableSink0" > Benchmark 1: cpp/out/build/ninja-release/release/parquet-scan-main --columns=6 /dev/shm/TableSink0 > Time (mean ± σ): 31.1 ms ± 0.3 ms [User: 27.2 ms, System: 3.7 ms] > Range (min … max): 30.7 ms … 31.8 ms 20 runs > > Benchmark 2: cpp/out/build/ninja-release/release/parquet-scan-old --columns=6 /dev/shm/TableSink0 > Time (mean ± σ): 22.8 ms ± 0.4 ms [User: 19.3 ms, System: 3.3 ms] > Range (min … max): 22.2 ms … 23.3 ms 20 runs > > Summary > cpp/out/build/ninja-release/release/parquet-scan-old --columns=6 /dev/shm/TableSink0 ran > 1.37 ± 0.02 times faster than cpp/out/build/ninja-release/release/parquet-scan-main --columns=6 /dev/shm/TableSink0 > > ❯ env LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$LLVM_ROOT/lib/x86_64-unknown-linux-gnu cpp/out/build/ninja-release/release/parquet-reader --only-metadata /dev/shm/TableSink0 > File Name: /dev/shm/TableSink0 > Version: 2.6 > Created By: cz-cpp version BuildInfo:GitBranch:release/20240820_rc8,GitVersion:73a4383,BuildTime:1725298762,CloudEnv:ALIYUN > Total rows: 6250733 > Number of RowGroups: 1 > Number of Real Columns: 40 > Number of Columns: 40 > Number of Selected Columns: 40 > ...... > Column 6: lo_orderpriority (BYTE_ARRAY / String / UTF8) > ...... > --- Row Group: 0 --- > --- Total Bytes: 1016215218 --- > --- Total Compressed Bytes: 552911018 --- > --- Sort Columns: > column_idx: 5, descending: 0, nulls_first: 1 > column_idx: 0, descending: 0, nulls_first: 1 > --- Rows: 6250733 --- > ...... > Column 6 > Values: 6250733, Null Values: 0, Distinct Values: 5 > Max (exact: unknown): 5-LOW, Min (exact: unknown): 1-URGENT > Compression: LZ4_RAW, Encodings: PLAIN(DICT_PAGE) RLE_DICTIONARY > Uncompressed Size: 2267694, Compressed Size: 2092132 > ...... > ``` Update from @HuaHuaY's comment: we just located the performance regression comes from the API change of the internal `unpack` function introduced by https://github.com/apache/arrow/pull/47994. The quick fix is to explicitly set `UnpackOptions::max_read_bytes` to not use its default value -1. @AntoinePrv @pitrou -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
