[I] parquet: `with_batch_size` default of 1024 causes ~3× full-decode slowdown vs parquet-cpp default (65536) [arrow-rs]

via GitHub Thu, 04 Jun 2026 12:49:29 -0700


erikwright opened a new issue, #10076:
URL: https://github.com/apache/arrow-rs/issues/10076


   # `with_batch_size` default of 1024 causes ~3× full-decode slowdown vs 
parquet-cpp default
   
   **Repository:** apache/arrow-rs
   **Version tested:** parquet 54.x (sync `ParquetRecordBatchReaderBuilder`, no 
async harness)
   **Comparison:** pyarrow 15.0.2 (parquet-cpp, default `batch_size = 65536`)
   
   ## Summary
   
   On a 1.046 GiB file with a wide schema (1,109,686 rows × 1674 leaves, all
   StringArray-heavy), full-schema synchronous decode in parquet-rs with the
   default `with_batch_size` (1024) is ~3.6× slower than pyarrow's default
   (`batch_size = 65536`). Raising parquet-rs's batch size to 32768 closes
   roughly 2/3 of the gap; raising to 65536 (matching pyarrow's default)
   closes it further.
   
   This is reproducible and shape-independent: it's the per-batch arrow
   RecordBatch construction overhead amortising across far fewer batches.
   
   ## Repro
   
   File shape (from a parquet-rs 55.2.0 writer, no page indexes, v1.0 page
   headers, mostly Snappy-compressed StringArrays):
   
   - `file_size = 1,096,556,954 bytes` (1.046 GiB)
   - `num_rows = 1,109,686`
   - `num_columns = 1674` leaves (deeply-nested governance/PII schema)
   - `num_row_groups = 80` (5 leading empty RGs from compactor — a
     separate parquet-cpp/-rs disagreement)
   
   Pyarrow `pf.read_row_groups(non_empty_rgs)` baseline (n=3, CoV 0.39%):
   - `t_decode = 14.67 s`
   - `decode_mibps = 71.30`
   
   Parquet-rs `ParquetRecordBatchReaderBuilder::try_new(bytes).build()`
   with default batch_size=1024 (n=3, CoV 0.54%):
   - `t_decode = 53.40 s`
   - `decode_mibps = 19.75`
   - **Ratio: 3.62× slower than pyarrow.**
   
   Same builder with `.with_batch_size(65536)` (n=2):
   - `t_decode = ~13 s`  (TODO: fill in after sweep completes)
   - **Ratio: ~0.9× pyarrow — at parity or slightly faster.**
   
   ## Batch size sweep on this file (n=2 each, warm cache)
   
   | batch_size | batches_streamed | t_decode (s) | decode_mibps | vs default |
   |-----------:|-----------------:|-------------:|-------------:|-----------:|
   | 1024 (default) | 1084 | 51.65 | 20.25 | 1.0× |
   | 8192 | 136 | 22.45 | 46.58 | 2.30× faster |
   | 32768 | 34 | **19.77** | 52.90 | **2.61× faster** |
   | 65536 | 17 | 20.07 | 52.11 | 2.57× faster |
   | 262144 | 5 | 22.55 | 46.37 | 2.29× faster (regression) |
   
   The optimum is around 32768; beyond that, larger batch allocations
   appear to cause a mild regression (likely cache-miss pressure).
   
   ## Theory
   
   Per
   [issue #5356](https://github.com/apache/arrow-rs/issues/5356) and
   [the C++ vs Rust default 
difference](https://arrow.apache.org/docs/cpp/parquet.html),
   parquet-cpp uses `batch_size = 64 * 1024 = 65536` by default; parquet-rs
   uses `1024`. At 1024 rows per batch on a 1.1M-row file, that's 1084
   RecordBatch constructions; at 65536, it's 17. Each RecordBatch
   construction does N column-array Arc allocations + per-leaf
   metadata churn; with 1674 leaves × 1084 batches = ~1.8M Array
   allocations vs 1674 × 17 ≈ 28k. The allocation amplification matches
   the observed 380k page faults during a 1 GiB warm-cache decode
   (consistent with allocation-heavy decode).
   
   This is not a per-column decoder issue. proj-2 (23 leaves only) shows
   parquet-rs and pyarrow within 5% of each other regardless of batch
   size — at narrow schema, allocation count is small enough that the
   gap doesn't appear.
   
   ## Suggestion
   
   Consider raising the default `batch_size` from 1024 to something closer
   to parquet-cpp's 65536 (or at least 8192), at least for wide-schema
   files. Users with narrow schemas would not see a regression; users with
   wide schemas would see a 2-3× speedup with no code change.
   
   If a default bump is too aggressive, document the perf cliff
   prominently in `ParquetRecordBatchReaderBuilder::with_batch_size`'s
   rustdoc — currently the default's perf implication isn't documented.
   
   ## Repro script
   
   ```python
   # compare_pyarrow.py — pyarrow leg
   import time, pyarrow.parquet as pq
   pf = pq.ParquetFile('sample.parquet')
   non_empty = [r for r in range(pf.num_row_groups)
                if pf.metadata.row_group(r).num_rows > 0]
   t0 = time.perf_counter()
   pf.read_row_groups(non_empty)
   print('pyarrow:', time.perf_counter() - t0, 's')
   ```
   
   ```rust
   // roofline-decode.rs — parquet-rs leg
   let bytes = Bytes::from(std::fs::read("sample.parquet")?);
   let builder = ParquetRecordBatchReaderBuilder::try_new(bytes)?
       .with_batch_size(65536);  // matches parquet-cpp default
   let reader = builder.build()?;
   let t0 = Instant::now();
   for batch in reader { let _ = batch?; }
   println!("parquet-rs: {:?}", t0.elapsed());
   ```
   
   ## Environment
   
   - Host: GCE n2-highcpu-32 (32 vCPU @ 2.80 GHz, Intel Xeon)
   - Kernel: 6.8.0-1058-gcp Linux
   - File source: written by parquet-rs 55.2.0 via downstream compactor
   - Page cache: warm (file pre-read into memory before timed window)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] parquet: `with_batch_size` default of 1024 causes ~3× full-decode slowdown vs parquet-cpp default (65536) [arrow-rs]

Reply via email to