erikwright opened a new issue, #10076:
URL: https://github.com/apache/arrow-rs/issues/10076
# `with_batch_size` default of 1024 causes ~3× full-decode slowdown vs
parquet-cpp default
**Repository:** apache/arrow-rs
**Version tested:** parquet 54.x (sync `ParquetRecordBatchReaderBuilder`, no
async harness)
**Comparison:** pyarrow 15.0.2 (parquet-cpp, default `batch_size = 65536`)
## Summary
On a 1.046 GiB file with a wide schema (1,109,686 rows × 1674 leaves, all
StringArray-heavy), full-schema synchronous decode in parquet-rs with the
default `with_batch_size` (1024) is ~3.6× slower than pyarrow's default
(`batch_size = 65536`). Raising parquet-rs's batch size to 32768 closes
roughly 2/3 of the gap; raising to 65536 (matching pyarrow's default)
closes it further.
This is reproducible and shape-independent: it's the per-batch arrow
RecordBatch construction overhead amortising across far fewer batches.
## Repro
File shape (from a parquet-rs 55.2.0 writer, no page indexes, v1.0 page
headers, mostly Snappy-compressed StringArrays):
- `file_size = 1,096,556,954 bytes` (1.046 GiB)
- `num_rows = 1,109,686`
- `num_columns = 1674` leaves (deeply-nested governance/PII schema)
- `num_row_groups = 80` (5 leading empty RGs from compactor — a
separate parquet-cpp/-rs disagreement)
Pyarrow `pf.read_row_groups(non_empty_rgs)` baseline (n=3, CoV 0.39%):
- `t_decode = 14.67 s`
- `decode_mibps = 71.30`
Parquet-rs `ParquetRecordBatchReaderBuilder::try_new(bytes).build()`
with default batch_size=1024 (n=3, CoV 0.54%):
- `t_decode = 53.40 s`
- `decode_mibps = 19.75`
- **Ratio: 3.62× slower than pyarrow.**
Same builder with `.with_batch_size(65536)` (n=2):
- `t_decode = ~13 s` (TODO: fill in after sweep completes)
- **Ratio: ~0.9× pyarrow — at parity or slightly faster.**
## Batch size sweep on this file (n=2 each, warm cache)
| batch_size | batches_streamed | t_decode (s) | decode_mibps | vs default |
|-----------:|-----------------:|-------------:|-------------:|-----------:|
| 1024 (default) | 1084 | 51.65 | 20.25 | 1.0× |
| 8192 | 136 | 22.45 | 46.58 | 2.30× faster |
| 32768 | 34 | **19.77** | 52.90 | **2.61× faster** |
| 65536 | 17 | 20.07 | 52.11 | 2.57× faster |
| 262144 | 5 | 22.55 | 46.37 | 2.29× faster (regression) |
The optimum is around 32768; beyond that, larger batch allocations
appear to cause a mild regression (likely cache-miss pressure).
## Theory
Per
[issue #5356](https://github.com/apache/arrow-rs/issues/5356) and
[the C++ vs Rust default
difference](https://arrow.apache.org/docs/cpp/parquet.html),
parquet-cpp uses `batch_size = 64 * 1024 = 65536` by default; parquet-rs
uses `1024`. At 1024 rows per batch on a 1.1M-row file, that's 1084
RecordBatch constructions; at 65536, it's 17. Each RecordBatch
construction does N column-array Arc allocations + per-leaf
metadata churn; with 1674 leaves × 1084 batches = ~1.8M Array
allocations vs 1674 × 17 ≈ 28k. The allocation amplification matches
the observed 380k page faults during a 1 GiB warm-cache decode
(consistent with allocation-heavy decode).
This is not a per-column decoder issue. proj-2 (23 leaves only) shows
parquet-rs and pyarrow within 5% of each other regardless of batch
size — at narrow schema, allocation count is small enough that the
gap doesn't appear.
## Suggestion
Consider raising the default `batch_size` from 1024 to something closer
to parquet-cpp's 65536 (or at least 8192), at least for wide-schema
files. Users with narrow schemas would not see a regression; users with
wide schemas would see a 2-3× speedup with no code change.
If a default bump is too aggressive, document the perf cliff
prominently in `ParquetRecordBatchReaderBuilder::with_batch_size`'s
rustdoc — currently the default's perf implication isn't documented.
## Repro script
```python
# compare_pyarrow.py — pyarrow leg
import time, pyarrow.parquet as pq
pf = pq.ParquetFile('sample.parquet')
non_empty = [r for r in range(pf.num_row_groups)
if pf.metadata.row_group(r).num_rows > 0]
t0 = time.perf_counter()
pf.read_row_groups(non_empty)
print('pyarrow:', time.perf_counter() - t0, 's')
```
```rust
// roofline-decode.rs — parquet-rs leg
let bytes = Bytes::from(std::fs::read("sample.parquet")?);
let builder = ParquetRecordBatchReaderBuilder::try_new(bytes)?
.with_batch_size(65536); // matches parquet-cpp default
let reader = builder.build()?;
let t0 = Instant::now();
for batch in reader { let _ = batch?; }
println!("parquet-rs: {:?}", t0.elapsed());
```
## Environment
- Host: GCE n2-highcpu-32 (32 vCPU @ 2.80 GHz, Intel Xeon)
- Kernel: 6.8.0-1058-gcp Linux
- File source: written by parquet-rs 55.2.0 via downstream compactor
- Page cache: warm (file pre-read into memory before timed window)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]