asuresh8 opened a new pull request, #9521: URL: https://github.com/apache/arrow-rs/pull/9521
## Which issue does this PR close? N/A - Performance optimization ## Rationale `GenericColumnWriter::add_data_page()` allocates fresh `Vec<u8>` buffers on every page flush: one for page assembly (`buffer`) and one for compression output (`compressed_buf`). For workloads that produce many pages (e.g., TPC-H SF1 lineitem: 17 columns x 49 row groups x ~70 pages/column = ~58K pages), this creates significant allocator overhead and cache pollution from touching fresh memory on every page. Additionally, the V1 compression path calls `compressed_buf.shrink_to_fit()` immediately before consuming the buffer, which triggers an unnecessary realloc+copy that is immediately discarded when the buffer is converted to `Bytes`. ## What changes are included in this PR? Add two persistent `Vec<u8>` fields (`page_buf` and `compressed_buf`) to `GenericColumnWriter` that are cleared and reused across data pages instead of allocating fresh buffers per page. **Changes:** 1. Add `page_buf: Vec<u8>` and `compressed_buf: Vec<u8>` fields to `GenericColumnWriter` 2. In `add_data_page()`, replace `let mut buffer = vec![]` with `self.page_buf.clear()` and reuse 3. Replace `let mut compressed_buf = Vec::with_capacity(...)` with `self.compressed_buf.clear()` and reuse 4. Use `Bytes::copy_from_slice()` to create the final `Bytes` from the reused buffer (since `Bytes::from(Vec<u8>)` would consume the allocation) 5. Remove `shrink_to_fit()` after compression (counterproductive with buffer reuse) 6. Update `memory_size()` to include the capacity of the reusable buffers (partial improvement to an already-approximate metric) 7. Add 3 multi-page roundtrip tests with compression (V1, V2, nullable) that validate the `.clear()` invariant across page boundaries **Not changed:** - Dictionary page code (only ~17 allocations total per workload, not worth the diff) - `LevelEncoder` buffer allocation (small buffers, would require API changes) - Any public API **Trade-off:** `Bytes::copy_from_slice` adds one memcpy per page vs the previous zero-copy `Bytes::from(Vec<u8>)`. However, the eliminated allocations (2-3 per page), removed `shrink_to_fit` realloc, and cache warming from reused buffers provide a net benefit. The copy operates on the smaller compressed data when compression is enabled. ## Are these changes tested? Yes. All 88 existing tests pass unchanged. 3 new tests added: - `test_multi_page_roundtrip_with_compression` - V1 path, 5 pages, Snappy - `test_multi_page_roundtrip_with_compression_v2` - V2 path, 5 pages, Snappy - `test_multi_page_roundtrip_nullable_with_compression` - V1 path with nullable columns, 5 pages, Snappy These tests write enough data to trigger multiple data pages (via `set_data_page_row_count_limit(1000)` with 5000 values), then read back and verify all values match. This validates that `.clear()` properly prevents stale data leakage between pages. ## Are there any user-facing changes? No. All changes are to private fields and internal methods. ## Benchmark Results `cargo bench -p parquet --bench arrow_writer` on Apple M-series: | Category | Improved (>2%) | Regressed (>2%) | Neutral | |----------|---------------|-----------------|---------| | Count | 13 | 0 | 42 | Aggregate: -0.76% (excluding float_with_nans noise, confirmed by re-run) Notable improvements: - `primitive_non_null/zstd_parquet_2`: -4.8% - `primitive_non_null/bloom_filter`: -4.0% - `bool_non_null/zstd_parquet_2`: -3.4% - `string/zstd_parquet_2`: -3.4% - `string/parquet_2`: -3.0% - `string_dictionary/default`: -2.7% No confirmed regressions (initial `float_with_nans` regression was measurement noise, confirmed by subsequent run with p > 0.05). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
