adriangb opened a new pull request, #10020:
URL: https://github.com/apache/arrow-rs/pull/10020

   # Pluggable page spilling for the Parquet `ArrowWriter`
   
   ## Which issue does this close?
   
   Draft / design proposal — opening early for feedback. No tracking issue yet.
   
   ## Rationale for this change
   
   When `ArrowWriter` writes a row group, record batches arrive with all columns
   interleaved, but each Parquet column chunk must be contiguous on disk. So 
every
   column's *compressed* pages are buffered for the whole row group and only
   spliced into the output at flush. Peak `ArrowWriter` memory is therefore
   ≈ Σ(compressed bytes of every column chunk) for one row group, and it grows 
with
   the row group size.
   
   Today the only lever against this is `ArrowWriter::memory_size()` + flushing
   smaller row groups — which trades away compression and read-time 
(page/row-group)
   pruning. This PR makes the page buffer **pluggable** so completed pages can 
be
   spilled off the heap (temp file, object storage, …), decoupling *row-group 
size*
   (a read-time tuning knob) from *peak write memory*, while keeping large,
   read-optimal row groups.
   
   ## What changes are included in this PR?
   
   A small, intentionally "dumb" key/value store trait and its wiring, in three
   stacked commits:
   
   1. **`PageStore` + `PageKey` + `PageStoreFactory` + `InMemoryPageStore`, 
wired
      into `ArrowWriter`.** The store maps an opaque, store-allocated `PageKey` 
to a
      blob of bytes and knows nothing about pages, dictionaries, ordering, or
      offsets — the caller keeps the handles and decides what they mean. The
      default `InMemoryPageStore` (a `Vec<Bytes>`) is byte-for-byte equivalent 
to
      the previous buffering with zero overhead. A `PageStoreFactory` is 
threaded
      through `ArrowWriterOptions::with_page_store_factory` →
      `ArrowRowGroupWriterFactory` → `ArrowColumnWriterFactory`.
   2. **Stream column chunks out of the store at splice.** Replaces the
      materialize-then-copy splice with a `Read` that takes each page blob back 
out
      of the store in write order *as it is consumed* and releases it 
immediately,
      so the splice never holds more than one page in memory at a time 
(essential
      for a spilling backend on skewed schemas). `append_column` is unchanged 
for
      external `ChunkReader` callers.
   3. **Public `PageKey::new`/`get`** (so external backends can mint their own
      handles) **+ a dhat memory regression test** with a temp-file backend.
   
   ## Are these changes tested?
   
   Yes:
   
   - A **byte-identical round-trip test** using a custom `PageStore` with 
sparse,
     non-contiguous, `HashMap`-backed handles, proving the writer relies only on
     the opaque-handle contract (not on handles being dense `Vec` indices) 
across
     dictionary and non-dictionary columns and multiple row groups.
   - Unit tests for the in-memory backend contract.
   - An **always-on `dhat` integration test** 
(`parquet/tests/page_spill_memory.rs`)
     that measures peak heap while writing a skewed ~16 MiB single row group:
   
     | page store        | peak heap |
     |-------------------|-----------|
     | in-memory default | ~18.3 MiB |
     | temp-file spill   | ~3.2 MiB  |
   
     i.e. the spilling backend bounds peak write memory by the in-flight
     encoder/dictionary buffers rather than the row group size.
   
   ## Are there any user-facing changes?
   
   New, additive public API (default behavior unchanged):
   
   - `ArrowWriterOptions::with_page_store_factory`
   - `PageStore`, `PageKey`, `PageStoreFactory`, `InMemoryPageStore`,
     `InMemoryPageStoreFactory` (re-exported from 
`parquet::arrow::arrow_writer`,
     defined in `parquet::column::page_store`).
   
   ## Follow-up (deliberately not in this PR)
   
   This PR bounds the **dominant** accumulation point — the per-column-chunk 
page
   buffer (`ArrowPageWriter`), which is the *only* thing in the way for
   non-dictionary columns.
   
   It does **not** yet bound a second accumulation point: **dictionary-encoded
   columns**. Their completed data pages are buffered in
   `GenericColumnWriter.data_pages` until `close()` (the dictionary page must be
   written first, and isn't final until all values are seen), so they never 
reach
   the page writer/store during encoding. Verified with the same dhat harness: a
   low-cardinality 4.2M-row column peaks at ~2.50 MiB with *both* the in-memory 
and
   the temp-file store — spilling gives no benefit there, and it grows with row
   count.
   
   Closing this requires routing dict-column data pages through the store during
   encoding, holding the (bounded) dictionary page in memory, and resolving the
   dict-first ordering + page offsets at splice. It touches the core
   `GenericColumnWriter` shared with the column-at-a-time `SerializedFileWriter`
   path, so it's tracked as a follow-up rather than bundled here.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to