Re: [Discuss] Column Update File Representation

Matt Butrovich Wed, 20 May 2026 12:50:57 -0700

Thanks for writing this up, Anurag, and everyone else for all the analysis
and discussion so far!


Re: "I did not find scatter in arrow-rs." I think this could be overcome
with take() and materialization, but it could be expensive. Based on what
we've seen with joins these scatter codepaths are often the most expensive
part because they can be extremely cache-unfriendly, especially with larger
data types.

Discussion seems to be leaning toward Option 1, so I went through a thought
experiment of what this might look like in iceberg-rust, assuming
continuing to use arrow-rs over Parquet data files:

1. Plan-time: FileScanTask carries the column_files list for the base file.
Each entry is opaque to scan planning beyond its file_path and field_ids.
2. File open: for each task, open the base file plus one reader per
referenced column-update file, in parallel, reusing the existing
ArrowFileReader/AsyncFileReader plumbing.
3. Per-file projection: partition the requested field_ids by their source
file. Each reader's ProjectionMask covers only the field_ids it owns.
Field_ids overridden by an update file are excluded from the base reader's
mask, so Parquet does not fetch or decode their column chunks. The base
file is read as if those columns were not projected.
4. Coordinated row selection: the deletion vector and any predicate-driven
row selection are translated to a single file-global RowSelection and
applied identically to every reader. The same set of rows is therefore
selected from every file.
5. Stitch stage: assemble each output batch by selecting columns from the
per-file batches according to field_id ownership. With aligned row counts
this is an Arc-level column swap; near row group boundaries readers diverge
and re-alignment requires occasional buffer copies (see Risks).
6. Downstream: the existing RecordBatchTransformer operates on the stitched
batch unchanged.

Some care needs to be done with predicate handling:
- Predicates over field_ids owned by exactly one file are pushed into that
file's RowFilter.
- Predicates spanning multiple files cannot be pushed; they evaluate
post-stitch via arrow::compute::filter.
- Row group statistics pruning is per-file but the decision is global. Each
file evaluates its own predicates against its own row group stats; the
resulting "could match" row ranges are intersected with each other and with
the DV to produce a single file-global RowSelection shared by all readers.
Each reader then derives its own with_row_groups list from that selection.
Row group boundaries differ between files but row counts stay aligned
because every reader applies the same selection.

It seems like every primitive we would need is in arrow-rs and the
iceberg-rust reader pipeline already.

I have one concern wrt row groups, but this mostly comes down to
implementation. Just documenting for posterity/if I have to implement this:

Parquet row groups are byte-sized and differ across files, and arrow-rs's
batch boundaries follow row group boundaries, so stitching across
misaligned files requires periodic buffer copies (concat-refill at each row
group boundary) to keep readers in sync. An upstream arrow-rs API that
allows the caller to request a short read at a chosen row count would let
the stitcher force re-alignment without copies; absent that, the concat
cost is the practical floor.

All of that said, neither option 1 or option 2 seem impossible in
iceberg-rust/arrow-rs, so I don't think there are blockers from that
standpoint.

-Matt

On Wed, May 20, 2026 at 11:38 AM Gábor Kaszab <[email protected]>
wrote:

> Minor addition to the storage overhead with writing NULLs as filling
> values:
> If we assume Parquet V2 and we write the _pos to the update file, with 20%
> of the rows deleted will totally amortize the storage overhead of writing
> NULLs. The reason is that with the "positional aligned" updates we have a
> complete sequence of the _pos that is efficiently compressed with delta
> encoding, while with the "applied deletes" approach there are holes in the
> _pos sequence, hence delta encoding is slightly less efficient.
>
> Just some number to visualize with 20% deletion ratio:
> 1 int col _pos + 1 int
> Omit deleted rows 9735745 10111232
> Null deleted rows 10189522 10205497
> Overhead 4,66% 0,93%
>
> Gabor
>
>
> Gábor Kaszab <[email protected]> ezt írta (időpont: 2026. máj. 20.,
> Sze, 16:50):
>
>> Hey Iceberg Community,
>>
>> Thanks Anurag for starting a dedicated thread on this! Just a couple of
>> thoughts from my side:
>>
>> *Storage overhead:*
>> *TLDR:* Regardless which way we go, storage overhead shouldn't be an
>> issue.
>>
>> *Details:*
>> I made a couple of measurements recently on the storage size of the
>> update files. For numbers, see the "Update file size measurements" tab in 
>> this
>> doc
>> <https://docs.google.com/spreadsheets/d/1I5u72D-4LbIs7p7lBnf5ITZou4V9jdHAj9z9p_nUO0Q/edit?usp=sharing>
>> .
>>
>> 1) Writing _pos column:
>>
>>    - File formats could hide the overhead for the _pos column, e.g.
>>    Parquet V2 has negligible (<0,5%) overhead
>>    - I recall there was an argument that having _pos even for
>>    pos-aligned updates might help us debug writer issues to see what is
>>    missing/icorrect in the update file
>>    - On the write path we already read and sort by _pos so there should
>>    be no overhead either
>>
>> In summary, I slightly lean towards always having _pos in the updates
>> regardless what representation we choose.
>> If we can agree on this, the relevant pros and cons around _pos are
>> obsolete.
>>
>> 2) Filling rows/values for deleted rows:
>>
>>    - I measured noticeable overhead when filling fields for deleted rows
>>    with NULLs. As always the overhead depends on many stuff
>>    - Deleted 5% of rows => writing NULLs has 2-3% storage overhead
>>       - Deleted 20% of rows => writing NULLs has 4-4.6% overhead
>>    - Since the consensus is constantly eliminating deletes, I think this
>>    overhead is acceptable
>>
>> *Factors to consider to choose representation*
>> As described above, the storage size shouldn't be a factor here, I'd not
>> consider the presence of _pos as pro or con.
>>
>> If we scratch the storage size related ones, here is what we have left:
>>
>> *1) Accuracy of stats*
>> By filling missing rows with auxiliary values the stats can go off
>>
>>    - I'm not that worried about Parquet footer stats, they represent the
>>    file itself, not the logical deletes on top
>>    - If we went for NULLs as filling values, we could set each field in
>>    a deleted row to NULL (not the entire row). As a result the column-level
>>    null count in the table metadata can be off because of the filling values.
>>    - Technically, if we want to, we can correct these stats if we
>>    collected the number of deleted rows and then adjust the null count by 
>> this
>>    number
>>    - In general, deletes make stats off anyway regardless of column
>>    updates (probably not the null count but the avg size, though)
>>
>> I don't think that this should be a decisive factor when choosing
>> representation.
>>
>> *2) Complexity in general*
>>     a) Read - stitching
>>
>>    - Positional alignment approach is straightforward for stitching both
>>    in vectorized and row readers
>>       - Stitch before applying deletes
>>    - Applied deletes approach seems more complicated initially
>>       - Scattering of update rows is required
>>       - Might not be supported for all the language implementations ATM.
>>       Thanks Anurag for taking a look!
>>
>>     b) Write
>>
>>    - Applied deletes approach is straightforward
>>    - Positional alignment approach has one major complexity:
>>       - when trailing rows are deleted, the writer currently has no
>>       information how many rows to fill
>>       - In the PoC we broadcast a "file path to row count" so that the
>>       writers can now if trailing rows have to be filled with nulls, and 
>> with how
>>       many (comparing to the _pos column)
>>       - In theory we could simply not fill trailing deleted rows, but
>>       then we have a hybrid approach between positional alignment and applied
>>       deletes. Probably, we don't want this complexity in the spec
>>
>> *3) Read and write performance*
>>
>>    - I'm not expecting any difference in write perf
>>    - Read could have a toll on the "applied deletes" approach due to
>>    scattering. *@pvary* might have some more insights here.
>>
>> *Summary*
>> I hope this summarizes all we have discussed from a different angle and
>> might narrow down the areas to look for. I think the two main questions to
>> sort out to make a decision are:
>>     1) Can we find a way to take care of trailing deletes in "positional
>> aligned" approach (or are we fine not filling trailing deletes)
>>     2) What is the cost of scattering the update rows in the "applied
>> deletes" approach
>>         2/b) Is scattering feasible on all language implementations
>>
>> Best Regards,
>> Gabor
>>
>>
>> Anurag Mantripragada <[email protected]> ezt írta (időpont:
>> 2026. máj. 20., Sze, 2:37):
>>
>>> Hi all,
>>>
>>> Following up on the column updates design
>>> <https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?tab=t.0#heading=h.b3mc4alqde65>
>>>  and
>>> the original discussion thread
>>> <https://lists.apache.org/thread/w90rqyhmh6pb0yxp0bqzgzk1y1rotyny>, I'd
>>> like to start a focused discussion on how column update files should
>>> represent rows when deletion vectors (DVs) are present.
>>>
>>> *Context*
>>>
>>> We've reached consensus on using a dense representation for column
>>> update files. When a column is updated, the column file contains values for
>>> all rows including unchanged rows. This avoids complex merge logic on the
>>> write path when successive updates target overlapping fields.
>>>
>>> The open question is: what should the column file contain at positions
>>> where the base file has deleted rows? There are two options.
>>>
>>> *Option 1*: Positional Alignment (row count matches base file)
>>>
>>> The column file has exactly base_file.record_count rows. Row N in the
>>> column file corresponds to row N in the base file. Deleted positions
>>> contain filler values (e.g., NULLs).
>>>
>>> Pros*:*
>>>
>>>    - Stitching is a zero-copy column swap in Arrow
>>>    - Works identically in every Arrow implementation (Java, Rust,
>>>    Python, C++)
>>>    - No _pos column needed
>>>    - Engines apply their existing DV filter to both base and column file
>>>
>>> Cons*:*
>>>
>>>    - Filler values at deleted positions skew Parquet footer statistics
>>>    (null_count, avg_length)
>>>    - Writes slightly more data than necessary (filler values for
>>>    deleted rows)
>>>    - Writer must know base_file.record_count to pad trailing deletions
>>>    (base file metadata already available during write planning)
>>>
>>> *Option 2*: Applied Deletes (row count = live rows only)
>>>
>>> The column file contains only live rows (after applying DVs). A _pos column
>>> maps each row back to its ordinal position in the base file.
>>>
>>> Pros*:*
>>>
>>>    - Only stores valid rows in column update files.
>>>    - Parquet footer statistics are accurate (no skew from NULLs at
>>>    deleted positions)
>>>    - Slightly smaller file (no filler bytes)
>>>
>>> Cons*:*
>>>
>>>    - _pos adds storage overhead (Encoding must be left to the file
>>>    format)
>>>    - Stitching requires a scatter operation to allocate a new array and
>>>    place values at the correct positions
>>>    - It's not zero-copy in Arrow and requires manipulation.
>>>    - As it stands today this might be  harder for non-Java engines (see
>>>    section below)
>>>
>>> I investigated how three Iceberg implementations handle vectorized
>>> reading and what column stitching would require in each. The key
>>> architectural difference is how they expose Arrow memory:
>>>
>>> * Java/Spark**:* Spark's ColumnVector is an abstract class. We can
>>> override getInt(rowId)to redirect reads without copying data. This
>>> makes scatter operations appear "free" via virtual dispatch. My POC uses
>>> this approach.
>>>
>>> *PyIceberg:* Uses PyArrow's native arrays. I could not find a way
>>> to override what array[i] returns. PyArrow has take() (gather) but lacks
>>> a scatter() primitive (in the  version we use).
>>>
>>> *iceberg-rust:* Uses arrow-rs arrays, which are concrete structs (not
>>> trait objects). Int32Array::value(i) is a direct memory offset. Must
>>> materialize new arrays via ArrayBuilder for any non-trivial column
>>> manipulation.
>>>
>>> TL;DR: If we choose Option 2 (applied deletes), engines need a scatter
>>> method to stitch column files. I found the following implementations in
>>> Arrow which can be used to stitch.
>>>
>>>
>>>    - C++ <https://github.com/apache/arrow/pull/44394> (Since Arrow
>>>    20.0.0)
>>>
>>>    - Python <https://github.com/apache/arrow/pull/48267> (Since Arrow
>>>    23.0.0)
>>>    - I did not find scatter in arrow-rs.
>>>
>>> I'm still researching these options and would love to hear from
>>> everyone.
>>>
>>> Thanks,
>>> Anurag
>>>
>>>

Re: [Discuss] Column Update File Representation

Reply via email to