Hi Anurag,

Thanks for the write-up!

Just to chime in with a long-term perspective: If we want to support
Column Families (permanently isolating sets of columns into separate
files) in the future, column files without _pos (option 1) are more
naturally compatible. They allow zero-copy stitching based on implicit
row alignment, avoiding the complexity of managing divergent _pos
columns across multiple column families.

Best,
Gang

On Wed, May 20, 2026 at 8:37 AM Anurag Mantripragada
<[email protected]> wrote:
>
> Hi all,
>
> Following up on the column updates design and the original discussion thread, 
> I'd like to start a focused discussion on how column update files should 
> represent rows when deletion vectors (DVs) are present.
>
> Context
>
> We've reached consensus on using a dense representation for column update 
> files. When a column is updated, the column file contains values for all rows 
> including unchanged rows. This avoids complex merge logic on the write path 
> when successive updates target overlapping fields.
>
> The open question is: what should the column file contain at positions where 
> the base file has deleted rows? There are two options.
>
> Option 1: Positional Alignment (row count matches base file)
>
> The column file has exactly base_file.record_count rows. Row N in the column 
> file corresponds to row N in the base file. Deleted positions contain filler 
> values (e.g., NULLs).
>
> Pros:
>
> Stitching is a zero-copy column swap in Arrow
> Works identically in every Arrow implementation (Java, Rust, Python, C++)
> No _pos column needed
> Engines apply their existing DV filter to both base and column file
>
> Cons:
>
> Filler values at deleted positions skew Parquet footer statistics 
> (null_count, avg_length)
> Writes slightly more data than necessary (filler values for deleted rows)
> Writer must know base_file.record_count to pad trailing deletions (base file 
> metadata already available during write planning)
>
> Option 2: Applied Deletes (row count = live rows only)
>
> The column file contains only live rows (after applying DVs). A _pos column 
> maps each row back to its ordinal position in the base file.
>
> Pros:
>
> Only stores valid rows in column update files.
> Parquet footer statistics are accurate (no skew from NULLs at deleted 
> positions)
> Slightly smaller file (no filler bytes)
>
> Cons:
>
> _pos adds storage overhead (Encoding must be left to the file format)
> Stitching requires a scatter operation to allocate a new array and place 
> values at the correct positions
> It's not zero-copy in Arrow and requires manipulation.
> As it stands today this might be  harder for non-Java engines (see section 
> below)
>
> I investigated how three Iceberg implementations handle vectorized reading 
> and what column stitching would require in each. The key architectural 
> difference is how they expose Arrow memory:
>
>  Java/Spark: Spark's ColumnVector is an abstract class. We can override 
> getInt(rowId)to redirect reads without copying data. This makes scatter 
> operations appear "free" via virtual dispatch. My POC uses this approach.
>
> PyIceberg: Uses PyArrow's native arrays. I could not find a way to override 
> what array[i] returns. PyArrow has take() (gather) but lacks a scatter() 
> primitive (in the  version we use).
>
> iceberg-rust: Uses arrow-rs arrays, which are concrete structs (not trait 
> objects). Int32Array::value(i) is a direct memory offset. Must materialize 
> new arrays via ArrayBuilder for any non-trivial column manipulation.
>
> TL;DR: If we choose Option 2 (applied deletes), engines need a scatter method 
> to stitch column files. I found the following implementations in Arrow which 
> can be used to stitch.
>
> C++ (Since Arrow 20.0.0)
> Python (Since Arrow 23.0.0)
> I did not find scatter in arrow-rs.
>
> I'm still researching these options and would love to hear from everyone.
>
> Thanks,
> Anurag

Reply via email to