Hi Anurag, Thanks for the write-up!
Just to chime in with a long-term perspective: If we want to support Column Families (permanently isolating sets of columns into separate files) in the future, column files without _pos (option 1) are more naturally compatible. They allow zero-copy stitching based on implicit row alignment, avoiding the complexity of managing divergent _pos columns across multiple column families. Best, Gang On Wed, May 20, 2026 at 8:37 AM Anurag Mantripragada <[email protected]> wrote: > > Hi all, > > Following up on the column updates design and the original discussion thread, > I'd like to start a focused discussion on how column update files should > represent rows when deletion vectors (DVs) are present. > > Context > > We've reached consensus on using a dense representation for column update > files. When a column is updated, the column file contains values for all rows > including unchanged rows. This avoids complex merge logic on the write path > when successive updates target overlapping fields. > > The open question is: what should the column file contain at positions where > the base file has deleted rows? There are two options. > > Option 1: Positional Alignment (row count matches base file) > > The column file has exactly base_file.record_count rows. Row N in the column > file corresponds to row N in the base file. Deleted positions contain filler > values (e.g., NULLs). > > Pros: > > Stitching is a zero-copy column swap in Arrow > Works identically in every Arrow implementation (Java, Rust, Python, C++) > No _pos column needed > Engines apply their existing DV filter to both base and column file > > Cons: > > Filler values at deleted positions skew Parquet footer statistics > (null_count, avg_length) > Writes slightly more data than necessary (filler values for deleted rows) > Writer must know base_file.record_count to pad trailing deletions (base file > metadata already available during write planning) > > Option 2: Applied Deletes (row count = live rows only) > > The column file contains only live rows (after applying DVs). A _pos column > maps each row back to its ordinal position in the base file. > > Pros: > > Only stores valid rows in column update files. > Parquet footer statistics are accurate (no skew from NULLs at deleted > positions) > Slightly smaller file (no filler bytes) > > Cons: > > _pos adds storage overhead (Encoding must be left to the file format) > Stitching requires a scatter operation to allocate a new array and place > values at the correct positions > It's not zero-copy in Arrow and requires manipulation. > As it stands today this might be harder for non-Java engines (see section > below) > > I investigated how three Iceberg implementations handle vectorized reading > and what column stitching would require in each. The key architectural > difference is how they expose Arrow memory: > > Java/Spark: Spark's ColumnVector is an abstract class. We can override > getInt(rowId)to redirect reads without copying data. This makes scatter > operations appear "free" via virtual dispatch. My POC uses this approach. > > PyIceberg: Uses PyArrow's native arrays. I could not find a way to override > what array[i] returns. PyArrow has take() (gather) but lacks a scatter() > primitive (in the version we use). > > iceberg-rust: Uses arrow-rs arrays, which are concrete structs (not trait > objects). Int32Array::value(i) is a direct memory offset. Must materialize > new arrays via ArrayBuilder for any non-trivial column manipulation. > > TL;DR: If we choose Option 2 (applied deletes), engines need a scatter method > to stitch column files. I found the following implementations in Arrow which > can be used to stitch. > > C++ (Since Arrow 20.0.0) > Python (Since Arrow 23.0.0) > I did not find scatter in arrow-rs. > > I'm still researching these options and would love to hear from everyone. > > Thanks, > Anurag
