One thing to add here: We're scoping internal work on our side to validate the dense- column-updates direction against Delta -- design alignment plus re-verifying the older measurements under current builds. Marco Kroll is driving that effort and is the right person from our side to engage here on implementation realities going forward.
Marco -- want to drop into the discussion? Your read on the representation choice from a Delta writer-implementation perspective would be useful as the validation work progresses. (The crossover numbers Marco shared on the parent thread came from the same older measurement; treat as recollection pending re-verification.) On iceberg-go, echoing Matt's read on iceberg-rust: neither option is a blocker, but Option 2 is materially more work given where arrow-go's compute surface sits today. Not a hard constraint, but rather one more thumb on the Option 1 scale. Andrei On Thu, May 21, 2026 at 6:07 PM Andrei Tserakhau < [email protected]> wrote: > Hi Peter, all, > > Four things from the PoC, bigger ones first. > Nothing blocking TBH, but some stuff need attention. > > == Row-group alignment within a file == > > Echoing Matt's read -- this is the real practical floor. Our PoC > forced 1:1 base->column-file at the file level (disabling > adaptive file sizing on the writer side), but within a file the > row group boundaries drifted between base and column file. We > took the concat-refill cost at every drifting boundary and moved > on; never pursued the deeper fix. The explicit "force short read > at row count N" API Matt described in arrow-rs would let > stitchers re-align without copies, which feels like the right > upstream ask if iceberg-rust pushes Option 1 to production. > > For Java/Spark the same problem exists but vectorized batch sizes > hide most of it because Spark already does the equivalent of > concat-refill at batch boundaries. The Rust path is genuinely > harder here, not just due to scatter. > > == Stats correction needs explicit ownership in the spec == > > Gabor's null_count adjustment is right, but the spec should also > state WHICH stats the writer must adjust before publishing to > manifests vs which engines can leave raw and reconcile at read > time. The set we saw matter: > > - null_count: must be adjusted (filler NULLs inflate it) > - value_count: if left raw, CBO cardinality estimates overcount > by the delete count -- affects join strategy choices > - min / max: stays correct, no adjustment needed > - IS NULL fast paths: engines that prune via stats over-match; > correct results post-DV-filter but wasted I/O > > Leaving any of these undefined is a recipe for engine divergence > at predicate pushdown time -- two implementations will make > different join-plan decisions on the same table > > == Compaction timing == > > We didn't reach compaction in the PoC. The mechanics (merge > column files into the base) are clear enough, but the timing > policy -- every snapshot? read-pressure-driven? manual via > RewriteDataFiles with a column-file-aware strategy? -- was never > settled. Probably fine to leave the timing to engines, but worth > describing the merge mechanics in the spec so engines don't > diverge there. > > == Schema evolution == > > Worth calling out so this doesn't get relitigated later: Iceberg's > stable field_ids handle column rename and add cleanly because > column files reference field_ids, not names. Type widening > reduces to the same promotion logic the reader already applies > to the base file. We had to work around this on Delta (names not > IDs), which limited what we could test, but the Iceberg design > gets it for free. > > Andrei > > On Thu, May 21, 2026 at 3:47 PM Péter Váry <[email protected]> > wrote: > >> Thanks everyone for taking a look! >> I could echo the concerns above, so I won’t duplicate them. >> @Andrei: given your prior experience, were there any blockers we haven’t >> identified yet? >> >> >> Matt Butrovich <[email protected]> ezt írta (időpont: 2026. máj. 20., >> Sze, 21:51): >> >>> Thanks for writing this up, Anurag, and everyone else for all the >>> analysis and discussion so far! >>> >>> Re: "I did not find scatter in arrow-rs." I think this could be overcome >>> with take() and materialization, but it could be expensive. Based on what >>> we've seen with joins these scatter codepaths are often the most expensive >>> part because they can be extremely cache-unfriendly, especially with larger >>> data types. >>> >>> Discussion seems to be leaning toward Option 1, so I went through a >>> thought experiment of what this might look like in iceberg-rust, assuming >>> continuing to use arrow-rs over Parquet data files: >>> >>> 1. Plan-time: FileScanTask carries the column_files list for the base >>> file. Each entry is opaque to scan planning beyond its file_path and >>> field_ids. >>> 2. File open: for each task, open the base file plus one reader per >>> referenced column-update file, in parallel, reusing the existing >>> ArrowFileReader/AsyncFileReader plumbing. >>> 3. Per-file projection: partition the requested field_ids by their >>> source file. Each reader's ProjectionMask covers only the field_ids it >>> owns. Field_ids overridden by an update file are excluded from the base >>> reader's mask, so Parquet does not fetch or decode their column chunks. The >>> base file is read as if those columns were not projected. >>> 4. Coordinated row selection: the deletion vector and any >>> predicate-driven row selection are translated to a single file-global >>> RowSelection and applied identically to every reader. The same set of rows >>> is therefore selected from every file. >>> 5. Stitch stage: assemble each output batch by selecting columns from >>> the per-file batches according to field_id ownership. With aligned row >>> counts this is an Arc-level column swap; near row group boundaries readers >>> diverge and re-alignment requires occasional buffer copies (see Risks). >>> 6. Downstream: the existing RecordBatchTransformer operates on the >>> stitched batch unchanged. >>> >>> Some care needs to be done with predicate handling: >>> - Predicates over field_ids owned by exactly one file are pushed into >>> that file's RowFilter. >>> - Predicates spanning multiple files cannot be pushed; they evaluate >>> post-stitch via arrow::compute::filter. >>> - Row group statistics pruning is per-file but the decision is global. >>> Each file evaluates its own predicates against its own row group stats; the >>> resulting "could match" row ranges are intersected with each other and with >>> the DV to produce a single file-global RowSelection shared by all readers. >>> Each reader then derives its own with_row_groups list from that selection. >>> Row group boundaries differ between files but row counts stay aligned >>> because every reader applies the same selection. >>> >>> It seems like every primitive we would need is in arrow-rs and the >>> iceberg-rust reader pipeline already. >>> >>> I have one concern wrt row groups, but this mostly comes down to >>> implementation. Just documenting for posterity/if I have to implement this: >>> >>> Parquet row groups are byte-sized and differ across files, and >>> arrow-rs's batch boundaries follow row group boundaries, so stitching >>> across misaligned files requires periodic buffer copies (concat-refill at >>> each row group boundary) to keep readers in sync. An upstream arrow-rs API >>> that allows the caller to request a short read at a chosen row count would >>> let the stitcher force re-alignment without copies; absent that, the concat >>> cost is the practical floor. >>> >>> All of that said, neither option 1 or option 2 seem impossible in >>> iceberg-rust/arrow-rs, so I don't think there are blockers from that >>> standpoint. >>> >>> -Matt >>> >>> On Wed, May 20, 2026 at 11:38 AM Gábor Kaszab <[email protected]> >>> wrote: >>> >>>> Minor addition to the storage overhead with writing NULLs as filling >>>> values: >>>> If we assume Parquet V2 and we write the _pos to the update file, with >>>> 20% of the rows deleted will totally amortize the storage overhead of >>>> writing NULLs. The reason is that with the "positional aligned" updates we >>>> have a complete sequence of the _pos that is efficiently compressed with >>>> delta encoding, while with the "applied deletes" approach there are holes >>>> in the _pos sequence, hence delta encoding is slightly less efficient. >>>> >>>> Just some number to visualize with 20% deletion ratio: >>>> 1 int col _pos + 1 int >>>> Omit deleted rows 9735745 10111232 >>>> Null deleted rows 10189522 10205497 >>>> Overhead 4,66% 0,93% >>>> >>>> Gabor >>>> >>>> >>>> Gábor Kaszab <[email protected]> ezt írta (időpont: 2026. máj. >>>> 20., Sze, 16:50): >>>> >>>>> Hey Iceberg Community, >>>>> >>>>> Thanks Anurag for starting a dedicated thread on this! Just a couple >>>>> of thoughts from my side: >>>>> >>>>> *Storage overhead:* >>>>> *TLDR:* Regardless which way we go, storage overhead shouldn't be an >>>>> issue. >>>>> >>>>> *Details:* >>>>> I made a couple of measurements recently on the storage size of the >>>>> update files. For numbers, see the "Update file size measurements" tab in >>>>> this >>>>> doc >>>>> <https://docs.google.com/spreadsheets/d/1I5u72D-4LbIs7p7lBnf5ITZou4V9jdHAj9z9p_nUO0Q/edit?usp=sharing> >>>>> . >>>>> >>>>> 1) Writing _pos column: >>>>> >>>>> - File formats could hide the overhead for the _pos column, e.g. >>>>> Parquet V2 has negligible (<0,5%) overhead >>>>> - I recall there was an argument that having _pos even for >>>>> pos-aligned updates might help us debug writer issues to see what is >>>>> missing/icorrect in the update file >>>>> - On the write path we already read and sort by _pos so there >>>>> should be no overhead either >>>>> >>>>> In summary, I slightly lean towards always having _pos in the updates >>>>> regardless what representation we choose. >>>>> If we can agree on this, the relevant pros and cons around _pos are >>>>> obsolete. >>>>> >>>>> 2) Filling rows/values for deleted rows: >>>>> >>>>> - I measured noticeable overhead when filling fields for deleted >>>>> rows with NULLs. As always the overhead depends on many stuff >>>>> - Deleted 5% of rows => writing NULLs has 2-3% storage overhead >>>>> - Deleted 20% of rows => writing NULLs has 4-4.6% overhead >>>>> - Since the consensus is constantly eliminating deletes, I think >>>>> this overhead is acceptable >>>>> >>>>> *Factors to consider to choose representation* >>>>> As described above, the storage size shouldn't be a factor here, I'd >>>>> not consider the presence of _pos as pro or con. >>>>> >>>>> If we scratch the storage size related ones, here is what we have left: >>>>> >>>>> *1) Accuracy of stats* >>>>> By filling missing rows with auxiliary values the stats can go off >>>>> >>>>> - I'm not that worried about Parquet footer stats, they represent >>>>> the file itself, not the logical deletes on top >>>>> - If we went for NULLs as filling values, we could set each field >>>>> in a deleted row to NULL (not the entire row). As a result the >>>>> column-level >>>>> null count in the table metadata can be off because of the filling >>>>> values. >>>>> - Technically, if we want to, we can correct these stats if we >>>>> collected the number of deleted rows and then adjust the null count by >>>>> this >>>>> number >>>>> - In general, deletes make stats off anyway regardless of column >>>>> updates (probably not the null count but the avg size, though) >>>>> >>>>> I don't think that this should be a decisive factor when choosing >>>>> representation. >>>>> >>>>> *2) Complexity in general* >>>>> a) Read - stitching >>>>> >>>>> - Positional alignment approach is straightforward for stitching >>>>> both in vectorized and row readers >>>>> - Stitch before applying deletes >>>>> - Applied deletes approach seems more complicated initially >>>>> - Scattering of update rows is required >>>>> - Might not be supported for all the language implementations >>>>> ATM. Thanks Anurag for taking a look! >>>>> >>>>> b) Write >>>>> >>>>> - Applied deletes approach is straightforward >>>>> - Positional alignment approach has one major complexity: >>>>> - when trailing rows are deleted, the writer currently has no >>>>> information how many rows to fill >>>>> - In the PoC we broadcast a "file path to row count" so that >>>>> the writers can now if trailing rows have to be filled with nulls, >>>>> and with >>>>> how many (comparing to the _pos column) >>>>> - In theory we could simply not fill trailing deleted rows, but >>>>> then we have a hybrid approach between positional alignment and >>>>> applied >>>>> deletes. Probably, we don't want this complexity in the spec >>>>> >>>>> *3) Read and write performance* >>>>> >>>>> - I'm not expecting any difference in write perf >>>>> - Read could have a toll on the "applied deletes" approach due to >>>>> scattering. *@pvary* might have some more insights here. >>>>> >>>>> *Summary* >>>>> I hope this summarizes all we have discussed from a different angle >>>>> and might narrow down the areas to look for. I think the two main >>>>> questions >>>>> to sort out to make a decision are: >>>>> 1) Can we find a way to take care of trailing deletes in >>>>> "positional aligned" approach (or are we fine not filling trailing >>>>> deletes) >>>>> 2) What is the cost of scattering the update rows in the "applied >>>>> deletes" approach >>>>> 2/b) Is scattering feasible on all language implementations >>>>> >>>>> Best Regards, >>>>> Gabor >>>>> >>>>> >>>>> Anurag Mantripragada <[email protected]> ezt írta >>>>> (időpont: 2026. máj. 20., Sze, 2:37): >>>>> >>>>>> Hi all, >>>>>> >>>>>> Following up on the column updates design >>>>>> <https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?tab=t.0#heading=h.b3mc4alqde65> >>>>>> and >>>>>> the original discussion thread >>>>>> <https://lists.apache.org/thread/w90rqyhmh6pb0yxp0bqzgzk1y1rotyny>, >>>>>> I'd like to start a focused discussion on how column update files should >>>>>> represent rows when deletion vectors (DVs) are present. >>>>>> >>>>>> *Context* >>>>>> >>>>>> We've reached consensus on using a dense representation for column >>>>>> update files. When a column is updated, the column file contains values >>>>>> for >>>>>> all rows including unchanged rows. This avoids complex merge logic on the >>>>>> write path when successive updates target overlapping fields. >>>>>> >>>>>> The open question is: what should the column file contain at >>>>>> positions where the base file has deleted rows? There are two options. >>>>>> >>>>>> *Option 1*: Positional Alignment (row count matches base file) >>>>>> >>>>>> The column file has exactly base_file.record_count rows. Row N in >>>>>> the column file corresponds to row N in the base file. Deleted positions >>>>>> contain filler values (e.g., NULLs). >>>>>> >>>>>> Pros*:* >>>>>> >>>>>> - Stitching is a zero-copy column swap in Arrow >>>>>> - Works identically in every Arrow implementation (Java, Rust, >>>>>> Python, C++) >>>>>> - No _pos column needed >>>>>> - Engines apply their existing DV filter to both base and column >>>>>> file >>>>>> >>>>>> Cons*:* >>>>>> >>>>>> - Filler values at deleted positions skew Parquet footer >>>>>> statistics (null_count, avg_length) >>>>>> - Writes slightly more data than necessary (filler values for >>>>>> deleted rows) >>>>>> - Writer must know base_file.record_count to pad trailing >>>>>> deletions (base file metadata already available during write planning) >>>>>> >>>>>> *Option 2*: Applied Deletes (row count = live rows only) >>>>>> >>>>>> The column file contains only live rows (after applying DVs). A _pos >>>>>> column >>>>>> maps each row back to its ordinal position in the base file. >>>>>> >>>>>> Pros*:* >>>>>> >>>>>> - Only stores valid rows in column update files. >>>>>> - Parquet footer statistics are accurate (no skew from NULLs at >>>>>> deleted positions) >>>>>> - Slightly smaller file (no filler bytes) >>>>>> >>>>>> Cons*:* >>>>>> >>>>>> - _pos adds storage overhead (Encoding must be left to the file >>>>>> format) >>>>>> - Stitching requires a scatter operation to allocate a new array >>>>>> and place values at the correct positions >>>>>> - It's not zero-copy in Arrow and requires manipulation. >>>>>> - As it stands today this might be harder for non-Java engines >>>>>> (see section below) >>>>>> >>>>>> I investigated how three Iceberg implementations handle vectorized >>>>>> reading and what column stitching would require in each. The key >>>>>> architectural difference is how they expose Arrow memory: >>>>>> >>>>>> * Java/Spark**:* Spark's ColumnVector is an abstract class. We can >>>>>> override getInt(rowId)to redirect reads without copying data. This >>>>>> makes scatter operations appear "free" via virtual dispatch. My POC uses >>>>>> this approach. >>>>>> >>>>>> *PyIceberg:* Uses PyArrow's native arrays. I could not find a way >>>>>> to override what array[i] returns. PyArrow has take() (gather) but lacks >>>>>> a scatter() primitive (in the version we use). >>>>>> >>>>>> *iceberg-rust:* Uses arrow-rs arrays, which are concrete structs (not >>>>>> trait objects). Int32Array::value(i) is a direct memory offset. Must >>>>>> materialize new arrays via ArrayBuilder for any non-trivial column >>>>>> manipulation. >>>>>> >>>>>> TL;DR: If we choose Option 2 (applied deletes), engines need a >>>>>> scatter method to stitch column files. I found the following >>>>>> implementations in Arrow which can be used to stitch. >>>>>> >>>>>> >>>>>> - C++ <https://github.com/apache/arrow/pull/44394> (Since Arrow >>>>>> 20.0.0) >>>>>> >>>>>> - Python <https://github.com/apache/arrow/pull/48267> (Since >>>>>> Arrow 23.0.0) >>>>>> - I did not find scatter in arrow-rs. >>>>>> >>>>>> I'm still researching these options and would love to hear from >>>>>> everyone. >>>>>> >>>>>> Thanks, >>>>>> Anurag >>>>>> >>>>>>
