I personally think these changes are not required for the following reasons:
- *Consistency:* We can apply this same logic to positional deletes, not just *DVs *(Peter also mentioned this). - *Spec Adherence:* The total_record_count should reflect an accurate value in all cases. The spec defines it as the *"Accurate count of records in a partition after applying deletes if any,"* which implies engines should account for equality deletes as well. - *Redundancy:* The current PR simply calculates total_record = data record - dv record. This is easily derived by the user from other existing stats, so specific handling isn't necessary. - *Existing Logic:* We don't force total_record = data record when no delete files exist. Since those stats are already easily inferred by the user, we should maintain that same approach here. On Thu, Apr 23, 2026 at 7:48 PM hemanth boyina <[email protected]> wrote: > Thanks Peter for the feedback. This aligns with the current implementation > in the PR - core computes total_record_count when it can be done > unambiguously(no eq deletes, no V2 position delete files) and leaves it > null otherwise. Engines remain free to override with a more precise value > when they have additional context. > > Thanks > Hemanth Boyina > > On Thu, 23 Apr 2026 at 3:59 PM, Péter Váry <[email protected]> > wrote: > >> Hi Hemanth, >> >> My take is that keeping `total_record_count` in the spec and populating >> it automatically in core when it’s reliably computable is still valuable, >> even if the presence of equality deletes (which remain in v3 as well) >> limits when the value is strictly derivable. >> In cases where there are no equality deletes (and the relevant position >> delete constraints you listed), core can compute it cheaply and >> consistently, which: >> >> - avoids re-implementing/forking the same logic across engines, >> - makes the value immediately available from the persisted stats file >> (without requiring an engine-side pass), >> - and matches the spirit of the spec encouraging richer partition >> stats when possible. >> >> At the same time, because equality deletes can invalidate the derivation, >> it also seems reasonable that engines retain the option to >> recompute/override total_record_count with a “more correct” value when they >> have additional context or are already scanning delete metadata. >> So I’d lean toward: core computes and persists it when it can do so >> unambiguously; otherwise leave it null, and engines are free to >> fill/override in their own pipelines if they want. >> This gives us a best-effort baseline in the common cases, without forcing >> complexity or correctness guarantees in the hard cases. >> >> Thanks! >> Peter >> >> hemanth boyina <[email protected]> ezt írta (időpont: 2026. >> ápr. 22., Sze, 7:43): >> >>> Hi all, >>> >>> I have raised a PR [1] to populate the total_record_count field in >>> partition statistics when computable from metadata( no equality deletes, no >>> V2 position delete files). This follows the discussion in #12098 about >>> using DV cardinalities for this. >>> >>> During review, a question came up : since total_record_count is >>> derivable from existing fields , should the iceberg core library compute >>> and persist it, or should this be left to engines ? >>> >>> For computing in core: the spec encourages it, it avoids duplicating >>> logic across engines, and it’s immediately available from the stats file >>> For leaving to engines: it’s a derived value, implementation adds >>> complexity around null handling in incremental computation and it can only >>> be populated for partitions without eq deletes. >>> >>> Would appreciate community inputs on the preferred approach. >>> [1] >>> https://github.com/apache/iceberg/pull/15979 >>> >>> Thanks >>> Hemanth Boyina >>> >>
