Re: [DISCUSS] Populating total_record_count in partition statistics

Péter Váry Thu, 23 Apr 2026 03:29:34 -0700

Hi Hemanth,

My take is that keeping `total_record_count` in the spec and populating it
automatically in core when it’s reliably computable is still valuable, even
if the presence of equality deletes (which remain in v3 as well) limits
when the value is strictly derivable.
In cases where there are no equality deletes (and the relevant position
delete constraints you listed), core can compute it cheaply and
consistently, which:


   - avoids re-implementing/forking the same logic across engines,
   - makes the value immediately available from the persisted stats file
   (without requiring an engine-side pass),
   - and matches the spirit of the spec encouraging richer partition stats
   when possible.

At the same time, because equality deletes can invalidate the derivation,
it also seems reasonable that engines retain the option to
recompute/override total_record_count with a “more correct” value when they
have additional context or are already scanning delete metadata.
So I’d lean toward: core computes and persists it when it can do so
unambiguously; otherwise leave it null, and engines are free to
fill/override in their own pipelines if they want.
This gives us a best-effort baseline in the common cases, without forcing
complexity or correctness guarantees in the hard cases.

Thanks!
Peter

hemanth boyina <[email protected]> ezt írta (időpont: 2026. ápr.
22., Sze, 7:43):

> Hi all,
>
> I have raised a PR [1] to populate the total_record_count field in
> partition statistics when computable from metadata( no equality deletes, no
> V2 position delete files). This follows the discussion in #12098 about
> using DV cardinalities for this.
>
> During review, a question came up : since total_record_count is derivable
> from existing fields , should the iceberg core library compute and persist
> it, or should this be left to engines ?
>
> For computing in core: the spec encourages it, it avoids duplicating logic
> across engines, and it’s immediately available from the stats file
> For leaving to engines: it’s a derived value, implementation adds
> complexity around null handling in incremental computation and it can only
> be populated for partitions without eq deletes.
>
> Would appreciate community inputs on the preferred approach.
> [1]
> https://github.com/apache/iceberg/pull/15979
>
> Thanks
> Hemanth Boyina
>

Re: [DISCUSS] Populating total_record_count in partition statistics

Reply via email to