Re: [DISCUSS] Populating total_record_count in partition statistics

hemanth boyina Thu, 23 Apr 2026 07:18:15 -0700

Thanks Peter for the feedback. This aligns with the current implementation
in the PR - core computes total_record_count when it can be done
unambiguously(no eq deletes, no V2 position delete files) and leaves it
null otherwise. Engines remain free to override with a more precise value
when they have additional context.


Thanks
Hemanth Boyina

On Thu, 23 Apr 2026 at 3:59 PM, Péter Váry <[email protected]>
wrote:

> Hi Hemanth,
>
> My take is that keeping `total_record_count` in the spec and populating it
> automatically in core when it’s reliably computable is still valuable, even
> if the presence of equality deletes (which remain in v3 as well) limits
> when the value is strictly derivable.
> In cases where there are no equality deletes (and the relevant position
> delete constraints you listed), core can compute it cheaply and
> consistently, which:
>
>    - avoids re-implementing/forking the same logic across engines,
>    - makes the value immediately available from the persisted stats file
>    (without requiring an engine-side pass),
>    - and matches the spirit of the spec encouraging richer partition
>    stats when possible.
>
> At the same time, because equality deletes can invalidate the derivation,
> it also seems reasonable that engines retain the option to
> recompute/override total_record_count with a “more correct” value when they
> have additional context or are already scanning delete metadata.
> So I’d lean toward: core computes and persists it when it can do so
> unambiguously; otherwise leave it null, and engines are free to
> fill/override in their own pipelines if they want.
> This gives us a best-effort baseline in the common cases, without forcing
> complexity or correctness guarantees in the hard cases.
>
> Thanks!
> Peter
>
> hemanth boyina <[email protected]> ezt írta (időpont: 2026. ápr.
> 22., Sze, 7:43):
>
>> Hi all,
>>
>> I have raised a PR [1] to populate the total_record_count field in
>> partition statistics when computable from metadata( no equality deletes, no
>> V2 position delete files). This follows the discussion in #12098 about
>> using DV cardinalities for this.
>>
>> During review, a question came up : since total_record_count is derivable
>> from existing fields , should the iceberg core library compute and persist
>> it, or should this be left to engines ?
>>
>> For computing in core: the spec encourages it, it avoids duplicating
>> logic across engines, and it’s immediately available from the stats file
>> For leaving to engines: it’s a derived value, implementation adds
>> complexity around null handling in incremental computation and it can only
>> be populated for partitions without eq deletes.
>>
>> Would appreciate community inputs on the preferred approach.
>> [1]
>> https://github.com/apache/iceberg/pull/15979
>>
>> Thanks
>> Hemanth Boyina
>>
>

Re: [DISCUSS] Populating total_record_count in partition statistics

Reply via email to