Re: Proposal to extend standardized statistics

Gábor Kaszab Wed, 05 Nov 2025 06:00:18 -0800

Hey Iceberg Community,

Thank you for taking a look at the proposal
<https://docs.google.com/document/d/1H9uYt53Q1_CcOXOfLcr0hXRxvqflg_k_xeVorMLrWbM>
and also for the feedback! First of all I'd like to apologise for the long
delay with my response. I went through the feedback, let me give a summary
and possible next steps:


*Partition-level column stats*
  - As a starting point a scan API could come handy (with filtering,
projection etc.) even for the existing partition stats. I've published a PR
<https://github.com/apache/iceberg/pull/14508> to introduce such an API.
  - There was an ask for this recently on Slack
<https://apache-iceberg.slack.com/archives/C03LG1D563F/p1760925647880099>,
and also there is a GH issue
<https://github.com/apache/iceberg/issues/11083> opened earlier.
  - It would make sense for the partition-level column stats to follow the
new design of column stats coming with the V4 column stats
<https://docs.google.com/document/d/1uvbrwwAJW2TgsnoaIcwAFpjbhHkBUL5wY_24nKgtt9I>
proposal. Doing that will allow us to project partition-level column stats
by field and by particular stat too. Should follow-up and coordinate with
that proposal.
 - If there is anything else I miss here, let me know.

*Table-level KLL sketch*
So far no feedback on this. There is a PR
<https://github.com/apache/iceberg/pull/8202> for the spec changes for this
already. This could be a nice addition, I can cover the implementation if
there are no objections.

*Table-level column stats (like min/max etc.)*
Sp far not much feedback on this. There are open questions wrt how to
implement this. Will wait for further feedback, putting it on hold for now
in favor of the above 2 items.

*File-level avg length and max length*
These will be included in the V4 stats improvements

*Partition-level Theta sketches for NDV*
These seem to consume too much space even with low precision and seem to
have limited benefits. In case there is a particular use-case for this, let
me know! Putting it on hold for now.

Any further feedback is appreciated! Thanks!
Gabor

Jacky Lee <[email protected]> ezt írta (időpont: 2025. aug. 28., Cs,
15:54):

> Excellent proposal!
>
> We’ve internally augmented both table-level and partition-level
> ColumnStatistics, and observed a 30%+ performance gain in Spark and
> Trino query execution—largely due to improved Cost-Based Optimization
> (CBO) effectiveness.
> However, leveraging the v3 format presented numerous challenges (such
> as column-type evolution and the way to save min/max values). We
> believe adopting the v4 format would be a more robust solution.
>
>
> I’ve researched this extensively and applied it in production. I’d be
> glad to collaborate on implementing this feature if needed.
>
>
> Best wishes.
>
> Gábor Kaszab <[email protected]> 于2025年8月28日周四 21:23写道：
> >
> > Hey Iceberg Community,
> >
> > I've been working on a proposal to extend the currently standardized
> statistics in Iceberg, by looking into what statistics are used by some
> query engines and trying to fill the gaps (credit also goes to Denys K to
> lay groundwork). The motivation is to use Iceberg for the source of truth
> when it comes to statistics across all the engines.
> > Meanwhile, there have been movements on other proposals (Restructuring
> col-stats, Restructuring metadata) that might overlap with mine. Let’s see
> how much of my proposal still holds up in light of these developments.
> >
> > Any feedback is appreciated!
> > Gabor
>

Re: Proposal to extend standardized statistics

Reply via email to