Re: [DISCUSS] v4 - Improved column statistics

Péter Váry Mon, 28 Jul 2025 06:19:55 -0700

If we focus strictly on file-level column statistics, then partition level
column statistics is not a concern. However, looking ahead, we likely want
to support column statistics at the table or partition level as well. It
would be beneficial to adopt a consistent approach to ID generation and
handling for partition statistics too.


Micah Kornfield <emkornfi...@gmail.com> ezt írta (időpont: 2025. júl. 24.,
Cs, 23:50):

> Hi Dan,
>
> I largely agree that expressions will be useful and would limit the need
> for "custom stats".  I just wanted to probe some of what some of the points
> you made since I think there might be some important distinctions that
> might be getting glossed over.
>
>
>> The requirement would just be that you need to project all the stats for
>> an expression/sort order when copying metadata entries (though I may be
>> trivializing this and it's harder than I expect).  I think the issue with a
>> low-bar to add new stats is basically the same effort as saying you need to
>> support arbitrary/unknown stats carry-over since older clients would either
>> have to handle the unknown cases or would end up dropping values.
>
>
> I think we should disambiguate two cases:
> 1.  Custom stats.  In this case I assume whoever is using them has a
> custom writer that will do any carry-over necessary, and won't let
> reference writers touch their table.  We shouldn't require reference
> implementations to carry over these stats.
> 2.  Official non-required stats.  In this case I think the projection is
> entirely known because all possible stats would be enumerated for any given
> version of the spec (i.e. it is different then unknown/arbitrary stats).
> Older clients should never be writing to newer versions of the table if
> they don't understand the version of the spec that is currently used for
> the table.  Manifest compaction could still occur fairly easily without
> data loss (i.e. it seems like in this scenario carrying over less used
> stats is the same effort as carrying over stats for expressions)?
>
> What couldn't occur is file compaction/adding new files, but I think we
> have the same problem with custom expressions in this regard.
>
> I'm still open to debate on this but if we need to support
>> expressions/sort-orders it feels like a good path to both handling
>> customization as well as providing a path to standardization if we find
>> specific cases that are commonly reused as expressions.
>
>
> We should probably distinguish between two types of expressions:
> 1.  Scalar expressions -  i.e. transform a value in a specific way (I
> thought this was the main use case of expressions).  Examples:
> Timestamp->Date.  String normalization/collation.
> 2.  Aggregate expressions - We are transforming N values to 1 value.  Note
> that stats are all aggregate expressions.
>
> If aggregates aren't in scope for expressions then I'm not sure they would
> satisfy all custom stats requirements.  If they are in scope, this brings
> up the  question: do we actually need a specific concept of "stats"? It
> seems all stats could just be modelled as expressions?
>
> Cheers,
> Micah
>
>
>
>
> On Thu, Jul 24, 2025 at 1:38 PM Daniel Weeks <dwe...@apache.org> wrote:
>
>> Also off topic, but doesn't this just shift the burden of
>>> standardardization to expressions?  This might be controversial but maybe
>>> the bar for adding a new stat type should be relatively low?  They are
>>> optional anyways, we can maybe define some stats as core (implementations
>>> are incomplete if they can't produce them) and others as non-core (not
>>> required for implementations, there can be optional configuration to either
>>> block writes that require producing the stats or just drop them).
>>
>>
>> If we need to support stats for expressions/sort-orders, then we've
>> pretty much done the hard work already.  The requirement would just be that
>> you need to project all the stats for an expression/sort order when copying
>> metadata entries (though I may be trivializing this and it's harder than I
>> expect).  I think the issue with a low-bar to add new stats is basically
>> the same effort as saying you need to support arbitrary/unknown stats
>> carry-over since older clients would either have to handle the unknown
>> cases or would end up dropping values.  I think expressions are a better
>> way to handle customization because it wouldn't require the same
>> consistency of representation/interpretation as a formally adopted stat.
>> The expression would then really fall on whatever standard we set for
>> portability which provides more flexibility (yes, it shifts the burden, but
>> we're going to have to figure that out for expressions/udfs/etc anyway).
>>
>> I'm still open to debate on this but if we need to support
>> expressions/sort-orders it feels like a good path to both handling
>> customization as well as providing a path to standardization if we find
>> specific cases that are commonly reused as expressions.
>>
>> -Dan
>>
>> On Thu, Jul 24, 2025 at 12:03 PM Micah Kornfield <emkornfi...@gmail.com>
>> wrote:
>>
>>> After having thought about it some more, my current point of view is
>>> proceeding with something as simple as possible for V4 (I tried to
>>> formalize what I think the proposed algorithm is in the original proposal
>>> doc
>>> <https://docs.google.com/document/d/1uvbrwwAJW2TgsnoaIcwAFpjbhHkBUL5wY_24nKgtt9I/edit?tab=t.0>
>>>  [1]).
>>> If in the course of V4 development we find some flaw with the simple
>>> approach we can revise it (e.g. we run out of space).  If something comes
>>> up after V4, we are not talking about a lot of code either way, so having a
>>> new scheme for V5+ would not be a major burden (all manifests are now
>>> written with a spec version, so detection is easy).
>>>
>>> Given the potential complications with custom stats, I think it is
>>> reasonable to allow implementations that want custom stats to use the upper
>>> bound of reserved offset range (e.g. we have 6 reserved out of 200 today,
>>> if implementations really need custom stats then they can start using
>>> offset 199, and then 198, etc).  This poses a low risk of overlap in the
>>> short term, and I assume those using custom stats would have tight control
>>> over their environment anyways, so they have the ability to manage
>>> conflicts, compactions, in a way that fits them.
>>>
>>>
>>>> Another thing that both Russel and Ryan brought up is being able to
>>>> track stats for sort orders or expressions, but they don't share an id
>>>> space with field ids.
>>>
>>>
>>> Slightly off topic, but is there a reason we can't unify the field ID
>>> range for V4?
>>>
>>> I feel like it would be better to work to formalize the stats so that
>>>> they are known and easier to project, but it's also hard to get agreement
>>>> for more complicated stats (like coalitions that have very
>>>> specific character set handling), but I think using expressions in lieu of
>>>> custom stats might address all of these cases and would be more
>>>> straightforward for the copy-forward requirement.
>>>
>>>
>>> Also off topic, but doesn't this just shift the burden of
>>> standardardization to expressions?  This might be controversial but maybe
>>> the bar for adding a new stat type should be relatively low?  They are
>>> optional anyways, we can maybe define some stats as core (implementations
>>> are incomplete if they can't produce them) and others as non-core (not
>>> required for implementations, there can be optional configuration to either
>>> block writes that require producing the stats or just drop them).
>>>
>>> [1]
>>> https://docs.google.com/document/d/1uvbrwwAJW2TgsnoaIcwAFpjbhHkBUL5wY_24nKgtt9I/edit?tab=t.0
>>>
>>>
>>>
>>> On Thu, Jul 24, 2025 at 11:19 AM Daniel Weeks <dwe...@apache.org> wrote:
>>>
>>>> The current proposal only leaves 10000+200 ids for other columns than
>>>>> stats. If in the future, we find some other feature which would require a
>>>>> manifest file column for every data column in the table, then we would 
>>>>> need
>>>>> to change the spec.
>>>>
>>>>
>>>> I do think we might want to put an upper bound on the column stats.
>>>> Ryan calculated the upper bound of what can be represented, but I don't
>>>> think we need to accommodate 10m+ field ids and that would block the entire
>>>> id range.  It might make more sense to simply put an upper bound on the
>>>> stats space (e.g. 100k or 1m fields?).  This would leave plenty of space
>>>> for future evolution of the spec without having to redefine the stats 
>>>> range.
>>>>
>>>> Another thing that both Russel and Ryan brought up is being able to
>>>> track stats for sort orders or expressions, but they don't share an id
>>>> space with field ids.  We might want to decide what the full stats space
>>>> should look like.  For example:
>>>>
>>>> 8k+ sort orders
>>>> 9k+ expressions
>>>> 10+ field ids
>>>> 1m+ <unreserved>
>>>> MAX_VALUE - 200 <reserved per spec>
>>>>
>>>> Since sort orders and expressions have much lower cardinality than
>>>> field ids, we can probably have a more constrained range.
>>>>
>>>> I'm leaning against custom stats because it does increase complexity
>>>> for all writers as Micah mentioned and introduces the potential for id
>>>> space collision.  It would also easily compromise the performance of
>>>> engines if other writers drop them (via compaction or just any metadata
>>>> rewrite operation).  I feel like it would be better to work to formalize
>>>> the stats so that they are known and easier to project, but it's also hard
>>>> to get agreement for more complicated stats (like coalitions that have very
>>>> specific character set handling), but I think using expressions in lieu of
>>>> custom stats might address all of these cases and would be more
>>>> straightforward for the copy-forward requirement.
>>>>
>>>> -Dan
>>>>
>>>>
>>>>
>>>> On Thu, Jul 24, 2025 at 4:03 AM Eduard Tudenhöfner
>>>> <eduard.tudenhoef...@databricks.com.invalid> wrote:
>>>>
>>>>>
>>>>>
>>>>>
>>>>>>    1. The current proposal only leaves 10000+200 ids for other
>>>>>>    columns than stats. If in the future, we find some other feature which
>>>>>>    would require a manifest file column for every data column in the 
>>>>>> table,
>>>>>>    then we would need to change the spec.
>>>>>>
>>>>>> For this I think we could start at *100,000* so that we use *100,000 +
>>>>> 200 * <fieldID>* to calculate the field ID of a given statistic.
>>>>>
>>>>>
>>>>>>
>>>>>>    1. The current proposal expects every engine to share the same
>>>>>>    stats, and not store any "non-standard" stat in the metadata.
>>>>>>
>>>>>> We haven't explicitly stated it in the proposal but there were
>>>>> discussions on how to potentially support this and what implications it
>>>>> brings for readers/writers
>>>>>
>>>>>
>>>>> I'm still not clear on what the proposal is to handle stats for reserved
>>>>>> columns <https://iceberg.apache.org/spec/#reserved-field-ids> [1] (I
>>>>>> think there was some mention in the notes but it was light on details). 
>>>>>> It
>>>>>> seems like it would be potentially useful to have stats for things like
>>>>>> _row_id, and the multiplication would overflow for these column IDs 
>>>>>> (maybe
>>>>>> this still yields unique column IDs though?)
>>>>>>
>>>>>
>>>>> To handle stats for reserved columns we could start at *2,417,000,000*
>>>>> which should give us enough room to store 200 stats per metadata ID. We
>>>>> would also ensure that those ID ranges for table columns and reserved
>>>>> columns wouldn't overlap.
>>>>>
>>>>>
>>>>> I assume we could put whatever these columns are under stats? Maybe we
>>>>>> just need a more generic name for the top level struct?
>>>>>
>>>>>
>>>>> I haven't updated the proposal yet, but I think renaming
>>>>> *column_stats* to *content_stats* would make sense.
>>>>>
>>>>>
>>>>>

Re: [DISCUSS] v4 - Improved column statistics

Reply via email to