Re: [DISCUSS] v4 - Improved column statistics

Daniel Weeks Thu, 24 Jul 2025 11:19:48 -0700

>
> The current proposal only leaves 10000+200 ids for other columns than
> stats. If in the future, we find some other feature which would require a
> manifest file column for every data column in the table, then we would need
> to change the spec.

I do think we might want to put an upper bound on the column stats.  Ryan
calculated the upper bound of what can be represented, but I don't think we
need to accommodate 10m+ field ids and that would block the entire id
range.  It might make more sense to simply put an upper bound on the stats
space (e.g. 100k or 1m fields?).  This would leave plenty of space for
future evolution of the spec without having to redefine the stats range.

Another thing that both Russel and Ryan brought up is being able to track
stats for sort orders or expressions, but they don't share an id space with
field ids.  We might want to decide what the full stats space should look
like.  For example:

8k+ sort orders
9k+ expressions
10+ field ids
1m+ <unreserved>
MAX_VALUE - 200 <reserved per spec>

Since sort orders and expressions have much lower cardinality than field
ids, we can probably have a more constrained range.

I'm leaning against custom stats because it does increase complexity for
all writers as Micah mentioned and introduces the potential for id space
collision.  It would also easily compromise the performance of engines if
other writers drop them (via compaction or just any metadata rewrite
operation).  I feel like it would be better to work to formalize the stats
so that they are known and easier to project, but it's also hard to get
agreement for more complicated stats (like coalitions that have very
specific character set handling), but I think using expressions in lieu of
custom stats might address all of these cases and would be more
straightforward for the copy-forward requirement.

-Dan

On Thu, Jul 24, 2025 at 4:03 AM Eduard Tudenhöfner
<[email protected]> wrote:

>
>
>
>>    1. The current proposal only leaves 10000+200 ids for other columns
>>    than stats. If in the future, we find some other feature which would
>>    require a manifest file column for every data column in the table, then we
>>    would need to change the spec.
>>
>> For this I think we could start at *100,000* so that we use *100,000 +
> 200 * <fieldID>* to calculate the field ID of a given statistic.
>
>
>>
>>    1. The current proposal expects every engine to share the same stats,
>>    and not store any "non-standard" stat in the metadata.
>>
>> We haven't explicitly stated it in the proposal but there were
> discussions on how to potentially support this and what implications it
> brings for readers/writers
>
>
> I'm still not clear on what the proposal is to handle stats for reserved
>> columns <https://iceberg.apache.org/spec/#reserved-field-ids> [1] (I
>> think there was some mention in the notes but it was light on details). It
>> seems like it would be potentially useful to have stats for things like
>> _row_id, and the multiplication would overflow for these column IDs (maybe
>> this still yields unique column IDs though?)
>>
>
> To handle stats for reserved columns we could start at *2,417,000,000*
> which should give us enough room to store 200 stats per metadata ID. We
> would also ensure that those ID ranges for table columns and reserved
> columns wouldn't overlap.
>
>
> I assume we could put whatever these columns are under stats? Maybe we
>> just need a more generic name for the top level struct?
>
>
> I haven't updated the proposal yet, but I think renaming *column_stats*
> to *content_stats* would make sense.
>
>
>

Re: [DISCUSS] v4 - Improved column statistics

Reply via email to