> > The current proposal only leaves 10000+200 ids for other columns than > stats. If in the future, we find some other feature which would require a > manifest file column for every data column in the table, then we would need > to change the spec.
I do think we might want to put an upper bound on the column stats. Ryan calculated the upper bound of what can be represented, but I don't think we need to accommodate 10m+ field ids and that would block the entire id range. It might make more sense to simply put an upper bound on the stats space (e.g. 100k or 1m fields?). This would leave plenty of space for future evolution of the spec without having to redefine the stats range. Another thing that both Russel and Ryan brought up is being able to track stats for sort orders or expressions, but they don't share an id space with field ids. We might want to decide what the full stats space should look like. For example: 8k+ sort orders 9k+ expressions 10+ field ids 1m+ <unreserved> MAX_VALUE - 200 <reserved per spec> Since sort orders and expressions have much lower cardinality than field ids, we can probably have a more constrained range. I'm leaning against custom stats because it does increase complexity for all writers as Micah mentioned and introduces the potential for id space collision. It would also easily compromise the performance of engines if other writers drop them (via compaction or just any metadata rewrite operation). I feel like it would be better to work to formalize the stats so that they are known and easier to project, but it's also hard to get agreement for more complicated stats (like coalitions that have very specific character set handling), but I think using expressions in lieu of custom stats might address all of these cases and would be more straightforward for the copy-forward requirement. -Dan On Thu, Jul 24, 2025 at 4:03 AM Eduard Tudenhöfner <eduard.tudenhoef...@databricks.com.invalid> wrote: > > > >> 1. The current proposal only leaves 10000+200 ids for other columns >> than stats. If in the future, we find some other feature which would >> require a manifest file column for every data column in the table, then we >> would need to change the spec. >> >> For this I think we could start at *100,000* so that we use *100,000 + > 200 * <fieldID>* to calculate the field ID of a given statistic. > > >> >> 1. The current proposal expects every engine to share the same stats, >> and not store any "non-standard" stat in the metadata. >> >> We haven't explicitly stated it in the proposal but there were > discussions on how to potentially support this and what implications it > brings for readers/writers > > > I'm still not clear on what the proposal is to handle stats for reserved >> columns <https://iceberg.apache.org/spec/#reserved-field-ids> [1] (I >> think there was some mention in the notes but it was light on details). It >> seems like it would be potentially useful to have stats for things like >> _row_id, and the multiplication would overflow for these column IDs (maybe >> this still yields unique column IDs though?) >> > > To handle stats for reserved columns we could start at *2,417,000,000* > which should give us enough room to store 200 stats per metadata ID. We > would also ensure that those ID ranges for table columns and reserved > columns wouldn't overlap. > > > I assume we could put whatever these columns are under stats? Maybe we >> just need a more generic name for the top level struct? > > > I haven't updated the proposal yet, but I think renaming *column_stats* > to *content_stats* would make sense. > > >