If we focus strictly on file-level column statistics, then partition level column statistics is not a concern. However, looking ahead, we likely want to support column statistics at the table or partition level as well. It would be beneficial to adopt a consistent approach to ID generation and handling for partition statistics too.
Micah Kornfield <emkornfi...@gmail.com> ezt írta (időpont: 2025. júl. 24., Cs, 23:50): > Hi Dan, > > I largely agree that expressions will be useful and would limit the need > for "custom stats". I just wanted to probe some of what some of the points > you made since I think there might be some important distinctions that > might be getting glossed over. > > >> The requirement would just be that you need to project all the stats for >> an expression/sort order when copying metadata entries (though I may be >> trivializing this and it's harder than I expect). I think the issue with a >> low-bar to add new stats is basically the same effort as saying you need to >> support arbitrary/unknown stats carry-over since older clients would either >> have to handle the unknown cases or would end up dropping values. > > > I think we should disambiguate two cases: > 1. Custom stats. In this case I assume whoever is using them has a > custom writer that will do any carry-over necessary, and won't let > reference writers touch their table. We shouldn't require reference > implementations to carry over these stats. > 2. Official non-required stats. In this case I think the projection is > entirely known because all possible stats would be enumerated for any given > version of the spec (i.e. it is different then unknown/arbitrary stats). > Older clients should never be writing to newer versions of the table if > they don't understand the version of the spec that is currently used for > the table. Manifest compaction could still occur fairly easily without > data loss (i.e. it seems like in this scenario carrying over less used > stats is the same effort as carrying over stats for expressions)? > > What couldn't occur is file compaction/adding new files, but I think we > have the same problem with custom expressions in this regard. > > I'm still open to debate on this but if we need to support >> expressions/sort-orders it feels like a good path to both handling >> customization as well as providing a path to standardization if we find >> specific cases that are commonly reused as expressions. > > > We should probably distinguish between two types of expressions: > 1. Scalar expressions - i.e. transform a value in a specific way (I > thought this was the main use case of expressions). Examples: > Timestamp->Date. String normalization/collation. > 2. Aggregate expressions - We are transforming N values to 1 value. Note > that stats are all aggregate expressions. > > If aggregates aren't in scope for expressions then I'm not sure they would > satisfy all custom stats requirements. If they are in scope, this brings > up the question: do we actually need a specific concept of "stats"? It > seems all stats could just be modelled as expressions? > > Cheers, > Micah > > > > > On Thu, Jul 24, 2025 at 1:38 PM Daniel Weeks <dwe...@apache.org> wrote: > >> Also off topic, but doesn't this just shift the burden of >>> standardardization to expressions? This might be controversial but maybe >>> the bar for adding a new stat type should be relatively low? They are >>> optional anyways, we can maybe define some stats as core (implementations >>> are incomplete if they can't produce them) and others as non-core (not >>> required for implementations, there can be optional configuration to either >>> block writes that require producing the stats or just drop them). >> >> >> If we need to support stats for expressions/sort-orders, then we've >> pretty much done the hard work already. The requirement would just be that >> you need to project all the stats for an expression/sort order when copying >> metadata entries (though I may be trivializing this and it's harder than I >> expect). I think the issue with a low-bar to add new stats is basically >> the same effort as saying you need to support arbitrary/unknown stats >> carry-over since older clients would either have to handle the unknown >> cases or would end up dropping values. I think expressions are a better >> way to handle customization because it wouldn't require the same >> consistency of representation/interpretation as a formally adopted stat. >> The expression would then really fall on whatever standard we set for >> portability which provides more flexibility (yes, it shifts the burden, but >> we're going to have to figure that out for expressions/udfs/etc anyway). >> >> I'm still open to debate on this but if we need to support >> expressions/sort-orders it feels like a good path to both handling >> customization as well as providing a path to standardization if we find >> specific cases that are commonly reused as expressions. >> >> -Dan >> >> On Thu, Jul 24, 2025 at 12:03 PM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> >>> After having thought about it some more, my current point of view is >>> proceeding with something as simple as possible for V4 (I tried to >>> formalize what I think the proposed algorithm is in the original proposal >>> doc >>> <https://docs.google.com/document/d/1uvbrwwAJW2TgsnoaIcwAFpjbhHkBUL5wY_24nKgtt9I/edit?tab=t.0> >>> [1]). >>> If in the course of V4 development we find some flaw with the simple >>> approach we can revise it (e.g. we run out of space). If something comes >>> up after V4, we are not talking about a lot of code either way, so having a >>> new scheme for V5+ would not be a major burden (all manifests are now >>> written with a spec version, so detection is easy). >>> >>> Given the potential complications with custom stats, I think it is >>> reasonable to allow implementations that want custom stats to use the upper >>> bound of reserved offset range (e.g. we have 6 reserved out of 200 today, >>> if implementations really need custom stats then they can start using >>> offset 199, and then 198, etc). This poses a low risk of overlap in the >>> short term, and I assume those using custom stats would have tight control >>> over their environment anyways, so they have the ability to manage >>> conflicts, compactions, in a way that fits them. >>> >>> >>>> Another thing that both Russel and Ryan brought up is being able to >>>> track stats for sort orders or expressions, but they don't share an id >>>> space with field ids. >>> >>> >>> Slightly off topic, but is there a reason we can't unify the field ID >>> range for V4? >>> >>> I feel like it would be better to work to formalize the stats so that >>>> they are known and easier to project, but it's also hard to get agreement >>>> for more complicated stats (like coalitions that have very >>>> specific character set handling), but I think using expressions in lieu of >>>> custom stats might address all of these cases and would be more >>>> straightforward for the copy-forward requirement. >>> >>> >>> Also off topic, but doesn't this just shift the burden of >>> standardardization to expressions? This might be controversial but maybe >>> the bar for adding a new stat type should be relatively low? They are >>> optional anyways, we can maybe define some stats as core (implementations >>> are incomplete if they can't produce them) and others as non-core (not >>> required for implementations, there can be optional configuration to either >>> block writes that require producing the stats or just drop them). >>> >>> [1] >>> https://docs.google.com/document/d/1uvbrwwAJW2TgsnoaIcwAFpjbhHkBUL5wY_24nKgtt9I/edit?tab=t.0 >>> >>> >>> >>> On Thu, Jul 24, 2025 at 11:19 AM Daniel Weeks <dwe...@apache.org> wrote: >>> >>>> The current proposal only leaves 10000+200 ids for other columns than >>>>> stats. If in the future, we find some other feature which would require a >>>>> manifest file column for every data column in the table, then we would >>>>> need >>>>> to change the spec. >>>> >>>> >>>> I do think we might want to put an upper bound on the column stats. >>>> Ryan calculated the upper bound of what can be represented, but I don't >>>> think we need to accommodate 10m+ field ids and that would block the entire >>>> id range. It might make more sense to simply put an upper bound on the >>>> stats space (e.g. 100k or 1m fields?). This would leave plenty of space >>>> for future evolution of the spec without having to redefine the stats >>>> range. >>>> >>>> Another thing that both Russel and Ryan brought up is being able to >>>> track stats for sort orders or expressions, but they don't share an id >>>> space with field ids. We might want to decide what the full stats space >>>> should look like. For example: >>>> >>>> 8k+ sort orders >>>> 9k+ expressions >>>> 10+ field ids >>>> 1m+ <unreserved> >>>> MAX_VALUE - 200 <reserved per spec> >>>> >>>> Since sort orders and expressions have much lower cardinality than >>>> field ids, we can probably have a more constrained range. >>>> >>>> I'm leaning against custom stats because it does increase complexity >>>> for all writers as Micah mentioned and introduces the potential for id >>>> space collision. It would also easily compromise the performance of >>>> engines if other writers drop them (via compaction or just any metadata >>>> rewrite operation). I feel like it would be better to work to formalize >>>> the stats so that they are known and easier to project, but it's also hard >>>> to get agreement for more complicated stats (like coalitions that have very >>>> specific character set handling), but I think using expressions in lieu of >>>> custom stats might address all of these cases and would be more >>>> straightforward for the copy-forward requirement. >>>> >>>> -Dan >>>> >>>> >>>> >>>> On Thu, Jul 24, 2025 at 4:03 AM Eduard Tudenhöfner >>>> <eduard.tudenhoef...@databricks.com.invalid> wrote: >>>> >>>>> >>>>> >>>>> >>>>>> 1. The current proposal only leaves 10000+200 ids for other >>>>>> columns than stats. If in the future, we find some other feature which >>>>>> would require a manifest file column for every data column in the >>>>>> table, >>>>>> then we would need to change the spec. >>>>>> >>>>>> For this I think we could start at *100,000* so that we use *100,000 + >>>>> 200 * <fieldID>* to calculate the field ID of a given statistic. >>>>> >>>>> >>>>>> >>>>>> 1. The current proposal expects every engine to share the same >>>>>> stats, and not store any "non-standard" stat in the metadata. >>>>>> >>>>>> We haven't explicitly stated it in the proposal but there were >>>>> discussions on how to potentially support this and what implications it >>>>> brings for readers/writers >>>>> >>>>> >>>>> I'm still not clear on what the proposal is to handle stats for reserved >>>>>> columns <https://iceberg.apache.org/spec/#reserved-field-ids> [1] (I >>>>>> think there was some mention in the notes but it was light on details). >>>>>> It >>>>>> seems like it would be potentially useful to have stats for things like >>>>>> _row_id, and the multiplication would overflow for these column IDs >>>>>> (maybe >>>>>> this still yields unique column IDs though?) >>>>>> >>>>> >>>>> To handle stats for reserved columns we could start at *2,417,000,000* >>>>> which should give us enough room to store 200 stats per metadata ID. We >>>>> would also ensure that those ID ranges for table columns and reserved >>>>> columns wouldn't overlap. >>>>> >>>>> >>>>> I assume we could put whatever these columns are under stats? Maybe we >>>>>> just need a more generic name for the top level struct? >>>>> >>>>> >>>>> I haven't updated the proposal yet, but I think renaming >>>>> *column_stats* to *content_stats* would make sense. >>>>> >>>>> >>>>>