Was there discussion on whether the tuple or stats will be used for Identity partition columns in column projection [1]? This is an edge case we support for migrated hive tables.
Thanks, Micah [1] https://iceberg.apache.org/spec/#column-projection On Wed, May 6, 2026 at 1:46 PM Steven Wu <[email protected]> wrote: > I watched the recording. Ryan's arguments make sense (especially on where > we spend the effort). I am onboard with keeping the partition tuple for now. > > I also agree with Russell's point about limiting partition tuples only to > equality deletes in v4 and extending the stats approach to cover > non-monotonic bucketing transforms and multi-arg transforms for pruning. > > On Mon, May 4, 2026 at 2:51 PM Russell Spitzer <[email protected]> > wrote: > >> As we discussed in the community sync, I recommend we keep the partition >> tuple for now. It's the simplest way to maintain the guarantees needed for >> equality deletes. >> >> Going forward, we shouldn't rely on these values for filtering (imho) and >> should instead work to extend the stats struct approach to cover bucketing, >> non-range-preserving, and multi-arg transforms. To this end, I would try to >> make sure none of our v4 planning code interacts with the tuple directly, >> except when falling back for v3-based logic. Isolating tuple access this >> way means we can cleanly remove it later without reworking v4 planning >> paths. >> >> In my ideal world we drop the tuple and equality deletes, but this seems >> like the way to make progress now while leaving the door open to remove the >> tuple before v4 is finalized. >> >> On Mon, May 4, 2026 at 10:00 AM Anoop Johnson <[email protected]> wrote: >> >>> Amogh, >>> >>> That is a good point. But the partition and stats-based evaluation paths >>> are typically separate. For partition evaluation, we compare against an >>> exact value, and for stats-based pruning, we look at the range of values in >>> the column stats. >>> >>> Even if we store partition values in the content stats, it would follow >>> the partition evaluation path. The new V4 manifest reader would just need >>> to look at the partition value's lower_bound in the content stats instead >>> of an explicit partition tuple field. The partition evaluator itself will >>> be unchanged. >>> >>> This is conceptually no different than the current partition tuple. >>> Storing it in content_stats with only lower_bound preserves the same >>> semantics, but aligns with how the rest of the column stats are stored. >>> >>> But let's discuss the tradeoffs of the various options. Looking forward >>> to the discussion in an hour. >>> >>> Best, >>> Anoop >>> >>> On Sun, May 3, 2026 at 6:45 PM Amogh Jahagirdar <[email protected]> >>> wrote: >>> >>>> I realized I gave a poor example of the semantic issue with removing >>>> upper bound for partition outputs, but the crux is that in that >>>> modeling the stats on partition outputs would be treated in a special way >>>> where upper bound being null means it's partitioned rather than "unknown", >>>> which is inconsistent with the other stats. >>>> >>>>>
