I won't be able to attend next Monday's meeting. So I will post my questions here. I have no problem continuing to use partition tuples. I'm just trying to understand the problem better. I will watch the recording later.
> Second, it has strict requirements to recover partition values: lower and upper bounds must be tight and bounds must be present for partition source columns. The last requirement is the biggest blocker because it would mean if we have a data file from v3 with a partition tuple but no stats, we could not correctly store it in v4. I assume this does not pose a problem for the approach of storing the ranges of partition function output in column stats. > We tried to make this work with a solution to both problems: store the range of values for the *output* of a partition function in column stats. ... This one is complicated: storing partition ranges in column stats requires storing several extra columns, tracking a table ID for the output of each partition function, and checking that lower_bound==upper_bound for each partition field’s stats. 1. partition tuples are not much different than extra columns 2. If we want to support expression column stats for derived/computed columns regardless of the partition tuple, it would require table ID for expressions. On Fri, May 1, 2026 at 5:15 PM Ryan Blue <[email protected]> wrote: > Hey everyone, > > We’ve had a lot of good discussion on the new manifest format in v4 and > adding the new columnar metadata structures, but we still have an open > issue: how to handle partition tuples. Amogh and I have been thinking about > this and I think we have a good solution to propose. > > First, some background on the problem: Iceberg v3 and earlier encode the > results of each partition field’s transform in a tuple of values for each > data file, which is stored using a struct type. That struct type is > specific to a partition spec, so manifests are written for one spec to have > a uniform partition struct. Partitions are used for two purposes: > > 1. Filtering during scan planning: skipping data files by partition > and metadata files by partition ranges > 2. To match data files with equality deletes (deletes are scoped to a > partition) > > Metadata changes in v4 mean that we need to write data files with > different partition specs into the root manifest at the same time, which > means we have to change how partition tracking works. We also need to > replace partition ranges that v3 stores with an updated way to do manifest > file pruning. > > Initially, I wanted to solve the problem by not storing partition tuples > and instead recovering partition values from column ranges. This *almost* > works. For example, consider a timestamp column with values in [ > 2026-05-01 10:00:00, 2026-05-01 10:59:59 ]. The hour(ts) value is > 2026-05-01-10 and we know it is partitioned because hour(lower_bound) is > equal to hour(upper_bound). For monotonic functions, we can recover the > partition value as long as the lower and upper bounds are tight bounds. And > this also works for manifest filtering: keeping overall bounds for the ts > column provides roughly the same metadata filtering as partition ranges. > With this approach, filtering is simpler and we can still recover the > partition values for equality delete matching. > > But there are problems with this approach. First, it doesn’t support > non-monotonic functions, like bucket. Second, it has strict requirements > to recover partition values: lower and upper bounds must be tight and > bounds must be present for partition source columns. The last requirement > is the biggest blocker because it would mean if we have a data file from v3 > with a partition tuple but no stats, we could not correctly store it in v4. > > We tried to make this work with a solution to both problems: store the > range of values for the *output* of a partition function in column stats. > This is directly equivalent to partition ranges for manifests and for > partitioned data files the lower and upper bound are always the same. We > always have values that are the result of the partition functions, so this > works for files without stats and for bucketing. This solution is still an > option, but we think there is a simpler one. This one is complicated: > storing partition ranges in column stats requires storing several extra > columns, tracking a table ID for the output of each partition function, and > checking that lower_bound==upper_bound for each partition field’s stats. > > If we retrace the logic to get to this point, there’s an alternative: keep > storing a partition tuple for data files. We need to use a struct that is > the union of all partition fields, but we already do this in other places. > A partition tuple is a few fields, not a struct per field with multiple > fields within. And we already have all the metadata needed to track the > data, without needing to add table field IDs for partition fields. > > I originally wanted to get rid of the partition tuple, but now I think it > is simpler to keep it. It requires changing fewer code paths for a new > representation. And it is cleaner to remove the partition tuple if we > remove the dependency on it by updating or removing equality deletes: we > would just remove one struct. > > Using the partition tuple and the new column stats appears to me as the > simplest solution. The one remaining issue is that metadata filtering on > bucket fields is not addressed, but I think we can still introduce a range > of bucket values to address this, independent of the decision to remove or > keep partition tuples. > > Please think about this and comment. I think keeping the partition tuple > is the simplest plan, but we are planning to cover this in Monday’s v4 > metadata sync so please come if you want to discuss! > > Ryan >
