Hey everyone, We’ve had a lot of good discussion on the new manifest format in v4 and adding the new columnar metadata structures, but we still have an open issue: how to handle partition tuples. Amogh and I have been thinking about this and I think we have a good solution to propose.
First, some background on the problem: Iceberg v3 and earlier encode the results of each partition field’s transform in a tuple of values for each data file, which is stored using a struct type. That struct type is specific to a partition spec, so manifests are written for one spec to have a uniform partition struct. Partitions are used for two purposes: 1. Filtering during scan planning: skipping data files by partition and metadata files by partition ranges 2. To match data files with equality deletes (deletes are scoped to a partition) Metadata changes in v4 mean that we need to write data files with different partition specs into the root manifest at the same time, which means we have to change how partition tracking works. We also need to replace partition ranges that v3 stores with an updated way to do manifest file pruning. Initially, I wanted to solve the problem by not storing partition tuples and instead recovering partition values from column ranges. This *almost* works. For example, consider a timestamp column with values in [ 2026-05-01 10:00:00, 2026-05-01 10:59:59 ]. The hour(ts) value is 2026-05-01-10 and we know it is partitioned because hour(lower_bound) is equal to hour(upper_bound). For monotonic functions, we can recover the partition value as long as the lower and upper bounds are tight bounds. And this also works for manifest filtering: keeping overall bounds for the ts column provides roughly the same metadata filtering as partition ranges. With this approach, filtering is simpler and we can still recover the partition values for equality delete matching. But there are problems with this approach. First, it doesn’t support non-monotonic functions, like bucket. Second, it has strict requirements to recover partition values: lower and upper bounds must be tight and bounds must be present for partition source columns. The last requirement is the biggest blocker because it would mean if we have a data file from v3 with a partition tuple but no stats, we could not correctly store it in v4. We tried to make this work with a solution to both problems: store the range of values for the *output* of a partition function in column stats. This is directly equivalent to partition ranges for manifests and for partitioned data files the lower and upper bound are always the same. We always have values that are the result of the partition functions, so this works for files without stats and for bucketing. This solution is still an option, but we think there is a simpler one. This one is complicated: storing partition ranges in column stats requires storing several extra columns, tracking a table ID for the output of each partition function, and checking that lower_bound==upper_bound for each partition field’s stats. If we retrace the logic to get to this point, there’s an alternative: keep storing a partition tuple for data files. We need to use a struct that is the union of all partition fields, but we already do this in other places. A partition tuple is a few fields, not a struct per field with multiple fields within. And we already have all the metadata needed to track the data, without needing to add table field IDs for partition fields. I originally wanted to get rid of the partition tuple, but now I think it is simpler to keep it. It requires changing fewer code paths for a new representation. And it is cleaner to remove the partition tuple if we remove the dependency on it by updating or removing equality deletes: we would just remove one struct. Using the partition tuple and the new column stats appears to me as the simplest solution. The one remaining issue is that metadata filtering on bucket fields is not addressed, but I think we can still introduce a range of bucket values to address this, independent of the decision to remove or keep partition tuples. Please think about this and comment. I think keeping the partition tuple is the simplest plan, but we are planning to cover this in Monday’s v4 metadata sync so please come if you want to discuss! Ryan
