Ryan, thank you for starting the discussion and the analysis of the options. Building on Steven's observation, I was wondering if we can simplify to using content stats for the output of partition functions.
V4 content stats already uses typed, per-field stat structs with a dynamically-built schema. We conditionally include only the fields relevant to each column type. For instance, there are no `nan_count` for ints, no `avg_size` for non-string types, etc. We could use the same mechanism for partition values. Specifically: 1. Only store `lower_bound` for partition entries. `upper_bound` is redundant and null/NaN/size stats are probably not useful. These unused fields will be physically absent from the manifest schema. 2. Partition fields get IDs from the same space as schema fields, so they are just regular entries in content stats with no special handling. 3. Writers must populate these entries for all partition fields. This addresses the original concerns Ryan had with this approach:: - No extra columns: there is only one typed field (lower_bound) per partition field. - Since the upper bound is not stored, there is no need to check if the upper bound and lower bound are the same. - Table ID for expressions: Partition fields IDs just get table IDs like other fields. If this works, it would simplify the spec by treating partition values as a special case of column stats rather than a separate concept. Thoughts? We can discuss this at the sync tomorrow. I will be there. Best, Anoop On Sat, May 2, 2026 at 2:55 PM Steven Wu <[email protected]> wrote: > I won't be able to attend next Monday's meeting. So I will post my > questions here. I have no problem continuing to use partition tuples. I'm > just trying to understand the problem better. I will watch the recording > later. > > > Second, it has strict requirements to recover partition values: lower > and upper bounds must be tight and bounds must be present for partition > source columns. The last requirement is the biggest blocker because it > would mean if we have a data file from v3 with a partition tuple but no > stats, we could not correctly store it in v4. > > I assume this does not pose a problem for the approach of storing the > ranges of partition function output in column stats. > > > We tried to make this work with a solution to both problems: store the > range of values for the *output* of a partition function in column stats. > ... This one is complicated: storing partition ranges in column stats > requires storing several extra columns, tracking a table ID for the output > of each partition function, and checking that lower_bound==upper_bound for > each partition field’s stats. > > 1. partition tuples are not much different than extra columns > 2. If we want to support expression column stats for derived/computed > columns regardless of the partition tuple, it would require table ID for > expressions. > > > On Fri, May 1, 2026 at 5:15 PM Ryan Blue <[email protected]> wrote: > >> Hey everyone, >> >> We’ve had a lot of good discussion on the new manifest format in v4 and >> adding the new columnar metadata structures, but we still have an open >> issue: how to handle partition tuples. Amogh and I have been thinking about >> this and I think we have a good solution to propose. >> >> First, some background on the problem: Iceberg v3 and earlier encode the >> results of each partition field’s transform in a tuple of values for each >> data file, which is stored using a struct type. That struct type is >> specific to a partition spec, so manifests are written for one spec to have >> a uniform partition struct. Partitions are used for two purposes: >> >> 1. Filtering during scan planning: skipping data files by partition >> and metadata files by partition ranges >> 2. To match data files with equality deletes (deletes are scoped to a >> partition) >> >> Metadata changes in v4 mean that we need to write data files with >> different partition specs into the root manifest at the same time, which >> means we have to change how partition tracking works. We also need to >> replace partition ranges that v3 stores with an updated way to do manifest >> file pruning. >> >> Initially, I wanted to solve the problem by not storing partition tuples >> and instead recovering partition values from column ranges. This *almost* >> works. For example, consider a timestamp column with values in [ >> 2026-05-01 10:00:00, 2026-05-01 10:59:59 ]. The hour(ts) value is >> 2026-05-01-10 and we know it is partitioned because hour(lower_bound) is >> equal to hour(upper_bound). For monotonic functions, we can recover the >> partition value as long as the lower and upper bounds are tight bounds. And >> this also works for manifest filtering: keeping overall bounds for the ts >> column provides roughly the same metadata filtering as partition ranges. >> With this approach, filtering is simpler and we can still recover the >> partition values for equality delete matching. >> >> But there are problems with this approach. First, it doesn’t support >> non-monotonic functions, like bucket. Second, it has strict requirements >> to recover partition values: lower and upper bounds must be tight and >> bounds must be present for partition source columns. The last requirement >> is the biggest blocker because it would mean if we have a data file from v3 >> with a partition tuple but no stats, we could not correctly store it in v4. >> >> We tried to make this work with a solution to both problems: store the >> range of values for the *output* of a partition function in column >> stats. This is directly equivalent to partition ranges for manifests and >> for partitioned data files the lower and upper bound are always the same. >> We always have values that are the result of the partition functions, so >> this works for files without stats and for bucketing. This solution is >> still an option, but we think there is a simpler one. This one is >> complicated: storing partition ranges in column stats requires storing >> several extra columns, tracking a table ID for the output of each partition >> function, and checking that lower_bound==upper_bound for each partition >> field’s stats. >> >> If we retrace the logic to get to this point, there’s an alternative: >> keep storing a partition tuple for data files. We need to use a struct that >> is the union of all partition fields, but we already do this in other >> places. A partition tuple is a few fields, not a struct per field with >> multiple fields within. And we already have all the metadata needed to >> track the data, without needing to add table field IDs for partition fields. >> >> I originally wanted to get rid of the partition tuple, but now I think it >> is simpler to keep it. It requires changing fewer code paths for a new >> representation. And it is cleaner to remove the partition tuple if we >> remove the dependency on it by updating or removing equality deletes: we >> would just remove one struct. >> >> Using the partition tuple and the new column stats appears to me as the >> simplest solution. The one remaining issue is that metadata filtering on >> bucket fields is not addressed, but I think we can still introduce a range >> of bucket values to address this, independent of the decision to remove or >> keep partition tuples. >> >> Please think about this and comment. I think keeping the partition tuple >> is the simplest plan, but we are planning to cover this in Monday’s v4 >> metadata sync so please come if you want to discuss! >> >> Ryan >> >
