Hey everyone,

We’ve had a lot of good discussion on the new manifest format in v4 and
adding the new columnar metadata structures, but we still have an open
issue: how to handle partition tuples. Amogh and I have been thinking about
this and I think we have a good solution to propose.

First, some background on the problem: Iceberg v3 and earlier encode the
results of each partition field’s transform in a tuple of values for each
data file, which is stored using a struct type. That struct type is
specific to a partition spec, so manifests are written for one spec to have
a uniform partition struct. Partitions are used for two purposes:

   1. Filtering during scan planning: skipping data files by partition and
   metadata files by partition ranges
   2. To match data files with equality deletes (deletes are scoped to a
   partition)

Metadata changes in v4 mean that we need to write data files with different
partition specs into the root manifest at the same time, which means we
have to change how partition tracking works. We also need to replace
partition ranges that v3 stores with an updated way to do manifest file
pruning.

Initially, I wanted to solve the problem by not storing partition tuples
and instead recovering partition values from column ranges. This *almost*
works. For example, consider a timestamp column with values in [ 2026-05-01
10:00:00, 2026-05-01 10:59:59 ]. The hour(ts) value is 2026-05-01-10 and we
know it is partitioned because hour(lower_bound) is equal to
hour(upper_bound). For monotonic functions, we can recover the partition
value as long as the lower and upper bounds are tight bounds. And this also
works for manifest filtering: keeping overall bounds for the ts column
provides roughly the same metadata filtering as partition ranges. With this
approach, filtering is simpler and we can still recover the partition
values for equality delete matching.

But there are problems with this approach. First, it doesn’t support
non-monotonic functions, like bucket. Second, it has strict requirements to
recover partition values: lower and upper bounds must be tight and bounds
must be present for partition source columns. The last requirement is the
biggest blocker because it would mean if we have a data file from v3 with a
partition tuple but no stats, we could not correctly store it in v4.

We tried to make this work with a solution to both problems: store the
range of values for the *output* of a partition function in column stats.
This is directly equivalent to partition ranges for manifests and for
partitioned data files the lower and upper bound are always the same. We
always have values that are the result of the partition functions, so this
works for files without stats and for bucketing. This solution is still an
option, but we think there is a simpler one. This one is complicated:
storing partition ranges in column stats requires storing several extra
columns, tracking a table ID for the output of each partition function, and
checking that lower_bound==upper_bound for each partition field’s stats.

If we retrace the logic to get to this point, there’s an alternative: keep
storing a partition tuple for data files. We need to use a struct that is
the union of all partition fields, but we already do this in other places.
A partition tuple is a few fields, not a struct per field with multiple
fields within. And we already have all the metadata needed to track the
data, without needing to add table field IDs for partition fields.

I originally wanted to get rid of the partition tuple, but now I think it
is simpler to keep it. It requires changing fewer code paths for a new
representation. And it is cleaner to remove the partition tuple if we
remove the dependency on it by updating or removing equality deletes: we
would just remove one struct.

Using the partition tuple and the new column stats appears to me as the
simplest solution. The one remaining issue is that metadata filtering on
bucket fields is not addressed, but I think we can still introduce a range
of bucket values to address this, independent of the decision to remove or
keep partition tuples.

Please think about this and comment. I think keeping the partition tuple is
the simplest plan, but we are planning to cover this in Monday’s v4
metadata sync so please come if you want to discuss!

Ryan

Reply via email to