I won't be able to attend next Monday's meeting. So I will post my
questions here. I have no problem continuing to use partition tuples. I'm
just trying to understand the problem better. I will watch the recording
later.

> Second, it has strict requirements to recover partition values: lower and
upper bounds must be tight and bounds must be present for partition source
columns. The last requirement is the biggest blocker because it would mean
if we have a data file from v3 with a partition tuple but no stats, we
could not correctly store it in v4.

I assume this does not pose a problem for the approach of storing the
ranges of partition function output in column stats.

> We tried to make this work with a solution to both problems: store the
range of values for the *output* of a partition function in column stats.
... This one is complicated: storing partition ranges in column stats
requires storing several extra columns, tracking a table ID for the output
of each partition function, and checking that lower_bound==upper_bound for
each partition field’s stats.

1. partition tuples are not much different than extra columns
2. If we want to support expression column stats for derived/computed
columns regardless of the partition tuple, it would require table ID for
expressions.


On Fri, May 1, 2026 at 5:15 PM Ryan Blue <[email protected]> wrote:

> Hey everyone,
>
> We’ve had a lot of good discussion on the new manifest format in v4 and
> adding the new columnar metadata structures, but we still have an open
> issue: how to handle partition tuples. Amogh and I have been thinking about
> this and I think we have a good solution to propose.
>
> First, some background on the problem: Iceberg v3 and earlier encode the
> results of each partition field’s transform in a tuple of values for each
> data file, which is stored using a struct type. That struct type is
> specific to a partition spec, so manifests are written for one spec to have
> a uniform partition struct. Partitions are used for two purposes:
>
>    1. Filtering during scan planning: skipping data files by partition
>    and metadata files by partition ranges
>    2. To match data files with equality deletes (deletes are scoped to a
>    partition)
>
> Metadata changes in v4 mean that we need to write data files with
> different partition specs into the root manifest at the same time, which
> means we have to change how partition tracking works. We also need to
> replace partition ranges that v3 stores with an updated way to do manifest
> file pruning.
>
> Initially, I wanted to solve the problem by not storing partition tuples
> and instead recovering partition values from column ranges. This *almost*
> works. For example, consider a timestamp column with values in [
> 2026-05-01 10:00:00, 2026-05-01 10:59:59 ]. The hour(ts) value is
> 2026-05-01-10 and we know it is partitioned because hour(lower_bound) is
> equal to hour(upper_bound). For monotonic functions, we can recover the
> partition value as long as the lower and upper bounds are tight bounds. And
> this also works for manifest filtering: keeping overall bounds for the ts
> column provides roughly the same metadata filtering as partition ranges.
> With this approach, filtering is simpler and we can still recover the
> partition values for equality delete matching.
>
> But there are problems with this approach. First, it doesn’t support
> non-monotonic functions, like bucket. Second, it has strict requirements
> to recover partition values: lower and upper bounds must be tight and
> bounds must be present for partition source columns. The last requirement
> is the biggest blocker because it would mean if we have a data file from v3
> with a partition tuple but no stats, we could not correctly store it in v4.
>
> We tried to make this work with a solution to both problems: store the
> range of values for the *output* of a partition function in column stats.
> This is directly equivalent to partition ranges for manifests and for
> partitioned data files the lower and upper bound are always the same. We
> always have values that are the result of the partition functions, so this
> works for files without stats and for bucketing. This solution is still an
> option, but we think there is a simpler one. This one is complicated:
> storing partition ranges in column stats requires storing several extra
> columns, tracking a table ID for the output of each partition function, and
> checking that lower_bound==upper_bound for each partition field’s stats.
>
> If we retrace the logic to get to this point, there’s an alternative: keep
> storing a partition tuple for data files. We need to use a struct that is
> the union of all partition fields, but we already do this in other places.
> A partition tuple is a few fields, not a struct per field with multiple
> fields within. And we already have all the metadata needed to track the
> data, without needing to add table field IDs for partition fields.
>
> I originally wanted to get rid of the partition tuple, but now I think it
> is simpler to keep it. It requires changing fewer code paths for a new
> representation. And it is cleaner to remove the partition tuple if we
> remove the dependency on it by updating or removing equality deletes: we
> would just remove one struct.
>
> Using the partition tuple and the new column stats appears to me as the
> simplest solution. The one remaining issue is that metadata filtering on
> bucket fields is not addressed, but I think we can still introduce a range
> of bucket values to address this, independent of the decision to remove or
> keep partition tuples.
>
> Please think about this and comment. I think keeping the partition tuple
> is the simplest plan, but we are planning to cover this in Monday’s v4
> metadata sync so please come if you want to discuss!
>
> Ryan
>

Reply via email to