Ryan, Steven, Anoop,

Thanks for kicking off the discussion here Ryan, and thanks Steven and
Anoop for the thoughtful responses.

I've been mulling this over after our offline discussion and I'm still
leaning toward the stats-only approach where content stats include the
partition output largely as was proposed on the original AMT doc.

The core tradeoff I see is maintaining two representations of partition
information (tuples + bucket range stats in manifests) versus unifying
everything in stats with broader filter codepath changes; also as Anoop
pointed out we can minimize metadata bloat for stats on partition outputs
by only storing lower_bound and scoping down to the relevant stats.

In terms of how I look at the upfront implementation complexity tradeoff,
we would need to solve bucket range pruning of manifests regardless of the
approach we take. I think that means we're neccessarily implementing
pruning off bucket ranges which are stored in stats. Once that range
pruning implementation exists, the incremental complexity of using stats
for all partition values (not just buckets) seems lower than maintaining
effectively two parallel representations.

With partition tuples, we'd have:

   -

   Data files store partition tuples + column stats
   -

   Manifests store bucket ranges in stats + column stats
   -

   Writers populate both tuples and other column stats
   -

   Readers use tuples for equality delete matching and combined partition
   tuple + data filter via column stats for pruning

With the stats-only (partition outputs in stats indexed in the table field
ID space) approach :

   -

   Partition field values go in content stats alongside column stats
   -

   Pruning of expressions in planning applies to both data file and
   manifest level
   -

   Equality delete matching and scan planning extract partition values from
   stats

The second approach is more change upfront, but I believe it's largely
modifying the same pruning paths we'd anyways need to change for bucket
range pruning. The stats-only model keeps partition information in one
place rather than split between tuples and stats.

On the future format evolution point: if equality deletes are eventually
dropped, the stats-only approach is just as simple (if not simpler) than
removing partition tuples. In this approach, a writer can just keep
producing whatever stats it wants to.

All that said, I'm flexible if we end up keeping the tuple. I'm mainly
looking at this from a lens of simplifying the metadata representation.

Happy to discuss more on tomorrow's sync!

Thanks,

Amogh Jahagirdar


On Sun, May 3, 2026 at 1:12 PM Anoop Johnson <[email protected]> wrote:

> Ryan, thank you for starting the discussion and the analysis of the
> options. Building on Steven's observation, I was wondering if we can
> simplify to using content stats for the output of partition functions.
>
> V4 content stats already uses typed, per-field stat structs with a
> dynamically-built schema. We conditionally include only the fields relevant
> to each column type. For instance, there are no `nan_count` for ints, no
> `avg_size` for non-string types, etc. We could use the same mechanism for
> partition values. Specifically:
>
> 1. Only store `lower_bound` for partition entries. `upper_bound` is
> redundant and null/NaN/size stats are probably not useful. These
> unused fields will be physically absent from the manifest schema.
> 2. Partition fields get IDs from the same space as schema fields, so they
> are just regular entries in content stats with no special handling.
> 3. Writers must populate these entries for all partition fields.
>
>   This addresses the original concerns Ryan had with this approach::
>
>    - No extra columns: there is only one typed field (lower_bound) per
>    partition field.
>    - Since the upper bound is not stored, there is no need to check if
>    the upper bound and lower bound are the same.
>    - Table ID for expressions: Partition fields IDs just get table IDs
>    like other fields.
>
> If this works, it would simplify the spec by treating partition values as
> a special case of column stats rather than a separate concept.
>
> Thoughts? We can discuss this at the sync tomorrow. I will be there.
>
> Best,
> Anoop
>
> On Sat, May 2, 2026 at 2:55 PM Steven Wu <[email protected]> wrote:
>
>> I won't be able to attend next Monday's meeting. So I will post my
>> questions here. I have no problem continuing to use partition tuples. I'm
>> just trying to understand the problem better. I will watch the recording
>> later.
>>
>> > Second, it has strict requirements to recover partition values: lower
>> and upper bounds must be tight and bounds must be present for partition
>> source columns. The last requirement is the biggest blocker because it
>> would mean if we have a data file from v3 with a partition tuple but no
>> stats, we could not correctly store it in v4.
>>
>> I assume this does not pose a problem for the approach of storing the
>> ranges of partition function output in column stats.
>>
>> > We tried to make this work with a solution to both problems: store the
>> range of values for the *output* of a partition function in column
>> stats. ... This one is complicated: storing partition ranges in column
>> stats requires storing several extra columns, tracking a table ID for the
>> output of each partition function, and checking that
>> lower_bound==upper_bound for each partition field’s stats.
>>
>> 1. partition tuples are not much different than extra columns
>> 2. If we want to support expression column stats for derived/computed
>> columns regardless of the partition tuple, it would require table ID for
>> expressions.
>>
>>
>> On Fri, May 1, 2026 at 5:15 PM Ryan Blue <[email protected]> wrote:
>>
>>> Hey everyone,
>>>
>>> We’ve had a lot of good discussion on the new manifest format in v4 and
>>> adding the new columnar metadata structures, but we still have an open
>>> issue: how to handle partition tuples. Amogh and I have been thinking about
>>> this and I think we have a good solution to propose.
>>>
>>> First, some background on the problem: Iceberg v3 and earlier encode the
>>> results of each partition field’s transform in a tuple of values for each
>>> data file, which is stored using a struct type. That struct type is
>>> specific to a partition spec, so manifests are written for one spec to have
>>> a uniform partition struct. Partitions are used for two purposes:
>>>
>>>    1. Filtering during scan planning: skipping data files by partition
>>>    and metadata files by partition ranges
>>>    2. To match data files with equality deletes (deletes are scoped to
>>>    a partition)
>>>
>>> Metadata changes in v4 mean that we need to write data files with
>>> different partition specs into the root manifest at the same time, which
>>> means we have to change how partition tracking works. We also need to
>>> replace partition ranges that v3 stores with an updated way to do manifest
>>> file pruning.
>>>
>>> Initially, I wanted to solve the problem by not storing partition tuples
>>> and instead recovering partition values from column ranges. This
>>> *almost* works. For example, consider a timestamp column with values in [
>>> 2026-05-01 10:00:00, 2026-05-01 10:59:59 ]. The hour(ts) value is
>>> 2026-05-01-10 and we know it is partitioned because hour(lower_bound)
>>> is equal to hour(upper_bound). For monotonic functions, we can recover
>>> the partition value as long as the lower and upper bounds are tight bounds.
>>> And this also works for manifest filtering: keeping overall bounds for the
>>> ts column provides roughly the same metadata filtering as partition
>>> ranges. With this approach, filtering is simpler and we can still recover
>>> the partition values for equality delete matching.
>>>
>>> But there are problems with this approach. First, it doesn’t support
>>> non-monotonic functions, like bucket. Second, it has strict
>>> requirements to recover partition values: lower and upper bounds must be
>>> tight and bounds must be present for partition source columns. The last
>>> requirement is the biggest blocker because it would mean if we have a data
>>> file from v3 with a partition tuple but no stats, we could not correctly
>>> store it in v4.
>>>
>>> We tried to make this work with a solution to both problems: store the
>>> range of values for the *output* of a partition function in column
>>> stats. This is directly equivalent to partition ranges for manifests and
>>> for partitioned data files the lower and upper bound are always the same.
>>> We always have values that are the result of the partition functions, so
>>> this works for files without stats and for bucketing. This solution is
>>> still an option, but we think there is a simpler one. This one is
>>> complicated: storing partition ranges in column stats requires storing
>>> several extra columns, tracking a table ID for the output of each partition
>>> function, and checking that lower_bound==upper_bound for each partition
>>> field’s stats.
>>>
>>> If we retrace the logic to get to this point, there’s an alternative:
>>> keep storing a partition tuple for data files. We need to use a struct that
>>> is the union of all partition fields, but we already do this in other
>>> places. A partition tuple is a few fields, not a struct per field with
>>> multiple fields within. And we already have all the metadata needed to
>>> track the data, without needing to add table field IDs for partition fields.
>>>
>>> I originally wanted to get rid of the partition tuple, but now I think
>>> it is simpler to keep it. It requires changing fewer code paths for a new
>>> representation. And it is cleaner to remove the partition tuple if we
>>> remove the dependency on it by updating or removing equality deletes: we
>>> would just remove one struct.
>>>
>>> Using the partition tuple and the new column stats appears to me as the
>>> simplest solution. The one remaining issue is that metadata filtering on
>>> bucket fields is not addressed, but I think we can still introduce a range
>>> of bucket values to address this, independent of the decision to remove or
>>> keep partition tuples.
>>>
>>> Please think about this and comment. I think keeping the partition tuple
>>> is the simplest plan, but we are planning to cover this in Monday’s v4
>>> metadata sync so please come if you want to discuss!
>>>
>>> Ryan
>>>
>>

Reply via email to