I watched the recording. Ryan's arguments make sense (especially on where
we spend the effort). I am onboard with keeping the partition tuple for now.

I also agree with Russell's point about limiting partition tuples only to
equality deletes in v4 and extending the stats approach to cover
non-monotonic bucketing transforms and multi-arg transforms for pruning.

On Mon, May 4, 2026 at 2:51 PM Russell Spitzer <[email protected]>
wrote:

> As we discussed in the community sync, I recommend we keep the partition
> tuple for now. It's the simplest way to maintain the guarantees needed for
> equality deletes.
>
> Going forward, we shouldn't rely on these values for filtering (imho) and
> should instead work to extend the stats struct approach to cover bucketing,
> non-range-preserving, and multi-arg transforms. To this end, I would try to
> make sure none of our v4 planning code interacts with the tuple directly,
> except when falling back for v3-based logic. Isolating tuple access this
> way means we can cleanly remove it later without reworking v4 planning
> paths.
>
> In my ideal world we drop the tuple and equality deletes, but this seems
> like the way to make progress now while leaving the door open to remove the
> tuple before v4 is finalized.
>
> On Mon, May 4, 2026 at 10:00 AM Anoop Johnson <[email protected]> wrote:
>
>> Amogh,
>>
>> That is a good point. But the partition and stats-based evaluation paths
>> are typically separate. For partition evaluation, we compare against an
>> exact value, and for stats-based pruning, we look at the range of values in
>> the column stats.
>>
>> Even if we store partition values in the content stats, it would follow
>> the partition evaluation path. The new V4 manifest reader would just need
>> to look at the partition value's lower_bound in the content stats instead
>> of an explicit partition tuple field. The partition evaluator itself will
>> be unchanged.
>>
>> This is conceptually no different than the current partition tuple.
>> Storing it in content_stats with only lower_bound preserves the same
>> semantics, but aligns with how the rest of the column stats are stored.
>>
>> But let's discuss the tradeoffs of the various options.  Looking forward
>> to the discussion in an hour.
>>
>> Best,
>> Anoop
>>
>> On Sun, May 3, 2026 at 6:45 PM Amogh Jahagirdar <[email protected]> wrote:
>>
>>> I realized I gave a poor example of the semantic issue with removing
>>> upper bound for partition outputs, but the crux is that in that
>>> modeling the stats on partition outputs would be treated in a special way
>>> where upper bound being null means it's partitioned rather than "unknown",
>>> which is inconsistent with the other stats.
>>>
>>>>

Reply via email to