As we discussed in the community sync, I recommend we keep the partition
tuple for now. It's the simplest way to maintain the guarantees needed for
equality deletes.

Going forward, we shouldn't rely on these values for filtering (imho) and
should instead work to extend the stats struct approach to cover bucketing,
non-range-preserving, and multi-arg transforms. To this end, I would try to
make sure none of our v4 planning code interacts with the tuple directly,
except when falling back for v3-based logic. Isolating tuple access this
way means we can cleanly remove it later without reworking v4 planning
paths.

In my ideal world we drop the tuple and equality deletes, but this seems
like the way to make progress now while leaving the door open to remove the
tuple before v4 is finalized.

On Mon, May 4, 2026 at 10:00 AM Anoop Johnson <[email protected]> wrote:

> Amogh,
>
> That is a good point. But the partition and stats-based evaluation paths
> are typically separate. For partition evaluation, we compare against an
> exact value, and for stats-based pruning, we look at the range of values in
> the column stats.
>
> Even if we store partition values in the content stats, it would follow
> the partition evaluation path. The new V4 manifest reader would just need
> to look at the partition value's lower_bound in the content stats instead
> of an explicit partition tuple field. The partition evaluator itself will
> be unchanged.
>
> This is conceptually no different than the current partition tuple.
> Storing it in content_stats with only lower_bound preserves the same
> semantics, but aligns with how the rest of the column stats are stored.
>
> But let's discuss the tradeoffs of the various options.  Looking forward
> to the discussion in an hour.
>
> Best,
> Anoop
>
> On Sun, May 3, 2026 at 6:45 PM Amogh Jahagirdar <[email protected]> wrote:
>
>> I realized I gave a poor example of the semantic issue with removing
>> upper bound for partition outputs, but the crux is that in that
>> modeling the stats on partition outputs would be treated in a special way
>> where upper bound being null means it's partitioned rather than "unknown",
>> which is inconsistent with the other stats.
>>
>>>

Reply via email to