Re: [DISCUSS] Partition tuples in v4

Amogh Jahagirdar Sun, 03 May 2026 18:17:33 -0700

On second thought, I think there's a semantic issue with removing
upper_bound for partition outputs so I don't think I'd like that even
though it does reduce metadata footprint.


I think storing only lower_bound for partition outputs means we need a
special rule: "a file is partitioned if the output lower_bound is set and
upper_bound is null". This constrains the model and changes the semantics
of stats specifically for partition outputs.

Keeping both lower and upper bound preserves consistent statistic semantics
and doesn't assume that stats on transform outputs necessarily mean the
file is partitioned. For example, a file could have hour(ts) values
representing the range (just a representation, not the actual integer
values) [2026-05-03 at 10 PM, 2026-05-03-14 at 12 AM], representing
clustering on hour transforms without strict partitioning. With only
lower_bound, we'd have to treat any transform output stats as indicating
partitioning, which may be overly constraining.

Thanks,

Amogh Jahagirdar

>

Re: [DISCUSS] Partition tuples in v4

Reply via email to