Hi, everyone -

As part of v4, we are adding aggregate column stats to the root manifests.
I wrote a short discussion doc
<https://docs.google.com/document/d/1glCxPNWHWmlxc5ULBcpxmsgOKR6i4Y4RErDRZD7vuJc/edit?tab=t.0>
on
a couple of topics in this area:

   - Define aggregation rules on how to compute these aggregate stats.
   - Column stats at the file level are optional.  So a naive aggregation
   can lead to false pruning. We need a mechanism to avoid it.

The doc covers these with examples and has some options for the v4 spec.

I'm looking for feedback on the approach, especially around using
`null_count` as a sentinel vs alternatives. Please feel free to comment
directly on the doc. I've also added this as an agenda item in the next v4
metadata tree community sync.

Link:
https://docs.google.com/document/d/1glCxPNWHWmlxc5ULBcpxmsgOKR6i4Y4RErDRZD7vuJc/edit?tab=t.0

Best,
Anoop

Reply via email to