I've been reading the variant shredding spec (
https://github.com/apache/parquet-format/blob/master/VariantShredding.md)
and variant binary encoding spec (
https://github.com/apache/parquet-format/blob/master/VariantEncoding.md)
trying to understand how they work in practice with the goal of
implementing support in DataFusion (https://github.com/apache/datafusion).

One thing that's not clear to me (because I don't fully understand Parquet
below the logical level exposed by readers) is how this data gets encoded
from the view point of readers. If I have the metadata, the value and a
shredded column, what do the logical columns look like for an existing
reader? Or would an existing reader completely fail to read this Parquet
file? It would be helpful if the spec had some sample data attached to it
(maybe linked to https://github.com/apache/parquet-testing).

My thought is that shredded columns must be "standalone" columns that can
be read by existing readers in order for stats pruning, etc. to work.
Assuming that's the case my first step to tackle this problem is going to
be enabling per-file filter rewriting/pushdown in DataFusion because it
would be useful not only for this but also for other use cases, and is
relatively self contained and devoid of breaking changes:
https://github.com/apache/datafusion/pull/15057. I mention this in case
anyone has insights into how this was implemented in Spark that might be
helpful guidance.

Thanks!

Reply via email to