I've been reading the variant shredding spec ( https://github.com/apache/parquet-format/blob/master/VariantShredding.md) and variant binary encoding spec ( https://github.com/apache/parquet-format/blob/master/VariantEncoding.md) trying to understand how they work in practice with the goal of implementing support in DataFusion (https://github.com/apache/datafusion).
One thing that's not clear to me (because I don't fully understand Parquet below the logical level exposed by readers) is how this data gets encoded from the view point of readers. If I have the metadata, the value and a shredded column, what do the logical columns look like for an existing reader? Or would an existing reader completely fail to read this Parquet file? It would be helpful if the spec had some sample data attached to it (maybe linked to https://github.com/apache/parquet-testing). My thought is that shredded columns must be "standalone" columns that can be read by existing readers in order for stats pruning, etc. to work. Assuming that's the case my first step to tackle this problem is going to be enabling per-file filter rewriting/pushdown in DataFusion because it would be useful not only for this but also for other use cases, and is relatively self contained and devoid of breaking changes: https://github.com/apache/datafusion/pull/15057. I mention this in case anyone has insights into how this was implemented in Spark that might be helpful guidance. Thanks!
