This one didn't work, I had not subscribed. Here's the actual discussion: https://lists.apache.org/thread/d86mxdvc1zv8ng0gfkmvq3qg5h11y14k
On Tue, Mar 25, 2025 at 3:35 PM Adrian Garcia Badaracco <[email protected]> wrote: > I've been reading the variant shredding spec ( > https://github.com/apache/parquet-format/blob/master/VariantShredding.md) > and variant binary encoding spec ( > https://github.com/apache/parquet-format/blob/master/VariantEncoding.md) > trying to understand how they work in practice with the goal of > implementing support in DataFusion (https://github.com/apache/datafusion). > > One thing that's not clear to me (because I don't fully understand Parquet > below the logical level exposed by readers) is how this data gets encoded > from the view point of readers. If I have the metadata, the value and a > shredded column, what do the logical columns look like for an existing > reader? Or would an existing reader completely fail to read this Parquet > file? It would be helpful if the spec had some sample data attached to it. > > My thought is that shredded columns must be "standalone" columns that can > be read by existing readers in order for stats pruning, etc. to work. > Assuming that's the case my first step to tackle this problem is going to > be enabling per-file filter rewriting/pushdown in DataFusion because it > would be useful not only for this but also for other use cases, and is > relatively self contained and devoid of breaking changes: > https://github.com/apache/datafusion/pull/15057. I mention this in case > anyone has insights into how this was implemented in Spark that might be > helpful guidance. >
