Re: Clarification on encoding of shredded columns for variant types

Adrian Garcia Badaracco Fri, 04 Apr 2025 11:36:00 -0700

This one didn't work, I had not subscribed. Here's the actual discussion:
https://lists.apache.org/thread/d86mxdvc1zv8ng0gfkmvq3qg5h11y14k


On Tue, Mar 25, 2025 at 3:35 PM Adrian Garcia Badaracco <[email protected]>
wrote:

> I've been reading the variant shredding spec (
> https://github.com/apache/parquet-format/blob/master/VariantShredding.md)
> and variant binary encoding spec (
> https://github.com/apache/parquet-format/blob/master/VariantEncoding.md)
> trying to understand how they work in practice with the goal of
> implementing support in DataFusion (https://github.com/apache/datafusion).
>
> One thing that's not clear to me (because I don't fully understand Parquet
> below the logical level exposed by readers) is how this data gets encoded
> from the view point of readers. If I have the metadata, the value and a
> shredded column, what do the logical columns look like for an existing
> reader? Or would an existing reader completely fail to read this Parquet
> file? It would be helpful if the spec had some sample data attached to it.
>
> My thought is that shredded columns must be "standalone" columns that can
> be read by existing readers in order for stats pruning, etc. to work.
> Assuming that's the case my first step to tackle this problem is going to
> be enabling per-file filter rewriting/pushdown in DataFusion because it
> would be useful not only for this but also for other use cases, and is
> relatively self contained and devoid of breaking changes:
> https://github.com/apache/datafusion/pull/15057. I mention this in case
> anyone has insights into how this was implemented in Spark that might be
> helpful guidance.
>

Re: Clarification on encoding of shredded columns for variant types

Reply via email to