[DISCUSS] Variant shredding specification

Gene Pang Mon, 03 Jun 2024 09:09:20 -0700

Hi all,

We have been working on the Variant data type, which is designed to store
and process semi-structured data efficiently, even with heterogeneous
values. Users can store and process semi-structured data in a flexible way,
without having to specify or know any fixed schema on write. Variant data
is encoded in a self-describing format
<https://github.com/apache/spark/blob/master/common/variant/README.md>, and
the binary format uses offset-based encoding to speed up the navigation
performance.


To further improve performance, we are also working on shredding, which is
the process of extracting some of the Variant fields from the binary, and
storing them in separate columns. We have written a specification for Variant
shredding <https://github.com/apache/spark/pull/46831> to augment the
existing Variant specification.

The shredding benefits include:
- more compact data encoding
- min/max statistics for data skipping
- I/O and CPU savings from pruning unnecessary fields not accessed by a
query

Please take a look at the shredding specification PR
<https://github.com/apache/spark/pull/46831> and leave github comments and
suggestions. Your feedback would be greatly appreciated!

Thanks,
Gene

[DISCUSS] Variant shredding specification

Reply via email to