Hi all, We have been working on the Variant data type, which is designed to store and process semi-structured data efficiently, even with heterogeneous values. Users can store and process semi-structured data in a flexible way, without having to specify or know any fixed schema on write. Variant data is encoded in a self-describing format <https://github.com/apache/spark/blob/master/common/variant/README.md>, and the binary format uses offset-based encoding to speed up the navigation performance.
To further improve performance, we are also working on shredding, which is the process of extracting some of the Variant fields from the binary, and storing them in separate columns. We have written a specification for Variant shredding <https://github.com/apache/spark/pull/46831> to augment the existing Variant specification. The shredding benefits include: - more compact data encoding - min/max statistics for data skipping - I/O and CPU savings from pruning unnecessary fields not accessed by a query Please take a look at the shredding specification PR <https://github.com/apache/spark/pull/46831> and leave github comments and suggestions. Your feedback would be greatly appreciated! Thanks, Gene