Re: [PR] [SPARK-48495][SQL][DOCS] Describe shredding scheme for Variant [spark]

via GitHub Thu, 18 Jul 2024 15:17:03 -0700


RussellSpitzer commented on code in PR #46831:
URL: https://github.com/apache/spark/pull/46831#discussion_r1683562693



##########
common/variant/shredding.md:
##########
@@ -0,0 +1,244 @@
+# Shredding Overview
+
+The Spark Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values. Query engines encode each variant 
value in a self-describing format, and store it as a group containing **value** 
and **metadata** binary fields in Parquet. Since data is often partially 
homogenous, it can be beneficial to extract certain fields into separate 
Parquet columns to further improve performance. We refer to this process as 
"shredding". Each Parquet file remains fully self-describing, with no 
additional metadata required to read or fully reconstruct the Variant data from 
the file. Combining shredding with a binary residual provides the flexibility 
to represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+
+This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction. 
For now, it does not discuss which fields to shred, user-facing API changes, or 
any engine-specific considerations like how to use shredded columns. The 
approach builds on top of the generic Spark Variant representation, and 
leverages the existing Parquet specification for maximum compatibility with the 
open-source ecosystem.
+
+At a high level, we replace the **value** and **metadata** of the Variant 
Parquet group with one or more fields called **object**, **array**, 
**typed_value** and **untyped_value**. These represent a fixed schema suitable 
for constructing the full Variant value for each row.
+
+Shredding lets Spark (or any other query engine) reap the full benefits of 
Parquet's columnar representation, such as more compact data encoding, min/max 
statistics for data skipping, and I/O and CPU savings from pruning unnecessary 
fields not accessed by a query (including the non-shredded Variant binary data).

Review Comment:
   nit: and IO and CPU savings 
   
   If this was in the comma separated section earlier that would fit better



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48495][SQL][DOCS] Describe shredding scheme for Variant [spark]

Reply via email to