Jiri Humpolicek created SPARK-42872:
---------------------------------------

             Summary: Spark SQL reads unnecessary nested fields
                 Key: SPARK-42872
                 URL: https://issues.apache.org/jira/browse/SPARK-42872
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.3.2
            Reporter: Jiri Humpolicek


When we use high order functions in spark sql query, it would be great if it 
will be somehow possible to write following example in way that spark will read 
only necessary nested fields.

Example:
1) Loading data
{code:scala}
val jsonStr = """{
 "items": [
   {"itemId": 1, "itemData": "a"},
   {"itemId": 2, "itemData": "b"}
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
{code}
2) read query with explain
{code:scala}
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)

read.select(transform($"items", 
i=>i.getItem("itemId")).as('itemIds)).explain(true)
// ReadSchema: struct<items:array<struct<itemData:string,itemId:bigint>>>
{code}
We use only *itemId* field from structure in array, but read schema contains 
all fields of structure.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to