Jiri Humpolicek created SPARK-42872: ---------------------------------------
Summary: Spark SQL reads unnecessary nested fields Key: SPARK-42872 URL: https://issues.apache.org/jira/browse/SPARK-42872 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.2 Reporter: Jiri Humpolicek When we use high order functions in spark sql query, it would be great if it will be somehow possible to write following example in way that spark will read only necessary nested fields. Example: 1) Loading data {code:scala} val jsonStr = """{ "items": [ {"itemId": 1, "itemData": "a"}, {"itemId": 2, "itemData": "b"} ] }""" val df = spark.read.json(Seq(jsonStr).toDS) df.write.format("parquet").mode("overwrite").saveAsTable("persisted") {code} 2) read query with explain {code:scala} val read = spark.table("persisted") spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) read.select(transform($"items", i=>i.getItem("itemId")).as('itemIds)).explain(true) // ReadSchema: struct<items:array<struct<itemData:string,itemId:bigint>>> {code} We use only *itemId* field from structure in array, but read schema contains all fields of structure. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org