Kai Kang created SPARK-29721: -------------------------------- Summary: Spark SQL reads unnecessary nested fields from Parquet after using explode Key: SPARK-29721 URL: https://issues.apache.org/jira/browse/SPARK-29721 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.4 Reporter: Kai Kang
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column pruning for nested structures. However, when explode() is called on a nested field, all columns for that nested structure is still fetched from data source. We are working on a project to create a parquet store for a big pre-joined table between two tables that has one-to-many relationship, and this is a blocking issue for us. The following code illustrates the issue. Part 1: loading some nested data {quote}{{import spark.implicits._}} {{val jsonStr = """{}} {{ "items": [}} {{ {}} {{ "itemId": 1,}} {{ "itemData": "a"}} {{ },}} {{ {}} {{ "itemId": 1,}} {{ "itemData": "b"}} {{ }}} {{ ]}"""}} {{val df = spark.read.json(Seq(jsonStr).toDS)}} {{df.write.format("parquet").mode("overwrite").saveAsTable("persisted")}} {quote} Part 2: reading it back and explaining the queries {quote}val read = spark.table("persisted") spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) read.select($"items.itemId").explain(true) // pruned, only loading itemId read.select(explode($"items.itemId")).explain(true) // not pruned, loading both itemId and itemData {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org