[jira] [Commented] (SPARK-37450) Spark SQL reads unnecessary nested fields (another type of pruning case)

L. C. Hsieh (Jira) Tue, 23 Nov 2021 18:33:06 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-37450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17448328#comment-17448328
 ]


L. C. Hsieh commented on SPARK-37450:
-------------------------------------

Hmm, this is the optimized plan.

{code}
== Optimized Logical Plan ==
Aggregate [count(1) AS count(true)#20299L]
+- Project
   +- Generate explode(items#20293), [0], false, [item#20296]
      +- Filter ((size(items#20293, true) > 0) AND isnotnull(items#20293))
         +- Relation default.table[items#20293] parquet
{code}

Because here you are counting "item" so Spark must read "items" and explode it 
to count nested elements. And because there is no particular nested field is 
specified, Spark reads the full nested struct ("itemId" and "itemData") without 
any pruning.

For example, if you change to 
"read.select(explode($"items").as('item)).select(count($"item.itemData")).explain(true)",
 Spark will prune the "itemId":


{code}
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(_extract_itemData#20302)], 
output=[count(item.itemData)#20300L])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#24668]
      +- HashAggregate(keys=[], 
functions=[partial_count(_extract_itemData#20302)], output=[count#20307L])
         +- Project [item#20296 AS _extract_itemData#20302]
            +- Generate explode(_extract_itemData#20304), false, [item#20296]
               +- Project [items#20293.itemData AS _extract_itemData#20304]
                  +- Filter ((size(items#20293.itemData, true) > 0) AND 
isnotnull(items#20293.itemData))
                     +- FileScan parquet default.table[items#20293] Batched: 
false, DataFilters: [(size(items#20293.itemData, true) > 0), 
isnotnull(items#20293.itemData)], Format: Parquet, Location: 
InMemoryFileIndex(1 paths)[file:..., PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct<items:array<struct<itemData:string>>>
{code}


> Spark SQL reads unnecessary nested fields (another type of pruning case)
> ------------------------------------------------------------------------
>
>                 Key: SPARK-37450
>                 URL: https://issues.apache.org/jira/browse/SPARK-37450
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.2.0
>            Reporter: Jiri Humpolicek
>            Priority: Major
>
> Based on this [SPARK-34638|https://issues.apache.org/jira/browse/SPARK-34638] 
> Maybe I found another nested fields pruning case. In this case I found full 
> read with `count` function
> Example:
> 1) Loading data
> {code:scala}
> val jsonStr = """{
>  "items": [
>    {"itemId": 1, "itemData": "a"},
>    {"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {code}
> 2) read query with explain
> {code:scala}
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> read.select(explode($"items").as('item)).select(count(lit(true))).explain(true)
> // ReadSchema: struct<items:array<struct<itemData:string,itemId:bigint>>>
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37450) Spark SQL reads unnecessary nested fields (another type of pruning case)

Reply via email to