[jira] [Commented] (SPARK-37450) Spark SQL reads unnecessary nested fields (another type of pruning case)

2021-11-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17448878#comment-17448878
 ] 

Apache Spark commented on SPARK-37450:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/34701

> Spark SQL reads unnecessary nested fields (another type of pruning case)
> 
>
> Key: SPARK-37450
> URL: https://issues.apache.org/jira/browse/SPARK-37450
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Jiri Humpolicek
>Priority: Major
>
> Based on this [SPARK-34638|https://issues.apache.org/jira/browse/SPARK-34638] 
> Maybe I found another nested fields pruning case. In this case I found full 
> read with `count` function
> Example:
> 1) Loading data
> {code:scala}
> val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {code}
> 2) read query with explain
> {code:scala}
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> read.select(explode($"items").as('item)).select(count(lit(true))).explain(true)
> // ReadSchema: struct>>
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37450) Spark SQL reads unnecessary nested fields (another type of pruning case)

2021-11-24 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17448824#comment-17448824
 ] 

L. C. Hsieh commented on SPARK-37450:
-

Okay, it sounds like a possible optimization case, although I think it is not 
actually for schema pruning. I'm working on adding an optimization rule for it.

> Spark SQL reads unnecessary nested fields (another type of pruning case)
> 
>
> Key: SPARK-37450
> URL: https://issues.apache.org/jira/browse/SPARK-37450
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Jiri Humpolicek
>Priority: Major
>
> Based on this [SPARK-34638|https://issues.apache.org/jira/browse/SPARK-34638] 
> Maybe I found another nested fields pruning case. In this case I found full 
> read with `count` function
> Example:
> 1) Loading data
> {code:scala}
> val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {code}
> 2) read query with explain
> {code:scala}
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> read.select(explode($"items").as('item)).select(count(lit(true))).explain(true)
> // ReadSchema: struct>>
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37450) Spark SQL reads unnecessary nested fields (another type of pruning case)

2021-11-23 Thread Jiri Humpolicek (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17448412#comment-17448412
 ] 

Jiri Humpolicek commented on SPARK-37450:
-

thanks for answer. Maybe I don't know internals of parquet, but when I need to 
know number of rows in parquet:
{code:java}
read.select(count(lit(1))).explain(true)
// ReadSchema: struct<>{code}
there is empty read schema (I suppose none column accessed).

So is there any way how to get size of array in parquet without reading whole 
sub-structure? If it is not, you showed at least optimization, read the 
"smallest" attribute in sub-structure.

> Spark SQL reads unnecessary nested fields (another type of pruning case)
> 
>
> Key: SPARK-37450
> URL: https://issues.apache.org/jira/browse/SPARK-37450
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Jiri Humpolicek
>Priority: Major
>
> Based on this [SPARK-34638|https://issues.apache.org/jira/browse/SPARK-34638] 
> Maybe I found another nested fields pruning case. In this case I found full 
> read with `count` function
> Example:
> 1) Loading data
> {code:scala}
> val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {code}
> 2) read query with explain
> {code:scala}
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> read.select(explode($"items").as('item)).select(count(lit(true))).explain(true)
> // ReadSchema: struct>>
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37450) Spark SQL reads unnecessary nested fields (another type of pruning case)

2021-11-23 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17448328#comment-17448328
 ] 

L. C. Hsieh commented on SPARK-37450:
-

Hmm, this is the optimized plan.

{code}
== Optimized Logical Plan ==
Aggregate [count(1) AS count(true)#20299L]
+- Project
   +- Generate explode(items#20293), [0], false, [item#20296]
  +- Filter ((size(items#20293, true) > 0) AND isnotnull(items#20293))
 +- Relation default.table[items#20293] parquet
{code}

Because here you are counting "item" so Spark must read "items" and explode it 
to count nested elements. And because there is no particular nested field is 
specified, Spark reads the full nested struct ("itemId" and "itemData") without 
any pruning.

For example, if you change to 
"read.select(explode($"items").as('item)).select(count($"item.itemData")).explain(true)",
 Spark will prune the "itemId":


{code}
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(_extract_itemData#20302)], 
output=[count(item.itemData)#20300L])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#24668]
  +- HashAggregate(keys=[], 
functions=[partial_count(_extract_itemData#20302)], output=[count#20307L])
 +- Project [item#20296 AS _extract_itemData#20302]
+- Generate explode(_extract_itemData#20304), false, [item#20296]
   +- Project [items#20293.itemData AS _extract_itemData#20304]
  +- Filter ((size(items#20293.itemData, true) > 0) AND 
isnotnull(items#20293.itemData))
 +- FileScan parquet default.table[items#20293] Batched: 
false, DataFilters: [(size(items#20293.itemData, true) > 0), 
isnotnull(items#20293.itemData)], Format: Parquet, Location: 
InMemoryFileIndex(1 paths)[file:..., PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct>>
{code}


> Spark SQL reads unnecessary nested fields (another type of pruning case)
> 
>
> Key: SPARK-37450
> URL: https://issues.apache.org/jira/browse/SPARK-37450
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Jiri Humpolicek
>Priority: Major
>
> Based on this [SPARK-34638|https://issues.apache.org/jira/browse/SPARK-34638] 
> Maybe I found another nested fields pruning case. In this case I found full 
> read with `count` function
> Example:
> 1) Loading data
> {code:scala}
> val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {code}
> 2) read query with explain
> {code:scala}
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> read.select(explode($"items").as('item)).select(count(lit(true))).explain(true)
> // ReadSchema: struct>>
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37450) Spark SQL reads unnecessary nested fields (another type of pruning case)

2021-11-23 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17448326#comment-17448326
 ] 

L. C. Hsieh commented on SPARK-37450:
-

Thanks [~hyukjin.kwon]. I will take a look.

> Spark SQL reads unnecessary nested fields (another type of pruning case)
> 
>
> Key: SPARK-37450
> URL: https://issues.apache.org/jira/browse/SPARK-37450
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Jiri Humpolicek
>Priority: Major
>
> Based on this [SPARK-34638|https://issues.apache.org/jira/browse/SPARK-34638] 
> Maybe I found another nested fields pruning case. In this case I found full 
> read with `count` function
> Example:
> 1) Loading data
> {code:scala}
> val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {code}
> 2) read query with explain
> {code:scala}
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> read.select(explode($"items").as('item)).select(count(lit(true))).explain(true)
> // ReadSchema: struct>>
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37450) Spark SQL reads unnecessary nested fields (another type of pruning case)

2021-11-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17448320#comment-17448320
 ] 

Hyukjin Kwon commented on SPARK-37450:
--

cc [~viirya] FYI

> Spark SQL reads unnecessary nested fields (another type of pruning case)
> 
>
> Key: SPARK-37450
> URL: https://issues.apache.org/jira/browse/SPARK-37450
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Jiri Humpolicek
>Priority: Major
>
> Based on this [SPARK-34638|https://issues.apache.org/jira/browse/SPARK-34638] 
> Maybe I found another nested fields pruning case. In this case I found full 
> read with `count` function
> Example:
> 1) Loading data
> {code:scala}
> val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {code}
> 2) read query with explain
> {code:scala}
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> read.select(explode($"items").as('item)).select(count(lit(true))).explain(true)
> // ReadSchema: struct>>
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37450) Spark SQL reads unnecessary nested fields (another type of pruning case)

2021-11-23 Thread Jiri Humpolicek (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17448001#comment-17448001
 ] 

Jiri Humpolicek commented on SPARK-37450:
-

same situation with {{size}} function:
{code:scala}
read.select(size('items)).explain(true)
// ReadSchema: struct>>
{code}

> Spark SQL reads unnecessary nested fields (another type of pruning case)
> 
>
> Key: SPARK-37450
> URL: https://issues.apache.org/jira/browse/SPARK-37450
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Jiri Humpolicek
>Priority: Major
>
> Based on this [SPARK-34638|https://issues.apache.org/jira/browse/SPARK-34638] 
> Maybe I found another nested fields pruning case. In this case I found full 
> read with `count` function
> Example:
> 1) Loading data
> {code:scala}
> val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {code}
> 2) read query with explain
> {code:scala}
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> read.select(explode($"items").as('item)).select(count(lit(true))).explain(true)
> // ReadSchema: struct>>
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org