[jira] [Updated] (SPARK-37201) Spark SQL reads unnecessary nested fields (filter after explode)

Sergey Kotlov (Jira) Wed, 03 Nov 2021 08:01:20 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-37201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sergey Kotlov updated SPARK-37201:
----------------------------------
    Description: 
In this example, reading unnecessary nested fields still happens.

Data preparation:
{code:java}
case class Struct(v1: String, v2: String, v3: String)
case class Event(struct: Struct, array: Seq[String])

Seq(
  Event(Struct("v1","v2","v3"), Seq("cx1", "cx2"))
).toDF().write.mode("overwrite").saveAsTable("table")
{code}
 v2 and v3 columns aren't needed here, but still exist in the physical plan.
{code:java}
spark.table("table")
  .select($"struct.v1", explode($"array").as("el"))
  .filter($"el" === "cx1")
  .explain(true)
 
== Physical Plan ==
... ReadSchema: 
struct<struct:struct<v1:string,v2:string,v3:string>,array:array<string>>

{code}
If you just remove _filter_ or move _explode_ to second _select_, everything is 
fine:
{code:java}
spark.table("table")
  .select($"struct.v1", explode($"array").as("el"))
  //.filter($"el" === "cx1")
  .explain(true)
  
// ... ReadSchema: struct<struct:struct<v1:string>,array:array<string>>

spark.table("table")
  .select($"struct.v1", $"array")
  .select($"v1", explode($"array").as("el"))
  .filter($"el" === "cx1")
  .explain(true)
  
// ... ReadSchema: struct<struct:struct<v1:string>,array:array<string>>
{code}
 

*Yet another example: left_anti join after double select:*
{code:java}
case class Struct(v1: String, v2: String, v3: String)
case class Event(struct: Struct, field1: String, field2: String)
Seq(
      Event(Struct("v1","v2","v3"), "fld1", "fld2")
    ).toDF().write.mode("overwrite").saveAsTable("table")    
val joinDf = Seq("id1").toDF("id")

spark.table("table")
      .select("struct", "field1")
      .select($"struct.v1", $"field1")
      .join(joinDf, $"field1" === joinDf("id"), "left_anti")
      .explain(true)

// ===> ReadSchema: 
struct<struct:struct<v1:string,v2:string,v3:string>,field1:string>
{code}
If you just remove the first select or change type of join, reading nested 
fields will be correct:**
{code:java}
// .select("struct", "field1")
===> ReadSchema: struct<struct:struct<v1:string>,field1:string>

.join(joinDf, $"field1" === joinDf("id"), "left")
===> ReadSchema: struct<struct:struct<v1:string>,field1:string>
{code}
PS: The first select might look strange in the context of this example, but in 
a real system, it might be part of a common api, that other parts of the system 
use with their own expressions on top of this api.

  was:
In this example, reading unnecessary nested fields still happens.

Data preparation:

 
{code:java}
case class Struct(v1: String, v2: String, v3: String)
case class Event(struct: Struct, array: Seq[String])

Seq(
  Event(Struct("v1","v2","v3"), Seq("cx1", "cx2"))
).toDF().write.mode("overwrite").saveAsTable("table")
{code}
 

v2 and v3 columns aren't needed here, but still exist in the physical plan.
{code:java}
spark.table("table")
  .select($"struct.v1", explode($"array").as("el"))
  .filter($"el" === "cx1")
  .explain(true)
 
== Physical Plan ==
... ReadSchema: 
struct<struct:struct<v1:string,v2:string,v3:string>,array:array<string>>

{code}
If you just remove _filter_ or move _explode_ to second _select_, everything is 
fine:
{code:java}
spark.table("table")
  .select($"struct.v1", explode($"array").as("el"))
  //.filter($"el" === "cx1")
  .explain(true)
  
// ... ReadSchema: struct<struct:struct<v1:string>,array:array<string>>

spark.table("table")
  .select($"struct.v1", $"array")
  .select($"v1", explode($"array").as("el"))
  .filter($"el" === "cx1")
  .explain(true)
  
// ... ReadSchema: struct<struct:struct<v1:string>,array:array<string>>
{code}


> Spark SQL reads unnecessary nested fields (filter after explode)
> ----------------------------------------------------------------
>
>                 Key: SPARK-37201
>                 URL: https://issues.apache.org/jira/browse/SPARK-37201
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.2.0
>            Reporter: Sergey Kotlov
>            Priority: Major
>
> In this example, reading unnecessary nested fields still happens.
> Data preparation:
> {code:java}
> case class Struct(v1: String, v2: String, v3: String)
> case class Event(struct: Struct, array: Seq[String])
> Seq(
>   Event(Struct("v1","v2","v3"), Seq("cx1", "cx2"))
> ).toDF().write.mode("overwrite").saveAsTable("table")
> {code}
>  v2 and v3 columns aren't needed here, but still exist in the physical plan.
> {code:java}
> spark.table("table")
>   .select($"struct.v1", explode($"array").as("el"))
>   .filter($"el" === "cx1")
>   .explain(true)
>  
> == Physical Plan ==
> ... ReadSchema: 
> struct<struct:struct<v1:string,v2:string,v3:string>,array:array<string>>
> {code}
> If you just remove _filter_ or move _explode_ to second _select_, everything 
> is fine:
> {code:java}
> spark.table("table")
>   .select($"struct.v1", explode($"array").as("el"))
>   //.filter($"el" === "cx1")
>   .explain(true)
>   
> // ... ReadSchema: struct<struct:struct<v1:string>,array:array<string>>
> spark.table("table")
>   .select($"struct.v1", $"array")
>   .select($"v1", explode($"array").as("el"))
>   .filter($"el" === "cx1")
>   .explain(true)
>   
> // ... ReadSchema: struct<struct:struct<v1:string>,array:array<string>>
> {code}
>  
> *Yet another example: left_anti join after double select:*
> {code:java}
> case class Struct(v1: String, v2: String, v3: String)
> case class Event(struct: Struct, field1: String, field2: String)
> Seq(
>       Event(Struct("v1","v2","v3"), "fld1", "fld2")
>     ).toDF().write.mode("overwrite").saveAsTable("table")    
> val joinDf = Seq("id1").toDF("id")
> spark.table("table")
>       .select("struct", "field1")
>       .select($"struct.v1", $"field1")
>       .join(joinDf, $"field1" === joinDf("id"), "left_anti")
>       .explain(true)
> // ===> ReadSchema: 
> struct<struct:struct<v1:string,v2:string,v3:string>,field1:string>
> {code}
> If you just remove the first select or change type of join, reading nested 
> fields will be correct:**
> {code:java}
> // .select("struct", "field1")
> ===> ReadSchema: struct<struct:struct<v1:string>,field1:string>
> .join(joinDf, $"field1" === joinDf("id"), "left")
> ===> ReadSchema: struct<struct:struct<v1:string>,field1:string>
> {code}
> PS: The first select might look strange in the context of this example, but 
> in a real system, it might be part of a common api, that other parts of the 
> system use with their own expressions on top of this api.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37201) Spark SQL reads unnecessary nested fields (filter after explode)

Reply via email to