Sergey Kotlov created SPARK-37201: ------------------------------------- Summary: Spark SQL reads unnecessary nested fields (filter after explode) Key: SPARK-37201 URL: https://issues.apache.org/jira/browse/SPARK-37201 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0 Reporter: Sergey Kotlov
In this example, reading unnecessary nested fields still happens. Data preparation: {code:java} case class Struct(v1: String, v2: String, v3: String) case class Event(struct: Struct, array: Seq[String]) Seq( Event(Struct("v1","v2","v3"), Seq("cx1", "cx2")) ).toDF().write.mode("overwrite").saveAsTable("table") {code} v2 and v3 columns aren't needed here, but still exist in the physical plan. {code:java} spark.table("table") .select($"struct.v1", explode($"array").as("el")) .filter($"el" === "cx1") .explain(true) == Physical Plan == ... ReadSchema: struct<struct:struct<v1:string,v2:string,v3:string>,array:array<string>> {code} If you just remove _filter_ or move _explode_ to second _select_, everything is fine: {code:java} spark.table("table") .select($"struct.v1", explode($"array").as("el")) //.filter($"el" === "cx1") .explain(true) // ... ReadSchema: struct<struct:struct<v1:string>,array:array<string>> spark.table("table") .select($"struct.v1", $"array") .select($"v1", explode($"array").as("el")) .filter($"el" === "cx1") .explain(true) // ... ReadSchema: struct<struct:struct<v1:string>,array:array<string>> {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org