Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/21320 Thanks @mallman for schema pruning work which will be a big win in our pattern of accessing our data. I'm testing this new feature, and find `where clause` on the selected nested column can break the schema pruning. For example, ```scala val q1 = sql("select name.first from contacts") val q2 = sql("select name.first from contacts where name.first = 'David'") q1.explain(true) q2.explain(true) ``` The physical plan of `q1` is right and what we expect with this feature, ``` == Physical Plan == *(1) Project [name#19.first AS first#36] +- *(1) FileScan parquet [name#19] Batched: false, Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<name:struct<first:string>> ``` But the physical plan of `q2` will have a pushed filter on `name` resulting reading the entire `name` column, ``` == Physical Plan == *(1) Project [name#19.first AS first#40] +- *(1) Filter (isnotnull(name#19) && (name#19.first = David)) +- *(1) FileScan parquet [name#19] Batched: false, Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: struct<name:struct<first:string,middle:string,last:string>> ``` I understand that predicate push-down on the nested column is not implemented yet, and with schema pruning and `where clause`, we should be able to only read the selected nested columns and the columns with `where caluse`. Thanks. cc @beettlle
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org