Github user dbtsai commented on the issue:

    https://github.com/apache/spark/pull/21320
  
    Thanks @mallman for schema pruning work which will be a big win in our 
pattern of accessing our data.
    
    I'm testing this new feature, and find `where clause` on the selected 
nested column can break the schema pruning.
    
    For example,
    ```scala
        val q1 = sql("select name.first from contacts")
        val q2 = sql("select name.first from contacts where name.first = 
'David'")
    
        q1.explain(true)
        q2.explain(true)
    ```
    
    The physical plan of `q1` is right and what we expect with this feature,
    ```
    == Physical Plan ==
    *(1) Project [name#19.first AS first#36]
    +- *(1) FileScan parquet [name#19] Batched: false, Format: Parquet, 
PartitionFilters: [], 
        PushedFilters: [], ReadSchema: struct<name:struct<first:string>>
    ```
    
    But the physical plan of `q2` will have a pushed filter on `name` resulting 
reading the entire `name` column,
    ```
    == Physical Plan ==
    *(1) Project [name#19.first AS first#40]
    +- *(1) Filter (isnotnull(name#19) && (name#19.first = David))
       +- *(1) FileScan parquet [name#19] Batched: false, Format: Parquet, 
PartitionFilters: [], 
        PushedFilters: [IsNotNull(name)], ReadSchema: 
struct<name:struct<first:string,middle:string,last:string>>
    ```
    
    I understand that predicate push-down on the nested column is not 
implemented yet, and with schema pruning and `where clause`, we should be able 
to only read the selected nested columns and the columns with `where caluse`.
    
    Thanks.
    
    cc @beettlle


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to