Github user mallman commented on the issue:

    https://github.com/apache/spark/pull/22357
  
    I have reconstructed my original patch for this issue, but I've discovered 
it will require more work to complete. However, as part of that reconstruction 
I've discovered a couple of cases where our patches create different physical 
plans. The query results are the same, but I'm not sure which—if 
either—plan is correct. I want to go into detail on that, but it's 
complicated and I have to call it quits tonight. I have a flight in the 
morning, and I'll be on break next week.
    
    In the meantime, I'll just copy and paste two queries—based on the data 
in `ParquetSchemaPruningSuite.scala`—with two query plans each.
    
    First query:
    
        select employer.id from contacts where employer is not null
    
    This PR (as of d68f808) produces:
    
    ```
    == Physical Plan ==
    *(1) Project [employer#4442.id AS id#4452]
    +- *(1) Filter isnotnull(employer#4442)
       +- *(1) FileScan parquet [employer#4442,p#4443] Batched: false, Format: 
Parquet,
        PartitionCount: 2, PartitionFilters: [], PushedFilters: 
[IsNotNull(employer)],
        ReadSchema: struct<employer:struct<id:int>>
    ```
    
    My WIP patch produces:
    
    ```
    == Physical Plan ==
    *(1) Project [employer#4442.id AS id#4452]
    +- *(1) Filter isnotnull(employer#4442)
       +- *(1) FileScan parquet [employer#4442,p#4443] Batched: false, Format: 
Parquet,
        PartitionCount: 2, PartitionFilters: [], PushedFilters: 
[IsNotNull(employer)],
        ReadSchema: 
struct<employer:struct<id:int,company:struct<name:string,address:string>>>
    ```
    
    Second query:
    
        select employer.id from contacts where employer.id = 0
    
    This PR produces:
    
    ```
    == Physical Plan ==
    *(1) Project [employer#4297.id AS id#4308]
    +- *(1) Filter (isnotnull(employer#4297) && (employer#4297.id = 0))
       +- *(1) FileScan parquet [employer#4297,p#4298] Batched: false, Format: 
Parquet,
        PartitionCount: 2, PartitionFilters: [], PushedFilters: 
[IsNotNull(employer)],
        ReadSchema: struct<employer:struct<id:int>>
    ```
    
    My WIP patch produces:
    
    ```
    == Physical Plan ==
    *(1) Project [employer#4445.id AS id#4456]
    +- *(1) Filter (isnotnull(employer#4445.id) && (employer#4445.id = 0))
       +- *(1) FileScan parquet [employer#4445,p#4446] Batched: false, Format: 
Parquet,
        PartitionCount: 2, PartitionFilters: [], PushedFilters: [],
        ReadSchema: struct<employer:struct<id:int>>
    ```
    
    I wanted to give my thoughts on the differences of these in detail, but I 
have to wrap up my work for the night. I'll be visiting family next week. I 
don't know how responsive I'll be in that time, but I'll at least try to check 
back.
    
    Cheers.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to