Github user mallman commented on the issue:

    https://github.com/apache/spark/pull/22357
  
    > @mallman It will be great that we can have this fix in 2.4 release as 
this can dramatically reduce the data being read in many applications which is 
the purpose of the original work.
    
    I agree it would be great to have this capability in 2.4. But I don't know 
that this PR is the right way to accomplish our intended goal. I'm also not 
sure this patch accomplishes its intended goal. And I would like time to 
complete my review—I'm still running tests against this patch.
    
    I would also like to submit my patch as an alternative for review, because 
the approach made by this PR and by my patch are not compatible. Even though 
it's incomplete, I'm willing to submit it as-is with some notes on how it's 
incomplete and what needs to be done. However, I can say for certain there is 
no way it would be accepted for Spark 2.4. The earliest I could get it 
submitted is tomorrow morning (EDT).
    
    However, to give you a sense of how my patch works, I can give you the gist 
of how I see the problem. Basically, constraint propagation as defined in 
`QueryPlanConstraints.scala` inhibits schema pruning. Indeed, if you turn off 
constraint propagation (by setting `spark.sql.constraintPropagation.enabled` to 
`false`), the following query
    
        select employer.id from contacts where employer.id = 0
    
    produces the following physical plan
    
    ```
    == Physical Plan ==
    *(1) Project [employer#36.id AS id#47]
    +- *(1) Filter (employer#36.id = 0)
       +- *(1) FileScan parquet [employer#36,p#37] Batched: false, Format: 
Parquet,
        PartitionCount: 2, PartitionFilters: [], PushedFilters: [],
        ReadSchema: struct<employer:struct<id:int>>
    ```
    
    without applying _either_ patch. (FYI I ran this on the master branch, 
commit 12e3e9f17dca11a2cddf0fb99d72b4b97517fb56). The only column read in this 
plan is `employer.id`, just as we'd like.
    
    Aside from the difference in approach, I have some other concerns around 
this PR. I don't think we should push down `IsNotNull(employer)` to the reader 
unless we need to. This PR includes that pushed down filter for both of the 
sample queries I provided in my previous comment 
    https://github.com/apache/spark/pull/22357#issuecomment-419612555. The 
question is—how does that pushdown affect the reader's behavior?
    
    That leads me to a concern around the testing of this functionality. Our 
intent is to read from as few columns as necessary. In the query
    
        select employer.id from contacts where employer.id = 0
    
    we need only read from the `employer.id` column. And we can tell the reader 
to only read that column. But how do we know that pushing down 
`IsNotNull(employer)` does not negate that instruction? One way to be certain 
is to not push that filter down in the first place. That is the approach my 
patch currently takes. Of course, this removes the pushdown. I think that 
identifying which plan leads to a faster scan requires a more robust testing 
capability, however one thing is certain: the `FileScan` in my patch's plan 
gives no reason to believe that it is reading anything other than that one 
column.
    
    IMO, we can get closer to settling the question of relative 
performance/behavior by pushing down Parquet reader filters just for the 
columns we need, e.g. `IsNotNull(employer.id)` in this case above. Neither 
patch (currently) does that, however I think my patch is closer to achieving 
that because it already identifies `isnotnull(employer#4445.id)` as a filter 
predicate in the query plan. We just need to push it down.
    
    As I mentioned, I'll endeavor to have my patch posted as a PR by tomorrow 
morning, but I can't make a promise of that.
    
    I'm sorry for the delay. I really wasn't expecting we'd work on this 
functionality for Spark 2.4. We do have a known bug in the schema pruning 
functionality that's in Spark 2.4—one that throws an error. We identified it 
in #21320 (look for the "ignored" test in `ParquetSchemaPruningSuite.scala`), 
but I don't think we have an issue in Jira for it. I'll try to take care of 
that by tomorrow morning as well, and I was hoping we would prioritize that. I 
have a patch for that bug that is code complete but missing proper code 
documentation.
    
    Thanks all.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to