Github user mallman commented on the issue: https://github.com/apache/spark/pull/22357 > @mallman It will be great that we can have this fix in 2.4 release as this can dramatically reduce the data being read in many applications which is the purpose of the original work. I agree it would be great to have this capability in 2.4. But I don't know that this PR is the right way to accomplish our intended goal. I'm also not sure this patch accomplishes its intended goal. And I would like time to complete my reviewâI'm still running tests against this patch. I would also like to submit my patch as an alternative for review, because the approach made by this PR and by my patch are not compatible. Even though it's incomplete, I'm willing to submit it as-is with some notes on how it's incomplete and what needs to be done. However, I can say for certain there is no way it would be accepted for Spark 2.4. The earliest I could get it submitted is tomorrow morning (EDT). However, to give you a sense of how my patch works, I can give you the gist of how I see the problem. Basically, constraint propagation as defined in `QueryPlanConstraints.scala` inhibits schema pruning. Indeed, if you turn off constraint propagation (by setting `spark.sql.constraintPropagation.enabled` to `false`), the following query select employer.id from contacts where employer.id = 0 produces the following physical plan ``` == Physical Plan == *(1) Project [employer#36.id AS id#47] +- *(1) Filter (employer#36.id = 0) +- *(1) FileScan parquet [employer#36,p#37] Batched: false, Format: Parquet, PartitionCount: 2, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<employer:struct<id:int>> ``` without applying _either_ patch. (FYI I ran this on the master branch, commit 12e3e9f17dca11a2cddf0fb99d72b4b97517fb56). The only column read in this plan is `employer.id`, just as we'd like. Aside from the difference in approach, I have some other concerns around this PR. I don't think we should push down `IsNotNull(employer)` to the reader unless we need to. This PR includes that pushed down filter for both of the sample queries I provided in my previous comment https://github.com/apache/spark/pull/22357#issuecomment-419612555. The question isâhow does that pushdown affect the reader's behavior? That leads me to a concern around the testing of this functionality. Our intent is to read from as few columns as necessary. In the query select employer.id from contacts where employer.id = 0 we need only read from the `employer.id` column. And we can tell the reader to only read that column. But how do we know that pushing down `IsNotNull(employer)` does not negate that instruction? One way to be certain is to not push that filter down in the first place. That is the approach my patch currently takes. Of course, this removes the pushdown. I think that identifying which plan leads to a faster scan requires a more robust testing capability, however one thing is certain: the `FileScan` in my patch's plan gives no reason to believe that it is reading anything other than that one column. IMO, we can get closer to settling the question of relative performance/behavior by pushing down Parquet reader filters just for the columns we need, e.g. `IsNotNull(employer.id)` in this case above. Neither patch (currently) does that, however I think my patch is closer to achieving that because it already identifies `isnotnull(employer#4445.id)` as a filter predicate in the query plan. We just need to push it down. As I mentioned, I'll endeavor to have my patch posted as a PR by tomorrow morning, but I can't make a promise of that. I'm sorry for the delay. I really wasn't expecting we'd work on this functionality for Spark 2.4. We do have a known bug in the schema pruning functionality that's in Spark 2.4âone that throws an error. We identified it in #21320 (look for the "ignored" test in `ParquetSchemaPruningSuite.scala`), but I don't think we have an issue in Jira for it. I'll try to take care of that by tomorrow morning as well, and I was hoping we would prioritize that. I have a patch for that bug that is code complete but missing proper code documentation. Thanks all.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org