Github user mallman commented on the issue: https://github.com/apache/spark/pull/22357 I have reconstructed my original patch for this issue, but I've discovered it will require more work to complete. However, as part of that reconstruction I've discovered a couple of cases where our patches create different physical plans. The query results are the same, but I'm not sure whichâif eitherâplan is correct. I want to go into detail on that, but it's complicated and I have to call it quits tonight. I have a flight in the morning, and I'll be on break next week. In the meantime, I'll just copy and paste two queriesâbased on the data in `ParquetSchemaPruningSuite.scala`âwith two query plans each. First query: select employer.id from contacts where employer is not null This PR (as of d68f808) produces: ``` == Physical Plan == *(1) Project [employer#4442.id AS id#4452] +- *(1) Filter isnotnull(employer#4442) +- *(1) FileScan parquet [employer#4442,p#4443] Batched: false, Format: Parquet, PartitionCount: 2, PartitionFilters: [], PushedFilters: [IsNotNull(employer)], ReadSchema: struct<employer:struct<id:int>> ``` My WIP patch produces: ``` == Physical Plan == *(1) Project [employer#4442.id AS id#4452] +- *(1) Filter isnotnull(employer#4442) +- *(1) FileScan parquet [employer#4442,p#4443] Batched: false, Format: Parquet, PartitionCount: 2, PartitionFilters: [], PushedFilters: [IsNotNull(employer)], ReadSchema: struct<employer:struct<id:int,company:struct<name:string,address:string>>> ``` Second query: select employer.id from contacts where employer.id = 0 This PR produces: ``` == Physical Plan == *(1) Project [employer#4297.id AS id#4308] +- *(1) Filter (isnotnull(employer#4297) && (employer#4297.id = 0)) +- *(1) FileScan parquet [employer#4297,p#4298] Batched: false, Format: Parquet, PartitionCount: 2, PartitionFilters: [], PushedFilters: [IsNotNull(employer)], ReadSchema: struct<employer:struct<id:int>> ``` My WIP patch produces: ``` == Physical Plan == *(1) Project [employer#4445.id AS id#4456] +- *(1) Filter (isnotnull(employer#4445.id) && (employer#4445.id = 0)) +- *(1) FileScan parquet [employer#4445,p#4446] Batched: false, Format: Parquet, PartitionCount: 2, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<employer:struct<id:int>> ``` I wanted to give my thoughts on the differences of these in detail, but I have to wrap up my work for the night. I'll be visiting family next week. I don't know how responsive I'll be in that time, but I'll at least try to check back. Cheers.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org