jackylee-ch opened a new pull request, #51046: URL: https://github.com/apache/spark/pull/51046
### Why are the changes needed? Recently, I have been testing TPC-DS queries based on DataSource V2, and noticed that column pruning does not occur in scenarios involving EXISTS (SELECT * FROM ... WHERE ...). As a result, the scan ends up reading all columns instead of just the required ones. This issue is reproducible in queries like Q10, Q16, Q35, Q69, and Q94. This PR introduces `PostV2ScanRelationPushdown` to address the column pruning issues that may arise after optimizer rules are applied. Below is the plan changes for the newly added test case. Before this PR ``` BatchScan parquet file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-76b1f4fc-2e84-485c-aade-a62168987baf/t1[id#32L, col1#33L, col2#34L, col3#35L, col4#36L, col5#37L, col6#38L, col7#39L, col8#40L, col9#41L] ParquetScan DataFilters: [isnotnull(col1#33L), (col1#33L > 5)], Format: parquet, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-76..., PartitionFilters: [], PushedAggregation: [], PushedFilters: [IsNotNull(col1), GreaterThan(col1,5)], PushedGroupBy: [], ReadSchema: struct<id:bigint,col1:bigint,col2:bigint,col3:bigint,col4:bigint,col5:bigint,col6:bigint,col7:big... RuntimeFilters: [] ``` After this PR ``` BatchScan parquet file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-cd4b50d9-1643-40e6-a8e1-1429d3213411/t1[id#133L, col1#134L] ParquetScan DataFilters: [isnotnull(col1#134L), (col1#134L > 5)], Format: parquet, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-cd..., PartitionFilters: [], PushedAggregation: [], PushedFilters: [IsNotNull(col1), GreaterThan(col1,5)], PushedGroupBy: [], ReadSchema: struct<id:bigint,col1:bigint> RuntimeFilters: [] ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Newly added UT. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
