jackylee-ch opened a new pull request, #51046:
URL: https://github.com/apache/spark/pull/51046

   ### Why are the changes needed?
   Recently, I have been testing TPC-DS queries based on DataSource V2, and 
noticed that column pruning does not occur in scenarios involving EXISTS 
(SELECT * FROM ... WHERE ...). As a result, the scan ends up reading all 
columns instead of just the required ones. This issue is reproducible in 
queries like Q10, Q16, Q35, Q69, and Q94.
   
   This PR introduces `PostV2ScanRelationPushdown` to address the column 
pruning issues that may arise after optimizer rules are applied.
   
   Below is the plan changes for the newly added test case.
   Before this PR
   ```
   BatchScan parquet 
file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-76b1f4fc-2e84-485c-aade-a62168987baf/t1[id#32L,
 col1#33L, col2#34L, col3#35L, col4#36L, col5#37L, col6#38L, col7#39L, 
col8#40L, col9#41L] ParquetScan DataFilters: [isnotnull(col1#33L), (col1#33L > 
5)], Format: parquet, Location: InMemoryFileIndex(1 
paths)[file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-76...,
 PartitionFilters: [], PushedAggregation: [], PushedFilters: [IsNotNull(col1), 
GreaterThan(col1,5)], PushedGroupBy: [], ReadSchema: 
struct<id:bigint,col1:bigint,col2:bigint,col3:bigint,col4:bigint,col5:bigint,col6:bigint,col7:big...
 RuntimeFilters: []
   ```
   After this PR
   ```
   BatchScan parquet 
file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-cd4b50d9-1643-40e6-a8e1-1429d3213411/t1[id#133L,
 col1#134L] ParquetScan DataFilters: [isnotnull(col1#134L), (col1#134L > 5)], 
Format: parquet, Location: InMemoryFileIndex(1 
paths)[file:/private/var/folders/bb/4fvsn8r949d3kghh68lx3sqr0000gp/T/spark-cd...,
 PartitionFilters: [], PushedAggregation: [], PushedFilters: [IsNotNull(col1), 
GreaterThan(col1,5)], PushedGroupBy: [], ReadSchema: 
struct<id:bigint,col1:bigint> RuntimeFilters: []
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Newly added UT.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to