HeartSaVioR opened a new pull request, #46569: URL: https://github.com/apache/spark/pull/46569
### What changes were proposed in this pull request? This PR proposes to add a regression test (e2e) with SPARK-47305 (https://issues.apache.org/jira/browse/SPARK-47305). As of commit cae2248bc13 (pre-Spark 4.0), the test query is represented as below logical plans: > Batch 0 >> analyzed plan ``` WriteToMicroBatchDataSource MemorySink, 5067923b-e1d0-484c-914c-b111c9e60aac, Append, 0 +- Project [value#1] +- Join Inner, (cast(code#5 as bigint) = ref_code#14L) :- Union false, false : :- Project [value#1, 1 AS code#5] : : +- StreamingDataSourceV2ScanRelation[value#1] MemoryStreamDataSource : +- Project [value#3, cast(code#9 as int) AS code#16] : +- Project [value#3, null AS code#9] : +- LocalRelation <empty>, [value#3] +- Project [id#12L AS ref_code#14L] +- Range (1, 5, step=1, splits=Some(2)) ``` >> optimized plan ``` WriteToDataSourceV2 MicroBatchWrite[epoch: 0, writer: ...] +- Join Inner :- StreamingDataSourceV2ScanRelation[value#1] MemoryStreamDataSource +- Project +- Filter (1 = id#12L) +- Range (1, 5, step=1, splits=Some(2)) ``` > Batch 1 >> analyzed plan ``` WriteToMicroBatchDataSource MemorySink, d1c8be66-88e7-437a-9f25-6b87db8efe17, Append, 1 +- Project [value#1] +- Join Inner, (cast(code#5 as bigint) = ref_code#14L) :- Union false, false : :- Project [value#1, 1 AS code#5] : : +- LocalRelation <empty>, [value#1] : +- Project [value#3, cast(code#9 as int) AS code#16] : +- Project [value#3, null AS code#9] : +- StreamingDataSourceV2ScanRelation[value#3] MemoryStreamDataSource +- Project [id#12L AS ref_code#14L] +- Range (1, 5, step=1, splits=Some(2)) ``` >> optimized plan ``` WriteToDataSourceV2 MicroBatchWrite[epoch: 1, writer: ...] +- Join Inner :- StreamingDataSourceV2ScanRelation[value#3] MemoryStreamDataSource +- LocalRelation <empty> ``` Notice the difference in optimized plan between batch 0 and batch 1. In optimized plan for batch 1, the batch side is pruned out, which goes with the path of PruneFilters. The sequence of optimization is, 1) left stream side is collapsed with empty local relation 2) union is replaced with subtree for right stream side as left stream side is simply an empty local relation 3) the value of 'code' column is now known to be 'null' and it's propagated to the join criteria (`null = ref_code`) 4) join criteria is extracted out from join, and being pushed to the batch side 5) the value of 'ref_code' column can never be null, hence the filter is optimized as `filter false` 6) `filter false` triggers PruneFilters (where we fix a bug in SPARK-47305) Before SPARK-47305, a new empty local relation was incorrectly marked as streaming. ### Why are the changes needed? In the PR of SPARK-47305 we only added an unit test to verify the fix, but it wasn't e2e about the workload we encountered an issue. Given the complexity of QO, it'd be ideal to put an e2e reproducer (despite simplified) as regression test. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New UT. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org