[PR] [SPARK-48267][SS] Regression e2e test with SPARK-47305 [spark]

via GitHub Mon, 13 May 2024 20:56:10 -0700


HeartSaVioR opened a new pull request, #46569:
URL: https://github.com/apache/spark/pull/46569


   ### What changes were proposed in this pull request?
   
   This PR proposes to add a regression test (e2e) with SPARK-47305 
(https://issues.apache.org/jira/browse/SPARK-47305).
   
   As of commit cae2248bc13 (pre-Spark 4.0), the test query is represented as 
below logical plans:
   
   > Batch 0
   
   >> analyzed plan
   
   ```
   WriteToMicroBatchDataSource MemorySink, 
5067923b-e1d0-484c-914c-b111c9e60aac, Append, 0
   +- Project [value#1]
      +- Join Inner, (cast(code#5 as bigint) = ref_code#14L)
         :- Union false, false
         :  :- Project [value#1, 1 AS code#5]
         :  :  +- StreamingDataSourceV2ScanRelation[value#1] 
MemoryStreamDataSource
         :  +- Project [value#3, cast(code#9 as int) AS code#16]
         :     +- Project [value#3, null AS code#9]
         :        +- LocalRelation <empty>, [value#3]
         +- Project [id#12L AS ref_code#14L]
            +- Range (1, 5, step=1, splits=Some(2))
   ```
   
   >> optimized plan
   
   ```
   WriteToDataSourceV2 MicroBatchWrite[epoch: 0, writer: ...]
   +- Join Inner
      :- StreamingDataSourceV2ScanRelation[value#1] MemoryStreamDataSource
      +- Project
         +- Filter (1 = id#12L)
            +- Range (1, 5, step=1, splits=Some(2))
   ```
   
   > Batch 1
   
   >> analyzed plan
   
   ```
   WriteToMicroBatchDataSource MemorySink, 
d1c8be66-88e7-437a-9f25-6b87db8efe17, Append, 1
   +- Project [value#1]
      +- Join Inner, (cast(code#5 as bigint) = ref_code#14L)
         :- Union false, false
         :  :- Project [value#1, 1 AS code#5]
         :  :  +- LocalRelation <empty>, [value#1]
         :  +- Project [value#3, cast(code#9 as int) AS code#16]
         :     +- Project [value#3, null AS code#9]
         :        +- StreamingDataSourceV2ScanRelation[value#3] 
MemoryStreamDataSource
         +- Project [id#12L AS ref_code#14L]
            +- Range (1, 5, step=1, splits=Some(2))
   ```
   
   >> optimized plan
   
   ```
   WriteToDataSourceV2 MicroBatchWrite[epoch: 1, writer: ...]
   +- Join Inner
      :- StreamingDataSourceV2ScanRelation[value#3] MemoryStreamDataSource
      +- LocalRelation <empty>
   ```
   
   Notice the difference in optimized plan between batch 0 and batch 1. In 
optimized plan for batch 1, the batch side is pruned out, which goes with the 
path of PruneFilters. The sequence of optimization is, 
   
   1) left stream side is collapsed with empty local relation
   2) union is replaced with subtree for right stream side as left stream side 
is simply an empty local relation
   3) the value of 'code' column is now known to be 'null' and it's propagated 
to the join criteria (`null = ref_code`)
   4) join criteria is extracted out from join, and being pushed to the batch 
side
   5) the value of 'ref_code' column can never be null, hence the filter is 
optimized as `filter false`
   6) `filter false` triggers PruneFilters (where we fix a bug in SPARK-47305)
   
   Before SPARK-47305, a new empty local relation was incorrectly marked as 
streaming.
   
   ### Why are the changes needed?
   
   In the PR of SPARK-47305 we only added an unit test to verify the fix, but 
it wasn't e2e about the workload we encountered an issue. Given the complexity 
of QO, it'd be ideal to put an e2e reproducer (despite simplified) as 
regression test.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   New UT.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-48267][SS] Regression e2e test with SPARK-47305 [spark]

Reply via email to