HeartSaVioR opened a new pull request #28975:
URL: https://github.com/apache/spark/pull/28975


   ### What changes were proposed in this pull request?
   
   This patch fixes the odd join result being occurred from stream-stream join 
for state store format V2.
   
   There're some spots on V2 path which leverage UnsafeProjection. As the 
result row is reused, the row should be copied to avoid changing value during 
reading (or make sure the caller doesn't affect by such behavior) but 
`SymmetricHashJoinStateManager.removeByValueCondition` violates the case.
   
   This patch makes `KeyWithIndexToValueRowConverterV2.convertValue` copy the 
row by itself so that callers don't need to take care about it. This patch 
doesn't change the behavior of 
`KeyWithIndexToValueRowConverterV2.convertToValueRow` to avoid double-copying, 
as the caller is expected to store the row which the implementation of state 
store will call `copy()`.
   
   This patch adds such behavior into each method doc in 
`KeyWithIndexToValueRowConverter`, so that further contributors can read 
through and make sure the change / new addition doesn't break the contract.
   
   ### Why are the changes needed?
   
   Stream-stream join with state store format V2 (newly added in Spark 3.0.0) 
has a serious correctness bug which brings indeterministic result.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, some of Spark 3.0.0 users using stream-stream join from the new 
checkpoint (as the bug exists to only v2 format path) may encounter wrong join 
result. This patch will fix it.
   
   ### How was this patch tested?
   
   Reported case is converted to the new UT, and confirmed UT passed. All UTs 
in StreamingInnerJoinSuite and StreamingOuterJoinSuite passed as well


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to