xuanyuanking edited a comment on pull request #28707:
URL: https://github.com/apache/spark/pull/28707#issuecomment-643916110


   cc @maropu @gatorsmile @HeartSaVioR @dongjoon-hyun 
   
   A new regression bug SPARK-31990 was found when investigating the test 
failure https://github.com/apache/spark/pull/28707#issuecomment-639861273. The 
root cause is that [this 
line](https://github.com/apache/spark/pull/28062/files#diff-7a46f10c3cedbf013cf255564d9483cdL2458)
 in SPARK-31292 made the order of groupCols in Deduplicate changed, and the 
order changing will break the validation logic here. That is to say, if we 
don't have this PR, the executor JVM could probably crash, throw a random 
exception, or even return a wrong answer when using the checkpoint written by 
the previous version.
   
   So we have 2 related work of this PR:
   
   - [x] **[Block]** Fix and merge the compatibility issue in #28830
   - [ ] [Follow-up] Add new test(or modify the current Kafka test) in #28725
   
   ------------------
   ### More detailed analysis:
   The expected order of `Deduplicate.groupCols` in UT 
KafkaMicroBatchV2SourceSuite is
   ```
   [timestamp, partition, timestampType, key, offset, topic, value]
   ```
   Which is also the order in the checkpoint written by the version before 
Spark 3.0
   After the changes in SPARK-31292, the groupCols changed to
   ```
   [key, value, topic, partition, offset, timestamp, timestampType]
   ```
   
   #### Why this incompatibility bug didn't fail the 
`KafkaMicroBatchV2SourceSuite` when it merged?
   
   Because the UT `default config of includeHeader doesn't break the existing 
query from Spark 2.4` didn't test the scenario of duplicating and check the 
answer.
   Although the UT uses the checkpoint written by version 2.4.3 and streaming 
duplicate operation, it just wants to prove that the new header(added in 
SPARK-23539) doesn't break the original checkpoint file. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to