xuanyuanking edited a comment on pull request #28707: URL: https://github.com/apache/spark/pull/28707#issuecomment-643916110
cc @maropu @gatorsmile @HeartSaVioR @dongjoon-hyun A new regression bug SPARK-31990 was found when investigating the test failure https://github.com/apache/spark/pull/28707#issuecomment-639861273. The root cause is that [this line](https://github.com/apache/spark/pull/28062/files#diff-7a46f10c3cedbf013cf255564d9483cdL2458) in SPARK-31292 made the order of groupCols in Deduplicate changed, and the order changing will break the validation logic here. That is to say, if we don't have this PR, the executor JVM could probably crash, throw a random exception, or even return a wrong answer when using the checkpoint written by the previous version. So we have 2 related work of this PR: - [x] **[Block]** Fix and merge the compatibility issue in #28830 - [ ] [Follow-up] Add new test(or modify the current Kafka test) in #28725 ------------------ ### More detailed analysis: The expected order of `Deduplicate.groupCols` in UT KafkaMicroBatchV2SourceSuite is ``` [timestamp, partition, timestampType, key, offset, topic, value] ``` Which is also the order in the checkpoint written by the version before Spark 3.0 After the changes in SPARK-31292, the groupCols changed to ``` [key, value, topic, partition, offset, timestamp, timestampType] ``` #### Why this incompatibility bug didn't fail the `KafkaMicroBatchV2SourceSuite` when it merged? Because the UT `default config of includeHeader doesn't break the existing query from Spark 2.4` didn't test the scenario of duplicating and check the answer. Although the UT uses the checkpoint written by version 2.4.3 and streaming duplicate operation, it just wants to prove that the new header(added in SPARK-23539) doesn't break the original checkpoint file. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org