[ https://issues.apache.org/jira/browse/SPARK-31990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135286#comment-17135286 ]
Jungtaek Lim commented on SPARK-31990: -------------------------------------- Nice finding! One thing I'd like to add is, technically it can't be said SPARK-31292 is the root cause. The root cause is that we use toSet which doesn't guarantee order, where order should be preserved. Paradoxically, "Seq.distinct" is more likely fit to the requirement, according to the scaladoc. Below is the description of "Seq.distinct": {noformat} def distinct: Seq[A] Builds a new sequence from this sequence without any duplicate elements. Note: will not terminate for infinite-sized collections. returns A new sequence which contains the first occurrence of every element of this sequence. Definition Classes SeqLike → GenSeqLike {noformat} (NOTE: the description is changed in 2.13 - I don't know why. Would they change the implementation? If we don't believe Scala description of distinct then probably we can implement some utils which have functions with preserving order of the element.) Though I have to say yes it may break backward compatibility, especially much more chance compared to the chance the algorithm of toSet affects the order. Looks like we should go through the hard decision - "fix it to get it right" vs "leave it as it is unless problem occurs". > Streaming's state store compatibility is broken > ----------------------------------------------- > > Key: SPARK-31990 > URL: https://issues.apache.org/jira/browse/SPARK-31990 > Project: Spark > Issue Type: Bug > Components: Structured Streaming > Affects Versions: 3.0.0 > Reporter: Xiao Li > Priority: Blocker > > [This > line|https://github.com/apache/spark/pull/28062/files#diff-7a46f10c3cedbf013cf255564d9483cdR2458] > of [https://github.com/apache/spark/pull/28062] changed the order of > groupCols in dropDuplicates(). Thus, the executor JVM could probably crash, > throw a random exception or even return a wrong answer when using the > checkpoint written by the previous version. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org