[ 
https://issues.apache.org/jira/browse/SPARK-31990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135286#comment-17135286
 ] 

Jungtaek Lim commented on SPARK-31990:
--------------------------------------

Nice finding!

One thing I'd like to add is, technically it can't be said SPARK-31292 is the 
root cause. The root cause is that we use toSet which doesn't guarantee order, 
where order should be preserved.

Paradoxically, "Seq.distinct" is more likely fit to the requirement, according 
to the scaladoc. Below is the description of "Seq.distinct":
{noformat}
def distinct: Seq[A]

Builds a new sequence from this sequence without any duplicate elements.

Note: will not terminate for infinite-sized collections.

returns 

A new sequence which contains the first occurrence of every element of this 
sequence.

Definition 

Classes SeqLike → GenSeqLike {noformat}
(NOTE: the description is changed in 2.13 - I don't know why. Would they change 
the implementation? If we don't believe Scala description of distinct then 
probably we can implement some utils which have functions with preserving order 
of the element.)

Though I have to say yes it may break backward compatibility, especially much 
more chance compared to the chance the algorithm of toSet affects the order. 

Looks like we should go through the hard decision - "fix it to get it right" vs 
"leave it as it is unless problem occurs".

 

> Streaming's state store compatibility is broken
> -----------------------------------------------
>
>                 Key: SPARK-31990
>                 URL: https://issues.apache.org/jira/browse/SPARK-31990
>             Project: Spark
>          Issue Type: Bug
>          Components: Structured Streaming
>    Affects Versions: 3.0.0
>            Reporter: Xiao Li
>            Priority: Blocker
>
> [This 
> line|https://github.com/apache/spark/pull/28062/files#diff-7a46f10c3cedbf013cf255564d9483cdR2458]
>  of [https://github.com/apache/spark/pull/28062] changed the order of 
> groupCols in dropDuplicates(). Thus, the executor JVM could probably crash, 
> throw a random exception or even return a wrong answer when using the 
> checkpoint written by the previous version. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to