[
https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117622#comment-14117622
]
Matei Zaharia commented on SPARK-3098:
--------------------------------------
It's true that the ordering of values after a shuffle is nondeterministic, so
that for example on failure you might get a different order of keys in a
reduceByKey or distinct or operations like that. However, I think that's the
way it should be (and we can document it). RDDs are deterministic when viewed
as a multiset, but not when viewed as an ordered collection, unless you do
sortByKey. Operations like zipWithIndex are meant to be more of a convenience
to get unique IDs or act on something with a known ordering (such as a text
file where you want to know the line numbers). But the freedom to control fetch
ordering is quite important for performance, especially if you want to have a
push-based shuffle in the future.
If we wanted to get the same result every time, we could design reduce tasks to
tell the master the order they fetched stuff in after the first time they ran,
but even then, notice that it might limit the kind of shuffle mechanisms we
allow (e.g. it would be harder to make a push-based shuffle deterministic). I'd
rather not make that guarantee now.
> In some cases, operation zipWithIndex get a wrong results
> ----------------------------------------------------------
>
> Key: SPARK-3098
> URL: https://issues.apache.org/jira/browse/SPARK-3098
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 1.0.1
> Reporter: Guoqiang Li
> Priority: Critical
>
> The reproduce code:
> {code}
> val c = sc.parallelize(1 to 7899).flatMap { i =>
> (1 to 10000).toSeq.map(p => i * 6000 + p)
> }.distinct().zipWithIndex()
> c.join(c).filter(t => t._2._1 != t._2._2).take(3)
> {code}
> =>
> {code}
> Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)),
> (36579712,(13,14)))
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]