[
https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117558#comment-14117558
]
Ye Xianjin commented on SPARK-3098:
-----------------------------------
hi, [~srowen] and [~gq], I think what [~matei] wants to say is that because the
ordering of elements in distinct() is not guaranteed, the result of
zipWithIndex is not deterministic. If you recompute the RDD with distinct
transformation, you are not guaranteed to get the same result. That explains
the behavior here.
But as [~srowen] said, It's surprised to see different results from the same
RDD. [~matei], what do you think about this behavior?
> In some cases, operation zipWithIndex get a wrong results
> ----------------------------------------------------------
>
> Key: SPARK-3098
> URL: https://issues.apache.org/jira/browse/SPARK-3098
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 1.0.1
> Reporter: Guoqiang Li
> Priority: Critical
>
> The reproduce code:
> {code}
> val c = sc.parallelize(1 to 7899).flatMap { i =>
> (1 to 10000).toSeq.map(p => i * 6000 + p)
> }.distinct().zipWithIndex()
> c.join(c).filter(t => t._2._1 != t._2._2).take(3)
> {code}
> =>
> {code}
> Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)),
> (36579712,(13,14)))
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]