[ https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117558#comment-14117558 ]
Ye Xianjin commented on SPARK-3098: ----------------------------------- hi, [~srowen] and [~gq], I think what [~matei] wants to say is that because the ordering of elements in distinct() is not guaranteed, the result of zipWithIndex is not deterministic. If you recompute the RDD with distinct transformation, you are not guaranteed to get the same result. That explains the behavior here. But as [~srowen] said, It's surprised to see different results from the same RDD. [~matei], what do you think about this behavior? > In some cases, operation zipWithIndex get a wrong results > ---------------------------------------------------------- > > Key: SPARK-3098 > URL: https://issues.apache.org/jira/browse/SPARK-3098 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.0.1 > Reporter: Guoqiang Li > Priority: Critical > > The reproduce code: > {code} > val c = sc.parallelize(1 to 7899).flatMap { i => > (1 to 10000).toSeq.map(p => i * 6000 + p) > }.distinct().zipWithIndex() > c.join(c).filter(t => t._2._1 != t._2._2).take(3) > {code} > => > {code} > Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), > (36579712,(13,14))) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org