[ 
https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117558#comment-14117558
 ] 

Ye Xianjin commented on SPARK-3098:
-----------------------------------

hi, [~srowen] and [~gq], I think what [~matei] wants to say is that because the 
ordering of elements in distinct() is not guaranteed, the result of 
zipWithIndex is not deterministic. If you recompute the RDD with distinct 
transformation, you are not guaranteed to get the same result. That explains 
the behavior here.

But as [~srowen] said, It's surprised to see different results from the same 
RDD. [~matei], what do you think about this behavior?

>  In some cases, operation zipWithIndex get a wrong results
> ----------------------------------------------------------
>
>                 Key: SPARK-3098
>                 URL: https://issues.apache.org/jira/browse/SPARK-3098
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.0.1
>            Reporter: Guoqiang Li
>            Priority: Critical
>
> The reproduce code:
> {code}
>      val c = sc.parallelize(1 to 7899).flatMap { i =>
>       (1 to 10000).toSeq.map(p => i * 6000 + p)
>     }.distinct().zipWithIndex() 
>     c.join(c).filter(t => t._2._1 != t._2._2).take(3)
> {code}
>  => 
> {code}
>  Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), 
> (36579712,(13,14)))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to