[ https://issues.apache.org/jira/browse/SPARK-5750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14365895#comment-14365895 ]
Apache Spark commented on SPARK-5750: ------------------------------------- User 'ilganeli' has created a pull request for this issue: https://github.com/apache/spark/pull/5074 > Document that ordering of elements in shuffled partitions is not > deterministic across runs > ------------------------------------------------------------------------------------------ > > Key: SPARK-5750 > URL: https://issues.apache.org/jira/browse/SPARK-5750 > Project: Spark > Issue Type: Improvement > Components: Documentation > Reporter: Josh Rosen > > The ordering of elements in shuffled partitions is not deterministic across > runs. For instance, consider the following example: > {code} > val largeFiles = sc.textFile(...) > val airlines = largeFiles.repartition(2000).cache() > println(airlines.first) > {code} > If this code is run twice, then each run will output a different result. > There is non-determinism in the shuffle read code that accounts for this: > Spark's shuffle read path processes blocks as soon as they are fetched Spark > uses > [ShuffleBlockFetcherIterator|https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala] > to fetch shuffle data from mappers. In this code, requests for multiple > blocks from the same host are batched together, so nondeterminism in where > tasks are run means that the set of requests can vary across runs. In > addition, there's an [explicit > call|https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L256] > to randomize the order of the batched fetch requests. As a result, shuffle > operations cannot be guaranteed to produce the same ordering of the elements > in their partitions. > Therefore, Spark should update its docs to clarify that the ordering of > elements in shuffle RDDs' partitions is non-deterministic. Note, however, > that the _set_ of elements in each partition will be deterministic: if we > used {{mapPartitions}} to sort each partition, then the {{first()}} call > above would produce a deterministic result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org