[jira] [Commented] (SPARK-5750) Document that ordering of elements in shuffled partitions is not deterministic across runs

Apache Spark (JIRA) Tue, 17 Mar 2015 12:49:25 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-5750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14365895#comment-14365895
 ]


Apache Spark commented on SPARK-5750:
-------------------------------------

User 'ilganeli' has created a pull request for this issue:
https://github.com/apache/spark/pull/5074

> Document that ordering of elements in shuffled partitions is not 
> deterministic across runs
> ------------------------------------------------------------------------------------------
>
>                 Key: SPARK-5750
>                 URL: https://issues.apache.org/jira/browse/SPARK-5750
>             Project: Spark
>          Issue Type: Improvement
>          Components: Documentation
>            Reporter: Josh Rosen
>
> The ordering of elements in shuffled partitions is not deterministic across 
> runs.  For instance, consider the following example:
> {code}
> val largeFiles = sc.textFile(...)
> val airlines = largeFiles.repartition(2000).cache()
> println(airlines.first)
> {code}
> If this code is run twice, then each run will output a different result.  
> There is non-determinism in the shuffle read code that accounts for this:
> Spark's shuffle read path processes blocks as soon as they are fetched  Spark 
> uses 
> [ShuffleBlockFetcherIterator|https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala]
>  to fetch shuffle data from mappers.  In this code, requests for multiple 
> blocks from the same host are batched together, so nondeterminism in where 
> tasks are run means that the set of requests can vary across runs.  In 
> addition, there's an [explicit 
> call|https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L256]
>  to randomize the order of the batched fetch requests.  As a result, shuffle 
> operations cannot be guaranteed to produce the same ordering of the elements 
> in their partitions.
> Therefore, Spark should update its docs to clarify that the ordering of 
> elements in shuffle RDDs' partitions is non-deterministic.  Note, however, 
> that the _set_ of elements in each partition will be deterministic: if we 
> used {{mapPartitions}} to sort each partition, then the {{first()}} call 
> above would produce a deterministic result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5750) Document that ordering of elements in shuffled partitions is not deterministic across runs

Reply via email to