[jira] [Commented] (SPARK-23207) Shuffle+Repartition on an DataFrame could lead to incorrect answers

Igor Berman (Jira) Wed, 08 Jun 2022 01:21:04 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-23207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551465#comment-17551465
 ]


Igor Berman commented on SPARK-23207:
-------------------------------------

We are still facing this issue in production with v3.1.2 at very large 
workloads. This happens very rarely, but still happens.
Current trials to reproduce this problem with above reproduction failed, so at 
this point no reproduction, will update if we will find one

> Shuffle+Repartition on an DataFrame could lead to incorrect answers
> -------------------------------------------------------------------
>
>                 Key: SPARK-23207
>                 URL: https://issues.apache.org/jira/browse/SPARK-23207
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0
>            Reporter: Xingbo Jiang
>            Assignee: Xingbo Jiang
>            Priority: Blocker
>              Labels: correctness
>             Fix For: 2.1.4, 2.2.3, 2.3.0
>
>
> Currently shuffle repartition uses RoundRobinPartitioning, the generated 
> result is nondeterministic since the sequence of input rows are not 
> determined.
> The bug can be triggered when there is a repartition call following a shuffle 
> (which would lead to non-deterministic row ordering), as the pattern shows 
> below:
> upstream stage -> repartition stage -> result stage
> (-> indicate a shuffle)
> When one of the executors process goes down, some tasks on the repartition 
> stage will be retried and generate inconsistent ordering, and some tasks of 
> the result stage will be retried generating different data.
> The following code returns 931532, instead of 1000000:
> {code:java}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x =>
>   x
> }.repartition(200).map { x =>
>   if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) {
>     throw new Exception("pkill -f java".!!)
>   }
>   x
> }
> res.distinct().count()
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23207) Shuffle+Repartition on an DataFrame could lead to incorrect answers

Reply via email to