Jakub Leś created SPARK-38137: --------------------------------- Summary: Repartition+Shuffle+ non deterministic function leads to bad results Key: SPARK-38137 URL: https://issues.apache.org/jira/browse/SPARK-38137 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.2.1, 3.1.1 Reporter: Jakub Leś
Hi, The results when using a non deterministic function in repartition (like rand) leads into incorrect results. Reproduce: (correct) {code:java} // code placeholder import scala.sys.process._ import org.apache.spark.TaskContext import org.apache.spark.sql.functions.randval res = spark.range(0, 100 * 100, 1).repartition(200).map { x => x }.repartition(200).map { x => if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) { throw new Exception("pkill -f java".!!) } x } res.distinct().count() {code} The correct result 10000 Reproduce: (bad) {code:java} // code placeholder import scala.sys.process._ import org.apache.spark.TaskContext import org.apache.spark.sql.functions.randval res = spark.range(0, 100 * 100, 1).repartition(200).map { x => x }.repartition(10, Array(rand):_*).map { x => if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) { throw new Exception("pkill -f java".!!) } x } res.distinct().count() {code} The bad result 9396 -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org