(spark) branch master updated: [SPARK-54830][CORE] Enable checksum based indeterminate shuffle retry by default

wenchen Sun, 04 Jan 2026 23:23:00 -0800

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 25307abfbfbc [SPARK-54830][CORE] Enable checksum based indeterminate 
shuffle retry by default
25307abfbfbc is described below

commit 25307abfbfbc26baa0ee5c9ea97c2ada3f59a3f2
Author: Tengfei Huang <[email protected]>
AuthorDate: Mon Jan 5 15:22:16 2026 +0800

    [SPARK-54830][CORE] Enable checksum based indeterminate shuffle retry by 
default
    
    ### What changes were proposed in this pull request?
    Enable checksum based indeterminate shuffle retry by default.
    
    Increase jvm memory size to 6g for `sql` module tests, as test case 
[SPARK-48037: Fix SortShuffleWriter lacks shuffle write related metrics 
resulting in potentially inaccurate 
data](https://github.com/apache/spark/blob/316322cbcb55ff5c1b4e479bc2aae12babdae534/sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala#L2696)
 set shuffle partition as `16777216` which will need more memory for computing 
order independent shuffle checksum.
    
    ### Why are the changes needed?
    As checksum based solution is more accurate to detect indeterminate shuffle 
output changes, propose to enable it by default to avoid query correctness 
issues caused by indeterminate shuffle retry.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Existing UTs.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes #53652 from ivoson/SPARK-54556-followup.
    
    Authored-by: Tengfei Huang <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
---
 .sbtopts                                                              | 2 +-
 docs/sql-migration-guide.md                                           | 4 ++++
 project/SparkBuild.scala                                              | 3 +++
 .../src/main/scala/org/apache/spark/sql/internal/SQLConf.scala        | 4 ++--
 4 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/.sbtopts b/.sbtopts
index 3516fc4bd7eb..2b2bb68217d4 100644
--- a/.sbtopts
+++ b/.sbtopts
@@ -16,5 +16,5 @@
 #
 
 -J-Xmx8g
--J-Xms8g
+-J-Xms4g
 -J-XX:MaxMetaspaceSize=1g
diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md
index 0a2533d28f0b..d3fc220559f3 100644
--- a/docs/sql-migration-guide.md
+++ b/docs/sql-migration-guide.md
@@ -22,6 +22,10 @@ license: |
 * Table of contents
 {:toc}
 
+## Upgrading from Spark SQL 4.1 to 4.2
+
+- Since Spark 4.2, Spark enables order-independent checksums for shuffle 
outputs by default to detect data inconsistencies during indeterminate shuffle 
stage retries. If a checksum mismatch is detected, Spark rolls back and 
re-executes all succeeding stages that depend on the shuffle output. If rolling 
back is not possible for some succeeding stages, the job will fail. To restore 
the previous behavior, set `spark.sql.shuffle.orderIndependentChecksum.enabled` 
and `spark.sql.shuffle.orderI [...]
+
 ## Upgrading from Spark SQL 4.0 to 4.1
 
 - Since Spark 4.1, the Parquet reader no longer assumes all struct values to 
be null, if all the requested fields are missing in the parquet file. The new 
default behavior is to read an additional struct field that is present in the 
file to determine nullness. To restore the previous behavior, set 
`spark.sql.legacy.parquet.returnNullStructIfAllFieldsMissing` to `true`.
diff --git a/project/SparkBuild.scala b/project/SparkBuild.scala
index 924b4df98a56..83b5ee84478a 100644
--- a/project/SparkBuild.scala
+++ b/project/SparkBuild.scala
@@ -1322,6 +1322,9 @@ object SqlApi {
 object SQL {
   import BuildCommons.protoVersion
   lazy val settings = Seq(
+    // SPARK-54830: avoid AdaptiveQueryExecSuite OOM, since computing order 
independent shuffle checksum needs more
+    // memory for test case introduced by SPARK-48037 which set shuffle 
partition to 16777216
+    (Test / javaOptions) += "-Xmx6g",
     // Setting version for the protobuf compiler. This has to be propagated to 
every sub-project
     // even if the project is not using it.
     PB.protocVersion := BuildCommons.protoVersion,
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index 23a1769508a7..bb7bf3b54016 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -907,7 +907,7 @@ object SQLConf {
         "retry all tasks of the consumer stages to avoid correctness issues.")
       .version("4.1.0")
       .booleanConf
-      .createWithDefault(false)
+      .createWithDefault(true)
 
   private[spark] val SHUFFLE_CHECKSUM_MISMATCH_FULL_RETRY_ENABLED =
     
buildConf("spark.sql.shuffle.orderIndependentChecksum.enableFullRetryOnMismatch")
@@ -915,7 +915,7 @@ object SQLConf {
         "with its producer stages.")
       .version("4.1.0")
       .booleanConf
-      .createWithDefault(false)
+      .createWithDefault(true)
 
   val SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE =
     buildConf("spark.sql.adaptive.shuffle.targetPostShuffleInputSize")


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch master updated: [SPARK-54830][CORE] Enable checksum based indeterminate shuffle retry by default

Reply via email to