jaylisde opened a new pull request, #12067:
URL: https://github.com/apache/gluten/pull/12067

   ## Summary
   
   Spark 4.1 introduced shuffle checksum end-to-end verification (SPARK-53322), 
requiring `MapStatus.checksumValue` to be non-zero and `.checksum` files to 
contain valid per-partition checksums. Gluten's `ColumnarShuffleWriter` was 
passing an empty checksum array to `writeMetadataFileAndCommit` and omitting 
the `checksumValue` parameter in `MapStatus`.
   
   **Fix:** After native shuffle write completes, read the data file (still in 
page cache) and compute per-partition checksums using the same algorithm as 
Spark's verification logic (`spark.shuffle.checksum.algorithm`, default 
ADLER32). Pass the checksums to `writeMetadataFileAndCommit` and an aggregated 
value to `MapStatus`.
   
   This is a pure Scala-layer fix — no C++/JNI changes required. The data file 
was just written by the native shuffle writer and remains in page cache, so the 
sequential read is effectively a memory operation with negligible overhead.
   
   ## Changes
   
   - `ColumnarShuffleWriter.scala`: Added `computePartitionChecksums()` method 
that reads the shuffle data file and computes per-partition checksums using 
`ShuffleChecksumHelper.getChecksumByAlgorithm()`. Respects 
`spark.shuffle.checksum.enabled` and `spark.shuffle.checksum.algorithm` configs.
   - `VeloxTestSettings.scala`: Enabled `GlutenMapStatusEndToEndSuite` 
(previously commented out).
   
   ## Test
   
   - `GlutenMapStatusEndToEndSuite` passes with default config 
(ansiFallback=true)
   - Verified with `-Dspark.gluten.sql.ansiFallback.enabled=false`: 
`ColumnarShuffleWriter` produces correct ADLER32 checksums
   
   Closes #11915


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to