[
https://issues.apache.org/jira/browse/SPARK-56511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xingbo Jiang resolved SPARK-56511.
----------------------------------
Resolution: Fixed
> NPE in ShuffleInMemorySorter.getMemoryUsage() when reset() fails to
> reallocate array
> ------------------------------------------------------------------------------------
>
> Key: SPARK-56511
> URL: https://issues.apache.org/jira/browse/SPARK-56511
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 4.1.0, 4.2.0, 4.1.1
> Reporter: Tim Lee
> Priority: Major
> Labels: pull-request-available
>
> SPARK-49386 introduced a getMemoryUsage() call on every insertRecord(). This
> exposes an NPE when spill's reset() fails to reallocate the pointer array:
> 1. insertRecord() → memory pressure → spill() → reset()
> 2. reset() sets array = null, then allocateArray() throws
> SparkOutOfMemoryError
> 3. OOM propagates to UnsafeShuffleWriter.write()'s finally block
> 4. cleanupResources() → freeMemory() → updatePeakMemoryUsed() →
> getMemoryUsage() → inMemSorter.getMemoryUsage() → NPE on array.size()
> inMemSorter is non-null (the failed reset prevented spill from completing),
> but inMemSorter.array is null. cleanupResources() dereferences array via
> freeMemory() before reaching inMemSorter.free().
> Stack trace we see in prod:
> {code:java}
> java.lang.NullPointerException
> at
> org.apache.spark.shuffle.sort.ShuffleInMemorySorter.getMemoryUsage(ShuffleInMemorySorter.java:131)
> at
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.getMemoryUsage(ShuffleExternalSorter.java:349)
> at
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:472)
> at
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:297)
> at
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:213)
> at
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:58)
> at
> org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$3(ShuffleMapTask.scala:87)
> at
> com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
> at
> org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$1(ShuffleMapTask.scala:82)
> at
> com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:58)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:39)
> ... {code}
> Fix: null-check array in getMemoryUsage() and return 0. This is correct —
> when array is null, the pointer array was already freed by reset() and never
> reallocated, so memory usage IS zero.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]