[ 
https://issues.apache.org/jira/browse/SPARK-56511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingbo Jiang resolved SPARK-56511.
----------------------------------
    Resolution: Fixed

> NPE in ShuffleInMemorySorter.getMemoryUsage() when reset() fails to 
> reallocate array
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-56511
>                 URL: https://issues.apache.org/jira/browse/SPARK-56511
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 4.1.0, 4.2.0, 4.1.1
>            Reporter: Tim Lee
>            Priority: Major
>              Labels: pull-request-available
>
> SPARK-49386 introduced a getMemoryUsage() call on every insertRecord(). This 
> exposes an NPE when spill's reset() fails to reallocate the pointer array:
> 1. insertRecord() → memory pressure → spill() → reset()
> 2. reset() sets array = null, then allocateArray() throws 
> SparkOutOfMemoryError
> 3. OOM propagates to UnsafeShuffleWriter.write()'s finally block
> 4. cleanupResources() → freeMemory() → updatePeakMemoryUsed() → 
> getMemoryUsage() → inMemSorter.getMemoryUsage() → NPE on array.size()
> inMemSorter is non-null (the failed reset prevented spill from completing), 
> but inMemSorter.array is null. cleanupResources() dereferences array via 
> freeMemory() before reaching inMemSorter.free().
> Stack trace we see in prod:
> {code:java}
> java.lang.NullPointerException
>       at 
> org.apache.spark.shuffle.sort.ShuffleInMemorySorter.getMemoryUsage(ShuffleInMemorySorter.java:131)
>       at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.getMemoryUsage(ShuffleExternalSorter.java:349)
>       at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:472)
>       at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:297)
>       at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:213)
>       at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:58)
>       at 
> org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$3(ShuffleMapTask.scala:87)
>       at 
> com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
>       at 
> org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$1(ShuffleMapTask.scala:82)
>       at 
> com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
>       at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:58)
>       at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:39)
> ... {code}
> Fix: null-check array in getMemoryUsage() and return 0. This is correct — 
> when array is null, the pointer array was already freed by reset() and never 
> reallocated, so memory usage IS zero.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to