Hi,

We occasionally encounter OutOfMemoryError errors when running Spark 3.1
with Java 17, G1 garbage collector (region size = 32MB), and a 200GB heap.
The OOM happens in the ShuffleExternalSorter when it attempts to allocate a
1GB array for the pointer array, despite having about 80GB of heap
available after a Full GC. We observe many large objects, like
ShuffleExternalSorter.allocatedPages (64MB) and pointer arrays (512MB -
1GB), being allocated.

Our theory is that due to heap fragmentation caused by these large objects,
Java cannot find 32 contiguous regions to allocate the 1GB for the pointer
array. While I can adjust 'spark.buffer.pageSize' to match G1's region size
(32MB), it seems I can't do much about the pointer array other than
increasing the number of partitions, which would reduce the size of the
pointer array. However, tuning every Spark job separately isn't feasible.

I guess splitting the pointer array into 32MB chunks would help.

Has anyone experienced this issue before? Any thoughts or workarounds?

Thanks.
--
Oleksii

Reply via email to