Hi, We occasionally encounter OutOfMemoryError errors when running Spark 3.1 with Java 17, G1 garbage collector (region size = 32MB), and a 200GB heap. The OOM happens in the ShuffleExternalSorter when it attempts to allocate a 1GB array for the pointer array, despite having about 80GB of heap available after a Full GC. We observe many large objects, like ShuffleExternalSorter.allocatedPages (64MB) and pointer arrays (512MB - 1GB), being allocated.
Our theory is that due to heap fragmentation caused by these large objects, Java cannot find 32 contiguous regions to allocate the 1GB for the pointer array. While I can adjust 'spark.buffer.pageSize' to match G1's region size (32MB), it seems I can't do much about the pointer array other than increasing the number of partitions, which would reduce the size of the pointer array. However, tuning every Spark job separately isn't feasible. I guess splitting the pointer array into 32MB chunks would help. Has anyone experienced this issue before? Any thoughts or workarounds? Thanks. -- Oleksii