I'm trying to run a simple pyspark application that reads from file (json),
flattens it (explode) and writes back to file (json) partitioned by date
using DataFrameWriter.partitionBy(*cols).

I keep getting OOMEs like:
java.lang.OutOfMemoryError: Java heap space
at
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.<init>(UnsafeSorterSpillWriter.java:46)
at
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:206)
at
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
.......

Explode could make the underlying RDD grow a lot, and maybe in an
unbalanced way sometimes,
adding to that partitioning by date (in daily ETLs for instance) would
probably cause a data skew (right?), but why am I getting OOMs? Isn't Spark
supposed to spill to disk if the underlying RDD is too big to fit in memory?

If I'm not using "partitionBy" with the writer (still exploding) everything
works fine.

This happens both in EMR and in local (mac) pyspark/spark shell (tried both
in python and scala).

Thanks!

Reply via email to