You are right. I've checked the overall stage metrics and looks like the
largest shuffling write is over 9G. The partition completed successfully
but its spilled file can't be removed until all others are finished.
It's very likely caused by a stupid mistake in my design. A lookup table
grows
I'm running a small job on a cluster with 15G of mem and 8G of disk per
machine.
The job always get into a deadlock where the last error message is:
java.io.IOException: No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at