Possibly one of your executors is in the middle of a large stop-the-world
GC and doesn't respond to network traffic during that period? If you
shared some information about how each node in your cluster is set up (heap
size, memory, CPU, etc) that might help with debugging.
Andrew
On Mon, Mar
After digging deeper, I realized all the workers ran out of memory, giving
an hs_error.log file in /tmp/jvm-PID with the header:
# Native memory allocation (malloc) failed to allocate 2097152 bytes for
committing reserved memory.
# Possible reasons:
# The system is out of physical RAM or swap
What does this error mean:
@hadoop-s2.oculus.local:45186]: Error [Association failed with
[akka.tcp://spark@hadoop-s2.oculus.local:45186]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://spark@hadoop-s2.oculus.local:45186]
Caused by: