Re: Akka error with largish job (works fine for smaller versions)

2014-03-25 Thread Andrew Ash
Possibly one of your executors is in the middle of a large stop-the-world GC and doesn't respond to network traffic during that period? If you shared some information about how each node in your cluster is set up (heap size, memory, CPU, etc) that might help with debugging. Andrew On Mon, Mar

Re: Akka error with largish job (works fine for smaller versions)

2014-03-25 Thread Nathan Kronenfeld
After digging deeper, I realized all the workers ran out of memory, giving an hs_error.log file in /tmp/jvm-PID with the header: # Native memory allocation (malloc) failed to allocate 2097152 bytes for committing reserved memory. # Possible reasons: # The system is out of physical RAM or swap

Akka error with largish job (works fine for smaller versions)

2014-03-24 Thread Nathan Kronenfeld
What does this error mean: @hadoop-s2.oculus.local:45186]: Error [Association failed with [akka.tcp://spark@hadoop-s2.oculus.local:45186]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://spark@hadoop-s2.oculus.local:45186] Caused by: