After digging deeper, I realized all the workers ran out of memory, giving an hs_error.log file in /tmp/jvm-<PID> with the header:
# Native memory allocation (malloc) failed to allocate 2097152 bytes for committing reserved memory. # Possible reasons: # The system is out of physical RAM or swap space # In 32 bit mode, the process size limit was hit # Possible solutions: # Reduce memory load on the system # Increase physical memory or swap space # Check if swap backing store is full # Use 64 bit Java on a 64 bit OS # Decrease Java heap size (-Xmx/-Xms) # Decrease number of Java threads # Decrease Java thread stack sizes (-Xss) # Set larger code cache with -XX:ReservedCodeCacheSize= # This output file may be truncated or incomplete. # # Out of Memory Error (os_linux.cpp:2761), pid=31426, tid=139549745604352 # # JRE version: OpenJDK Runtime Environment (7.0_51-b02) (build 1.7.0_51-mockbuild_2014_01_15_01_3 9-b00) # Java VM: OpenJDK 64-Bit Server VM (24.45-b08 mixed mode linux-amd64 ) We have 3 workers, each assigned 200G for spark. The dataset is ~250g All I'm doing is data.map(r => (getKey(r), r)).sortByKey().map(_._2).coalesce(n).saveAsTextFile(), where n is the original number of files in the dataset. This worked fine under spark 0.8.1, with the same setup; I haven't changed this code since upgrading to 0.9.0. I took a look at a workers memory before it ran out using jmap and jhat; they indicated file handles as the biggest memory user (which I guess makes sense for a sort) - but the total was nowhere close to 200g, so I find their output somewhat suspect. On Tue, Mar 25, 2014 at 6:59 AM, Andrew Ash <and...@andrewash.com> wrote: > Possibly one of your executors is in the middle of a large stop-the-world > GC and doesn't respond to network traffic during that period? If you > shared some information about how each node in your cluster is set up (heap > size, memory, CPU, etc) that might help with debugging. > > Andrew > > > On Mon, Mar 24, 2014 at 9:13 PM, Nathan Kronenfeld < > nkronenf...@oculusinfo.com> wrote: > >> What does this error mean: >> >> @hadoop-s2.oculus.local:45186]: Error [Association failed with >> [akka.tcp://spark@hadoop-s2.oculus.local:45186]] [ >> akka.remote.EndpointAssociationException: Association failed with >> [akka.tcp://spark@hadoop-s2.oculus.local:45186] >> Caused by: >> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: >> Connection refused: hadoop-s2.oculus.loca\ >> l/192.168.0.47:45186 >> ] >> >> ? >> >> -- >> Nathan Kronenfeld >> Senior Visualization Developer >> Oculus Info Inc >> 2 Berkeley Street, Suite 600, >> Toronto, Ontario M5A 4J5 >> Phone: +1-416-203-3003 x 238 >> Email: nkronenf...@oculusinfo.com >> > > -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: nkronenf...@oculusinfo.com