After digging deeper, I realized all the workers ran out of memory, giving
an hs_error.log file in /tmp/jvm-<PID> with the header:

# Native memory allocation (malloc) failed to allocate 2097152 bytes for
committing reserved memory.
# Possible reasons:
#   The system is out of physical RAM or swap space
#   In 32 bit mode, the process size limit was hit
# Possible solutions:
#   Reduce memory load on the system
#   Increase physical memory or swap space
#   Check if swap backing store is full
#   Use 64 bit Java on a 64 bit OS
#   Decrease Java heap size (-Xmx/-Xms)
#   Decrease number of Java threads
#   Decrease Java thread stack sizes (-Xss)
#   Set larger code cache with -XX:ReservedCodeCacheSize=
# This output file may be truncated or incomplete.
#
#  Out of Memory Error (os_linux.cpp:2761), pid=31426, tid=139549745604352
#
# JRE version: OpenJDK Runtime Environment (7.0_51-b02) (build
1.7.0_51-mockbuild_2014_01_15_01_3
9-b00)
# Java VM: OpenJDK 64-Bit Server VM (24.45-b08 mixed mode linux-amd64 )



We have 3 workers, each assigned 200G for spark.
The dataset is ~250g

All I'm doing is data.map(r => (getKey(r),
r)).sortByKey().map(_._2).coalesce(n).saveAsTextFile(), where n is the
original number of files in the dataset.

This worked fine under spark 0.8.1, with the same setup; I haven't changed
this code since upgrading to 0.9.0.

I took a look at a workers memory before it ran out using jmap and jhat;
they indicated file handles as the biggest memory user (which I guess makes
sense for a sort) - but the total was nowhere close to 200g, so I find
their output somewhat suspect.



On Tue, Mar 25, 2014 at 6:59 AM, Andrew Ash <and...@andrewash.com> wrote:

> Possibly one of your executors is in the middle of a large stop-the-world
> GC and doesn't respond to network traffic during that period?  If you
> shared some information about how each node in your cluster is set up (heap
> size, memory, CPU, etc) that might help with debugging.
>
> Andrew
>
>
> On Mon, Mar 24, 2014 at 9:13 PM, Nathan Kronenfeld <
> nkronenf...@oculusinfo.com> wrote:
>
>> What does this error mean:
>>
>> @hadoop-s2.oculus.local:45186]: Error [Association failed with
>> [akka.tcp://spark@hadoop-s2.oculus.local:45186]] [
>> akka.remote.EndpointAssociationException: Association failed with
>> [akka.tcp://spark@hadoop-s2.oculus.local:45186]
>> Caused by:
>> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
>> Connection refused: hadoop-s2.oculus.loca\
>> l/192.168.0.47:45186
>> ]
>>
>> ?
>>
>> --
>> Nathan Kronenfeld
>> Senior Visualization Developer
>> Oculus Info Inc
>> 2 Berkeley Street, Suite 600,
>> Toronto, Ontario M5A 4J5
>> Phone:  +1-416-203-3003 x 238
>> Email:  nkronenf...@oculusinfo.com
>>
>
>


-- 
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone:  +1-416-203-3003 x 238
Email:  nkronenf...@oculusinfo.com

Reply via email to