Hmmm, I see this a lot (multiple times per second) in the stdout logs of my
application:

2014-12-19T16:12:35.748+0000: [GC (Allocation Failure) [ParNew:
286663K->12530K(306688K), 0.0074579 secs] 1470813K->1198034K(2063104K),
0.0075189 secs] [Times: user=0.03 sys=0.00, real=0.01 secs]


And finally I see

2014-12-19 16:12:36,116 ERROR [SIGTERM handler]
executor.CoarseGrainedExecutorBackend (SignalLogger.scala:handle(57)) -
RECEIVED SIGNAL 15: SIGTERM

which I assume is coming from Yarn, after which the log contains this and
then ends:

Heap
 par new generation   total 306688K, used 23468K [0x0000000080000000,
0x0000000094cc0000, 0x0000000094cc0000)
  eden space 272640K,   4% used [0x0000000080000000, 0x0000000080abff10,
0x0000000090a40000)
  from space 34048K,  36% used [0x0000000092b80000, 0x00000000937ab488,
0x0000000094cc0000)
  to   space 34048K,   0% used [0x0000000090a40000, 0x0000000090a40000,
0x0000000092b80000)
 concurrent mark-sweep generation total 1756416K, used 1186756K
[0x0000000094cc0000, 0x0000000100000000, 0x0000000100000000)
 Metaspace       used 52016K, capacity 52683K, committed 52848K, reserved
1095680K
  class space    used 7149K, capacity 7311K, committed 7392K, reserved
1048576K







On Fri, Dec 19, 2014 at 11:16 AM, Jon Chase <jon.ch...@gmail.com> wrote:

> I'm actually already running 1.1.1.
>
> I also just tried --conf spark.yarn.executor.memoryOverhead=4096, but no
> luck.  Still getting "ExecutorLostFailure (executor lost)".
>
>
>
> On Fri, Dec 19, 2014 at 10:43 AM, Rafal Kwasny <rafal.kwa...@gmail.com>
> wrote:
>
>> Hi,
>> Just upgrade to 1.1.1 - it was fixed some time ago
>>
>> /Raf
>>
>>
>> sandy.r...@cloudera.com wrote:
>>
>> Hi Jon,
>>
>> The fix for this is to increase spark.yarn.executor.memoryOverhead to
>> something greater than it's default of 384.
>>
>> This will increase the gap between the executors heap size and what it
>> requests from yarn. It's required because jvms take up some memory beyond
>> their heap size.
>>
>> -Sandy
>>
>> On Dec 19, 2014, at 9:04 AM, Jon Chase <jon.ch...@gmail.com> wrote:
>>
>> I'm getting the same error ("ExecutorLostFailure") - input RDD is 100k
>> small files (~2MB each).  I do a simple map, then keyBy(), and then
>> rdd.saveAsHadoopDataset(...).  Depending on the memory settings given to
>> spark-submit, the time before the first ExecutorLostFailure varies (more
>> memory == longer until failure) - but this usually happens after about 100
>> files being processed.
>>
>> I'm running Spark 1.1.0 on AWS EMR w/Yarn.    It appears that Yarn is
>> killing the executor b/c it thinks it's exceeding memory.  However, I can't
>> repro any OOM issues when running locally, no matter the size of the data
>> set.
>>
>> It seems like Yarn thinks the heap size is increasing according to the
>> Yarn logs:
>>
>> 2014-12-18 22:06:43,505 INFO
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>> (Container Monitor): Memory usage of ProcessTree 24273 for container-id
>> container_1418928607193_0011_01_000002: 6.1 GB of 6.5 GB physical memory
>> used; 13.8 GB of 32.5 GB virtual memory used
>> 2014-12-18 22:06:46,516 INFO
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>> (Container Monitor): Memory usage of ProcessTree 24273 for container-id
>> container_1418928607193_0011_01_000002: 6.2 GB of 6.5 GB physical memory
>> used; 13.9 GB of 32.5 GB virtual memory used
>> 2014-12-18 22:06:49,524 INFO
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>> (Container Monitor): Memory usage of ProcessTree 24273 for container-id
>> container_1418928607193_0011_01_000002: 6.2 GB of 6.5 GB physical memory
>> used; 14.0 GB of 32.5 GB virtual memory used
>> 2014-12-18 22:06:52,531 INFO
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>> (Container Monitor): Memory usage of ProcessTree 24273 for container-id
>> container_1418928607193_0011_01_000002: 6.4 GB of 6.5 GB physical memory
>> used; 14.1 GB of 32.5 GB virtual memory used
>> 2014-12-18 22:06:55,538 INFO
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>> (Container Monitor): Memory usage of ProcessTree 24273 for container-id
>> container_1418928607193_0011_01_000002: 6.5 GB of 6.5 GB physical memory
>> used; 14.2 GB of 32.5 GB virtual memory used
>> 2014-12-18 22:06:58,549 INFO
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>> (Container Monitor): Memory usage of ProcessTree 24273 for container-id
>> container_1418928607193_0011_01_000002: 6.5 GB of 6.5 GB physical memory
>> used; 14.3 GB of 32.5 GB virtual memory used
>> 2014-12-18 22:06:58,549 WARN
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>> (Container Monitor): Process tree for container:
>> container_1418928607193_0011_01_000002 has processes older than 1 iteration
>> running over the configured limit. Limit=6979321856, current usage =
>> 6995812352
>> 2014-12-18 22:06:58,549 WARN
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>> (Container Monitor): Container
>> [pid=24273,containerID=container_1418928607193_0011_01_000002] is running
>> beyond physical memory limits. Current usage: 6.5 GB of 6.5 GB physical
>> memory used; 14.3 GB of 32.5 GB virtual memory used. Killing container.
>> Dump of the process-tree for container_1418928607193_0011_01_000002 :
>> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
>> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
>> |- 24273 4304 24273 24273 (bash) 0 0 115630080 302 /bin/bash -c
>> /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p'
>> -Xms6144m -Xmx6144m  -verbose:gc -XX:+HeapDumpOnOutOfMemoryError
>> -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC
>> -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70
>> -Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_000002/tmp
>> org.apache.spark.executor.CoarseGrainedExecutorBackend
>> akka.tcp://sparkdri...@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler
>> 1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4 1>
>> /mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_000002/stdout
>> 2>
>> /mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_000002/stderr
>> |- 24277 24273 24273 24273 (java) 13808 1730 15204556800 1707660
>> /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms6144m
>> -Xmx6144m -verbose:gc -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails
>> -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC
>> -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70
>> -Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_000002/tmp
>> org.apache.spark.executor.CoarseGrainedExecutorBackend
>> akka.tcp://sparkdri...@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler
>> 1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4
>>
>>
>> I've analyzed some heap dumps and see nothing out of the ordinary.
>> Would love to know what could be causing this.
>>
>>
>> On Fri, Dec 19, 2014 at 7:46 AM, bethesda <swearinge...@mac.com> wrote:
>>
>>> I have a job that runs fine on relatively small input datasets but then
>>> reaches a threshold where I begin to consistently get "Fetch failure" for
>>> the Failure Reason, late in the job, during a saveAsText() operation.
>>>
>>> The first error we are seeing on the "Details for Stage" page is
>>> "ExecutorLostFailure"
>>>
>>> My Shuffle Read is 3.3 GB and that's the only thing that seems high, we
>>> have
>>> three servers and they are configured on this job for 5g memory, and the
>>> job
>>> is running in spark-shell.  The first error in the shell is "Lost
>>> executor 2
>>> on (servername): remote Akka client disassociated.
>>>
>>> We are still trying to understand how to best diagnose jobs using the
>>> web ui
>>> so it's likely that there's some helpful info here that we just don't
>>> know
>>> how to interpret -- is there any kind of "troubleshooting guide" beyond
>>> the
>>> Spark Configuration page?  I don't know if I'm providing enough info
>>> here.
>>>
>>> thanks.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>>
>

Reply via email to