I'm getting the same error ("ExecutorLostFailure") - input RDD is 100k
small files (~2MB each).  I do a simple map, then keyBy(), and then
rdd.saveAsHadoopDataset(...).  Depending on the memory settings given to
spark-submit, the time before the first ExecutorLostFailure varies (more
memory == longer until failure) - but this usually happens after about 100
files being processed.

I'm running Spark 1.1.0 on AWS EMR w/Yarn.    It appears that Yarn is
killing the executor b/c it thinks it's exceeding memory.  However, I can't
repro any OOM issues when running locally, no matter the size of the data
set.

It seems like Yarn thinks the heap size is increasing according to the Yarn
logs:

2014-12-18 22:06:43,505 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
(Container Monitor): Memory usage of ProcessTree 24273 for container-id
container_1418928607193_0011_01_000002: 6.1 GB of 6.5 GB physical memory
used; 13.8 GB of 32.5 GB virtual memory used
2014-12-18 22:06:46,516 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
(Container Monitor): Memory usage of ProcessTree 24273 for container-id
container_1418928607193_0011_01_000002: 6.2 GB of 6.5 GB physical memory
used; 13.9 GB of 32.5 GB virtual memory used
2014-12-18 22:06:49,524 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
(Container Monitor): Memory usage of ProcessTree 24273 for container-id
container_1418928607193_0011_01_000002: 6.2 GB of 6.5 GB physical memory
used; 14.0 GB of 32.5 GB virtual memory used
2014-12-18 22:06:52,531 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
(Container Monitor): Memory usage of ProcessTree 24273 for container-id
container_1418928607193_0011_01_000002: 6.4 GB of 6.5 GB physical memory
used; 14.1 GB of 32.5 GB virtual memory used
2014-12-18 22:06:55,538 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
(Container Monitor): Memory usage of ProcessTree 24273 for container-id
container_1418928607193_0011_01_000002: 6.5 GB of 6.5 GB physical memory
used; 14.2 GB of 32.5 GB virtual memory used
2014-12-18 22:06:58,549 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
(Container Monitor): Memory usage of ProcessTree 24273 for container-id
container_1418928607193_0011_01_000002: 6.5 GB of 6.5 GB physical memory
used; 14.3 GB of 32.5 GB virtual memory used
2014-12-18 22:06:58,549 WARN
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
(Container Monitor): Process tree for container:
container_1418928607193_0011_01_000002 has processes older than 1 iteration
running over the configured limit. Limit=6979321856, current usage =
6995812352
2014-12-18 22:06:58,549 WARN
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
(Container Monitor): Container
[pid=24273,containerID=container_1418928607193_0011_01_000002] is running
beyond physical memory limits. Current usage: 6.5 GB of 6.5 GB physical
memory used; 14.3 GB of 32.5 GB virtual memory used. Killing container.
Dump of the process-tree for container_1418928607193_0011_01_000002 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 24273 4304 24273 24273 (bash) 0 0 115630080 302 /bin/bash -c
/usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p'
-Xms6144m -Xmx6144m  -verbose:gc -XX:+HeapDumpOnOutOfMemoryError
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70
-Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_000002/tmp
org.apache.spark.executor.CoarseGrainedExecutorBackend
akka.tcp://sparkdri...@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler
1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4 1>
/mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_000002/stdout
2>
/mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_000002/stderr
|- 24277 24273 24273 24273 (java) 13808 1730 15204556800 1707660
/usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms6144m
-Xmx6144m -verbose:gc -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails
-XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70
-Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_000002/tmp
org.apache.spark.executor.CoarseGrainedExecutorBackend
akka.tcp://sparkdri...@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler
1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4


I've analyzed some heap dumps and see nothing out of the ordinary.   Would
love to know what could be causing this.


On Fri, Dec 19, 2014 at 7:46 AM, bethesda <swearinge...@mac.com> wrote:

> I have a job that runs fine on relatively small input datasets but then
> reaches a threshold where I begin to consistently get "Fetch failure" for
> the Failure Reason, late in the job, during a saveAsText() operation.
>
> The first error we are seeing on the "Details for Stage" page is
> "ExecutorLostFailure"
>
> My Shuffle Read is 3.3 GB and that's the only thing that seems high, we
> have
> three servers and they are configured on this job for 5g memory, and the
> job
> is running in spark-shell.  The first error in the shell is "Lost executor
> 2
> on (servername): remote Akka client disassociated.
>
> We are still trying to understand how to best diagnose jobs using the web
> ui
> so it's likely that there's some helpful info here that we just don't know
> how to interpret -- is there any kind of "troubleshooting guide" beyond the
> Spark Configuration page?  I don't know if I'm providing enough info here.
>
> thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to