Yes, same problem. On Fri, Dec 19, 2014 at 11:29 AM, Sandy Ryza <sandy.r...@cloudera.com> wrote:
> Do you hit the same errors? Is it now saying your containers are exceed > ~10 GB? > > On Fri, Dec 19, 2014 at 11:16 AM, Jon Chase <jon.ch...@gmail.com> wrote: >> >> I'm actually already running 1.1.1. >> >> I also just tried --conf spark.yarn.executor.memoryOverhead=4096, but no >> luck. Still getting "ExecutorLostFailure (executor lost)". >> >> >> >> On Fri, Dec 19, 2014 at 10:43 AM, Rafal Kwasny <rafal.kwa...@gmail.com> >> wrote: >> >>> Hi, >>> Just upgrade to 1.1.1 - it was fixed some time ago >>> >>> /Raf >>> >>> >>> sandy.r...@cloudera.com wrote: >>> >>> Hi Jon, >>> >>> The fix for this is to increase spark.yarn.executor.memoryOverhead to >>> something greater than it's default of 384. >>> >>> This will increase the gap between the executors heap size and what it >>> requests from yarn. It's required because jvms take up some memory beyond >>> their heap size. >>> >>> -Sandy >>> >>> On Dec 19, 2014, at 9:04 AM, Jon Chase <jon.ch...@gmail.com> wrote: >>> >>> I'm getting the same error ("ExecutorLostFailure") - input RDD is 100k >>> small files (~2MB each). I do a simple map, then keyBy(), and then >>> rdd.saveAsHadoopDataset(...). Depending on the memory settings given to >>> spark-submit, the time before the first ExecutorLostFailure varies (more >>> memory == longer until failure) - but this usually happens after about 100 >>> files being processed. >>> >>> I'm running Spark 1.1.0 on AWS EMR w/Yarn. It appears that Yarn is >>> killing the executor b/c it thinks it's exceeding memory. However, I can't >>> repro any OOM issues when running locally, no matter the size of the data >>> set. >>> >>> It seems like Yarn thinks the heap size is increasing according to the >>> Yarn logs: >>> >>> 2014-12-18 22:06:43,505 INFO >>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl >>> (Container Monitor): Memory usage of ProcessTree 24273 for container-id >>> container_1418928607193_0011_01_000002: 6.1 GB of 6.5 GB physical memory >>> used; 13.8 GB of 32.5 GB virtual memory used >>> 2014-12-18 22:06:46,516 INFO >>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl >>> (Container Monitor): Memory usage of ProcessTree 24273 for container-id >>> container_1418928607193_0011_01_000002: 6.2 GB of 6.5 GB physical memory >>> used; 13.9 GB of 32.5 GB virtual memory used >>> 2014-12-18 22:06:49,524 INFO >>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl >>> (Container Monitor): Memory usage of ProcessTree 24273 for container-id >>> container_1418928607193_0011_01_000002: 6.2 GB of 6.5 GB physical memory >>> used; 14.0 GB of 32.5 GB virtual memory used >>> 2014-12-18 22:06:52,531 INFO >>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl >>> (Container Monitor): Memory usage of ProcessTree 24273 for container-id >>> container_1418928607193_0011_01_000002: 6.4 GB of 6.5 GB physical memory >>> used; 14.1 GB of 32.5 GB virtual memory used >>> 2014-12-18 22:06:55,538 INFO >>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl >>> (Container Monitor): Memory usage of ProcessTree 24273 for container-id >>> container_1418928607193_0011_01_000002: 6.5 GB of 6.5 GB physical memory >>> used; 14.2 GB of 32.5 GB virtual memory used >>> 2014-12-18 22:06:58,549 INFO >>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl >>> (Container Monitor): Memory usage of ProcessTree 24273 for container-id >>> container_1418928607193_0011_01_000002: 6.5 GB of 6.5 GB physical memory >>> used; 14.3 GB of 32.5 GB virtual memory used >>> 2014-12-18 22:06:58,549 WARN >>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl >>> (Container Monitor): Process tree for container: >>> container_1418928607193_0011_01_000002 has processes older than 1 iteration >>> running over the configured limit. Limit=6979321856, current usage = >>> 6995812352 >>> 2014-12-18 22:06:58,549 WARN >>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl >>> (Container Monitor): Container >>> [pid=24273,containerID=container_1418928607193_0011_01_000002] is running >>> beyond physical memory limits. Current usage: 6.5 GB of 6.5 GB physical >>> memory used; 14.3 GB of 32.5 GB virtual memory used. Killing container. >>> Dump of the process-tree for container_1418928607193_0011_01_000002 : >>> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) >>> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE >>> |- 24273 4304 24273 24273 (bash) 0 0 115630080 302 /bin/bash -c >>> /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p' >>> -Xms6144m -Xmx6144m -verbose:gc -XX:+HeapDumpOnOutOfMemoryError >>> -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC >>> -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 >>> -Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_000002/tmp >>> org.apache.spark.executor.CoarseGrainedExecutorBackend >>> akka.tcp://sparkdri...@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler >>> 1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4 1> >>> /mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_000002/stdout >>> 2> >>> /mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_000002/stderr >>> |- 24277 24273 24273 24273 (java) 13808 1730 15204556800 1707660 >>> /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms6144m >>> -Xmx6144m -verbose:gc -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails >>> -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC >>> -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 >>> -Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_000002/tmp >>> org.apache.spark.executor.CoarseGrainedExecutorBackend >>> akka.tcp://sparkdri...@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler >>> 1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4 >>> >>> >>> I've analyzed some heap dumps and see nothing out of the ordinary. >>> Would love to know what could be causing this. >>> >>> >>> On Fri, Dec 19, 2014 at 7:46 AM, bethesda <swearinge...@mac.com> wrote: >>> >>>> I have a job that runs fine on relatively small input datasets but then >>>> reaches a threshold where I begin to consistently get "Fetch failure" >>>> for >>>> the Failure Reason, late in the job, during a saveAsText() operation. >>>> >>>> The first error we are seeing on the "Details for Stage" page is >>>> "ExecutorLostFailure" >>>> >>>> My Shuffle Read is 3.3 GB and that's the only thing that seems high, we >>>> have >>>> three servers and they are configured on this job for 5g memory, and >>>> the job >>>> is running in spark-shell. The first error in the shell is "Lost >>>> executor 2 >>>> on (servername): remote Akka client disassociated. >>>> >>>> We are still trying to understand how to best diagnose jobs using the >>>> web ui >>>> so it's likely that there's some helpful info here that we just don't >>>> know >>>> how to interpret -- is there any kind of "troubleshooting guide" beyond >>>> the >>>> Spark Configuration page? I don't know if I'm providing enough info >>>> here. >>>> >>>> thanks. >>>> >>>> >>>> >>>> -- >>>> View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com >>>> . >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>> For additional commands, e-mail: user-h...@spark.apache.org >>>> >>>> >>> >>> >>