Re: Fetch Failure
i've eliminated fetch failed with this parameters (don't know which was the right one for the problem) to the spark-submit running with 1.2.0 --conf spark.shuffle.compress=false \ --conf spark.file.transferTo=false \ --conf spark.shuffle.manager=hash \ --conf spark.akka.frameSize=50 \ --conf spark.core.connection.ack.wait.timeout=600 ..but me too i'm unable to finish a job...now i'm facing OOM's...still trying...but at least fetch failed are gone bye Il 23/12/2014 21:10, Chen Song ha scritto: I tried both 1.1.1 and 1.2.0 (built against cdh5.1.0 and hadoop2.3) but I am still seeing FetchFailedException. On Mon, Dec 22, 2014 at 8:27 AM, steghe stefano.ghe...@icteam.it mailto:stefano.ghe...@icteam.it wrote: Which version of spark are you running? It could be related to this https://issues.apache.org/jira/browse/SPARK-3633 fixed in 1.1.1 and 1.2.0 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org mailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org mailto:user-h...@spark.apache.org -- Chen Song -- Stefano Ghezzi ICTeam S.p.A Project Manager - PMP tel 035 4232129fax 035 4522034 email stefano.ghe...@icteam.it url http://www.icteam.com mobile 335 7308587
Re: Fetch Failure
Which version of spark are you running? It could be related to this https://issues.apache.org/jira/browse/SPARK-3633 fixed in 1.1.1 and 1.2.0 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Fetch Failure
I'm getting the same error (ExecutorLostFailure) - input RDD is 100k small files (~2MB each). I do a simple map, then keyBy(), and then rdd.saveAsHadoopDataset(...). Depending on the memory settings given to spark-submit, the time before the first ExecutorLostFailure varies (more memory == longer until failure) - but this usually happens after about 100 files being processed. I'm running Spark 1.1.0 on AWS EMR w/Yarn.It appears that Yarn is killing the executor b/c it thinks it's exceeding memory. However, I can't repro any OOM issues when running locally, no matter the size of the data set. It seems like Yarn thinks the heap size is increasing according to the Yarn logs: 2014-12-18 22:06:43,505 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_02: 6.1 GB of 6.5 GB physical memory used; 13.8 GB of 32.5 GB virtual memory used 2014-12-18 22:06:46,516 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_02: 6.2 GB of 6.5 GB physical memory used; 13.9 GB of 32.5 GB virtual memory used 2014-12-18 22:06:49,524 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_02: 6.2 GB of 6.5 GB physical memory used; 14.0 GB of 32.5 GB virtual memory used 2014-12-18 22:06:52,531 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_02: 6.4 GB of 6.5 GB physical memory used; 14.1 GB of 32.5 GB virtual memory used 2014-12-18 22:06:55,538 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_02: 6.5 GB of 6.5 GB physical memory used; 14.2 GB of 32.5 GB virtual memory used 2014-12-18 22:06:58,549 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_02: 6.5 GB of 6.5 GB physical memory used; 14.3 GB of 32.5 GB virtual memory used 2014-12-18 22:06:58,549 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Process tree for container: container_1418928607193_0011_01_02 has processes older than 1 iteration running over the configured limit. Limit=6979321856, current usage = 6995812352 2014-12-18 22:06:58,549 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Container [pid=24273,containerID=container_1418928607193_0011_01_02] is running beyond physical memory limits. Current usage: 6.5 GB of 6.5 GB physical memory used; 14.3 GB of 32.5 GB virtual memory used. Killing container. Dump of the process-tree for container_1418928607193_0011_01_02 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 24273 4304 24273 24273 (bash) 0 0 115630080 302 /bin/bash -c /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms6144m -Xmx6144m -verbose:gc -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_02/tmp org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkdri...@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler 1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4 1 /mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_02/stdout 2 /mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_02/stderr |- 24277 24273 24273 24273 (java) 13808 1730 15204556800 1707660 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms6144m -Xmx6144m -verbose:gc -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_02/tmp org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkdri...@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler 1
Re: Fetch Failure
Hi Jon, The fix for this is to increase spark.yarn.executor.memoryOverhead to something greater than it's default of 384. This will increase the gap between the executors heap size and what it requests from yarn. It's required because jvms take up some memory beyond their heap size. -Sandy On Dec 19, 2014, at 9:04 AM, Jon Chase jon.ch...@gmail.com wrote: I'm getting the same error (ExecutorLostFailure) - input RDD is 100k small files (~2MB each). I do a simple map, then keyBy(), and then rdd.saveAsHadoopDataset(...). Depending on the memory settings given to spark-submit, the time before the first ExecutorLostFailure varies (more memory == longer until failure) - but this usually happens after about 100 files being processed. I'm running Spark 1.1.0 on AWS EMR w/Yarn.It appears that Yarn is killing the executor b/c it thinks it's exceeding memory. However, I can't repro any OOM issues when running locally, no matter the size of the data set. It seems like Yarn thinks the heap size is increasing according to the Yarn logs: 2014-12-18 22:06:43,505 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_02: 6.1 GB of 6.5 GB physical memory used; 13.8 GB of 32.5 GB virtual memory used 2014-12-18 22:06:46,516 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_02: 6.2 GB of 6.5 GB physical memory used; 13.9 GB of 32.5 GB virtual memory used 2014-12-18 22:06:49,524 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_02: 6.2 GB of 6.5 GB physical memory used; 14.0 GB of 32.5 GB virtual memory used 2014-12-18 22:06:52,531 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_02: 6.4 GB of 6.5 GB physical memory used; 14.1 GB of 32.5 GB virtual memory used 2014-12-18 22:06:55,538 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_02: 6.5 GB of 6.5 GB physical memory used; 14.2 GB of 32.5 GB virtual memory used 2014-12-18 22:06:58,549 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_02: 6.5 GB of 6.5 GB physical memory used; 14.3 GB of 32.5 GB virtual memory used 2014-12-18 22:06:58,549 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Process tree for container: container_1418928607193_0011_01_02 has processes older than 1 iteration running over the configured limit. Limit=6979321856, current usage = 6995812352 2014-12-18 22:06:58,549 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Container [pid=24273,containerID=container_1418928607193_0011_01_02] is running beyond physical memory limits. Current usage: 6.5 GB of 6.5 GB physical memory used; 14.3 GB of 32.5 GB virtual memory used. Killing container. Dump of the process-tree for container_1418928607193_0011_01_02 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 24273 4304 24273 24273 (bash) 0 0 115630080 302 /bin/bash -c /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms6144m -Xmx6144m -verbose:gc -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_02/tmp org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkdri...@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler 1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4 1 /mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_02/stdout 2 /mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_02/stderr |- 24277 24273 24273 24273 (java) 13808 1730 15204556800 1707660 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms6144m
Re: Fetch Failure
I'm actually already running 1.1.1. I also just tried --conf spark.yarn.executor.memoryOverhead=4096, but no luck. Still getting ExecutorLostFailure (executor lost). On Fri, Dec 19, 2014 at 10:43 AM, Rafal Kwasny rafal.kwa...@gmail.com wrote: Hi, Just upgrade to 1.1.1 - it was fixed some time ago /Raf sandy.r...@cloudera.com wrote: Hi Jon, The fix for this is to increase spark.yarn.executor.memoryOverhead to something greater than it's default of 384. This will increase the gap between the executors heap size and what it requests from yarn. It's required because jvms take up some memory beyond their heap size. -Sandy On Dec 19, 2014, at 9:04 AM, Jon Chase jon.ch...@gmail.com wrote: I'm getting the same error (ExecutorLostFailure) - input RDD is 100k small files (~2MB each). I do a simple map, then keyBy(), and then rdd.saveAsHadoopDataset(...). Depending on the memory settings given to spark-submit, the time before the first ExecutorLostFailure varies (more memory == longer until failure) - but this usually happens after about 100 files being processed. I'm running Spark 1.1.0 on AWS EMR w/Yarn.It appears that Yarn is killing the executor b/c it thinks it's exceeding memory. However, I can't repro any OOM issues when running locally, no matter the size of the data set. It seems like Yarn thinks the heap size is increasing according to the Yarn logs: 2014-12-18 22:06:43,505 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_02: 6.1 GB of 6.5 GB physical memory used; 13.8 GB of 32.5 GB virtual memory used 2014-12-18 22:06:46,516 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_02: 6.2 GB of 6.5 GB physical memory used; 13.9 GB of 32.5 GB virtual memory used 2014-12-18 22:06:49,524 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_02: 6.2 GB of 6.5 GB physical memory used; 14.0 GB of 32.5 GB virtual memory used 2014-12-18 22:06:52,531 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_02: 6.4 GB of 6.5 GB physical memory used; 14.1 GB of 32.5 GB virtual memory used 2014-12-18 22:06:55,538 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_02: 6.5 GB of 6.5 GB physical memory used; 14.2 GB of 32.5 GB virtual memory used 2014-12-18 22:06:58,549 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_02: 6.5 GB of 6.5 GB physical memory used; 14.3 GB of 32.5 GB virtual memory used 2014-12-18 22:06:58,549 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Process tree for container: container_1418928607193_0011_01_02 has processes older than 1 iteration running over the configured limit. Limit=6979321856, current usage = 6995812352 2014-12-18 22:06:58,549 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Container [pid=24273,containerID=container_1418928607193_0011_01_02] is running beyond physical memory limits. Current usage: 6.5 GB of 6.5 GB physical memory used; 14.3 GB of 32.5 GB virtual memory used. Killing container. Dump of the process-tree for container_1418928607193_0011_01_02 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 24273 4304 24273 24273 (bash) 0 0 115630080 302 /bin/bash -c /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms6144m -Xmx6144m -verbose:gc -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_02/tmp org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkdri...@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler 1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4 1
Re: Fetch Failure
Do you hit the same errors? Is it now saying your containers are exceed ~10 GB? On Fri, Dec 19, 2014 at 11:16 AM, Jon Chase jon.ch...@gmail.com wrote: I'm actually already running 1.1.1. I also just tried --conf spark.yarn.executor.memoryOverhead=4096, but no luck. Still getting ExecutorLostFailure (executor lost). On Fri, Dec 19, 2014 at 10:43 AM, Rafal Kwasny rafal.kwa...@gmail.com wrote: Hi, Just upgrade to 1.1.1 - it was fixed some time ago /Raf sandy.r...@cloudera.com wrote: Hi Jon, The fix for this is to increase spark.yarn.executor.memoryOverhead to something greater than it's default of 384. This will increase the gap between the executors heap size and what it requests from yarn. It's required because jvms take up some memory beyond their heap size. -Sandy On Dec 19, 2014, at 9:04 AM, Jon Chase jon.ch...@gmail.com wrote: I'm getting the same error (ExecutorLostFailure) - input RDD is 100k small files (~2MB each). I do a simple map, then keyBy(), and then rdd.saveAsHadoopDataset(...). Depending on the memory settings given to spark-submit, the time before the first ExecutorLostFailure varies (more memory == longer until failure) - but this usually happens after about 100 files being processed. I'm running Spark 1.1.0 on AWS EMR w/Yarn.It appears that Yarn is killing the executor b/c it thinks it's exceeding memory. However, I can't repro any OOM issues when running locally, no matter the size of the data set. It seems like Yarn thinks the heap size is increasing according to the Yarn logs: 2014-12-18 22:06:43,505 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_02: 6.1 GB of 6.5 GB physical memory used; 13.8 GB of 32.5 GB virtual memory used 2014-12-18 22:06:46,516 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_02: 6.2 GB of 6.5 GB physical memory used; 13.9 GB of 32.5 GB virtual memory used 2014-12-18 22:06:49,524 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_02: 6.2 GB of 6.5 GB physical memory used; 14.0 GB of 32.5 GB virtual memory used 2014-12-18 22:06:52,531 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_02: 6.4 GB of 6.5 GB physical memory used; 14.1 GB of 32.5 GB virtual memory used 2014-12-18 22:06:55,538 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_02: 6.5 GB of 6.5 GB physical memory used; 14.2 GB of 32.5 GB virtual memory used 2014-12-18 22:06:58,549 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_02: 6.5 GB of 6.5 GB physical memory used; 14.3 GB of 32.5 GB virtual memory used 2014-12-18 22:06:58,549 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Process tree for container: container_1418928607193_0011_01_02 has processes older than 1 iteration running over the configured limit. Limit=6979321856, current usage = 6995812352 2014-12-18 22:06:58,549 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Container [pid=24273,containerID=container_1418928607193_0011_01_02] is running beyond physical memory limits. Current usage: 6.5 GB of 6.5 GB physical memory used; 14.3 GB of 32.5 GB virtual memory used. Killing container. Dump of the process-tree for container_1418928607193_0011_01_02 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 24273 4304 24273 24273 (bash) 0 0 115630080 302 /bin/bash -c /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms6144m -Xmx6144m -verbose:gc -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_02/tmp org.apache.spark.executor.CoarseGrainedExecutorBackend
Re: Fetch Failure
Hmmm, I see this a lot (multiple times per second) in the stdout logs of my application: 2014-12-19T16:12:35.748+: [GC (Allocation Failure) [ParNew: 286663K-12530K(306688K), 0.0074579 secs] 1470813K-1198034K(2063104K), 0.0075189 secs] [Times: user=0.03 sys=0.00, real=0.01 secs] And finally I see 2014-12-19 16:12:36,116 ERROR [SIGTERM handler] executor.CoarseGrainedExecutorBackend (SignalLogger.scala:handle(57)) - RECEIVED SIGNAL 15: SIGTERM which I assume is coming from Yarn, after which the log contains this and then ends: Heap par new generation total 306688K, used 23468K [0x8000, 0x94cc, 0x94cc) eden space 272640K, 4% used [0x8000, 0x80abff10, 0x90a4) from space 34048K, 36% used [0x92b8, 0x937ab488, 0x94cc) to space 34048K, 0% used [0x90a4, 0x90a4, 0x92b8) concurrent mark-sweep generation total 1756416K, used 1186756K [0x94cc, 0x0001, 0x0001) Metaspace used 52016K, capacity 52683K, committed 52848K, reserved 1095680K class spaceused 7149K, capacity 7311K, committed 7392K, reserved 1048576K On Fri, Dec 19, 2014 at 11:16 AM, Jon Chase jon.ch...@gmail.com wrote: I'm actually already running 1.1.1. I also just tried --conf spark.yarn.executor.memoryOverhead=4096, but no luck. Still getting ExecutorLostFailure (executor lost). On Fri, Dec 19, 2014 at 10:43 AM, Rafal Kwasny rafal.kwa...@gmail.com wrote: Hi, Just upgrade to 1.1.1 - it was fixed some time ago /Raf sandy.r...@cloudera.com wrote: Hi Jon, The fix for this is to increase spark.yarn.executor.memoryOverhead to something greater than it's default of 384. This will increase the gap between the executors heap size and what it requests from yarn. It's required because jvms take up some memory beyond their heap size. -Sandy On Dec 19, 2014, at 9:04 AM, Jon Chase jon.ch...@gmail.com wrote: I'm getting the same error (ExecutorLostFailure) - input RDD is 100k small files (~2MB each). I do a simple map, then keyBy(), and then rdd.saveAsHadoopDataset(...). Depending on the memory settings given to spark-submit, the time before the first ExecutorLostFailure varies (more memory == longer until failure) - but this usually happens after about 100 files being processed. I'm running Spark 1.1.0 on AWS EMR w/Yarn.It appears that Yarn is killing the executor b/c it thinks it's exceeding memory. However, I can't repro any OOM issues when running locally, no matter the size of the data set. It seems like Yarn thinks the heap size is increasing according to the Yarn logs: 2014-12-18 22:06:43,505 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_02: 6.1 GB of 6.5 GB physical memory used; 13.8 GB of 32.5 GB virtual memory used 2014-12-18 22:06:46,516 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_02: 6.2 GB of 6.5 GB physical memory used; 13.9 GB of 32.5 GB virtual memory used 2014-12-18 22:06:49,524 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_02: 6.2 GB of 6.5 GB physical memory used; 14.0 GB of 32.5 GB virtual memory used 2014-12-18 22:06:52,531 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_02: 6.4 GB of 6.5 GB physical memory used; 14.1 GB of 32.5 GB virtual memory used 2014-12-18 22:06:55,538 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_02: 6.5 GB of 6.5 GB physical memory used; 14.2 GB of 32.5 GB virtual memory used 2014-12-18 22:06:58,549 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_02: 6.5 GB of 6.5 GB physical memory used; 14.3 GB of 32.5 GB virtual memory used 2014-12-18 22:06:58,549 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Process tree for container: container_1418928607193_0011_01_02 has processes older than 1 iteration running over the configured limit. Limit=6979321856, current usage = 6995812352 2014-12-18 22:06:58,549 WARN
Re: Fetch Failure
Yes, same problem. On Fri, Dec 19, 2014 at 11:29 AM, Sandy Ryza sandy.r...@cloudera.com wrote: Do you hit the same errors? Is it now saying your containers are exceed ~10 GB? On Fri, Dec 19, 2014 at 11:16 AM, Jon Chase jon.ch...@gmail.com wrote: I'm actually already running 1.1.1. I also just tried --conf spark.yarn.executor.memoryOverhead=4096, but no luck. Still getting ExecutorLostFailure (executor lost). On Fri, Dec 19, 2014 at 10:43 AM, Rafal Kwasny rafal.kwa...@gmail.com wrote: Hi, Just upgrade to 1.1.1 - it was fixed some time ago /Raf sandy.r...@cloudera.com wrote: Hi Jon, The fix for this is to increase spark.yarn.executor.memoryOverhead to something greater than it's default of 384. This will increase the gap between the executors heap size and what it requests from yarn. It's required because jvms take up some memory beyond their heap size. -Sandy On Dec 19, 2014, at 9:04 AM, Jon Chase jon.ch...@gmail.com wrote: I'm getting the same error (ExecutorLostFailure) - input RDD is 100k small files (~2MB each). I do a simple map, then keyBy(), and then rdd.saveAsHadoopDataset(...). Depending on the memory settings given to spark-submit, the time before the first ExecutorLostFailure varies (more memory == longer until failure) - but this usually happens after about 100 files being processed. I'm running Spark 1.1.0 on AWS EMR w/Yarn.It appears that Yarn is killing the executor b/c it thinks it's exceeding memory. However, I can't repro any OOM issues when running locally, no matter the size of the data set. It seems like Yarn thinks the heap size is increasing according to the Yarn logs: 2014-12-18 22:06:43,505 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_02: 6.1 GB of 6.5 GB physical memory used; 13.8 GB of 32.5 GB virtual memory used 2014-12-18 22:06:46,516 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_02: 6.2 GB of 6.5 GB physical memory used; 13.9 GB of 32.5 GB virtual memory used 2014-12-18 22:06:49,524 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_02: 6.2 GB of 6.5 GB physical memory used; 14.0 GB of 32.5 GB virtual memory used 2014-12-18 22:06:52,531 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_02: 6.4 GB of 6.5 GB physical memory used; 14.1 GB of 32.5 GB virtual memory used 2014-12-18 22:06:55,538 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_02: 6.5 GB of 6.5 GB physical memory used; 14.2 GB of 32.5 GB virtual memory used 2014-12-18 22:06:58,549 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_02: 6.5 GB of 6.5 GB physical memory used; 14.3 GB of 32.5 GB virtual memory used 2014-12-18 22:06:58,549 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Process tree for container: container_1418928607193_0011_01_02 has processes older than 1 iteration running over the configured limit. Limit=6979321856, current usage = 6995812352 2014-12-18 22:06:58,549 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Container [pid=24273,containerID=container_1418928607193_0011_01_02] is running beyond physical memory limits. Current usage: 6.5 GB of 6.5 GB physical memory used; 14.3 GB of 32.5 GB virtual memory used. Killing container. Dump of the process-tree for container_1418928607193_0011_01_02 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 24273 4304 24273 24273 (bash) 0 0 115630080 302 /bin/bash -c /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms6144m -Xmx6144m -verbose:gc -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_02/tmp