Re: Fetch Failure

2014-12-23 Thread Stefano Ghezzi
i've eliminated fetch failed with this parameters (don't know which was 
the right one for the problem)

to the spark-submit running with 1.2.0

--conf spark.shuffle.compress=false \
--conf spark.file.transferTo=false \
--conf spark.shuffle.manager=hash \
--conf spark.akka.frameSize=50 \
--conf spark.core.connection.ack.wait.timeout=600

..but me too i'm unable to finish a job...now i'm facing OOM's...still 
trying...but at

least fetch failed are gone

bye

Il 23/12/2014 21:10, Chen Song ha scritto:
I tried both 1.1.1 and 1.2.0 (built against cdh5.1.0 and hadoop2.3) 
but I am still seeing FetchFailedException.


On Mon, Dec 22, 2014 at 8:27 AM, steghe stefano.ghe...@icteam.it 
mailto:stefano.ghe...@icteam.it wrote:


Which version of spark are you running?

It could be related to this
https://issues.apache.org/jira/browse/SPARK-3633

fixed in 1.1.1 and 1.2.0





--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
mailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
mailto:user-h...@spark.apache.org




--
Chen Song




--

Stefano Ghezzi ICTeam S.p.A
Project Manager - PMP
tel 035 4232129fax 035 4522034
email   stefano.ghe...@icteam.it   url http://www.icteam.com
mobile  335 7308587




Re: Fetch Failure

2014-12-22 Thread steghe
Which version of spark are you running?

It could be related to this
https://issues.apache.org/jira/browse/SPARK-3633

fixed in 1.1.1 and 1.2.0





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Fetch Failure

2014-12-19 Thread Jon Chase
I'm getting the same error (ExecutorLostFailure) - input RDD is 100k
small files (~2MB each).  I do a simple map, then keyBy(), and then
rdd.saveAsHadoopDataset(...).  Depending on the memory settings given to
spark-submit, the time before the first ExecutorLostFailure varies (more
memory == longer until failure) - but this usually happens after about 100
files being processed.

I'm running Spark 1.1.0 on AWS EMR w/Yarn.It appears that Yarn is
killing the executor b/c it thinks it's exceeding memory.  However, I can't
repro any OOM issues when running locally, no matter the size of the data
set.

It seems like Yarn thinks the heap size is increasing according to the Yarn
logs:

2014-12-18 22:06:43,505 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
(Container Monitor): Memory usage of ProcessTree 24273 for container-id
container_1418928607193_0011_01_02: 6.1 GB of 6.5 GB physical memory
used; 13.8 GB of 32.5 GB virtual memory used
2014-12-18 22:06:46,516 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
(Container Monitor): Memory usage of ProcessTree 24273 for container-id
container_1418928607193_0011_01_02: 6.2 GB of 6.5 GB physical memory
used; 13.9 GB of 32.5 GB virtual memory used
2014-12-18 22:06:49,524 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
(Container Monitor): Memory usage of ProcessTree 24273 for container-id
container_1418928607193_0011_01_02: 6.2 GB of 6.5 GB physical memory
used; 14.0 GB of 32.5 GB virtual memory used
2014-12-18 22:06:52,531 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
(Container Monitor): Memory usage of ProcessTree 24273 for container-id
container_1418928607193_0011_01_02: 6.4 GB of 6.5 GB physical memory
used; 14.1 GB of 32.5 GB virtual memory used
2014-12-18 22:06:55,538 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
(Container Monitor): Memory usage of ProcessTree 24273 for container-id
container_1418928607193_0011_01_02: 6.5 GB of 6.5 GB physical memory
used; 14.2 GB of 32.5 GB virtual memory used
2014-12-18 22:06:58,549 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
(Container Monitor): Memory usage of ProcessTree 24273 for container-id
container_1418928607193_0011_01_02: 6.5 GB of 6.5 GB physical memory
used; 14.3 GB of 32.5 GB virtual memory used
2014-12-18 22:06:58,549 WARN
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
(Container Monitor): Process tree for container:
container_1418928607193_0011_01_02 has processes older than 1 iteration
running over the configured limit. Limit=6979321856, current usage =
6995812352
2014-12-18 22:06:58,549 WARN
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
(Container Monitor): Container
[pid=24273,containerID=container_1418928607193_0011_01_02] is running
beyond physical memory limits. Current usage: 6.5 GB of 6.5 GB physical
memory used; 14.3 GB of 32.5 GB virtual memory used. Killing container.
Dump of the process-tree for container_1418928607193_0011_01_02 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 24273 4304 24273 24273 (bash) 0 0 115630080 302 /bin/bash -c
/usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p'
-Xms6144m -Xmx6144m  -verbose:gc -XX:+HeapDumpOnOutOfMemoryError
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70
-Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_02/tmp
org.apache.spark.executor.CoarseGrainedExecutorBackend
akka.tcp://sparkdri...@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler
1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4 1
/mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_02/stdout
2
/mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_02/stderr
|- 24277 24273 24273 24273 (java) 13808 1730 15204556800 1707660
/usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms6144m
-Xmx6144m -verbose:gc -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails
-XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70
-Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_02/tmp
org.apache.spark.executor.CoarseGrainedExecutorBackend
akka.tcp://sparkdri...@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler
1 

Re: Fetch Failure

2014-12-19 Thread sandy . ryza
Hi Jon,

The fix for this is to increase spark.yarn.executor.memoryOverhead to something 
greater than it's default of 384.

This will increase the gap between the executors heap size and what it requests 
from yarn. It's required because jvms take up some memory beyond their heap 
size.

-Sandy

 On Dec 19, 2014, at 9:04 AM, Jon Chase jon.ch...@gmail.com wrote:
 
 I'm getting the same error (ExecutorLostFailure) - input RDD is 100k small 
 files (~2MB each).  I do a simple map, then keyBy(), and then 
 rdd.saveAsHadoopDataset(...).  Depending on the memory settings given to 
 spark-submit, the time before the first ExecutorLostFailure varies (more 
 memory == longer until failure) - but this usually happens after about 100 
 files being processed.  
 
 I'm running Spark 1.1.0 on AWS EMR w/Yarn.It appears that Yarn is killing 
 the executor b/c it thinks it's exceeding memory.  However, I can't repro any 
 OOM issues when running locally, no matter the size of the data set. 
 
 It seems like Yarn thinks the heap size is increasing according to the Yarn 
 logs:
 
 2014-12-18 22:06:43,505 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
  (Container Monitor): Memory usage of ProcessTree 24273 for container-id 
 container_1418928607193_0011_01_02: 6.1 GB of 6.5 GB physical memory 
 used; 13.8 GB of 32.5 GB virtual memory used
 2014-12-18 22:06:46,516 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
  (Container Monitor): Memory usage of ProcessTree 24273 for container-id 
 container_1418928607193_0011_01_02: 6.2 GB of 6.5 GB physical memory 
 used; 13.9 GB of 32.5 GB virtual memory used
 2014-12-18 22:06:49,524 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
  (Container Monitor): Memory usage of ProcessTree 24273 for container-id 
 container_1418928607193_0011_01_02: 6.2 GB of 6.5 GB physical memory 
 used; 14.0 GB of 32.5 GB virtual memory used
 2014-12-18 22:06:52,531 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
  (Container Monitor): Memory usage of ProcessTree 24273 for container-id 
 container_1418928607193_0011_01_02: 6.4 GB of 6.5 GB physical memory 
 used; 14.1 GB of 32.5 GB virtual memory used
 2014-12-18 22:06:55,538 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
  (Container Monitor): Memory usage of ProcessTree 24273 for container-id 
 container_1418928607193_0011_01_02: 6.5 GB of 6.5 GB physical memory 
 used; 14.2 GB of 32.5 GB virtual memory used
 2014-12-18 22:06:58,549 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
  (Container Monitor): Memory usage of ProcessTree 24273 for container-id 
 container_1418928607193_0011_01_02: 6.5 GB of 6.5 GB physical memory 
 used; 14.3 GB of 32.5 GB virtual memory used
 2014-12-18 22:06:58,549 WARN 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
  (Container Monitor): Process tree for container: 
 container_1418928607193_0011_01_02 has processes older than 1 iteration 
 running over the configured limit. Limit=6979321856, current usage = 
 6995812352
 2014-12-18 22:06:58,549 WARN 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
  (Container Monitor): Container 
 [pid=24273,containerID=container_1418928607193_0011_01_02] is running 
 beyond physical memory limits. Current usage: 6.5 GB of 6.5 GB physical 
 memory used; 14.3 GB of 32.5 GB virtual memory used. Killing container.
 Dump of the process-tree for container_1418928607193_0011_01_02 :
   |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) 
 SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
   |- 24273 4304 24273 24273 (bash) 0 0 115630080 302 /bin/bash -c 
 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms6144m 
 -Xmx6144m  -verbose:gc -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails 
 -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC 
 -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 
 -Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_02/tmp
  org.apache.spark.executor.CoarseGrainedExecutorBackend 
 akka.tcp://sparkdri...@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler
  1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4 1 
 /mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_02/stdout
  2 
 /mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_02/stderr
  
   |- 24277 24273 24273 24273 (java) 13808 1730 15204556800 1707660 
 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms6144m 
 

Re: Fetch Failure

2014-12-19 Thread Jon Chase
I'm actually already running 1.1.1.

I also just tried --conf spark.yarn.executor.memoryOverhead=4096, but no
luck.  Still getting ExecutorLostFailure (executor lost).



On Fri, Dec 19, 2014 at 10:43 AM, Rafal Kwasny rafal.kwa...@gmail.com
wrote:

 Hi,
 Just upgrade to 1.1.1 - it was fixed some time ago

 /Raf


 sandy.r...@cloudera.com wrote:

 Hi Jon,

 The fix for this is to increase spark.yarn.executor.memoryOverhead to
 something greater than it's default of 384.

 This will increase the gap between the executors heap size and what it
 requests from yarn. It's required because jvms take up some memory beyond
 their heap size.

 -Sandy

 On Dec 19, 2014, at 9:04 AM, Jon Chase jon.ch...@gmail.com wrote:

 I'm getting the same error (ExecutorLostFailure) - input RDD is 100k
 small files (~2MB each).  I do a simple map, then keyBy(), and then
 rdd.saveAsHadoopDataset(...).  Depending on the memory settings given to
 spark-submit, the time before the first ExecutorLostFailure varies (more
 memory == longer until failure) - but this usually happens after about 100
 files being processed.

 I'm running Spark 1.1.0 on AWS EMR w/Yarn.It appears that Yarn is
 killing the executor b/c it thinks it's exceeding memory.  However, I can't
 repro any OOM issues when running locally, no matter the size of the data
 set.

 It seems like Yarn thinks the heap size is increasing according to the
 Yarn logs:

 2014-12-18 22:06:43,505 INFO
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 (Container Monitor): Memory usage of ProcessTree 24273 for container-id
 container_1418928607193_0011_01_02: 6.1 GB of 6.5 GB physical memory
 used; 13.8 GB of 32.5 GB virtual memory used
 2014-12-18 22:06:46,516 INFO
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 (Container Monitor): Memory usage of ProcessTree 24273 for container-id
 container_1418928607193_0011_01_02: 6.2 GB of 6.5 GB physical memory
 used; 13.9 GB of 32.5 GB virtual memory used
 2014-12-18 22:06:49,524 INFO
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 (Container Monitor): Memory usage of ProcessTree 24273 for container-id
 container_1418928607193_0011_01_02: 6.2 GB of 6.5 GB physical memory
 used; 14.0 GB of 32.5 GB virtual memory used
 2014-12-18 22:06:52,531 INFO
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 (Container Monitor): Memory usage of ProcessTree 24273 for container-id
 container_1418928607193_0011_01_02: 6.4 GB of 6.5 GB physical memory
 used; 14.1 GB of 32.5 GB virtual memory used
 2014-12-18 22:06:55,538 INFO
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 (Container Monitor): Memory usage of ProcessTree 24273 for container-id
 container_1418928607193_0011_01_02: 6.5 GB of 6.5 GB physical memory
 used; 14.2 GB of 32.5 GB virtual memory used
 2014-12-18 22:06:58,549 INFO
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 (Container Monitor): Memory usage of ProcessTree 24273 for container-id
 container_1418928607193_0011_01_02: 6.5 GB of 6.5 GB physical memory
 used; 14.3 GB of 32.5 GB virtual memory used
 2014-12-18 22:06:58,549 WARN
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 (Container Monitor): Process tree for container:
 container_1418928607193_0011_01_02 has processes older than 1 iteration
 running over the configured limit. Limit=6979321856, current usage =
 6995812352
 2014-12-18 22:06:58,549 WARN
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 (Container Monitor): Container
 [pid=24273,containerID=container_1418928607193_0011_01_02] is running
 beyond physical memory limits. Current usage: 6.5 GB of 6.5 GB physical
 memory used; 14.3 GB of 32.5 GB virtual memory used. Killing container.
 Dump of the process-tree for container_1418928607193_0011_01_02 :
 |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
 SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
 |- 24273 4304 24273 24273 (bash) 0 0 115630080 302 /bin/bash -c
 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p'
 -Xms6144m -Xmx6144m  -verbose:gc -XX:+HeapDumpOnOutOfMemoryError
 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC
 -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70
 -Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_02/tmp
 org.apache.spark.executor.CoarseGrainedExecutorBackend
 akka.tcp://sparkdri...@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler
 1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4 1
 

Re: Fetch Failure

2014-12-19 Thread Sandy Ryza
Do you hit the same errors?  Is it now saying your containers are exceed
~10 GB?

On Fri, Dec 19, 2014 at 11:16 AM, Jon Chase jon.ch...@gmail.com wrote:

 I'm actually already running 1.1.1.

 I also just tried --conf spark.yarn.executor.memoryOverhead=4096, but no
 luck.  Still getting ExecutorLostFailure (executor lost).



 On Fri, Dec 19, 2014 at 10:43 AM, Rafal Kwasny rafal.kwa...@gmail.com
 wrote:

 Hi,
 Just upgrade to 1.1.1 - it was fixed some time ago

 /Raf


 sandy.r...@cloudera.com wrote:

 Hi Jon,

 The fix for this is to increase spark.yarn.executor.memoryOverhead to
 something greater than it's default of 384.

 This will increase the gap between the executors heap size and what it
 requests from yarn. It's required because jvms take up some memory beyond
 their heap size.

 -Sandy

 On Dec 19, 2014, at 9:04 AM, Jon Chase jon.ch...@gmail.com wrote:

 I'm getting the same error (ExecutorLostFailure) - input RDD is 100k
 small files (~2MB each).  I do a simple map, then keyBy(), and then
 rdd.saveAsHadoopDataset(...).  Depending on the memory settings given to
 spark-submit, the time before the first ExecutorLostFailure varies (more
 memory == longer until failure) - but this usually happens after about 100
 files being processed.

 I'm running Spark 1.1.0 on AWS EMR w/Yarn.It appears that Yarn is
 killing the executor b/c it thinks it's exceeding memory.  However, I can't
 repro any OOM issues when running locally, no matter the size of the data
 set.

 It seems like Yarn thinks the heap size is increasing according to the
 Yarn logs:

 2014-12-18 22:06:43,505 INFO
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 (Container Monitor): Memory usage of ProcessTree 24273 for container-id
 container_1418928607193_0011_01_02: 6.1 GB of 6.5 GB physical memory
 used; 13.8 GB of 32.5 GB virtual memory used
 2014-12-18 22:06:46,516 INFO
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 (Container Monitor): Memory usage of ProcessTree 24273 for container-id
 container_1418928607193_0011_01_02: 6.2 GB of 6.5 GB physical memory
 used; 13.9 GB of 32.5 GB virtual memory used
 2014-12-18 22:06:49,524 INFO
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 (Container Monitor): Memory usage of ProcessTree 24273 for container-id
 container_1418928607193_0011_01_02: 6.2 GB of 6.5 GB physical memory
 used; 14.0 GB of 32.5 GB virtual memory used
 2014-12-18 22:06:52,531 INFO
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 (Container Monitor): Memory usage of ProcessTree 24273 for container-id
 container_1418928607193_0011_01_02: 6.4 GB of 6.5 GB physical memory
 used; 14.1 GB of 32.5 GB virtual memory used
 2014-12-18 22:06:55,538 INFO
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 (Container Monitor): Memory usage of ProcessTree 24273 for container-id
 container_1418928607193_0011_01_02: 6.5 GB of 6.5 GB physical memory
 used; 14.2 GB of 32.5 GB virtual memory used
 2014-12-18 22:06:58,549 INFO
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 (Container Monitor): Memory usage of ProcessTree 24273 for container-id
 container_1418928607193_0011_01_02: 6.5 GB of 6.5 GB physical memory
 used; 14.3 GB of 32.5 GB virtual memory used
 2014-12-18 22:06:58,549 WARN
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 (Container Monitor): Process tree for container:
 container_1418928607193_0011_01_02 has processes older than 1 iteration
 running over the configured limit. Limit=6979321856, current usage =
 6995812352
 2014-12-18 22:06:58,549 WARN
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 (Container Monitor): Container
 [pid=24273,containerID=container_1418928607193_0011_01_02] is running
 beyond physical memory limits. Current usage: 6.5 GB of 6.5 GB physical
 memory used; 14.3 GB of 32.5 GB virtual memory used. Killing container.
 Dump of the process-tree for container_1418928607193_0011_01_02 :
 |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
 SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
 |- 24273 4304 24273 24273 (bash) 0 0 115630080 302 /bin/bash -c
 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p'
 -Xms6144m -Xmx6144m  -verbose:gc -XX:+HeapDumpOnOutOfMemoryError
 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC
 -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70
 -Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_02/tmp
 org.apache.spark.executor.CoarseGrainedExecutorBackend
 

Re: Fetch Failure

2014-12-19 Thread Jon Chase
Hmmm, I see this a lot (multiple times per second) in the stdout logs of my
application:

2014-12-19T16:12:35.748+: [GC (Allocation Failure) [ParNew:
286663K-12530K(306688K), 0.0074579 secs] 1470813K-1198034K(2063104K),
0.0075189 secs] [Times: user=0.03 sys=0.00, real=0.01 secs]


And finally I see

2014-12-19 16:12:36,116 ERROR [SIGTERM handler]
executor.CoarseGrainedExecutorBackend (SignalLogger.scala:handle(57)) -
RECEIVED SIGNAL 15: SIGTERM

which I assume is coming from Yarn, after which the log contains this and
then ends:

Heap
 par new generation   total 306688K, used 23468K [0x8000,
0x94cc, 0x94cc)
  eden space 272640K,   4% used [0x8000, 0x80abff10,
0x90a4)
  from space 34048K,  36% used [0x92b8, 0x937ab488,
0x94cc)
  to   space 34048K,   0% used [0x90a4, 0x90a4,
0x92b8)
 concurrent mark-sweep generation total 1756416K, used 1186756K
[0x94cc, 0x0001, 0x0001)
 Metaspace   used 52016K, capacity 52683K, committed 52848K, reserved
1095680K
  class spaceused 7149K, capacity 7311K, committed 7392K, reserved
1048576K







On Fri, Dec 19, 2014 at 11:16 AM, Jon Chase jon.ch...@gmail.com wrote:

 I'm actually already running 1.1.1.

 I also just tried --conf spark.yarn.executor.memoryOverhead=4096, but no
 luck.  Still getting ExecutorLostFailure (executor lost).



 On Fri, Dec 19, 2014 at 10:43 AM, Rafal Kwasny rafal.kwa...@gmail.com
 wrote:

 Hi,
 Just upgrade to 1.1.1 - it was fixed some time ago

 /Raf


 sandy.r...@cloudera.com wrote:

 Hi Jon,

 The fix for this is to increase spark.yarn.executor.memoryOverhead to
 something greater than it's default of 384.

 This will increase the gap between the executors heap size and what it
 requests from yarn. It's required because jvms take up some memory beyond
 their heap size.

 -Sandy

 On Dec 19, 2014, at 9:04 AM, Jon Chase jon.ch...@gmail.com wrote:

 I'm getting the same error (ExecutorLostFailure) - input RDD is 100k
 small files (~2MB each).  I do a simple map, then keyBy(), and then
 rdd.saveAsHadoopDataset(...).  Depending on the memory settings given to
 spark-submit, the time before the first ExecutorLostFailure varies (more
 memory == longer until failure) - but this usually happens after about 100
 files being processed.

 I'm running Spark 1.1.0 on AWS EMR w/Yarn.It appears that Yarn is
 killing the executor b/c it thinks it's exceeding memory.  However, I can't
 repro any OOM issues when running locally, no matter the size of the data
 set.

 It seems like Yarn thinks the heap size is increasing according to the
 Yarn logs:

 2014-12-18 22:06:43,505 INFO
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 (Container Monitor): Memory usage of ProcessTree 24273 for container-id
 container_1418928607193_0011_01_02: 6.1 GB of 6.5 GB physical memory
 used; 13.8 GB of 32.5 GB virtual memory used
 2014-12-18 22:06:46,516 INFO
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 (Container Monitor): Memory usage of ProcessTree 24273 for container-id
 container_1418928607193_0011_01_02: 6.2 GB of 6.5 GB physical memory
 used; 13.9 GB of 32.5 GB virtual memory used
 2014-12-18 22:06:49,524 INFO
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 (Container Monitor): Memory usage of ProcessTree 24273 for container-id
 container_1418928607193_0011_01_02: 6.2 GB of 6.5 GB physical memory
 used; 14.0 GB of 32.5 GB virtual memory used
 2014-12-18 22:06:52,531 INFO
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 (Container Monitor): Memory usage of ProcessTree 24273 for container-id
 container_1418928607193_0011_01_02: 6.4 GB of 6.5 GB physical memory
 used; 14.1 GB of 32.5 GB virtual memory used
 2014-12-18 22:06:55,538 INFO
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 (Container Monitor): Memory usage of ProcessTree 24273 for container-id
 container_1418928607193_0011_01_02: 6.5 GB of 6.5 GB physical memory
 used; 14.2 GB of 32.5 GB virtual memory used
 2014-12-18 22:06:58,549 INFO
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 (Container Monitor): Memory usage of ProcessTree 24273 for container-id
 container_1418928607193_0011_01_02: 6.5 GB of 6.5 GB physical memory
 used; 14.3 GB of 32.5 GB virtual memory used
 2014-12-18 22:06:58,549 WARN
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 (Container Monitor): Process tree for container:
 container_1418928607193_0011_01_02 has processes older than 1 iteration
 running over the configured limit. Limit=6979321856, current usage =
 6995812352
 2014-12-18 22:06:58,549 WARN
 

Re: Fetch Failure

2014-12-19 Thread Jon Chase
Yes, same problem.

On Fri, Dec 19, 2014 at 11:29 AM, Sandy Ryza sandy.r...@cloudera.com
wrote:

 Do you hit the same errors?  Is it now saying your containers are exceed
 ~10 GB?

 On Fri, Dec 19, 2014 at 11:16 AM, Jon Chase jon.ch...@gmail.com wrote:

 I'm actually already running 1.1.1.

 I also just tried --conf spark.yarn.executor.memoryOverhead=4096, but no
 luck.  Still getting ExecutorLostFailure (executor lost).



 On Fri, Dec 19, 2014 at 10:43 AM, Rafal Kwasny rafal.kwa...@gmail.com
 wrote:

 Hi,
 Just upgrade to 1.1.1 - it was fixed some time ago

 /Raf


 sandy.r...@cloudera.com wrote:

 Hi Jon,

 The fix for this is to increase spark.yarn.executor.memoryOverhead to
 something greater than it's default of 384.

 This will increase the gap between the executors heap size and what it
 requests from yarn. It's required because jvms take up some memory beyond
 their heap size.

 -Sandy

 On Dec 19, 2014, at 9:04 AM, Jon Chase jon.ch...@gmail.com wrote:

 I'm getting the same error (ExecutorLostFailure) - input RDD is 100k
 small files (~2MB each).  I do a simple map, then keyBy(), and then
 rdd.saveAsHadoopDataset(...).  Depending on the memory settings given to
 spark-submit, the time before the first ExecutorLostFailure varies (more
 memory == longer until failure) - but this usually happens after about 100
 files being processed.

 I'm running Spark 1.1.0 on AWS EMR w/Yarn.It appears that Yarn is
 killing the executor b/c it thinks it's exceeding memory.  However, I can't
 repro any OOM issues when running locally, no matter the size of the data
 set.

 It seems like Yarn thinks the heap size is increasing according to the
 Yarn logs:

 2014-12-18 22:06:43,505 INFO
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 (Container Monitor): Memory usage of ProcessTree 24273 for container-id
 container_1418928607193_0011_01_02: 6.1 GB of 6.5 GB physical memory
 used; 13.8 GB of 32.5 GB virtual memory used
 2014-12-18 22:06:46,516 INFO
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 (Container Monitor): Memory usage of ProcessTree 24273 for container-id
 container_1418928607193_0011_01_02: 6.2 GB of 6.5 GB physical memory
 used; 13.9 GB of 32.5 GB virtual memory used
 2014-12-18 22:06:49,524 INFO
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 (Container Monitor): Memory usage of ProcessTree 24273 for container-id
 container_1418928607193_0011_01_02: 6.2 GB of 6.5 GB physical memory
 used; 14.0 GB of 32.5 GB virtual memory used
 2014-12-18 22:06:52,531 INFO
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 (Container Monitor): Memory usage of ProcessTree 24273 for container-id
 container_1418928607193_0011_01_02: 6.4 GB of 6.5 GB physical memory
 used; 14.1 GB of 32.5 GB virtual memory used
 2014-12-18 22:06:55,538 INFO
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 (Container Monitor): Memory usage of ProcessTree 24273 for container-id
 container_1418928607193_0011_01_02: 6.5 GB of 6.5 GB physical memory
 used; 14.2 GB of 32.5 GB virtual memory used
 2014-12-18 22:06:58,549 INFO
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 (Container Monitor): Memory usage of ProcessTree 24273 for container-id
 container_1418928607193_0011_01_02: 6.5 GB of 6.5 GB physical memory
 used; 14.3 GB of 32.5 GB virtual memory used
 2014-12-18 22:06:58,549 WARN
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 (Container Monitor): Process tree for container:
 container_1418928607193_0011_01_02 has processes older than 1 iteration
 running over the configured limit. Limit=6979321856, current usage =
 6995812352
 2014-12-18 22:06:58,549 WARN
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 (Container Monitor): Container
 [pid=24273,containerID=container_1418928607193_0011_01_02] is running
 beyond physical memory limits. Current usage: 6.5 GB of 6.5 GB physical
 memory used; 14.3 GB of 32.5 GB virtual memory used. Killing container.
 Dump of the process-tree for container_1418928607193_0011_01_02 :
 |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
 SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
 |- 24273 4304 24273 24273 (bash) 0 0 115630080 302 /bin/bash -c
 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p'
 -Xms6144m -Xmx6144m  -verbose:gc -XX:+HeapDumpOnOutOfMemoryError
 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC
 -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70
 -Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_02/tmp