I think you need to increase the memory size of executor through command arguments "--executor-memory", or configuration "spark.executor.memory".
Also yarn.scheduler.maximum-allocation-mb in Yarn side if necessary. Thanks Saisai On Mon, Sep 21, 2015 at 5:13 PM, Alexander Pivovarov <apivova...@gmail.com> wrote: > I noticed that some executors have issue with scratch space. > I see the following in yarn app container stderr around the time when yarn > killed the executor because it uses too much memory. > > -- App container stderr -- > 15/09/21 21:43:22 WARN storage.MemoryStore: Not enough space to cache > rdd_6_346 in memory! (computed 3.0 GB so far) > 15/09/21 21:43:22 INFO storage.MemoryStore: Memory use = 477.6 KB (blocks) > + 24.4 GB (scratch space shared across 8 thread(s)) = 24.4 GB. Storage > limit = 25.2 GB. > 15/09/21 21:43:22 WARN spark.CacheManager: Persisting partition rdd_6_346 > to disk instead. > 15/09/21 21:43:22 WARN storage.MemoryStore: Not enough space to cache > rdd_6_49 in memory! (computed 3.1 GB so far) > 15/09/21 21:43:22 INFO storage.MemoryStore: Memory use = 477.6 KB (blocks) > + 24.4 GB (scratch space shared across 8 thread(s)) = 24.4 GB. Storage > limit = 25.2 GB. > 15/09/21 21:43:22 WARN spark.CacheManager: Persisting partition rdd_6_49 > to disk instead. > > -- Yarn Nodemanager log -- > 2015-09-21 21:44:05,716 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl > (Container Monitor): Container > [pid=5114,containerID=container_1442869100946_0001_01_0 > 00056] is running beyond physical memory limits. Current usage: 52.2 GB of > 52 GB physical memory used; 53.0 GB of 260 GB virtual memory used. Killing > container. > Dump of the process-tree for container_1442869100946_0001_01_000056 : > |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) > SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE > |- 5117 5114 5114 5114 (java) 1322810 27563 56916316160 13682772 > /usr/lib/jvm/java-openjdk/bin/java -server -XX:OnOutOfMemoryError=kill %p > -Xms47924m -Xmx47924m -verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC > -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 > -XX:+CMSClassUnloadingEnabled > -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1442869100946_0001/container_1442869100946_0001_01_000056/tmp > -Dspark.akka.failure-detector.threshold=3000.0 > -Dspark.akka.heartbeat.interval=10000s -Dspark.akka.threads=4 > -Dspark.history.ui.port=18080 -Dspark.akka.heartbeat.pauses=60000s > -Dspark.akka.timeout=1000s -Dspark.akka.frameSize=50 > -Dspark.driver.port=52690 > -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1442869100946_0001/container_1442869100946_0001_01_000056 > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > akka.tcp://sparkDriver@10.0.24.153:52690/user/CoarseGrainedScheduler > --executor-id 55 --hostname ip-10-0-28-96.ec2.internal --cores 8 --app-id > application_1442869100946_0001 --user-class-path > file:/mnt/yarn/usercache/hadoop/appcache/application_1442869100946_0001/container_1442869100946_0001_01_000056/__app__.jar > |- 5114 5112 5114 5114 (bash) 0 0 9658368 291 /bin/bash -c > /usr/lib/jvm/java-openjdk/bin/java -server -XX:OnOutOfMemoryError='kill %p' > -Xms47924m -Xmx47924m '-verbose:gc' '-XX:+PrintGCDetails' > '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' > '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' > '-XX:+CMSClassUnloadingEnabled' > -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1442869100946_0001/container_1442869100946_0001_01_000056/tmp > '-Dspark.akka.failure-detector.threshold=3000.0' > '-Dspark.akka.heartbeat.interval=10000s' '-Dspark.akka.threads=4' > '-Dspark.history.ui.port=18080' '-Dspark.akka.heartbeat.pauses=60000s' > '-Dspark.akka.timeout=1000s' '-Dspark.akka.frameSize=50' > '-Dspark.driver.port=52690' > -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1442869100946_0001/container_1442869100946_0001_01_000056 > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > akka.tcp://sparkDriver@10.0.24.153:52690/user/CoarseGrainedScheduler > --executor-id 55 --hostname ip-10-0-28-96.ec2.internal --cores 8 --app-id > application_1442869100946_0001 --user-class-path > file:/mnt/yarn/usercache/hadoop/appcache/application_1442869100946_0001/container_1442869100946_0001_01_000056/__app__.jar > 1> > /var/log/hadoop-yarn/containers/application_1442869100946_0001/container_1442869100946_0001_01_000056/stdout > 2> > /var/log/hadoop-yarn/containers/application_1442869100946_0001/container_1442869100946_0001_01_000056/stderr > > > > Is it possible to get what scratch space is used for? > > What spark setting should I try to adjust to solve the issue? > > On Thu, Sep 10, 2015 at 2:52 PM, Sandy Ryza <sandy.r...@cloudera.com> > wrote: > >> YARN will never kill processes for being unresponsive. >> >> It may kill processes for occupying more memory than it allows. To get >> around this, you can either bump spark.yarn.executor.memoryOverhead or turn >> off the memory checks entirely with yarn.nodemanager.pmem-check-enabled. >> >> -Sandy >> >> On Tue, Sep 8, 2015 at 10:48 PM, Alexander Pivovarov < >> apivova...@gmail.com> wrote: >> >>> The problem which we have now is skew data (2360 tasks done in 5 min, 3 >>> tasks in 40 min and 1 task in 2 hours) >>> >>> Some people from the team worry that the executor which runs the longest >>> task can be killed by YARN (because executor might be unresponsive because >>> of GC or it might occupy more memory than Yarn allows) >>> >>> >>> >>> On Tue, Sep 8, 2015 at 3:02 PM, Sandy Ryza <sandy.r...@cloudera.com> >>> wrote: >>> >>>> Those settings seem reasonable to me. >>>> >>>> Are you observing performance that's worse than you would expect? >>>> >>>> -Sandy >>>> >>>> On Mon, Sep 7, 2015 at 11:22 AM, Alexander Pivovarov < >>>> apivova...@gmail.com> wrote: >>>> >>>>> Hi Sandy >>>>> >>>>> Thank you for your reply >>>>> Currently we use r3.2xlarge boxes (vCPU: 8, Mem: 61 GiB) >>>>> with emr setting for Spark "maximizeResourceAllocation": "true" >>>>> >>>>> It is automatically converted to Spark settings >>>>> spark.executor.memory 47924M >>>>> spark.yarn.executor.memoryOverhead 5324 >>>>> >>>>> we also set spark.default.parallelism = slave_count * 16 >>>>> >>>>> Does it look good for you? (we run single heavy job on cluster) >>>>> >>>>> Alex >>>>> >>>>> On Mon, Sep 7, 2015 at 11:03 AM, Sandy Ryza <sandy.r...@cloudera.com> >>>>> wrote: >>>>> >>>>>> Hi Alex, >>>>>> >>>>>> If they're both configured correctly, there's no reason that Spark >>>>>> Standalone should provide performance or memory improvement over Spark on >>>>>> YARN. >>>>>> >>>>>> -Sandy >>>>>> >>>>>> On Fri, Sep 4, 2015 at 1:24 PM, Alexander Pivovarov < >>>>>> apivova...@gmail.com> wrote: >>>>>> >>>>>>> Hi Everyone >>>>>>> >>>>>>> We are trying the latest aws emr-4.0.0 and Spark and my question is >>>>>>> about YARN vs Standalone mode. >>>>>>> Our usecase is >>>>>>> - start 100-150 nodes cluster every week, >>>>>>> - run one heavy spark job (5-6 hours) >>>>>>> - save data to s3 >>>>>>> - stop cluster >>>>>>> >>>>>>> Officially aws emr-4.0.0 comes with Spark on Yarn >>>>>>> It's probably possible to hack emr by creating bootstrap script >>>>>>> which stops yarn and starts master and slaves on each computer (to >>>>>>> start >>>>>>> Spark in standalone mode) >>>>>>> >>>>>>> My questions are >>>>>>> - Does Spark standalone provides significant performance / memory >>>>>>> improvement in comparison to YARN mode? >>>>>>> - Does it worth hacking official emr Spark on Yarn and switch Spark >>>>>>> to Standalone mode? >>>>>>> >>>>>>> >>>>>>> I already created comparison table and want you to check if my >>>>>>> understanding is correct >>>>>>> >>>>>>> Lets say r3.2xlarge computer has 52GB ram available for Spark >>>>>>> Executor JVMs >>>>>>> >>>>>>> standalone to yarn comparison >>>>>>> >>>>>>> >>>>>>> STDLN YARN >>>>>>> >>>>>>> can executor allocate up to 52GB ram - yes >>>>>>> | yes >>>>>>> >>>>>>> will executor be unresponsive after using all 52GB ram because of GC >>>>>>> - yes | yes >>>>>>> >>>>>>> additional JVMs on slave except of spark executor - workr | >>>>>>> node mngr >>>>>>> >>>>>>> are additional JVMs lightweight >>>>>>> - yes | yes >>>>>>> >>>>>>> >>>>>>> Thank you >>>>>>> >>>>>>> Alex >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >