Hi I suppose you are using —master yarn-client or yarn cluster. Can you try boosting spark.yarn.driver.memoryOverhead, override it to 0.15 * executor memory rather then default 0.1. Check out this link https://spark.apache.org/docs/1.5.2/running-on-yarn.html <https://spark.apache.org/docs/1.5.2/running-on-yarn.html>. Also try adding this SPARK_REPL_OPTS="-XX:MaxPermSize=1g” to increase heap space permanent memory size too. One of the reasons for OOM Java Heap Space.
> On Feb 3, 2016, at 12:30 PM, Nirav Patel <npa...@xactlycorp.com> wrote: > > Hi Stefan, > > Welcome to the OOM - heap space club. I have been struggling with similar > errors (OOM and yarn executor being killed) and failing job or sending it in > retry loops. I bet the same job will run perfectly fine with less resource on > Hadoop MapReduce program. I have tested it for my program and it does work. > > Bottomline from my experience. Spark sucks with memory management when job is > processing large (not huge) amount of data. It's failing for me with 16gb > executors, 10 executors, 6 threads each. And data its processing is only > 150GB! It's 1 billion rows for me. Same job works perfectly fine with 1 > million rows. > > Hope that saves you some trouble. > > Nirav > > > > On Wed, Feb 3, 2016 at 11:00 AM, Stefan Panayotov <spanayo...@msn.com > <mailto:spanayo...@msn.com>> wrote: > I drastically increased the memory: > > spark.executor.memory = 50g > spark.driver.memory = 8g > spark.driver.maxResultSize = 8g > spark.yarn.executor.memoryOverhead = 768 > > I still see executors killed, but this time the memory does not seem to be > the issue. > The error on the Jupyter notebook is: > > Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.collectAndServe. > : org.apache.spark.SparkException: Job aborted due to stage failure: > Exception while getting task result: java.io.IOException: Failed to connect > to /10.0.0.9:48755 <http://10.0.0.9:48755/> > From nodemanagers log corresponding to worker 10.0.0.9 <http://10.0.0.9/>: > > > 2016-02-03 17:31:44,917 INFO yarn.YarnShuffleService > (YarnShuffleService.java:initializeApplication(129)) - Initializing > application application_1454509557526_0014 > > 2016-02-03 17:31:44,918 INFO container.ContainerImpl > (ContainerImpl.java:handle(1131)) - Container > container_1454509557526_0014_01_000093 transitioned from LOCALIZING to > LOCALIZED > > 2016-02-03 17:31:44,947 INFO container.ContainerImpl > (ContainerImpl.java:handle(1131)) - Container > container_1454509557526_0014_01_000093 transitioned from LOCALIZED to RUNNING > > 2016-02-03 17:31:44,951 INFO nodemanager.DefaultContainerExecutor > (DefaultContainerExecutor.java:buildCommandExecutor(267)) - launchContainer: > [bash, > /mnt/resource/hadoop/yarn/local/usercache/root/appcache/application_1454509557526_0014/container_1454509557526_0014_01_000093/default_container_executor.sh] > > 2016-02-03 17:31:45,686 INFO monitor.ContainersMonitorImpl > (ContainersMonitorImpl.java:run(371)) - Starting resource-monitoring for > container_1454509557526_0014_01_000093 > > 2016-02-03 17:31:45,686 INFO monitor.ContainersMonitorImpl > (ContainersMonitorImpl.java:run(385)) - Stopping resource-monitoring for > container_1454509557526_0014_01_000011 > > > > Then I can see the memory usage increasing from 230.6 MB to 12.6 GB, which is > far below 50g, and the suddenly getting killed!?! > > > > 2016-02-03 17:33:17,350 INFO monitor.ContainersMonitorImpl > (ContainersMonitorImpl.java:run(458)) - Memory usage of ProcessTree 30962 for > container-id container_1454509557526_0014_01_000093: 12.6 GB of 51 GB > physical memory used; 52.8 GB of 107.1 GB virtual memory used > > 2016-02-03 17:33:17,613 INFO container.ContainerImpl > (ContainerImpl.java:handle(1131)) - Container > container_1454509557526_0014_01_000093 transitioned from RUNNING to KILLING > > 2016-02-03 17:33:17,613 INFO launcher.ContainerLaunch > (ContainerLaunch.java:cleanupContainer(370)) - Cleaning up container > container_1454509557526_0014_01_000093 > > 2016-02-03 17:33:17,629 WARN nodemanager.DefaultContainerExecutor > (DefaultContainerExecutor.java:launchContainer(223)) - Exit code from > container container_1454509557526_0014_01_000093 is : 143 > > 2016-02-03 17:33:17,667 INFO container.ContainerImpl > (ContainerImpl.java:handle(1131)) - Container > container_1454509557526_0014_01_000093 transitioned from KILLING to > CONTAINER_CLEANEDUP_AFTER_KILL > > 2016-02-03 17:33:17,669 INFO nodemanager.NMAuditLogger > (NMAuditLogger.java:logSuccess(89)) - USER=root OPERATION=Container > Finished - Killed TARGET=ContainerImpl RESULT=SUCCESS > APPID=application_1454509557526_0014 > CONTAINERID=container_1454509557526_0014_01_000093 > > 2016-02-03 17:33:17,670 INFO container.ContainerImpl > (ContainerImpl.java:handle(1131)) - Container > container_1454509557526_0014_01_000093 transitioned from > CONTAINER_CLEANEDUP_AFTER_KILL to DONE > > 2016-02-03 17:33:17,670 INFO application.ApplicationImpl > (ApplicationImpl.java:transition(347)) - Removing > container_1454509557526_0014_01_000093 from application > application_1454509557526_0014 > > 2016-02-03 17:33:17,671 INFO logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:startContainerLogAggregation(546)) - Considering > container container_1454509557526_0014_01_000093 for log-aggregation > > 2016-02-03 17:33:17,671 INFO containermanager.AuxServices > (AuxServices.java:handle(196)) - Got event CONTAINER_STOP for appId > application_1454509557526_0014 > > 2016-02-03 17:33:17,671 INFO yarn.YarnShuffleService > (YarnShuffleService.java:stopContainer(161)) - Stopping container > container_1454509557526_0014_01_000093 > > 2016-02-03 17:33:20,351 INFO monitor.ContainersMonitorImpl > (ContainersMonitorImpl.java:run(385)) - Stopping resource-monitoring for > container_1454509557526_0014_01_000093 > > 2016-02-03 17:33:20,383 INFO monitor.ContainersMonitorImpl > (ContainersMonitorImpl.java:run(458)) - Memory usage of ProcessTree 28727 for > container-id container_1454509557526_0012_01_000001: 319.8 MB of 1.5 GB > physical memory used; 1.7 GB of 3.1 GB virtual memory used > 2016-02-03 17:33:22,627 INFO nodemanager.NodeStatusUpdaterImpl > (NodeStatusUpdaterImpl.java:removeOrTrackCompletedContainersFromContext(529)) > - Removed completed containers from NM context: > [container_1454509557526_0014_01_000093] > > I'll appreciate any suggestions. > > Thanks, > > Stefan Panayotov, PhD > Home: 610-355-0919 <tel:610-355-0919> > Cell: 610-517-5586 <tel:610-517-5586> > email: spanayo...@msn.com <mailto:spanayo...@msn.com> > spanayo...@outlook.com <mailto:spanayo...@outlook.com> > spanayo...@comcast.net <mailto:spanayo...@comcast.net> > > > Date: Tue, 2 Feb 2016 15:40:10 -0800 > Subject: Re: Spark 1.5.2 memory error > From: openkbi...@gmail.com <mailto:openkbi...@gmail.com> > To: spanayo...@msn.com <mailto:spanayo...@msn.com> > CC: yuzhih...@gmail.com <mailto:yuzhih...@gmail.com>; ja...@odersky.com > <mailto:ja...@odersky.com>; user@spark.apache.org > <mailto:user@spark.apache.org> > > > Look at part#3 in below blog: > http://www.openkb.info/2015/06/resource-allocation-configurations-for.html > <http://www.openkb.info/2015/06/resource-allocation-configurations-for.html> > > You may want to increase the executor memory, not just the > spark.yarn.executor.memoryOverhead. > > On Tue, Feb 2, 2016 at 2:14 PM, Stefan Panayotov <spanayo...@msn.com > <mailto:spanayo...@msn.com>> wrote: > For the memoryOvethead I have the default of 10% of 16g, and Spark version is > 1.5.2. > > > > Stefan Panayotov, PhD > Sent from Outlook Mail for Windows 10 phone > > > > > From: Ted Yu <mailto:yuzhih...@gmail.com> > Sent: Tuesday, February 2, 2016 4:52 PM > To: Jakob Odersky <mailto:ja...@odersky.com> > Cc: Stefan Panayotov <mailto:spanayo...@msn.com>; user@spark.apache.org > <mailto:user@spark.apache.org> > Subject: Re: Spark 1.5.2 memory error > > > > What value do you use for spark.yarn.executor.memoryOverhead ? > > > > Please see https://spark.apache.org/docs/latest/running-on-yarn.html > <https://spark.apache.org/docs/latest/running-on-yarn.html> for description > of the parameter. > > > > Which Spark release are you using ? > > > > Cheers > > > > On Tue, Feb 2, 2016 at 1:38 PM, Jakob Odersky <ja...@odersky.com > <mailto:ja...@odersky.com>> wrote: > > Can you share some code that produces the error? It is probably not > due to spark but rather the way data is handled in the user code. > Does your code call any reduceByKey actions? These are often a source > for OOM errors. > > > On Tue, Feb 2, 2016 at 1:22 PM, Stefan Panayotov <spanayo...@msn.com > <mailto:spanayo...@msn.com>> wrote: > > Hi Guys, > > > > I need help with Spark memory errors when executing ML pipelines. > > The error that I see is: > > > > > > 16/02/02 20:34:17 INFO Executor: Executor is trying to kill task 32.0 in > > stage 32.0 (TID 3298) > > > > > > 16/02/02 20:34:17 INFO Executor: Executor is trying to kill task 12.0 in > > stage 32.0 (TID 3278) > > > > > > 16/02/02 20:34:39 INFO MemoryStore: ensureFreeSpace(2004728720) called with > > curMem=296303415, maxMem=8890959790 > > > > > > 16/02/02 20:34:39 INFO MemoryStore: Block taskresult_3298 stored as bytes in > > memory (estimated size 1911.9 MB, free 6.1 GB) > > > > > > 16/02/02 20:34:39 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: > > SIGTERM > > > > > > 16/02/02 20:34:39 ERROR Executor: Exception in task 12.0 in stage 32.0 (TID > > 3278) > > > > > > java.lang.OutOfMemoryError: Java heap space > > > > > > at java.util.Arrays.copyOf(Arrays.java:2271) > > > > > > at > > java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191) > > > > > > at > > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:86) > > > > > > at > > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:256) > > > > > > at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > > > > > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > > > > > > at java.lang.Thread.run(Thread.java:745) > > > > > > 16/02/02 20:34:39 INFO DiskBlockManager: Shutdown hook called > > > > > > 16/02/02 20:34:39 INFO Executor: Finished task 32.0 in stage 32.0 (TID > > 3298). 2004728720 bytes result sent via BlockManager) > > > > > > 16/02/02 20:34:39 ERROR SparkUncaughtExceptionHandler: Uncaught exception in > > thread Thread[Executor task launch worker-8,5,main] > > > > > > java.lang.OutOfMemoryError: Java heap space > > > > > > at java.util.Arrays.copyOf(Arrays.java:2271) > > > > > > at > > java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191) > > > > > > at > > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:86) > > > > > > at > > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:256) > > > > > > at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > > > > > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > > > > > > at java.lang.Thread.run(Thread.java:745) > > > > > > 16/02/02 20:34:39 INFO ShutdownHookManager: Shutdown hook called > > > > > > 16/02/02 20:34:39 INFO MetricsSystemImpl: Stopping azure-file-system metrics > > system... > > > > > > 16/02/02 20:34:39 INFO MetricsSinkAdapter: azurefs2 thread interrupted. > > > > > > 16/02/02 20:34:39 INFO MetricsSystemImpl: azure-file-system metrics system > > stopped. > > > > > > 16/02/02 20:34:39 INFO MetricsSystemImpl: azure-file-system metrics system > > shutdown complete. > > > > > > > > > > > > And ….. > > > > > > > > > > > > 16/02/02 20:09:03 INFO impl.ContainerManagementProtocolProxy: Opening proxy > > : 10.0.0.5:30050 <http://10.0.0.5:30050/> > > > > > > 16/02/02 20:33:51 INFO yarn.YarnAllocator: Completed container > > container_1454421662639_0011_01_000005 (state: COMPLETE, exit status: -104) > > > > > > 16/02/02 20:33:51 WARN yarn.YarnAllocator: Container killed by YARN for > > exceeding memory limits. 16.8 GB of 16.5 GB physical memory used. Consider > > boosting spark.yarn.executor.memoryOverhead. > > > > > > 16/02/02 20:33:56 INFO yarn.YarnAllocator: Will request 1 executor > > containers, each with 2 cores and 16768 MB memory including 384 MB overhead > > > > > > 16/02/02 20:33:56 INFO yarn.YarnAllocator: Container request (host: Any, > > capability: <memory:16768, vCores:2>) > > > > > > 16/02/02 20:33:57 INFO yarn.YarnAllocator: Launching container > > container_1454421662639_0011_01_000037 for on host 10.0.0.8 > > > > > > 16/02/02 20:33:57 INFO yarn.YarnAllocator: Launching ExecutorRunnable. > > driverUrl: > > akka.tcp://sparkDriver@10.0.0.15:47446/user/CoarseGrainedScheduler > > <http://10.0.0.15:47446/user/CoarseGrainedScheduler>, > > executorHostname: 10.0.0.8 > > > > > > 16/02/02 20:33:57 INFO yarn.YarnAllocator: Received 1 containers from YARN, > > launching executors on 1 of them. > > > > > > I'll really appreciate any help here. > > > > Thank you, > > > > Stefan Panayotov, PhD > > Home: 610-355-0919 <> > > Cell: 610-517-5586 <> > > email: spanayo...@msn.com <mailto:spanayo...@msn.com> > > spanayo...@outlook.com <mailto:spanayo...@outlook.com> > > spanayo...@comcast.net <mailto:spanayo...@comcast.net> > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > <mailto:user-unsubscr...@spark.apache.org> > For additional commands, e-mail: user-h...@spark.apache.org > <mailto:user-h...@spark.apache.org> > > > > > > > > -- > Thanks, > www.openkb.info <http://www.openkb.info/> > (Open KnowledgeBase for Hadoop/Database/OS/Network/Tool) > > > > > <http://www.xactlycorp.com/email-click/> > > <https://www.nyse.com/quote/XNYS:XTLY> > <https://www.linkedin.com/company/xactly-corporation> > <https://twitter.com/Xactly> <https://www.facebook.com/XactlyCorp> > <http://www.youtube.com/xactlycorporation>