Re: Spark 1.5.2 memory error

Rishabh Wadhawan Wed, 03 Feb 2016 11:56:23 -0800

Hi I suppose you are using —master yarn-client or yarn cluster. Can you try 
boosting spark.yarn.driver.memoryOverhead, override it to 0.15 *  executor 
memory rather then default 0.1. Check out this link 
https://spark.apache.org/docs/1.5.2/running-on-yarn.html 
<https://spark.apache.org/docs/1.5.2/running-on-yarn.html>. Also try adding 
this SPARK_REPL_OPTS="-XX:MaxPermSize=1g” to increase heap space permanent 
memory size too. One of the reasons for OOM Java Heap Space.


> On Feb 3, 2016, at 12:30 PM, Nirav Patel <npa...@xactlycorp.com> wrote:
> 
> Hi Stefan,
> 
> Welcome to the OOM - heap space club. I have been struggling with similar 
> errors (OOM and yarn executor being killed) and failing job or sending it in 
> retry loops. I bet the same job will run perfectly fine with less resource on 
> Hadoop MapReduce program. I have tested it for my program and it does work.
> 
> Bottomline from my experience. Spark sucks with memory management when job is 
> processing large (not huge) amount of data. It's failing for me with 16gb 
> executors, 10 executors, 6 threads each. And data its processing is only 
> 150GB! It's 1 billion rows for me. Same job works perfectly fine with 1 
> million rows. 
> 
> Hope that saves you some trouble.
> 
> Nirav
> 
> 
> 
> On Wed, Feb 3, 2016 at 11:00 AM, Stefan Panayotov <spanayo...@msn.com 
> <mailto:spanayo...@msn.com>> wrote:
> I drastically increased the memory:
>  
> spark.executor.memory = 50g
> spark.driver.memory = 8g
> spark.driver.maxResultSize = 8g
> spark.yarn.executor.memoryOverhead = 768
>  
> I still see executors killed, but this time the memory does not seem to be 
> the issue.
> The error on the Jupyter notebook is:
>  
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : org.apache.spark.SparkException: Job aborted due to stage failure: 
> Exception while getting task result: java.io.IOException: Failed to connect 
> to /10.0.0.9:48755 <http://10.0.0.9:48755/> 
> From nodemanagers log corresponding to worker 10.0.0.9 <http://10.0.0.9/>:
>  
> 
> 2016-02-03 17:31:44,917 INFO  yarn.YarnShuffleService 
> (YarnShuffleService.java:initializeApplication(129)) - Initializing 
> application application_1454509557526_0014
> 
> 2016-02-03 17:31:44,918 INFO  container.ContainerImpl 
> (ContainerImpl.java:handle(1131)) - Container 
> container_1454509557526_0014_01_000093 transitioned from LOCALIZING to 
> LOCALIZED
> 
> 2016-02-03 17:31:44,947 INFO  container.ContainerImpl 
> (ContainerImpl.java:handle(1131)) - Container 
> container_1454509557526_0014_01_000093 transitioned from LOCALIZED to RUNNING
> 
> 2016-02-03 17:31:44,951 INFO  nodemanager.DefaultContainerExecutor 
> (DefaultContainerExecutor.java:buildCommandExecutor(267)) - launchContainer: 
> [bash, 
> /mnt/resource/hadoop/yarn/local/usercache/root/appcache/application_1454509557526_0014/container_1454509557526_0014_01_000093/default_container_executor.sh]
> 
> 2016-02-03 17:31:45,686 INFO  monitor.ContainersMonitorImpl 
> (ContainersMonitorImpl.java:run(371)) - Starting resource-monitoring for 
> container_1454509557526_0014_01_000093
> 
> 2016-02-03 17:31:45,686 INFO  monitor.ContainersMonitorImpl 
> (ContainersMonitorImpl.java:run(385)) - Stopping resource-monitoring for 
> container_1454509557526_0014_01_000011
> 
>  
> 
> Then I can see the memory usage increasing from 230.6 MB to 12.6 GB, which is 
> far below 50g, and the suddenly getting killed!?!
> 
>  
> 
> 2016-02-03 17:33:17,350 INFO  monitor.ContainersMonitorImpl 
> (ContainersMonitorImpl.java:run(458)) - Memory usage of ProcessTree 30962 for 
> container-id container_1454509557526_0014_01_000093: 12.6 GB of 51 GB 
> physical memory used; 52.8 GB of 107.1 GB virtual memory used
> 
> 2016-02-03 17:33:17,613 INFO  container.ContainerImpl 
> (ContainerImpl.java:handle(1131)) - Container 
> container_1454509557526_0014_01_000093 transitioned from RUNNING to KILLING
> 
> 2016-02-03 17:33:17,613 INFO  launcher.ContainerLaunch 
> (ContainerLaunch.java:cleanupContainer(370)) - Cleaning up container 
> container_1454509557526_0014_01_000093
> 
> 2016-02-03 17:33:17,629 WARN  nodemanager.DefaultContainerExecutor 
> (DefaultContainerExecutor.java:launchContainer(223)) - Exit code from 
> container container_1454509557526_0014_01_000093 is : 143
> 
> 2016-02-03 17:33:17,667 INFO  container.ContainerImpl 
> (ContainerImpl.java:handle(1131)) - Container 
> container_1454509557526_0014_01_000093 transitioned from KILLING to 
> CONTAINER_CLEANEDUP_AFTER_KILL
> 
> 2016-02-03 17:33:17,669 INFO  nodemanager.NMAuditLogger 
> (NMAuditLogger.java:logSuccess(89)) - USER=root       OPERATION=Container 
> Finished - Killed    TARGET=ContainerImpl RESULT=SUCCESS       
> APPID=application_1454509557526_0014     
> CONTAINERID=container_1454509557526_0014_01_000093
> 
> 2016-02-03 17:33:17,670 INFO  container.ContainerImpl 
> (ContainerImpl.java:handle(1131)) - Container 
> container_1454509557526_0014_01_000093 transitioned from 
> CONTAINER_CLEANEDUP_AFTER_KILL to DONE
> 
> 2016-02-03 17:33:17,670 INFO  application.ApplicationImpl 
> (ApplicationImpl.java:transition(347)) - Removing 
> container_1454509557526_0014_01_000093 from application 
> application_1454509557526_0014
> 
> 2016-02-03 17:33:17,671 INFO  logaggregation.AppLogAggregatorImpl 
> (AppLogAggregatorImpl.java:startContainerLogAggregation(546)) - Considering 
> container container_1454509557526_0014_01_000093 for log-aggregation
> 
> 2016-02-03 17:33:17,671 INFO  containermanager.AuxServices 
> (AuxServices.java:handle(196)) - Got event CONTAINER_STOP for appId 
> application_1454509557526_0014
> 
> 2016-02-03 17:33:17,671 INFO  yarn.YarnShuffleService 
> (YarnShuffleService.java:stopContainer(161)) - Stopping container 
> container_1454509557526_0014_01_000093
> 
> 2016-02-03 17:33:20,351 INFO  monitor.ContainersMonitorImpl 
> (ContainersMonitorImpl.java:run(385)) - Stopping resource-monitoring for 
> container_1454509557526_0014_01_000093
> 
> 2016-02-03 17:33:20,383 INFO  monitor.ContainersMonitorImpl 
> (ContainersMonitorImpl.java:run(458)) - Memory usage of ProcessTree 28727 for 
> container-id container_1454509557526_0012_01_000001: 319.8 MB of 1.5 GB 
> physical memory used; 1.7 GB of 3.1 GB virtual memory used
> 2016-02-03 17:33:22,627 INFO  nodemanager.NodeStatusUpdaterImpl 
> (NodeStatusUpdaterImpl.java:removeOrTrackCompletedContainersFromContext(529)) 
> - Removed completed containers from NM context: 
> [container_1454509557526_0014_01_000093]
>  
> I'll appreciate any suggestions.
> 
> Thanks,
> 
> Stefan Panayotov, PhD 
> Home: 610-355-0919 <tel:610-355-0919> 
> Cell: 610-517-5586 <tel:610-517-5586> 
> email: spanayo...@msn.com <mailto:spanayo...@msn.com> 
> spanayo...@outlook.com <mailto:spanayo...@outlook.com> 
> spanayo...@comcast.net <mailto:spanayo...@comcast.net>
> 
>  
> Date: Tue, 2 Feb 2016 15:40:10 -0800
> Subject: Re: Spark 1.5.2 memory error
> From: openkbi...@gmail.com <mailto:openkbi...@gmail.com>
> To: spanayo...@msn.com <mailto:spanayo...@msn.com>
> CC: yuzhih...@gmail.com <mailto:yuzhih...@gmail.com>; ja...@odersky.com 
> <mailto:ja...@odersky.com>; user@spark.apache.org 
> <mailto:user@spark.apache.org>
> 
> 
> Look at part#3 in below blog:
> http://www.openkb.info/2015/06/resource-allocation-configurations-for.html
>  <http://www.openkb.info/2015/06/resource-allocation-configurations-for.html>
> 
> You may want to increase the executor memory, not just the 
> spark.yarn.executor.memoryOverhead.
> 
> On Tue, Feb 2, 2016 at 2:14 PM, Stefan Panayotov <spanayo...@msn.com 
> <mailto:spanayo...@msn.com>> wrote:
> For the memoryOvethead I have the default of 10% of 16g, and Spark version is 
> 1.5.2.
> 
>  
> 
> Stefan Panayotov, PhD
> Sent from Outlook Mail for Windows 10 phone
> 
>  
> 
> 
> From: Ted Yu <mailto:yuzhih...@gmail.com>
> Sent: Tuesday, February 2, 2016 4:52 PM
> To: Jakob Odersky <mailto:ja...@odersky.com>
> Cc: Stefan Panayotov <mailto:spanayo...@msn.com>; user@spark.apache.org 
> <mailto:user@spark.apache.org>
> Subject: Re: Spark 1.5.2 memory error
> 
>  
> 
> What value do you use for spark.yarn.executor.memoryOverhead ?
> 
>  
> 
> Please see https://spark.apache.org/docs/latest/running-on-yarn.html 
> <https://spark.apache.org/docs/latest/running-on-yarn.html> for description 
> of the parameter.
> 
>  
> 
> Which Spark release are you using ?
> 
>  
> 
> Cheers
> 
>  
> 
> On Tue, Feb 2, 2016 at 1:38 PM, Jakob Odersky <ja...@odersky.com 
> <mailto:ja...@odersky.com>> wrote:
> 
> Can you share some code that produces the error? It is probably not
> due to spark but rather the way data is handled in the user code.
> Does your code call any reduceByKey actions? These are often a source
> for OOM errors.
> 
> 
> On Tue, Feb 2, 2016 at 1:22 PM, Stefan Panayotov <spanayo...@msn.com 
> <mailto:spanayo...@msn.com>> wrote:
> > Hi Guys,
> >
> > I need help with Spark memory errors when executing ML pipelines.
> > The error that I see is:
> >
> >
> > 16/02/02 20:34:17 INFO Executor: Executor is trying to kill task 32.0 in
> > stage 32.0 (TID 3298)
> >
> >
> > 16/02/02 20:34:17 INFO Executor: Executor is trying to kill task 12.0 in
> > stage 32.0 (TID 3278)
> >
> >
> > 16/02/02 20:34:39 INFO MemoryStore: ensureFreeSpace(2004728720) called with
> > curMem=296303415, maxMem=8890959790
> >
> >
> > 16/02/02 20:34:39 INFO MemoryStore: Block taskresult_3298 stored as bytes in
> > memory (estimated size 1911.9 MB, free 6.1 GB)
> >
> >
> > 16/02/02 20:34:39 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15:
> > SIGTERM
> >
> >
> > 16/02/02 20:34:39 ERROR Executor: Exception in task 12.0 in stage 32.0 (TID
> > 3278)
> >
> >
> > java.lang.OutOfMemoryError: Java heap space
> >
> >
> >        at java.util.Arrays.copyOf(Arrays.java:2271)
> >
> >
> >        at
> > java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191)
> >
> >
> >        at
> > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:86)
> >
> >
> >        at
> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:256)
> >
> >
> >        at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >
> >
> >        at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >
> >
> >        at java.lang.Thread.run(Thread.java:745)
> >
> >
> > 16/02/02 20:34:39 INFO DiskBlockManager: Shutdown hook called
> >
> >
> > 16/02/02 20:34:39 INFO Executor: Finished task 32.0 in stage 32.0 (TID
> > 3298). 2004728720 bytes result sent via BlockManager)
> >
> >
> > 16/02/02 20:34:39 ERROR SparkUncaughtExceptionHandler: Uncaught exception in
> > thread Thread[Executor task launch worker-8,5,main]
> >
> >
> > java.lang.OutOfMemoryError: Java heap space
> >
> >
> >        at java.util.Arrays.copyOf(Arrays.java:2271)
> >
> >
> >        at
> > java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191)
> >
> >
> >        at
> > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:86)
> >
> >
> >        at
> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:256)
> >
> >
> >        at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >
> >
> >        at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >
> >
> >        at java.lang.Thread.run(Thread.java:745)
> >
> >
> > 16/02/02 20:34:39 INFO ShutdownHookManager: Shutdown hook called
> >
> >
> > 16/02/02 20:34:39 INFO MetricsSystemImpl: Stopping azure-file-system metrics
> > system...
> >
> >
> > 16/02/02 20:34:39 INFO MetricsSinkAdapter: azurefs2 thread interrupted.
> >
> >
> > 16/02/02 20:34:39 INFO MetricsSystemImpl: azure-file-system metrics system
> > stopped.
> >
> >
> > 16/02/02 20:34:39 INFO MetricsSystemImpl: azure-file-system metrics system
> > shutdown complete.
> >
> >
> >
> >
> >
> > And …..
> >
> >
> >
> >
> >
> > 16/02/02 20:09:03 INFO impl.ContainerManagementProtocolProxy: Opening proxy
> > : 10.0.0.5:30050 <http://10.0.0.5:30050/>
> >
> >
> > 16/02/02 20:33:51 INFO yarn.YarnAllocator: Completed container
> > container_1454421662639_0011_01_000005 (state: COMPLETE, exit status: -104)
> >
> >
> > 16/02/02 20:33:51 WARN yarn.YarnAllocator: Container killed by YARN for
> > exceeding memory limits. 16.8 GB of 16.5 GB physical memory used. Consider
> > boosting spark.yarn.executor.memoryOverhead.
> >
> >
> > 16/02/02 20:33:56 INFO yarn.YarnAllocator: Will request 1 executor
> > containers, each with 2 cores and 16768 MB memory including 384 MB overhead
> >
> >
> > 16/02/02 20:33:56 INFO yarn.YarnAllocator: Container request (host: Any,
> > capability: <memory:16768, vCores:2>)
> >
> >
> > 16/02/02 20:33:57 INFO yarn.YarnAllocator: Launching container
> > container_1454421662639_0011_01_000037 for on host 10.0.0.8
> >
> >
> > 16/02/02 20:33:57 INFO yarn.YarnAllocator: Launching ExecutorRunnable.
> > driverUrl:
> > akka.tcp://sparkDriver@10.0.0.15:47446/user/CoarseGrainedScheduler 
> > <http://10.0.0.15:47446/user/CoarseGrainedScheduler>,
> > executorHostname: 10.0.0.8
> >
> >
> > 16/02/02 20:33:57 INFO yarn.YarnAllocator: Received 1 containers from YARN,
> > launching executors on 1 of them.
> >
> >
> > I'll really appreciate any help here.
> >
> > Thank you,
> >
> > Stefan Panayotov, PhD
> > Home: 610-355-0919 <>
> > Cell: 610-517-5586 <>
> > email: spanayo...@msn.com <mailto:spanayo...@msn.com>
> > spanayo...@outlook.com <mailto:spanayo...@outlook.com>
> > spanayo...@comcast.net <mailto:spanayo...@comcast.net>
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> For additional commands, e-mail: user-h...@spark.apache.org 
> <mailto:user-h...@spark.apache.org>
>  
> 
>  
> 
> 
> 
> 
> -- 
> Thanks,
> www.openkb.info <http://www.openkb.info/> 
> (Open KnowledgeBase for Hadoop/Database/OS/Network/Tool)
> 
> 
> 
> 
>  <http://www.xactlycorp.com/email-click/>
> 
>  <https://www.nyse.com/quote/XNYS:XTLY>   
> <https://www.linkedin.com/company/xactly-corporation>   
> <https://twitter.com/Xactly>   <https://www.facebook.com/XactlyCorp>   
> <http://www.youtube.com/xactlycorporation>

Re: Spark 1.5.2 memory error

Reply via email to