There is also (deprecated) spark.storage.unrollFraction to consider On Wed, Feb 3, 2016 at 2:21 PM, Nirav Patel <npa...@xactlycorp.com> wrote:
> What I meant is executor.cores and task.cpus can dictate how many parallel > tasks will run on given executor. > > Let's take this example setting. > > spark.executor.memory = 16GB > spark.executor.cores = 6 > spark.task.cpus = 1 > > SO here I think spark will assign 6 tasks to One executor each using 1 > core and 16/6=2.6GB. > > ANd out of those 2.6 gb some goes to shuffle and some goes to storage. > > spark.shuffle.memoryFraction = 0.4 > spark.storage.memoryFraction = 0.6 > > Again my speculation from some past articles I read. > > > > > > > > > On Wed, Feb 3, 2016 at 2:09 PM, Rishabh Wadhawan <rishabh...@gmail.com> > wrote: > >> As of what I know, Cores won’t give you more portion of executor memory, >> because its just cpu cores that you are using per executor. Reducing the >> number of cores however would result in lack of parallel processing power. >> The executor memory that we specify with spark.executor.memory would be the >> max memory that your executor might have. But the memory that you get is >> less then that. I don’t clearly remember but i think its either memory/2 or >> memory/4. But I may be wrong as I have been out of spark for months. >> >> On Feb 3, 2016, at 2:58 PM, Nirav Patel <npa...@xactlycorp.com> wrote: >> >> About OP. >> >> How many cores you assign per executor? May be reducing that number will >> give more portion of executor memory to each task being executed on that >> executor. Others please comment if that make sense. >> >> >> >> On Wed, Feb 3, 2016 at 1:52 PM, Nirav Patel <npa...@xactlycorp.com> >> wrote: >> >>> I know it;s a strong word but when I have a case open for that with MapR >>> and Databricks for a month and their only solution to change to DataFrame >>> it frustrate you. I know DataFrame/Sql catalyst has internal optimizations >>> but it requires lot of code change. I think there's something fundamentally >>> wrong (or different from hadoop) in framework that is not allowing it to do >>> robust memory management. I know my job is memory hogger, it does a groupBy >>> and perform combinatorics in reducer side; uses additional datastructures >>> at task levels. May be spark is running multiple heavier tasks on same >>> executor and collectively they cause OOM. But suggesting DataFrame is NOT a >>> Solution for me (and most others who already invested time with RDD and >>> loves the type safety it provides). Not even sure if changing to DataFrame >>> will for sure solve the issue. >>> >>> On Wed, Feb 3, 2016 at 1:33 PM, Mohammed Guller <moham...@glassbeam.com> >>> wrote: >>> >>>> Nirav, >>>> >>>> Sorry to hear about your experience with Spark; however, sucks is a >>>> very strong word. Many organizations are processing a lot more than 150GB >>>> of data with Spark. >>>> >>>> >>>> >>>> Mohammed >>>> >>>> Author: Big Data Analytics with Spark >>>> <http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/> >>>> >>>> >>>> >>>> *From:* Nirav Patel [mailto:npa...@xactlycorp.com] >>>> *Sent:* Wednesday, February 3, 2016 11:31 AM >>>> *To:* Stefan Panayotov >>>> *Cc:* Jim Green; Ted Yu; Jakob Odersky; user@spark.apache.org >>>> >>>> *Subject:* Re: Spark 1.5.2 memory error >>>> >>>> >>>> >>>> Hi Stefan, >>>> >>>> >>>> >>>> Welcome to the OOM - heap space club. I have been struggling with >>>> similar errors (OOM and yarn executor being killed) and failing job or >>>> sending it in retry loops. I bet the same job will run perfectly fine with >>>> less resource on Hadoop MapReduce program. I have tested it for my program >>>> and it does work. >>>> >>>> >>>> >>>> Bottomline from my experience. Spark sucks with memory management when >>>> job is processing large (not huge) amount of data. It's failing for me with >>>> 16gb executors, 10 executors, 6 threads each. And data its processing is >>>> only 150GB! It's 1 billion rows for me. Same job works perfectly fine with >>>> 1 million rows. >>>> >>>> >>>> >>>> Hope that saves you some trouble. >>>> >>>> >>>> >>>> Nirav >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Wed, Feb 3, 2016 at 11:00 AM, Stefan Panayotov <spanayo...@msn.com> >>>> wrote: >>>> >>>> I drastically increased the memory: >>>> >>>> spark.executor.memory = 50g >>>> spark.driver.memory = 8g >>>> spark.driver.maxResultSize = 8g >>>> spark.yarn.executor.memoryOverhead = 768 >>>> >>>> I still see executors killed, but this time the memory does not seem to >>>> be the issue. >>>> The error on the Jupyter notebook is: >>>> >>>> >>>> Py4JJavaError: An error occurred while calling >>>> z:org.apache.spark.api.python.PythonRDD.collectAndServe. >>>> >>>> : org.apache.spark.SparkException: Job aborted due to stage failure: >>>> Exception while getting task result: java.io.IOException: Failed to >>>> connect to /10.0.0.9:48755 >>>> >>>> >>>> From nodemanagers log corresponding to worker 10.0.0.9: >>>> >>>> >>>> 2016-02-03 17:31:44,917 INFO yarn.YarnShuffleService >>>> (YarnShuffleService.java:initializeApplication(129)) - Initializing >>>> application application_1454509557526_0014 >>>> >>>> >>>> >>>> 2016-02-03 17:31:44,918 INFO container.ContainerImpl >>>> (ContainerImpl.java:handle(1131)) - Container >>>> container_1454509557526_0014_01_000093 transitioned from LOCALIZING to >>>> LOCALIZED >>>> >>>> >>>> >>>> 2016-02-03 17:31:44,947 INFO container.ContainerImpl >>>> (ContainerImpl.java:handle(1131)) - Container >>>> container_1454509557526_0014_01_000093 transitioned from LOCALIZED to >>>> RUNNING >>>> >>>> >>>> >>>> 2016-02-03 17:31:44,951 INFO nodemanager.DefaultContainerExecutor >>>> (DefaultContainerExecutor.java:buildCommandExecutor(267)) - >>>> launchContainer: [bash, >>>> /mnt/resource/hadoop/yarn/local/usercache/root/appcache/application_1454509557526_0014/container_1454509557526_0014_01_000093/default_container_executor.sh] >>>> >>>> >>>> >>>> 2016-02-03 17:31:45,686 INFO monitor.ContainersMonitorImpl >>>> (ContainersMonitorImpl.java:run(371)) - Starting resource-monitoring for >>>> container_1454509557526_0014_01_000093 >>>> >>>> >>>> >>>> 2016-02-03 17:31:45,686 INFO monitor.ContainersMonitorImpl >>>> (ContainersMonitorImpl.java:run(385)) - Stopping resource-monitoring for >>>> container_1454509557526_0014_01_000011 >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> Then I can see the memory usage increasing from 230.6 MB to 12.6 GB, >>>> which is far below 50g, and the suddenly getting killed!?! >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> 2016-02-03 17:33:17,350 INFO monitor.ContainersMonitorImpl >>>> (ContainersMonitorImpl.java:run(458)) - Memory usage of ProcessTree 30962 >>>> for container-id container_1454509557526_0014_01_000093: 12.6 GB of 51 GB >>>> physical memory used; 52.8 GB of 107.1 GB virtual memory used >>>> >>>> >>>> >>>> 2016-02-03 17:33:17,613 INFO container.ContainerImpl >>>> (ContainerImpl.java:handle(1131)) - Container >>>> container_1454509557526_0014_01_000093 transitioned from RUNNING to KILLING >>>> >>>> >>>> >>>> 2016-02-03 17:33:17,613 INFO launcher.ContainerLaunch >>>> (ContainerLaunch.java:cleanupContainer(370)) - Cleaning up container >>>> container_1454509557526_0014_01_000093 >>>> >>>> >>>> >>>> 2016-02-03 17:33:17,629 WARN nodemanager.DefaultContainerExecutor >>>> (DefaultContainerExecutor.java:launchContainer(223)) - Exit code from >>>> container container_1454509557526_0014_01_000093 is : 143 >>>> >>>> >>>> >>>> 2016-02-03 17:33:17,667 INFO container.ContainerImpl >>>> (ContainerImpl.java:handle(1131)) - Container >>>> container_1454509557526_0014_01_000093 transitioned from KILLING to >>>> CONTAINER_CLEANEDUP_AFTER_KILL >>>> >>>> >>>> >>>> 2016-02-03 17:33:17,669 INFO nodemanager.NMAuditLogger >>>> (NMAuditLogger.java:logSuccess(89)) - USER=root OPERATION=Container >>>> Finished - Killed TARGET=ContainerImpl RESULT=SUCCESS >>>> APPID=application_1454509557526_0014 >>>> CONTAINERID=container_1454509557526_0014_01_000093 >>>> >>>> >>>> >>>> 2016-02-03 17:33:17,670 INFO container.ContainerImpl >>>> (ContainerImpl.java:handle(1131)) - Container >>>> container_1454509557526_0014_01_000093 transitioned from >>>> CONTAINER_CLEANEDUP_AFTER_KILL to DONE >>>> >>>> >>>> >>>> 2016-02-03 17:33:17,670 INFO application.ApplicationImpl >>>> (ApplicationImpl.java:transition(347)) - Removing >>>> container_1454509557526_0014_01_000093 from application >>>> application_1454509557526_0014 >>>> >>>> >>>> >>>> 2016-02-03 17:33:17,671 INFO logaggregation.AppLogAggregatorImpl >>>> (AppLogAggregatorImpl.java:startContainerLogAggregation(546)) - Considering >>>> container container_1454509557526_0014_01_000093 for log-aggregation >>>> >>>> >>>> >>>> 2016-02-03 17:33:17,671 INFO containermanager.AuxServices >>>> (AuxServices.java:handle(196)) - Got event CONTAINER_STOP for appId >>>> application_1454509557526_0014 >>>> >>>> >>>> >>>> 2016-02-03 17:33:17,671 INFO yarn.YarnShuffleService >>>> (YarnShuffleService.java:stopContainer(161)) - Stopping container >>>> container_1454509557526_0014_01_000093 >>>> >>>> >>>> >>>> 2016-02-03 17:33:20,351 INFO monitor.ContainersMonitorImpl >>>> (ContainersMonitorImpl.java:run(385)) - Stopping resource-monitoring for >>>> container_1454509557526_0014_01_000093 >>>> >>>> >>>> >>>> 2016-02-03 17:33:20,383 INFO monitor.ContainersMonitorImpl >>>> (ContainersMonitorImpl.java:run(458)) - Memory usage of ProcessTree 28727 >>>> for container-id container_1454509557526_0012_01_000001: 319.8 MB of 1.5 GB >>>> physical memory used; 1.7 GB of 3.1 GB virtual memory used >>>> >>>> 2016-02-03 17:33:22,627 INFO nodemanager.NodeStatusUpdaterImpl >>>> (NodeStatusUpdaterImpl.java:removeOrTrackCompletedContainersFromContext(529)) >>>> - Removed completed containers from NM context: >>>> [container_1454509557526_0014_01_000093] >>>> >>>> I'll appreciate any suggestions. >>>> >>>> Thanks, >>>> >>>> *Stefan Panayotov, PhD * >>>> *Home*: 610-355-0919 >>>> *Cell*: 610-517-5586 >>>> *email*: spanayo...@msn.com >>>> spanayo...@outlook.com >>>> spanayo...@comcast.net >>>> >>>> >>>> >>>> ------------------------------ >>>> >>>> Date: Tue, 2 Feb 2016 15:40:10 -0800 >>>> Subject: Re: Spark 1.5.2 memory error >>>> From: openkbi...@gmail.com >>>> To: spanayo...@msn.com >>>> CC: yuzhih...@gmail.com; ja...@odersky.com; user@spark.apache.org >>>> >>>> >>>> >>>> Look at part#3 in below blog: >>>> >>>> >>>> http://www.openkb.info/2015/06/resource-allocation-configurations-for.html >>>> >>>> >>>> >>>> You may want to increase the executor memory, not just the >>>> spark.yarn.executor.memoryOverhead. >>>> >>>> >>>> >>>> On Tue, Feb 2, 2016 at 2:14 PM, Stefan Panayotov <spanayo...@msn.com> >>>> wrote: >>>> >>>> For the memoryOvethead I have the default of 10% of 16g, and Spark >>>> version is 1.5.2. >>>> >>>> >>>> >>>> Stefan Panayotov, PhD >>>> Sent from Outlook Mail for Windows 10 phone >>>> >>>> >>>> >>>> >>>> *From: *Ted Yu <yuzhih...@gmail.com> >>>> *Sent: *Tuesday, February 2, 2016 4:52 PM >>>> *To: *Jakob Odersky <ja...@odersky.com> >>>> *Cc: *Stefan Panayotov <spanayo...@msn.com>; user@spark.apache.org >>>> *Subject: *Re: Spark 1.5.2 memory error >>>> >>>> >>>> >>>> What value do you use for spark.yarn.executor.memoryOverhead ? >>>> >>>> >>>> >>>> Please see https://spark.apache.org/docs/latest/running-on-yarn.html >>>> for description of the parameter. >>>> >>>> >>>> >>>> Which Spark release are you using ? >>>> >>>> >>>> >>>> Cheers >>>> >>>> >>>> >>>> On Tue, Feb 2, 2016 at 1:38 PM, Jakob Odersky <ja...@odersky.com> >>>> wrote: >>>> >>>> Can you share some code that produces the error? It is probably not >>>> due to spark but rather the way data is handled in the user code. >>>> Does your code call any reduceByKey actions? These are often a source >>>> for OOM errors. >>>> >>>> >>>> On Tue, Feb 2, 2016 at 1:22 PM, Stefan Panayotov <spanayo...@msn.com> >>>> wrote: >>>> > Hi Guys, >>>> > >>>> > I need help with Spark memory errors when executing ML pipelines. >>>> > The error that I see is: >>>> > >>>> > >>>> > 16/02/02 20:34:17 INFO Executor: Executor is trying to kill task 32.0 >>>> in >>>> > stage 32.0 (TID 3298) >>>> > >>>> > >>>> > 16/02/02 20:34:17 INFO Executor: Executor is trying to kill task 12.0 >>>> in >>>> > stage 32.0 (TID 3278) >>>> > >>>> > >>>> > 16/02/02 20:34:39 INFO MemoryStore: ensureFreeSpace(2004728720) >>>> called with >>>> > curMem=296303415, maxMem=8890959790 >>>> > >>>> > >>>> > 16/02/02 20:34:39 INFO MemoryStore: Block taskresult_3298 stored as >>>> bytes in >>>> > memory (estimated size 1911.9 MB, free 6.1 GB) >>>> > >>>> > >>>> > 16/02/02 20:34:39 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL >>>> 15: >>>> > SIGTERM >>>> > >>>> > >>>> > 16/02/02 20:34:39 ERROR Executor: Exception in task 12.0 in stage >>>> 32.0 (TID >>>> > 3278) >>>> > >>>> > >>>> > java.lang.OutOfMemoryError: Java heap space >>>> > >>>> > >>>> > at java.util.Arrays.copyOf(Arrays.java:2271) >>>> > >>>> > >>>> > at >>>> > >>>> java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191) >>>> > >>>> > >>>> > at >>>> > >>>> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:86) >>>> > >>>> > >>>> > at >>>> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:256) >>>> > >>>> > >>>> > at >>>> > >>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >>>> > >>>> > >>>> > at >>>> > >>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>>> > >>>> > >>>> > at java.lang.Thread.run(Thread.java:745) >>>> > >>>> > >>>> > 16/02/02 20:34:39 INFO DiskBlockManager: Shutdown hook called >>>> > >>>> > >>>> > 16/02/02 20:34:39 INFO Executor: Finished task 32.0 in stage 32.0 (TID >>>> > 3298). 2004728720 bytes result sent via BlockManager) >>>> > >>>> > >>>> > 16/02/02 20:34:39 ERROR SparkUncaughtExceptionHandler: Uncaught >>>> exception in >>>> > thread Thread[Executor task launch worker-8,5,main] >>>> > >>>> > >>>> > java.lang.OutOfMemoryError: Java heap space >>>> > >>>> > >>>> > at java.util.Arrays.copyOf(Arrays.java:2271) >>>> > >>>> > >>>> > at >>>> > >>>> java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191) >>>> > >>>> > >>>> > at >>>> > >>>> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:86) >>>> > >>>> > >>>> > at >>>> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:256) >>>> > >>>> > >>>> > at >>>> > >>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >>>> > >>>> > >>>> > at >>>> > >>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>>> > >>>> > >>>> > at java.lang.Thread.run(Thread.java:745) >>>> > >>>> > >>>> > 16/02/02 20:34:39 INFO ShutdownHookManager: Shutdown hook called >>>> > >>>> > >>>> > 16/02/02 20:34:39 INFO MetricsSystemImpl: Stopping azure-file-system >>>> metrics >>>> > system... >>>> > >>>> > >>>> > 16/02/02 20:34:39 INFO MetricsSinkAdapter: azurefs2 thread >>>> interrupted. >>>> > >>>> > >>>> > 16/02/02 20:34:39 INFO MetricsSystemImpl: azure-file-system metrics >>>> system >>>> > stopped. >>>> > >>>> > >>>> > 16/02/02 20:34:39 INFO MetricsSystemImpl: azure-file-system metrics >>>> system >>>> > shutdown complete. >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > And ….. >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > 16/02/02 20:09:03 INFO impl.ContainerManagementProtocolProxy: Opening >>>> proxy >>>> > : 10.0.0.5:30050 >>>> > >>>> > >>>> > 16/02/02 20:33:51 INFO yarn.YarnAllocator: Completed container >>>> > container_1454421662639_0011_01_000005 (state: COMPLETE, exit status: >>>> -104) >>>> > >>>> > >>>> > 16/02/02 20:33:51 WARN yarn.YarnAllocator: Container killed by YARN >>>> for >>>> > exceeding memory limits. 16.8 GB of 16.5 GB physical memory used. >>>> Consider >>>> > boosting spark.yarn.executor.memoryOverhead. >>>> > >>>> > >>>> > 16/02/02 20:33:56 INFO yarn.YarnAllocator: Will request 1 executor >>>> > containers, each with 2 cores and 16768 MB memory including 384 MB >>>> overhead >>>> > >>>> > >>>> > 16/02/02 20:33:56 INFO yarn.YarnAllocator: Container request (host: >>>> Any, >>>> > capability: <memory:16768, vCores:2>) >>>> > >>>> > >>>> > 16/02/02 20:33:57 INFO yarn.YarnAllocator: Launching container >>>> > container_1454421662639_0011_01_000037 for on host 10.0.0.8 >>>> > >>>> > >>>> > 16/02/02 20:33:57 INFO yarn.YarnAllocator: Launching ExecutorRunnable. >>>> > driverUrl: >>>> > akka.tcp://sparkDriver@10.0.0.15:47446/user/CoarseGrainedScheduler >>>> <http://10.0.0.15:47446/user/CoarseGrainedScheduler>, >>>> > executorHostname: 10.0.0.8 >>>> > >>>> > >>>> > 16/02/02 20:33:57 INFO yarn.YarnAllocator: Received 1 containers from >>>> YARN, >>>> > launching executors on 1 of them. >>>> > >>>> > >>>> > I'll really appreciate any help here. >>>> > >>>> > Thank you, >>>> > >>>> > Stefan Panayotov, PhD >>>> > Home: 610-355-0919 >>>> > Cell: 610-517-5586 >>>> > email: spanayo...@msn.com >>>> > spanayo...@outlook.com >>>> > spanayo...@comcast.net >>>> > >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>> For additional commands, e-mail: user-h...@spark.apache.org >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> >>>> Thanks, >>>> >>>> www.openkb.info >>>> >>>> (Open KnowledgeBase for Hadoop/Database/OS/Network/Tool) >>>> >>>> >>>> >>>> >>>> >>>> >>>> [image: What's New with Xactly] >>>> <http://www.xactlycorp.com/email-click/> >>>> >>>> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] >>>> <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] >>>> <https://twitter.com/Xactly> [image: Facebook] >>>> <https://www.facebook.com/XactlyCorp> [image: YouTube] >>>> <http://www.youtube.com/xactlycorporation> >>>> >>> >>> >> >> >> >> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> >> >> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] >> <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] >> <https://twitter.com/Xactly> [image: Facebook] >> <https://www.facebook.com/XactlyCorp> [image: YouTube] >> <http://www.youtube.com/xactlycorporation> >> >> >> > > > > [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> > > <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] > <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] > <https://twitter.com/Xactly> [image: Facebook] > <https://www.facebook.com/XactlyCorp> [image: YouTube] > <http://www.youtube.com/xactlycorporation> >