What I meant is executor.cores and task.cpus can dictate how many parallel tasks will run on given executor.
Let's take this example setting. spark.executor.memory = 16GB spark.executor.cores = 6 spark.task.cpus = 1 SO here I think spark will assign 6 tasks to One executor each using 1 core and 16/6=2.6GB. ANd out of those 2.6 gb some goes to shuffle and some goes to storage. spark.shuffle.memoryFraction = 0.4 spark.storage.memoryFraction = 0.6 Again my speculation from some past articles I read. On Wed, Feb 3, 2016 at 2:09 PM, Rishabh Wadhawan <rishabh...@gmail.com> wrote: > As of what I know, Cores won’t give you more portion of executor memory, > because its just cpu cores that you are using per executor. Reducing the > number of cores however would result in lack of parallel processing power. > The executor memory that we specify with spark.executor.memory would be the > max memory that your executor might have. But the memory that you get is > less then that. I don’t clearly remember but i think its either memory/2 or > memory/4. But I may be wrong as I have been out of spark for months. > > On Feb 3, 2016, at 2:58 PM, Nirav Patel <npa...@xactlycorp.com> wrote: > > About OP. > > How many cores you assign per executor? May be reducing that number will > give more portion of executor memory to each task being executed on that > executor. Others please comment if that make sense. > > > > On Wed, Feb 3, 2016 at 1:52 PM, Nirav Patel <npa...@xactlycorp.com> wrote: > >> I know it;s a strong word but when I have a case open for that with MapR >> and Databricks for a month and their only solution to change to DataFrame >> it frustrate you. I know DataFrame/Sql catalyst has internal optimizations >> but it requires lot of code change. I think there's something fundamentally >> wrong (or different from hadoop) in framework that is not allowing it to do >> robust memory management. I know my job is memory hogger, it does a groupBy >> and perform combinatorics in reducer side; uses additional datastructures >> at task levels. May be spark is running multiple heavier tasks on same >> executor and collectively they cause OOM. But suggesting DataFrame is NOT a >> Solution for me (and most others who already invested time with RDD and >> loves the type safety it provides). Not even sure if changing to DataFrame >> will for sure solve the issue. >> >> On Wed, Feb 3, 2016 at 1:33 PM, Mohammed Guller <moham...@glassbeam.com> >> wrote: >> >>> Nirav, >>> >>> Sorry to hear about your experience with Spark; however, sucks is a very >>> strong word. Many organizations are processing a lot more than 150GB of >>> data with Spark. >>> >>> >>> >>> Mohammed >>> >>> Author: Big Data Analytics with Spark >>> <http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/> >>> >>> >>> >>> *From:* Nirav Patel [mailto:npa...@xactlycorp.com] >>> *Sent:* Wednesday, February 3, 2016 11:31 AM >>> *To:* Stefan Panayotov >>> *Cc:* Jim Green; Ted Yu; Jakob Odersky; user@spark.apache.org >>> >>> *Subject:* Re: Spark 1.5.2 memory error >>> >>> >>> >>> Hi Stefan, >>> >>> >>> >>> Welcome to the OOM - heap space club. I have been struggling with >>> similar errors (OOM and yarn executor being killed) and failing job or >>> sending it in retry loops. I bet the same job will run perfectly fine with >>> less resource on Hadoop MapReduce program. I have tested it for my program >>> and it does work. >>> >>> >>> >>> Bottomline from my experience. Spark sucks with memory management when >>> job is processing large (not huge) amount of data. It's failing for me with >>> 16gb executors, 10 executors, 6 threads each. And data its processing is >>> only 150GB! It's 1 billion rows for me. Same job works perfectly fine with >>> 1 million rows. >>> >>> >>> >>> Hope that saves you some trouble. >>> >>> >>> >>> Nirav >>> >>> >>> >>> >>> >>> >>> >>> On Wed, Feb 3, 2016 at 11:00 AM, Stefan Panayotov <spanayo...@msn.com> >>> wrote: >>> >>> I drastically increased the memory: >>> >>> spark.executor.memory = 50g >>> spark.driver.memory = 8g >>> spark.driver.maxResultSize = 8g >>> spark.yarn.executor.memoryOverhead = 768 >>> >>> I still see executors killed, but this time the memory does not seem to >>> be the issue. >>> The error on the Jupyter notebook is: >>> >>> >>> Py4JJavaError: An error occurred while calling >>> z:org.apache.spark.api.python.PythonRDD.collectAndServe. >>> >>> : org.apache.spark.SparkException: Job aborted due to stage failure: >>> Exception while getting task result: java.io.IOException: Failed to connect >>> to /10.0.0.9:48755 >>> >>> >>> From nodemanagers log corresponding to worker 10.0.0.9: >>> >>> >>> 2016-02-03 17:31:44,917 INFO yarn.YarnShuffleService >>> (YarnShuffleService.java:initializeApplication(129)) - Initializing >>> application application_1454509557526_0014 >>> >>> >>> >>> 2016-02-03 17:31:44,918 INFO container.ContainerImpl >>> (ContainerImpl.java:handle(1131)) - Container >>> container_1454509557526_0014_01_000093 transitioned from LOCALIZING to >>> LOCALIZED >>> >>> >>> >>> 2016-02-03 17:31:44,947 INFO container.ContainerImpl >>> (ContainerImpl.java:handle(1131)) - Container >>> container_1454509557526_0014_01_000093 transitioned from LOCALIZED to >>> RUNNING >>> >>> >>> >>> 2016-02-03 17:31:44,951 INFO nodemanager.DefaultContainerExecutor >>> (DefaultContainerExecutor.java:buildCommandExecutor(267)) - >>> launchContainer: [bash, >>> /mnt/resource/hadoop/yarn/local/usercache/root/appcache/application_1454509557526_0014/container_1454509557526_0014_01_000093/default_container_executor.sh] >>> >>> >>> >>> 2016-02-03 17:31:45,686 INFO monitor.ContainersMonitorImpl >>> (ContainersMonitorImpl.java:run(371)) - Starting resource-monitoring for >>> container_1454509557526_0014_01_000093 >>> >>> >>> >>> 2016-02-03 17:31:45,686 INFO monitor.ContainersMonitorImpl >>> (ContainersMonitorImpl.java:run(385)) - Stopping resource-monitoring for >>> container_1454509557526_0014_01_000011 >>> >>> >>> >>> >>> >>> >>> >>> Then I can see the memory usage increasing from 230.6 MB to 12.6 GB, >>> which is far below 50g, and the suddenly getting killed!?! >>> >>> >>> >>> >>> >>> >>> >>> 2016-02-03 17:33:17,350 INFO monitor.ContainersMonitorImpl >>> (ContainersMonitorImpl.java:run(458)) - Memory usage of ProcessTree 30962 >>> for container-id container_1454509557526_0014_01_000093: 12.6 GB of 51 GB >>> physical memory used; 52.8 GB of 107.1 GB virtual memory used >>> >>> >>> >>> 2016-02-03 17:33:17,613 INFO container.ContainerImpl >>> (ContainerImpl.java:handle(1131)) - Container >>> container_1454509557526_0014_01_000093 transitioned from RUNNING to KILLING >>> >>> >>> >>> 2016-02-03 17:33:17,613 INFO launcher.ContainerLaunch >>> (ContainerLaunch.java:cleanupContainer(370)) - Cleaning up container >>> container_1454509557526_0014_01_000093 >>> >>> >>> >>> 2016-02-03 17:33:17,629 WARN nodemanager.DefaultContainerExecutor >>> (DefaultContainerExecutor.java:launchContainer(223)) - Exit code from >>> container container_1454509557526_0014_01_000093 is : 143 >>> >>> >>> >>> 2016-02-03 17:33:17,667 INFO container.ContainerImpl >>> (ContainerImpl.java:handle(1131)) - Container >>> container_1454509557526_0014_01_000093 transitioned from KILLING to >>> CONTAINER_CLEANEDUP_AFTER_KILL >>> >>> >>> >>> 2016-02-03 17:33:17,669 INFO nodemanager.NMAuditLogger >>> (NMAuditLogger.java:logSuccess(89)) - USER=root OPERATION=Container >>> Finished - Killed TARGET=ContainerImpl RESULT=SUCCESS >>> APPID=application_1454509557526_0014 >>> CONTAINERID=container_1454509557526_0014_01_000093 >>> >>> >>> >>> 2016-02-03 17:33:17,670 INFO container.ContainerImpl >>> (ContainerImpl.java:handle(1131)) - Container >>> container_1454509557526_0014_01_000093 transitioned from >>> CONTAINER_CLEANEDUP_AFTER_KILL to DONE >>> >>> >>> >>> 2016-02-03 17:33:17,670 INFO application.ApplicationImpl >>> (ApplicationImpl.java:transition(347)) - Removing >>> container_1454509557526_0014_01_000093 from application >>> application_1454509557526_0014 >>> >>> >>> >>> 2016-02-03 17:33:17,671 INFO logaggregation.AppLogAggregatorImpl >>> (AppLogAggregatorImpl.java:startContainerLogAggregation(546)) - Considering >>> container container_1454509557526_0014_01_000093 for log-aggregation >>> >>> >>> >>> 2016-02-03 17:33:17,671 INFO containermanager.AuxServices >>> (AuxServices.java:handle(196)) - Got event CONTAINER_STOP for appId >>> application_1454509557526_0014 >>> >>> >>> >>> 2016-02-03 17:33:17,671 INFO yarn.YarnShuffleService >>> (YarnShuffleService.java:stopContainer(161)) - Stopping container >>> container_1454509557526_0014_01_000093 >>> >>> >>> >>> 2016-02-03 17:33:20,351 INFO monitor.ContainersMonitorImpl >>> (ContainersMonitorImpl.java:run(385)) - Stopping resource-monitoring for >>> container_1454509557526_0014_01_000093 >>> >>> >>> >>> 2016-02-03 17:33:20,383 INFO monitor.ContainersMonitorImpl >>> (ContainersMonitorImpl.java:run(458)) - Memory usage of ProcessTree 28727 >>> for container-id container_1454509557526_0012_01_000001: 319.8 MB of 1.5 GB >>> physical memory used; 1.7 GB of 3.1 GB virtual memory used >>> >>> 2016-02-03 17:33:22,627 INFO nodemanager.NodeStatusUpdaterImpl >>> (NodeStatusUpdaterImpl.java:removeOrTrackCompletedContainersFromContext(529)) >>> - Removed completed containers from NM context: >>> [container_1454509557526_0014_01_000093] >>> >>> I'll appreciate any suggestions. >>> >>> Thanks, >>> >>> *Stefan Panayotov, PhD * >>> *Home*: 610-355-0919 >>> *Cell*: 610-517-5586 >>> *email*: spanayo...@msn.com >>> spanayo...@outlook.com >>> spanayo...@comcast.net >>> >>> >>> >>> ------------------------------ >>> >>> Date: Tue, 2 Feb 2016 15:40:10 -0800 >>> Subject: Re: Spark 1.5.2 memory error >>> From: openkbi...@gmail.com >>> To: spanayo...@msn.com >>> CC: yuzhih...@gmail.com; ja...@odersky.com; user@spark.apache.org >>> >>> >>> >>> Look at part#3 in below blog: >>> >>> >>> http://www.openkb.info/2015/06/resource-allocation-configurations-for.html >>> >>> >>> >>> You may want to increase the executor memory, not just the >>> spark.yarn.executor.memoryOverhead. >>> >>> >>> >>> On Tue, Feb 2, 2016 at 2:14 PM, Stefan Panayotov <spanayo...@msn.com> >>> wrote: >>> >>> For the memoryOvethead I have the default of 10% of 16g, and Spark >>> version is 1.5.2. >>> >>> >>> >>> Stefan Panayotov, PhD >>> Sent from Outlook Mail for Windows 10 phone >>> >>> >>> >>> >>> *From: *Ted Yu <yuzhih...@gmail.com> >>> *Sent: *Tuesday, February 2, 2016 4:52 PM >>> *To: *Jakob Odersky <ja...@odersky.com> >>> *Cc: *Stefan Panayotov <spanayo...@msn.com>; user@spark.apache.org >>> *Subject: *Re: Spark 1.5.2 memory error >>> >>> >>> >>> What value do you use for spark.yarn.executor.memoryOverhead ? >>> >>> >>> >>> Please see https://spark.apache.org/docs/latest/running-on-yarn.html >>> for description of the parameter. >>> >>> >>> >>> Which Spark release are you using ? >>> >>> >>> >>> Cheers >>> >>> >>> >>> On Tue, Feb 2, 2016 at 1:38 PM, Jakob Odersky <ja...@odersky.com> wrote: >>> >>> Can you share some code that produces the error? It is probably not >>> due to spark but rather the way data is handled in the user code. >>> Does your code call any reduceByKey actions? These are often a source >>> for OOM errors. >>> >>> >>> On Tue, Feb 2, 2016 at 1:22 PM, Stefan Panayotov <spanayo...@msn.com> >>> wrote: >>> > Hi Guys, >>> > >>> > I need help with Spark memory errors when executing ML pipelines. >>> > The error that I see is: >>> > >>> > >>> > 16/02/02 20:34:17 INFO Executor: Executor is trying to kill task 32.0 >>> in >>> > stage 32.0 (TID 3298) >>> > >>> > >>> > 16/02/02 20:34:17 INFO Executor: Executor is trying to kill task 12.0 >>> in >>> > stage 32.0 (TID 3278) >>> > >>> > >>> > 16/02/02 20:34:39 INFO MemoryStore: ensureFreeSpace(2004728720) called >>> with >>> > curMem=296303415, maxMem=8890959790 >>> > >>> > >>> > 16/02/02 20:34:39 INFO MemoryStore: Block taskresult_3298 stored as >>> bytes in >>> > memory (estimated size 1911.9 MB, free 6.1 GB) >>> > >>> > >>> > 16/02/02 20:34:39 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL >>> 15: >>> > SIGTERM >>> > >>> > >>> > 16/02/02 20:34:39 ERROR Executor: Exception in task 12.0 in stage 32.0 >>> (TID >>> > 3278) >>> > >>> > >>> > java.lang.OutOfMemoryError: Java heap space >>> > >>> > >>> > at java.util.Arrays.copyOf(Arrays.java:2271) >>> > >>> > >>> > at >>> > >>> java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191) >>> > >>> > >>> > at >>> > >>> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:86) >>> > >>> > >>> > at >>> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:256) >>> > >>> > >>> > at >>> > >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >>> > >>> > >>> > at >>> > >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>> > >>> > >>> > at java.lang.Thread.run(Thread.java:745) >>> > >>> > >>> > 16/02/02 20:34:39 INFO DiskBlockManager: Shutdown hook called >>> > >>> > >>> > 16/02/02 20:34:39 INFO Executor: Finished task 32.0 in stage 32.0 (TID >>> > 3298). 2004728720 bytes result sent via BlockManager) >>> > >>> > >>> > 16/02/02 20:34:39 ERROR SparkUncaughtExceptionHandler: Uncaught >>> exception in >>> > thread Thread[Executor task launch worker-8,5,main] >>> > >>> > >>> > java.lang.OutOfMemoryError: Java heap space >>> > >>> > >>> > at java.util.Arrays.copyOf(Arrays.java:2271) >>> > >>> > >>> > at >>> > >>> java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191) >>> > >>> > >>> > at >>> > >>> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:86) >>> > >>> > >>> > at >>> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:256) >>> > >>> > >>> > at >>> > >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >>> > >>> > >>> > at >>> > >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>> > >>> > >>> > at java.lang.Thread.run(Thread.java:745) >>> > >>> > >>> > 16/02/02 20:34:39 INFO ShutdownHookManager: Shutdown hook called >>> > >>> > >>> > 16/02/02 20:34:39 INFO MetricsSystemImpl: Stopping azure-file-system >>> metrics >>> > system... >>> > >>> > >>> > 16/02/02 20:34:39 INFO MetricsSinkAdapter: azurefs2 thread interrupted. >>> > >>> > >>> > 16/02/02 20:34:39 INFO MetricsSystemImpl: azure-file-system metrics >>> system >>> > stopped. >>> > >>> > >>> > 16/02/02 20:34:39 INFO MetricsSystemImpl: azure-file-system metrics >>> system >>> > shutdown complete. >>> > >>> > >>> > >>> > >>> > >>> > And ….. >>> > >>> > >>> > >>> > >>> > >>> > 16/02/02 20:09:03 INFO impl.ContainerManagementProtocolProxy: Opening >>> proxy >>> > : 10.0.0.5:30050 >>> > >>> > >>> > 16/02/02 20:33:51 INFO yarn.YarnAllocator: Completed container >>> > container_1454421662639_0011_01_000005 (state: COMPLETE, exit status: >>> -104) >>> > >>> > >>> > 16/02/02 20:33:51 WARN yarn.YarnAllocator: Container killed by YARN for >>> > exceeding memory limits. 16.8 GB of 16.5 GB physical memory used. >>> Consider >>> > boosting spark.yarn.executor.memoryOverhead. >>> > >>> > >>> > 16/02/02 20:33:56 INFO yarn.YarnAllocator: Will request 1 executor >>> > containers, each with 2 cores and 16768 MB memory including 384 MB >>> overhead >>> > >>> > >>> > 16/02/02 20:33:56 INFO yarn.YarnAllocator: Container request (host: >>> Any, >>> > capability: <memory:16768, vCores:2>) >>> > >>> > >>> > 16/02/02 20:33:57 INFO yarn.YarnAllocator: Launching container >>> > container_1454421662639_0011_01_000037 for on host 10.0.0.8 >>> > >>> > >>> > 16/02/02 20:33:57 INFO yarn.YarnAllocator: Launching ExecutorRunnable. >>> > driverUrl: >>> > akka.tcp://sparkDriver@10.0.0.15:47446/user/CoarseGrainedScheduler >>> <http://10.0.0.15:47446/user/CoarseGrainedScheduler>, >>> > executorHostname: 10.0.0.8 >>> > >>> > >>> > 16/02/02 20:33:57 INFO yarn.YarnAllocator: Received 1 containers from >>> YARN, >>> > launching executors on 1 of them. >>> > >>> > >>> > I'll really appreciate any help here. >>> > >>> > Thank you, >>> > >>> > Stefan Panayotov, PhD >>> > Home: 610-355-0919 >>> > Cell: 610-517-5586 >>> > email: spanayo...@msn.com >>> > spanayo...@outlook.com >>> > spanayo...@comcast.net >>> > >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> -- >>> >>> Thanks, >>> >>> www.openkb.info >>> >>> (Open KnowledgeBase for Hadoop/Database/OS/Network/Tool) >>> >>> >>> >>> >>> >>> >>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> >>> >>> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] >>> <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] >>> <https://twitter.com/Xactly> [image: Facebook] >>> <https://www.facebook.com/XactlyCorp> [image: YouTube] >>> <http://www.youtube.com/xactlycorporation> >>> >> >> > > > > [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> > > <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] > <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] > <https://twitter.com/Xactly> [image: Facebook] > <https://www.facebook.com/XactlyCorp> [image: YouTube] > <http://www.youtube.com/xactlycorporation> > > > -- [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] <https://twitter.com/Xactly> [image: Facebook] <https://www.facebook.com/XactlyCorp> [image: YouTube] <http://www.youtube.com/xactlycorporation>