Re: Spark 1.5.2 memory error

Nirav Patel Wed, 03 Feb 2016 14:23:01 -0800

What I meant is executor.cores and task.cpus can dictate how many parallel
tasks will run on given executor.


Let's take this example setting.

spark.executor.memory = 16GB
spark.executor.cores = 6
spark.task.cpus = 1

SO here I think spark will assign 6 tasks to One executor each using 1 core
and 16/6=2.6GB.

ANd out of those 2.6 gb some goes to shuffle and some goes to storage.

spark.shuffle.memoryFraction = 0.4
spark.storage.memoryFraction = 0.6

Again my speculation from some past articles I read.








On Wed, Feb 3, 2016 at 2:09 PM, Rishabh Wadhawan <rishabh...@gmail.com>
wrote:

> As of what I know, Cores won’t give you more portion of executor memory,
> because its just cpu cores that you are using per executor. Reducing the
> number of cores however would result in lack of parallel processing power.
> The executor memory that we specify with spark.executor.memory would be the
> max memory that your executor might have. But the memory that you get is
> less then that. I don’t clearly remember but i think its either memory/2 or
> memory/4. But I may be wrong as I have been out of spark for months.
>
> On Feb 3, 2016, at 2:58 PM, Nirav Patel <npa...@xactlycorp.com> wrote:
>
> About OP.
>
> How many cores you assign per executor? May be reducing that number will
> give more portion of executor memory to each task being executed on that
> executor. Others please comment if that make sense.
>
>
>
> On Wed, Feb 3, 2016 at 1:52 PM, Nirav Patel <npa...@xactlycorp.com> wrote:
>
>> I know it;s a strong word but when I have a case open for that with MapR
>> and Databricks for a month and their only solution to change to DataFrame
>> it frustrate you. I know DataFrame/Sql catalyst has internal optimizations
>> but it requires lot of code change. I think there's something fundamentally
>> wrong (or different from hadoop) in framework that is not allowing it to do
>> robust memory management. I know my job is memory hogger, it does a groupBy
>> and perform combinatorics in reducer side; uses additional datastructures
>> at task levels. May be spark is running multiple heavier tasks on same
>> executor and collectively they cause OOM. But suggesting DataFrame is NOT a
>> Solution for me (and most others who already invested time with RDD and
>> loves the type safety it provides). Not even sure if changing to DataFrame
>> will for sure solve the issue.
>>
>> On Wed, Feb 3, 2016 at 1:33 PM, Mohammed Guller <moham...@glassbeam.com>
>> wrote:
>>
>>> Nirav,
>>>
>>> Sorry to hear about your experience with Spark; however, sucks is a very
>>> strong word. Many organizations are processing a lot more than 150GB of
>>> data  with Spark.
>>>
>>>
>>>
>>> Mohammed
>>>
>>> Author: Big Data Analytics with Spark
>>> <http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>
>>>
>>>
>>>
>>> *From:* Nirav Patel [mailto:npa...@xactlycorp.com]
>>> *Sent:* Wednesday, February 3, 2016 11:31 AM
>>> *To:* Stefan Panayotov
>>> *Cc:* Jim Green; Ted Yu; Jakob Odersky; user@spark.apache.org
>>>
>>> *Subject:* Re: Spark 1.5.2 memory error
>>>
>>>
>>>
>>> Hi Stefan,
>>>
>>>
>>>
>>> Welcome to the OOM - heap space club. I have been struggling with
>>> similar errors (OOM and yarn executor being killed) and failing job or
>>> sending it in retry loops. I bet the same job will run perfectly fine with
>>> less resource on Hadoop MapReduce program. I have tested it for my program
>>> and it does work.
>>>
>>>
>>>
>>> Bottomline from my experience. Spark sucks with memory management when
>>> job is processing large (not huge) amount of data. It's failing for me with
>>> 16gb executors, 10 executors, 6 threads each. And data its processing is
>>> only 150GB! It's 1 billion rows for me. Same job works perfectly fine with
>>> 1 million rows.
>>>
>>>
>>>
>>> Hope that saves you some trouble.
>>>
>>>
>>>
>>> Nirav
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Feb 3, 2016 at 11:00 AM, Stefan Panayotov <spanayo...@msn.com>
>>> wrote:
>>>
>>> I drastically increased the memory:
>>>
>>> spark.executor.memory = 50g
>>> spark.driver.memory = 8g
>>> spark.driver.maxResultSize = 8g
>>> spark.yarn.executor.memoryOverhead = 768
>>>
>>> I still see executors killed, but this time the memory does not seem to
>>> be the issue.
>>> The error on the Jupyter notebook is:
>>>
>>>
>>> Py4JJavaError: An error occurred while calling 
>>> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
>>>
>>> : org.apache.spark.SparkException: Job aborted due to stage failure: 
>>> Exception while getting task result: java.io.IOException: Failed to connect 
>>> to /10.0.0.9:48755
>>>
>>>
>>> From nodemanagers log corresponding to worker 10.0.0.9:
>>>
>>>
>>> 2016-02-03 17:31:44,917 INFO  yarn.YarnShuffleService
>>> (YarnShuffleService.java:initializeApplication(129)) - Initializing
>>> application application_1454509557526_0014
>>>
>>>
>>>
>>> 2016-02-03 17:31:44,918 INFO  container.ContainerImpl
>>> (ContainerImpl.java:handle(1131)) - Container
>>> container_1454509557526_0014_01_000093 transitioned from LOCALIZING to
>>> LOCALIZED
>>>
>>>
>>>
>>> 2016-02-03 17:31:44,947 INFO  container.ContainerImpl
>>> (ContainerImpl.java:handle(1131)) - Container
>>> container_1454509557526_0014_01_000093 transitioned from LOCALIZED to
>>> RUNNING
>>>
>>>
>>>
>>> 2016-02-03 17:31:44,951 INFO  nodemanager.DefaultContainerExecutor
>>> (DefaultContainerExecutor.java:buildCommandExecutor(267)) -
>>> launchContainer: [bash,
>>> /mnt/resource/hadoop/yarn/local/usercache/root/appcache/application_1454509557526_0014/container_1454509557526_0014_01_000093/default_container_executor.sh]
>>>
>>>
>>>
>>> 2016-02-03 17:31:45,686 INFO  monitor.ContainersMonitorImpl
>>> (ContainersMonitorImpl.java:run(371)) - Starting resource-monitoring for
>>> container_1454509557526_0014_01_000093
>>>
>>>
>>>
>>> 2016-02-03 17:31:45,686 INFO  monitor.ContainersMonitorImpl
>>> (ContainersMonitorImpl.java:run(385)) - Stopping resource-monitoring for
>>> container_1454509557526_0014_01_000011
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Then I can see the memory usage increasing from 230.6 MB to 12.6 GB,
>>> which is far below 50g, and the suddenly getting killed!?!
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> 2016-02-03 17:33:17,350 INFO  monitor.ContainersMonitorImpl
>>> (ContainersMonitorImpl.java:run(458)) - Memory usage of ProcessTree 30962
>>> for container-id container_1454509557526_0014_01_000093: 12.6 GB of 51 GB
>>> physical memory used; 52.8 GB of 107.1 GB virtual memory used
>>>
>>>
>>>
>>> 2016-02-03 17:33:17,613 INFO  container.ContainerImpl
>>> (ContainerImpl.java:handle(1131)) - Container
>>> container_1454509557526_0014_01_000093 transitioned from RUNNING to KILLING
>>>
>>>
>>>
>>> 2016-02-03 17:33:17,613 INFO  launcher.ContainerLaunch
>>> (ContainerLaunch.java:cleanupContainer(370)) - Cleaning up container
>>> container_1454509557526_0014_01_000093
>>>
>>>
>>>
>>> 2016-02-03 17:33:17,629 WARN  nodemanager.DefaultContainerExecutor
>>> (DefaultContainerExecutor.java:launchContainer(223)) - Exit code from
>>> container container_1454509557526_0014_01_000093 is : 143
>>>
>>>
>>>
>>> 2016-02-03 17:33:17,667 INFO  container.ContainerImpl
>>> (ContainerImpl.java:handle(1131)) - Container
>>> container_1454509557526_0014_01_000093 transitioned from KILLING to
>>> CONTAINER_CLEANEDUP_AFTER_KILL
>>>
>>>
>>>
>>> 2016-02-03 17:33:17,669 INFO  nodemanager.NMAuditLogger
>>> (NMAuditLogger.java:logSuccess(89)) - USER=root       OPERATION=Container
>>> Finished - Killed    TARGET=ContainerImpl RESULT=SUCCESS
>>> APPID=application_1454509557526_0014
>>> CONTAINERID=container_1454509557526_0014_01_000093
>>>
>>>
>>>
>>> 2016-02-03 17:33:17,670 INFO  container.ContainerImpl
>>> (ContainerImpl.java:handle(1131)) - Container
>>> container_1454509557526_0014_01_000093 transitioned from
>>> CONTAINER_CLEANEDUP_AFTER_KILL to DONE
>>>
>>>
>>>
>>> 2016-02-03 17:33:17,670 INFO  application.ApplicationImpl
>>> (ApplicationImpl.java:transition(347)) - Removing
>>> container_1454509557526_0014_01_000093 from application
>>> application_1454509557526_0014
>>>
>>>
>>>
>>> 2016-02-03 17:33:17,671 INFO  logaggregation.AppLogAggregatorImpl
>>> (AppLogAggregatorImpl.java:startContainerLogAggregation(546)) - Considering
>>> container container_1454509557526_0014_01_000093 for log-aggregation
>>>
>>>
>>>
>>> 2016-02-03 17:33:17,671 INFO  containermanager.AuxServices
>>> (AuxServices.java:handle(196)) - Got event CONTAINER_STOP for appId
>>> application_1454509557526_0014
>>>
>>>
>>>
>>> 2016-02-03 17:33:17,671 INFO  yarn.YarnShuffleService
>>> (YarnShuffleService.java:stopContainer(161)) - Stopping container
>>> container_1454509557526_0014_01_000093
>>>
>>>
>>>
>>> 2016-02-03 17:33:20,351 INFO  monitor.ContainersMonitorImpl
>>> (ContainersMonitorImpl.java:run(385)) - Stopping resource-monitoring for
>>> container_1454509557526_0014_01_000093
>>>
>>>
>>>
>>> 2016-02-03 17:33:20,383 INFO  monitor.ContainersMonitorImpl
>>> (ContainersMonitorImpl.java:run(458)) - Memory usage of ProcessTree 28727
>>> for container-id container_1454509557526_0012_01_000001: 319.8 MB of 1.5 GB
>>> physical memory used; 1.7 GB of 3.1 GB virtual memory used
>>>
>>> 2016-02-03 17:33:22,627 INFO  nodemanager.NodeStatusUpdaterImpl
>>> (NodeStatusUpdaterImpl.java:removeOrTrackCompletedContainersFromContext(529))
>>> - Removed completed containers from NM context:
>>> [container_1454509557526_0014_01_000093]
>>>
>>> I'll appreciate any suggestions.
>>>
>>> Thanks,
>>>
>>> *Stefan Panayotov, PhD *
>>> *Home*: 610-355-0919
>>> *Cell*: 610-517-5586
>>> *email*: spanayo...@msn.com
>>> spanayo...@outlook.com
>>> spanayo...@comcast.net
>>>
>>>
>>>
>>> ------------------------------
>>>
>>> Date: Tue, 2 Feb 2016 15:40:10 -0800
>>> Subject: Re: Spark 1.5.2 memory error
>>> From: openkbi...@gmail.com
>>> To: spanayo...@msn.com
>>> CC: yuzhih...@gmail.com; ja...@odersky.com; user@spark.apache.org
>>>
>>>
>>>
>>> Look at part#3 in below blog:
>>>
>>>
>>> http://www.openkb.info/2015/06/resource-allocation-configurations-for.html
>>>
>>>
>>>
>>> You may want to increase the executor memory, not just the
>>> spark.yarn.executor.memoryOverhead.
>>>
>>>
>>>
>>> On Tue, Feb 2, 2016 at 2:14 PM, Stefan Panayotov <spanayo...@msn.com>
>>> wrote:
>>>
>>> For the memoryOvethead I have the default of 10% of 16g, and Spark
>>> version is 1.5.2.
>>>
>>>
>>>
>>> Stefan Panayotov, PhD
>>> Sent from Outlook Mail for Windows 10 phone
>>>
>>>
>>>
>>>
>>> *From: *Ted Yu <yuzhih...@gmail.com>
>>> *Sent: *Tuesday, February 2, 2016 4:52 PM
>>> *To: *Jakob Odersky <ja...@odersky.com>
>>> *Cc: *Stefan Panayotov <spanayo...@msn.com>; user@spark.apache.org
>>> *Subject: *Re: Spark 1.5.2 memory error
>>>
>>>
>>>
>>> What value do you use for spark.yarn.executor.memoryOverhead ?
>>>
>>>
>>>
>>> Please see https://spark.apache.org/docs/latest/running-on-yarn.html
>>> for description of the parameter.
>>>
>>>
>>>
>>> Which Spark release are you using ?
>>>
>>>
>>>
>>> Cheers
>>>
>>>
>>>
>>> On Tue, Feb 2, 2016 at 1:38 PM, Jakob Odersky <ja...@odersky.com> wrote:
>>>
>>> Can you share some code that produces the error? It is probably not
>>> due to spark but rather the way data is handled in the user code.
>>> Does your code call any reduceByKey actions? These are often a source
>>> for OOM errors.
>>>
>>>
>>> On Tue, Feb 2, 2016 at 1:22 PM, Stefan Panayotov <spanayo...@msn.com>
>>> wrote:
>>> > Hi Guys,
>>> >
>>> > I need help with Spark memory errors when executing ML pipelines.
>>> > The error that I see is:
>>> >
>>> >
>>> > 16/02/02 20:34:17 INFO Executor: Executor is trying to kill task 32.0
>>> in
>>> > stage 32.0 (TID 3298)
>>> >
>>> >
>>> > 16/02/02 20:34:17 INFO Executor: Executor is trying to kill task 12.0
>>> in
>>> > stage 32.0 (TID 3278)
>>> >
>>> >
>>> > 16/02/02 20:34:39 INFO MemoryStore: ensureFreeSpace(2004728720) called
>>> with
>>> > curMem=296303415, maxMem=8890959790
>>> >
>>> >
>>> > 16/02/02 20:34:39 INFO MemoryStore: Block taskresult_3298 stored as
>>> bytes in
>>> > memory (estimated size 1911.9 MB, free 6.1 GB)
>>> >
>>> >
>>> > 16/02/02 20:34:39 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL
>>> 15:
>>> > SIGTERM
>>> >
>>> >
>>> > 16/02/02 20:34:39 ERROR Executor: Exception in task 12.0 in stage 32.0
>>> (TID
>>> > 3278)
>>> >
>>> >
>>> > java.lang.OutOfMemoryError: Java heap space
>>> >
>>> >
>>> >        at java.util.Arrays.copyOf(Arrays.java:2271)
>>> >
>>> >
>>> >        at
>>> >
>>> java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191)
>>> >
>>> >
>>> >        at
>>> >
>>> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:86)
>>> >
>>> >
>>> >        at
>>> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:256)
>>> >
>>> >
>>> >        at
>>> >
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> >
>>> >
>>> >        at
>>> >
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> >
>>> >
>>> >        at java.lang.Thread.run(Thread.java:745)
>>> >
>>> >
>>> > 16/02/02 20:34:39 INFO DiskBlockManager: Shutdown hook called
>>> >
>>> >
>>> > 16/02/02 20:34:39 INFO Executor: Finished task 32.0 in stage 32.0 (TID
>>> > 3298). 2004728720 bytes result sent via BlockManager)
>>> >
>>> >
>>> > 16/02/02 20:34:39 ERROR SparkUncaughtExceptionHandler: Uncaught
>>> exception in
>>> > thread Thread[Executor task launch worker-8,5,main]
>>> >
>>> >
>>> > java.lang.OutOfMemoryError: Java heap space
>>> >
>>> >
>>> >        at java.util.Arrays.copyOf(Arrays.java:2271)
>>> >
>>> >
>>> >        at
>>> >
>>> java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191)
>>> >
>>> >
>>> >        at
>>> >
>>> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:86)
>>> >
>>> >
>>> >        at
>>> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:256)
>>> >
>>> >
>>> >        at
>>> >
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> >
>>> >
>>> >        at
>>> >
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> >
>>> >
>>> >        at java.lang.Thread.run(Thread.java:745)
>>> >
>>> >
>>> > 16/02/02 20:34:39 INFO ShutdownHookManager: Shutdown hook called
>>> >
>>> >
>>> > 16/02/02 20:34:39 INFO MetricsSystemImpl: Stopping azure-file-system
>>> metrics
>>> > system...
>>> >
>>> >
>>> > 16/02/02 20:34:39 INFO MetricsSinkAdapter: azurefs2 thread interrupted.
>>> >
>>> >
>>> > 16/02/02 20:34:39 INFO MetricsSystemImpl: azure-file-system metrics
>>> system
>>> > stopped.
>>> >
>>> >
>>> > 16/02/02 20:34:39 INFO MetricsSystemImpl: azure-file-system metrics
>>> system
>>> > shutdown complete.
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > And …..
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > 16/02/02 20:09:03 INFO impl.ContainerManagementProtocolProxy: Opening
>>> proxy
>>> > : 10.0.0.5:30050
>>> >
>>> >
>>> > 16/02/02 20:33:51 INFO yarn.YarnAllocator: Completed container
>>> > container_1454421662639_0011_01_000005 (state: COMPLETE, exit status:
>>> -104)
>>> >
>>> >
>>> > 16/02/02 20:33:51 WARN yarn.YarnAllocator: Container killed by YARN for
>>> > exceeding memory limits. 16.8 GB of 16.5 GB physical memory used.
>>> Consider
>>> > boosting spark.yarn.executor.memoryOverhead.
>>> >
>>> >
>>> > 16/02/02 20:33:56 INFO yarn.YarnAllocator: Will request 1 executor
>>> > containers, each with 2 cores and 16768 MB memory including 384 MB
>>> overhead
>>> >
>>> >
>>> > 16/02/02 20:33:56 INFO yarn.YarnAllocator: Container request (host:
>>> Any,
>>> > capability: <memory:16768, vCores:2>)
>>> >
>>> >
>>> > 16/02/02 20:33:57 INFO yarn.YarnAllocator: Launching container
>>> > container_1454421662639_0011_01_000037 for on host 10.0.0.8
>>> >
>>> >
>>> > 16/02/02 20:33:57 INFO yarn.YarnAllocator: Launching ExecutorRunnable.
>>> > driverUrl:
>>> > akka.tcp://sparkDriver@10.0.0.15:47446/user/CoarseGrainedScheduler
>>> <http://10.0.0.15:47446/user/CoarseGrainedScheduler>,
>>> > executorHostname: 10.0.0.8
>>> >
>>> >
>>> > 16/02/02 20:33:57 INFO yarn.YarnAllocator: Received 1 containers from
>>> YARN,
>>> > launching executors on 1 of them.
>>> >
>>> >
>>> > I'll really appreciate any help here.
>>> >
>>> > Thank you,
>>> >
>>> > Stefan Panayotov, PhD
>>> > Home: 610-355-0919
>>> > Cell: 610-517-5586
>>> > email: spanayo...@msn.com
>>> > spanayo...@outlook.com
>>> > spanayo...@comcast.net
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Thanks,
>>>
>>> www.openkb.info
>>>
>>> (Open KnowledgeBase for Hadoop/Database/OS/Network/Tool)
>>>
>>>
>>>
>>>
>>>
>>>
>>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>>
>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>>> <https://twitter.com/Xactly>  [image: Facebook]
>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>> <http://www.youtube.com/xactlycorporation>
>>>
>>
>>
>
>
>
> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>
> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
> <https://twitter.com/Xactly>  [image: Facebook]
> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
> <http://www.youtube.com/xactlycorporation>
>
>
>

-- 


[image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>

<https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn] 
<https://www.linkedin.com/company/xactly-corporation>  [image: Twitter] 
<https://twitter.com/Xactly>  [image: Facebook] 
<https://www.facebook.com/XactlyCorp>  [image: YouTube] 
<http://www.youtube.com/xactlycorporation>

Re: Spark 1.5.2 memory error

Reply via email to