Re: Spark 1.5.2 memory error

Ted Yu Wed, 03 Feb 2016 14:40:31 -0800

There is also (deprecated) spark.storage.unrollFraction to consider

On Wed, Feb 3, 2016 at 2:21 PM, Nirav Patel <npa...@xactlycorp.com> wrote:


> What I meant is executor.cores and task.cpus can dictate how many parallel
> tasks will run on given executor.
>
> Let's take this example setting.
>
> spark.executor.memory = 16GB
> spark.executor.cores = 6
> spark.task.cpus = 1
>
> SO here I think spark will assign 6 tasks to One executor each using 1
> core and 16/6=2.6GB.
>
> ANd out of those 2.6 gb some goes to shuffle and some goes to storage.
>
> spark.shuffle.memoryFraction = 0.4
> spark.storage.memoryFraction = 0.6
>
> Again my speculation from some past articles I read.
>
>
>
>
>
>
>
>
> On Wed, Feb 3, 2016 at 2:09 PM, Rishabh Wadhawan <rishabh...@gmail.com>
> wrote:
>
>> As of what I know, Cores won’t give you more portion of executor memory,
>> because its just cpu cores that you are using per executor. Reducing the
>> number of cores however would result in lack of parallel processing power.
>> The executor memory that we specify with spark.executor.memory would be the
>> max memory that your executor might have. But the memory that you get is
>> less then that. I don’t clearly remember but i think its either memory/2 or
>> memory/4. But I may be wrong as I have been out of spark for months.
>>
>> On Feb 3, 2016, at 2:58 PM, Nirav Patel <npa...@xactlycorp.com> wrote:
>>
>> About OP.
>>
>> How many cores you assign per executor? May be reducing that number will
>> give more portion of executor memory to each task being executed on that
>> executor. Others please comment if that make sense.
>>
>>
>>
>> On Wed, Feb 3, 2016 at 1:52 PM, Nirav Patel <npa...@xactlycorp.com>
>> wrote:
>>
>>> I know it;s a strong word but when I have a case open for that with MapR
>>> and Databricks for a month and their only solution to change to DataFrame
>>> it frustrate you. I know DataFrame/Sql catalyst has internal optimizations
>>> but it requires lot of code change. I think there's something fundamentally
>>> wrong (or different from hadoop) in framework that is not allowing it to do
>>> robust memory management. I know my job is memory hogger, it does a groupBy
>>> and perform combinatorics in reducer side; uses additional datastructures
>>> at task levels. May be spark is running multiple heavier tasks on same
>>> executor and collectively they cause OOM. But suggesting DataFrame is NOT a
>>> Solution for me (and most others who already invested time with RDD and
>>> loves the type safety it provides). Not even sure if changing to DataFrame
>>> will for sure solve the issue.
>>>
>>> On Wed, Feb 3, 2016 at 1:33 PM, Mohammed Guller <moham...@glassbeam.com>
>>> wrote:
>>>
>>>> Nirav,
>>>>
>>>> Sorry to hear about your experience with Spark; however, sucks is a
>>>> very strong word. Many organizations are processing a lot more than 150GB
>>>> of data  with Spark.
>>>>
>>>>
>>>>
>>>> Mohammed
>>>>
>>>> Author: Big Data Analytics with Spark
>>>> <http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>
>>>>
>>>>
>>>>
>>>> *From:* Nirav Patel [mailto:npa...@xactlycorp.com]
>>>> *Sent:* Wednesday, February 3, 2016 11:31 AM
>>>> *To:* Stefan Panayotov
>>>> *Cc:* Jim Green; Ted Yu; Jakob Odersky; user@spark.apache.org
>>>>
>>>> *Subject:* Re: Spark 1.5.2 memory error
>>>>
>>>>
>>>>
>>>> Hi Stefan,
>>>>
>>>>
>>>>
>>>> Welcome to the OOM - heap space club. I have been struggling with
>>>> similar errors (OOM and yarn executor being killed) and failing job or
>>>> sending it in retry loops. I bet the same job will run perfectly fine with
>>>> less resource on Hadoop MapReduce program. I have tested it for my program
>>>> and it does work.
>>>>
>>>>
>>>>
>>>> Bottomline from my experience. Spark sucks with memory management when
>>>> job is processing large (not huge) amount of data. It's failing for me with
>>>> 16gb executors, 10 executors, 6 threads each. And data its processing is
>>>> only 150GB! It's 1 billion rows for me. Same job works perfectly fine with
>>>> 1 million rows.
>>>>
>>>>
>>>>
>>>> Hope that saves you some trouble.
>>>>
>>>>
>>>>
>>>> Nirav
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Feb 3, 2016 at 11:00 AM, Stefan Panayotov <spanayo...@msn.com>
>>>> wrote:
>>>>
>>>> I drastically increased the memory:
>>>>
>>>> spark.executor.memory = 50g
>>>> spark.driver.memory = 8g
>>>> spark.driver.maxResultSize = 8g
>>>> spark.yarn.executor.memoryOverhead = 768
>>>>
>>>> I still see executors killed, but this time the memory does not seem to
>>>> be the issue.
>>>> The error on the Jupyter notebook is:
>>>>
>>>>
>>>> Py4JJavaError: An error occurred while calling 
>>>> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
>>>>
>>>> : org.apache.spark.SparkException: Job aborted due to stage failure: 
>>>> Exception while getting task result: java.io.IOException: Failed to 
>>>> connect to /10.0.0.9:48755
>>>>
>>>>
>>>> From nodemanagers log corresponding to worker 10.0.0.9:
>>>>
>>>>
>>>> 2016-02-03 17:31:44,917 INFO  yarn.YarnShuffleService
>>>> (YarnShuffleService.java:initializeApplication(129)) - Initializing
>>>> application application_1454509557526_0014
>>>>
>>>>
>>>>
>>>> 2016-02-03 17:31:44,918 INFO  container.ContainerImpl
>>>> (ContainerImpl.java:handle(1131)) - Container
>>>> container_1454509557526_0014_01_000093 transitioned from LOCALIZING to
>>>> LOCALIZED
>>>>
>>>>
>>>>
>>>> 2016-02-03 17:31:44,947 INFO  container.ContainerImpl
>>>> (ContainerImpl.java:handle(1131)) - Container
>>>> container_1454509557526_0014_01_000093 transitioned from LOCALIZED to
>>>> RUNNING
>>>>
>>>>
>>>>
>>>> 2016-02-03 17:31:44,951 INFO  nodemanager.DefaultContainerExecutor
>>>> (DefaultContainerExecutor.java:buildCommandExecutor(267)) -
>>>> launchContainer: [bash,
>>>> /mnt/resource/hadoop/yarn/local/usercache/root/appcache/application_1454509557526_0014/container_1454509557526_0014_01_000093/default_container_executor.sh]
>>>>
>>>>
>>>>
>>>> 2016-02-03 17:31:45,686 INFO  monitor.ContainersMonitorImpl
>>>> (ContainersMonitorImpl.java:run(371)) - Starting resource-monitoring for
>>>> container_1454509557526_0014_01_000093
>>>>
>>>>
>>>>
>>>> 2016-02-03 17:31:45,686 INFO  monitor.ContainersMonitorImpl
>>>> (ContainersMonitorImpl.java:run(385)) - Stopping resource-monitoring for
>>>> container_1454509557526_0014_01_000011
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Then I can see the memory usage increasing from 230.6 MB to 12.6 GB,
>>>> which is far below 50g, and the suddenly getting killed!?!
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 2016-02-03 17:33:17,350 INFO  monitor.ContainersMonitorImpl
>>>> (ContainersMonitorImpl.java:run(458)) - Memory usage of ProcessTree 30962
>>>> for container-id container_1454509557526_0014_01_000093: 12.6 GB of 51 GB
>>>> physical memory used; 52.8 GB of 107.1 GB virtual memory used
>>>>
>>>>
>>>>
>>>> 2016-02-03 17:33:17,613 INFO  container.ContainerImpl
>>>> (ContainerImpl.java:handle(1131)) - Container
>>>> container_1454509557526_0014_01_000093 transitioned from RUNNING to KILLING
>>>>
>>>>
>>>>
>>>> 2016-02-03 17:33:17,613 INFO  launcher.ContainerLaunch
>>>> (ContainerLaunch.java:cleanupContainer(370)) - Cleaning up container
>>>> container_1454509557526_0014_01_000093
>>>>
>>>>
>>>>
>>>> 2016-02-03 17:33:17,629 WARN  nodemanager.DefaultContainerExecutor
>>>> (DefaultContainerExecutor.java:launchContainer(223)) - Exit code from
>>>> container container_1454509557526_0014_01_000093 is : 143
>>>>
>>>>
>>>>
>>>> 2016-02-03 17:33:17,667 INFO  container.ContainerImpl
>>>> (ContainerImpl.java:handle(1131)) - Container
>>>> container_1454509557526_0014_01_000093 transitioned from KILLING to
>>>> CONTAINER_CLEANEDUP_AFTER_KILL
>>>>
>>>>
>>>>
>>>> 2016-02-03 17:33:17,669 INFO  nodemanager.NMAuditLogger
>>>> (NMAuditLogger.java:logSuccess(89)) - USER=root       OPERATION=Container
>>>> Finished - Killed    TARGET=ContainerImpl RESULT=SUCCESS
>>>> APPID=application_1454509557526_0014
>>>> CONTAINERID=container_1454509557526_0014_01_000093
>>>>
>>>>
>>>>
>>>> 2016-02-03 17:33:17,670 INFO  container.ContainerImpl
>>>> (ContainerImpl.java:handle(1131)) - Container
>>>> container_1454509557526_0014_01_000093 transitioned from
>>>> CONTAINER_CLEANEDUP_AFTER_KILL to DONE
>>>>
>>>>
>>>>
>>>> 2016-02-03 17:33:17,670 INFO  application.ApplicationImpl
>>>> (ApplicationImpl.java:transition(347)) - Removing
>>>> container_1454509557526_0014_01_000093 from application
>>>> application_1454509557526_0014
>>>>
>>>>
>>>>
>>>> 2016-02-03 17:33:17,671 INFO  logaggregation.AppLogAggregatorImpl
>>>> (AppLogAggregatorImpl.java:startContainerLogAggregation(546)) - Considering
>>>> container container_1454509557526_0014_01_000093 for log-aggregation
>>>>
>>>>
>>>>
>>>> 2016-02-03 17:33:17,671 INFO  containermanager.AuxServices
>>>> (AuxServices.java:handle(196)) - Got event CONTAINER_STOP for appId
>>>> application_1454509557526_0014
>>>>
>>>>
>>>>
>>>> 2016-02-03 17:33:17,671 INFO  yarn.YarnShuffleService
>>>> (YarnShuffleService.java:stopContainer(161)) - Stopping container
>>>> container_1454509557526_0014_01_000093
>>>>
>>>>
>>>>
>>>> 2016-02-03 17:33:20,351 INFO  monitor.ContainersMonitorImpl
>>>> (ContainersMonitorImpl.java:run(385)) - Stopping resource-monitoring for
>>>> container_1454509557526_0014_01_000093
>>>>
>>>>
>>>>
>>>> 2016-02-03 17:33:20,383 INFO  monitor.ContainersMonitorImpl
>>>> (ContainersMonitorImpl.java:run(458)) - Memory usage of ProcessTree 28727
>>>> for container-id container_1454509557526_0012_01_000001: 319.8 MB of 1.5 GB
>>>> physical memory used; 1.7 GB of 3.1 GB virtual memory used
>>>>
>>>> 2016-02-03 17:33:22,627 INFO  nodemanager.NodeStatusUpdaterImpl
>>>> (NodeStatusUpdaterImpl.java:removeOrTrackCompletedContainersFromContext(529))
>>>> - Removed completed containers from NM context:
>>>> [container_1454509557526_0014_01_000093]
>>>>
>>>> I'll appreciate any suggestions.
>>>>
>>>> Thanks,
>>>>
>>>> *Stefan Panayotov, PhD *
>>>> *Home*: 610-355-0919
>>>> *Cell*: 610-517-5586
>>>> *email*: spanayo...@msn.com
>>>> spanayo...@outlook.com
>>>> spanayo...@comcast.net
>>>>
>>>>
>>>>
>>>> ------------------------------
>>>>
>>>> Date: Tue, 2 Feb 2016 15:40:10 -0800
>>>> Subject: Re: Spark 1.5.2 memory error
>>>> From: openkbi...@gmail.com
>>>> To: spanayo...@msn.com
>>>> CC: yuzhih...@gmail.com; ja...@odersky.com; user@spark.apache.org
>>>>
>>>>
>>>>
>>>> Look at part#3 in below blog:
>>>>
>>>>
>>>> http://www.openkb.info/2015/06/resource-allocation-configurations-for.html
>>>>
>>>>
>>>>
>>>> You may want to increase the executor memory, not just the
>>>> spark.yarn.executor.memoryOverhead.
>>>>
>>>>
>>>>
>>>> On Tue, Feb 2, 2016 at 2:14 PM, Stefan Panayotov <spanayo...@msn.com>
>>>> wrote:
>>>>
>>>> For the memoryOvethead I have the default of 10% of 16g, and Spark
>>>> version is 1.5.2.
>>>>
>>>>
>>>>
>>>> Stefan Panayotov, PhD
>>>> Sent from Outlook Mail for Windows 10 phone
>>>>
>>>>
>>>>
>>>>
>>>> *From: *Ted Yu <yuzhih...@gmail.com>
>>>> *Sent: *Tuesday, February 2, 2016 4:52 PM
>>>> *To: *Jakob Odersky <ja...@odersky.com>
>>>> *Cc: *Stefan Panayotov <spanayo...@msn.com>; user@spark.apache.org
>>>> *Subject: *Re: Spark 1.5.2 memory error
>>>>
>>>>
>>>>
>>>> What value do you use for spark.yarn.executor.memoryOverhead ?
>>>>
>>>>
>>>>
>>>> Please see https://spark.apache.org/docs/latest/running-on-yarn.html
>>>> for description of the parameter.
>>>>
>>>>
>>>>
>>>> Which Spark release are you using ?
>>>>
>>>>
>>>>
>>>> Cheers
>>>>
>>>>
>>>>
>>>> On Tue, Feb 2, 2016 at 1:38 PM, Jakob Odersky <ja...@odersky.com>
>>>> wrote:
>>>>
>>>> Can you share some code that produces the error? It is probably not
>>>> due to spark but rather the way data is handled in the user code.
>>>> Does your code call any reduceByKey actions? These are often a source
>>>> for OOM errors.
>>>>
>>>>
>>>> On Tue, Feb 2, 2016 at 1:22 PM, Stefan Panayotov <spanayo...@msn.com>
>>>> wrote:
>>>> > Hi Guys,
>>>> >
>>>> > I need help with Spark memory errors when executing ML pipelines.
>>>> > The error that I see is:
>>>> >
>>>> >
>>>> > 16/02/02 20:34:17 INFO Executor: Executor is trying to kill task 32.0
>>>> in
>>>> > stage 32.0 (TID 3298)
>>>> >
>>>> >
>>>> > 16/02/02 20:34:17 INFO Executor: Executor is trying to kill task 12.0
>>>> in
>>>> > stage 32.0 (TID 3278)
>>>> >
>>>> >
>>>> > 16/02/02 20:34:39 INFO MemoryStore: ensureFreeSpace(2004728720)
>>>> called with
>>>> > curMem=296303415, maxMem=8890959790
>>>> >
>>>> >
>>>> > 16/02/02 20:34:39 INFO MemoryStore: Block taskresult_3298 stored as
>>>> bytes in
>>>> > memory (estimated size 1911.9 MB, free 6.1 GB)
>>>> >
>>>> >
>>>> > 16/02/02 20:34:39 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL
>>>> 15:
>>>> > SIGTERM
>>>> >
>>>> >
>>>> > 16/02/02 20:34:39 ERROR Executor: Exception in task 12.0 in stage
>>>> 32.0 (TID
>>>> > 3278)
>>>> >
>>>> >
>>>> > java.lang.OutOfMemoryError: Java heap space
>>>> >
>>>> >
>>>> >        at java.util.Arrays.copyOf(Arrays.java:2271)
>>>> >
>>>> >
>>>> >        at
>>>> >
>>>> java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191)
>>>> >
>>>> >
>>>> >        at
>>>> >
>>>> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:86)
>>>> >
>>>> >
>>>> >        at
>>>> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:256)
>>>> >
>>>> >
>>>> >        at
>>>> >
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>> >
>>>> >
>>>> >        at
>>>> >
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>> >
>>>> >
>>>> >        at java.lang.Thread.run(Thread.java:745)
>>>> >
>>>> >
>>>> > 16/02/02 20:34:39 INFO DiskBlockManager: Shutdown hook called
>>>> >
>>>> >
>>>> > 16/02/02 20:34:39 INFO Executor: Finished task 32.0 in stage 32.0 (TID
>>>> > 3298). 2004728720 bytes result sent via BlockManager)
>>>> >
>>>> >
>>>> > 16/02/02 20:34:39 ERROR SparkUncaughtExceptionHandler: Uncaught
>>>> exception in
>>>> > thread Thread[Executor task launch worker-8,5,main]
>>>> >
>>>> >
>>>> > java.lang.OutOfMemoryError: Java heap space
>>>> >
>>>> >
>>>> >        at java.util.Arrays.copyOf(Arrays.java:2271)
>>>> >
>>>> >
>>>> >        at
>>>> >
>>>> java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191)
>>>> >
>>>> >
>>>> >        at
>>>> >
>>>> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:86)
>>>> >
>>>> >
>>>> >        at
>>>> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:256)
>>>> >
>>>> >
>>>> >        at
>>>> >
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>> >
>>>> >
>>>> >        at
>>>> >
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>> >
>>>> >
>>>> >        at java.lang.Thread.run(Thread.java:745)
>>>> >
>>>> >
>>>> > 16/02/02 20:34:39 INFO ShutdownHookManager: Shutdown hook called
>>>> >
>>>> >
>>>> > 16/02/02 20:34:39 INFO MetricsSystemImpl: Stopping azure-file-system
>>>> metrics
>>>> > system...
>>>> >
>>>> >
>>>> > 16/02/02 20:34:39 INFO MetricsSinkAdapter: azurefs2 thread
>>>> interrupted.
>>>> >
>>>> >
>>>> > 16/02/02 20:34:39 INFO MetricsSystemImpl: azure-file-system metrics
>>>> system
>>>> > stopped.
>>>> >
>>>> >
>>>> > 16/02/02 20:34:39 INFO MetricsSystemImpl: azure-file-system metrics
>>>> system
>>>> > shutdown complete.
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > And …..
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > 16/02/02 20:09:03 INFO impl.ContainerManagementProtocolProxy: Opening
>>>> proxy
>>>> > : 10.0.0.5:30050
>>>> >
>>>> >
>>>> > 16/02/02 20:33:51 INFO yarn.YarnAllocator: Completed container
>>>> > container_1454421662639_0011_01_000005 (state: COMPLETE, exit status:
>>>> -104)
>>>> >
>>>> >
>>>> > 16/02/02 20:33:51 WARN yarn.YarnAllocator: Container killed by YARN
>>>> for
>>>> > exceeding memory limits. 16.8 GB of 16.5 GB physical memory used.
>>>> Consider
>>>> > boosting spark.yarn.executor.memoryOverhead.
>>>> >
>>>> >
>>>> > 16/02/02 20:33:56 INFO yarn.YarnAllocator: Will request 1 executor
>>>> > containers, each with 2 cores and 16768 MB memory including 384 MB
>>>> overhead
>>>> >
>>>> >
>>>> > 16/02/02 20:33:56 INFO yarn.YarnAllocator: Container request (host:
>>>> Any,
>>>> > capability: <memory:16768, vCores:2>)
>>>> >
>>>> >
>>>> > 16/02/02 20:33:57 INFO yarn.YarnAllocator: Launching container
>>>> > container_1454421662639_0011_01_000037 for on host 10.0.0.8
>>>> >
>>>> >
>>>> > 16/02/02 20:33:57 INFO yarn.YarnAllocator: Launching ExecutorRunnable.
>>>> > driverUrl:
>>>> > akka.tcp://sparkDriver@10.0.0.15:47446/user/CoarseGrainedScheduler
>>>> <http://10.0.0.15:47446/user/CoarseGrainedScheduler>,
>>>> > executorHostname: 10.0.0.8
>>>> >
>>>> >
>>>> > 16/02/02 20:33:57 INFO yarn.YarnAllocator: Received 1 containers from
>>>> YARN,
>>>> > launching executors on 1 of them.
>>>> >
>>>> >
>>>> > I'll really appreciate any help here.
>>>> >
>>>> > Thank you,
>>>> >
>>>> > Stefan Panayotov, PhD
>>>> > Home: 610-355-0919
>>>> > Cell: 610-517-5586
>>>> > email: spanayo...@msn.com
>>>> > spanayo...@outlook.com
>>>> > spanayo...@comcast.net
>>>> >
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Thanks,
>>>>
>>>> www.openkb.info
>>>>
>>>> (Open KnowledgeBase for Hadoop/Database/OS/Network/Tool)
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> [image: What's New with Xactly]
>>>> <http://www.xactlycorp.com/email-click/>
>>>>
>>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>>>> <https://twitter.com/Xactly>  [image: Facebook]
>>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>>> <http://www.youtube.com/xactlycorporation>
>>>>
>>>
>>>
>>
>>
>>
>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>
>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>> <https://twitter.com/Xactly>  [image: Facebook]
>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>> <http://www.youtube.com/xactlycorporation>
>>
>>
>>
>
>
>
> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>
> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
> <https://twitter.com/Xactly>  [image: Facebook]
> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
> <http://www.youtube.com/xactlycorporation>
>

Re: Spark 1.5.2 memory error

Reply via email to