About OP.

How many cores you assign per executor? May be reducing that number will
give more portion of executor memory to each task being executed on that
executor. Others please comment if that make sense.



On Wed, Feb 3, 2016 at 1:52 PM, Nirav Patel <npa...@xactlycorp.com> wrote:

> I know it;s a strong word but when I have a case open for that with MapR
> and Databricks for a month and their only solution to change to DataFrame
> it frustrate you. I know DataFrame/Sql catalyst has internal optimizations
> but it requires lot of code change. I think there's something fundamentally
> wrong (or different from hadoop) in framework that is not allowing it to do
> robust memory management. I know my job is memory hogger, it does a groupBy
> and perform combinatorics in reducer side; uses additional datastructures
> at task levels. May be spark is running multiple heavier tasks on same
> executor and collectively they cause OOM. But suggesting DataFrame is NOT a
> Solution for me (and most others who already invested time with RDD and
> loves the type safety it provides). Not even sure if changing to DataFrame
> will for sure solve the issue.
>
> On Wed, Feb 3, 2016 at 1:33 PM, Mohammed Guller <moham...@glassbeam.com>
> wrote:
>
>> Nirav,
>>
>> Sorry to hear about your experience with Spark; however, sucks is a very
>> strong word. Many organizations are processing a lot more than 150GB of
>> data  with Spark.
>>
>>
>>
>> Mohammed
>>
>> Author: Big Data Analytics with Spark
>> <http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>
>>
>>
>>
>> *From:* Nirav Patel [mailto:npa...@xactlycorp.com]
>> *Sent:* Wednesday, February 3, 2016 11:31 AM
>> *To:* Stefan Panayotov
>> *Cc:* Jim Green; Ted Yu; Jakob Odersky; user@spark.apache.org
>>
>> *Subject:* Re: Spark 1.5.2 memory error
>>
>>
>>
>> Hi Stefan,
>>
>>
>>
>> Welcome to the OOM - heap space club. I have been struggling with similar
>> errors (OOM and yarn executor being killed) and failing job or sending it
>> in retry loops. I bet the same job will run perfectly fine with less
>> resource on Hadoop MapReduce program. I have tested it for my program and
>> it does work.
>>
>>
>>
>> Bottomline from my experience. Spark sucks with memory management when
>> job is processing large (not huge) amount of data. It's failing for me with
>> 16gb executors, 10 executors, 6 threads each. And data its processing is
>> only 150GB! It's 1 billion rows for me. Same job works perfectly fine with
>> 1 million rows.
>>
>>
>>
>> Hope that saves you some trouble.
>>
>>
>>
>> Nirav
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Feb 3, 2016 at 11:00 AM, Stefan Panayotov <spanayo...@msn.com>
>> wrote:
>>
>> I drastically increased the memory:
>>
>> spark.executor.memory = 50g
>> spark.driver.memory = 8g
>> spark.driver.maxResultSize = 8g
>> spark.yarn.executor.memoryOverhead = 768
>>
>> I still see executors killed, but this time the memory does not seem to
>> be the issue.
>> The error on the Jupyter notebook is:
>>
>>
>> Py4JJavaError: An error occurred while calling 
>> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
>>
>> : org.apache.spark.SparkException: Job aborted due to stage failure: 
>> Exception while getting task result: java.io.IOException: Failed to connect 
>> to /10.0.0.9:48755
>>
>>
>> From nodemanagers log corresponding to worker 10.0.0.9:
>>
>>
>> 2016-02-03 17:31:44,917 INFO  yarn.YarnShuffleService
>> (YarnShuffleService.java:initializeApplication(129)) - Initializing
>> application application_1454509557526_0014
>>
>>
>>
>> 2016-02-03 17:31:44,918 INFO  container.ContainerImpl
>> (ContainerImpl.java:handle(1131)) - Container
>> container_1454509557526_0014_01_000093 transitioned from LOCALIZING to
>> LOCALIZED
>>
>>
>>
>> 2016-02-03 17:31:44,947 INFO  container.ContainerImpl
>> (ContainerImpl.java:handle(1131)) - Container
>> container_1454509557526_0014_01_000093 transitioned from LOCALIZED to
>> RUNNING
>>
>>
>>
>> 2016-02-03 17:31:44,951 INFO  nodemanager.DefaultContainerExecutor
>> (DefaultContainerExecutor.java:buildCommandExecutor(267)) -
>> launchContainer: [bash,
>> /mnt/resource/hadoop/yarn/local/usercache/root/appcache/application_1454509557526_0014/container_1454509557526_0014_01_000093/default_container_executor.sh]
>>
>>
>>
>> 2016-02-03 17:31:45,686 INFO  monitor.ContainersMonitorImpl
>> (ContainersMonitorImpl.java:run(371)) - Starting resource-monitoring for
>> container_1454509557526_0014_01_000093
>>
>>
>>
>> 2016-02-03 17:31:45,686 INFO  monitor.ContainersMonitorImpl
>> (ContainersMonitorImpl.java:run(385)) - Stopping resource-monitoring for
>> container_1454509557526_0014_01_000011
>>
>>
>>
>>
>>
>>
>>
>> Then I can see the memory usage increasing from 230.6 MB to 12.6 GB,
>> which is far below 50g, and the suddenly getting killed!?!
>>
>>
>>
>>
>>
>>
>>
>> 2016-02-03 17:33:17,350 INFO  monitor.ContainersMonitorImpl
>> (ContainersMonitorImpl.java:run(458)) - Memory usage of ProcessTree 30962
>> for container-id container_1454509557526_0014_01_000093: 12.6 GB of 51 GB
>> physical memory used; 52.8 GB of 107.1 GB virtual memory used
>>
>>
>>
>> 2016-02-03 17:33:17,613 INFO  container.ContainerImpl
>> (ContainerImpl.java:handle(1131)) - Container
>> container_1454509557526_0014_01_000093 transitioned from RUNNING to KILLING
>>
>>
>>
>> 2016-02-03 17:33:17,613 INFO  launcher.ContainerLaunch
>> (ContainerLaunch.java:cleanupContainer(370)) - Cleaning up container
>> container_1454509557526_0014_01_000093
>>
>>
>>
>> 2016-02-03 17:33:17,629 WARN  nodemanager.DefaultContainerExecutor
>> (DefaultContainerExecutor.java:launchContainer(223)) - Exit code from
>> container container_1454509557526_0014_01_000093 is : 143
>>
>>
>>
>> 2016-02-03 17:33:17,667 INFO  container.ContainerImpl
>> (ContainerImpl.java:handle(1131)) - Container
>> container_1454509557526_0014_01_000093 transitioned from KILLING to
>> CONTAINER_CLEANEDUP_AFTER_KILL
>>
>>
>>
>> 2016-02-03 17:33:17,669 INFO  nodemanager.NMAuditLogger
>> (NMAuditLogger.java:logSuccess(89)) - USER=root       OPERATION=Container
>> Finished - Killed    TARGET=ContainerImpl RESULT=SUCCESS
>> APPID=application_1454509557526_0014
>> CONTAINERID=container_1454509557526_0014_01_000093
>>
>>
>>
>> 2016-02-03 17:33:17,670 INFO  container.ContainerImpl
>> (ContainerImpl.java:handle(1131)) - Container
>> container_1454509557526_0014_01_000093 transitioned from
>> CONTAINER_CLEANEDUP_AFTER_KILL to DONE
>>
>>
>>
>> 2016-02-03 17:33:17,670 INFO  application.ApplicationImpl
>> (ApplicationImpl.java:transition(347)) - Removing
>> container_1454509557526_0014_01_000093 from application
>> application_1454509557526_0014
>>
>>
>>
>> 2016-02-03 17:33:17,671 INFO  logaggregation.AppLogAggregatorImpl
>> (AppLogAggregatorImpl.java:startContainerLogAggregation(546)) - Considering
>> container container_1454509557526_0014_01_000093 for log-aggregation
>>
>>
>>
>> 2016-02-03 17:33:17,671 INFO  containermanager.AuxServices
>> (AuxServices.java:handle(196)) - Got event CONTAINER_STOP for appId
>> application_1454509557526_0014
>>
>>
>>
>> 2016-02-03 17:33:17,671 INFO  yarn.YarnShuffleService
>> (YarnShuffleService.java:stopContainer(161)) - Stopping container
>> container_1454509557526_0014_01_000093
>>
>>
>>
>> 2016-02-03 17:33:20,351 INFO  monitor.ContainersMonitorImpl
>> (ContainersMonitorImpl.java:run(385)) - Stopping resource-monitoring for
>> container_1454509557526_0014_01_000093
>>
>>
>>
>> 2016-02-03 17:33:20,383 INFO  monitor.ContainersMonitorImpl
>> (ContainersMonitorImpl.java:run(458)) - Memory usage of ProcessTree 28727
>> for container-id container_1454509557526_0012_01_000001: 319.8 MB of 1.5 GB
>> physical memory used; 1.7 GB of 3.1 GB virtual memory used
>>
>> 2016-02-03 17:33:22,627 INFO  nodemanager.NodeStatusUpdaterImpl
>> (NodeStatusUpdaterImpl.java:removeOrTrackCompletedContainersFromContext(529))
>> - Removed completed containers from NM context:
>> [container_1454509557526_0014_01_000093]
>>
>> I'll appreciate any suggestions.
>>
>> Thanks,
>>
>> *Stefan Panayotov, PhD *
>> *Home*: 610-355-0919
>> *Cell*: 610-517-5586
>> *email*: spanayo...@msn.com
>> spanayo...@outlook.com
>> spanayo...@comcast.net
>>
>>
>>
>> ------------------------------
>>
>> Date: Tue, 2 Feb 2016 15:40:10 -0800
>> Subject: Re: Spark 1.5.2 memory error
>> From: openkbi...@gmail.com
>> To: spanayo...@msn.com
>> CC: yuzhih...@gmail.com; ja...@odersky.com; user@spark.apache.org
>>
>>
>>
>> Look at part#3 in below blog:
>>
>> http://www.openkb.info/2015/06/resource-allocation-configurations-for.html
>>
>>
>>
>> You may want to increase the executor memory, not just the
>> spark.yarn.executor.memoryOverhead.
>>
>>
>>
>> On Tue, Feb 2, 2016 at 2:14 PM, Stefan Panayotov <spanayo...@msn.com>
>> wrote:
>>
>> For the memoryOvethead I have the default of 10% of 16g, and Spark
>> version is 1.5.2.
>>
>>
>>
>> Stefan Panayotov, PhD
>> Sent from Outlook Mail for Windows 10 phone
>>
>>
>>
>>
>> *From: *Ted Yu <yuzhih...@gmail.com>
>> *Sent: *Tuesday, February 2, 2016 4:52 PM
>> *To: *Jakob Odersky <ja...@odersky.com>
>> *Cc: *Stefan Panayotov <spanayo...@msn.com>; user@spark.apache.org
>> *Subject: *Re: Spark 1.5.2 memory error
>>
>>
>>
>> What value do you use for spark.yarn.executor.memoryOverhead ?
>>
>>
>>
>> Please see https://spark.apache.org/docs/latest/running-on-yarn.html for
>> description of the parameter.
>>
>>
>>
>> Which Spark release are you using ?
>>
>>
>>
>> Cheers
>>
>>
>>
>> On Tue, Feb 2, 2016 at 1:38 PM, Jakob Odersky <ja...@odersky.com> wrote:
>>
>> Can you share some code that produces the error? It is probably not
>> due to spark but rather the way data is handled in the user code.
>> Does your code call any reduceByKey actions? These are often a source
>> for OOM errors.
>>
>>
>> On Tue, Feb 2, 2016 at 1:22 PM, Stefan Panayotov <spanayo...@msn.com>
>> wrote:
>> > Hi Guys,
>> >
>> > I need help with Spark memory errors when executing ML pipelines.
>> > The error that I see is:
>> >
>> >
>> > 16/02/02 20:34:17 INFO Executor: Executor is trying to kill task 32.0 in
>> > stage 32.0 (TID 3298)
>> >
>> >
>> > 16/02/02 20:34:17 INFO Executor: Executor is trying to kill task 12.0 in
>> > stage 32.0 (TID 3278)
>> >
>> >
>> > 16/02/02 20:34:39 INFO MemoryStore: ensureFreeSpace(2004728720) called
>> with
>> > curMem=296303415, maxMem=8890959790
>> >
>> >
>> > 16/02/02 20:34:39 INFO MemoryStore: Block taskresult_3298 stored as
>> bytes in
>> > memory (estimated size 1911.9 MB, free 6.1 GB)
>> >
>> >
>> > 16/02/02 20:34:39 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL
>> 15:
>> > SIGTERM
>> >
>> >
>> > 16/02/02 20:34:39 ERROR Executor: Exception in task 12.0 in stage 32.0
>> (TID
>> > 3278)
>> >
>> >
>> > java.lang.OutOfMemoryError: Java heap space
>> >
>> >
>> >        at java.util.Arrays.copyOf(Arrays.java:2271)
>> >
>> >
>> >        at
>> >
>> java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191)
>> >
>> >
>> >        at
>> >
>> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:86)
>> >
>> >
>> >        at
>> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:256)
>> >
>> >
>> >        at
>> >
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> >
>> >
>> >        at
>> >
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >
>> >
>> >        at java.lang.Thread.run(Thread.java:745)
>> >
>> >
>> > 16/02/02 20:34:39 INFO DiskBlockManager: Shutdown hook called
>> >
>> >
>> > 16/02/02 20:34:39 INFO Executor: Finished task 32.0 in stage 32.0 (TID
>> > 3298). 2004728720 bytes result sent via BlockManager)
>> >
>> >
>> > 16/02/02 20:34:39 ERROR SparkUncaughtExceptionHandler: Uncaught
>> exception in
>> > thread Thread[Executor task launch worker-8,5,main]
>> >
>> >
>> > java.lang.OutOfMemoryError: Java heap space
>> >
>> >
>> >        at java.util.Arrays.copyOf(Arrays.java:2271)
>> >
>> >
>> >        at
>> >
>> java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191)
>> >
>> >
>> >        at
>> >
>> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:86)
>> >
>> >
>> >        at
>> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:256)
>> >
>> >
>> >        at
>> >
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> >
>> >
>> >        at
>> >
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >
>> >
>> >        at java.lang.Thread.run(Thread.java:745)
>> >
>> >
>> > 16/02/02 20:34:39 INFO ShutdownHookManager: Shutdown hook called
>> >
>> >
>> > 16/02/02 20:34:39 INFO MetricsSystemImpl: Stopping azure-file-system
>> metrics
>> > system...
>> >
>> >
>> > 16/02/02 20:34:39 INFO MetricsSinkAdapter: azurefs2 thread interrupted.
>> >
>> >
>> > 16/02/02 20:34:39 INFO MetricsSystemImpl: azure-file-system metrics
>> system
>> > stopped.
>> >
>> >
>> > 16/02/02 20:34:39 INFO MetricsSystemImpl: azure-file-system metrics
>> system
>> > shutdown complete.
>> >
>> >
>> >
>> >
>> >
>> > And …..
>> >
>> >
>> >
>> >
>> >
>> > 16/02/02 20:09:03 INFO impl.ContainerManagementProtocolProxy: Opening
>> proxy
>> > : 10.0.0.5:30050
>> >
>> >
>> > 16/02/02 20:33:51 INFO yarn.YarnAllocator: Completed container
>> > container_1454421662639_0011_01_000005 (state: COMPLETE, exit status:
>> -104)
>> >
>> >
>> > 16/02/02 20:33:51 WARN yarn.YarnAllocator: Container killed by YARN for
>> > exceeding memory limits. 16.8 GB of 16.5 GB physical memory used.
>> Consider
>> > boosting spark.yarn.executor.memoryOverhead.
>> >
>> >
>> > 16/02/02 20:33:56 INFO yarn.YarnAllocator: Will request 1 executor
>> > containers, each with 2 cores and 16768 MB memory including 384 MB
>> overhead
>> >
>> >
>> > 16/02/02 20:33:56 INFO yarn.YarnAllocator: Container request (host: Any,
>> > capability: <memory:16768, vCores:2>)
>> >
>> >
>> > 16/02/02 20:33:57 INFO yarn.YarnAllocator: Launching container
>> > container_1454421662639_0011_01_000037 for on host 10.0.0.8
>> >
>> >
>> > 16/02/02 20:33:57 INFO yarn.YarnAllocator: Launching ExecutorRunnable.
>> > driverUrl:
>> > akka.tcp://sparkDriver@10.0.0.15:47446/user/CoarseGrainedScheduler
>> <http://10.0.0.15:47446/user/CoarseGrainedScheduler>,
>> > executorHostname: 10.0.0.8
>> >
>> >
>> > 16/02/02 20:33:57 INFO yarn.YarnAllocator: Received 1 containers from
>> YARN,
>> > launching executors on 1 of them.
>> >
>> >
>> > I'll really appreciate any help here.
>> >
>> > Thank you,
>> >
>> > Stefan Panayotov, PhD
>> > Home: 610-355-0919
>> > Cell: 610-517-5586
>> > email: spanayo...@msn.com
>> > spanayo...@outlook.com
>> > spanayo...@comcast.net
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> Thanks,
>>
>> www.openkb.info
>>
>> (Open KnowledgeBase for Hadoop/Database/OS/Network/Tool)
>>
>>
>>
>>
>>
>>
>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>
>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>> <https://twitter.com/Xactly>  [image: Facebook]
>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>> <http://www.youtube.com/xactlycorporation>
>>
>
>

-- 


[image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>

<https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn] 
<https://www.linkedin.com/company/xactly-corporation>  [image: Twitter] 
<https://twitter.com/Xactly>  [image: Facebook] 
<https://www.facebook.com/XactlyCorp>  [image: YouTube] 
<http://www.youtube.com/xactlycorporation>

Reply via email to