I drastically increased the memory:
 
spark.executor.memory = 50g
spark.driver.memory = 8g
spark.driver.maxResultSize = 8g
spark.yarn.executor.memoryOverhead = 768
 
I still see executors killed, but this time the memory does not seem to be the 
issue.
The error on the Jupyter notebook is:
 
Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Exception 
while getting task result: java.io.IOException: Failed to connect to 
/10.0.0.9:48755 
>From nodemanagers log corresponding to worker 10.0.0.9:
 



2016-02-03 17:31:44,917 INFO  yarn.YarnShuffleService
(YarnShuffleService.java:initializeApplication(129)) - Initializing application
application_1454509557526_0014


2016-02-03 17:31:44,918 INFO  container.ContainerImpl
(ContainerImpl.java:handle(1131)) - Container 
container_1454509557526_0014_01_000093
transitioned from LOCALIZING to LOCALIZED


2016-02-03 17:31:44,947 INFO  container.ContainerImpl
(ContainerImpl.java:handle(1131)) - Container 
container_1454509557526_0014_01_000093 transitioned from
LOCALIZED to RUNNING


2016-02-03 17:31:44,951 INFO 
nodemanager.DefaultContainerExecutor
(DefaultContainerExecutor.java:buildCommandExecutor(267)) - launchContainer:
[bash,
/mnt/resource/hadoop/yarn/local/usercache/root/appcache/application_1454509557526_0014/container_1454509557526_0014_01_000093/default_container_executor.sh]


2016-02-03 17:31:45,686 INFO  monitor.ContainersMonitorImpl
(ContainersMonitorImpl.java:run(371)) - Starting resource-monitoring for
container_1454509557526_0014_01_000093


2016-02-03 17:31:45,686 INFO  monitor.ContainersMonitorImpl
(ContainersMonitorImpl.java:run(385)) - Stopping resource-monitoring for
container_1454509557526_0014_01_000011


 


Then I can see the memory usage
increasing from 230.6 MB to 12.6 GB, which is far below 50g, and the suddenly 
getting killed!?!


 


2016-02-03 17:33:17,350 INFO  monitor.ContainersMonitorImpl
(ContainersMonitorImpl.java:run(458)) - Memory usage of ProcessTree 30962 for
container-id container_1454509557526_0014_01_000093: 12.6 GB of 51 GB physical
memory used; 52.8 GB of 107.1 GB virtual memory used


2016-02-03 17:33:17,613 INFO  container.ContainerImpl
(ContainerImpl.java:handle(1131)) - Container 
container_1454509557526_0014_01_000093 transitioned from
RUNNING to KILLING


2016-02-03 17:33:17,613 INFO  launcher.ContainerLaunch
(ContainerLaunch.java:cleanupContainer(370)) - Cleaning up container
container_1454509557526_0014_01_000093


2016-02-03 17:33:17,629 WARN 
nodemanager.DefaultContainerExecutor
(DefaultContainerExecutor.java:launchContainer(223)) - Exit code from container
container_1454509557526_0014_01_000093 is : 143


2016-02-03 17:33:17,667 INFO  container.ContainerImpl
(ContainerImpl.java:handle(1131)) - Container
container_1454509557526_0014_01_000093 transitioned from KILLING to
CONTAINER_CLEANEDUP_AFTER_KILL


2016-02-03 17:33:17,669 INFO  nodemanager.NMAuditLogger
(NMAuditLogger.java:logSuccess(89)) -
USER=root       OPERATION=Container Finished -
Killed    TARGET=ContainerImpl
RESULT=SUCCESS      
APPID=application_1454509557526_0014    
CONTAINERID=container_1454509557526_0014_01_000093


2016-02-03 17:33:17,670 INFO  container.ContainerImpl
(ContainerImpl.java:handle(1131)) - Container
container_1454509557526_0014_01_000093 transitioned from
CONTAINER_CLEANEDUP_AFTER_KILL to DONE


2016-02-03 17:33:17,670 INFO  application.ApplicationImpl
(ApplicationImpl.java:transition(347)) - Removing
container_1454509557526_0014_01_000093 from application
application_1454509557526_0014


2016-02-03 17:33:17,671 INFO 
logaggregation.AppLogAggregatorImpl
(AppLogAggregatorImpl.java:startContainerLogAggregation(546)) - Considering
container container_1454509557526_0014_01_000093 for log-aggregation


2016-02-03 17:33:17,671 INFO  containermanager.AuxServices
(AuxServices.java:handle(196)) - Got event CONTAINER_STOP for appId
application_1454509557526_0014


2016-02-03 17:33:17,671 INFO  yarn.YarnShuffleService
(YarnShuffleService.java:stopContainer(161)) - Stopping container
container_1454509557526_0014_01_000093


2016-02-03 17:33:20,351 INFO  monitor.ContainersMonitorImpl
(ContainersMonitorImpl.java:run(385)) - Stopping resource-monitoring for
container_1454509557526_0014_01_000093


2016-02-03 17:33:20,383 INFO  monitor.ContainersMonitorImpl
(ContainersMonitorImpl.java:run(458)) - Memory usage of ProcessTree 28727 for
container-id container_1454509557526_0012_01_000001: 319.8 MB of 1.5 GB
physical memory used; 1.7 GB of 3.1 GB virtual memory used

2016-02-03
17:33:22,627 INFO  nodemanager.NodeStatusUpdaterImpl
(NodeStatusUpdaterImpl.java:removeOrTrackCompletedContainersFromContext(529)) -
Removed completed containers from NM context: 
[container_1454509557526_0014_01_000093]
 
I'll appreciate any suggestions.

Thanks,


Stefan Panayotov, PhD 
Home: 610-355-0919 
Cell: 610-517-5586 
email: spanayo...@msn.com 
spanayo...@outlook.com 
spanayo...@comcast.net

 
Date: Tue, 2 Feb 2016 15:40:10 -0800
Subject: Re: Spark 1.5.2 memory error
From: openkbi...@gmail.com
To: spanayo...@msn.com
CC: yuzhih...@gmail.com; ja...@odersky.com; user@spark.apache.org

Look at part#3 in below 
blog:http://www.openkb.info/2015/06/resource-allocation-configurations-for.html

You may want to increase the executor memory, not just the 
spark.yarn.executor.memoryOverhead.
On Tue, Feb 2, 2016 at 2:14 PM, Stefan Panayotov <spanayo...@msn.com> wrote:
For the memoryOvethead I have the default of 10% of 16g, and Spark version is 
1.5.2. Stefan Panayotov, PhD
Sent from Outlook Mail for Windows 10 phone 
From: Ted Yu
Sent: Tuesday, February 2, 2016 4:52 PM
To: Jakob Odersky
Cc: Stefan Panayotov; user@spark.apache.org
Subject: Re: Spark 1.5.2 memory error What value do you use for 
spark.yarn.executor.memoryOverhead ? Please see 
https://spark.apache.org/docs/latest/running-on-yarn.html for description of 
the parameter. Which Spark release are you using ? Cheers On Tue, Feb 2, 2016 
at 1:38 PM, Jakob Odersky <ja...@odersky.com> wrote:Can you share some code 
that produces the error? It is probably not
due to spark but rather the way data is handled in the user code.
Does your code call any reduceByKey actions? These are often a source
for OOM errors.
On Tue, Feb 2, 2016 at 1:22 PM, Stefan Panayotov <spanayo...@msn.com> wrote:
> Hi Guys,
>
> I need help with Spark memory errors when executing ML pipelines.
> The error that I see is:
>
>
> 16/02/02 20:34:17 INFO Executor: Executor is trying to kill task 32.0 in
> stage 32.0 (TID 3298)
>
>
> 16/02/02 20:34:17 INFO Executor: Executor is trying to kill task 12.0 in
> stage 32.0 (TID 3278)
>
>
> 16/02/02 20:34:39 INFO MemoryStore: ensureFreeSpace(2004728720) called with
> curMem=296303415, maxMem=8890959790
>
>
> 16/02/02 20:34:39 INFO MemoryStore: Block taskresult_3298 stored as bytes in
> memory (estimated size 1911.9 MB, free 6.1 GB)
>
>
> 16/02/02 20:34:39 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15:
> SIGTERM
>
>
> 16/02/02 20:34:39 ERROR Executor: Exception in task 12.0 in stage 32.0 (TID
> 3278)
>
>
> java.lang.OutOfMemoryError: Java heap space
>
>
>        at java.util.Arrays.copyOf(Arrays.java:2271)
>
>
>        at
> java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191)
>
>
>        at
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:86)
>
>
>        at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:256)
>
>
>        at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
>
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
>
>        at java.lang.Thread.run(Thread.java:745)
>
>
> 16/02/02 20:34:39 INFO DiskBlockManager: Shutdown hook called
>
>
> 16/02/02 20:34:39 INFO Executor: Finished task 32.0 in stage 32.0 (TID
> 3298). 2004728720 bytes result sent via BlockManager)
>
>
> 16/02/02 20:34:39 ERROR SparkUncaughtExceptionHandler: Uncaught exception in
> thread Thread[Executor task launch worker-8,5,main]
>
>
> java.lang.OutOfMemoryError: Java heap space
>
>
>        at java.util.Arrays.copyOf(Arrays.java:2271)
>
>
>        at
> java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191)
>
>
>        at
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:86)
>
>
>        at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:256)
>
>
>        at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
>
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
>
>        at java.lang.Thread.run(Thread.java:745)
>
>
> 16/02/02 20:34:39 INFO ShutdownHookManager: Shutdown hook called
>
>
> 16/02/02 20:34:39 INFO MetricsSystemImpl: Stopping azure-file-system metrics
> system...
>
>
> 16/02/02 20:34:39 INFO MetricsSinkAdapter: azurefs2 thread interrupted.
>
>
> 16/02/02 20:34:39 INFO MetricsSystemImpl: azure-file-system metrics system
> stopped.
>
>
> 16/02/02 20:34:39 INFO MetricsSystemImpl: azure-file-system metrics system
> shutdown complete.
>
>
>
>
>
> And …..
>
>
>
>
>
> 16/02/02 20:09:03 INFO impl.ContainerManagementProtocolProxy: Opening proxy
> : 10.0.0.5:30050
>
>
> 16/02/02 20:33:51 INFO yarn.YarnAllocator: Completed container
> container_1454421662639_0011_01_000005 (state: COMPLETE, exit status: -104)
>
>
> 16/02/02 20:33:51 WARN yarn.YarnAllocator: Container killed by YARN for
> exceeding memory limits. 16.8 GB of 16.5 GB physical memory used. Consider
> boosting spark.yarn.executor.memoryOverhead.
>
>
> 16/02/02 20:33:56 INFO yarn.YarnAllocator: Will request 1 executor
> containers, each with 2 cores and 16768 MB memory including 384 MB overhead
>
>
> 16/02/02 20:33:56 INFO yarn.YarnAllocator: Container request (host: Any,
> capability: <memory:16768, vCores:2>)
>
>
> 16/02/02 20:33:57 INFO yarn.YarnAllocator: Launching container
> container_1454421662639_0011_01_000037 for on host 10.0.0.8
>
>
> 16/02/02 20:33:57 INFO yarn.YarnAllocator: Launching ExecutorRunnable.
> driverUrl:
> akka.tcp://sparkDriver@10.0.0.15:47446/user/CoarseGrainedScheduler,
> executorHostname: 10.0.0.8
>
>
> 16/02/02 20:33:57 INFO yarn.YarnAllocator: Received 1 containers from YARN,
> launching executors on 1 of them.
>
>
> I'll really appreciate any help here.
>
> Thank you,
>
> Stefan Panayotov, PhD
> Home: 610-355-0919
> Cell: 610-517-5586
> email: spanayo...@msn.com
> spanayo...@outlook.com
> spanayo...@comcast.net
>---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org  

-- 
Thanks,www.openkb.info (Open KnowledgeBase for Hadoop/Database/OS/Network/Tool)
                                          

Reply via email to