I drastically increased the memory: spark.executor.memory = 50g spark.driver.memory = 8g spark.driver.maxResultSize = 8g spark.yarn.executor.memoryOverhead = 768 I still see executors killed, but this time the memory does not seem to be the issue. The error on the Jupyter notebook is: Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result: java.io.IOException: Failed to connect to /10.0.0.9:48755 >From nodemanagers log corresponding to worker 10.0.0.9:
2016-02-03 17:31:44,917 INFO yarn.YarnShuffleService (YarnShuffleService.java:initializeApplication(129)) - Initializing application application_1454509557526_0014 2016-02-03 17:31:44,918 INFO container.ContainerImpl (ContainerImpl.java:handle(1131)) - Container container_1454509557526_0014_01_000093 transitioned from LOCALIZING to LOCALIZED 2016-02-03 17:31:44,947 INFO container.ContainerImpl (ContainerImpl.java:handle(1131)) - Container container_1454509557526_0014_01_000093 transitioned from LOCALIZED to RUNNING 2016-02-03 17:31:44,951 INFO nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:buildCommandExecutor(267)) - launchContainer: [bash, /mnt/resource/hadoop/yarn/local/usercache/root/appcache/application_1454509557526_0014/container_1454509557526_0014_01_000093/default_container_executor.sh] 2016-02-03 17:31:45,686 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(371)) - Starting resource-monitoring for container_1454509557526_0014_01_000093 2016-02-03 17:31:45,686 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(385)) - Stopping resource-monitoring for container_1454509557526_0014_01_000011 Then I can see the memory usage increasing from 230.6 MB to 12.6 GB, which is far below 50g, and the suddenly getting killed!?! 2016-02-03 17:33:17,350 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(458)) - Memory usage of ProcessTree 30962 for container-id container_1454509557526_0014_01_000093: 12.6 GB of 51 GB physical memory used; 52.8 GB of 107.1 GB virtual memory used 2016-02-03 17:33:17,613 INFO container.ContainerImpl (ContainerImpl.java:handle(1131)) - Container container_1454509557526_0014_01_000093 transitioned from RUNNING to KILLING 2016-02-03 17:33:17,613 INFO launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(370)) - Cleaning up container container_1454509557526_0014_01_000093 2016-02-03 17:33:17,629 WARN nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:launchContainer(223)) - Exit code from container container_1454509557526_0014_01_000093 is : 143 2016-02-03 17:33:17,667 INFO container.ContainerImpl (ContainerImpl.java:handle(1131)) - Container container_1454509557526_0014_01_000093 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL 2016-02-03 17:33:17,669 INFO nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(89)) - USER=root OPERATION=Container Finished - Killed TARGET=ContainerImpl RESULT=SUCCESS APPID=application_1454509557526_0014 CONTAINERID=container_1454509557526_0014_01_000093 2016-02-03 17:33:17,670 INFO container.ContainerImpl (ContainerImpl.java:handle(1131)) - Container container_1454509557526_0014_01_000093 transitioned from CONTAINER_CLEANEDUP_AFTER_KILL to DONE 2016-02-03 17:33:17,670 INFO application.ApplicationImpl (ApplicationImpl.java:transition(347)) - Removing container_1454509557526_0014_01_000093 from application application_1454509557526_0014 2016-02-03 17:33:17,671 INFO logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:startContainerLogAggregation(546)) - Considering container container_1454509557526_0014_01_000093 for log-aggregation 2016-02-03 17:33:17,671 INFO containermanager.AuxServices (AuxServices.java:handle(196)) - Got event CONTAINER_STOP for appId application_1454509557526_0014 2016-02-03 17:33:17,671 INFO yarn.YarnShuffleService (YarnShuffleService.java:stopContainer(161)) - Stopping container container_1454509557526_0014_01_000093 2016-02-03 17:33:20,351 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(385)) - Stopping resource-monitoring for container_1454509557526_0014_01_000093 2016-02-03 17:33:20,383 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(458)) - Memory usage of ProcessTree 28727 for container-id container_1454509557526_0012_01_000001: 319.8 MB of 1.5 GB physical memory used; 1.7 GB of 3.1 GB virtual memory used 2016-02-03 17:33:22,627 INFO nodemanager.NodeStatusUpdaterImpl (NodeStatusUpdaterImpl.java:removeOrTrackCompletedContainersFromContext(529)) - Removed completed containers from NM context: [container_1454509557526_0014_01_000093] I'll appreciate any suggestions. Thanks, Stefan Panayotov, PhD Home: 610-355-0919 Cell: 610-517-5586 email: spanayo...@msn.com spanayo...@outlook.com spanayo...@comcast.net Date: Tue, 2 Feb 2016 15:40:10 -0800 Subject: Re: Spark 1.5.2 memory error From: openkbi...@gmail.com To: spanayo...@msn.com CC: yuzhih...@gmail.com; ja...@odersky.com; user@spark.apache.org Look at part#3 in below blog:http://www.openkb.info/2015/06/resource-allocation-configurations-for.html You may want to increase the executor memory, not just the spark.yarn.executor.memoryOverhead. On Tue, Feb 2, 2016 at 2:14 PM, Stefan Panayotov <spanayo...@msn.com> wrote: For the memoryOvethead I have the default of 10% of 16g, and Spark version is 1.5.2. Stefan Panayotov, PhD Sent from Outlook Mail for Windows 10 phone From: Ted Yu Sent: Tuesday, February 2, 2016 4:52 PM To: Jakob Odersky Cc: Stefan Panayotov; user@spark.apache.org Subject: Re: Spark 1.5.2 memory error What value do you use for spark.yarn.executor.memoryOverhead ? Please see https://spark.apache.org/docs/latest/running-on-yarn.html for description of the parameter. Which Spark release are you using ? Cheers On Tue, Feb 2, 2016 at 1:38 PM, Jakob Odersky <ja...@odersky.com> wrote:Can you share some code that produces the error? It is probably not due to spark but rather the way data is handled in the user code. Does your code call any reduceByKey actions? These are often a source for OOM errors. On Tue, Feb 2, 2016 at 1:22 PM, Stefan Panayotov <spanayo...@msn.com> wrote: > Hi Guys, > > I need help with Spark memory errors when executing ML pipelines. > The error that I see is: > > > 16/02/02 20:34:17 INFO Executor: Executor is trying to kill task 32.0 in > stage 32.0 (TID 3298) > > > 16/02/02 20:34:17 INFO Executor: Executor is trying to kill task 12.0 in > stage 32.0 (TID 3278) > > > 16/02/02 20:34:39 INFO MemoryStore: ensureFreeSpace(2004728720) called with > curMem=296303415, maxMem=8890959790 > > > 16/02/02 20:34:39 INFO MemoryStore: Block taskresult_3298 stored as bytes in > memory (estimated size 1911.9 MB, free 6.1 GB) > > > 16/02/02 20:34:39 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: > SIGTERM > > > 16/02/02 20:34:39 ERROR Executor: Exception in task 12.0 in stage 32.0 (TID > 3278) > > > java.lang.OutOfMemoryError: Java heap space > > > at java.util.Arrays.copyOf(Arrays.java:2271) > > > at > java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191) > > > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:86) > > > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:256) > > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > > > at java.lang.Thread.run(Thread.java:745) > > > 16/02/02 20:34:39 INFO DiskBlockManager: Shutdown hook called > > > 16/02/02 20:34:39 INFO Executor: Finished task 32.0 in stage 32.0 (TID > 3298). 2004728720 bytes result sent via BlockManager) > > > 16/02/02 20:34:39 ERROR SparkUncaughtExceptionHandler: Uncaught exception in > thread Thread[Executor task launch worker-8,5,main] > > > java.lang.OutOfMemoryError: Java heap space > > > at java.util.Arrays.copyOf(Arrays.java:2271) > > > at > java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191) > > > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:86) > > > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:256) > > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > > > at java.lang.Thread.run(Thread.java:745) > > > 16/02/02 20:34:39 INFO ShutdownHookManager: Shutdown hook called > > > 16/02/02 20:34:39 INFO MetricsSystemImpl: Stopping azure-file-system metrics > system... > > > 16/02/02 20:34:39 INFO MetricsSinkAdapter: azurefs2 thread interrupted. > > > 16/02/02 20:34:39 INFO MetricsSystemImpl: azure-file-system metrics system > stopped. > > > 16/02/02 20:34:39 INFO MetricsSystemImpl: azure-file-system metrics system > shutdown complete. > > > > > > And ….. > > > > > > 16/02/02 20:09:03 INFO impl.ContainerManagementProtocolProxy: Opening proxy > : 10.0.0.5:30050 > > > 16/02/02 20:33:51 INFO yarn.YarnAllocator: Completed container > container_1454421662639_0011_01_000005 (state: COMPLETE, exit status: -104) > > > 16/02/02 20:33:51 WARN yarn.YarnAllocator: Container killed by YARN for > exceeding memory limits. 16.8 GB of 16.5 GB physical memory used. Consider > boosting spark.yarn.executor.memoryOverhead. > > > 16/02/02 20:33:56 INFO yarn.YarnAllocator: Will request 1 executor > containers, each with 2 cores and 16768 MB memory including 384 MB overhead > > > 16/02/02 20:33:56 INFO yarn.YarnAllocator: Container request (host: Any, > capability: <memory:16768, vCores:2>) > > > 16/02/02 20:33:57 INFO yarn.YarnAllocator: Launching container > container_1454421662639_0011_01_000037 for on host 10.0.0.8 > > > 16/02/02 20:33:57 INFO yarn.YarnAllocator: Launching ExecutorRunnable. > driverUrl: > akka.tcp://sparkDriver@10.0.0.15:47446/user/CoarseGrainedScheduler, > executorHostname: 10.0.0.8 > > > 16/02/02 20:33:57 INFO yarn.YarnAllocator: Received 1 containers from YARN, > launching executors on 1 of them. > > > I'll really appreciate any help here. > > Thank you, > > Stefan Panayotov, PhD > Home: 610-355-0919 > Cell: 610-517-5586 > email: spanayo...@msn.com > spanayo...@outlook.com > spanayo...@comcast.net >--------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Thanks,www.openkb.info (Open KnowledgeBase for Hadoop/Database/OS/Network/Tool)