[ 
https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-22458.
-----------------------------

> OutOfDirectMemoryError with Spark 2.2
> -------------------------------------
>
>                 Key: SPARK-22458
>                 URL: https://issues.apache.org/jira/browse/SPARK-22458
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle, SQL, YARN
>    Affects Versions: 2.2.0
>            Reporter: Kaushal Prajapati
>            Priority: Blocker
>
> We were using Spark 2.1 from last 6 months to execute multiple spark jobs 
> that is running 15 hour long for 50+ TB of source data with below 
> configurations successfully. 
> {quote}spark.master                                      yarn
> spark.driver.cores                                10
> spark.driver.maxResultSize                        5g
> spark.driver.memory                               20g
> spark.executor.cores                              5
> spark.executor.extraJavaOptions                   *-XX:+UseG1GC 
> -Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=60000 
> *-XX:MaxDirectMemorySize=2048m* 
> -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
> spark.driver.extraJavaOptions                   
> *-Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* 
> -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
> spark.executor.instances                          30
> spark.executor.memory                             30g
> *spark.kryoserializer.buffer.max                   512m*
> spark.network.timeout                             12000s
> spark.serializer                                  
> org.apache.spark.serializer.KryoSerializer
> spark.shuffle.io.preferDirectBufs                 false
> spark.sql.catalogImplementation                   hive
> spark.sql.shuffle.partitions                      5000
> spark.yarn.driver.memoryOverhead                  1536
> spark.yarn.executor.memoryOverhead                4096
> spark.core.connection.ack.wait.timeout            600s
> spark.scheduler.maxRegisteredResourcesWaitingTime 15s
> spark.sql.hive.filesourcePartitionFileCacheSize   524288000
> spark.dynamicAllocation.executorIdleTimeout       30000s
> spark.dynamicAllocation.enabled                   true
> spark.hadoop.yarn.timeline-service.enabled        false
> spark.shuffle.service.enabled                     true
> spark.yarn.am.extraJavaOptions                    *-Dhdp.version=2.5.3.0-37 
> -Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m*{quote}
> Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes 
> using latest version. But we started facing DirectBuffer outOfMemory error 
> and exceeding memory limits for executor memoryOverhead issue. To fix that we 
> started tweaking multiple properties but still issue persists. Relevant 
> information is shared below
> Please let me any other details is requried,
>               
> Snapshot for DirectMemory Error Stacktrace :- 
> {code:java}
> 10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in 
> stage 5.3 (TID 25022, dedwdprshc070.de.xxxxxxx.com, executor 615): 
> FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxxxxxx.com, 7337, None), 
> shuffleId=7, mapId=141, reduceId=3372, message=
> org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 
> byte(s) of direct memory (used: 1073699840, max: 1073741824)
>         at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
>         at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
>         at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
>         at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>         at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>         at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>         at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:414)
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>         at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:166)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>         at org.apache.spark.scheduler.Task.run(Task.scala:108)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 
> 65536 byte(s) of direct memory (used: 1073699840, max: 1073741824)
>         at 
> io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:530)
>         at 
> io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:484)
>         at 
> io.netty.buffer.UnpooledUnsafeNoCleanerDirectByteBuf.allocateDirect(UnpooledUnsafeNoCleanerDirectByteBuf.java:30)
>         at 
> io.netty.buffer.UnpooledUnsafeDirectByteBuf.<init>(UnpooledUnsafeDirectByteBuf.java:67)
>         at 
> io.netty.buffer.UnpooledUnsafeNoCleanerDirectByteBuf.<init>(UnpooledUnsafeNoCleanerDirectByteBuf.java:25)
>         at 
> io.netty.buffer.UnsafeByteBufUtil.newUnsafeDirectByteBuf(UnsafeByteBufUtil.java:425)
>         at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:299)
>         at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:177)
>         at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:168)
>         at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:129)
>         at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
>         at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117)
>         at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
>         at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
>         at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
>         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
>         at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
>         at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
>         ... 1 more
> {code}
> if i removed above netty configuration, getting below error
> Snapshot for Excedding memory overhead Stacktrace :- 
>               
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 3372 in stage 5.0 failed 4 times, most recent failure: Lost task 3372.3 
> in stage 5.0 (TID 19534, dedwfprshd006.de.xxxxxxx.com, executor 125): 
> ExecutorLostFailure (executor 125 exited caused by one of the running tasks) 
> Reason: Container killed by YARN for exceeding memory limits. 37.1 GB of 34 
> GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
> Driver stacktrace:
>         at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
>         at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
>         at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
>         at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>         at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>         at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
>         at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
>         at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
>         at scala.Option.foreach(Option.scala:257)
>         at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
>         at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
>         at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
>         at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
>         at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>         at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
>         at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
>         at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:188)
>         ... 49 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to