[ 
https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kaushal Prajapati updated SPARK-22458:
--------------------------------------
    Description: 
We were using Spark 2.1 from last 6 months to execute multiple spark jobs that 
is running 15 hour long for 50+ TB of source data with below configurations 
successfully. 

spark.master                                      yarn
spark.driver.cores                                10
spark.driver.maxResultSize                        5g
spark.driver.memory                               20g
spark.executor.cores                              5
spark.executor.extraJavaOptions                   -XX:+UseG1GC 
*-Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=60000 
*-XX:MaxDirectMemorySize=2048m* 
-Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
spark.driver.extraJavaOptions                    * 
-Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* 
-Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
spark.executor.instances                          30
spark.executor.memory                             30g
*spark.kryoserializer.buffer.max                   512m*

spark.network.timeout                             12000s
spark.serializer                                  
org.apache.spark.serializer.KryoSerializer
spark.shuffle.io.preferDirectBufs                 false
spark.sql.catalogImplementation                   hive
spark.sql.shuffle.partitions                      5000
spark.yarn.driver.memoryOverhead                  1536
spark.yarn.executor.memoryOverhead                4096
spark.core.connection.ack.wait.timeout            600s
spark.scheduler.maxRegisteredResourcesWaitingTime 15s
spark.sql.hive.filesourcePartitionFileCacheSize   524288000


spark.dynamicAllocation.executorIdleTimeout       30000s
spark.dynamicAllocation.enabled                   true
spark.hadoop.yarn.timeline-service.enabled        false
spark.shuffle.service.enabled                     true
spark.yarn.am.extraJavaOptions                    -Dhdp.version=2.5.3.0-37 * 
-Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m*

Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes 
using latest version. But we started facing DirectBuffer outOfMemory error and 
exceeding memory limits for executor memoryOverhead issue. To fix that we 
started tweaking multiple properties but still issue persists. Relevant 
information is shared below

Please let me any other details is requried,
                
Snapshot for DirectMemory Error Stacktrace :- 
10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in 
stage 5.3 (TID 25022, dedwdprshc070.de.xxxxxxx.com, executor 615): 
FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxxxxxx.com, 7337, None), 
shuffleId=7, mapId=141, reduceId=3372, message=
org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 byte(s) 
of direct memory (used: 1073699840, max: 1073741824)
        at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
        at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
        at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
        at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
        at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:414)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:166)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)
Caused by: io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 
65536 byte(s) of direct memory (used: 1073699840, max: 1073741824)
        at 
io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:530)
        at 
io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:484)
        at 
io.netty.buffer.UnpooledUnsafeNoCleanerDirectByteBuf.allocateDirect(UnpooledUnsafeNoCleanerDirectByteBuf.java:30)
        at 
io.netty.buffer.UnpooledUnsafeDirectByteBuf.<init>(UnpooledUnsafeDirectByteBuf.java:67)
        at 
io.netty.buffer.UnpooledUnsafeNoCleanerDirectByteBuf.<init>(UnpooledUnsafeNoCleanerDirectByteBuf.java:25)
        at 
io.netty.buffer.UnsafeByteBufUtil.newUnsafeDirectByteBuf(UnsafeByteBufUtil.java:425)
        at 
io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:299)
        at 
io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:177)
        at 
io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:168)
        at 
io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:129)
        at 
io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
        at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
        at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
        at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
        ... 1 more

if i removed above netty configuration, getting below error

Snapshot for Excedding memory overhead Stacktrace :- 
                

{code:java}
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 3372 in stage 5.0 failed 4 times, most recent failure: Lost task 3372.3 in 
stage 5.0 (TID 19534, dedwfprshd006.de.xxxxxxx.com, executor 125): 
ExecutorLostFailure (executor 125 exited caused by one of the running tasks) 
Reason: Container killed by YARN for exceeding memory limits. 37.1 GB of 34 GB 
physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
Driver stacktrace:
        at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
        at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
        at scala.Option.foreach(Option.scala:257)
        at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
        at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
        at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:188)
        ... 49 more
{code}



  was:
We were using Spark 2.1 from last 6 months to execute multiple spark jobs that 
is running 15 hour long for 50+ TB of source data with below configurations 
successfully. 

spark.master                                      yarn
spark.driver.cores                                10
spark.driver.maxResultSize                        5g
spark.driver.memory                               20g
spark.executor.cores                              5
spark.executor.extraJavaOptions                   -XX:+UseG1GC 
*-Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=60000 
*-XX:MaxDirectMemorySize=2048m* 
-Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
spark.driver.extraJavaOptions                    * 
-Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* 
-Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
spark.executor.instances                          30
spark.executor.memory                             30g
*spark.kryoserializer.buffer.max                   512m*

spark.network.timeout                             12000s
spark.serializer                                  
org.apache.spark.serializer.KryoSerializer
spark.shuffle.io.preferDirectBufs                 false
spark.sql.catalogImplementation                   hive
spark.sql.shuffle.partitions                      5000
spark.yarn.driver.memoryOverhead                  1536
spark.yarn.executor.memoryOverhead                4096
spark.core.connection.ack.wait.timeout            600s
spark.scheduler.maxRegisteredResourcesWaitingTime 15s
spark.sql.hive.filesourcePartitionFileCacheSize   524288000


spark.dynamicAllocation.executorIdleTimeout       30000s
spark.dynamicAllocation.enabled                   true
spark.hadoop.yarn.timeline-service.enabled        false
spark.shuffle.service.enabled                     true
spark.yarn.am.extraJavaOptions                    -Dhdp.version=2.5.3.0-37 * 
-Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m*

Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes 
using latest version. But we started facing DirectBuffer outOfMemory error and 
exceeding memory limits for executor memoryOverhead issue. To fix that we 
started tweaking multiple properties but still issue persists. Relevant 
information is shared below

Please let me any other details is requried,
                
Snapshot for DirectMemory Error Stacktrace :- 
10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in 
stage 5.3 (TID 25022, dedwdprshc070.de.xxxxxxx.com, executor 615): 
FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxxxxxx.com, 7337, None), 
shuffleId=7, mapId=141, reduceId=3372, message=
org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 byte(s) 
of direct memory (used: 1073699840, max: 1073741824)
        at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
        at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
        at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
        at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
        at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:414)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:166)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)
Caused by: io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 
65536 byte(s) of direct memory (used: 1073699840, max: 1073741824)
        at 
io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:530)
        at 
io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:484)
        at 
io.netty.buffer.UnpooledUnsafeNoCleanerDirectByteBuf.allocateDirect(UnpooledUnsafeNoCleanerDirectByteBuf.java:30)
        at 
io.netty.buffer.UnpooledUnsafeDirectByteBuf.<init>(UnpooledUnsafeDirectByteBuf.java:67)
        at 
io.netty.buffer.UnpooledUnsafeNoCleanerDirectByteBuf.<init>(UnpooledUnsafeNoCleanerDirectByteBuf.java:25)
        at 
io.netty.buffer.UnsafeByteBufUtil.newUnsafeDirectByteBuf(UnsafeByteBufUtil.java:425)
        at 
io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:299)
        at 
io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:177)
        at 
io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:168)
        at 
io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:129)
        at 
io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
        at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
        at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
        at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
        ... 1 more

if i removed above netty configuration, getting below error

Snapshot for Excedding memory overhead Stacktrace :- 
                
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 3372 in stage 5.0 failed 4 times, most recent failure: Lost task 3372.3 in 
stage 5.0 (TID 19534, dedwfprshd006.de.xxxxxxx.com, executor 125): 
ExecutorLostFailure (executor 125 exited caused by one of the running tasks) 
Reason: Container killed by YARN for exceeding memory limits. 37.1 GB of 34 GB 
physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
Driver stacktrace:
        at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
        at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
        at scala.Option.foreach(Option.scala:257)
        at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
        at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
        at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:188)
        ... 49 more



> OutOfDirectMemoryError with Spark 2.2
> -------------------------------------
>
>                 Key: SPARK-22458
>                 URL: https://issues.apache.org/jira/browse/SPARK-22458
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle, SQL, YARN
>    Affects Versions: 2.2.0
>            Reporter: Kaushal Prajapati
>            Priority: Blocker
>
> We were using Spark 2.1 from last 6 months to execute multiple spark jobs 
> that is running 15 hour long for 50+ TB of source data with below 
> configurations successfully. 
> spark.master                                      yarn
> spark.driver.cores                                10
> spark.driver.maxResultSize                        5g
> spark.driver.memory                               20g
> spark.executor.cores                              5
> spark.executor.extraJavaOptions                   -XX:+UseG1GC 
> *-Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=60000 
> *-XX:MaxDirectMemorySize=2048m* 
> -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
> spark.driver.extraJavaOptions                    * 
> -Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* 
> -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
> spark.executor.instances                          30
> spark.executor.memory                             30g
> *spark.kryoserializer.buffer.max                   512m*
> spark.network.timeout                             12000s
> spark.serializer                                  
> org.apache.spark.serializer.KryoSerializer
> spark.shuffle.io.preferDirectBufs                 false
> spark.sql.catalogImplementation                   hive
> spark.sql.shuffle.partitions                      5000
> spark.yarn.driver.memoryOverhead                  1536
> spark.yarn.executor.memoryOverhead                4096
> spark.core.connection.ack.wait.timeout            600s
> spark.scheduler.maxRegisteredResourcesWaitingTime 15s
> spark.sql.hive.filesourcePartitionFileCacheSize   524288000
> spark.dynamicAllocation.executorIdleTimeout       30000s
> spark.dynamicAllocation.enabled                   true
> spark.hadoop.yarn.timeline-service.enabled        false
> spark.shuffle.service.enabled                     true
> spark.yarn.am.extraJavaOptions                    -Dhdp.version=2.5.3.0-37 * 
> -Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m*
> Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes 
> using latest version. But we started facing DirectBuffer outOfMemory error 
> and exceeding memory limits for executor memoryOverhead issue. To fix that we 
> started tweaking multiple properties but still issue persists. Relevant 
> information is shared below
> Please let me any other details is requried,
>               
> Snapshot for DirectMemory Error Stacktrace :- 
> 10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in 
> stage 5.3 (TID 25022, dedwdprshc070.de.xxxxxxx.com, executor 615): 
> FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxxxxxx.com, 7337, None), 
> shuffleId=7, mapId=141, reduceId=3372, message=
> org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 
> byte(s) of direct memory (used: 1073699840, max: 1073741824)
>         at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
>         at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
>         at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
>         at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>         at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>         at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>         at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:414)
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>         at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:166)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>         at org.apache.spark.scheduler.Task.run(Task.scala:108)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 
> 65536 byte(s) of direct memory (used: 1073699840, max: 1073741824)
>         at 
> io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:530)
>         at 
> io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:484)
>         at 
> io.netty.buffer.UnpooledUnsafeNoCleanerDirectByteBuf.allocateDirect(UnpooledUnsafeNoCleanerDirectByteBuf.java:30)
>         at 
> io.netty.buffer.UnpooledUnsafeDirectByteBuf.<init>(UnpooledUnsafeDirectByteBuf.java:67)
>         at 
> io.netty.buffer.UnpooledUnsafeNoCleanerDirectByteBuf.<init>(UnpooledUnsafeNoCleanerDirectByteBuf.java:25)
>         at 
> io.netty.buffer.UnsafeByteBufUtil.newUnsafeDirectByteBuf(UnsafeByteBufUtil.java:425)
>         at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:299)
>         at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:177)
>         at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:168)
>         at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:129)
>         at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
>         at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117)
>         at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
>         at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
>         at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
>         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
>         at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
>         at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
>         ... 1 more
> if i removed above netty configuration, getting below error
> Snapshot for Excedding memory overhead Stacktrace :- 
>               
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 3372 in stage 5.0 failed 4 times, most recent failure: Lost task 3372.3 
> in stage 5.0 (TID 19534, dedwfprshd006.de.xxxxxxx.com, executor 125): 
> ExecutorLostFailure (executor 125 exited caused by one of the running tasks) 
> Reason: Container killed by YARN for exceeding memory limits. 37.1 GB of 34 
> GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
> Driver stacktrace:
>         at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
>         at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
>         at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
>         at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>         at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>         at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
>         at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
>         at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
>         at scala.Option.foreach(Option.scala:257)
>         at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
>         at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
>         at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
>         at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
>         at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>         at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
>         at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
>         at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:188)
>         ... 49 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to