[ https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen closed SPARK-22458. ----------------------------- > OutOfDirectMemoryError with Spark 2.2 > ------------------------------------- > > Key: SPARK-22458 > URL: https://issues.apache.org/jira/browse/SPARK-22458 > Project: Spark > Issue Type: Bug > Components: Shuffle, SQL, YARN > Affects Versions: 2.2.0 > Reporter: Kaushal Prajapati > Priority: Blocker > > We were using Spark 2.1 from last 6 months to execute multiple spark jobs > that is running 15 hour long for 50+ TB of source data with below > configurations successfully. > {quote}spark.master yarn > spark.driver.cores 10 > spark.driver.maxResultSize 5g > spark.driver.memory 20g > spark.executor.cores 5 > spark.executor.extraJavaOptions *-XX:+UseG1GC > -Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=60000 > *-XX:MaxDirectMemorySize=2048m* > -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 > spark.driver.extraJavaOptions > *-Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* > -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 > spark.executor.instances 30 > spark.executor.memory 30g > *spark.kryoserializer.buffer.max 512m* > spark.network.timeout 12000s > spark.serializer > org.apache.spark.serializer.KryoSerializer > spark.shuffle.io.preferDirectBufs false > spark.sql.catalogImplementation hive > spark.sql.shuffle.partitions 5000 > spark.yarn.driver.memoryOverhead 1536 > spark.yarn.executor.memoryOverhead 4096 > spark.core.connection.ack.wait.timeout 600s > spark.scheduler.maxRegisteredResourcesWaitingTime 15s > spark.sql.hive.filesourcePartitionFileCacheSize 524288000 > spark.dynamicAllocation.executorIdleTimeout 30000s > spark.dynamicAllocation.enabled true > spark.hadoop.yarn.timeline-service.enabled false > spark.shuffle.service.enabled true > spark.yarn.am.extraJavaOptions *-Dhdp.version=2.5.3.0-37 > -Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m*{quote} > Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes > using latest version. But we started facing DirectBuffer outOfMemory error > and exceeding memory limits for executor memoryOverhead issue. To fix that we > started tweaking multiple properties but still issue persists. Relevant > information is shared below > Please let me any other details is requried, > > Snapshot for DirectMemory Error Stacktrace :- > {code:java} > 10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in > stage 5.3 (TID 25022, dedwdprshc070.de.xxxxxxx.com, executor 615): > FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxxxxxx.com, 7337, None), > shuffleId=7, mapId=141, reduceId=3372, message= > org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 > byte(s) of direct memory (used: 1073699840, max: 1073741824) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:414) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:166) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > Caused by: io.netty.util.internal.OutOfDirectMemoryError: failed to allocate > 65536 byte(s) of direct memory (used: 1073699840, max: 1073741824) > at > io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:530) > at > io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:484) > at > io.netty.buffer.UnpooledUnsafeNoCleanerDirectByteBuf.allocateDirect(UnpooledUnsafeNoCleanerDirectByteBuf.java:30) > at > io.netty.buffer.UnpooledUnsafeDirectByteBuf.<init>(UnpooledUnsafeDirectByteBuf.java:67) > at > io.netty.buffer.UnpooledUnsafeNoCleanerDirectByteBuf.<init>(UnpooledUnsafeNoCleanerDirectByteBuf.java:25) > at > io.netty.buffer.UnsafeByteBufUtil.newUnsafeDirectByteBuf(UnsafeByteBufUtil.java:425) > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:299) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:177) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:168) > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:129) > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) > ... 1 more > {code} > if i removed above netty configuration, getting below error > Snapshot for Excedding memory overhead Stacktrace :- > > {code:java} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 3372 in stage 5.0 failed 4 times, most recent failure: Lost task 3372.3 > in stage 5.0 (TID 19534, dedwfprshd006.de.xxxxxxx.com, executor 125): > ExecutorLostFailure (executor 125 exited caused by one of the running tasks) > Reason: Container killed by YARN for exceeding memory limits. 37.1 GB of 34 > GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:188) > ... 49 more > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org