perhaps this is https://issues.apache.org/jira/browse/SPARK-24578?
that was reported as a performance issue, not OOMs, but its in the exact same part of the code and the change was to reduce the memory pressure significantly. On Mon, Jul 16, 2018 at 1:43 PM, Bryan Jeffrey <bryan.jeff...@gmail.com> wrote: > Hello. > > I am working to move our system from Spark 2.1.0 to Spark 2.3.0. Our > system is running on Spark managed via Yarn. During the course of the move > I mirrored the settings to our new cluster. However, on the Spark 2.3.0 > cluster with the same resource allocation I am seeing a number of executors > die due to OOM: > > 18/07/16 17:23:06 ERROR YarnClusterScheduler: Lost executor 5 on wn80: > Container killed by YARN for exceeding memory limits. 22.0 GB of 22 GB > physical memory used. Consider boosting spark.yarn.executor.memoryOver > head. > > I increased spark.driver.memoryOverhead and spark.executor.memoryOverhead > from the default (384) to 2048. I went ahead and disabled vmem and pmem > Yarn checks on the cluster. With that disabled I see the following error: > > Caused by: java.lang.OutOfMemoryError: Java heap space > at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57) > at java.nio.ByteBuffer.allocate(ByteBuffer.java:335) > at > io.netty.buffer.CompositeByteBuf.nioBuffer(CompositeByteBuf.java:1466) > at io.netty.buffer.AbstractByteBuf.nioBuffer(AbstractByteBuf.java:1203) > at > org.apache.spark.network.protocol.MessageWithHeader.copyByteBuf(MessageWithHeader.java:140) > at > org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:123) > at > io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:355) > at > io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:224) > at > io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:382) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:934) > at > io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:362) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:901) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1321) > at > io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776) > at > io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768) > at > io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749) > at > io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:115) > at > io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776) > at > io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768) > at > io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749) > at > io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117) > at > io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:802) > at > io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:814) > at > io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:794) > at > io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:831) > at > io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:1041) > at > io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:300) > at > org.apache.spark.network.server.TransportRequestHandler.respond(TransportRequestHandler.java:222) > at > org.apache.spark.network.server.TransportRequestHandler.processFetchRequest(TransportRequestHandler.java:146) > at > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118) > > > > Looking at GC: > > [Eden: 16.0M(8512.0M)->0.0B(8484.0M) Survivors: 4096.0K->4096.0K Heap: > 8996.7M(20.0G)->8650.3M(20.0G)] > [Times: user=0.03 sys=0.01, real=0.01 secs] > 794.949: [G1Ergonomics (Heap Sizing) attempt heap expansion, reason: > allocation request failed, allocation request: 401255000 bytes] > 794.949: [G1Ergonomics (Heap Sizing) expand the heap, requested expansion > amount: 401255000 bytes, attempted expansion amount: 402653184 bytes] > 794.949: [G1Ergonomics (Heap Sizing) did not expand the heap, reason: heap > already fully expanded] > 794.949: [Full GC (Allocation Failure) 801.766: [SoftReference, 0 refs, > 0.0000359 secs]801.766: [WeakReference, 1604 refs, 0.0001191 secs]801.766: > [FinalReference, 1180 refs, 0.0000882 secs]801.766: [PhantomReference, 0 > refs, 12 refs, 0.0000117 secs]801.766: [JNI Weak Reference, 0.0000180 secs] > 8650M->7931M(20G), 17.4838808 secs] > [Eden: 0.0B(8484.0M)->0.0B(9588.0M) Survivors: 4096.0K->0.0B Heap: > 8650.3M(20.0G)->7931.0M(20.0G)], [Metaspace: 52542K->52542K(1095680K)] > [Times: user=20.80 sys=1.53, real=17.48 secs] > 812.433: [Full GC (Allocation Failure) 817.544: [SoftReference, 112 refs, > 0.0000858 secs]817.544: [WeakReference, 1606 refs, 0.0002380 secs]817.545: > [FinalReference, 1175 refs, 0.0003358 secs]817.545: [PhantomReference, 0 > refs, 12 refs, 0.0000135 secs]817.545: [JNI Weak Reference, 0.0000305 secs] > 7931M->7930M(20G), 15.9899953 secs] > [Eden: 0.0B(9588.0M)->0.0B(9588.0M) Survivors: 0.0B->0.0B Heap: > 7931.0M(20.0G)->7930.8M(20.0G)], [Metaspace: 52542K->52477K(1095680K)] > [Times: user=20.88 sys=0.87, real=15.99 secs] > 828.425: [GC concurrent-mark-abort] > Heap > garbage-first heap total 20971520K, used 8465150K [0x00000002c0000000, > 0x00000002c040a000, 0x00000007c0000000) > region size 4096K, 2 young (8192K), 0 survivors (0K) > Metaspace used 52477K, capacity 52818K, committed 54400K, reserved > 1095680K > class space used 7173K, capacity 7280K, committed 7552K, reserved > 1048576K > Concurrent marking: > 0 init marks: total time = 0.00 s (avg = 0.00 ms). > 13 remarks: total time = 0.49 s (avg = 37.77 ms). > [std. dev = 21.54 ms, max = 83.71 ms] > 13 final marks: total time = 0.03 s (avg = 2.00 ms). > [std. dev = 2.89 ms, max = 11.95 ms] > 13 weak refs: total time = 0.46 s (avg = 35.77 ms). > [std. dev = 22.02 ms, max = 82.16 ms] > 13 cleanups: total time = 0.14 s (avg = 10.63 ms). > [std. dev = 11.13 ms, max = 42.87 ms] > Final counting total time = 0.04 s (avg = 2.96 ms). > RS scrub total time = 0.05 s (avg = 3.91 ms). > Total stop_world time = 0.63 s. > Total concurrent time = 195.51 s ( 194.00 s marking). > > > This appears to show that we're seeing allocated heap of 7930 MB out > of 20000 MB (so about half). However, Spark is throwing a Java OOM (out of > heap space) error. I validated that we're not using legacy memory > management mode. I've tried this against a few applications (some batch, > some streaming). Has something changed with memory allocation in Spark > 2.3.0 that would cause these issues? > > Thank you for any help you can provide. > > Regards, > > Bryan Jeffrey > > >