Hello. I am working to move our system from Spark 2.1.0 to Spark 2.3.0. Our system is running on Spark managed via Yarn. During the course of the move I mirrored the settings to our new cluster. However, on the Spark 2.3.0 cluster with the same resource allocation I am seeing a number of executors die due to OOM:
18/07/16 17:23:06 ERROR YarnClusterScheduler: Lost executor 5 on wn80: Container killed by YARN for exceeding memory limits. 22.0 GB of 22 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. I increased spark.driver.memoryOverhead and spark.executor.memoryOverhead from the default (384) to 2048. I went ahead and disabled vmem and pmem Yarn checks on the cluster. With that disabled I see the following error: Caused by: java.lang.OutOfMemoryError: Java heap space at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57) at java.nio.ByteBuffer.allocate(ByteBuffer.java:335) at io.netty.buffer.CompositeByteBuf.nioBuffer(CompositeByteBuf.java:1466) at io.netty.buffer.AbstractByteBuf.nioBuffer(AbstractByteBuf.java:1203) at org.apache.spark.network.protocol.MessageWithHeader.copyByteBuf(MessageWithHeader.java:140) at org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:123) at io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:355) at io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:224) at io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:382) at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:934) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:362) at io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:901) at io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1321) at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776) at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768) at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749) at io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:115) at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776) at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768) at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749) at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117) at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776) at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:802) at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:814) at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:794) at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:831) at io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:1041) at io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:300) at org.apache.spark.network.server.TransportRequestHandler.respond(TransportRequestHandler.java:222) at org.apache.spark.network.server.TransportRequestHandler.processFetchRequest(TransportRequestHandler.java:146) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109) at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118) Looking at GC: [Eden: 16.0M(8512.0M)->0.0B(8484.0M) Survivors: 4096.0K->4096.0K Heap: 8996.7M(20.0G)->8650.3M(20.0G)] [Times: user=0.03 sys=0.01, real=0.01 secs] 794.949: [G1Ergonomics (Heap Sizing) attempt heap expansion, reason: allocation request failed, allocation request: 401255000 bytes] 794.949: [G1Ergonomics (Heap Sizing) expand the heap, requested expansion amount: 401255000 bytes, attempted expansion amount: 402653184 bytes] 794.949: [G1Ergonomics (Heap Sizing) did not expand the heap, reason: heap already fully expanded] 794.949: [Full GC (Allocation Failure) 801.766: [SoftReference, 0 refs, 0.0000359 secs]801.766: [WeakReference, 1604 refs, 0.0001191 secs]801.766: [FinalReference, 1180 refs, 0.0000882 secs]801.766: [PhantomReference, 0 refs, 12 refs, 0.0000117 secs]801.766: [JNI Weak Reference, 0.0000180 secs] 8650M->7931M(20G), 17.4838808 secs] [Eden: 0.0B(8484.0M)->0.0B(9588.0M) Survivors: 4096.0K->0.0B Heap: 8650.3M(20.0G)->7931.0M(20.0G)], [Metaspace: 52542K->52542K(1095680K)] [Times: user=20.80 sys=1.53, real=17.48 secs] 812.433: [Full GC (Allocation Failure) 817.544: [SoftReference, 112 refs, 0.0000858 secs]817.544: [WeakReference, 1606 refs, 0.0002380 secs]817.545: [FinalReference, 1175 refs, 0.0003358 secs]817.545: [PhantomReference, 0 refs, 12 refs, 0.0000135 secs]817.545: [JNI Weak Reference, 0.0000305 secs] 7931M->7930M(20G), 15.9899953 secs] [Eden: 0.0B(9588.0M)->0.0B(9588.0M) Survivors: 0.0B->0.0B Heap: 7931.0M(20.0G)->7930.8M(20.0G)], [Metaspace: 52542K->52477K(1095680K)] [Times: user=20.88 sys=0.87, real=15.99 secs] 828.425: [GC concurrent-mark-abort] Heap garbage-first heap total 20971520K, used 8465150K [0x00000002c0000000, 0x00000002c040a000, 0x00000007c0000000) region size 4096K, 2 young (8192K), 0 survivors (0K) Metaspace used 52477K, capacity 52818K, committed 54400K, reserved 1095680K class space used 7173K, capacity 7280K, committed 7552K, reserved 1048576K Concurrent marking: 0 init marks: total time = 0.00 s (avg = 0.00 ms). 13 remarks: total time = 0.49 s (avg = 37.77 ms). [std. dev = 21.54 ms, max = 83.71 ms] 13 final marks: total time = 0.03 s (avg = 2.00 ms). [std. dev = 2.89 ms, max = 11.95 ms] 13 weak refs: total time = 0.46 s (avg = 35.77 ms). [std. dev = 22.02 ms, max = 82.16 ms] 13 cleanups: total time = 0.14 s (avg = 10.63 ms). [std. dev = 11.13 ms, max = 42.87 ms] Final counting total time = 0.04 s (avg = 2.96 ms). RS scrub total time = 0.05 s (avg = 3.91 ms). Total stop_world time = 0.63 s. Total concurrent time = 195.51 s ( 194.00 s marking). This appears to show that we're seeing allocated heap of 7930 MB out of 20000 MB (so about half). However, Spark is throwing a Java OOM (out of heap space) error. I validated that we're not using legacy memory management mode. I've tried this against a few applications (some batch, some streaming). Has something changed with memory allocation in Spark 2.3.0 that would cause these issues? Thank you for any help you can provide. Regards, Bryan Jeffrey