perhaps this is https://issues.apache.org/jira/browse/SPARK-24578?

that was reported as a performance issue, not OOMs, but its in the exact
same part of the code and the change was to reduce the memory pressure
significantly.

On Mon, Jul 16, 2018 at 1:43 PM, Bryan Jeffrey <bryan.jeff...@gmail.com>
wrote:

> Hello.
>
> I am working to move our system from Spark 2.1.0 to Spark 2.3.0.  Our
> system is running on Spark managed via Yarn.  During the course of the move
> I mirrored the settings to our new cluster.  However, on the Spark 2.3.0
> cluster with the same resource allocation I am seeing a number of executors
> die due to OOM:
>
> 18/07/16 17:23:06 ERROR YarnClusterScheduler: Lost executor 5 on wn80:
> Container killed by YARN for exceeding memory limits. 22.0 GB of 22 GB
> physical memory used. Consider boosting spark.yarn.executor.memoryOver
> head.
>
> I increased spark.driver.memoryOverhead and spark.executor.memoryOverhead
> from the default (384) to 2048.  I went ahead and disabled vmem and pmem
> Yarn checks on the cluster.  With that disabled I see the following error:
>
> Caused by: java.lang.OutOfMemoryError: Java heap space
>       at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
>       at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
>       at 
> io.netty.buffer.CompositeByteBuf.nioBuffer(CompositeByteBuf.java:1466)
>       at io.netty.buffer.AbstractByteBuf.nioBuffer(AbstractByteBuf.java:1203)
>       at 
> org.apache.spark.network.protocol.MessageWithHeader.copyByteBuf(MessageWithHeader.java:140)
>       at 
> org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:123)
>       at 
> io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:355)
>       at 
> io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:224)
>       at 
> io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:382)
>       at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:934)
>       at 
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:362)
>       at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:901)
>       at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1321)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
>       at 
> io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:115)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
>       at 
> io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:802)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:814)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:794)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:831)
>       at 
> io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:1041)
>       at 
> io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:300)
>       at 
> org.apache.spark.network.server.TransportRequestHandler.respond(TransportRequestHandler.java:222)
>       at 
> org.apache.spark.network.server.TransportRequestHandler.processFetchRequest(TransportRequestHandler.java:146)
>       at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
>       at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
>
>
>
> Looking at GC:
>
>    [Eden: 16.0M(8512.0M)->0.0B(8484.0M) Survivors: 4096.0K->4096.0K Heap: 
> 8996.7M(20.0G)->8650.3M(20.0G)]
>  [Times: user=0.03 sys=0.01, real=0.01 secs]
>  794.949: [G1Ergonomics (Heap Sizing) attempt heap expansion, reason: 
> allocation request failed, allocation request: 401255000 bytes]
>  794.949: [G1Ergonomics (Heap Sizing) expand the heap, requested expansion 
> amount: 401255000 bytes, attempted expansion amount: 402653184 bytes]
>  794.949: [G1Ergonomics (Heap Sizing) did not expand the heap, reason: heap 
> already fully expanded]
> 794.949: [Full GC (Allocation Failure) 801.766: [SoftReference, 0 refs, 
> 0.0000359 secs]801.766: [WeakReference, 1604 refs, 0.0001191 secs]801.766: 
> [FinalReference, 1180 refs, 0.0000882 secs]801.766: [PhantomReference, 0 
> refs, 12 refs, 0.0000117 secs]801.766: [JNI Weak Reference, 0.0000180 secs] 
> 8650M->7931M(20G), 17.4838808 secs]
>    [Eden: 0.0B(8484.0M)->0.0B(9588.0M) Survivors: 4096.0K->0.0B Heap: 
> 8650.3M(20.0G)->7931.0M(20.0G)], [Metaspace: 52542K->52542K(1095680K)]
>  [Times: user=20.80 sys=1.53, real=17.48 secs]
> 812.433: [Full GC (Allocation Failure) 817.544: [SoftReference, 112 refs, 
> 0.0000858 secs]817.544: [WeakReference, 1606 refs, 0.0002380 secs]817.545: 
> [FinalReference, 1175 refs, 0.0003358 secs]817.545: [PhantomReference, 0 
> refs, 12 refs, 0.0000135 secs]817.545: [JNI Weak Reference, 0.0000305 secs] 
> 7931M->7930M(20G), 15.9899953 secs]
>    [Eden: 0.0B(9588.0M)->0.0B(9588.0M) Survivors: 0.0B->0.0B Heap: 
> 7931.0M(20.0G)->7930.8M(20.0G)], [Metaspace: 52542K->52477K(1095680K)]
>  [Times: user=20.88 sys=0.87, real=15.99 secs]
> 828.425: [GC concurrent-mark-abort]
> Heap
>  garbage-first heap   total 20971520K, used 8465150K [0x00000002c0000000, 
> 0x00000002c040a000, 0x00000007c0000000)
>   region size 4096K, 2 young (8192K), 0 survivors (0K)
>  Metaspace       used 52477K, capacity 52818K, committed 54400K, reserved 
> 1095680K
>   class space    used 7173K, capacity 7280K, committed 7552K, reserved 
> 1048576K
>  Concurrent marking:
>       0   init marks: total time =     0.00 s (avg =     0.00 ms).
>      13      remarks: total time =     0.49 s (avg =    37.77 ms).
>            [std. dev =    21.54 ms, max =    83.71 ms]
>         13  final marks: total time =     0.03 s (avg =     2.00 ms).
>               [std. dev =     2.89 ms, max =    11.95 ms]
>         13    weak refs: total time =     0.46 s (avg =    35.77 ms).
>               [std. dev =    22.02 ms, max =    82.16 ms]
>      13     cleanups: total time =     0.14 s (avg =    10.63 ms).
>            [std. dev =    11.13 ms, max =    42.87 ms]
>     Final counting total time =     0.04 s (avg =     2.96 ms).
>     RS scrub total time =     0.05 s (avg =     3.91 ms).
>   Total stop_world time =     0.63 s.
>   Total concurrent time =   195.51 s (  194.00 s marking).
>
>
> This appears to show that we're seeing allocated heap of 7930 MB out
> of 20000 MB (so about half).  However, Spark is throwing a Java OOM (out of
> heap space) error.  I validated that we're not using legacy memory
> management mode.  I've tried this against a few applications (some batch,
> some streaming).  Has something changed with memory allocation in Spark
> 2.3.0 that would cause these issues?
>
> Thank you for any help you can provide.
>
> Regards,
>
> Bryan Jeffrey
>
>
>

Reply via email to