Hello.

I am working to move our system from Spark 2.1.0 to Spark 2.3.0.  Our
system is running on Spark managed via Yarn.  During the course of the move
I mirrored the settings to our new cluster.  However, on the Spark 2.3.0
cluster with the same resource allocation I am seeing a number of executors
die due to OOM:

18/07/16 17:23:06 ERROR YarnClusterScheduler: Lost executor 5 on wn80:
Container killed by YARN for exceeding memory limits. 22.0 GB of 22 GB
physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.

I increased spark.driver.memoryOverhead and spark.executor.memoryOverhead
from the default (384) to 2048.  I went ahead and disabled vmem and pmem
Yarn checks on the cluster.  With that disabled I see the following error:

Caused by: java.lang.OutOfMemoryError: Java heap space
        at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
        at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
        at 
io.netty.buffer.CompositeByteBuf.nioBuffer(CompositeByteBuf.java:1466)
        at io.netty.buffer.AbstractByteBuf.nioBuffer(AbstractByteBuf.java:1203)
        at 
org.apache.spark.network.protocol.MessageWithHeader.copyByteBuf(MessageWithHeader.java:140)
        at 
org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:123)
        at 
io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:355)
        at 
io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:224)
        at 
io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:382)
        at 
io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:934)
        at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:362)
        at 
io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:901)
        at 
io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1321)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
        at 
io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
        at 
io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:115)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
        at 
io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
        at 
io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:802)
        at 
io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:814)
        at 
io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:794)
        at 
io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:831)
        at 
io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:1041)
        at 
io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:300)
        at 
org.apache.spark.network.server.TransportRequestHandler.respond(TransportRequestHandler.java:222)
        at 
org.apache.spark.network.server.TransportRequestHandler.processFetchRequest(TransportRequestHandler.java:146)
        at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
        at 
org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)



Looking at GC:

   [Eden: 16.0M(8512.0M)->0.0B(8484.0M) Survivors: 4096.0K->4096.0K
Heap: 8996.7M(20.0G)->8650.3M(20.0G)]
 [Times: user=0.03 sys=0.01, real=0.01 secs]
 794.949: [G1Ergonomics (Heap Sizing) attempt heap expansion, reason:
allocation request failed, allocation request: 401255000 bytes]
 794.949: [G1Ergonomics (Heap Sizing) expand the heap, requested
expansion amount: 401255000 bytes, attempted expansion amount:
402653184 bytes]
 794.949: [G1Ergonomics (Heap Sizing) did not expand the heap, reason:
heap already fully expanded]
794.949: [Full GC (Allocation Failure) 801.766: [SoftReference, 0
refs, 0.0000359 secs]801.766: [WeakReference, 1604 refs, 0.0001191
secs]801.766: [FinalReference, 1180 refs, 0.0000882 secs]801.766:
[PhantomReference, 0 refs, 12 refs, 0.0000117 secs]801.766: [JNI Weak
Reference, 0.0000180 secs] 8650M->7931M(20G), 17.4838808 secs]
   [Eden: 0.0B(8484.0M)->0.0B(9588.0M) Survivors: 4096.0K->0.0B Heap:
8650.3M(20.0G)->7931.0M(20.0G)], [Metaspace: 52542K->52542K(1095680K)]
 [Times: user=20.80 sys=1.53, real=17.48 secs]
812.433: [Full GC (Allocation Failure) 817.544: [SoftReference, 112
refs, 0.0000858 secs]817.544: [WeakReference, 1606 refs, 0.0002380
secs]817.545: [FinalReference, 1175 refs, 0.0003358 secs]817.545:
[PhantomReference, 0 refs, 12 refs, 0.0000135 secs]817.545: [JNI Weak
Reference, 0.0000305 secs] 7931M->7930M(20G), 15.9899953 secs]
   [Eden: 0.0B(9588.0M)->0.0B(9588.0M) Survivors: 0.0B->0.0B Heap:
7931.0M(20.0G)->7930.8M(20.0G)], [Metaspace: 52542K->52477K(1095680K)]
 [Times: user=20.88 sys=0.87, real=15.99 secs]
828.425: [GC concurrent-mark-abort]
Heap
 garbage-first heap   total 20971520K, used 8465150K
[0x00000002c0000000, 0x00000002c040a000, 0x00000007c0000000)
  region size 4096K, 2 young (8192K), 0 survivors (0K)
 Metaspace       used 52477K, capacity 52818K, committed 54400K,
reserved 1095680K
  class space    used 7173K, capacity 7280K, committed 7552K, reserved 1048576K
 Concurrent marking:
      0   init marks: total time =     0.00 s (avg =     0.00 ms).
     13      remarks: total time =     0.49 s (avg =    37.77 ms).
           [std. dev =    21.54 ms, max =    83.71 ms]
        13  final marks: total time =     0.03 s (avg =     2.00 ms).
              [std. dev =     2.89 ms, max =    11.95 ms]
        13    weak refs: total time =     0.46 s (avg =    35.77 ms).
              [std. dev =    22.02 ms, max =    82.16 ms]
     13     cleanups: total time =     0.14 s (avg =    10.63 ms).
           [std. dev =    11.13 ms, max =    42.87 ms]
    Final counting total time =     0.04 s (avg =     2.96 ms).
    RS scrub total time =     0.05 s (avg =     3.91 ms).
  Total stop_world time =     0.63 s.
  Total concurrent time =   195.51 s (  194.00 s marking).


This appears to show that we're seeing allocated heap of 7930 MB out
of 20000 MB (so about half).  However, Spark is throwing a Java OOM (out of
heap space) error.  I validated that we're not using legacy memory
management mode.  I've tried this against a few applications (some batch,
some streaming).  Has something changed with memory allocation in Spark
2.3.0 that would cause these issues?

Thank you for any help you can provide.

Regards,

Bryan Jeffrey

Reply via email to