I believe jmap is only showing you the java heap used, but the program is running out of direct memory space. They are two different pools of memory.
I haven't had to diagnose a direct memory problem before, but this blog post has some suggestions of how to do it: https://jkutner.github.io/2017/04/28/oh-the-places-your-java-memory-goes.html On Thu, Mar 8, 2018 at 1:57 AM, Chawla,Sumit <sumitkcha...@gmail.com> wrote: > Hi > > Anybody got any pointers on this one? > > Regards > Sumit Chawla > > > On Tue, Mar 6, 2018 at 8:58 AM, Chawla,Sumit <sumitkcha...@gmail.com> > wrote: > >> No, This is the only Stack trace i get. I have tried DEBUG but didn't >> notice much of a log change. >> >> Yes, I have tried bumping MaxDirectMemorySize to get rid of this error. >> It does work if i throw 4G+ memory at it. However, I am trying to >> understand this behavior so that i can setup this number to appropriate >> value. >> >> Regards >> Sumit Chawla >> >> >> On Tue, Mar 6, 2018 at 8:07 AM, Vadim Semenov <va...@datadoghq.com> >> wrote: >> >>> Do you have a trace? i.e. what's the source of `io.netty.*` calls? >>> >>> And have you tried bumping `-XX:MaxDirectMemorySize`? >>> >>> On Tue, Mar 6, 2018 at 12:45 AM, Chawla,Sumit <sumitkcha...@gmail.com> >>> wrote: >>> >>>> Hi All >>>> >>>> I have a job which processes a large dataset. All items in the dataset >>>> are unrelated. To save on cluster resources, I process these items in >>>> chunks. Since chunks are independent of each other, I start and shut down >>>> the spark context for each chunk. This allows me to keep DAG smaller and >>>> not retry the entire DAG in case of failures. This mechanism used to work >>>> fine with Spark 1.6. Now, as we have moved to 2.2, the job started >>>> failing with OutOfDirectMemoryError error. >>>> >>>> 2018-03-03 22:00:59,687 WARN [rpc-server-48-1] >>>> server.TransportChannelHandler >>>> (TransportChannelHandler.java:exceptionCaught(78)) >>>> - Exception in connection from /10.66.73.27:60374 >>>> >>>> io.netty.util.internal.OutOfDirectMemoryError: failed to allocate >>>> 8388608 byte(s) of direct memory (used: 1023410176, max: 1029177344) >>>> >>>> at io.netty.util.internal.PlatformDependent.incrementMemoryCoun >>>> ter(PlatformDependent.java:506) >>>> >>>> at io.netty.util.internal.PlatformDependent.allocateDirectNoCle >>>> aner(PlatformDependent.java:460) >>>> >>>> at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolAre >>>> na.java:701) >>>> >>>> at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:690) >>>> >>>> at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:237) >>>> >>>> at io.netty.buffer.PoolArena.allocate(PoolArena.java:213) >>>> >>>> at io.netty.buffer.PoolArena.allocate(PoolArena.java:141) >>>> >>>> at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(Poole >>>> dByteBufAllocator.java:271) >>>> >>>> at io.netty.buffer.AbstractByteBufAllocator.directBuffer(Abstra >>>> ctByteBufAllocator.java:177) >>>> >>>> at io.netty.buffer.AbstractByteBufAllocator.directBuffer(Abstra >>>> ctByteBufAllocator.java:168) >>>> >>>> at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractBy >>>> teBufAllocator.java:129) >>>> >>>> at io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.all >>>> ocate(AdaptiveRecvByteBufAllocator.java:104) >>>> >>>> at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.re >>>> ad(AbstractNioByteChannel.java:117) >>>> >>>> at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEven >>>> tLoop.java:564) >>>> >>>> I got some clue on what is causing this from https://github.com/netty/ >>>> netty/issues/6343, However I am not able to add up numbers on what is >>>> causing 1 GB of Direct Memory to fill up. >>>> >>>> Output from jmap >>>> >>>> >>>> 7: 22230 1422720 io.netty.buffer.PoolSubpage >>>> >>>> 12: 1370 804640 io.netty.buffer.PoolSubpage[] >>>> >>>> 41: 3600 144000 io.netty.buffer.PoolChunkList >>>> >>>> 98: 1440 46080 io.netty.buffer.PoolThreadCache$SubPageMemoryRegionCache >>>> >>>> 113: 300 40800 io.netty.buffer.PoolArena$HeapArena >>>> >>>> 114: 300 40800 io.netty.buffer.PoolArena$DirectArena >>>> >>>> 192: 198 15840 io.netty.buffer.PoolChunk >>>> >>>> 274: 120 8320 io.netty.buffer.PoolThreadCache$MemoryRegionCache[] >>>> >>>> 406: 120 3840 io.netty.buffer.PoolThreadCache$NormalMemoryRegionCache >>>> >>>> 422: 72 3552 io.netty.buffer.PoolArena[] >>>> >>>> 458: 30 2640 io.netty.buffer.PooledUnsafeDirectByteBuf >>>> >>>> 500: 36 2016 io.netty.buffer.PooledByteBufAllocator >>>> >>>> 529: 32 1792 io.netty.buffer.UnpooledUnsafeHeapByteBuf >>>> >>>> 589: 20 1440 io.netty.buffer.PoolThreadCache >>>> >>>> 630: 37 1184 io.netty.buffer.EmptyByteBuf >>>> >>>> 703: 36 864 io.netty.buffer.PooledByteBufAllocator$PoolThreadLocalCache >>>> >>>> 852: 22 528 io.netty.buffer.AdvancedLeakAwareByteBuf >>>> >>>> 889: 10 480 io.netty.buffer.SlicedAbstractByteBuf >>>> >>>> 917: 8 448 io.netty.buffer.UnpooledHeapByteBuf >>>> >>>> 1018: 20 320 io.netty.buffer.PoolThreadCache$1 >>>> >>>> 1305: 4 128 io.netty.buffer.PoolThreadCache$MemoryRegionCache$Entry >>>> >>>> 1404: 1 80 io.netty.buffer.PooledUnsafeHeapByteBuf >>>> >>>> 1473: 3 72 io.netty.buffer.PoolArena$SizeClass >>>> >>>> 1529: 1 64 io.netty.buffer.AdvancedLeakAwareCompositeByteBuf >>>> >>>> 1541: 2 64 io.netty.buffer.CompositeByteBuf$Component >>>> >>>> 1568: 1 56 io.netty.buffer.CompositeByteBuf >>>> >>>> 1896: 1 32 io.netty.buffer.PoolArena$SizeClass[] >>>> >>>> 2042: 1 24 io.netty.buffer.PooledUnsafeDirectByteBuf$1 >>>> >>>> 2046: 1 24 io.netty.buffer.UnpooledByteBufAllocator >>>> >>>> 2051: 1 24 io.netty.buffer.PoolThreadCache$MemoryRegionCache$1 >>>> >>>> 2078: 1 24 io.netty.buffer.PooledHeapByteBuf$1 >>>> >>>> 2135: 1 24 io.netty.buffer.PooledUnsafeHeapByteBuf$1 >>>> >>>> 2302: 1 16 io.netty.buffer.ByteBufUtil$1 >>>> >>>> 2769: 1 16 io.netty.util.internal.__matchers__.io.netty.buffer.ByteBufM >>>> atcher >>>> >>>> >>>> >>>> My Driver machine has 32 CPUs, and as of now i have 15 machines in my >>>> cluster. As of now, the error happens on processing 5th or 6th chunk. I >>>> suspect the error is dependent on number of Executors and would happen >>>> early if we add more executors. >>>> >>>> >>>> I am trying to come up an explanation of what is filling up the Direct >>>> Memory and how to quanitfy it as factor of Number of Executors. Our >>>> cluster is shared cluster, And we need to understand how much Driver >>>> Memory to allocate for most of the jobs. >>>> >>>> >>>> >>>> >>>> >>>> Regards >>>> Sumit Chawla >>>> >>>> >>> >> > -- Dave Cameron Senior Platform Engineer (415) 646-5657 <415-646-5657> d...@digitalocean.com ------------------------------ We're Hiring! <http://grnh.se/w8o6y11> | @digitalocean <https://twitter.com/digitalocean> | @davcamer <https://twitter.com/davcamer> |linkedin <https://www.linkedin.com/in/dave-cameron-41b6b81/> | github <https://github.com/davcamer>| blog <http://intwoplacesatonce.com/>