Could you search "OutOfMemoryError" in the executor logs? It could be "OufOfMemoryError: Direct Buffer Memory" or something else.
On Tue, Mar 1, 2016 at 6:23 AM, Nirav Patel <npa...@xactlycorp.com> wrote: > Hi, > > We are using spark 1.5.2 or yarn. We have a spark application utilizing > about 15GB executor memory and 1500 overhead. However, at certain stage we > notice higher GC time (almost same as task time) spent. These executors are > bound to get killed at some point. However, nodemanager or resource manager > logs doesn't indicate failure due to 'beyond physical/virtual memory > limits' nor I see any 'heap space' or 'gc overhead exceeded' errors in > executor logs. Some of these high GC executor gets killed eventually but I > can't seem to find reason. Based on application logs it seems like executor > didn't respond to driver for long period of time and connection was reset. > > Following are logs from 'yarn logs -applicationId appId_1232_xxx' > > > 16/02/24 11:09:47 INFO executor.Executor: Finished task 224.0 in stage 8.0 > (TID 15318). 2099 bytes result sent to driver > 16/02/24 11:09:47 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 15333 > 16/02/24 11:09:47 INFO executor.Executor: Running task 239.0 in stage 8.0 > (TID 15333) > 16/02/24 11:09:47 INFO storage.ShuffleBlockFetcherIterator: Getting 125 > non-empty blocks out of 3007 blocks > 16/02/24 11:09:47 INFO storage.ShuffleBlockFetcherIterator: Started 14 > remote fetches in 10 ms > 16/02/24 11:11:47 ERROR server.TransportChannelHandler: Connection to > maprnode5 has been quiet for 120000 ms while there are outstanding > requests. Assuming connection is dead; please adjust spark.network.timeout > if this is wrong. > 16/02/24 11:11:47 ERROR client.TransportResponseHandler: Still have 1 > requests outstanding when connection from maprnode5 is closed > 16/02/24 11:11:47 ERROR shuffle.OneForOneBlockFetcher: Failed while > starting block fetches > java.io.IOException: Connection from maprnode5 closed > at > org.apache.spark.network.client.TransportResponseHandler.channelUnregistered(TransportResponseHandler.java:104) > at > org.apache.spark.network.server.TransportChannelHandler.channelUnregistered(TransportChannelHandler.java:91) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144) > at > io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144) > at > io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144) > at > io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144) > at > io.netty.channel.DefaultChannelPipeline.fireChannelUnregistered(DefaultChannelPipeline.java:739) > at > io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:659) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:744) > 16/02/24 11:11:47 INFO shuffle.RetryingBlockFetcher: Retrying fetch (1/3) > for 6 outstanding blocks after 5000 ms > 16/02/24 11:11:52 INFO client.TransportClientFactory: Found inactive > connection to maprnode5, creating a new one. > 16/02/24 11:12:16 WARN server.TransportChannelHandler: Exception in > connection from maprnode5 > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:192) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) > at > io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313) > at > io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) > at > io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:744) > 16/02/24 11:12:16 ERROR client.TransportResponseHandler: Still have 1 > requests outstanding when connection from maprnode5 is closed > 16/02/24 11:12:16 ERROR shuffle.OneForOneBlockFetcher: Failed while > starting block fetches > > Also, I can see lot of ExecutorLost error in Driver logs but I can'\t seem > to connect it to Above executor logs. I think executor logs should at least > have mentioning of executor ID/task ID (EID-TID) and not just task ID (TID). > > this is snippet of driver logs from ui: > > 189 15283 0 FAILED PROCESS_LOCAL 15 / maprnode5 2016/02/24 11:08:55 / > ExecutorLostFailure > (executor 15 lost) > > here we can see executor id is 5 but executor logs itself doesn't use this > id as reference in log stream so it's hard to cross check logs. > > > Anyhow my main issue is to determine cause of executor killing. > > > Thanks > > Nirav > > > > > > > > [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> > > <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] > <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] > <https://twitter.com/Xactly> [image: Facebook] > <https://www.facebook.com/XactlyCorp> [image: YouTube] > <http://www.youtube.com/xactlycorporation>