Re: Spark executor killed without apparent reason

Jeff Zhang Tue, 01 Mar 2016 18:09:07 -0800

Do you mind to attach the whole yarn app log ?

On Wed, Mar 2, 2016 at 10:03 AM, Nirav Patel <npa...@xactlycorp.com> wrote:


> Hi Ryan,
>
> I did search "OutOfMemoryError" earlier and just now but it doesn't
> indicate anything else.
>
> Another thing is Job fails at "saveAsHadoopDataset" call to huge rdd. Most
> of the executors fails at this stage. I don't understand that as well.
> Because that should be a straight write job to filesystem. All the complex
> joins and logic were in previous stages which all ran successfully.
>
> Thanks
> Nirav
>
> On Wed, Mar 2, 2016 at 2:22 AM, Shixiong(Ryan) Zhu <
> shixi...@databricks.com> wrote:
>
>> Could you search "OutOfMemoryError" in the executor logs? It could be
>> "OufOfMemoryError: Direct Buffer Memory" or something else.
>>
>> On Tue, Mar 1, 2016 at 6:23 AM, Nirav Patel <npa...@xactlycorp.com>
>> wrote:
>>
>>> Hi,
>>>
>>> We are using spark 1.5.2 or yarn. We have a spark application utilizing
>>> about 15GB executor memory and 1500 overhead. However, at certain stage we
>>> notice higher GC time (almost same as task time) spent. These executors are
>>> bound to get killed at some point. However, nodemanager or resource manager
>>> logs doesn't indicate failure due to 'beyond physical/virtual memory
>>> limits' nor I see any 'heap space' or 'gc overhead exceeded' errors in
>>> executor logs. Some of these high GC executor gets killed eventually but I
>>> can't seem to find reason. Based on application logs it seems like executor
>>> didn't respond to driver for long period of time and connection was reset.
>>>
>>> Following are logs from 'yarn logs -applicationId appId_1232_xxx'
>>>
>>>
>>> 16/02/24 11:09:47 INFO executor.Executor: Finished task 224.0 in stage
>>> 8.0 (TID 15318). 2099 bytes result sent to driver
>>> 16/02/24 11:09:47 INFO executor.CoarseGrainedExecutorBackend: Got
>>> assigned task 15333
>>> 16/02/24 11:09:47 INFO executor.Executor: Running task 239.0 in stage
>>> 8.0 (TID 15333)
>>> 16/02/24 11:09:47 INFO storage.ShuffleBlockFetcherIterator: Getting 125
>>> non-empty blocks out of 3007 blocks
>>> 16/02/24 11:09:47 INFO storage.ShuffleBlockFetcherIterator: Started 14
>>> remote fetches in 10 ms
>>> 16/02/24 11:11:47 ERROR server.TransportChannelHandler: Connection to
>>> maprnode5 has been quiet for 120000 ms while there are outstanding
>>> requests. Assuming connection is dead; please adjust spark.network.timeout
>>> if this is wrong.
>>> 16/02/24 11:11:47 ERROR client.TransportResponseHandler: Still have 1
>>> requests outstanding when connection from maprnode5 is closed
>>> 16/02/24 11:11:47 ERROR shuffle.OneForOneBlockFetcher: Failed while
>>> starting block fetches
>>> java.io.IOException: Connection from maprnode5 closed
>>>         at
>>> org.apache.spark.network.client.TransportResponseHandler.channelUnregistered(TransportResponseHandler.java:104)
>>>         at
>>> org.apache.spark.network.server.TransportChannelHandler.channelUnregistered(TransportChannelHandler.java:91)
>>>         at
>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
>>>         at
>>> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
>>>         at
>>> io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
>>>         at
>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
>>>         at
>>> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
>>>         at
>>> io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
>>>         at
>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
>>>         at
>>> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
>>>         at
>>> io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
>>>         at
>>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
>>>         at
>>> io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
>>>         at
>>> io.netty.channel.DefaultChannelPipeline.fireChannelUnregistered(DefaultChannelPipeline.java:739)
>>>         at
>>> io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:659)
>>>         at
>>> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
>>>         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
>>>         at
>>> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>>>         at java.lang.Thread.run(Thread.java:744)
>>> 16/02/24 11:11:47 INFO shuffle.RetryingBlockFetcher: Retrying fetch
>>> (1/3) for 6 outstanding blocks after 5000 ms
>>> 16/02/24 11:11:52 INFO client.TransportClientFactory: Found inactive
>>> connection to maprnode5, creating a new one.
>>> 16/02/24 11:12:16 WARN server.TransportChannelHandler: Exception in
>>> connection from maprnode5
>>> java.io.IOException: Connection reset by peer
>>>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>>>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>>>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>>>         at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>>>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>>>         at
>>> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
>>>         at
>>> io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
>>>         at
>>> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
>>>         at
>>> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
>>>         at
>>> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>>>         at
>>> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>>>         at
>>> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>>>         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>>>         at
>>> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>>>         at java.lang.Thread.run(Thread.java:744)
>>> 16/02/24 11:12:16 ERROR client.TransportResponseHandler: Still have 1
>>> requests outstanding when connection from maprnode5 is closed
>>> 16/02/24 11:12:16 ERROR shuffle.OneForOneBlockFetcher: Failed while
>>> starting block fetches
>>>
>>> Also, I can see lot of ExecutorLost error in Driver logs but I can'\t
>>> seem to connect it to Above executor logs. I think executor logs should at
>>> least have mentioning of executor ID/task ID (EID-TID) and not just task ID
>>> (TID).
>>>
>>> this is snippet of driver logs from ui:
>>>
>>> 189 15283 0 FAILED PROCESS_LOCAL 15 / maprnode5 2016/02/24 11:08:55 / 
>>> ExecutorLostFailure
>>> (executor 15 lost)
>>>
>>> here we can see executor id is 5 but executor logs itself doesn't use
>>> this id as reference in log stream so it's hard to cross check logs.
>>>
>>>
>>> Anyhow my main issue is to determine cause of executor killing.
>>>
>>>
>>> Thanks
>>>
>>> Nirav
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>>
>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>>> <https://twitter.com/Xactly>  [image: Facebook]
>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>> <http://www.youtube.com/xactlycorporation>
>>
>>
>>
>
>
>
> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>
> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
> <https://twitter.com/Xactly>  [image: Facebook]
> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
> <http://www.youtube.com/xactlycorporation>
>



-- 
Best Regards

Jeff Zhang

Re: Spark executor killed without apparent reason

Reply via email to