[ 
https://issues.apache.org/jira/browse/SPARK-16146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cong Feng closed SPARK-16146.
-----------------------------
    Resolution: Fixed

> Spark application failed by Yarn preempting
> -------------------------------------------
>
>                 Key: SPARK-16146
>                 URL: https://issues.apache.org/jira/browse/SPARK-16146
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 1.6.1
>         Environment: Amazon EC2, centos 6.6,
> Spark-1.6.1-bin-hadoop-2.6(binary from spark official web), Hadoop 2.7.2, 
> preemption and dynamic allocation enabled.
>            Reporter: Cong Feng
>
> Hi,
> We are setting up our Spark cluster on amz ec2. We are using Spark Yarn 
> client mode, which is Spark-1.6.1-bin-hadoop-2.6(binary from spark official 
> web) and Hadoop 2.7.2. We also enable preemption, dynamic allocation and 
> spark.shuffle.service.enabled.
> During our test we found our Spark application frequently get killed when the 
> preemption happened. Mostly seems driver trying to send rpc to executor which 
> has been preempted before, also there are some connect rest by peer 
> exceptions which also cause job failed Below are the typical exceptions we 
> found:
> 16/06/22 08:13:30 ERROR spark.ContextCleaner: Error cleaning RDD 49
> java.io.IOException: Failed to send RPC 5721681506291542850 to 
> nodexx.xx.xxxx.ddns.xx.com/xx.xx.xx.xx:42857: 
> java.nio.channels.ClosedChannelException
>         at 
> org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:239)
>         at 
> org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:226)
>         at 
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
>         at 
> io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:567)
>         at 
> io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424)
>         at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:801)
>         at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:699)
>         at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1122)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:633)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:32)
>         at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:908)
>         at 
> io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:960)
>         at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:893)
>         at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
>         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
>         at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.nio.channels.ClosedChannelException
> And 
> 16/06/19 22:33:14 INFO storage.BlockManager: Removing RDD 122
> 16/06/19 22:33:14 WARN server.TransportChannelHandler: Exception in 
> connection from nodexx-xx-xx.xx.ddns.xx.com/xx.xx.xx.xx:56618
> java.io.IOException: Connection reset by peer
>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>         at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>         at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
>         at 
> io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
>         at 
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
>         at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
>         at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>         at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>         at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>         at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>         at java.lang.Thread.run(Thread.java:745)
> 16/06/19 22:33:14 ERROR client.TransportResponseHandler: Still have 2 
> requests outstanding when connection from 
> nodexx-xx-xx.xxxx.ddns.xx.com/xx.xx.xx.xx:56618 is closed.
> It happens both to capacity scheduler and fair scheduler. The wired thing is 
> when we rolled back to Spark 1.4.1, this issue magically disappeared and we 
> can do the preemption smoothly.
> But we still wants to deploy with Spark 1.6.1. Is this a bug or something we 
> can fixed. Any ideas will be great helpful to us.
> Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to