[ 
https://issues.apache.org/jira/browse/SPARK-23191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16839435#comment-16839435
 ] 

Neeraj Gupta commented on SPARK-23191:
--------------------------------------

Hi [~Ngone51],

In our case we were running the driver with "supervise" flag. So during the 
network glitch that driver was assumed to be dead and a new driver was spawned. 
However at that time(after the network glitch was over) we saw that the old 
driver was also in running state. This might be some boundary condition.
As a result of this we did a fix to remove supervise flag and create 
monitoring/alerting for the driver process.

> Workers registration failes in case of network drop
> ---------------------------------------------------
>
>                 Key: SPARK-23191
>                 URL: https://issues.apache.org/jira/browse/SPARK-23191
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.6.3, 2.2.1, 2.3.0
>         Environment: OS:- Centos 6.9(64 bit)
>  
>            Reporter: Neeraj Gupta
>            Priority: Critical
>
> We have a 3 node cluster. We were facing issues of multiple driver running in 
> some scenario in production.
> On further investigation we were able to reproduce iin both 1.6.3 and 2.2.1 
> versions the scenario with following steps:-
>  # Setup a 3 node cluster. Start master and slaves.
>  # On any node where the worker process is running block the connections on 
> port 7077 using iptables.
> {code:java}
> iptables -A OUTPUT -p tcp --dport 7077 -j DROP
> {code}
>  # After about 10-15 secs we get the error on node that it is unable to 
> connect to master.
> {code:java}
> 2018-01-23 12:08:51,639 [rpc-client-1-1] WARN  
> org.apache.spark.network.server.TransportChannelHandler - Exception in 
> connection from <servername>
> java.io.IOException: Connection timed out
>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>         at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
>         at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)
>         at 
> io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)
>         at 
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275)
>         at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
>         at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
>         at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
>         at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
>         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
>         at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
>         at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
>         at java.lang.Thread.run(Thread.java:745)
> 2018-01-23 12:08:51,647 [dispatcher-event-loop-0] ERROR 
> org.apache.spark.deploy.worker.Worker - Connection to master failed! Waiting 
> for master to reconnect...
> 2018-01-23 12:08:51,647 [dispatcher-event-loop-0] ERROR 
> org.apache.spark.deploy.worker.Worker - Connection to master failed! Waiting 
> for master to reconnect...
> {code}
>  # Once we get this exception we renable the connections to port 7077 using
> {code:java}
> iptables -D OUTPUT -p tcp --dport 7077 -j DROP
> {code}
>  # Worker tries to register again with master but is unable to do so. It 
> gives following error
> {code:java}
> 2018-01-23 12:08:58,657 [worker-register-master-threadpool-2] WARN  
> org.apache.spark.deploy.worker.Worker - Failed to connect to master 
> <servername>:7077
> org.apache.spark.SparkException: Exception thrown in awaitResult:
>         at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
>         at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>         at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
>         at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:108)
>         at 
> org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:241)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to connect to <servername>:7077
>         at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
>         at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
>         at 
> org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
>         at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
>         at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
>         ... 4 more
> Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: 
> Connection timed out: <servername>:7077
>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>         at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
>         at 
> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)
>         at 
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)
>         at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:631)
>         at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
>         at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
>         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
>         at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
>         at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
>         ... 1 more
> 2018-01-23 12:09:03,705 [dispatcher-event-loop-5] ERROR 
> org.apache.spark.deploy.worker.Worker - Worker registration failed: Duplicate 
> worker ID
> 2018-01-23 12:09:03,705 [dispatcher-event-loop-5] ERROR 
> org.apache.spark.deploy.worker.Worker - Worker registration failed: Duplicate 
> worker ID{code}
>  # The worker state is changed to DEAD in spark UI. As a result of which 
> duplicate driver is launched.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to