Re: Cluster failure after zookeeper glitch.

Stefan Richter Fri, 20 Jan 2017 01:31:34 -0800

I would think that network problems between Flink and Zookeeper in HA mode 
could indeed lead to problems. Maybe Till (in CC) has a better idea of what is 
going on there).


> Am 19.01.2017 um 14:55 schrieb Andrew Ge Wu <andrew.ge...@eniro.com>:
> 
> Hi Stefan
> 
> Yes we are running in HA mode with dedicated zookeeper cluster. As far as I 
> can see it looks like a networking issue with zookeeper cluster.
> 2 out of 5 zookeeper reported something around the same time:
> 
> server1
> 2017-01-19 11:52:13,044 [myid:1] - WARN  
> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when 
> following the leader
> java.net.SocketTimeoutException: Read timed out
>         at java.net.SocketInputStream.socketRead0(Native Method)
>         at java.net.SocketInputStream.read(SocketInputStream.java:150)
>         at java.net.SocketInputStream.read(SocketInputStream.java:121)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
>         at 
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>         at 
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
>         at 
> org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
>         at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
> 2017-01-19 11:52:13,045 [myid:1] - INFO  
> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called
> java.lang.Exception: shutdown Follower
>         at 
> org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:790)
> 2017-01-19 11:52:13,045 [myid:1] - INFO  
> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1007] - Closed socket 
> connection for client /172.27.163.227:51800 which had sessionid 
> 0x159b505820a0009
> 2017-01-19 11:52:13,046 [myid:1] - INFO  
> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1007] - Closed socket 
> connection for client /172.27.163.227:51798 which had sessionid 
> 0x159b505820a0008
> 2017-01-19 11:52:13,046 [myid:1] - INFO  
> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1007] - Closed socket 
> connection for client /0:0:0:0:0:0:0:1:46891 which had sessionid 
> 0x1537b32bbe100ad
> 2017-01-19 11:52:13,046 [myid:1] - INFO  
> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1007] - Closed socket 
> connection for client /172.27.165.64:50075 which had sessionid 
> 0x159b505820a000d
> 2017-01-19 11:52:13,046 [myid:1] - INFO  
> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@139] - 
> Shutting down
> 2017-01-19 11:52:13,046 [myid:1] - INFO  
> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@441] - shutting down
> 
> 
> server2
> 2017-01-19 11:52:13,061 [myid:2] - INFO  
> [WorkerReceiver[myid=2]:FastLeaderElection@597] - Notification: 1 (message 
> format version), 1 (n.leader), 0x4000000cb (n.zxid), 0x4 (n.round), LOOKING 
> (n.state), 1 (n.sid), 0x4 (n.peerEpoch) FOLLOWING (my state)
> 2017-01-19 11:52:13,082 [myid:2] - INFO  
> [WorkerReceiver[myid=2]:FastLeaderElection@597] - Notification: 1 (message 
> format version), 4 (n.leader), 0x4000000cb (n.zxid), 0x4 (n.round), LOOKING 
> (n.state), 4 (n.sid), 0x4 (n.peerEpoch) FOLLOWING (my state)
> 2017-01-19 11:52:13,083 [myid:2] - INFO  
> [WorkerReceiver[myid=2]:FastLeaderElection@597] - Notification: 1 (message 
> format version), 4 (n.leader), 0x4000000cb (n.zxid), 0x4 (n.round), LOOKING 
> (n.state), 1 (n.sid), 0x4 (n.peerEpoch) FOLLOWING (my state)
> 2017-01-19 11:52:13,284 [myid:2] - INFO  
> [WorkerReceiver[myid=2]:FastLeaderElection@597] - Notification: 1 (message 
> format version), 4 (n.leader), 0x4000000cb (n.zxid), 0x4 (n.round), LOOKING 
> (n.state), 4 (n.sid), 0x4 (n.peerEpoch) FOLLOWING (my state)
> 2017-01-19 11:52:13,285 [myid:2] - INFO  
> [WorkerReceiver[myid=2]:FastLeaderElection@597] - Notification: 1 (message 
> format version), 4 (n.leader), 0x4000000cb (n.zxid), 0x4 (n.round), LOOKING 
> (n.state), 1 (n.sid), 0x4 (n.peerEpoch) FOLLOWING (my state)
> 2017-01-19 11:52:13,310 [myid:2] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - 
> Accepted socket connection from /172.27.163.227:39302
> 2017-01-19 11:52:13,311 [myid:2] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@861] - Client 
> attempting to renew session 0x159b505820a0009 at /172.27.163.227:39302
> 2017-01-19 11:52:13,312 [myid:2] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@108] - Revalidating 
> client: 0x159b505820a0009
> 2017-01-19 11:52:13,687 [myid:2] - INFO  
> [WorkerReceiver[myid=2]:FastLeaderElection@597] - Notification: 1 (message 
> format version), 4 (n.leader), 0x4000000cb (n.zxid), 0x4 (n.round), LOOKING 
> (n.state), 4 (n.sid), 0x4 (n.peerEpoch) FOLLOWING (my state)
> 2017-01-19 11:52:13,687 [myid:2] - INFO  
> [WorkerReceiver[myid=2]:FastLeaderElection@597] - Notification: 1 (message 
> format version), 4 (n.leader), 0x4000000cb (n.zxid), 0x4 (n.round), LOOKING 
> (n.state), 1 (n.sid), 0x4 (n.peerEpoch) FOLLOWING (my state)
> 2017-01-19 11:52:14,488 [myid:2] - INFO  
> [WorkerReceiver[myid=2]:FastLeaderElection@597] - Notification: 1 (message 
> format version), 4 (n.leader), 0x4000000cb (n.zxid), 0x4 (n.round), LOOKING 
> (n.state), 1 (n.sid), 0x4 (n.peerEpoch) FOLLOWING (my state)
> 2017-01-19 11:52:14,489 [myid:2] - INFO  
> [WorkerReceiver[myid=2]:FastLeaderElection@597] - Notification: 1 (message 
> format version), 4 (n.leader), 0x4000000cb (n.zxid), 0x4 (n.round), LOOKING 
> (n.state), 4 (n.sid), 0x4 (n.peerEpoch) FOLLOWING (my state)
> 2017-01-19 11:52:14,719 [myid:2] - INFO  
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@617] - Established 
> session 0x159b505820a0009 with negotiated timeout 40000 for client 
> /172.27.163.227:39302
> 
> I can’t say for sure if data in zookeeper is corrupted at that time. I guess 
> Flink is kinda sensitive on that?
> 
> 
> Thanks
> 
> 
> 
> Andrew
> 
> 
> 
> 
> 
>> On 19 Jan 2017, at 14:19, Stefan Richter <s.rich...@data-artisans.com 
>> <mailto:s.rich...@data-artisans.com>> wrote:
>> 
>> Hi,
>> 
>> I think depending on your configuration of Flink (are you using high 
>> availability mode?) and the type of ZK glitches we are talking about, it can 
>> very well be that some of Flink’s meta data in ZK got corrupted and the 
>> system can not longer operate. But for a deeper analysis, we would need more 
>> details about your configuration and the ZK problem.
>> 
>> Best,
>> Stefan
>> 
>>> Am 19.01.2017 um 13:16 schrieb Andrew Ge Wu <andrew.ge...@eniro.com 
>>> <mailto:andrew.ge...@eniro.com>>:
>>> 
>>> Hi,
>>> 
>>> 
>>> We recently had several zookeeper glitch, when that happens it seems to 
>>> take flink cluster with it.
>>> 
>>> We are running on 1.03
>>> 
>>> It started like this:
>>> 
>>> 
>>> 2017-01-19 11:52:13,047 INFO  org.apache.zookeeper.ClientCnxn               
>>>                 - Unable to read additional data from server sessionid 
>>> 0x159b505820a0008, likely server has closed socket, closing socket 
>>> connection and attempting reconnect
>>> 2017-01-19 11:52:13,047 INFO  org.apache.zookeeper.ClientCnxn               
>>>                 - Unable to read additional data from server sessionid 
>>> 0x159b505820a0009, likely server has closed socket, closing socket 
>>> connection and attempting reconnect
>>> 2017-01-19 11:52:13,151 INFO  
>>> org.apache.flink.shaded.org.apache.curator.framework.state.ConnectionStateManager
>>>   - State change: SUSPENDED
>>> 2017-01-19 11:52:13,151 INFO  
>>> org.apache.flink.shaded.org.apache.curator.framework.state.ConnectionStateManager
>>>   - State change: SUSPENDED
>>> 2017-01-19 11:52:13,166 WARN  
>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - 
>>> ZooKeeper connection SUSPENDED. Changes to the submitted job graphs are not 
>>> monitored (temporarily).
>>> 2017-01-19 11:52:13,169 INFO  
>>> org.apache.flink.runtime.jobmanager.JobManager                - JobManager 
>>> akka://flink/user/jobmanager#1976923422 
>>> <akka://flink/user/jobmanager#1976923422> was revoked leadership.
>>> 2017-01-19 11:52:13,179 INFO  
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - op1 -> 
>>> (Map, Map -> op2) (18/24) (5336dd375eb12616c5a0e93c84f93465) switched from 
>>> RUNNING to FAILED
>>> 
>>> 
>>> 
>>> Then our web-ui stopped serving and job manager stuck in an exception loop 
>>> like this:
>>> 2017-01-19 13:05:13,521 WARN  
>>> org.apache.flink.runtime.jobmanager.JobManager                - Discard 
>>> message 
>>> LeaderSessionMessage(0318ecf5-7069-41b2-a793-2f24bdbaa287,01/19/2017 
>>> 13:05:13     Job execution switched to status RESTARTING.) because the 
>>> expected leader session I
>>> D None did not equal the received leader session ID 
>>> Some(0318ecf5-7069-41b2-a793-2f24bdbaa287).
>>> 2017-01-19 13:05:13,521 INFO  
>>> org.apache.flink.runtime.executiongraph.restart.FixedDelayRestartStrategy  
>>> - Delaying retry of job execution for xxxxx ms …
>>> 
>>> 
>>> Is it because we misconfigured anything? or this is expected behavior? When 
>>> this happens we have to restart the cluster to bring it back.
>>> 
>>> 
>>> Thanks!
>>> 
>>> 
>>> Andrew
>>> -- 
>>> Confidentiality Notice: This e-mail transmission may contain confidential 
>>> or legally privileged information that is intended only for the individual 
>>> or entity named in the e-mail address. If you are not the intended 
>>> recipient, you are hereby notified that any disclosure, copying, 
>>> distribution, or reliance upon the contents of this e-mail is strictly 
>>> prohibited and may be unlawful. If you have received this e-mail in error, 
>>> please notify the sender immediately by return e-mail and delete all copies 
>>> of this message.
>> 
> 
> 
> Confidentiality Notice: This e-mail transmission may contain confidential or 
> legally privileged information that is intended only for the individual or 
> entity named in the e-mail address. If you are not the intended recipient, 
> you are hereby notified that any disclosure, copying, distribution, or 
> reliance upon the contents of this e-mail is strictly prohibited and may be 
> unlawful. If you have received this e-mail in error, please notify the sender 
> immediately by return e-mail and delete all copies of this message.

Re: Cluster failure after zookeeper glitch.

Reply via email to