I’ve checked the logs of the zookeeper quorum nodes, and it seems that problem 
is not here.

I found that client is closing connection due to timeout, only after this 
zookeeper server has EOF exception. Here is the logs from hadoop zkfc process
2015-10-01 07:12:17,820 INFO org.apache.zookeeper.ClientCnxn: Client session 
timed out, have not heard from server in 3334ms for sessionid 
0x25020ff36c40002, closing socket connection and attempting reconnect
2015-10-01 07:12:17,938 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session 
disconnected. Entering neutral mode…

As you can see client timeouted after 3334ms, but the timeout configured in 
hdfs-site.xml dfs.ha.fencing.ssh.connect-timeout is 10.000ms 
so why does the client timed out?

Thanks.


> On 24 Sep 2015, at 15:36, Flavio Junqueira <[email protected]> wrote:
> 
> I can see that the client is disconnecting from the server, and there is also 
> a new round of leader election for the zookeeper servers. If this is 
> happening, then yeah, your ensemble is unstable. If the ensemble leader 
> election is being triggered frequently, then I'd start by looking there. Try 
> to determine why the ensemble is failing to continue with the same leader. If 
> ensemble elections aren't happening frequently, then another possibility is 
> that GC pauses are causing the session to expire.
> 
> -Flavio
> 
>> On 24 Sep 2015, at 05:24, Akmal Abbasov <[email protected]> wrote:
>> 
>> Hi,
>> I am using zookeeper 3.4.6
>> I have a spark cluster configured with HA. Once per 1-2 days, the active 
>> spark master is shutting down with a message
>> 15/09/23 18:58:18 INFO zookeeper.ClientCnxn: Unable to read additional data 
>> from server sessionid 0x34ffa68dbd10021, likely server has closed socket, 
>> closing socket connection and attempting reconnect
>> 15/09/23 18:58:18 INFO state.ConnectionStateManager: State change: SUSPENDED
>> 15/09/23 18:58:18 INFO master.ZooKeeperLeaderElectionAgent: We have lost 
>> leadership
>> 15/09/23 18:58:18 ERROR master.Master: Leadership has been revoked -- master 
>> shutting down.
>> 15/09/23 18:58:18 INFO util.Utils: Shutdown hook called
>> 
>> I don’t have the zookeeper logs from the same period, but the logs are full 
>> of the these messages 
>> 2015-09-24 05:07:42,228 [myid:1] - INFO  
>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - 
>> Accepted socket connection from /10.0.8.4:34705
>> 2015-09-24 05:07:42,229 [myid:1] - WARN  
>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@822] - Connection 
>> request from old client /10.0.8.4:34705; will be dropped if server is in r-o 
>> mode
>> 2015-09-24 05:07:42,229 [myid:1] - INFO  
>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client 
>> attempting to establish new session at /10.0.8.4:34705
>> 2015-09-24 05:07:42,292 [myid:1] - INFO  
>> [CommitProcessor:1:ZooKeeperServer@617] - Established session 
>> 0x14ffd3670130030 with negotiated timeout 20001 for client /10.0.8.4:34705
>> 2015-09-24 05:07:42,302 [myid:1] - WARN  
>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end 
>> of stream exception
>> EndOfStreamException: Unable to read additional data from client sessionid 
>> 0x14ffd3670130030, likely client has closed socket
>>      at 
>> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
>>      at 
>> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
>>      at java.lang.Thread.run(Thread.java:745)
>> 2015-09-24 05:07:42,303 [myid:1] - INFO  
>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed 
>> socket connection for client /10.0.8.4:34705 which had sessionid 
>> 0x14ffd3670130030
>> 2015-09-24 05:07:42,314 [myid:1] - ERROR 
>> [CommitProcessor:1:NIOServerCnxn@178] - Unexpected Exception:
>> java.nio.channels.CancelledKeyException
>>      at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
>>      at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
>>      at 
>> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:151)
>>      at 
>> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1081)
>>      at 
>> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:404)
>>      at 
>> org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:74)
>> 2015-09-24 05:07:42,334 [myid:1] - INFO  
>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - 
>> Accepted socket connection from /10.0.8.4:34707
>> 2015-09-24 05:07:42,334 [myid:1] - WARN  
>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@822] - Connection 
>> request from old client /10.0.8.4:34707; will be dropped if server is in r-o 
>> mode
>> 2015-09-24 05:07:42,335 [myid:1] - INFO  
>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client 
>> attempting to establish new session at /10.0.8.4:34707
>> 2015-09-24 05:07:42,357 [myid:1] - INFO  
>> [CommitProcessor:1:ZooKeeperServer@617] - Established session 
>> 0x14ffd3670130031 with negotiated timeout 20001 for client /10.0.8.4:34707
>> 2015-09-24 05:07:42,364 [myid:1] - WARN  
>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end 
>> of stream exception
>> EndOfStreamException: Unable to read additional data from client sessionid 
>> 0x14ffd3670130031, likely client has closed socket
>>      at 
>> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
>>      at 
>> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
>>      at java.lang.Thread.run(Thread.java:745)
>> 2015-09-24 05:07:42,365 [myid:1] - INFO  
>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed 
>> socket connection for client /10.0.8.4:34707 which had sessionid 
>> 0x14ffd3670130031
>> 2015-09-24 05:07:42,376 [myid:1] - ERROR 
>> [CommitProcessor:1:NIOServerCnxn@178] - Unexpected Exception:
>> java.nio.channels.CancelledKeyException
>>      at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
>>      at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
>>      at 
>> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:151)
>>      at 
>> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1081)
>>      at 
>> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:404)
>>      at 
>> org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:74)
>> 
>> Also there are 
>> 2015-09-24 06:29:54,459 [myid:1] - INFO  
>> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@139] - 
>> Shutting down
>> 2015-09-24 06:29:54,459 [myid:1] - INFO  
>> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@441] - shutting down
>> 2015-09-24 06:29:54,459 [myid:1] - INFO  
>> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FollowerRequestProcessor@105] - 
>> Shutting down
>> 2015-09-24 06:29:54,459 [myid:1] - INFO  
>> [FollowerRequestProcessor:1:FollowerRequestProcessor@95] - 
>> FollowerRequestProcessor exited loop!
>> 2015-09-24 06:29:54,460 [myid:1] - INFO  
>> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@181] - Shutting down
>> 2015-09-24 06:29:54,464 [myid:1] - INFO  
>> [CommitProcessor:1:CommitProcessor@150] - CommitProcessor exited loop!
>> 2015-09-24 06:29:54,465 [myid:1] - INFO  
>> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FinalRequestProcessor@415] - 
>> shutdown of request processor complete
>> 2015-09-24 06:29:54,466 [myid:1] - INFO  
>> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:SyncRequestProcessor@209] - 
>> Shutting down
>> 2015-09-24 06:29:54,466 [myid:1] - INFO  
>> [SyncThread:1:SyncRequestProcessor@187] - SyncRequestProcessor exited!
>> 2015-09-24 06:29:54,466 [myid:1] - INFO  
>> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumPeer@714] - LOOKING
>> 2015-09-24 06:29:54,584 [myid:1] - INFO  
>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - 
>> Accepted socket connection from /10.0.8.58:36137
>> 2015-09-24 06:29:54,584 [myid:1] - WARN  
>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception 
>> causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not 
>> running
>> 2015-09-24 06:29:54,584 [myid:1] - INFO  
>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed 
>> socket connection for client /10.0.8.58:36137 (no session established for 
>> client)
>> 2015-09-24 06:29:54,679 [myid:1] - INFO  
>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - 
>> Accepted socket connection from /10.0.8.57:57410
>> 2015-09-24 06:29:54,680 [myid:1] - WARN  
>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception 
>> causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not 
>> running
>> 2015-09-24 06:29:54,680 [myid:1] - INFO  
>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed 
>> socket connection for client /10.0.8.57:57410 (no session established for 
>> client)
>> 
>> I also observed that hadoop-zkfc restarts very frequently.
>> Any ideas what could be wrong?
>> 
>> Thanks.
> 

Reply via email to