There're a few questions on the original thread which might be useful to answer here as well:
1) Why is the session closed, the client closed it or the cluster expired it? 2) which server was the session attached to - the first (44sec max lat) or one of the others? Which server was the leader? 3) the znode exists on all 4 servers, is that right? Would also be useful to attach server logs related to the session expiration as well as LogFormatter output of txn log files about the nodes. Regards, Andor On Tue, Apr 3, 2018 at 10:34 AM, Andor Molnar <[email protected]> wrote: > Hi Daniel, > > Thanks for the bugreport. > Interesting that this issue should have been fixed already by ages: > https://issues.apache.org/jira/browse/ZOOKEEPER-1208 > > Regards, > Andor > > > On Tue, Apr 3, 2018 at 3:22 AM, Daniel Chan <[email protected]> > wrote: > >> We have a live Zookeeper environment (quorum size is 2) and observed a >> strange behavior: >> Kafka created 2 ephemeral nodes /brokers/ids/822712429 and >> /brokers/ids/707577499 on 2018-03-12 03:30:36.933 >> The Kafka clients were long gone but as of today, the two ephemeral nodes >> are still present >> >> Troubleshooting: >> 1) Lists the outstanding sessions and ephemeral nodes >> $ echo dump | nc $SERVER1 2181 >> SessionTracker dump: >> org.apache.zookeeper.server.quorum.LearnerSessionTracker@6d7fd863 >> ephemeral nodes dump: >> Sessions with Ephemerals (2): >> 0x162183ea9f70003: >> /brokers/ids/822712429 >> 0x162183ea9f70002: >> /brokers/ids/707577499 >> /controller >> >> 2) stat on /brokers/ids/822712429 >> zk> stat /brokers/ids/822712429 >> czxid: 4294967344 >> mzxid: 4294967344 >> pzxid: 4294967344 >> ctime: 1520825436933 (2018-03-11T20:30:36.933-0700) >> mtime: 1520825436933 (2018-03-11T20:30:36.933-0700) >> version: 0 >> cversion: 0 >> aversion: 0 >> owner: 99668799174148099 >> datalen: 102 >> children: 0 >> >> 3) List full connection/session details for all clients connected >> $ echo cons | nc $SERVER1 2181 >> /10.247.114.70:30401[0](queued=0,recved=1,sent=0) >> /10.248.88.235:40430[1](queued=0,recved=345,sent=345,sid= >> 0x162183ea9f70c22,lop=PING,est=1522713395028,to=40000, >> lcxid=0x12,lzxid=0xffffffffffffffff,lresp=1522717802117, >> llat=0,minlat=0,avglat=0,maxlat=31) >> >> $ echo cons | nc $SERVER2 2181 >> /10.196.18.61:28173[0](queued=0,recved=1,sent=0) >> /10.247.114.69:42679[1](queued=0,recved=73800,sent=73800, >> sid=0x262183eaa21da96,lop=PING,est=1522651352906,to=9000 >> ,lcxid=0xe49f,lzxid=0x10004683d,lresp=1522717854847,llat=0, >> minlat=0,avglat=0,maxlat=1235) >> >> 4) health >> $ echo mntr | nc $SERVER1 2181 >> zk_version 3.4.6-1569965, built on 02/20/2014 09:09 GMT >> zk_avg_latency 0 >> zk_max_latency 443 >> zk_min_latency 0 >> zk_packets_received 11158019 >> zk_packets_sent 11158244 >> zk_num_alive_connections 2 >> zk_outstanding_requests 0 >> zk_server_state follower >> zk_znode_count 344 >> zk_watch_count 0 >> zk_ephemerals_count 3 >> zk_approximate_data_size 36654 >> zk_open_file_descriptor_count 33 >> zk_max_file_descriptor_count 65536 >> >> 5) Could not find any special exception from zookeeper logs about the two >> sessions >> >> Is this a known bug in version 3.4.6? what could be the potential cause >> of the issue? >> >> Thanks, >> Daniel >> > >
