[ https://issues.apache.org/jira/browse/ZOOKEEPER-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15574296#comment-15574296 ]
guoping.gp commented on ZOOKEEPER-2615: --------------------------------------- REPRODUCE TEST CASE STEP BY STEP ==================================== 1. start zookeeper cluster, says server1(localhost:6181), server2(localhost:4181), server3(localhost:2181) let's assume now server2 is the LEADER. 2. use zkCli.sh create a persistence node /curator_exists_watch 3. use Curator Framework2.9.1 to create a simple Java Client Application, as following: ***** import org.apache.curator.framework.CuratorFramework; import org.apache.curator.framework.CuratorFrameworkFactory; import org.apache.curator.framework.api.CuratorWatcher; import org.apache.curator.retry.ExponentialBackoffRetry; import org.apache.zookeeper.WatchedEvent; public class ExistingWatchDemo { public static void main(String[] args) throws Exception { CuratorFrameworkFactory.Builder builder = CuratorFrameworkFactory.builder(); builder = builder.connectString("localhost:2181,localhost:6181") .sessionTimeoutMs(15000) .connectionTimeoutMs(10000) .retryPolicy(new ExponentialBackoffRetry(5000, Integer.MAX_VALUE)); final CuratorFramework zkClient = builder.build(); zkClient.start(); CuratorWatcher watcher = new CuratorWatcher() { public void process(WatchedEvent event) throws Exception { System.out.println("received event " + event); System.out.println("watch again ..."); zkClient.checkExists().usingWatcher(this).forPath("/curator_exists_watch"); } }; zkClient.checkExists().usingWatcher(watcher).forPath("/curator_exists_watch"); Thread.sleep(600 * 1000); } } ***** to stay us to reproduce in more convenience, here we provide only 2 servers in the connection string instead of 3 both, says as server1(FOLLOWING) server3(FOLLOWING) 4. breakpoint at 2 places: FinalRequestProcessor.java: line 303 : zks.getZKDatabase().setWatches(relativeZxid, NIOServerCnxn.java: line 966: synchronized(factory.cnxns){ 5. running the client app ExistingWatchDemo.java, let say the client application connection to server3(2181) 6. use the IDE make a remote debug session connect to another server of the connection string which we provided, here server1(6181) 7. now kill -STOP server3(2181), the client app will find the server hang and reconnect to server1(6181), we can confirm the reconnect in the client's log: --------- 016-10-14 13:31:21,505 [myid:] - INFO [main-SendThread(localhost:2181):ClientCnxn$SendThread@1096] - Client session timed out, have not heard from server in 3754ms for sessionid 0x357c1a4462b000e, closing socket connection and attempting reconnect 2016-10-14 13:31:22,389 [myid:] - INFO [main-SendThread(localhost:6181):ClientCnxn$SendThread@975] - Opening socket connection to server localhost/0:0:0:0:0:0:0:1:6181. Will not attempt to authenticate using SASL (unknown error) 2016-10-14 13:31:22,392 [myid:] - INFO [main-SendThread(localhost:6181):ClientCnxn$SendThread@852] - Socket connection established to localhost/0:0:0:0:0:0:0:1:6181, initiating session 2016-10-14 13:31:22,392 [myid:] - DEBUG [main-SendThread(localhost:6181):ClientCnxn$SendThread@892] - Session establishment request sent on localhost/0:0:0:0:0:0:0:1:6181 2016-10-14 13:31:22,407 [myid:] - INFO [main-SendThread(localhost:6181):ClientCnxn$SendThread@1235] - Session establishment complete on server localhost/0:0:0:0:0:0:0:1:6181, sessionid = 0x357c1a4462b000e, negotiated timeout = 15000 --------- the reconnect will hit the breakpoint FinalRequestProcessor.java: line 303 as following: Thread [CommitProcessor:1] (Suspended (breakpoint at line 303 in FinalRequestProcessor)) FinalRequestProcessor.processRequest(Request) line: 303 CommitProcessor.run() line: 74 8. now kill the process of the client app, kill -9 <pid of client app>, this will let us another breakpoint hit, as following Daemon Thread [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:6181] (Suspended (breakpoint at line 966 in NIOServerCnxn)) NIOServerCnxn.close() line: 966 NIOServerCnxn.doIO(SelectionKey) line: 360 NIOServerCnxnFactory.run() line: 208 Thread.run() line: 745 9. now resume the breakpoint(NIOServerCnxn:966) in step 8, after that, then resume the breakpoint (FinalRequestProcessor:303) in step 7 10. now use wchp and cons to check the leak watch, as following shows pguodeMacBook-Air:bin pguo$ echo wchp | nc localhost 6181 /curator_exists_watch 0x357c1a4462b000e pguodeMacBook-Air:bin pguo$ echo cons | nc localhost 6181 /0:0:0:0:0:0:0:1:61702[0](queued=0,recved=1,sent=0) > Zookeeper server holds onto dead/expired session ids in the watch data > structures > --------------------------------------------------------------------------------- > > Key: ZOOKEEPER-2615 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2615 > Project: ZooKeeper > Issue Type: Bug > Components: server > Affects Versions: 3.4.6 > Reporter: guoping.gp > > The same issue (https://issues.apache.org/jira/browse/ZOOKEEPER-1382) still > can be found even with zookeeper 3.4.6. > this issue cause our production zookeeper cluster leak about 1 million > watchs, after restart the server one by one, the watch count decrease to only > about 40000. > I can reproduce the issue on my mac,here it is: > ------------------------------------------------------------------------ > pguodeMacBook-Air:bin pguo$ echo srvr | nc localhost 6181 > Zookeeper version: 3.4.6-1569965, built on 02/20/2014 09:09 GMT > Latency min/avg/max: 0/1156/128513 > Received: 539 > Sent: 531 > Connections: 1 > Outstanding: 0 > Zxid: 0x100000037 > Mode: follower > Node count: 5 > ------------------------ > pguodeMacBook-Air:bin pguo$ echo cons | nc localhost 6181 > > /127.0.0.1:55759[1](queued=0,recved=5,sent=5,sid=0x157be2732d0000e,lop=PING,est=1476372631116,to=15000,lcxid=0x1,lzxid=0xffffffffffffffff,lresp=1476372646260,llat=8,minlat=0,avglat=6,maxlat=17) > /0:0:0:0:0:0:0:1:55767[0](queued=0,recved=1,sent=0) > ------------------------ > pguodeMacBook-Air:bin pguo$ echo wchp | nc localhost 6181 > /curator_exists_watch > 0x357be48e4d90007 > 0x357be48e4d90009 > 0x157be2732d0000e > as above 4-letter's report shows, 0x357be48e4d90007 and 0x357be48e4d90009 > are leaked after the two sessions expired -- This message was sent by Atlassian JIRA (v6.3.4#6332)