[jira] [Commented] (ZOOKEEPER-2615) Zookeeper server holds onto dead/expired session ids in the watch data structures
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15615182#comment-15615182 ] guoping.gp commented on ZOOKEEPER-2615: --- Is there any recommended fix option on this issue? > Zookeeper server holds onto dead/expired session ids in the watch data > structures > - > > Key: ZOOKEEPER-2615 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2615 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.6 >Reporter: guoping.gp >Assignee: Camille Fournier > Fix For: 3.4.10, 3.5.3, 3.6.0 > > > The same issue (https://issues.apache.org/jira/browse/ZOOKEEPER-1382) still > can be found even with zookeeper 3.4.6. > this issue cause our production zookeeper cluster leak about 1 million > watchs, after restart the server one by one, the watch count decrease to only > about 4. > I can reproduce the issue on my mac,here it is: > > pguodeMacBook-Air:bin pguo$ echo srvr | nc localhost 6181 > Zookeeper version: 3.4.6-1569965, built on 02/20/2014 09:09 GMT > Latency min/avg/max: 0/1156/128513 > Received: 539 > Sent: 531 > Connections: 1 > Outstanding: 0 > Zxid: 0x10037 > Mode: follower > Node count: 5 > > pguodeMacBook-Air:bin pguo$ echo cons | nc localhost 6181 > > /127.0.0.1:55759[1](queued=0,recved=5,sent=5,sid=0x157be2732de,lop=PING,est=1476372631116,to=15000,lcxid=0x1,lzxid=0x,lresp=1476372646260,llat=8,minlat=0,avglat=6,maxlat=17) > /0:0:0:0:0:0:0:1:55767[0](queued=0,recved=1,sent=0) > > pguodeMacBook-Air:bin pguo$ echo wchp | nc localhost 6181 > /curator_exists_watch > 0x357be48e4d90007 > 0x357be48e4d90009 > 0x157be2732de > as above 4-letter's report shows, 0x357be48e4d90007 and 0x357be48e4d90009 > are leaked after the two sessions expired -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (ZOOKEEPER-1382) Zookeeper server holds onto dead/expired session ids in the watch data structures
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15574315#comment-15574315 ] guoping.gp edited comment on ZOOKEEPER-1382 at 10/14/16 5:47 AM: - the same issue is still can be found on 3.4.6, https://issues.apache.org/jira/browse/ZOOKEEPER-2615. when the client app reconnect to a server, the setWatchs and close still stay in the race condition as following possible: Daemon Thread [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:6181] (Suspended (breakpoint at line 966 in NIOServerCnxn)) NIOServerCnxn.close() line: 966 NIOServerCnxn.doIO(SelectionKey) line: 360 NIOServerCnxnFactory.run() line: 208 Thread.run() line: 745 and then Thread [CommitProcessor:1] (Suspended (breakpoint at line 303 in FinalRequestProcessor)) FinalRequestProcessor.processRequest(Request) line: 303 CommitProcessor.run() line: 74 the issue cause our production leak about 1 million watchs was (Author: guoping.gp): the same issue is still can be found on 3.4.6, https://issues.apache.org/jira/browse/ZOOKEEPER-2615. when the client app reconnect to a server, the setWatchs and close still stay in the race condition as following possible: Daemon Thread [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:6181] (Suspended (breakpoint at line 966 in NIOServerCnxn)) NIOServerCnxn.close() line: 966 NIOServerCnxn.doIO(SelectionKey) line: 360 NIOServerCnxnFactory.run() line: 208 Thread.run() line: 745 and then Thread [CommitProcessor:1] (Suspended (breakpoint at line 303 in FinalRequestProcessor)) FinalRequestProcessor.processRequest(Request) line: 303 CommitProcessor.run() line: 74 > Zookeeper server holds onto dead/expired session ids in the watch data > structures > - > > Key: ZOOKEEPER-1382 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1382 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.5 >Reporter: Neha Narkhede >Assignee: Germán Blanco >Priority: Critical > Fix For: 3.4.6, 3.5.0 > > Attachments: ZOOKEEPER-1382-branch-3.4.patch, > ZOOKEEPER-1382-branch-3.4.patch, ZOOKEEPER-1382-branch-3.4.patch, > ZOOKEEPER-1382-branch-3.4.patch, ZOOKEEPER-1382.patch, ZOOKEEPER-1382.patch, > ZOOKEEPER-1382.patch, ZOOKEEPER-1382.patch, ZOOKEEPER-1382.patch, > ZOOKEEPER-1382_3.3.4.patch > > > I've observed that zookeeper server holds onto expired session ids in the > watcher data structures. The result is the wchp command reports session ids > that cannot be found through cons/dump and those expired session ids sit > there maybe until the server is restarted. Here are snippets from the client > and the server logs that lead to this state, for one particular session id > 0x134485fd7bcb26f - > There are 4 servers in the zookeeper cluster - 223, 224, 225 (leader), 226 > and I'm using ZkClient to connect to the cluster > From the application log - > application.log.2012-01-26-325.gz:2012/01/26 04:56:36.177 INFO [ClientCnxn] > [main-SendThread(223.prod:12913)] [application Session establishment complete > on server 223.prod/172.17.135.38:12913, sessionid = 0x134485fd7bcb26f, > negotiated timeout = 6000 > application.log.2012-01-27.gz:2012/01/27 09:52:37.714 INFO [ClientCnxn] > [main-SendThread(223.prod:12913)] [application] Client session timed out, > have not heard from server in 9827ms for sessionid 0x134485fd7bcb26f, closing > socket connection and attempting reconnect > application.log.2012-01-27.gz:2012/01/27 09:52:38.191 INFO [ClientCnxn] > [main-SendThread(226.prod:12913)] [application] Unable to reconnect to > ZooKeeper service, session 0x134485fd7bcb26f has expired, closing socket > connection > On the leader zk, 225 - > zookeeper.log.2012-01-27-leader-225.gz:2012-01-27 09:52:34,010 - INFO > [SessionTracker:ZooKeeperServer@314] - Expiring session 0x134485fd7bcb26f, > timeout of 6000ms exceeded > zookeeper.log.2012-01-27-leader-225.gz:2012-01-27 09:52:34,010 - INFO > [ProcessThread:-1:PrepRequestProcessor@391] - Processed session termination > for sessionid: 0x134485fd7bcb26f > On the server, the client was initially connected to, 223 - > zookeeper.log.2012-01-26-223.gz:2012-01-26 04:56:36,173 - INFO > [CommitProcessor:1:NIOServerCnxn@1580] - Established session > 0x134485fd7bcb26f with negotiated timeout 6000 for client /172.17.136.82:45020 > zookeeper.log.2012-01-27-223.gz:2012-01-27 09:52:34,018 - INFO > [CommitProcessor:1:NIOServerCnxn@1435] - Closed socket connection for client > /172.17.136.82:45020 which had sessionid 0x134485fd7bcb26f > Here are the log snippets from 226, which is the server, the client > reconnected to, before getting session expired event - > 2012-01-27 09:52:38,190 - INFO
[jira] [Commented] (ZOOKEEPER-1382) Zookeeper server holds onto dead/expired session ids in the watch data structures
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15574315#comment-15574315 ] guoping.gp commented on ZOOKEEPER-1382: --- the same issue is still can be found on 3.4.6, https://issues.apache.org/jira/browse/ZOOKEEPER-2615. when the client app reconnect to a server, the setWatchs and close still stay in the race condition as following possible: Daemon Thread [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:6181] (Suspended (breakpoint at line 966 in NIOServerCnxn)) NIOServerCnxn.close() line: 966 NIOServerCnxn.doIO(SelectionKey) line: 360 NIOServerCnxnFactory.run() line: 208 Thread.run() line: 745 and then Thread [CommitProcessor:1] (Suspended (breakpoint at line 303 in FinalRequestProcessor)) FinalRequestProcessor.processRequest(Request) line: 303 CommitProcessor.run() line: 74 > Zookeeper server holds onto dead/expired session ids in the watch data > structures > - > > Key: ZOOKEEPER-1382 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1382 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.5 >Reporter: Neha Narkhede >Assignee: Germán Blanco >Priority: Critical > Fix For: 3.4.6, 3.5.0 > > Attachments: ZOOKEEPER-1382-branch-3.4.patch, > ZOOKEEPER-1382-branch-3.4.patch, ZOOKEEPER-1382-branch-3.4.patch, > ZOOKEEPER-1382-branch-3.4.patch, ZOOKEEPER-1382.patch, ZOOKEEPER-1382.patch, > ZOOKEEPER-1382.patch, ZOOKEEPER-1382.patch, ZOOKEEPER-1382.patch, > ZOOKEEPER-1382_3.3.4.patch > > > I've observed that zookeeper server holds onto expired session ids in the > watcher data structures. The result is the wchp command reports session ids > that cannot be found through cons/dump and those expired session ids sit > there maybe until the server is restarted. Here are snippets from the client > and the server logs that lead to this state, for one particular session id > 0x134485fd7bcb26f - > There are 4 servers in the zookeeper cluster - 223, 224, 225 (leader), 226 > and I'm using ZkClient to connect to the cluster > From the application log - > application.log.2012-01-26-325.gz:2012/01/26 04:56:36.177 INFO [ClientCnxn] > [main-SendThread(223.prod:12913)] [application Session establishment complete > on server 223.prod/172.17.135.38:12913, sessionid = 0x134485fd7bcb26f, > negotiated timeout = 6000 > application.log.2012-01-27.gz:2012/01/27 09:52:37.714 INFO [ClientCnxn] > [main-SendThread(223.prod:12913)] [application] Client session timed out, > have not heard from server in 9827ms for sessionid 0x134485fd7bcb26f, closing > socket connection and attempting reconnect > application.log.2012-01-27.gz:2012/01/27 09:52:38.191 INFO [ClientCnxn] > [main-SendThread(226.prod:12913)] [application] Unable to reconnect to > ZooKeeper service, session 0x134485fd7bcb26f has expired, closing socket > connection > On the leader zk, 225 - > zookeeper.log.2012-01-27-leader-225.gz:2012-01-27 09:52:34,010 - INFO > [SessionTracker:ZooKeeperServer@314] - Expiring session 0x134485fd7bcb26f, > timeout of 6000ms exceeded > zookeeper.log.2012-01-27-leader-225.gz:2012-01-27 09:52:34,010 - INFO > [ProcessThread:-1:PrepRequestProcessor@391] - Processed session termination > for sessionid: 0x134485fd7bcb26f > On the server, the client was initially connected to, 223 - > zookeeper.log.2012-01-26-223.gz:2012-01-26 04:56:36,173 - INFO > [CommitProcessor:1:NIOServerCnxn@1580] - Established session > 0x134485fd7bcb26f with negotiated timeout 6000 for client /172.17.136.82:45020 > zookeeper.log.2012-01-27-223.gz:2012-01-27 09:52:34,018 - INFO > [CommitProcessor:1:NIOServerCnxn@1435] - Closed socket connection for client > /172.17.136.82:45020 which had sessionid 0x134485fd7bcb26f > Here are the log snippets from 226, which is the server, the client > reconnected to, before getting session expired event - > 2012-01-27 09:52:38,190 - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@770] - Client > attempting to renew session 0x134485fd7bcb26f at /172.17.136.82:49367 > 2012-01-27 09:52:38,191 - INFO > [QuorumPeer:/0.0.0.0:12913:NIOServerCnxn@1573] - Invalid session > 0x134485fd7bcb26f for client /172.17.136.82:49367, probably expired > 2012-01-27 09:52:38,191 - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@1435] - Closed > socket connection for client /172.17.136.82:49367 which had sessionid > 0x134485fd7bcb26f > wchp output from 226, taken on 01/30 - > nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f > *226.*wchp* | wc -l > 3 > wchp output from 223, taken on 01/30 - > nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f > *223.*wchp* | wc -l > 0 > cons output from
[jira] [Comment Edited] (ZOOKEEPER-2615) Zookeeper server holds onto dead/expired session ids in the watch data structures
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15574296#comment-15574296 ] guoping.gp edited comment on ZOOKEEPER-2615 at 10/14/16 5:41 AM: - REPRODUCE TEST CASE STEP BY STEP 1). start zookeeper cluster, says server1(localhost:6181), server2(localhost:4181), server3(localhost:2181) let's assume now server2 is the LEADER. 2). use zkCli.sh create a persistence node /curator_exists_watch 3). use Curator Framework2.9.1 to create a simple Java Client Application, as following: * import org.apache.curator.framework.CuratorFramework; import org.apache.curator.framework.CuratorFrameworkFactory; import org.apache.curator.framework.api.CuratorWatcher; import org.apache.curator.retry.ExponentialBackoffRetry; import org.apache.zookeeper.WatchedEvent; public class ExistingWatchDemo { public static void main(String[] args) throws Exception { CuratorFrameworkFactory.Builder builder = CuratorFrameworkFactory.builder(); builder = builder.connectString("localhost:2181,localhost:6181") .sessionTimeoutMs(15000) .connectionTimeoutMs(1) .retryPolicy(new ExponentialBackoffRetry(5000, Integer.MAX_VALUE)); final CuratorFramework zkClient = builder.build(); zkClient.start(); CuratorWatcher watcher = new CuratorWatcher() { public void process(WatchedEvent event) throws Exception { System.out.println("received event " + event); System.out.println("watch again ..."); zkClient.checkExists().usingWatcher(this).forPath("/curator_exists_watch"); } }; zkClient.checkExists().usingWatcher(watcher).forPath("/curator_exists_watch"); Thread.sleep(600 * 1000); } } * to stay us to reproduce in more convenience, here we provide only 2 servers in the connection string instead of 3 both, says as server1(FOLLOWING) server3(FOLLOWING) 4). breakpoint at 2 places: FinalRequestProcessor.java: line 303 : zks.getZKDatabase().setWatches(relativeZxid, NIOServerCnxn.java: line 966: synchronized(factory.cnxns){ 5). running the client app ExistingWatchDemo.java, let say the client application connection to server3(2181) 6). use the IDE make a remote debug session connect to another server of the connection string which we provided, here server1(6181) 7). now kill -STOP server3(2181), the client app will find the server hang and reconnect to server1(6181), we can confirm the reconnect in the client's log: - 016-10-14 13:31:21,505 [myid:] - INFO [main-SendThread(localhost:2181):ClientCnxn$SendThread@1096] - Client session timed out, have not heard from server in 3754ms for sessionid 0x357c1a4462b000e, closing socket connection and attempting reconnect 2016-10-14 13:31:22,389 [myid:] - INFO [main-SendThread(localhost:6181):ClientCnxn$SendThread@975] - Opening socket connection to server localhost/0:0:0:0:0:0:0:1:6181. Will not attempt to authenticate using SASL (unknown error) 2016-10-14 13:31:22,392 [myid:] - INFO [main-SendThread(localhost:6181):ClientCnxn$SendThread@852] - Socket connection established to localhost/0:0:0:0:0:0:0:1:6181, initiating session 2016-10-14 13:31:22,392 [myid:] - DEBUG [main-SendThread(localhost:6181):ClientCnxn$SendThread@892] - Session establishment request sent on localhost/0:0:0:0:0:0:0:1:6181 2016-10-14 13:31:22,407 [myid:] - INFO [main-SendThread(localhost:6181):ClientCnxn$SendThread@1235] - Session establishment complete on server localhost/0:0:0:0:0:0:0:1:6181, sessionid = 0x357c1a4462b000e, negotiated timeout = 15000 - the reconnect will hit the breakpoint FinalRequestProcessor.java: line 303 as following: Thread [CommitProcessor:1] (Suspended (breakpoint at line 303 in FinalRequestProcessor)) FinalRequestProcessor.processRequest(Request) line: 303 CommitProcessor.run() line: 74 8). now kill the process of the client app, kill -9 , this will let us another breakpoint hit, as following Daemon Thread [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:6181] (Suspended (breakpoint at line 966 in NIOServerCnxn)) NIOServerCnxn.close() line: 966 NIOServerCnxn.doIO(SelectionKey) line: 360 NIOServerCnxnFactory.run() line: 208 Thread.run() line: 745 9). now resume the breakpoint(NIOServerCnxn:966) in step 8, after that, then resume the breakpoint (FinalRequestProcessor:303) in step 7 10).
[jira] [Commented] (ZOOKEEPER-2615) Zookeeper server holds onto dead/expired session ids in the watch data structures
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15574296#comment-15574296 ] guoping.gp commented on ZOOKEEPER-2615: --- REPRODUCE TEST CASE STEP BY STEP 1. start zookeeper cluster, says server1(localhost:6181), server2(localhost:4181), server3(localhost:2181) let's assume now server2 is the LEADER. 2. use zkCli.sh create a persistence node /curator_exists_watch 3. use Curator Framework2.9.1 to create a simple Java Client Application, as following: * import org.apache.curator.framework.CuratorFramework; import org.apache.curator.framework.CuratorFrameworkFactory; import org.apache.curator.framework.api.CuratorWatcher; import org.apache.curator.retry.ExponentialBackoffRetry; import org.apache.zookeeper.WatchedEvent; public class ExistingWatchDemo { public static void main(String[] args) throws Exception { CuratorFrameworkFactory.Builder builder = CuratorFrameworkFactory.builder(); builder = builder.connectString("localhost:2181,localhost:6181") .sessionTimeoutMs(15000) .connectionTimeoutMs(1) .retryPolicy(new ExponentialBackoffRetry(5000, Integer.MAX_VALUE)); final CuratorFramework zkClient = builder.build(); zkClient.start(); CuratorWatcher watcher = new CuratorWatcher() { public void process(WatchedEvent event) throws Exception { System.out.println("received event " + event); System.out.println("watch again ..."); zkClient.checkExists().usingWatcher(this).forPath("/curator_exists_watch"); } }; zkClient.checkExists().usingWatcher(watcher).forPath("/curator_exists_watch"); Thread.sleep(600 * 1000); } } * to stay us to reproduce in more convenience, here we provide only 2 servers in the connection string instead of 3 both, says as server1(FOLLOWING) server3(FOLLOWING) 4. breakpoint at 2 places: FinalRequestProcessor.java: line 303 : zks.getZKDatabase().setWatches(relativeZxid, NIOServerCnxn.java: line 966: synchronized(factory.cnxns){ 5. running the client app ExistingWatchDemo.java, let say the client application connection to server3(2181) 6. use the IDE make a remote debug session connect to another server of the connection string which we provided, here server1(6181) 7. now kill -STOP server3(2181), the client app will find the server hang and reconnect to server1(6181), we can confirm the reconnect in the client's log: - 016-10-14 13:31:21,505 [myid:] - INFO [main-SendThread(localhost:2181):ClientCnxn$SendThread@1096] - Client session timed out, have not heard from server in 3754ms for sessionid 0x357c1a4462b000e, closing socket connection and attempting reconnect 2016-10-14 13:31:22,389 [myid:] - INFO [main-SendThread(localhost:6181):ClientCnxn$SendThread@975] - Opening socket connection to server localhost/0:0:0:0:0:0:0:1:6181. Will not attempt to authenticate using SASL (unknown error) 2016-10-14 13:31:22,392 [myid:] - INFO [main-SendThread(localhost:6181):ClientCnxn$SendThread@852] - Socket connection established to localhost/0:0:0:0:0:0:0:1:6181, initiating session 2016-10-14 13:31:22,392 [myid:] - DEBUG [main-SendThread(localhost:6181):ClientCnxn$SendThread@892] - Session establishment request sent on localhost/0:0:0:0:0:0:0:1:6181 2016-10-14 13:31:22,407 [myid:] - INFO [main-SendThread(localhost:6181):ClientCnxn$SendThread@1235] - Session establishment complete on server localhost/0:0:0:0:0:0:0:1:6181, sessionid = 0x357c1a4462b000e, negotiated timeout = 15000 - the reconnect will hit the breakpoint FinalRequestProcessor.java: line 303 as following: Thread [CommitProcessor:1] (Suspended (breakpoint at line 303 in FinalRequestProcessor)) FinalRequestProcessor.processRequest(Request) line: 303 CommitProcessor.run() line: 74 8. now kill the process of the client app, kill -9 , this will let us another breakpoint hit, as following Daemon Thread [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:6181] (Suspended (breakpoint at line 966 in NIOServerCnxn)) NIOServerCnxn.close() line: 966 NIOServerCnxn.doIO(SelectionKey) line: 360 NIOServerCnxnFactory.run() line: 208 Thread.run() line: 745 9. now resume the breakpoint(NIOServerCnxn:966) in step 8, after that, then resume the breakpoint (FinalRequestProcessor:303) in step 7 10. now use wchp and cons to check the leak watch, as following
[jira] [Created] (ZOOKEEPER-2615) Zookeeper server holds onto dead/expired session ids in the watch data structures
guoping.gp created ZOOKEEPER-2615: - Summary: Zookeeper server holds onto dead/expired session ids in the watch data structures Key: ZOOKEEPER-2615 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2615 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.6 Reporter: guoping.gp The same issue (https://issues.apache.org/jira/browse/ZOOKEEPER-1382) still can be found even with zookeeper 3.4.6. this issue cause our production zookeeper cluster leak about 1 million watchs, after restart the server one by one, the watch count decrease to only about 4. I can reproduce the issue on my mac,here it is: pguodeMacBook-Air:bin pguo$ echo srvr | nc localhost 6181 Zookeeper version: 3.4.6-1569965, built on 02/20/2014 09:09 GMT Latency min/avg/max: 0/1156/128513 Received: 539 Sent: 531 Connections: 1 Outstanding: 0 Zxid: 0x10037 Mode: follower Node count: 5 pguodeMacBook-Air:bin pguo$ echo cons | nc localhost 6181 /127.0.0.1:55759[1](queued=0,recved=5,sent=5,sid=0x157be2732de,lop=PING,est=1476372631116,to=15000,lcxid=0x1,lzxid=0x,lresp=1476372646260,llat=8,minlat=0,avglat=6,maxlat=17) /0:0:0:0:0:0:0:1:55767[0](queued=0,recved=1,sent=0) pguodeMacBook-Air:bin pguo$ echo wchp | nc localhost 6181 /curator_exists_watch 0x357be48e4d90007 0x357be48e4d90009 0x157be2732de as above 4-letter's report shows, 0x357be48e4d90007 and 0x357be48e4d90009 are leaked after the two sessions expired -- This message was sent by Atlassian JIRA (v6.3.4#6332)