[jira] [Commented] (ZOOKEEPER-2615) Zookeeper server holds onto dead/expired session ids in the watch data structures

guoping.gp (JIRA) Thu, 13 Oct 2016 22:41:28 -0700

    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15574296#comment-15574296
 ]


guoping.gp commented on ZOOKEEPER-2615:
---------------------------------------

REPRODUCE TEST CASE STEP BY STEP
====================================
1. start zookeeper cluster, says server1(localhost:6181), 
server2(localhost:4181), server3(localhost:2181)
    let's assume now server2 is the LEADER.

2. use zkCli.sh create a persistence node /curator_exists_watch

3. use Curator Framework2.9.1 to create a simple Java Client Application,  as 
following:

*****

import org.apache.curator.framework.CuratorFramework;
import org.apache.curator.framework.CuratorFrameworkFactory;
import org.apache.curator.framework.api.CuratorWatcher;
import org.apache.curator.retry.ExponentialBackoffRetry;
import org.apache.zookeeper.WatchedEvent;

public class ExistingWatchDemo {

        public static void main(String[] args) throws Exception {

                CuratorFrameworkFactory.Builder builder = 
CuratorFrameworkFactory.builder();
                builder = builder.connectString("localhost:2181,localhost:6181")
                                         .sessionTimeoutMs(15000)
                                         .connectionTimeoutMs(10000)
                                        .retryPolicy(new 
ExponentialBackoffRetry(5000, Integer.MAX_VALUE));
                
                final CuratorFramework zkClient = builder.build();

                zkClient.start();

                CuratorWatcher watcher = new CuratorWatcher() {

                        public void process(WatchedEvent event) throws 
Exception {
                                 System.out.println("received event " + event);
                                 System.out.println("watch again ...");
                                
zkClient.checkExists().usingWatcher(this).forPath("/curator_exists_watch");
                        }
                        
                };
                
                
zkClient.checkExists().usingWatcher(watcher).forPath("/curator_exists_watch");

                Thread.sleep(600 * 1000);

        }

}

*****

to stay us to reproduce in more convenience, here we provide only 2 servers in 
the connection string instead of 3 both, says as server1(FOLLOWING) 
server3(FOLLOWING)

4.  breakpoint at 2 places: 
 
    FinalRequestProcessor.java: line 303 :         
zks.getZKDatabase().setWatches(relativeZxid, 
    NIOServerCnxn.java: line 966:                     
synchronized(factory.cnxns){


5. running the client app ExistingWatchDemo.java, let say the client 
application connection to server3(2181)

6. use the IDE make a remote debug session connect to another server of the 
connection string which we provided, here server1(6181)

7. now kill -STOP server3(2181), the client app will find the server hang and 
reconnect to  server1(6181), we can confirm the reconnect in
the client's log:

---------
016-10-14 13:31:21,505 [myid:] - INFO  
[main-SendThread(localhost:2181):ClientCnxn$SendThread@1096] - Client session 
timed out, have not heard from server in 3754ms for sessionid 
0x357c1a4462b000e, closing socket connection and attempting reconnect
2016-10-14 13:31:22,389 [myid:] - INFO  
[main-SendThread(localhost:6181):ClientCnxn$SendThread@975] - Opening socket 
connection to server localhost/0:0:0:0:0:0:0:1:6181. Will not attempt to 
authenticate using SASL (unknown error)
2016-10-14 13:31:22,392 [myid:] - INFO  
[main-SendThread(localhost:6181):ClientCnxn$SendThread@852] - Socket connection 
established to localhost/0:0:0:0:0:0:0:1:6181, initiating session
2016-10-14 13:31:22,392 [myid:] - DEBUG 
[main-SendThread(localhost:6181):ClientCnxn$SendThread@892] - Session 
establishment request sent on localhost/0:0:0:0:0:0:0:1:6181
2016-10-14 13:31:22,407 [myid:] - INFO  
[main-SendThread(localhost:6181):ClientCnxn$SendThread@1235] - Session 
establishment complete on server localhost/0:0:0:0:0:0:0:1:6181, sessionid = 
0x357c1a4462b000e, negotiated timeout = 15000
---------

the reconnect will hit the breakpoint FinalRequestProcessor.java: line 303 as 
following:

Thread [CommitProcessor:1] (Suspended (breakpoint at line 303 in 
FinalRequestProcessor))        
        FinalRequestProcessor.processRequest(Request) line: 303 
        CommitProcessor.run() line: 74  

8. now kill the process of  the client app, kill -9 <pid of client app>, this 
will let us another breakpoint hit, as following
 
Daemon Thread [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:6181] (Suspended 
(breakpoint at line 966 in NIOServerCnxn)) 
        NIOServerCnxn.close() line: 966 
        NIOServerCnxn.doIO(SelectionKey) line: 360      
        NIOServerCnxnFactory.run() line: 208    
        Thread.run() line: 745  

9.  now resume the breakpoint(NIOServerCnxn:966) in step 8, after that, then 
resume the breakpoint (FinalRequestProcessor:303) in step 7

10. now use wchp and cons to check the leak watch, as following shows

pguodeMacBook-Air:bin pguo$ echo wchp | nc localhost 6181
/curator_exists_watch
        0x357c1a4462b000e

pguodeMacBook-Air:bin pguo$ echo cons | nc localhost 6181
 /0:0:0:0:0:0:0:1:61702[0](queued=0,recved=1,sent=0)
 

> Zookeeper server holds onto dead/expired session ids in the watch data 
> structures
> ---------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-2615
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2615
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.6
>            Reporter: guoping.gp
>
> The same issue (https://issues.apache.org/jira/browse/ZOOKEEPER-1382) still 
> can be found even with zookeeper 3.4.6.
> this issue cause our production zookeeper cluster leak about 1 million 
> watchs, after restart the server one by one, the watch count decrease to only 
> about 40000.
> I can reproduce the issue on my mac,here it is:
> ------------------------------------------------------------------------
> pguodeMacBook-Air:bin pguo$ echo srvr | nc localhost 6181
> Zookeeper version: 3.4.6-1569965, built on 02/20/2014 09:09 GMT
> Latency min/avg/max: 0/1156/128513
> Received: 539
> Sent: 531
> Connections: 1
> Outstanding: 0
> Zxid: 0x100000037
> Mode: follower
> Node count: 5
> ------------------------
> pguodeMacBook-Air:bin pguo$ echo cons | nc localhost 6181
>  
> /127.0.0.1:55759[1](queued=0,recved=5,sent=5,sid=0x157be2732d0000e,lop=PING,est=1476372631116,to=15000,lcxid=0x1,lzxid=0xffffffffffffffff,lresp=1476372646260,llat=8,minlat=0,avglat=6,maxlat=17)
>  /0:0:0:0:0:0:0:1:55767[0](queued=0,recved=1,sent=0)
> ------------------------
> pguodeMacBook-Air:bin pguo$ echo wchp | nc localhost 6181
> /curator_exists_watch
>               0x357be48e4d90007
>               0x357be48e4d90009
>               0x157be2732d0000e
> as above 4-letter's report shows,     0x357be48e4d90007 and 0x357be48e4d90009 
> are leaked after the two sessions expired 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ZOOKEEPER-2615) Zookeeper server holds onto dead/expired session ids in the watch data structures

Reply via email to