I have tested with Curator 2.0.1 , curator 2.0.2 branch, master. Same result. Tested different timeouts and the result is the same. Let me know the branch you are using and I can try again.
It looks that the issue is different in my opinion: I have checked Zookeeper server logs, and this is what I have found after starting again the server (Notice that is a server fresh start). Looks that zk client is trying to reconnect but ZK client last seen TX is zxid = 2, and last server zxid =0 because server has been restarted (and TX log is lost). 2013-06-14 21:37:46 INFO ClientCnxn:849 - Socket connection established to localhost.localdomain/127.0.0.1:40000, initiating session 2013-06-14 21:37:46 INFO NIOServerCnxnFactory:197 - Accepted socket connection from /127.0.0.1:50436 2013-06-14 21:37:46 INFO ZooKeeperServer:812 - Refusing session request for client /127.0.0.1:50436 as it has seen zxid 0x2 our last zxid is 0x0 client must try another server 2013-06-14 21:37:46 INFO NIOServerCnxn:1001 - Closed socket connection for client /127.0.0.1:50436 (no session established for client) 2013-06-14 21:37:46 INFO ClientCnxn:1085 - Unable to read additional data from server sessionid 0x13f4430e2930000, likely server has closed socket, closing socket connection and attempting reconnect 2013-06-14 21:37:47 INFO ClientCnxn:966 - Opening socket connection to server localhost.localdomain/127.0.0.1:40000. Will not attempt to authenticate using SASL (unknown error) 2013-06-14 21:37:47 INFO ClientCnxn:849 - Socket connection established to localhost.localdomain/127.0.0.1:40000, initiating session 2013-06-14 21:37:47 INFO NIOServerCnxnFactory:197 - Accepted socket connection from /127.0.0.1:50437 2013-06-14 21:37:47 INFO ZooKeeperServer:812 - Refusing session request for client /127.0.0.1:50437 as it has seen zxid 0x2 our last zxid is 0x0 client must try another server 2013-06-14 21:37:47 INFO NIOServerCnxn:1001 - Closed socket connection for client /127.0.0.1:50437 (no session established for client) 2013-06-14 21:37:47 INFO ClientCnxn:1085 - Unable to read additional data from server sessionid 0x13f4430e2930000, likely server has closed socket, closing socket connection and attempting reconnect 2013-06-14 21:37:49 INFO ClientCnxn:966 - Opening socket connection to server localhost.localdomain/127.0.0.1:40000. Will not attempt to authenticate using SASL (unknown error) 2013-06-14 21:37:49 INFO ClientCnxn:849 - Socket connection established to localhost.localdomain/127.0.0.1:40000, initiating session 2013-06-14 21:37:49 INFO NIOServerCnxnFactory:197 - Accepted socket connection from /127.0.0.1:50438 2013-06-14 21:37:49 INFO ZooKeeperServer:812 - Refusing session request for client /127.0.0.1:50438 as it has seen zxid 0x2 our last zxid is 0x0 client must try another server 2013-06-14 21:37:49 INFO NIOServerCnxn:1001 - Closed socket connection for client /127.0.0.1:50438 (no session established for client) 2013-06-14 21:37:49 INFO ClientCnxn:1085 - Unable to read additional data from server sessionid 0x13f4430e2930000, likely server has closed socket, closing socket connection and attempting reconnect 2013-06-14 21:37:50 INFO ClientCnxn:966 - Opening socket connection to server localhost.localdomain/127.0.0.1:40000. Will not attempt to authenticate using SASL (unknown error) 2013-06-14 21:37:50 INFO ClientCnxn:849 - Socket connection established to localhost.localdomain/127.0.0.1:40000, initiating session 2013-06-14 21:37:50 INFO NIOServerCnxnFactory:197 - Accepted socket connection from /127.0.0.1:50439 2013-06-14 21:37:50 INFO ZooKeeperServer:812 - Refusing session request for client /127.0.0.1:50439 as it has seen zxid 0x2 our last zxid is 0x0 client must try another server 2013-06-14 21:37:50 INFO NIOServerCnxn:1001 - Closed socket connection for client /127.0.0.1:50439 (no session established for client) 2013-06-14 21:37:50 INFO ClientCnxn:1085 - Unable to read additional data from server sessionid 0x13f4430e2930000, likely server has closed socket, closing socket connection and attempting reconnect 2013-06-14 21:37:52 INFO ClientCnxn:966 - Opening socket connection to server localhost.localdomain/127.0.0.1:40000. Will not attempt to authenticate using SASL (unknown error) 2013-06-14 21:37:52 INFO NIOServerCnxnFactory:197 - Accepted socket connection from /127.0.0.1:50440 2013-06-14 21:37:52 INFO ClientCnxn:849 - Socket connection established to localhost.localdomain/127.0.0.1:40000, initiating session 2013-06-14 21:37:52 INFO ZooKeeperServer:812 - Refusing session request for client /127.0.0.1:50440 as it has seen zxid 0x2 our last zxid is 0x0 client must try another server 2013-06-14 21:37:52 INFO NIOServerCnxn:1001 - Closed socket connection for client /127.0.0.1:50440 (no session established for client) 2013-06-14 21:37:52 INFO ClientCnxn:1085 - Unable to read additional data from server sessionid 0x13f4430e2930000, likely server has closed socket, closing socket connection and attempting reconnect 2013-06-14 21:37:54 INFO ClientCnxn:966 - Opening socket connection to server localhost.localdomain/127.0.0.1:40000. Will not attempt to authenticate using SASL (unknown error) 2013-06-14 21:37:54 INFO ClientCnxn:849 - Socket connection established to localhost.localdomain/127.0.0.1:40000, initiating session 2013-06-14 21:37:54 INFO NIOServerCnxnFactory:197 - Accepted socket connection from /127.0.0.1:50441 2013-06-14 21:37:54 INFO ZooKeeperServer:812 - Refusing session request for client /127.0.0.1:50441 as it has seen zxid 0x2 our last zxid is 0x0 client must try another server 2013-06-14 21:37:54 INFO NIOServerCnxn:1001 - Closed socket connection for client /127.0.0.1:50441 (no session established for client) 2013-06-14 21:37:54 INFO ClientCnxn:1085 - Unable to read additional data from server sessionid 0x13f4430e2930000, likely server has closed socket, closing socket connection and attempting reconnect 2013-06-14 21:37:55 INFO ClientCnxn:966 - Opening socket connection to server localhost.localdomain/127.0.0.1:40000. Will not attempt to authenticate using SASL (unknown error) 2013-06-14 21:37:55 INFO NIOServerCnxnFactory:197 - Accepted socket connection from /127.0.0.1:50442 2013-06-14 21:37:55 INFO ClientCnxn:849 - Socket connection established to localhost.localdomain/127.0.0.1:40000, initiating session 2013-06-14 21:37:55 INFO ZooKeeperServer:812 - Refusing session request for client /127.0.0.1:50442 as it has seen zxid 0x2 our last zxid is 0x0 client must try another server 2013-06-14 21:37:55 INFO NIOServerCnxn:1001 - Closed socket connection for client /127.0.0.1:50442 (no session established for client) 2013-06-14 21:37:55 INFO ClientCnxn:1085 - Unable to read additional data from server sessionid 0x13f4430e2930000, likely server has closed socket, closing socket connection and attempting reconnect 2013-06-14 21:37:56 INFO ClientCnxn:966 - Opening socket connection to server localhost.localdomain/127.0.0.1:40000. Will not attempt to authenticate using SASL (unknown error) 2013-06-14 21:37:56 INFO NIOServerCnxnFactory:197 - Accepted socket connection from /127.0.0.1:50443 2013-06-14 21:37:56 INFO ClientCnxn:849 - Socket connection established to localhost.localdomain/127.0.0.1:40000, initiating session 2013-06-14 21:37:56 INFO ZooKeeperServer:812 - Refusing session request for client /127.0.0.1:50443 as it has seen zxid 0x2 our last zxid is 0x0 client must try another server 2013-06-14 21:37:56 INFO NIOServerCnxn:1001 - Closed socket connection for client /127.0.0.1:50443 (no session established for client) 2013-06-14 21:37:56 INFO ClientCnxn:1085 - Unable to read additional data from server sessionid 0x13f4430e2930000, likely server has closed socket, closing socket connection and attempting reconnect 2013-06-14 21:37:58 INFO ClientCnxn:966 - Opening socket connection to server localhost.localdomain/127.0.0.1:40000. Will not attempt to authenticate using SASL (unknown error) 2013-06-14 21:37:58 INFO NIOServerCnxnFactory:197 - Accepted socket connection from /127.0.0.1:50444 2013-06-14 21:37:58 INFO ClientCnxn:849 - Socket connection established to localhost.localdomain/127.0.0.1:40000, initiating session 2013-06-14 21:37:58 INFO ZooKeeperServer:812 - Refusing session request for client /127.0.0.1:50444 as it has seen zxid 0x2 our last zxid is 0x0 client must try another server 2013-06-14 21:37:58 INFO NIOServerCnxn:1001 - Closed socket connection for client /127.0.0.1:50444 (no session established for client) 2013-06-14 21:37:58 INFO ClientCnxn:1085 - Unable to read additional data from server sessionid 0x13f4430e2930000, likely server has closed socket, closing socket connection and attempting reconnect 2013-06-14 21:38:00 INFO ClientCnxn:966 - Opening socket connection to server localhost.localdomain/127.0.0.1:40000. Will not attempt to authenticate using SASL (unknown error) 2013-06-14 21:38:00 INFO NIOServerCnxnFactory:197 - Accepted socket connection from /127.0.0.1:50445 2013-06-14 21:38:00 INFO ClientCnxn:849 - Socket connection established to localhost.localdomain/127.0.0.1:40000, initiating session 2013-06-14 21:38:00 INFO ZooKeeperServer:812 - Refusing session request for client /127.0.0.1:50445 as it has seen zxid 0x2 our last zxid is 0x0 client must try another server 2013-06-14 21:38:00 INFO NIOServerCnxn:1001 - Closed socket connection for client /127.0.0.1:50445 (no session established for client) 2013-06-14 21:38:00 INFO ClientCnxn:1085 - Unable to read additional data from server sessionid 0x13f4430e2930000, likely server has closed socket, closing socket connection and attempting reconnect 2013-06-14 21:38:02 INFO ClientCnxn:966 - Opening socket connection to server localhost.localdomain/127.0.0.1:40000. Will not attempt to authenticate using SASL (unknown error) 2013-06-14 21:38:02 INFO NIOServerCnxnFactory:197 - Accepted socket connection from /127.0.0.1:50446 2013-06-14 21:38:02 INFO ClientCnxn:849 - Socket connection established to localhost.localdomain/127.0.0.1:40000, initiating session 2013-06-14 21:38:02 INFO ZooKeeperServer:812 - Refusing session request for client /127.0.0.1:50446 as it has seen zxid 0x2 our last zxid is 0x0 client must try another server 2013-06-14 21:38:02 INFO NIOServerCnxn:1001 - Closed socket connection for client /127.0.0.1:50446 (no session established for client) 2013-06-14 21:38:02 INFO ClientCnxn:1085 - Unable to read additional data from server sessionid 0x13f4430e2930000, likely server has closed socket, closing socket connection and attempting reconnect 2013-06-14 21:38:04 INFO ClientCnxn:966 - Opening socket connection to server localhost.localdomain/127.0.0.1:40000. Will not attempt to authenticate using SASL (unknown error) 2013-06-14 21:38:04 INFO NIOServerCnxnFactory:197 - Accepted socket connection from /127.0.0.1:50447 2013-06-14 21:38:04 INFO ClientCnxn:849 - Socket connection established to localhost.localdomain/127.0.0.1:40000, initiating session 2013-06-14 21:38:04 INFO ZooKeeperServer:812 - Refusing session request for client /127.0.0.1:50447 as it has seen zxid 0x2 our last zxid is 0x0 client must try another server 2013-06-14 21:38:04 INFO NIOServerCnxn:1001 - Closed socket connection for client /127.0.0.1:50447 (no session established for client) 2013-06-14 21:38:04 INFO ClientCnxn:1085 - Unable to read additional data from server sessionid 0x13f4430e2930000, likely server has closed socket, closing socket connection and attempting reconnect 2013-06-14 21:38:05 INFO ClientCnxn:966 - Opening socket connection to server localhost.localdomain/127.0.0.1:40000. Will not attempt to authenticate using SASL (unknown error) 2013-06-14 21:38:05 INFO ClientCnxn:849 - Socket connection established to localhost.localdomain/127.0.0.1:40000, initiating session 2013-06-14 21:38:05 INFO NIOServerCnxnFactory:197 - Accepted socket connection from /127.0.0.1:50448 2013-06-14 21:38:05 INFO ZooKeeperServer:812 - Refusing session request for client /127.0.0.1:50448 as it has seen zxid 0x2 our last zxid is 0x0 client must try another server 2013-06-14 21:38:05 INFO NIOServerCnxn:1001 - Closed socket connection for client /127.0.0.1:50448 (no session established for client) 2013-06 The same result Timing are not relevant in the tests. The ZK server is providing this logs ________________________________ De: Jordan Zimmerman <[email protected]> Para: Evaristo José Camarero <[email protected]> CC: "[email protected]" <[email protected]> Enviado: Viernes 14 de junio de 2013 21:13 Asunto: Re: CuratorFramework is not recovering connecion with ZK Play with the timings a bit. I consistently see RECONNECTED. Also, make sure you're using the latest Curator. There's a bug fix in 2.0.1 that can affect this. -JZ On Jun 14, 2013, at 12:09 PM, Evaristo José Camarero <[email protected]> wrote: Hi there, > > >Thanks for the fast response. > > >This is the output of the program with your modifications: > > >2013-06-14 21:05:01 INFO CuratorFrameworkImpl:221 - Starting >2013-06-14 21:05:01 INFO ConnectionStateManager:151 - State change: CONNECTED >NEw state CONNECTED >2013-06-14 21:05:13 INFO ConnectionStateManager:151 - State change: SUSPENDED >NEw state SUSPENDED >2013-06-14 21:05:16 ERROR CuratorFrameworkImpl:530 - Background operation >retry gave up >org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode >= ConnectionLoss >at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) >at >org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:487) >at >org.apache.curator.framework.imps.BackgroundSyncImpl$1.processResult(BackgroundSyncImpl.java:50) >at >org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:606) >at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) >2013-06-14 21:05:16 INFO ConnectionStateManager:151 - State change: LOST >NEw state LOST > > > > >You can notice that: >"NEW state RECONNECTED never appears" > > > >what basically means that CuratorFramework listener is not called, and >therefore I assume that Curator does not reconnect. > > > > > > >Something very interesting is the fact that if you remove >"zkClient.create().forPath("/pedro", "juan".getBytes());" from the program >(basically you are not doing any TX, Curator is able to recover. > > >Regards, > > >Evaristo > > > > > > > >________________________________ > De: Jordan Zimmerman <[email protected]> >Para: [email protected]; Evaristo José Camarero ><[email protected]> >Enviado: Viernes 14 de junio de 2013 19:06 >Asunto: Re: CuratorFramework is not recovering connecion with ZK > > > >Curator uses the timeouts given when the CuratorFramework object is created to >manage the connection. In this instance, after LOST has been received Curator >may wait until session timeout before trying to re-create the ZooKeeper >object. I'm enclosing a modified version of your test that sets the Curator >timeouts to half the time that the test's sleeps wait and you can see that >Curator recovers correctly. > > >-Jordan > > > > > >On Jun 14, 2013, at 7:22 AM, Evaristo José Camarero ><[email protected]> wrote: > > >> >>Hi there, >> >> >>I have found a case in which CuratorFramework is not recovering connection >>with ZK servers. >> >> >>The use case is the following: >>- Start ZK servers >>- Start application with CuratorFramework >>- ZK servers goes down. >>- ZK servers start again. >>- CuratorFramework app it is not notified that connection is reconnected , >>but neither notifies that can not recover connection in any way, so >>application can not recover. Only option is to restart the app. >>Notice that if CuratorFRamework client is not making any transaction (just >>comment create() TX), the client is able to reconnect. >> >> >> >>I attach a program able to reproduce the problem >> >> >>Regards, >> >> >>Evaristo >> >> >> >> >> >><CuratorFails.java> > > >
