I think there needs to be some ZK-active methods to get the reconnect. I just
tried again and didn't always get the reconnect. However, if I add this line
after the second server start() it always works:
zkClient.create().forPath("/hey");
-JZ
On Jun 14, 2013, at 12:47 PM, Evaristo José Camarero
<[email protected]> wrote:
>
> I have tested with Curator 2.0.1 , curator 2.0.2 branch, master. Same result.
> Tested different timeouts and the result is the same. Let me know the branch
> you are using and I can try again.
>
> It looks that the issue is different in my opinion:
> I have checked Zookeeper server logs, and this is what I have found after
> starting again the server (Notice that is a server fresh start).
>
> Looks that zk client is trying to reconnect but ZK client last seen TX is
> zxid = 2, and last server zxid =0 because server has been restarted (and TX
> log is lost).
>
> 2013-06-14 21:37:46 INFO ClientCnxn:849 - Socket connection established to
> localhost.localdomain/127.0.0.1:40000, initiating session
> 2013-06-14 21:37:46 INFO NIOServerCnxnFactory:197 - Accepted socket
> connection from /127.0.0.1:50436
> 2013-06-14 21:37:46 INFO ZooKeeperServer:812 - Refusing session request for
> client /127.0.0.1:50436 as it has seen zxid 0x2 our last zxid is 0x0 client
> must try another server
> 2013-06-14 21:37:46 INFO NIOServerCnxn:1001 - Closed socket connection for
> client /127.0.0.1:50436 (no session established for client)
> 2013-06-14 21:37:46 INFO ClientCnxn:1085 - Unable to read additional data
> from server sessionid 0x13f4430e2930000, likely server has closed socket,
> closing socket connection and attempting reconnect
> 2013-06-14 21:37:47 INFO ClientCnxn:966 - Opening socket connection to
> server localhost.localdomain/127.0.0.1:40000. Will not attempt to
> authenticate using SASL (unknown error)
> 2013-06-14 21:37:47 INFO ClientCnxn:849 - Socket connection established to
> localhost.localdomain/127.0.0.1:40000, initiating session
> 2013-06-14 21:37:47 INFO NIOServerCnxnFactory:197 - Accepted socket
> connection from /127.0.0.1:50437
> 2013-06-14 21:37:47 INFO ZooKeeperServer:812 - Refusing session request for
> client /127.0.0.1:50437 as it has seen zxid 0x2 our last zxid is 0x0 client
> must try another server
> 2013-06-14 21:37:47 INFO NIOServerCnxn:1001 - Closed socket connection for
> client /127.0.0.1:50437 (no session established for client)
> 2013-06-14 21:37:47 INFO ClientCnxn:1085 - Unable to read additional data
> from server sessionid 0x13f4430e2930000, likely server has closed socket,
> closing socket connection and attempting reconnect
> 2013-06-14 21:37:49 INFO ClientCnxn:966 - Opening socket connection to
> server localhost.localdomain/127.0.0.1:40000. Will not attempt to
> authenticate using SASL (unknown error)
> 2013-06-14 21:37:49 INFO ClientCnxn:849 - Socket connection established to
> localhost.localdomain/127.0.0.1:40000, initiating session
> 2013-06-14 21:37:49 INFO NIOServerCnxnFactory:197 - Accepted socket
> connection from /127.0.0.1:50438
> 2013-06-14 21:37:49 INFO ZooKeeperServer:812 - Refusing session request for
> client /127.0.0.1:50438 as it has seen zxid 0x2 our last zxid is 0x0 client
> must try another server
> 2013-06-14 21:37:49 INFO NIOServerCnxn:1001 - Closed socket connection for
> client /127.0.0.1:50438 (no session established for client)
> 2013-06-14 21:37:49 INFO ClientCnxn:1085 - Unable to read additional data
> from server sessionid 0x13f4430e2930000, likely server has closed socket,
> closing socket connection and attempting reconnect
> 2013-06-14 21:37:50 INFO ClientCnxn:966 - Opening socket connection to
> server localhost.localdomain/127.0.0.1:40000. Will not attempt to
> authenticate using SASL (unknown error)
> 2013-06-14 21:37:50 INFO ClientCnxn:849 - Socket connection established to
> localhost.localdomain/127.0.0.1:40000, initiating session
> 2013-06-14 21:37:50 INFO NIOServerCnxnFactory:197 - Accepted socket
> connection from /127.0.0.1:50439
> 2013-06-14 21:37:50 INFO ZooKeeperServer:812 - Refusing session request for
> client /127.0.0.1:50439 as it has seen zxid 0x2 our last zxid is 0x0 client
> must try another server
> 2013-06-14 21:37:50 INFO NIOServerCnxn:1001 - Closed socket connection for
> client /127.0.0.1:50439 (no session established for client)
> 2013-06-14 21:37:50 INFO ClientCnxn:1085 - Unable to read additional data
> from server sessionid 0x13f4430e2930000, likely server has closed socket,
> closing socket connection and attempting reconnect
> 2013-06-14 21:37:52 INFO ClientCnxn:966 - Opening socket connection to
> server localhost.localdomain/127.0.0.1:40000. Will not attempt to
> authenticate using SASL (unknown error)
> 2013-06-14 21:37:52 INFO NIOServerCnxnFactory:197 - Accepted socket
> connection from /127.0.0.1:50440
> 2013-06-14 21:37:52 INFO ClientCnxn:849 - Socket connection established to
> localhost.localdomain/127.0.0.1:40000, initiating session
> 2013-06-14 21:37:52 INFO ZooKeeperServer:812 - Refusing session request for
> client /127.0.0.1:50440 as it has seen zxid 0x2 our last zxid is 0x0 client
> must try another server
> 2013-06-14 21:37:52 INFO NIOServerCnxn:1001 - Closed socket connection for
> client /127.0.0.1:50440 (no session established for client)
> 2013-06-14 21:37:52 INFO ClientCnxn:1085 - Unable to read additional data
> from server sessionid 0x13f4430e2930000, likely server has closed socket,
> closing socket connection and attempting reconnect
> 2013-06-14 21:37:54 INFO ClientCnxn:966 - Opening socket connection to
> server localhost.localdomain/127.0.0.1:40000. Will not attempt to
> authenticate using SASL (unknown error)
> 2013-06-14 21:37:54 INFO ClientCnxn:849 - Socket connection established to
> localhost.localdomain/127.0.0.1:40000, initiating session
> 2013-06-14 21:37:54 INFO NIOServerCnxnFactory:197 - Accepted socket
> connection from /127.0.0.1:50441
> 2013-06-14 21:37:54 INFO ZooKeeperServer:812 - Refusing session request for
> client /127.0.0.1:50441 as it has seen zxid 0x2 our last zxid is 0x0 client
> must try another server
> 2013-06-14 21:37:54 INFO NIOServerCnxn:1001 - Closed socket connection for
> client /127.0.0.1:50441 (no session established for client)
> 2013-06-14 21:37:54 INFO ClientCnxn:1085 - Unable to read additional data
> from server sessionid 0x13f4430e2930000, likely server has closed socket,
> closing socket connection and attempting reconnect
> 2013-06-14 21:37:55 INFO ClientCnxn:966 - Opening socket connection to
> server localhost.localdomain/127.0.0.1:40000. Will not attempt to
> authenticate using SASL (unknown error)
> 2013-06-14 21:37:55 INFO NIOServerCnxnFactory:197 - Accepted socket
> connection from /127.0.0.1:50442
> 2013-06-14 21:37:55 INFO ClientCnxn:849 - Socket connection established to
> localhost.localdomain/127.0.0.1:40000, initiating session
> 2013-06-14 21:37:55 INFO ZooKeeperServer:812 - Refusing session request for
> client /127.0.0.1:50442 as it has seen zxid 0x2 our last zxid is 0x0 client
> must try another server
> 2013-06-14 21:37:55 INFO NIOServerCnxn:1001 - Closed socket connection for
> client /127.0.0.1:50442 (no session established for client)
> 2013-06-14 21:37:55 INFO ClientCnxn:1085 - Unable to read additional data
> from server sessionid 0x13f4430e2930000, likely server has closed socket,
> closing socket connection and attempting reconnect
> 2013-06-14 21:37:56 INFO ClientCnxn:966 - Opening socket connection to
> server localhost.localdomain/127.0.0.1:40000. Will not attempt to
> authenticate using SASL (unknown error)
> 2013-06-14 21:37:56 INFO NIOServerCnxnFactory:197 - Accepted socket
> connection from /127.0.0.1:50443
> 2013-06-14 21:37:56 INFO ClientCnxn:849 - Socket connection established to
> localhost.localdomain/127.0.0.1:40000, initiating session
> 2013-06-14 21:37:56 INFO ZooKeeperServer:812 - Refusing session request for
> client /127.0.0.1:50443 as it has seen zxid 0x2 our last zxid is 0x0 client
> must try another server
> 2013-06-14 21:37:56 INFO NIOServerCnxn:1001 - Closed socket connection for
> client /127.0.0.1:50443 (no session established for client)
> 2013-06-14 21:37:56 INFO ClientCnxn:1085 - Unable to read additional data
> from server sessionid 0x13f4430e2930000, likely server has closed socket,
> closing socket connection and attempting reconnect
> 2013-06-14 21:37:58 INFO ClientCnxn:966 - Opening socket connection to
> server localhost.localdomain/127.0.0.1:40000. Will not attempt to
> authenticate using SASL (unknown error)
> 2013-06-14 21:37:58 INFO NIOServerCnxnFactory:197 - Accepted socket
> connection from /127.0.0.1:50444
> 2013-06-14 21:37:58 INFO ClientCnxn:849 - Socket connection established to
> localhost.localdomain/127.0.0.1:40000, initiating session
> 2013-06-14 21:37:58 INFO ZooKeeperServer:812 - Refusing session request for
> client /127.0.0.1:50444 as it has seen zxid 0x2 our last zxid is 0x0 client
> must try another server
> 2013-06-14 21:37:58 INFO NIOServerCnxn:1001 - Closed socket connection for
> client /127.0.0.1:50444 (no session established for client)
> 2013-06-14 21:37:58 INFO ClientCnxn:1085 - Unable to read additional data
> from server sessionid 0x13f4430e2930000, likely server has closed socket,
> closing socket connection and attempting reconnect
> 2013-06-14 21:38:00 INFO ClientCnxn:966 - Opening socket connection to
> server localhost.localdomain/127.0.0.1:40000. Will not attempt to
> authenticate using SASL (unknown error)
> 2013-06-14 21:38:00 INFO NIOServerCnxnFactory:197 - Accepted socket
> connection from /127.0.0.1:50445
> 2013-06-14 21:38:00 INFO ClientCnxn:849 - Socket connection established to
> localhost.localdomain/127.0.0.1:40000, initiating session
> 2013-06-14 21:38:00 INFO ZooKeeperServer:812 - Refusing session request for
> client /127.0.0.1:50445 as it has seen zxid 0x2 our last zxid is 0x0 client
> must try another server
> 2013-06-14 21:38:00 INFO NIOServerCnxn:1001 - Closed socket connection for
> client /127.0.0.1:50445 (no session established for client)
> 2013-06-14 21:38:00 INFO ClientCnxn:1085 - Unable to read additional data
> from server sessionid 0x13f4430e2930000, likely server has closed socket,
> closing socket connection and attempting reconnect
> 2013-06-14 21:38:02 INFO ClientCnxn:966 - Opening socket connection to
> server localhost.localdomain/127.0.0.1:40000. Will not attempt to
> authenticate using SASL (unknown error)
> 2013-06-14 21:38:02 INFO NIOServerCnxnFactory:197 - Accepted socket
> connection from /127.0.0.1:50446
> 2013-06-14 21:38:02 INFO ClientCnxn:849 - Socket connection established to
> localhost.localdomain/127.0.0.1:40000, initiating session
> 2013-06-14 21:38:02 INFO ZooKeeperServer:812 - Refusing session request for
> client /127.0.0.1:50446 as it has seen zxid 0x2 our last zxid is 0x0 client
> must try another server
> 2013-06-14 21:38:02 INFO NIOServerCnxn:1001 - Closed socket connection for
> client /127.0.0.1:50446 (no session established for client)
> 2013-06-14 21:38:02 INFO ClientCnxn:1085 - Unable to read additional data
> from server sessionid 0x13f4430e2930000, likely server has closed socket,
> closing socket connection and attempting reconnect
> 2013-06-14 21:38:04 INFO ClientCnxn:966 - Opening socket connection to
> server localhost.localdomain/127.0.0.1:40000. Will not attempt to
> authenticate using SASL (unknown error)
> 2013-06-14 21:38:04 INFO NIOServerCnxnFactory:197 - Accepted socket
> connection from /127.0.0.1:50447
> 2013-06-14 21:38:04 INFO ClientCnxn:849 - Socket connection established to
> localhost.localdomain/127.0.0.1:40000, initiating session
> 2013-06-14 21:38:04 INFO ZooKeeperServer:812 - Refusing session request for
> client /127.0.0.1:50447 as it has seen zxid 0x2 our last zxid is 0x0 client
> must try another server
> 2013-06-14 21:38:04 INFO NIOServerCnxn:1001 - Closed socket connection for
> client /127.0.0.1:50447 (no session established for client)
> 2013-06-14 21:38:04 INFO ClientCnxn:1085 - Unable to read additional data
> from server sessionid 0x13f4430e2930000, likely server has closed socket,
> closing socket connection and attempting reconnect
> 2013-06-14 21:38:05 INFO ClientCnxn:966 - Opening socket connection to
> server localhost.localdomain/127.0.0.1:40000. Will not attempt to
> authenticate using SASL (unknown error)
> 2013-06-14 21:38:05 INFO ClientCnxn:849 - Socket connection established to
> localhost.localdomain/127.0.0.1:40000, initiating session
> 2013-06-14 21:38:05 INFO NIOServerCnxnFactory:197 - Accepted socket
> connection from /127.0.0.1:50448
> 2013-06-14 21:38:05 INFO ZooKeeperServer:812 - Refusing session request for
> client /127.0.0.1:50448 as it has seen zxid 0x2 our last zxid is 0x0 client
> must try another server
> 2013-06-14 21:38:05 INFO NIOServerCnxn:1001 - Closed socket connection for
> client /127.0.0.1:50448 (no session established for client)
> 2013-06
>
> The same result
>
> Timing are not relevant in the tests. The ZK server is providing this logs
>
>
>
> De: Jordan Zimmerman <[email protected]>
> Para: Evaristo José Camarero <[email protected]>
> CC: "[email protected]" <[email protected]>
> Enviado: Viernes 14 de junio de 2013 21:13
> Asunto: Re: CuratorFramework is not recovering connecion with ZK
>
> Play with the timings a bit. I consistently see RECONNECTED. Also, make sure
> you're using the latest Curator. There's a bug fix in 2.0.1 that can affect
> this.
>
> -JZ
>
> On Jun 14, 2013, at 12:09 PM, Evaristo José Camarero
> <[email protected]> wrote:
>
>> Hi there,
>>
>> Thanks for the fast response.
>>
>> This is the output of the program with your modifications:
>>
>> 2013-06-14 21:05:01 INFO CuratorFrameworkImpl:221 - Starting
>> 2013-06-14 21:05:01 INFO ConnectionStateManager:151 - State change:
>> CONNECTED
>> NEw state CONNECTED
>> 2013-06-14 21:05:13 INFO ConnectionStateManager:151 - State change:
>> SUSPENDED
>> NEw state SUSPENDED
>> 2013-06-14 21:05:16 ERROR CuratorFrameworkImpl:530 - Background operation
>> retry gave up
>> org.apache.zookeeper.KeeperException$ConnectionLossException:
>> KeeperErrorCode = ConnectionLoss
>> at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>> at
>> org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:487)
>> at
>> org.apache.curator.framework.imps.BackgroundSyncImpl$1.processResult(BackgroundSyncImpl.java:50)
>> at
>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:606)
>> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
>> 2013-06-14 21:05:16 INFO ConnectionStateManager:151 - State change: LOST
>> NEw state LOST
>>
>>
>> You can notice that:
>> "NEW state RECONNECTED never appears"
>>
>> what basically means that CuratorFramework listener is not called, and
>> therefore I assume that Curator does not reconnect.
>>
>>
>>
>> Something very interesting is the fact that if you remove
>> "zkClient.create().forPath("/pedro", "juan".getBytes());" from the program
>> (basically you are not doing any TX, Curator is able to recover.
>>
>> Regards,
>>
>> Evaristo
>>
>>
>>
>> De: Jordan Zimmerman <[email protected]>
>> Para: [email protected]; Evaristo José Camarero
>> <[email protected]>
>> Enviado: Viernes 14 de junio de 2013 19:06
>> Asunto: Re: CuratorFramework is not recovering connecion with ZK
>>
>> Curator uses the timeouts given when the CuratorFramework object is created
>> to manage the connection. In this instance, after LOST has been received
>> Curator may wait until session timeout before trying to re-create the
>> ZooKeeper object. I'm enclosing a modified version of your test that sets
>> the Curator timeouts to half the time that the test's sleeps wait and you
>> can see that Curator recovers correctly.
>>
>> -Jordan
>>
>>
>>
>> On Jun 14, 2013, at 7:22 AM, Evaristo José Camarero
>> <[email protected]> wrote:
>>
>>>
>>> Hi there,
>>>
>>> I have found a case in which CuratorFramework is not recovering connection
>>> with ZK servers.
>>>
>>> The use case is the following:
>>> - Start ZK servers
>>> - Start application with CuratorFramework
>>> - ZK servers goes down.
>>> - ZK servers start again.
>>> - CuratorFramework app it is not notified that connection is reconnected ,
>>> but neither notifies that can not recover connection in any way, so
>>> application can not recover. Only option is to restart the app.
>>> Notice that if CuratorFRamework client is not making any transaction (just
>>> comment create() TX), the client is able to reconnect.
>>>
>>> I attach a program able to reproduce the problem
>>>
>>> Regards,
>>>
>>> Evaristo
>>>
>>>
>>>
>>> <CuratorFails.java>
>>
>>
>>
>
>
>