Re: zk keeps disconnecting and reconnecting

2011-08-31 Thread Patrick Hunt
Based on past experience I believe it's going to take a fix release or
two before 3.4 is rock solid, I personally think we should do a 3.3.4.
Notice there are 6 blockers currently listed in 3.3.4
https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=truejqlQuery=project+%3D+ZOOKEEPER+AND+fixVersion+%3D+12316276+ORDER+BY+priority+DESC%2C+key+DESC

I'd be happy to RM 3.3.4 if no one else is available to do it. My goal
would be to push out a fix release containing the current listed
blockers plus anything else that's currently available.

Patrick

On Tue, Aug 30, 2011 at 4:45 PM, Benjamin Reed br...@apache.org wrote:
 i have been wondering about 3.3.4. there are so many great bugs that
 were fixed in 3.4.0 that it isn't clear what we should put into 3.3.4
 or if we should even do it. the chroot bug does seem like a good one
 to do a 3.3.4 release for.

 ben

 On Mon, Aug 29, 2011 at 12:45 PM, Mahadev Konar maha...@hortonworks.com 
 wrote:
 Camille,
  I will be cutting a branch this week some time. Just waiting for 
 ZOOKEEPER-999 to get in. Other than that, we are probably  2 weeks away from 
 the release.
  3.3.4 would be good even if we have 3.4 coming in a week or 2. Thats 
 because 3.4.0 might take sometime to stabilize and 3.3.4 would be a good 
 stable release (recommended for production use), until 3.4 stabilizes.
  Does that sound reasonable? Others?

 thanks
 mahadev

 On Aug 29, 2011, at 12:38 PM, Fournier, Camille F. wrote:

 Yeah let's put it in 3.3.4. What's the plan for 3.4? I thought we were 
 almost ready for that.

 C

 -Original Message-
 From: Mahadev Konar [mailto:maha...@hortonworks.com]
 Sent: Monday, August 29, 2011 2:10 PM
 To: u...@zookeeper.apache.org
 Subject: Re: zk keeps disconnecting and reconnecting

 Camille,
 Do you think we should put the fix in 3.3.4? I think 3.4 might take a while 
 to stabilize, so 3.3.4 would be a good release to get this in.

 Thoughts?

 mahadev

 On Aug 29, 2011, at 10:50 AM, Fournier, Camille F. wrote:

 Well, it causes the problem you are seeing. If you set any watchers with a 
 chroot and then your client gets disconnected with these watches 
 outstanding, when you reconnect you will try to reset them and they are 
 probably on paths that don't exist (if you are creating everything under 
 path /kafka-tracking). So you get a notification about the watches 
 immediately after resetting them, which causes the string out of bounds 
 exception.

 The only fix is to disable auto watch reset, and then have your own client 
 reset watches when it gets a reconnected event. I suspect it would be 
 easier for you to take a shot at fixing the bug than to rewrite your 
 client to handle this. Thomas provided a patch with tests that presumably 
 show the error, so all you need is a fix to make them pass.


 C

 -Original Message-
 From: Jun Rao [mailto:jun...@gmail.com]
 Sent: Monday, August 29, 2011 12:39 PM
 To: u...@zookeeper.apache.org; tho...@koch.ro
 Subject: Re: zk keeps disconnecting and reconnecting

 What's the impact of ZOOKEEPER-961? If it shows up, does that mean the
 client won't get any watcher events afterwards? If so, this sounds like a
 blocker for 3.4 release to me. What's the temporary solution for 3.3.3?

 Also, for the very first time that the ZK client gets disconnected, I saw
 the following entry in the log. It seems that the client can't ping the
 server for 4 seconds. The ZK server was up at that time and the load was
 minimal. What could cause the time out? Client GC pauses?

 2011/08/26 10:58:22.306 INFO [ClientCnxn]
 [main-SendThread(esv4-app27.stg:12913)] [kafka] Client session timed out,
 have not heard from server in 4001ms for sessionid 0x131f
 ddd84bc0006, closing socket connection and attempting reconnect

 Thanks,

 Jun

 On Mon, Aug 29, 2011 at 7:54 AM, Thomas Koch tho...@koch.ro wrote:

 Fournier, Camille F.:
 Did anyone ever check resetting watches at client reconnect on a client
 with a chroot? Looking at the code, we store the watches associated with
 the non-chroot path, but they are set by the original request prepending
 chroot to the request. However, it looks like the SetWatches request on
 reconnect just calls get on the various watch lists from ZooKeeper, which
 don't have the prepended chroot.

 I haven't written a test but I would bet dollars to donuts this is the
 problem.

 C
 seems to be this:
 ZOOKEEPER-961, ZOOKEEPER-1091

 Regards,

 Thomas Koch, http://www.koch.ro







RE: zk keeps disconnecting and reconnecting

2011-08-29 Thread Fournier, Camille F.
Did anyone ever check resetting watches at client reconnect on a client with a 
chroot? Looking at the code, we store the watches associated with the 
non-chroot path, but they are set by the original request prepending chroot to 
the request. However, it looks like the SetWatches request on reconnect just 
calls get on the various watch lists from ZooKeeper, which don't have the 
prepended chroot.

I haven't written a test but I would bet dollars to donuts this is the problem.

C

-Original Message-
From: Jun Rao [mailto:jun...@gmail.com] 
Sent: Monday, August 29, 2011 12:34 AM
To: u...@zookeeper.apache.org
Subject: Re: zk keeps disconnecting and reconnecting

We cleaned up all ZK server data and restarted both the servers and the
clients. We also upgraded the client to 3.3.3. After running for a day and a
half, the same weird reconnect issue showed up in one of the clients. Our ZK
connection string is
esv4-app27.stg:12913,esv4-app28.stg:12913,esv4-app29.stg:12913,esv4-app30.stg:12913/kafka-tracking.
We are on java 1_6_0_21 on RedHat Linux. Note that our ZK client has been
running fine until we upgraded the client code recently. The new version
makes one extra ZK connection to the same ZK cluster. Here are the log
entries and ZK client keeps connecting and disconnecting from each of the 4
ZK servers.

2011/08/26 10:58:39.864 INFO [ClientCnxn]
[main-SendThread(esv4-app28.stg:12913)] [kafka] Opening socket connection to
server esv4-app27.stg/172.18.98.88:12913
2011/08/26 10:58:39.865 INFO [ClientCnxn]
[main-SendThread(esv4-app27.stg:12913)] [kafka] Socket connection
established to esv4-app27.stg/172.18.98.88:12913, initiating session
2011/08/26 10:58:39.867 INFO [ClientCnxn]
[main-SendThread(esv4-app27.stg:12913)] [kafka] Session establishment
complete on server esv4-app27.stg/172.18.98.88:12913, sessionid =
0x131fddd84bc0006, negotiated timeout = 6000
2011/08/26 10:58:39.867 INFO [ZkClient] [main-EventThread] [kafka] zookeeper
state changed (SyncConnected)
2011/08/26 10:58:39.868 WARN [ClientCnxn]
[main-SendThread(esv4-app27.stg:12913)] [kafka] Session 0x131fddd84bc0006
for server esv4-app27.stg/172.18.98.88:12913, unexpected error, closing
socket connection and attempting reconnect
java.lang.StringIndexOutOfBoundsException: String index out of range: -3
at java.lang.String.substring(String.java:1937)
at java.lang.String.substring(String.java:1904)
at
org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:794)
at
org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:881)
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1130)
2011/08/26 10:58:39.969 INFO [ZkClient] [main-EventThread] [kafka] zookeeper
state changed (Disconnected)
2011/08/26 10:58:40.276 INFO [ClientCnxn]
[main-SendThread(esv4-app27.stg:12913)] [kafka] Opening socket connection to
server esv4-app29.stg/172.18.98.89:12913
2011/08/26 10:58:40.276 INFO [ClientCnxn]
[main-SendThread(esv4-app29.stg:12913)] [kafka] Socket connection
established to esv4-app29.stg/172.18.98.89:12913, initiating session
2011/08/26 10:58:40.278 INFO [ClientCnxn]
[main-SendThread(esv4-app29.stg:12913)] [kafka] Session establishment
complete on server esv4-app29.stg/172.18.98.89:12913, sessionid =
0x131fddd84bc0006, negotiated timeout = 6000
2011/08/26 10:58:40.278 INFO [ZkClient] [main-EventThread] [kafka] zookeeper
state changed (SyncConnected)
2011/08/26 10:58:40.279 WARN [ClientCnxn]
[main-SendThread(esv4-app29.stg:12913)] [kafka] Session 0x131fddd84bc0006
for server esv4-app29.stg/172.18.98.89:12913, unexpected error, closing
socket connection and attempting reconnect
java.lang.StringIndexOutOfBoundsException: String index out of range: -3
at java.lang.String.substring(String.java:1937)
at java.lang.String.substring(String.java:1904)
at
org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:794)
at
org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:881)
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1130)
2011/08/26 10:58:40.380 INFO [ZkClient] [main-EventThread] [kafka] zookeeper
state changed (Disconnected)
2011/08/26 10:58:40.515 INFO [ClientCnxn]
[main-SendThread(esv4-app29.stg:12913)] [kafka] Opening socket connection to
server esv4-app30.stg/172.18.98.90:12913

Thanks,

Jun

On Thu, Aug 25, 2011 at 9:34 AM, Patrick Hunt ph...@apache.org wrote:

 The client seeing the problem in this case is 3.3.0, I see this based
 on the line number in the stack trace not matching up with 3.3.3, with
 3.3.0 it's this line:

   event.setPath(serverPath.substring(chrootPath.length()));

 so for some reason your chroot path is negative in length? That's just
 not possible (string.length() should never return negative).

 What JVM are you using? What's your client connect string look like?

 Patrick

 On Tue, Aug 23, 2011 at 2:58 PM, Jun Rao jun...@gmail.com wrote:
  I have a ZK server

Re: zk keeps disconnecting and reconnecting

2011-08-29 Thread Thomas Koch
Fournier, Camille F.:
 Did anyone ever check resetting watches at client reconnect on a client
 with a chroot? Looking at the code, we store the watches associated with
 the non-chroot path, but they are set by the original request prepending
 chroot to the request. However, it looks like the SetWatches request on
 reconnect just calls get on the various watch lists from ZooKeeper, which
 don't have the prepended chroot.
 
 I haven't written a test but I would bet dollars to donuts this is the
 problem.
 
 C
seems to be this:
ZOOKEEPER-961, ZOOKEEPER-1091

Regards,

Thomas Koch, http://www.koch.ro