Ted, Yes, we could have used just 3 servers. It just happens that we are overlapping ZK servers on each of the 4 Kafka servers.
Jun On Tue, Aug 23, 2011 at 4:07 PM, Ted Dunning <[email protected]> wrote: > Jun, > > I don't see what your connection loss problem is, but I do have a question > about why you are using 4 ZK servers. This doesn't increase availability > or > durability over having just three ZK servers. > > On Tue, Aug 23, 2011 at 2:58 PM, Jun Rao <[email protected]> wrote: > > > I have a ZK server cluster running on 4 nodes (version 3.3.3) and a few > ZK > > clients (version 3.3.0). After the clients have been running for a while, > > each of them starts to constantly disconnect and reconnect to the ZK > > server. > > On the client, I saw lots of entries like the following: > > 2011/08/23 14:42:06.579 INFO [ClientCnxn] > > [main-SendThread(esv4-app27.stg:12913)] [kafka] Opening socket connection > > to > > server esv4-app28.stg/172.18.98.101:12913 > > 2011/08/23 14:42:06.579 INFO [ClientCnxn] > > [main-SendThread(esv4-app28.stg:12913)] [kafka] Socket connection > > established to esv4-app28.stg/172.18.98.101:12913, initiating session > > 2011/08/23 14:42:06.581 INFO [ClientCnxn] > > [main-SendThread(esv4-app28.stg:12913)] [kafka] Session establishment > > complete on server esv4-app28.stg/172.18.98.101:12913, sessionid = > > 0x331f77a1ed80004, negotiated timeout = 6000 > > 2011/08/23 14:42:06.581 INFO [ZkClient] [main-EventThread] [kafka] > > zookeeper > > state changed (SyncConnected) > > 2011/08/23 14:42:06.583 WARN [ClientCnxn] > > [main-SendThread(esv4-app28.stg:12913)] [kafka] Session 0x331f77a1ed80004 > > for server esv4-app28.stg/172.18.98.101:12913, unexpected error, closing > > socket connection and attempting reconnect > > java.lang.StringIndexOutOfBoundsException: String index out of range: > > -3 at java.lang.String.substring(String.java:1937) > > at java.lang.String.substring(String.java:1904) > > at > > > > > org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:753) > > at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:840) > > at > > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1089) > > 2011/08/23 14:42:06.683 INFO [ZkClient] [main-EventThread] [kafka] > > zookeeper > > state changed (Disconnected) > > 2011/08/23 14:42:07.510 INFO [ClientCnxn] > > [main-SendThread(esv4-app28.stg:12913)] [kafka] Opening socket connection > > to > > server esv4-app29.stg/172.18.98.89:12913 > > 2011/08/23 14:42:07.511 INFO [ClientCnxn] > > [main-SendThread(esv4-app29.stg:12913)] [kafka] Socket connection > > established to esv4-app29.stg/172.18.98.89:12913, initiating session > > 2011/08/23 14:42:07.512 INFO [ClientCnxn] > > [main-SendThread(esv4-app29.stg:12913)] [kafka] Session establishment > > complete on server esv4-app29.stg/172.18.98.89:12913, sessionid = 0x331f > > 77a1ed80004, negotiated timeout = 6000 > > 2011/08/23 14:42:07.513 INFO [ZkClient] [main-EventThread] [kafka] > > zookeeper > > state changed (SyncConnected) > > 2011/08/23 14:42:07.552 WARN [ClientCnxn] > > [main-SendThread(esv4-app29.stg:12913)] [kafka] Session 0x331f77a1ed80004 > > for server esv4-app29.stg/172.18.98.89:12913, unexpected error, clos > > ing socket connection and attempting reconnect > > java.lang.StringIndexOutOfBoundsException: String index out of range: > > -3 at java.lang.String.substring(String.java:1937) > > at java.lang.String.substring(String.java:1904) > > at > > > > > org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:753) > > at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:840) > > at > > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1089) > > 2011/08/23 14:42:07.653 INFO [ZkClient] [main-EventThread] [kafka] > > zookeeper > > state changed (Disconnected) > > > > On the ZK server, I saw lots of entries like these: > > 2011-08-23 14:34:28,802 - INFO [NIOServerCxn.Factory: > > 0.0.0.0/0.0.0.0:12913:NIOServerCnxn$Factory@251] - Accepted socket > > connection from /172.18.98.95:345072011-08-23 14:34:28,803 - INFO > > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@770] - Client > > attempting to renew session 0x231db9b5f3f010e at /172.18.98.95:34507 > > 2011-08-23 14:34:28,803 - INFO [NIOServerCxn.Factory: > > 0.0.0.0/0.0.0.0:12913:Learner@103] - Revalidating client: > > 1581489222990563982011-08-23 14:34:28,803 - INFO > > [QuorumPeer:/0:0:0:0:0:0:0:0:12913:NIOServerCnxn@1580] - Established > > session > > 0x231db9b5f3f010e with negotiated timeout 6000 for client / > > 172.18.98.95:345 > > 072011-08-23 14:34:28,805 - WARN [NIOServerCxn.Factory: > > 0.0.0.0/0.0.0.0:12913:NIOServerCnxn@634] - EndOfStreamException: Unable > to > > read additional data from client sessionid 0x231db9b5f3 > > f010e, likely client has closed socket > > 2011-08-23 14:34:28,805 - INFO [NIOServerCxn.Factory: > > 0.0.0.0/0.0.0.0:12913:NIOServerCnxn@1435] - Closed socket connection for > > client /172.18.98.95:34507 which had sessionid 0x231db9b5 > > f3f010e > > 2011-08-23 14:34:29,512 - INFO [NIOServerCxn.Factory: > > 0.0.0.0/0.0.0.0:12913:NIOServerCnxn$Factory@251] - Accepted socket > > connection from /172.18.98.94:588812011-08-23 14:34:29,512 - INFO > > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@770] - Client > > attempting to renew session 0x231db9b5f3f010c at /172.18.98.94:58881 > > 2011-08-23 14:34:29,512 - INFO [NIOServerCxn.Factory: > > 0.0.0.0/0.0.0.0:12913:Learner@103] - Revalidating client: > > 1581489222990563962011-08-23 14:34:29,513 - INFO > > [QuorumPeer:/0:0:0:0:0:0:0:0:12913:NIOServerCnxn@1580] - Established > > session > > 0x231db9b5f3f010c with negotiated timeout 6000 for client / > > 172.18.98.94:588 > > 812011-08-23 14:34:29,514 - WARN [NIOServerCxn.Factory: > > 0.0.0.0/0.0.0.0:12913:NIOServerCnxn@634] - EndOfStreamException: Unable > to > > read additional data from client sessionid 0x231db9b5f3 > > f010c, likely client has closed socket > > 2011-08-23 14:34:29,514 - INFO [NIOServerCxn.Factory: > > 0.0.0.0/0.0.0.0:12913:NIOServerCnxn@1435] - Closed socket connection for > > client /172.18.98.94:58881 which had sessionid 0x231db9b5 > > f3f010c > > > > Anybody knows what could be the problem? A few other things that may help > > understand the problem: each ZK client is also connecting to another > > cluster > > of ZK servers in the same JVM and all the ZK connections use the ZK > > namespace (chroot). > > > > Thanks, > > > > Jun > > >
