Thanks so much for your reply. Appreciate your help with this. We have 10 Solr4 nodes (5 shards with replication factor 2) and three zookeeper instances. When we bring 10 Solr4 nodes (while all zookeeper instances are down), we see this exception in Solr4 logs. (which makes sense)
java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068) 862352 [main-SendThread(d136274-003.dc.gs.com:2181)] WARN org.apache.zookeeper.ClientCnxn ? Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect When we bring up all zookeeper instances, we stop getting above exception, see this message in log and log stops moving after that: INFO - 2013-08-09 15:48:41.447; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@203727c5 name:ZooKeeperConnection Watcher:zk1.test.com:2181,zk2.test.com:2181,zk3.test.com:2181 got event WatchedEvent state:SyncConnected type:None path:null path:null type:None 998962 [main-EventThread] INFO org.apache.solr.common.cloud.ConnectionManager ? Watcher org.apache.solr.common.cloud.ConnectionManager@203727c5 name:ZooKeeperConnection Watcher:zk1.test.com:2181,zk2.test.com:2181,qa-zk3.test.com:2181 got event WatchedEvent state:SyncConnected type:None path:null path:null type:None INFO - 2013-08-09 15:48:41.528; org.apache.solr.common.cloud.ConnectionManager; Client->ZooKeeper status change trigger but we are already closed 999043 [main-EventThread] INFO org.apache.solr.common.cloud.ConnectionManager ? Client->ZooKeeper status change trigger but we are already closed At this point, we cannot see admin page or query of any solr nodes unless we restart entire cloud and after that everything is great. So we must put checks to make sure that N/2 + 1 zookeeper instances are up before we can bring up any solr nodes. -----Original Message----- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Thursday, August 08, 2013 6:34 PM To: solr-user@lucene.apache.org Subject: Re: external zookeeper with SolrCloud On 8/8/2013 3:03 PM, Joshi, Shital wrote: > We did quite a bit of testing and we think bug > https://issues.apache.org/jira/browse/SOLR-4899 is not resolved in Solr 4.4 The commit for SOLR-4899 was made to branch_4x on June 10th. lucene_solr_4_4 code branch was created from branch_4x on July 8th. The change is definitely present in 4.4. It's an extremely simple one-line change - instead of waiting for DEFAULT_CLIENT_CONNECT_TIMEOUT, a zookeeper reconnect will wait for Long.MAX_VALUE milliseconds. http://svn.apache.org/viewvc/lucene/dev/branches/branch_4x/solr/solrj/src/java/org/apache/solr/common/cloud/ConnectionManager.java?r1=1491451&r2=1491450&pathrev=1491451 Either you are having a problem that's unrelated to the change committed by SOLR-4899 or there's something strange going on. Can you describe exactly what you are trying, what you are seeing, and what you expect to see? Thanks, Shawn