Thanks so much for your reply. Appreciate your help with this. 

We have 10 Solr4 nodes (5 shards with replication factor 2) and three zookeeper 
instances. When we bring 10 Solr4 nodes (while all zookeeper instances are 
down), we see this exception in Solr4 logs. (which makes sense)

java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
        at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
862352 [main-SendThread(d136274-003.dc.gs.com:2181)] WARN  
org.apache.zookeeper.ClientCnxn  ? Session 0x0 for server null, unexpected 
error, closing socket connection and attempting reconnect

When we bring up all zookeeper instances, we stop getting above exception, see 
this message in log and log stops moving after that:

INFO  - 2013-08-09 15:48:41.447; 
org.apache.solr.common.cloud.ConnectionManager; Watcher 
org.apache.solr.common.cloud.ConnectionManager@203727c5 
name:ZooKeeperConnection 
Watcher:zk1.test.com:2181,zk2.test.com:2181,zk3.test.com:2181 got event 
WatchedEvent state:SyncConnected type:None path:null path:null type:None
998962 [main-EventThread] INFO  org.apache.solr.common.cloud.ConnectionManager  
? Watcher org.apache.solr.common.cloud.ConnectionManager@203727c5 
name:ZooKeeperConnection 
Watcher:zk1.test.com:2181,zk2.test.com:2181,qa-zk3.test.com:2181 got event 
WatchedEvent state:SyncConnected type:None path:null path:null type:None
INFO  - 2013-08-09 15:48:41.528; 
org.apache.solr.common.cloud.ConnectionManager; Client->ZooKeeper status change 
trigger but we are already closed
999043 [main-EventThread] INFO  org.apache.solr.common.cloud.ConnectionManager  
? Client->ZooKeeper status change trigger but we are already closed

At this point, we cannot see admin page or query of any solr nodes unless we 
restart entire cloud and after that everything is great. So we must put checks 
to make sure that N/2 + 1 zookeeper instances are up before we can bring up any 
solr nodes.  




-----Original Message-----
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: Thursday, August 08, 2013 6:34 PM
To: solr-user@lucene.apache.org
Subject: Re: external zookeeper with SolrCloud

On 8/8/2013 3:03 PM, Joshi, Shital wrote:
> We did quite a bit of testing and we think bug 
> https://issues.apache.org/jira/browse/SOLR-4899 is not resolved in Solr 4.4

The commit for SOLR-4899 was made to branch_4x on June 10th. 
lucene_solr_4_4 code branch was created from branch_4x on July 8th.

The change is definitely present in 4.4.  It's an extremely simple 
one-line change - instead of waiting for DEFAULT_CLIENT_CONNECT_TIMEOUT, 
a zookeeper reconnect will wait for Long.MAX_VALUE milliseconds.

http://svn.apache.org/viewvc/lucene/dev/branches/branch_4x/solr/solrj/src/java/org/apache/solr/common/cloud/ConnectionManager.java?r1=1491451&r2=1491450&pathrev=1491451

Either you are having a problem that's unrelated to the change committed 
by SOLR-4899 or there's something strange going on.

Can you describe exactly what you are trying, what you are seeing, and 
what you expect to see?

Thanks,
Shawn

Reply via email to