Thank you all for your help! The zookeeper log rolled over, thisis from Solr.log:
Looks like the solr and zk connection is gone for some reason INFO - 2016-04-21 12:37:57.536; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@19789a96 name:ZooKeeperConnection Watcher:{ZK HOSTS here} got event WatchedEvent state:Disconnected type:None path:null path:null type:None INFO - 2016-04-21 12:37:57.536; org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected INFO - 2016-04-21 12:38:24.248; org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - starting a new one... INFO - 2016-04-21 12:38:24.262; org.apache.solr.common.cloud.ConnectionManager; Waiting for client to connect to ZooKeeper INFO - 2016-04-21 12:38:24.269; org.apache.solr.common.cloud.ConnectionManager; Connected:true Then it publishes all cores on the hosts are down. I just list three cores here: INFO - 2016-04-21 12:38:24.269; org.apache.solr.cloud.ZkController; publishing core=product1_shard1_replica1 state=down INFO - 2016-04-21 12:38:24.271; org.apache.solr.cloud.ZkController; publishing core=collection1 state=down INFO - 2016-04-21 12:38:24.272; org.apache.solr.cloud.ZkController; numShards not found on descriptor - reading it from system property INFO - 2016-04-21 12:38:24.289; org.apache.solr.cloud.ZkController; publishing core=product2_shard5_replica1 state=down INFO - 2016-04-21 12:38:24.292; org.apache.solr.cloud.ZkController; publishing core=product2_shard13_replica1 state=down product1 has only one shard one replica and it's able to be active successfully: INFO - 2016-04-21 12:38:26.383; org.apache.solr.cloud.ZkController; Register replica - core:product1_shard1_replica1 address:http:// {internalIp}:8983/solr collection:product1 shard:shard1 WARN - 2016-04-21 12:38:26.385; org.apache.solr.cloud.ElectionContext; cancelElection did not find election node to remove INFO - 2016-04-21 12:38:26.393; org.apache.solr.cloud.ShardLeaderElectionContext; Running the leader process for shard shard1 INFO - 2016-04-21 12:38:26.399; org.apache.solr.cloud.ShardLeaderElectionContext; Enough replicas found to continue. INFO - 2016-04-21 12:38:26.399; org.apache.solr.cloud.ShardLeaderElectionContext; I may be the new leader - try and sync INFO - 2016-04-21 12:38:26.399; org.apache.solr.cloud.SyncStrategy; Sync replicas to http://{internalIp}:8983/solr/product1_shard1_replica1/ INFO - 2016-04-21 12:38:26.399; org.apache.solr.cloud.SyncStrategy; Sync Success - now sync replicas to me INFO - 2016-04-21 12:38:26.399; org.apache.solr.cloud.SyncStrategy; http://{internalIp}:8983/solr/product1_shard1_replica1/ has no replicas INFO - 2016-04-21 12:38:26.399; org.apache.solr.cloud.ShardLeaderElectionContext; I am the new leader: http://{internalIp}:8983/solr/product1_shard1_replica1/ shard1 INFO - 2016-04-21 12:38:26.399; org.apache.solr.common.cloud.SolrZkClient; makePath: /collections/product1/leaders/shard1 INFO - 2016-04-21 12:38:26.412; org.apache.solr.cloud.ZkController; We are http://{internalIp}:8983/solr/product1_shard1_replica1/ and leader is http://{internalIp}:8983/solr/product1_shard1_replica1/ INFO - 2016-04-21 12:38:26.412; org.apache.solr.cloud.ZkController; No LogReplay needed for core=product1_replica1 baseURL=http:// {internalIp}:8983/solr INFO - 2016-04-21 12:38:26.412; org.apache.solr.cloud.ZkController; I am the leader, no recovery necessary INFO - 2016-04-21 12:38:26.413; org.apache.solr.cloud.ZkController; publishing core=product1_shard1_replica1 state=active product2 has 15 shards one replica but only two shards lived on this machine, this is one of the failed shard that I never seen the message of the core product2_shard5_replica1 active: INFO - 2016-04-21 12:38:26.616; org.apache.solr.cloud.ZkController; Register replica - product2_shard5_replica1 address:http:// {internalIp}:8983/solr collection:product2 shard:shard5 WARN - 2016-04-21 12:38:26.618; org.apache.solr.cloud.ElectionContext; cancelElection did not find election node to remove INFO - 2016-04-21 12:38:26.625; org.apache.solr.cloud.ShardLeaderElectionContext; Running the leader process for shard shard5 INFO - 2016-04-21 12:38:26.631; org.apache.solr.cloud.ShardLeaderElectionContext; Enough replicas found to continue. INFO - 2016-04-21 12:38:26.631; org.apache.solr.cloud.ShardLeaderElectionContext; I may be the new leader - try and sync INFO - 2016-04-21 12:38:26.631; org.apache.solr.cloud.SyncStrategy; Sync replicas to http:// {internalIp}:8983/solr/product2_shard5_replica1_shard5_replica1/ INFO - 2016-04-21 12:38:26.631; org.apache.solr.cloud.SyncStrategy; Sync Success - now sync replicas to me INFO - 2016-04-21 12:38:26.632; org.apache.solr.cloud.SyncStrategy; http://{internalIp}:8983/solr/product2_shard5_replica1_shard5_replica1/ has no replicas INFO - 2016-04-21 12:38:26.632; org.apache.solr.cloud.ShardLeaderElectionContext; I am the new leader: http://{internalIp}:8983/solr/product2_shard5_replica1_shard5_replica1/ shard5 INFO - 2016-04-21 12:38:26.632; org.apache.solr.common.cloud.SolrZkClient; makePath: /collections/product2_shard5_replica1/leaders/shard5 INFO - 2016-04-21 12:38:26.645; org.apache.solr.cloud.ZkController; We are http://{internalIp}:8983/solr/product2_shard5_replica1_shard5_replica1/ and leader is http://{internalIp}:8983/solr product2_shard5_replica1_shard5_replica1/ INFO - 2016-04-21 12:38:26.646; org.apache.solr.common.cloud.ZkStateReader; Updating cloud state from ZooKeeper... Before I restarted this server, a bunch of queries failed for this collection product2. But I don't think it will affect the core status. Do you guys have any idea about why this particular core is not published as active since from the log, most steps are done except the very last one to publish info to ZK. Thanks, Li On Thu, Apr 21, 2016 at 7:08 AM, Rajesh Hazari <rajeshhaz...@gmail.com> wrote: > Hi Li, > > Do you see timeouts liek "CLUSTERSTATUS the collection time out:180s" > if its the case, this may be related to > https://issues.apache.org/jira/browse/SOLR-7940, > and i would say either use the patch file or upgrade. > > > *Thanks,* > *Rajesh,* > *8328789519,* > *If I don't answer your call please leave a voicemail with your contact > info, * > *will return your call ASAP.* > > On Thu, Apr 21, 2016 at 6:02 AM, YouPeng Yang <yypvsxf19870...@gmail.com> > wrote: > > > Hi > > We have used Solr4.6 for 2 years,If you post more logs ,maybe we can > > fixed it. > > > > 2016-04-21 6:50 GMT+08:00 Li Ding <li.d...@bloomreach.com>: > > > > > Hi All, > > > > > > We are using SolrCloud 4.6.1. We have observed following behaviors > > > recently. A Solr node in a Solrcloud cluster is up but some of the > cores > > > on the nodes are marked as down in Zookeeper. If the cores are parts > of > > a > > > multi-sharded collection with one replica, the queries to that > > collection > > > will fail. However, when this happened, if we issue queries to the > core > > > directly, it returns 200 and correct info. But once Solr got into the > > > state, the core will be marked down forever unless we do a restart on > > Solr. > > > > > > Has anyone seen this behavior before? Is there any to get out of the > > state > > > on its own? > > > > > > Thanks, > > > > > > Li > > > > > >