Hey guys, I've been having an issue with 1 of my 4 replicas having an inconsistent replica, and have been trying to fix it. At the core of this issue, I've noticed /clusterstate.json doesn't seem to be receiving updates when cores get unhealthy, or even added/removed.
Today I decided I would remove the "bad" replica from the SolrCloud and force a sync of a new clean replica, so I ran a '/admin/cores?command=UNLOAD&name=name' to drop it. After this, on the instance with the "bad" replica, the core was removed from solr.xml but strangely NOT the /clusterstate.json in Zookeeper - it remained in Zookeeper unchanged, still with "state: active" :(. So, I then manually edited the clusterstate.json with a Perl script, removing the json data for the "bad" replica. I checked all nodes saw the change themselves, things looked good. Then I brought the node up/down to check that it was properly adding/removing itself from /live_nodes znode in Zookeeper. That all worked perfectly, too. Here is the really odd part: when I created a new replica on this node (to replace the "bad" replica), the core was created on the node, and NO update was made to /clusterstate.json. At this point this node had no cores, no cores with state in /clusterstate.json, and all data dirs deleted, so this is quite confusing. Upon checking ACLs on /clusterstate.json, it is world/anyone accessible: "[zk: localhost:2181(CONNECTED) 18] getAcl /clusterstate.json 'world,'anyone : cdrwa" Also, keep in mind my external Perl script had no issue updating /clusterstate.json. Can anyone make any suggestions why /clusterstate.json isn't getting updated when I create this new core? One other thing I checked was the health of the Zookeeper ensemble, and all 3 Zookeepers have the same mZxid, ctime, mtime, etc for /clusterstate.json and receive updates no problem, just this node isn't updating Zookeeper somehow. Any thoughts are much appreciated! Thanks! Tim