Hi,
I have a 3 external node ZK (zookeeper-3.4.8) cluster managing my 6 node 
solrcloud (solr 6.1) cluster. Recently due to dirty cow I had to reboot my Solr 
and zookeeper clusters. I rebooted the solr nodes one by one and all was fine. 
I then rebooted zookeeper nodes 1 and 2 (with at least 10 minute delay between 
reboots) and again all was fine - no errors reported in zookeepers RUOK, 
solcloud cluster health was all green. When I rebooted ZK 3 solr reported it 
could no longer connect to ZK and all the leaders lost their replicas. After a 
short time solr started rebuilding its replicas (it recovered all 
automagically) - I didn’t restart solr. The only issue was a spike in load on 
the solr leaders. 

My best guess is that solrcloud doesn’t reconnect effectively if a zookeeper 
node disappears for a period (zkClientTimeout is set to 15 sec (15000)).  

Relevant config in start-up script: -DzkClientTimeout=1500 
-DzkHost=zookeeper01:2181,zookeeper02:2181,zookeeper03:2181/solr/production

My questions: 
Has anyone experienced this upon rebooting zookeeper? Any advice if anything I 
did above was wrong? - should I increase zkClientTimeout?
Any monitoring that would alert me that solr has an issue connecting to an 
individual ZK node (well that would have alerted me before I rebooted ZK3).
Any other relevant info from the docs I should be reading? (I believe have 
read/looked relatively exhaustively)  

Thanks, let me know if further info is required, I unfortunately didn’t collect 
logs for this period. My next step is to reproduce in non-prod (but thought I’d 
reach out first).
- Brendan 
 

Reply via email to