James Hardwick created SOLR-6707: ------------------------------------ Summary: Recovery/election for invalid core results in rapid-fire re-attempts until /overseer/queue is clogged Key: SOLR-6707 URL: https://issues.apache.org/jira/browse/SOLR-6707 Project: Solr Issue Type: Bug Affects Versions: 4.10 Reporter: James Hardwick
We experienced an issue the other day that brought a production solr server down, and this is what we found after investigating: - Running solr instance with two separate cores, one of which is perpetually down because it's configs are not yet completely updated for Solr-cloud. This was thought to be harmless since it's not currently in use. - Solr experienced an "internal server error" I believe due in part to a fairly new feature we are using, which seemingly caused all cores to go down. - Solr immediately went into recovery, and subsequent leader election for each shard of each core. - Our primary core recovered immediately. Our additional core which was never active in the first place, attempted to recover but of course couldn't due to the improper configs. - Solr then began rapid-fire reattempting recovery of said node, trying maybe 20-30 times per second. - This in turn bombarded zookeepers /overseer/queue into oblivion - At some point /overseer/queue becomes so backed up that normal cluster coordination can no longer play out, and Solr topples over. I know this is a bit of an unusual circumstance due to us keeping the dead core around, and our quick solution has been to remove said core. However I can see other potential scenarios that might cause the same issue to arise. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org