[ https://issues.apache.org/jira/browse/SOLR-6707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14202813#comment-14202813 ]
James Hardwick edited comment on SOLR-6707 at 11/7/14 10:19 PM: ---------------------------------------------------------------- Interesting clusterstate.json in ZK. Why would we have null range/parent properties for an implicitly routed index that has never been split? {code:javascript} { "gemindex":{ "shards":{"shard1":{ "range":null, "state":"active", "parent":null, "replicas":{ "core_node1":{ "state":"active", "core":"gemindex", "node_name":"10.128.26.109:8081_extera-search", "base_url":"http://10.128.26.109:8081/extera-search"}, "core_node2":{ "state":"active", "core":"gemindex", "node_name":"10.128.225.154:8081_extera-search", "base_url":"http://10.128.225.154:8081/extera-search", "leader":"true"}, "core_node3":{ "state":"active", "core":"gemindex", "node_name":"10.128.226.160:8081_extera-search", "base_url":"http://10.128.226.160:8081/extera-search"}}}}, "router":{"name":"implicit"}}, "text-analytics":{ "shards":{"shard1":{ "range":null, "state":"active", "parent":null, "replicas":{ "core_node1":{ "state":"recovery_failed", "core":"text-analytics", "node_name":"10.128.26.109:8081_extera-search", "base_url":"http://10.128.26.109:8081/extera-search"}, "core_node2":{ "state":"recovery_failed", "core":"text-analytics", "node_name":"10.128.225.154:8081_extera-search", "base_url":"http://10.128.225.154:8081/extera-search"}, "core_node3":{ "state":"down", "core":"text-analytics", "node_name":"10.128.226.160:8081_extera-search", "base_url":"http://10.128.226.160:8081/extera-search", "leader":"true"}}}}, "router":{"name":"implicit"}}} {code} was (Author: hardwickj): Interesting clusterstate.json in ZK. Why would we have null range/parent properties for an implicitly routed index that has never been split? {code:json} { "gemindex":{ "shards":{"shard1":{ "range":null, "state":"active", "parent":null, "replicas":{ "core_node1":{ "state":"active", "core":"gemindex", "node_name":"10.128.26.109:8081_extera-search", "base_url":"http://10.128.26.109:8081/extera-search"}, "core_node2":{ "state":"active", "core":"gemindex", "node_name":"10.128.225.154:8081_extera-search", "base_url":"http://10.128.225.154:8081/extera-search", "leader":"true"}, "core_node3":{ "state":"active", "core":"gemindex", "node_name":"10.128.226.160:8081_extera-search", "base_url":"http://10.128.226.160:8081/extera-search"}}}}, "router":{"name":"implicit"}}, "text-analytics":{ "shards":{"shard1":{ "range":null, "state":"active", "parent":null, "replicas":{ "core_node1":{ "state":"recovery_failed", "core":"text-analytics", "node_name":"10.128.26.109:8081_extera-search", "base_url":"http://10.128.26.109:8081/extera-search"}, "core_node2":{ "state":"recovery_failed", "core":"text-analytics", "node_name":"10.128.225.154:8081_extera-search", "base_url":"http://10.128.225.154:8081/extera-search"}, "core_node3":{ "state":"down", "core":"text-analytics", "node_name":"10.128.226.160:8081_extera-search", "base_url":"http://10.128.226.160:8081/extera-search", "leader":"true"}}}}, "router":{"name":"implicit"}}} {code} > Recovery/election for invalid core results in rapid-fire re-attempts until > /overseer/queue is clogged > ----------------------------------------------------------------------------------------------------- > > Key: SOLR-6707 > URL: https://issues.apache.org/jira/browse/SOLR-6707 > Project: Solr > Issue Type: Bug > Affects Versions: 4.10 > Reporter: James Hardwick > > We experienced an issue the other day that brought a production solr server > down, and this is what we found after investigating: > - Running solr instance with two separate cores, one of which is perpetually > down because it's configs are not yet completely updated for Solr-cloud. This > was thought to be harmless since it's not currently in use. > - Solr experienced an "internal server error" supposedly because of "No space > left on device" even though we appeared to have ~10GB free. > - Solr immediately went into recovery, and subsequent leader election for > each shard of each core. > - Our primary core recovered immediately. Our additional core which was never > active in the first place, attempted to recover but of course couldn't due to > the improper configs. > - Solr then began rapid-fire reattempting recovery of said node, trying maybe > 20-30 times per second. > - This in turn bombarded zookeepers /overseer/queue into oblivion > - At some point /overseer/queue becomes so backed up that normal cluster > coordination can no longer play out, and Solr topples over. > I know this is a bit of an unusual circumstance due to us keeping the dead > core around, and our quick solution has been to remove said core. However I > can see other potential scenarios that might cause the same issue to arise. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org