queue is clogged

James Hardwick (JIRA) Fri, 07 Nov 2014 14:19:47 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-6707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14202813#comment-14202813
 ]


James Hardwick edited comment on SOLR-6707 at 11/7/14 10:19 PM:
----------------------------------------------------------------

Interesting clusterstate.json in ZK. Why would we have null range/parent 
properties for an implicitly routed index that has never been split?

{code:javascript}
{
  "gemindex":{
    "shards":{"shard1":{
        "range":null,
        "state":"active",
        "parent":null,
        "replicas":{
          "core_node1":{
            "state":"active",
            "core":"gemindex",
            "node_name":"10.128.26.109:8081_extera-search",
            "base_url":"http://10.128.26.109:8081/extera-search"},
          "core_node2":{
            "state":"active",
            "core":"gemindex",
            "node_name":"10.128.225.154:8081_extera-search",
            "base_url":"http://10.128.225.154:8081/extera-search";,
            "leader":"true"},
          "core_node3":{
            "state":"active",
            "core":"gemindex",
            "node_name":"10.128.226.160:8081_extera-search",
            "base_url":"http://10.128.226.160:8081/extera-search"}}}},
    "router":{"name":"implicit"}},
  "text-analytics":{
    "shards":{"shard1":{
        "range":null,
        "state":"active",
        "parent":null,
        "replicas":{
          "core_node1":{
            "state":"recovery_failed",
            "core":"text-analytics",
            "node_name":"10.128.26.109:8081_extera-search",
            "base_url":"http://10.128.26.109:8081/extera-search"},
          "core_node2":{
            "state":"recovery_failed",
            "core":"text-analytics",
            "node_name":"10.128.225.154:8081_extera-search",
            "base_url":"http://10.128.225.154:8081/extera-search"},
          "core_node3":{
            "state":"down",
            "core":"text-analytics",
            "node_name":"10.128.226.160:8081_extera-search",
            "base_url":"http://10.128.226.160:8081/extera-search";,
            "leader":"true"}}}},
    "router":{"name":"implicit"}}}
{code}


was (Author: hardwickj):
Interesting clusterstate.json in ZK. Why would we have null range/parent 
properties for an implicitly routed index that has never been split?

{code:json}
{
  "gemindex":{
    "shards":{"shard1":{
        "range":null,
        "state":"active",
        "parent":null,
        "replicas":{
          "core_node1":{
            "state":"active",
            "core":"gemindex",
            "node_name":"10.128.26.109:8081_extera-search",
            "base_url":"http://10.128.26.109:8081/extera-search"},
          "core_node2":{
            "state":"active",
            "core":"gemindex",
            "node_name":"10.128.225.154:8081_extera-search",
            "base_url":"http://10.128.225.154:8081/extera-search";,
            "leader":"true"},
          "core_node3":{
            "state":"active",
            "core":"gemindex",
            "node_name":"10.128.226.160:8081_extera-search",
            "base_url":"http://10.128.226.160:8081/extera-search"}}}},
    "router":{"name":"implicit"}},
  "text-analytics":{
    "shards":{"shard1":{
        "range":null,
        "state":"active",
        "parent":null,
        "replicas":{
          "core_node1":{
            "state":"recovery_failed",
            "core":"text-analytics",
            "node_name":"10.128.26.109:8081_extera-search",
            "base_url":"http://10.128.26.109:8081/extera-search"},
          "core_node2":{
            "state":"recovery_failed",
            "core":"text-analytics",
            "node_name":"10.128.225.154:8081_extera-search",
            "base_url":"http://10.128.225.154:8081/extera-search"},
          "core_node3":{
            "state":"down",
            "core":"text-analytics",
            "node_name":"10.128.226.160:8081_extera-search",
            "base_url":"http://10.128.226.160:8081/extera-search";,
            "leader":"true"}}}},
    "router":{"name":"implicit"}}}
{code}

> Recovery/election for invalid core results in rapid-fire re-attempts until 
> /overseer/queue is clogged
> -----------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-6707
>                 URL: https://issues.apache.org/jira/browse/SOLR-6707
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 4.10
>            Reporter: James Hardwick
>
> We experienced an issue the other day that brought a production solr server 
> down, and this is what we found after investigating:
> - Running solr instance with two separate cores, one of which is perpetually 
> down because it's configs are not yet completely updated for Solr-cloud. This 
> was thought to be harmless since it's not currently in use. 
> - Solr experienced an "internal server error" supposedly because of "No space 
> left on device" even though we appeared to have ~10GB free. 
> - Solr immediately went into recovery, and subsequent leader election for 
> each shard of each core. 
> - Our primary core recovered immediately. Our additional core which was never 
> active in the first place, attempted to recover but of course couldn't due to 
> the improper configs. 
> - Solr then began rapid-fire reattempting recovery of said node, trying maybe 
> 20-30 times per second.
> - This in turn bombarded zookeepers /overseer/queue into oblivion
> - At some point /overseer/queue becomes so backed up that normal cluster 
> coordination can no longer play out, and Solr topples over. 
> I know this is a bit of an unusual circumstance due to us keeping the dead 
> core around, and our quick solution has been to remove said core. However I 
> can see other potential scenarios that might cause the same issue to arise. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-6707) Recovery/election for invalid core results in rapid-fire re-attempts until /overseer/queue is clogged

Reply via email to