[ 
https://issues.apache.org/jira/browse/SOLR-12087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Bao updated SOLR-12087:
-----------------------------
    Description: 
Sometimes when deleting replicas, the replica fails to be removed from the 
cluster state. This occurs especially when deleting replicas en mass; the 
resulting cause is that the data is deleted but the replicas aren't removed 
from the cluster state. Attempting to delete the downed replicas causes 
failures because the core does not exist anymore.

This also occurs when trying to move replicas, since that move is an add and 
delete.

Some more information regarding this issue; when the MOVEREPLICA command is 
issued, the new replica is created successfully but the replica to be deleted 
fails to be removed from state.json (the core is deleted though) and we see two 
logs spammed.
 # The node containing the leader replica continually attempts to initiate 
recovery on the replica and fails to do so because the core does not exist. As 
a result it continually publishes a down state for the replica to zookeeper.
 # The replica node spams that it cannot locate the core because it's been 
deleted.

During this period of time, we see an increase in ZK network connectivity 
overall, until the replica is finally deleted (spamming DELETEREPLICA on the 
shard until its removed from the state)

My guess is there's two issues at hand here:
 # The leader continually attempts to recover a downed replica that is 
unrecoverable because the core does not exist.
 # The replica to be deleted is having trouble being deleted from state.json in 
ZK.

This is mostly consistent for my use case. I'm running 7.2.1 with 66 nodes.

  was:
Sometimes when deleting replicas, the replica fails to be removed from the 
cluster state. This occurs especially when deleting replicas en mass; the 
resulting cause is that the data is deleted but the replicas aren't removed 
from the cluster state. Attempting to delete the downed replicas causes 
failures because the core does not exist anymore.

This also occurs when trying to move replicas, since that move is an add and 
delete.


> Deleting replicas sometimes fails and causes the replicas to exist in the 
> down state
> ------------------------------------------------------------------------------------
>
>                 Key: SOLR-12087
>                 URL: https://issues.apache.org/jira/browse/SOLR-12087
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>    Affects Versions: 7.2
>            Reporter: Jerry Bao
>            Priority: Major
>         Attachments: Screen Shot 2018-03-16 at 11.50.32 AM.png
>
>
> Sometimes when deleting replicas, the replica fails to be removed from the 
> cluster state. This occurs especially when deleting replicas en mass; the 
> resulting cause is that the data is deleted but the replicas aren't removed 
> from the cluster state. Attempting to delete the downed replicas causes 
> failures because the core does not exist anymore.
> This also occurs when trying to move replicas, since that move is an add and 
> delete.
> Some more information regarding this issue; when the MOVEREPLICA command is 
> issued, the new replica is created successfully but the replica to be deleted 
> fails to be removed from state.json (the core is deleted though) and we see 
> two logs spammed.
>  # The node containing the leader replica continually attempts to initiate 
> recovery on the replica and fails to do so because the core does not exist. 
> As a result it continually publishes a down state for the replica to 
> zookeeper.
>  # The replica node spams that it cannot locate the core because it's been 
> deleted.
> During this period of time, we see an increase in ZK network connectivity 
> overall, until the replica is finally deleted (spamming DELETEREPLICA on the 
> shard until its removed from the state)
> My guess is there's two issues at hand here:
>  # The leader continually attempts to recover a downed replica that is 
> unrecoverable because the core does not exist.
>  # The replica to be deleted is having trouble being deleted from state.json 
> in ZK.
> This is mostly consistent for my use case. I'm running 7.2.1 with 66 nodes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to