[ 
https://issues.apache.org/jira/browse/SOLR-9555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15836870#comment-15836870
 ] 

Mike Drob commented on SOLR-9555:
---------------------------------

I've been thinking about this a bunch and have some observations.

We already have special nodes for LIR. It looks like these are currently only 
checked during leader election (or more generally, core load). If we instead 
have replicas watch for this at all times, then we wouldn't need to forcefully 
publish a down state, we could publish a down LIR state. Does this model move 
the race instead of eliminating it though? Alan's sequence would look like:
* A node goes down, and then restarts
* The leader tries to send a document to the starting node, and gets a 503 'not 
ready yet'
* The node publishes its state as RECOVERING
* The leader's LIR thread publishes the recovery node's LIR state as DOWN
* The node sends a PREPRECOVERY request to the leader
* The leader waits for the node's state to be RECOVERING, which it already is, 
and can proceed.
* At some point (possibly already happened) node sees new LIR state and 
abandons current recovery and starts a new one.

In the case where we get an error during recovery, the recovering replica would 
know to restart recovery process, so that works too.

We would also need to keep the Active state in the LIR path instead of deleting 
it so that there is a node we that replicas can set a watcher on.

The potential downside here is that we end up keeping two copies of the state, 
but I think it's ok? One is what the replica thinks it is, and one is what the 
leader thinks it is. I'll keep thinking about this more, but I wonder if 
there's a way to condense all these operations down to one znode safely.

> Leader incorrectly publishes state for replica when it puts replica into LIR.
> -----------------------------------------------------------------------------
>
>                 Key: SOLR-9555
>                 URL: https://issues.apache.org/jira/browse/SOLR-9555
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Alan Woodward
>
> See 
> https://jenkins.thetaphi.de/job/Lucene-Solr-master-Linux/17888/consoleFull 
> for an example



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to