[ https://issues.apache.org/jira/browse/SOLR-9555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15836870#comment-15836870 ]
Mike Drob commented on SOLR-9555: --------------------------------- I've been thinking about this a bunch and have some observations. We already have special nodes for LIR. It looks like these are currently only checked during leader election (or more generally, core load). If we instead have replicas watch for this at all times, then we wouldn't need to forcefully publish a down state, we could publish a down LIR state. Does this model move the race instead of eliminating it though? Alan's sequence would look like: * A node goes down, and then restarts * The leader tries to send a document to the starting node, and gets a 503 'not ready yet' * The node publishes its state as RECOVERING * The leader's LIR thread publishes the recovery node's LIR state as DOWN * The node sends a PREPRECOVERY request to the leader * The leader waits for the node's state to be RECOVERING, which it already is, and can proceed. * At some point (possibly already happened) node sees new LIR state and abandons current recovery and starts a new one. In the case where we get an error during recovery, the recovering replica would know to restart recovery process, so that works too. We would also need to keep the Active state in the LIR path instead of deleting it so that there is a node we that replicas can set a watcher on. The potential downside here is that we end up keeping two copies of the state, but I think it's ok? One is what the replica thinks it is, and one is what the leader thinks it is. I'll keep thinking about this more, but I wonder if there's a way to condense all these operations down to one znode safely. > Leader incorrectly publishes state for replica when it puts replica into LIR. > ----------------------------------------------------------------------------- > > Key: SOLR-9555 > URL: https://issues.apache.org/jira/browse/SOLR-9555 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Alan Woodward > > See > https://jenkins.thetaphi.de/job/Lucene-Solr-master-Linux/17888/consoleFull > for an example -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org