[
https://issues.apache.org/jira/browse/SOLR-17049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17779897#comment-17779897
]
Vincent Primault commented on SOLR-17049:
-----------------------------------------
Thread on [email protected]:
[https://lists.apache.org/thread/3q5t2kxbpq7poc6nb06qgs1gld2f6ny0]
I could see two ways of fixing this:
* By relying on cluster state to see which collections have a local replica
* By relying on CoresLocator to be consistent with what is done at startup
> Marking replicas down at startup and waiting does not wait
> ----------------------------------------------------------
>
> Key: SOLR-17049
> URL: https://issues.apache.org/jira/browse/SOLR-17049
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Affects Versions: 8.6
> Reporter: Vincent Primault
> Priority: Major
>
> We observed an unexpected behaviour where a node was taking traffic for a
> replica that was not ready to take it. It seems to happen when the node is
> marked as live and the replica is marked as active, while the corresponding
> core is not loaded yet on the node.
>
> I looked at the code and in theory it should not happen, since the following
> happens in {{{}ZkController#init{}}}: mark node as down, wait for replicas to
> be marked as down, and then register the node as live. However, after looking
> at the code of {{{}publishAndWaitForDownStates{}}}, I observed that we wait
> for down states for replicas associated with cores as returned by
> {{{}CoreContainer#getCoreDescriptors{}}}... which is empty at this point
> since {{ZkController#init}} is called before cores are discovered (which
> happens later in {{{}CoreContainer#load{}}}).
>
> It hence seems to me that we basically never wait for any replicas to be
> marked as down, and continue the startup sequence by marking the node as
> live, and hence _might_ take traffic for a short period of time for a replica
> that is not ready (e.g., if the node previously crashed and the replica
> stayed active).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]