[
https://issues.apache.org/jira/browse/SOLR-17200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17825412#comment-17825412
]
Chris M. Hostetter commented on SOLR-17200:
-------------------------------------------
FWIW:
* I found these race conditions because i heard from a coworker anecdotally
that they had seen a kube pod with 100+ PULL replicas (which will all current
and didn't need recovery) report it was "READY" w/in a min or so of restart
even though the logs showed it was still loading cores.
* There may be other race conditions I haven't considered – those are just the
two that jumped out at me when skimming the code
I think at a minimum one or both of the following changes should be made:
* Instead of using {{coreContainer.getCores().stream().map(c ->
c.getCoreDescriptor().getCloudDescriptor())}} to get the list of
{{CloudDescriptors}} we should stick to using the registered
{{CoreDescriptors}} directly (ignoring the question of whether the {{SolrCore}}
itself is loaded) via {{coreContainer.getCoreDescriptors().stream().map(cd ->
cd..getCloudDescriptor())}}
* Before even looking at the {{{}CoreDescriptors{}}}, {{HealthCheckHandler}}
should inspect {{CoreContainer.getStatus()}}
** But since i'm really not a fan of methods that require the caller to check
bitmasks, we should probably just add a {{public boolean isLoadComplete()}} to
{{CoreContainer}}
/ping [~houston], [~broustant], [~tflobbe]
> Race conditions on startup using /health?requireHealthyCores=true
> -----------------------------------------------------------------
>
> Key: SOLR-17200
> URL: https://issues.apache.org/jira/browse/SOLR-17200
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Chris M. Hostetter
> Priority: Major
>
> There seem to be at least two possible thread race conditions that can lead
> {{/health?requireHealthyCores=true}} to returning false positive while
> {{CoreContainer}} is in the process of starting up.
> # If the request comes in _after_ {{CoreContainer}} has initialized
> {{healthCheckHandler}} but _before_ initializing & running the
> {{coreLoadExecutor}}
> # A more complex situation where the request comes in _while_
> {{coreLoadExecutor}} is loading cores, and all of the cores that have
> _finished_ initialization are "active" in SolrCloud, but other SolrCores
> remain to be initialized (and may need recovery)
> In both cases, the root of the issue is that {{requireHealthyCores=true}}
> works by checking...
> {code:java}
> Collection<CloudDescriptor> coreDescriptors =
> coreContainer.getCores().stream()
> .map(c -> c.getCoreDescriptor().getCloudDescriptor())
> .collect(Collectors.toList());
> long unhealthyCores = findUnhealthyCores(coreDescriptors, clusterState);
> {code}
> ..but that means the only {{CloudDescriptor}} s that are checked are the ones
> that come from _loaded_ cores (which is what {{coreContainer.getCores()}}
> returns). and any {{currentlyLoadingCores}} (registered by CoreContainer
> calling {{solrCores.markCoreAsLoading(cd)}} before starting the
> {{coreLoadExecutor}} ) are not considered.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]