[ 
https://issues.apache.org/jira/browse/SOLR-17200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17825412#comment-17825412
 ] 

Chris M. Hostetter commented on SOLR-17200:
-------------------------------------------

FWIW:
 * I found these race conditions because i heard from a coworker anecdotally 
that they had seen a kube pod with 100+ PULL replicas (which will all current 
and didn't need recovery) report it was "READY" w/in a min or so of restart 
even though the logs showed it was still loading cores.
 * There may be other race conditions I haven't considered – those are just the 
two that jumped out at me when skimming the code

I think at a minimum one or both of the following changes should be made:
 * Instead of using {{coreContainer.getCores().stream().map(c -> 
c.getCoreDescriptor().getCloudDescriptor())}} to get the list of 
{{CloudDescriptors}} we should stick to using the registered 
{{CoreDescriptors}} directly (ignoring the question of whether the {{SolrCore}} 
itself is loaded) via {{coreContainer.getCoreDescriptors().stream().map(cd -> 
cd..getCloudDescriptor())}}
 * Before even looking at the {{{}CoreDescriptors{}}}, {{HealthCheckHandler}} 
should inspect {{CoreContainer.getStatus()}}
 ** But since i'm really not a fan of methods that require the caller to check 
bitmasks, we should probably just add a {{public boolean isLoadComplete()}} to 
{{CoreContainer}}

/ping [~houston], [~broustant], [~tflobbe]

> Race conditions on startup using /health?requireHealthyCores=true
> -----------------------------------------------------------------
>
>                 Key: SOLR-17200
>                 URL: https://issues.apache.org/jira/browse/SOLR-17200
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Chris M. Hostetter
>            Priority: Major
>
> There seem to be at least two possible thread race conditions that can lead 
> {{/health?requireHealthyCores=true}} to returning false positive while 
> {{CoreContainer}} is in the process of starting up.
>  # If the request comes in _after_ {{CoreContainer}} has initialized 
> {{healthCheckHandler}} but _before_ initializing & running the 
> {{coreLoadExecutor}}
>  # A more complex situation where the request comes in _while_ 
> {{coreLoadExecutor}} is loading cores, and all of the cores that have 
> _finished_ initialization are "active" in SolrCloud, but other SolrCores 
> remain to be initialized (and may need recovery)
> In both cases, the root of the issue is that {{requireHealthyCores=true}} 
> works by checking...
> {code:java}
>       Collection<CloudDescriptor> coreDescriptors =
>           coreContainer.getCores().stream()
>               .map(c -> c.getCoreDescriptor().getCloudDescriptor())
>               .collect(Collectors.toList());
>       long unhealthyCores = findUnhealthyCores(coreDescriptors, clusterState);
> {code}
> ..but that means the only {{CloudDescriptor}} s that are checked are the ones 
> that come from _loaded_ cores (which is what {{coreContainer.getCores()}} 
> returns). and any {{currentlyLoadingCores}} (registered by CoreContainer 
> calling {{solrCores.markCoreAsLoading(cd)}} before starting the 
> {{coreLoadExecutor}} ) are not considered.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to