[ 
https://issues.apache.org/jira/browse/SOLR-8862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200439#comment-15200439
 ] 

Scott Blum commented on SOLR-8862:
----------------------------------

I think I can comment on this just a bit.  The first call to 
createEphemeralLiveNode() is not actually called from the constructor; it's 
called from the OnReconnect handler much later, if you lose your ZK session and 
have to create a new one.  At least, that's the theory.  Are you seeing it 
actually get called early?

More generally, the important race being resolved here is the call to 
registerAllCoresAsDown() happening before createEphemeralLiveNode().  Any 
client is supposed to join the cluster state (ie, is a core marked ACTIVE) with 
the live_nodes list.  So the idea is, mark everything as DOWN, then put in the 
live_nodes child, then go mark things ACTIVE as they actually come up.  This 
works reasonably well for things like routing search requests.  I can see how 
it might fall over if you're depending on live_nodes for doing cluster level 
operations.

> /live_nodes is populated too early to be very useful for clients -- 
> CloudSolrClient (and MiniSolrCloudCluster.createCollection) need some other 
> ephemeral zk node to knowwhich servers are "ready"
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-8862
>                 URL: https://issues.apache.org/jira/browse/SOLR-8862
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Hoss Man
>
> {{/live_nodes}} is populated surprisingly early (and multiple times) in the 
> life cycle of a sole node startup, and as a result probably shouldn't be used 
> by {{CloudSolrClient}} (or other "smart" clients) for deciding what servers 
> are fair game for requests.
> we should either fix {{/live_nodes}} to be created later in the lifecycle, or 
> add some new ZK node for this purpose.
> {panel:title=original bug report}
> I haven't been able to make sense of this yet, but what i'm seeing in a new 
> SolrCloudTestCase subclass i'm writing is that the code below, which 
> (reasonably) attempts to create a collection immediately after configuring 
> the MiniSolrCloudCluster gets a "SolrServerException: No live SolrServers 
> available to handle this request" -- in spite of the fact, that (as far as i 
> can tell at first glance) MiniSolrCloudCluster's constructor is suppose to 
> block until all the servers are live..
> {code}
>     configureCluster(numServers)
>       .addConfig(configName, configDir.toPath())
>       .configure();
>     Map<String, String> collectionProperties = ...;
>     assertNotNull(cluster.createCollection(COLLECTION_NAME, numShards, 
> repFactor,
>                                            configName, null, null, 
> collectionProperties));
> {code}
> {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to