[ 
https://issues.apache.org/jira/browse/SOLR-12386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17386590#comment-17386590
 ] 

Mark Robert Miller commented on SOLR-12386:
-------------------------------------------

It’s really all related. Moving from a strategy of N actors on X nodes all all 
trying to independent create and manage zk paths (at a segment level) with 
retries and the same ground 0 strategy every time the come up. It would be 
fairly difficult to do without saying, no, this this is going to be the 
sensible strategy for how we manage zk nodes. And that comes down to pretty 
much doing it like a human would on paper. We know what nodes should exist 
when. We know what nodes should and paths should be created or removed and 
when. We share this knowledge with the computer.

I’m just talking about a simple zk lock for cluster creation. With curator, 
it’s minimal code. With a fullish zk lock recipe, it’s a bit more. But you 
could also do something more basic, like designate an Ephem zk node as the 
cluster wide core zk layout state lock. On startup, try to create it, if it 
exists, someone else has it, wait for a key final node to show up and continue 
(/collections or whatever). If you succeeded in creating it, create the rest of 
the cluster layout and continue.

How much just this helps depends on what you want. I’d love to be able to 
startup hundreds of nodes, each with hundreds of cores quickly and reliably. 
That’s a bit of a challenge with all of them + internal objects, fighting to 
create and retry each individual zk path. Throw in some connection loss, 
perhaps a zk node blinked, or the hardware is chugging getting everything up. 
Now you have a storm of indeterminate time, ferocity and fallout. The system 
can scale like nuts, but not with this kind of behavior. 

> Test fails for "Can't find resource" for files in the _default configset
> ------------------------------------------------------------------------
>
>                 Key: SOLR-12386
>                 URL: https://issues.apache.org/jira/browse/SOLR-12386
>             Project: Solr
>          Issue Type: Test
>          Components: SolrCloud
>            Reporter: David Smiley
>            Priority: Minor
>         Attachments: cant find resource, stacktrace.txt
>
>
> Some tests, especially ConcurrentCreateRoutedAliasTest, have failed 
> sporadically failed with the message "Can't find resource" pertaining to a 
> file that is in the default ConfigSet yet mysteriously can't be found.  This 
> happens when a collection is being created that ultimately fails for this 
> reason.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to