[ https://issues.apache.org/jira/browse/SOLR-8862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200194#comment-15200194 ]
Hoss Man edited comment on SOLR-8862 at 3/17/16 10:01 PM: ---------------------------------------------------------- Ok, so here's what i've found so far... * Just adding a single line of logging to my test after {{configureCluster}} and before {{cluster.createCollection}} was enough to make the seed start passing fairly reliably. ** so clearly a finicky timing problem * {{MiniSolrCloudCluster}}'s constructor has logic that waits for {{/live_nodes}} have {{numServer}} children before returning ** this was added in SOLR-7146 precisely because of problems like the one i'm seeing ** if there aren't the expected number of {{/live_nodes}} the first time it checks, then it sleeps in 1 second increments until there are. * {{/live_nodes}} get's populated by {{ZkController.createEphemeralLiveNode}} ** -*THIS METHOD IS SUSPICIOUSLY CALLED IN TWO DIFF PLACES:*- **# EDIT: this is actualy part of an {{OnReconnect}} handler that I misconstrued as something that would be called on the initial connect. -fairly early in the {{ZkController}} constructor-...{code} // we have to register as live first to pick up docs in the buffer createEphemeralLiveNode(); {code} **# again as the very last thing in {{ZkControlle.init}}...{code} // Do this last to signal we're up. createEphemeralLiveNode(); {code}...this line+comment added in recently in SOLR-8696 when it replaced another previously existing call to {{createEphemeralLiveNode}} that was earlier in the init method (see https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;a=commitdiff;h=8ac4fdd;hp=7d32456efa4ade0130c3ed0ae677aa47b29355a9 ) * Even if {{/live_nodes}} were only populated as the very last line in {{ZkController.init}}, that's far from the last thing that happens when a solr node starts up. Things that happen after {{ZkController}} is initialized but before {{CoreContainer.createAndLoad}} returns and the {{SolrDispatchFilter}} starts accepting requests: ** {{ZkContainer.initZooKeeper}}... *** whatever the hell this is suppose to do...{code} if (zkRun != null && zkServer.getServers().size() > 1 && confDir == null && boostrapConf == false) { // we are part of an ensemble and we are not uploading the config - pause to give the config time // to get up Thread.sleep(10000); } {code} *** any node that has a confDir uploads it to zk: {{configManager.uploadConfigDir(configPath, confName);}} (even if it's not bootstrapping???) *** any node that *IS* doing bootstrap does that: {{ZkController.bootstrapConf(zkController.getZkClient(), cc, solrHome);}} ** {{CoreContainer.load()}}... *** Authentication plugins are initialized *** core * collection & configset & container handlers are initialized *** *{{CoreDescriptor}} FOR EACH CORE DIR ON DISK ARE LOADED* **** which of course means opening transaction logs, opening indexwriters, open searchers, newSearcher event listeners, etc... *** {{ZkController.checkOverseerDesignate()}} is called (no idea what that does) Which all leads me to the following conclusions... # when using {{MiniSolrCloudCluster}}, if you are lucky, there will be at least one node not yet in {{/live_nodes} when it does it's first check, and then it will sleep 1 second giving those nodes time to _actually_ startup & load their cores, and hopefully at least one of them will be completley finished by the time you actaully try to use a {{CloudSolrClient}} pointed at that ZK {{/live_nodes}} data. # unless there is some other "i'm alive" data in ZK that {{MiniSolrCloudCluster}} should be consulting, it seems like it's doing the best it can to ensure that all the nodes are live before returning to the caller # *This does not seem like a probably that only affects tests.* This seems like a real world problem we shoudl address -- {{CloudSolrClient}} should be able to consult some info in ZK to know when a node is _really_ alive and ready for requests. #* if there is a reason why the {{/live_nodes}} entry needs to be created as early as it is (ie: {{// we have to register as live first to pick up docs in the buffer}}) then it should only be created that one time and some other ephemeral node should be used #* whatever ephemeral node is used should be created by a very explicit very special method call made as the very last thing in {{SolrDispatchFilter}} was (Author: hossman): Ok, so here's what i've found so far... * Just adding a single line of logging to my test after {{configureCluster}} and before {{cluster.createCollection}} was enough to make the seed start passing fairly reliably. ** so clearly a finicky timing problem * {{MiniSolrCloudCluster}}'s constructor has logic that waits for {{/live_nodes}} have {{numServer}} children before returning ** this was added in SOLR-7146 precisely because of problems like the one i'm seeing ** if there aren't the expected number of {{/live_nodes}} the first time it checks, then it sleeps in 1 second increments until there are. * {{/live_nodes}} get's populated by {{ZkController.createEphemeralLiveNode}} ** *THIS METHOD IS SUSPICIOUSLY CALLED IN TWO DIFF PLACES:* **# fairly early in the {{ZkController}} constructor...{code} // we have to register as live first to pick up docs in the buffer createEphemeralLiveNode(); {code} **# again as the very last thing in {{ZkControlle.init}}...{code} // Do this last to signal we're up. createEphemeralLiveNode(); {code}...this line+comment added in recently in SOLR-8696 when it replaced another previously existing call to {{createEphemeralLiveNode}} that was earlier in the init method (see https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;a=commitdiff;h=8ac4fdd;hp=7d32456efa4ade0130c3ed0ae677aa47b29355a9 ) * Even if {{/live_nodes}} were only populated as the very last line in {{ZkController.init}}, that's far from the last thing that happens when a solr node starts up. Things that happen after {{ZkController}} is initialized but before {{CoreContainer.createAndLoad}} returns and the {{SolrDispatchFilter}} starts accepting requests: ** {{ZkContainer.initZooKeeper}}... *** whatever the hell this is suppose to do...{code} if (zkRun != null && zkServer.getServers().size() > 1 && confDir == null && boostrapConf == false) { // we are part of an ensemble and we are not uploading the config - pause to give the config time // to get up Thread.sleep(10000); } {code} *** any node that has a confDir uploads it to zk: {{configManager.uploadConfigDir(configPath, confName);}} (even if it's not bootstrapping???) *** any node that *IS* doing bootstrap does that: {{ZkController.bootstrapConf(zkController.getZkClient(), cc, solrHome);}} ** {{CoreContainer.load()}}... *** Authentication plugins are initialized *** core * collection & configset & container handlers are initialized *** *{{CoreDescriptor}} FOR EACH CORE DIR ON DISK ARE LOADED* **** which of course means opening transaction logs, opening indexwriters, open searchers, newSearcher event listeners, etc... *** {{ZkController.checkOverseerDesignate()}} is called (no idea what that does) Which all leads me to the following conclusions... # when using {{MiniSolrCloudCluster}}, if you are lucky, there will be at least one node not yet in {{/live_nodes} when it does it's first check, and then it will sleep 1 second giving those nodes time to _actually_ startup & load their cores, and hopefully at least one of them will be completley finished by the time you actaully try to use a {{CloudSolrClient}} pointed at that ZK {{/live_nodes}} data. # unless there is some other "i'm alive" data in ZK that {{MiniSolrCloudCluster}} should be consulting, it seems like it's doing the best it can to ensure that all the nodes are live before returning to the caller # *This does not seem like a probably that only affects tests.* This seems like a real world problem we shoudl address -- {{CloudSolrClient}} should be able to consult some info in ZK to know when a node is _really_ alive and ready for requests. #* if there is a reason why the {{/live_nodes}} entry needs to be created as early as it is (ie: {{// we have to register as live first to pick up docs in the buffer}}) then it should only be created that one time and some other ephemeral node should be used #* whatever ephemeral node is used should be created by a very explicit very special method call made as the very last thing in {{SolrDispatchFilter}} > /live_nodes is populated too early to be very useful for clients -- > CloudSolrClient (and MiniSolrCloudCluster.createCollection) need some other > ephemeral zk node to knowwhich servers are "ready" > -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: SOLR-8862 > URL: https://issues.apache.org/jira/browse/SOLR-8862 > Project: Solr > Issue Type: Bug > Reporter: Hoss Man > > {{/live_nodes}} is populated surprisingly early (and multiple times) in the > life cycle of a sole node startup, and as a result probably shouldn't be used > by {{CloudSolrClient}} (or other "smart" clients) for deciding what servers > are fair game for requests. > we should either fix {{/live_nodes}} to be created later in the lifecycle, or > add some new ZK node for this purpose. > {panel:title=original bug report} > I haven't been able to make sense of this yet, but what i'm seeing in a new > SolrCloudTestCase subclass i'm writing is that the code below, which > (reasonably) attempts to create a collection immediately after configuring > the MiniSolrCloudCluster gets a "SolrServerException: No live SolrServers > available to handle this request" -- in spite of the fact, that (as far as i > can tell at first glance) MiniSolrCloudCluster's constructor is suppose to > block until all the servers are live.. > {code} > configureCluster(numServers) > .addConfig(configName, configDir.toPath()) > .configure(); > Map<String, String> collectionProperties = ...; > assertNotNull(cluster.createCollection(COLLECTION_NAME, numShards, > repFactor, > configName, null, null, > collectionProperties)); > {code} > {panel} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org