[
https://issues.apache.org/jira/browse/SOLR-17765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris M. Hostetter updated SOLR-17765:
--------------------------------------
Attachment: SOLR-17765.patch
Status: Open (was: Open)
A trivial attempt to uncomment this code and run all tests w/multiple seeds
only surfaced a single test failure seemed toe be related: {{ZkFailoverTest}}
would error for some seeds (related to PRS randomization IIUC?) during
{{MiniSolrCloudCluster}} shutdown at the end of the test. The exception thrown
would from the (now uncommented) call to
{{zkController.publishNodeAsDown(...)}} , because the {{ZkTestServer}} had been
shutdown, causing a {{SolrException}} from the low level {{ZkStateReader}} code.
In my attached patch, I've made two small additions to the originally commented
code:
* a quick check of {{getZkClient().isConnected()}} in
{{CoreContainerProvider}} _before_ calling {{publishNodeAsDown()}}
* _inside_ of {{publishNodeAsDown()}} I {{SolrException}} to the existing
"warm on ZK exceptions and treat as No-Op" since the javadocs for the method
explicitly say {{Best effort to set DOWN state...}} suggesting that underlying
SolrExceptions should not be propagated.
/ping [[email protected]] & [~gus] given there previous work on this code
> Nodes should publish themselves as DOWN ASAP during shutdown
> ------------------------------------------------------------
>
> Key: SOLR-17765
> URL: https://issues.apache.org/jira/browse/SOLR-17765
> Project: Solr
> Issue Type: Bug
> Reporter: Chris M. Hostetter
> Priority: Major
> Attachments: SOLR-17765.patch
>
>
> While working on SOLR-17744 i noticed this comment in
> {{CoreContainerProvider.close()}} (that seems to date back to SOLR-15590) ...
> {noformat}
> // Mark Miller suggested that we should be publishing that we are down before
> anything else
> // which makes good sense, but the following causes test failures, so that
> improvement can be
> // the subject of another PR/issue. Also, jetty might already be refusing
> requests by this point
> // so that's a potential issue too. Digging slightly I see that there's a
> whole mess of code
> // looking up collections and calculating state changes associated with this
> call, which smells
> // a lot like we're duplicating node state in collection stuff, but it will
> take a lot of code
> // reading to figure out if that's really what it is, why we did it and if
> there's room for
> // improvement.
> // if (cc != null) {
> // ZkController zkController = cc.getZkController();
> // if (zkController != null) {
> // zkController.publishNodeAsDown(zkController.getNodeName());
> // }
> // }
> {noformat}
> ...I'm creating this Jira because I see no other existing Jira addressing
> this idea.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]