[
https://issues.apache.org/jira/browse/SOLR-16116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17912241#comment-17912241
]
David Smiley commented on SOLR-16116:
-------------------------------------
Also, same test but different failing test run on a PR of mine
[here|https://github.com/apache/solr/actions/runs/12721792820/job/35465118919?pr=3025]
shows interesting behavior. ObjectReleaseTracker shows ZkCollectionTerms and
a shard terms as not closed. I looked at the logs closely while examining
ZkController, and it shows that one of the 4 nodes's ZkController.reconnect was
called _after_ the nodes were shut down (thus after ZkController.close). This
is not an area I'm very comfortable in but the easy/naive answer is maybe
onReconnect() needs to check for the shutdown state()? Adding shutdown checks
blindly is hacky though because there's usually a race condition. Thinking
out-loud here, Objects that are "ref-counted" are more resilient. If
ZkController was ref-counted, then onReconnect would have to first incref,
failing that quit but on success would block a race from shutting it down. I'm
not actually recommending ref-counting here but pointing out its merits. I
suppose a better solution is to ensure that the executor that's actually
running the onReconnect logic is shut down prior to ZkController closing.
> Refactor the Solr Zookeeper logic to use Apache Curator
> -------------------------------------------------------
>
> Key: SOLR-16116
> URL: https://issues.apache.org/jira/browse/SOLR-16116
> Project: Solr
> Issue Type: Sub-task
> Components: SolrCloud
> Reporter: Houston Putman
> Assignee: Houston Putman
> Priority: Major
> Labels: pull-request-available
> Fix For: main (10.0)
>
> Time Spent: 10h 10m
> Remaining Estimate: 0h
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]