[
https://issues.apache.org/jira/browse/SOLR-17421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877229#comment-17877229
]
ASF subversion and git services commented on SOLR-17421:
--------------------------------------------------------
Commit 8513516ed39dd3190786aa93bdbe02e608ff7d9d in solr's branch
refs/heads/main from Pierre Salagnac
[ https://gitbox.apache.org/repos/asf?p=solr.git;h=8513516ed39 ]
SOLR-17421: Make sure overseer drops leadership after QUIT failure (#2663)
Overseer might be stopped after a failure of the QUIT operation if we don't
give up on leadership.
At least, this happens if we get a timeout error when closing the thread pool
for cluster state updater. There might be other scenarios.
> With overseer node role enabled, overseer may be stopped without giving-up
> leadership
> -------------------------------------------------------------------------------------
>
> Key: SOLR-17421
> URL: https://issues.apache.org/jira/browse/SOLR-17421
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Affects Versions: 8.11, 9.6
> Reporter: Pierre Salagnac
> Priority: Major
> Labels: pull-request-available
> Time Spent: 20m
> Remaining Estimate: 0h
>
> Overseer may retain the leadership status while the thread pool that is
> supposed to consume the collection state mutator queue was already shut down.
> Occurrences of this but are probably not frequent. But when it happens, it
> has a huge impact. The overseer cluster state updater is stuck and all
> collection admin requests are very likely to fail. Because of the stuck
> overseer, all the enqueued operations (collection creation, deletion...) fail
> and remain in the collection API queue.
> h2. Root cause
> Root cause is the {{QUIT}} command does not cancel overseer election if any
> error happens while shutting down the state updater thread pool.
> {code:java}
> level: ERROR
> logger: org.apache.solr.cloud.Overseer
> message: Overseer could not process the current clusterstate state
> update message, skipping the message: {
> "operation":"quit",
> "id":"72073405485023239-<host>_solr-n_0000000948"}
> node_name: <host>:8983_solr
> threadId: 281272
> threadName:
> OverseerStateUpdate-72073405485023239-<host>_solr-n_0000000948
> thrown: java.lang.RuntimeException: Timeout waiting for pool
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor@2c1da18d[Shutting
> down, pool size = 1, active threads = 1, queued tasks = 0, completed tasks =
> 0] to shutdown.
> at
> org.apache.solr.common.util.ExecutorUtil.awaitTermination(ExecutorUtil.java:142)
> at
> org.apache.solr.common.util.ExecutorUtil.awaitTermination(ExecutorUtil.java:129)
> at
> org.apache.solr.common.util.ExecutorUtil.shutdownAndAwaitTermination(ExecutorUtil.java:112)
> at
> org.apache.solr.cloud.OverseerTaskProcessor.close(OverseerTaskProcessor.java:431)
> at
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.processMessage(Overseer.java:601)
> at
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:450)
> at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:377)
> at java.base/java.lang.Thread.run(Thread.java:1583)
> {code}
> h2. Proximate cause
> It seems to me long running operations in the collection API could trigger
> the bug more frequently. Because of a long running operation, we get an
> exception when shutting down the thread pool. This has a 60 seconds timeout.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]