[jira] [Commented] (SOLR-17421) With overseer node role enabled, overseer may be stopped without giving-up leadership

ASF subversion and git services (Jira) Tue, 27 Aug 2024 19:04:05 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-17421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877229#comment-17877229
 ]


ASF subversion and git services commented on SOLR-17421:
--------------------------------------------------------

Commit 8513516ed39dd3190786aa93bdbe02e608ff7d9d in solr's branch 
refs/heads/main from Pierre Salagnac
[ https://gitbox.apache.org/repos/asf?p=solr.git;h=8513516ed39 ]

SOLR-17421: Make sure overseer drops leadership after QUIT failure (#2663)

Overseer might be stopped after a failure of the QUIT operation if we don't 
give up on leadership.
At least, this happens if we get a timeout error when closing the thread pool 
for cluster state updater.  There might be other scenarios.

> With overseer node role enabled, overseer may be stopped without giving-up 
> leadership
> -------------------------------------------------------------------------------------
>
>                 Key: SOLR-17421
>                 URL: https://issues.apache.org/jira/browse/SOLR-17421
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>    Affects Versions: 8.11, 9.6
>            Reporter: Pierre Salagnac
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Overseer may retain the leadership status while the thread pool that is 
> supposed to consume the collection state mutator queue was already shut down.
> Occurrences of this but are probably not frequent. But when it happens, it 
> has a huge impact. The overseer cluster state updater is stuck and all 
> collection admin requests are very likely to fail. Because of the stuck 
> overseer, all the enqueued operations (collection creation, deletion...) fail 
> and remain in the collection API queue.
> h2. Root cause
> Root cause is the {{QUIT}} command does not cancel overseer election if any 
> error happens while shutting down the state updater thread pool.
> {code:java}
> level:  ERROR
>     logger:  org.apache.solr.cloud.Overseer
>     message:  Overseer could not process the current clusterstate state 
> update message, skipping the message: {
> "operation":"quit",
> "id":"72073405485023239-<host>_solr-n_0000000948"}
>     node_name:  <host>:8983_solr
>     threadId:  281272
>     threadName:  
> OverseerStateUpdate-72073405485023239-<host>_solr-n_0000000948
>     thrown:  java.lang.RuntimeException: Timeout waiting for pool 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor@2c1da18d[Shutting
>  down, pool size = 1, active threads = 1, queued tasks = 0, completed tasks = 
> 0] to shutdown.
> at 
> org.apache.solr.common.util.ExecutorUtil.awaitTermination(ExecutorUtil.java:142)
> at 
> org.apache.solr.common.util.ExecutorUtil.awaitTermination(ExecutorUtil.java:129)
> at 
> org.apache.solr.common.util.ExecutorUtil.shutdownAndAwaitTermination(ExecutorUtil.java:112)
> at 
> org.apache.solr.cloud.OverseerTaskProcessor.close(OverseerTaskProcessor.java:431)
> at 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.processMessage(Overseer.java:601)
> at 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:450)
> at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:377)
> at java.base/java.lang.Thread.run(Thread.java:1583)
> {code}
> h2. Proximate cause
> It seems to me long running operations in the collection API could trigger 
> the bug more frequently. Because of a long running operation, we get an 
> exception when shutting down the thread pool. This has a 60 seconds timeout.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-17421) With overseer node role enabled, overseer may be stopped without giving-up leadership

Reply via email to