[ 
https://issues.apache.org/jira/browse/SOLR-17405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SOLR-17405:
----------------------------------
    Labels: pull-request-available  (was: )

> Zookeeper session can be re-established by multiple threads concurrently
> ------------------------------------------------------------------------
>
>                 Key: SOLR-17405
>                 URL: https://issues.apache.org/jira/browse/SOLR-17405
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>    Affects Versions: 8.11, 9.6
>            Reporter: Pierre Salagnac
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: stack.png
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Because of a bug in SolrCloud, the Zookeeper session can be re-established by 
> multiple threads concurrently when an expiration occurs.
> This portion of the code assumes it is mono-threaded. Because of the bug, the 
> last thread re-establishing the session can waif for 30 seconds per core, 
> waiting for it to be marked {{DOWN}} while it was previously marked 
> {{ACTIVE}} by another thread. With a high number of cores, the Solr server 
> can hang for hours before taking traffic again.
> Following exception shows two threads were reestablishing the session 
> concurrently. {{ZkController.createEphemeralLiveNode()}} should never be 
> invoked twice for the same Zookeeper session.
> {code:java}
> thrown: java. lang.RuntimeException: 
> org.apache.solr.common.cloud.ZooKeeperException:
>          at 
> org.apache.solr.common.cloud.ConnectionManager$1.update(ConnectionManager.java:178)
>          at org.apache.solr. common.cloud.DefaultConnectionStrategy. 
> reconnect(DefaultConnectionStrategy.java:57)
>          at 
> org.apache.solr.common.cloud.ConnectionManager.process(ConnectionManager.java:152)
>          at org.apache. 
> zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535)
>          at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.solr.common.cloud.ZooKeeperException:
>          at 
> org.apache.solr.cloud.ZkController$1.command(ZkController.java:462)
>          at 
> org.apache.solr.common.cloud.ConnectionManager$1.update(ConnectionManager.java:170)
>          ... 4 more
> Caused by: org.apache. zookeeper.KeeperException$NodeExistsException. 
> KeeperErrorCode = NodeExists
>          at org.apache.zookeeper.KeeperException.create(KeeperException.java: 
> 126)
>          at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1925)
>          at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1830)
>          at 
> org.apache.solr.common.cloud.SolrZkClient.lambda$multi$11(SolrZkClient.java:666)
>          at 
> org.apache.solr.common.cloud.ZkCmdExecutor.retry0peration(ZkCmdExecutor.java:71)
>          at 
> org.apache.solr.common.cloud.SolrZkClient.multi(SolrZkClient.java:666)
>          at org.apache.sol.cloud.ZkController 
> CreateEphemeralLiveNode(ZkController.java:1086)
>          at 
> org.apache.solr.cloud.ZkController$1.command(ZkController.java:411)
> ... 5 more {code}
> h2. Root cause
> This bug occurs because several threads can re-establish the session 
> concurrently.
> It cannot happen at the first expiration of the session, thanks to a thread 
> pool with a single thread to execute the zookeeper Watcher.
> Bellow is a code snippet from class {{SolrZkClient.ProcessWatchWithExecutor}}
> {code:java}
>         if (watcher instanceof ConnectionManager) {
>           zkConnManagerCallbackExecutor.submit(() -> watcher.process(event));
>         } else {
>            .......
>         }
> {code}
> Using this dedicated thread pool (with a single thread) is supposed to ensure 
> we don’t handle watches for connection related events with multiple threads. 
> This works well for the first session expiration.
> Now, when we re-establish the session after the first expiration, we don’t 
> use this wrapper to register the watch.
> It is done directly in {{ConnectionManager}} without wrapping the ZK watch. 
> In the following snippet, _“this”_ is the ZK watcher instance, but it is not 
> wrapper to use a {{{}ProcessWatchWithExecutor{}}}. This means the next events 
> will directly be handled by any ZK callback thread.
> {code:java}
> connectionStrategy.reconnect(zkServerAddress,client.getZkClientTimeout(), 
> this,
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to