[ 
https://issues.apache.org/jira/browse/SOLR-17405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Salagnac updated SOLR-17405:
-----------------------------------
    Description: 
Because of a bug in SolrCloud, the Zookeeper session can be re-established by 
multiple threads concurrently when an expiration occurs.

This portion of the code assumes it is mono-threaded. Because of the bug, the 
last thread re-establishing the session can waif for 30 seconds per core, 
waiting for it to be marked {{DOWN}} while it was previously marked {{ACTIVE}} 
by another thread. With a high number of cores, the Solr server can hang for 
hours before taking traffic again.

Following exception shows two threads were reestablishing the session 
concurrently. {{ZkController.createEphemeralLiveNode()}} should never be 
invoked twice for the same Zookeeper session.
{code:java}
thrown: java. lang.RuntimeException: 
org.apache.solr.common.cloud.ZooKeeperException:
at 
org.apache.solr.common.cloud.ConnectionManager$1.update(ConnectionManager.java:178)
at org.apache.solr. common.cloud.DefaultConnectionStrategy. 
reconnect(DefaultConnectionStrategy.java:57)
at 
org.apache.solr.common.cloud.ConnectionManager.process(ConnectionManager.java:152)
at org.apache. 
zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
Caused by: org.apache.solr.common.cloud.ZooKeeperException:
at org.apache.solr.cloud.ZkController$1.command(ZkController.java:462)
at 
org.apache.solr.common.cloud.ConnectionManager$1.update(ConnectionManager.java:170)
... 4 more
Caused by: org.apache. zookeeper.KeeperException$NodeExistsException. 
KeeperErrorCode = NodeExists
at org.apache.zookeeper.KeeperException.create(KeeperException.java: 126)
at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1925)
at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1830)
at 
org.apache.solr.common.cloud.SolrZkClient.lambda$multi$11(SolrZkClient.java:666)
at 
org.apache.solr.common.cloud.ZkCmdExecutor.retry0peration(ZkCmdExecutor.java:71)
at org.apache.solr.common.cloud.SolrZkClient.multi(SolrZkClient.java:666)
at org.apache.sol.cloud.ZkController 
CreateEphemeralLiveNode(ZkController.java:1086)
at org.apache.solr.cloud.ZkController$1.command(ZkController.java:411)
... 5 more {code}
h2. Root cause

This bug occurs because several threads can re-establish the session 
concurrently.
It cannot happen at the first expiration of the session, thanks to a thread 
pool with a single thread to execute the zookeeper Watcher.

Bellow is a code snippet from class {{SolrZkClient.ProcessWatchWithExecutor}}
{code:java}
        if (watcher instanceof ConnectionManager) {
          zkConnManagerCallbackExecutor.submit(() -> watcher.process(event));
        } else {
           .......
        }
{code}
Using this dedicated thread pool (with a single thread) is supposed to ensure 
we don’t handle watches for connection related events with multiple threads. 
This works well for the first session expiration.
Now, when we re-establish the session after the first expiration, we don’t use 
this wrapper to register the watch.

It is done directly in {{ConnectionManager}} without wrapping the ZK watch. In 
the following snippet, _“this”_ is the ZK watcher instance, but it is not 
wrapper to use a {{{}ProcessWatchWithExecutor{}}}. This means the next events 
will directly be handled by any ZK callback thread.
{code:java}
connectionStrategy.reconnect(zkServerAddress,client.getZkClientTimeout(), this,
{code}

  was:
Because of a bug in SolrCloud, the Zookeeper session can be re-established by 
multiple threads concurrently when an expiration occurs.

This portion of the code assumes it is mono-threaded. Because of the bug, the 
last thread re-establishing the session can waif for 30 seconds per core, 
waiting for it to be marked {{DOWN}} while it was previously marked {{ACTIVE}} 
by another thread. With a high number of cores, the Solr server can hang for 
hours before taking traffic again.

Following exception shows two threads were reestablishing the session 
concurrently. {{ZkController.createEphemeralLiveNode()}} should never be 
invoked twice for the same Zookeeper session.

!stack.png!

h2. Root cause

This bug occurs because several threads can re-establish the session 
concurrently.
It cannot happen at the first expiration of the session, thanks to a thread 
pool with a single thread to execute the zookeeper Watcher.

Bellow is a code snippet from class {{SolrZkClient.ProcessWatchWithExecutor}}
{code:java}
        if (watcher instanceof ConnectionManager) {
          zkConnManagerCallbackExecutor.submit(() -> watcher.process(event));
        } else {
           .......
        }
{code}
Using this dedicated thread pool (with a single thread) is supposed to ensure 
we don’t handle watches for connection related events with multiple threads. 
This works well for the first session expiration.
Now, when we re-establish the session after the first expiration, we don’t use 
this wrapper to register the watch.

It is done directly in {{ConnectionManager}} without wrapping the ZK watch. In 
the following snippet, _“this”_ is the ZK watcher instance, but it is not 
wrapper to use a {{{}ProcessWatchWithExecutor{}}}. This means the next events 
will directly be handled by any ZK callback thread.
{code:java}
connectionStrategy.reconnect(zkServerAddress,client.getZkClientTimeout(), this,
{code}


> Zookeeper session can be re-established by multiple threads concurrently
> ------------------------------------------------------------------------
>
>                 Key: SOLR-17405
>                 URL: https://issues.apache.org/jira/browse/SOLR-17405
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>    Affects Versions: 8.11, 9.6
>            Reporter: Pierre Salagnac
>            Priority: Major
>         Attachments: stack.png
>
>
> Because of a bug in SolrCloud, the Zookeeper session can be re-established by 
> multiple threads concurrently when an expiration occurs.
> This portion of the code assumes it is mono-threaded. Because of the bug, the 
> last thread re-establishing the session can waif for 30 seconds per core, 
> waiting for it to be marked {{DOWN}} while it was previously marked 
> {{ACTIVE}} by another thread. With a high number of cores, the Solr server 
> can hang for hours before taking traffic again.
> Following exception shows two threads were reestablishing the session 
> concurrently. {{ZkController.createEphemeralLiveNode()}} should never be 
> invoked twice for the same Zookeeper session.
> {code:java}
> thrown: java. lang.RuntimeException: 
> org.apache.solr.common.cloud.ZooKeeperException:
> at 
> org.apache.solr.common.cloud.ConnectionManager$1.update(ConnectionManager.java:178)
> at org.apache.solr. common.cloud.DefaultConnectionStrategy. 
> reconnect(DefaultConnectionStrategy.java:57)
> at 
> org.apache.solr.common.cloud.ConnectionManager.process(ConnectionManager.java:152)
> at org.apache. 
> zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.solr.common.cloud.ZooKeeperException:
> at org.apache.solr.cloud.ZkController$1.command(ZkController.java:462)
> at 
> org.apache.solr.common.cloud.ConnectionManager$1.update(ConnectionManager.java:170)
> ... 4 more
> Caused by: org.apache. zookeeper.KeeperException$NodeExistsException. 
> KeeperErrorCode = NodeExists
> at org.apache.zookeeper.KeeperException.create(KeeperException.java: 126)
> at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1925)
> at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1830)
> at 
> org.apache.solr.common.cloud.SolrZkClient.lambda$multi$11(SolrZkClient.java:666)
> at 
> org.apache.solr.common.cloud.ZkCmdExecutor.retry0peration(ZkCmdExecutor.java:71)
> at org.apache.solr.common.cloud.SolrZkClient.multi(SolrZkClient.java:666)
> at org.apache.sol.cloud.ZkController 
> CreateEphemeralLiveNode(ZkController.java:1086)
> at org.apache.solr.cloud.ZkController$1.command(ZkController.java:411)
> ... 5 more {code}
> h2. Root cause
> This bug occurs because several threads can re-establish the session 
> concurrently.
> It cannot happen at the first expiration of the session, thanks to a thread 
> pool with a single thread to execute the zookeeper Watcher.
> Bellow is a code snippet from class {{SolrZkClient.ProcessWatchWithExecutor}}
> {code:java}
>         if (watcher instanceof ConnectionManager) {
>           zkConnManagerCallbackExecutor.submit(() -> watcher.process(event));
>         } else {
>            .......
>         }
> {code}
> Using this dedicated thread pool (with a single thread) is supposed to ensure 
> we don’t handle watches for connection related events with multiple threads. 
> This works well for the first session expiration.
> Now, when we re-establish the session after the first expiration, we don’t 
> use this wrapper to register the watch.
> It is done directly in {{ConnectionManager}} without wrapping the ZK watch. 
> In the following snippet, _“this”_ is the ZK watcher instance, but it is not 
> wrapper to use a {{{}ProcessWatchWithExecutor{}}}. This means the next events 
> will directly be handled by any ZK callback thread.
> {code:java}
> connectionStrategy.reconnect(zkServerAddress,client.getZkClientTimeout(), 
> this,
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to