[jira] [Commented] (SOLR-10277) On 'downnode', lots of wasteful mutations are done to ZK

ASF subversion and git services (JIRA) Wed, 05 Apr 2017 03:32:54 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-10277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15956634#comment-15956634
 ]


ASF subversion and git services commented on SOLR-10277:
--------------------------------------------------------

Commit 60303028debf3927e0c3abfaaa4015f73b88e689 in lucene-solr's branch 
refs/heads/master from [~shalinmangar]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6030302 ]

SOLR-10277: On 'downnode', lots of wasteful mutations are done to ZK


> On 'downnode', lots of wasteful mutations are done to ZK
> --------------------------------------------------------
>
>                 Key: SOLR-10277
>                 URL: https://issues.apache.org/jira/browse/SOLR-10277
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>    Affects Versions: 5.5.3, 5.5.4, 6.0.1, 6.2.1, 6.3, 6.4.2
>            Reporter: Joshua Humphries
>            Assignee: Scott Blum
>              Labels: leader, zookeeper
>         Attachments: SOLR-10277-5.5.3.patch, SOLR-10277.patch, 
> SOLR-10277.patch
>
>
> When a node restarts, it submits a single 'downnode' message to the 
> overseer's state update queue.
> When the overseer processes the message, it does way more writes to ZK than 
> necessary. In our cluster of 48 hosts, the majority of collections have only 
> 1 shard and 1 replica. So a single node restarting should only result in 
> ~1/40th of the collections being updated with new replica states (to indicate 
> the node that is no longer active).
> However, the current logic in NodeMutator#downNode always updates *every* 
> collection. So we end up having to do rolling restarts very slowly to avoid 
> having a severe outage due to the overseer having to do way too much work for 
> each host that is restarted. And subsequent shards becoming leader can't get 
> processed until the `downnode` message is fully processed. So a fast rolling 
> restart can result in the overseer queue growing incredibly large and nearly 
> all shards winding up in a leader-less state until that backlog is processed.
> The fix is a trivial logic change to only add a ZkWriteCommand for 
> collections that actually have an impacted replica.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-10277) On 'downnode', lots of wasteful mutations are done to ZK

Reply via email to