[ 
https://issues.apache.org/jira/browse/SOLR-8973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241669#comment-15241669
 ] 

Shalin Shekhar Mangar commented on SOLR-8973:
---------------------------------------------

Ah, you are right that the watcher won't be set if the new collection's znode 
is not visible.

But the solution that I proposed will still work -- if you see 
ZkStateReader.constructState(), you'll see that it tries to re-create the 
watcher every time a collection is in the "interestingCollections" set but 
neither in legacyCollectionStates nor in watchedCollectionStates. Since the 
constructState method is called each and every time any watched znode changes, 
the new collection's state will eventually be cached. I don't think that any 
alternate solution will achieve a better result. This may be a side-effect of 
the way the state is managed but it will work. We can document this as a code 
comment as well. The bug here is that we do not call 
ZkStateReader.addCollectionWatch at all if the collection is not visible to the 
node yet.

[~dragonsinth] Since you recently wrote most of this state management code, 
what do you think?

> TX-frenzy on Zookeeper when collection is put to use
> ----------------------------------------------------
>
>                 Key: SOLR-8973
>                 URL: https://issues.apache.org/jira/browse/SOLR-8973
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 5.0, 5.1, 5.2, 5.3, 5.4, 5.5, master, 5.6
>            Reporter: Janmejay Singh
>            Assignee: Shalin Shekhar Mangar
>              Labels: collections, patch-available, solrcloud, zookeeper
>         Attachments: SOLR-8973.patch
>
>
> This is to do with a distributed data-race. Core-creation happens at a time 
> when collection is not yet visible to the node. In this case a fallback 
> code-path is used which de-references collection-state lazily (on demand) as 
> opposed to setting a watch and keeping it cached locally.
> Due to this, as requests towards the core mount, it generates ZK fetch for 
> collection proportionately. On a large solr-cloud cluster, this generates 
> several Gbps of TX traffic on ZK nodes. This affects indexing 
> throughput(which floors) in addition to running ZK node out of network 
> bandwidth. 
> On smaller solr-cloud clusters its hard to run into, because probability of 
> this race materializing reduces.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to