[ https://issues.apache.org/jira/browse/SOLR-8973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241669#comment-15241669 ]
Shalin Shekhar Mangar commented on SOLR-8973: --------------------------------------------- Ah, you are right that the watcher won't be set if the new collection's znode is not visible. But the solution that I proposed will still work -- if you see ZkStateReader.constructState(), you'll see that it tries to re-create the watcher every time a collection is in the "interestingCollections" set but neither in legacyCollectionStates nor in watchedCollectionStates. Since the constructState method is called each and every time any watched znode changes, the new collection's state will eventually be cached. I don't think that any alternate solution will achieve a better result. This may be a side-effect of the way the state is managed but it will work. We can document this as a code comment as well. The bug here is that we do not call ZkStateReader.addCollectionWatch at all if the collection is not visible to the node yet. [~dragonsinth] Since you recently wrote most of this state management code, what do you think? > TX-frenzy on Zookeeper when collection is put to use > ---------------------------------------------------- > > Key: SOLR-8973 > URL: https://issues.apache.org/jira/browse/SOLR-8973 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Affects Versions: 5.0, 5.1, 5.2, 5.3, 5.4, 5.5, master, 5.6 > Reporter: Janmejay Singh > Assignee: Shalin Shekhar Mangar > Labels: collections, patch-available, solrcloud, zookeeper > Attachments: SOLR-8973.patch > > > This is to do with a distributed data-race. Core-creation happens at a time > when collection is not yet visible to the node. In this case a fallback > code-path is used which de-references collection-state lazily (on demand) as > opposed to setting a watch and keeping it cached locally. > Due to this, as requests towards the core mount, it generates ZK fetch for > collection proportionately. On a large solr-cloud cluster, this generates > several Gbps of TX traffic on ZK nodes. This affects indexing > throughput(which floors) in addition to running ZK node out of network > bandwidth. > On smaller solr-cloud clusters its hard to run into, because probability of > this race materializing reduces. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org