[ https://issues.apache.org/jira/browse/SOLR-6530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shalin Shekhar Mangar updated SOLR-6530: ---------------------------------------- Attachment: SOLR-6530.patch A trivial test which demonstrates the problem by partitioning the leader from a replica and sending a commit to the replica which then marks the leader as "down". > Commits under network partition can put any node in down state by any node > -------------------------------------------------------------------------- > > Key: SOLR-6530 > URL: https://issues.apache.org/jira/browse/SOLR-6530 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Reporter: Shalin Shekhar Mangar > Priority: Critical > Fix For: 4.11, 5.0 > > Attachments: SOLR-6530.patch > > > Commits are executed by any node in SolrCloud i.e. they're not routed via the > leader like other updates. > # Suppose there's 1 collection, 1 shard, 2 replicas (A and B) and A is the > leader > # Suppose a commit request is made to node B during a time where B cannot > talk to A due to a partition for any reason (failing switch, heavy GC, > whatever) > # B fails to distribute the commit to A (times out) and asks A to recover > # This was okay earlier because a leader just ignores recovery requests but > with leader initiated recovery code, B puts A in the "down" state and A can > never get out of that state. > tl;dr; During network partitions, if enough commit/optimize requests are sent > to the cluster, all the nodes in the cluster will eventually be marked as > "down". -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org