[ https://issues.apache.org/jira/browse/SOLR-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998379#comment-13998379 ]
Timothy Potter commented on SOLR-5468: -------------------------------------- Hoping to commit this in the next couple of days. Any feedback before then would be appreciated. > Option to enforce a majority quorum approach to accepting updates in SolrCloud > ------------------------------------------------------------------------------ > > Key: SOLR-5468 > URL: https://issues.apache.org/jira/browse/SOLR-5468 > Project: Solr > Issue Type: New Feature > Components: SolrCloud > Affects Versions: 4.5 > Environment: All > Reporter: Timothy Potter > Assignee: Timothy Potter > Priority: Minor > Attachments: SOLR-5468.patch, SOLR-5468.patch > > > I've been thinking about how SolrCloud deals with write-availability using > in-sync replica sets, in which writes will continue to be accepted so long as > there is at least one healthy node per shard. > For a little background (and to verify my understanding of the process is > correct), SolrCloud only considers active/healthy replicas when acknowledging > a write. Specifically, when a shard leader accepts an update request, it > forwards the request to all active/healthy replicas and only considers the > write successful if all active/healthy replicas ack the write. Any down / > gone replicas are not considered and will sync up with the leader when they > come back online using peer sync or snapshot replication. For instance, if a > shard has 3 nodes, A, B, C with A being the current leader, then writes to > the shard will continue to succeed even if B & C are down. > The issue is that if a shard leader continues to accept updates even if it > loses all of its replicas, then we have acknowledged updates on only 1 node. > If that node, call it A, then fails and one of the previous replicas, call it > B, comes back online before A does, then any writes that A accepted while the > other replicas were offline are at risk to being lost. > SolrCloud does provide a safe-guard mechanism for this problem with the > leaderVoteWait setting, which puts any replicas that come back online before > node A into a temporary wait state. If A comes back online within the wait > period, then all is well as it will become the leader again and no writes > will be lost. As a side note, sys admins definitely need to be made more > aware of this situation as when I first encountered it in my cluster, I had > no idea what it meant. > My question is whether we want to consider an approach where SolrCloud will > not accept writes unless there is a majority of replicas available to accept > the write? For my example, under this approach, we wouldn't accept writes if > both B&C failed, but would if only C did, leaving A & B online. Admittedly, > this lowers the write-availability of the system, so may be something that > should be tunable? > From Mark M: Yeah, this is kind of like one of many little features that we > have just not gotten to yet. I’ve always planned for a param that let’s you > say how many replicas an update must be verified on before responding > success. Seems to make sense to fail that type of request early if you notice > there are not enough replicas up to satisfy the param to begin with. -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org