[ https://issues.apache.org/jira/browse/SOLR-15029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249976#comment-17249976 ]
ASF subversion and git services commented on SOLR-15029: -------------------------------------------------------- Commit bf7b438f12d65904b461e595594fc9a64cfcc899 in lucene-solr's branch refs/heads/master from Mike Drob [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=bf7b438 ] SOLR-15029 Trigger leader election on index writer tragedy SOLR-13027 Use TestInjection so that we always have a Tragic Event When we encounter a tragic error in the index writer, we can trigger a leader election instead of queing up a delete and re-add of the node in question. This should result in a more graceful transition, and the previous leader will eventually be put into recovery by a new leader. closes #2120 > More gracefully allow Shard Leader to give up leadership > -------------------------------------------------------- > > Key: SOLR-15029 > URL: https://issues.apache.org/jira/browse/SOLR-15029 > Project: Solr > Issue Type: Improvement > Reporter: Mike Drob > Assignee: Mike Drob > Priority: Major > Fix For: 8.8, master (9.0) > > Time Spent: 1.5h > Remaining Estimate: 0h > > Currently we have (via SOLR-12412) that when a leader sees an index writing > error during an update it will give up leadership by deleting the replica and > adding a new replica. One stated benefit of this was that because we are > using the overseer and a known code path, that this is done asynchronous and > very efficiently. > I would argue that this approach is too heavy handed. > In the case of a corrupt index exception, it makes some sense to completely > delete the index dir and attempt to sync from a good peer. Even in this case, > however, it might be better to allow fingerprinting and other index delta > mechanisms take over and allow for a more efficient data transfer. > In an alternate case where the index error arises due to a disconnected file > system (possible with shared file systems, i.e. S3, HDFS, some k8s systems) > and the required solution is some kind of reconnect, then this approach has > several shortcomings - the core delete and creations are going to fail > leaving dangling replicas. Further, the data is still present so there is no > need to do so many extra copies. > I propose that we bring in a mechanism to give up leadership via the existing > shard terms language. I believe we would be able to set all replicas > currently equal to leader term T to T+1, and then trigger a new leader > election. The current leader would know it is ineligible, while the other > replicas that were current before the failed update would be eligible. This > improvement would entail adding an additional possible operation to terms > state machine. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org