[ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14964979#comment-14964979 ]
Shalin Shekhar Mangar commented on SOLR-7569: --------------------------------------------- Thanks Ishan. # ForceLeaderTest.testReplicasInLIRNoLeader has a 5 second sleep, why? Isn't waitForRecoveriesToFinish() enough? # Similarly, ForceLeaderTest.testLeaderDown has a 15 second sleep for steady state to be reached? What is this steady state, is there a better way than waiting for an arbitrary amount of time? In general, Thread.sleep should be avoided as much as possible as a way to reach steady state. # Can you please add some javadocs on the various test methods describing the scenario that they are test? # minor nit - can you use assertEquals when testing equality of state etc instead of assertTrue. The advantage with assertEquals is that it logs the mismatched values in the exception messages. # In OverseerCollectionMessageHandler, lirPath can never be null. The lir path should probably be logged in debug rather than INFO. {code} // Clear out any LIR state String lirPath = overseer.getZkController().getLeaderInitiatedRecoveryZnodePath(collection, sliceId); if (lirPath != null && zkStateReader.getZkClient().exists(lirPath, true)) { StringBuilder sb = new StringBuilder(); zkStateReader.getZkClient().printLayout(lirPath, 4, sb); log.info("Cleaning out LIR data, which was: " + sb); zkStateReader.getZkClient().clean(lirPath); } {code} # There's no need to send an empty string as the role while publishing the state of the replica. # minor nit - you can compare enums directly using == instead of .equals # Referring to the following, what is the thinking behind it? when can this happen? is there a test which specifically exercises this scenario? seems like this can interfere with the leader election if the leader election was taking some time? {code} // If we still don't have an active leader by now, it maybe possible that the replica at the head of the election queue // was the leader at some point and never left the queue, but got marked as down. So, if the election queue is not empty, // and the replica at the head of the queue is live, then mark it as a leader. {code} > Create an API to force a leader election between nodes > ------------------------------------------------------ > > Key: SOLR-7569 > URL: https://issues.apache.org/jira/browse/SOLR-7569 > Project: Solr > Issue Type: New Feature > Components: SolrCloud > Reporter: Shalin Shekhar Mangar > Assignee: Shalin Shekhar Mangar > Labels: difficulty-medium, impact-high > Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, > SOLR-7569_lir_down_state_test.patch > > > There are many reasons why Solr will not elect a leader for a shard e.g. all > replicas' last published state was recovery or due to bugs which cause a > leader to be marked as 'down'. While the best solution is that they never get > into this state, we need a manual way to fix this when it does get into this > state. Right now we can do a series of dance involving bouncing the node > (since recovery paths between bouncing and REQUESTRECOVERY are different), > but that is difficult when running a large cluster. Although it is possible > that such a manual API may lead to some data loss but in some cases, it is > the only possible option to restore availability. > This issue proposes to build a new collection API which can be used to force > replicas into recovering a leader while avoiding data loss on a best effort > basis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org