[ https://issues.apache.org/jira/browse/SOLR-13189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16758564#comment-16758564 ]
Hoss Man commented on SOLR-13189: --------------------------------- {quote}In older versions these tests might have worked because before the request returns to the client, the leader would have called to the replica and told it to go into recovery. I believe we no longer make these calls (for good reason, http calls tied to updates was no good). So a replica will only enter recovery when it realizes it should via ZooKeeper communication. {quote} Ok ... so to re-iterate and make sure i'm following everything: * OLD LIR: ** LIR was pushed to replica ia HTTP immediately after replica returned non-200 status ** was bad in real life because if replica was having problems, it might not recognize/respond to LIR apprpriate ** was good in tests because it ment immediately after doing an index update, you could {{waitForRecoveriesToFinish}} and the replica would already be in recover * CURRENT LIR: ** LIR status is managed via flags in ZK (this is the "terms" concept correct?) ** replicas monitor ZK to see if/when they need to go into LIR ** this is good in real life because it's less dependent on healthy network/http requests ** this is bad in tests because there is an inherent and hard to predict delay the replica even realizes it needs to go into recovery *** ie: {{waitForRecoveriesToFinish}} now seems completley useless? does that cover it? {quote}The system will be eventually consistent, but there is no promise it will be consistent even when all replicas are active. You must be willing to wait a short time for consistency and this test does not. {quote} Right ... i understand that ... the question at the heart of this jira is what a test can/should do to know "the system should now be consistent enough for me to make the assertions I want to make" (and how do we make that as easy as possible for tests to do). I haven't dug into your patch that deep, but so far is seems really hackish? ... sleep looping until all the replicas are live the first 1000 docs from a {{*:*}} of a query to each matches each other? If nothing else this creates a (slow) chicken and egg diagnoses problem in tests – did {{waitForConsistency}} eventually time out because the recovery is broken, or because the code i'm writting a test for (example: distributed atomic updates) is broken? I'm not saying the {{checkConsistency}} logic is bad – if anything it seems like something that might be good to have in the tear down of every test – but I'm concerned that just trying to do a "wait for" on it doesn't really get to the heart of the problem of tests being able to know when the cluster *_should_* be consistent – it makes the test wait (or timeout) until it *_is_* consistent) ---- If recovery is driven by these flags in ZK, then why couldn't we re-write {{waitForRecoveriesToFinish}} to check those flags first (in addition to the {{Replica.State}}) to know if recovery is pending (or in progress) > Need reliable example (Test) of how to use TestInjection.failReplicaRequests > ---------------------------------------------------------------------------- > > Key: SOLR-13189 > URL: https://issues.apache.org/jira/browse/SOLR-13189 > Project: Solr > Issue Type: Sub-task > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Hoss Man > Priority: Major > Attachments: SOLR-13189.patch, SOLR-13189.patch, SOLR-13189.patch > > > We need a test that reliably demonstrates the usage of > {{TestInjection.failReplicaRequests}} and shows what steps a test needs to > take after issuing updates to reliably "pass" (finding all index updates that > succeeded from the clients perspective) even in the event of an (injected) > replica failure. > As things stand now, it does not seem that any test using > {{TestInjection.failReplicaRequests}} passes reliably -- *and it's not clear > if this is due to poorly designed tests, or an indication of a bug in > distributed updates / LIR* -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org