[ 
https://issues.apache.org/jira/browse/SOLR-13189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16758564#comment-16758564
 ] 

Hoss Man commented on SOLR-13189:
---------------------------------

{quote}In older versions these tests might have worked because before the 
request returns to the client, the leader would have called to the replica and 
told it to go into recovery. I believe we no longer make these calls (for good 
reason, http calls tied to updates was no good). So a replica will only enter 
recovery when it realizes it should via ZooKeeper communication.
{quote}
Ok ... so to re-iterate and make sure i'm following everything:
 * OLD LIR:
 ** LIR was pushed to replica ia HTTP immediately after replica returned 
non-200 status
 ** was bad in real life because if replica was having problems, it might not 
recognize/respond to LIR apprpriate
 ** was good in tests because it ment immediately after doing an index update, 
you could {{waitForRecoveriesToFinish}} and the replica would already be in 
recover
 * CURRENT LIR:
 ** LIR status is managed via flags in ZK (this is the "terms" concept correct?)
 ** replicas monitor ZK to see if/when they need to go into LIR
 ** this is good in real life because it's less dependent on healthy 
network/http requests
 ** this is bad in tests because there is an inherent and hard to predict delay 
the replica even realizes it needs to go into recovery
 *** ie: {{waitForRecoveriesToFinish}} now seems completley useless?

does that cover it?
{quote}The system will be eventually consistent, but there is no promise it 
will be consistent even when all replicas are active. You must be willing to 
wait a short time for consistency and this test does not.
{quote}
Right ... i understand that ... the question at the heart of this jira is what 
a test can/should do to know "the system should now be consistent enough for me 
to make the assertions I want to make" (and how do we make that as easy as 
possible for tests to do).

I haven't dug into your patch that deep, but so far is seems really hackish? 
... sleep looping until all the replicas are live the first 1000 docs from a 
{{*:*}} of a query to each matches each other?

If nothing else this creates a (slow) chicken and egg diagnoses problem in 
tests – did {{waitForConsistency}} eventually time out because the recovery is 
broken, or because the code i'm writting a test for (example: distributed 
atomic updates) is broken?

I'm not saying the {{checkConsistency}} logic is bad – if anything it seems 
like something that might be good to have in the tear down of every test – but 
I'm concerned that just trying to do a "wait for" on it doesn't really get to 
the heart of the problem of tests being able to know when the cluster 
*_should_* be consistent – it makes the test wait (or timeout) until it *_is_* 
consistent)
----
If recovery is driven by these flags in ZK, then why couldn't we re-write 
{{waitForRecoveriesToFinish}} to check those flags first (in addition to the 
{{Replica.State}}) to know if recovery is pending (or in progress)

> Need reliable example (Test) of how to use TestInjection.failReplicaRequests
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-13189
>                 URL: https://issues.apache.org/jira/browse/SOLR-13189
>             Project: Solr
>          Issue Type: Sub-task
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>            Priority: Major
>         Attachments: SOLR-13189.patch, SOLR-13189.patch, SOLR-13189.patch
>
>
> We need a test that reliably demonstrates the usage of 
> {{TestInjection.failReplicaRequests}} and shows what steps a test needs to 
> take after issuing updates to reliably "pass" (finding all index updates that 
> succeeded from the clients perspective) even in the event of an (injected) 
> replica failure.
> As things stand now, it does not seem that any test using 
> {{TestInjection.failReplicaRequests}} passes reliably -- *and it's not clear 
> if this is due to poorly designed tests, or an indication of a bug in 
> distributed updates / LIR*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to