[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13824270#comment-13824270
 ] 

Jessica Cheng commented on SOLR-4260:
-------------------------------------

Looking at the code a bit, I realized that the scenario I described can in fact 
happen if the Old Leader dies (or somehow becomes unreachable, for example due 
to tripping the kernel SYN flood detection, as ours did), because looks like 
during runLeaderProcess(), the sync that's run is called with 
cantReachIsSuccess=true. Since the New Leader can't reach Old Leader, it won't 
find out about 4 5 (assuming no other replicas have it either), but will 
successfully "sync" and become the new leader. This can be remedied if the "// 
TODO: optionally fail if n replicas are not reached..." on 
DistributedUpdateProcessor.doFinish() is implemented so that at least another 
replica must have 4 5 before the request would have been ack'd to the user, but 
of course if New Leader can't reach this other replica either then it's not 
much help.

I feel like in general the code may be trying too hard to find a new leader to 
take over, thereby compromising data consistency. This is probably the right 
thing to do for many, if not most, search solutions. However, if Solr is indeed 
moving toward being a possible NoSql solution or for use cases where reindexing 
the entire corpus is extremely expensive, then maybe a more consistent mode can 
be implemented where user can choose to trade availability for consistency.

> Inconsistent numDocs between leader and replica
> -----------------------------------------------
>
>                 Key: SOLR-4260
>                 URL: https://issues.apache.org/jira/browse/SOLR-4260
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 5.0
>         Environment: 5.0.0.2013.01.04.15.31.51
>            Reporter: Markus Jelsma
>            Priority: Critical
>             Fix For: 5.0
>
>         Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to