[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13869916#comment-13869916
 ] 

Timothy Potter commented on SOLR-4260:
--------------------------------------

Make sense about not waiting because of the penalty now that I've had a chance 
to get into the details of that code.

I spent a lot of time on Friday and over the weekend trying to track down the 
docs getting dropped. Unfortunately have not been able to track down the source 
of the issue yet. I'm fairly certain the issue happens before docs get 
submitted to CUSS, meaning that the lost docs never seemed to hit the queue in 
ConcurrentUpdateSolrServer. My original thinking was that given the complex 
nature of CUSS, there might be some sort of race condition but after having 
added a log of what hit the queue, it seems that the documents that get lost 
never hit the queue. Not to mention that the actual use of CUSS is mostly 
single-threaded because StreamingSolrServers construct them with a threadCount 
of 1.

As a side note, one thing I noticed while is that direct updates don't 
necessarily hit the correct core initially when a Solr node hosts more than one 
shard per collection. In other words, if host X had shard1 and shard3 of 
collection foo, then some update requests would hit shard1 on host X when they 
should go to shard3 on the same host; shard1 correctly forwards them on but 
it's still an extra hop. Of course that is probably not a big deal in 
production as it would be rare to host multiple shards of the same collection 
in the same Solr host, unless they are over-sharding.

In terms of this issue, here's what I'm seeing:

Assume a SolrCloud environment with shard1 having replicas on host A and B; A 
is the current leader
client sends direct update request to shard1 on host A containing 3 docs 
(1,2,3) (for example)
batch from client gets broken up into individual docs (during request parsing)
docs 1,2,3 get indexed on host A (the leader)
docs 1 and 2 get queued into CUSS and sent on to the replica on host B 
(sometimes in the same request, sometimes in separate requests)
doc 3 never makes it and from what I can tell, never hits the queue

This may be anecdotal but from what I can tell, it's always docs on the end of 
a batch and not in the middle. Meaning that I haven't seen a case where 1 and 3 
make it and 2 not ... maybe useful, maybe not. The only other thing I'll 
mention is it does seem timing / race condition related as it's almost 
impossible to reproduce this on my Mac when running 2 shards across 2 nodes but 
much easier to trigger if I ramp up to say 8 shards on 2 nodes, i.e. the busier 
my CPU is, the easier it is to see docs getting dropped.



> Inconsistent numDocs between leader and replica
> -----------------------------------------------
>
>                 Key: SOLR-4260
>                 URL: https://issues.apache.org/jira/browse/SOLR-4260
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>         Environment: 5.0.0.2013.01.04.15.31.51
>            Reporter: Markus Jelsma
>            Assignee: Mark Miller
>            Priority: Critical
>             Fix For: 5.0, 4.7
>
>         Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to