[ 
https://issues.apache.org/jira/browse/SOLR-5309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14004472#comment-14004472
 ] 

Shalin Shekhar Mangar commented on SOLR-5309:
---------------------------------------------

I am looking at these failure again today. Yeah, it's been that busy around 
here :(

I implemented a RateLimitedDirectoryFactory for Solr with a very small limit 
and forced ShardSplitTest to use it always. This helped reproduce the issue for 
me. I have finally managed to track down the root cause. It always perplexed me 
that the difference between expected and actual doc counts was almost always 1.

Whenever we add/delete documents during shard splitting, we synchronously 
forward the request to the appropriate sub-shard. For add requests, a single 
sub-shard is selected but for delete by ids, we weren't selecting a single 
sub-shard. Instead we are forwarding the delete by id to all sub-shards. This 
works out fine and doesn't cause any damage in practice because the id exists 
only on one shard. However, when one sub-shard (the right one) accepts the 
delete and the other rejects it (maybe because it became active in the 
mean-time) then the client (ShardSplitTest) gets an error back and assumes that 
the delete did not succeed whereas it actually succeeded on the right sub-shard.

We always advise our users to retry update operations upon failure and they 
would be fine if they follow this advise during shard splitting also. 
ShardSplitTest unfortunately doesn't follow that advice and just counts 
success/failures and ends up with an inconsistent state.

I'll start by fixing delete-by-id to route requests to the correct (single) 
sub-shard and enabling this test again.

> Investigate ShardSplitTest failures
> -----------------------------------
>
>                 Key: SOLR-5309
>                 URL: https://issues.apache.org/jira/browse/SOLR-5309
>             Project: Solr
>          Issue Type: Task
>          Components: SolrCloud
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>            Priority: Blocker
>
> Investigate why ShardSplitTest if failing sporadically.
> Some recent failures:
> http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Windows/3328/
> http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Linux/7760/
> http://jenkins.thetaphi.de/job/Lucene-Solr-4.x-MacOSX/861/



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to