[ 
https://issues.apache.org/jira/browse/SOLR-10873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16047261#comment-16047261
 ] 

Erick Erickson commented on SOLR-10873:
---------------------------------------

Good idea! Plus it seems lightweight as well....

> Explore a utility for periodically checking the document counts for replicas 
> of a shard
> ---------------------------------------------------------------------------------------
>
>                 Key: SOLR-10873
>                 URL: https://issues.apache.org/jira/browse/SOLR-10873
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Erick Erickson
>
> We've had several situations "in the field" and on the user's list where the 
> number of documents on different replicas of the same shard differ. I've also 
> seen situations where the numbers are wildly different (two orders of 
> magnitude). I can force this situation by, say, taking down nodes, adding 
> replicas that become the leader then starting the nodes back up. But it 
> doesn't matter whether the discrepancy is a result of "pilot error" or a 
> problem with the code, in either case it would be useful to flag it.
> Straw-man proposal:
> We create a processor (modeled on DocExpirationUpdateProcessorFactory 
> perhaps?) that periodically wakes up and checks that each replica in the 
> given shard has the same document count (and perhaps other checks TBD?). Send 
> some kind of notification if a problem was detected.
> Issues:
> 1> this will require some way to deal with the differing commit times. 
> 1a> If we require a timestamp on each document we could check the config file 
> to see the autocommit interval and, say, check NOW-(2 x opensearcher 
> interval). In that case the config would just require the field to use be 
> specified.
> 1b> we could require that part of the configuration is a query to use to 
> check document counts. I kind of like this one.
> 2> How to let the admins know a discrepancy was found? e-mail? ERROR level 
> log message? Other?
> 3> How does this fit into the autoscaling initiative? This is a "monitor the 
> system and do something" item. If we go forward with this we should do it 
> with an eye toward fitting it in that framework.
> 3a> Is there anything we can do to auto-correct this situation? 
> Auto-correction could be tricky. Heuristics like "make the replica with the 
> most documents the leader and force full index replication on all the 
> replicas that don't agree" seem dangerous. 
> 4> How to keep the impact minimal? The simple approach would be for each 
> replica to check all other replicas in the shard. So say there are 10 
> replicas on a single shard, that would be 90 queries. It would suffice for 
> just one of those to check the other 9, not have all 10 check the other nine. 
> Maybe restrict the checker to be the leader? Or otherwise just make it one 
> replica/shard that does the checking?
> 5> It's probably useful to add a collections API call to fire this off 
> manually. Or maybe as part of CHECKSTATUS?
> What do people think?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to