[ https://issues.apache.org/jira/browse/SOLR-10873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16047261#comment-16047261 ]
Erick Erickson commented on SOLR-10873: --------------------------------------- Good idea! Plus it seems lightweight as well.... > Explore a utility for periodically checking the document counts for replicas > of a shard > --------------------------------------------------------------------------------------- > > Key: SOLR-10873 > URL: https://issues.apache.org/jira/browse/SOLR-10873 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Erick Erickson > > We've had several situations "in the field" and on the user's list where the > number of documents on different replicas of the same shard differ. I've also > seen situations where the numbers are wildly different (two orders of > magnitude). I can force this situation by, say, taking down nodes, adding > replicas that become the leader then starting the nodes back up. But it > doesn't matter whether the discrepancy is a result of "pilot error" or a > problem with the code, in either case it would be useful to flag it. > Straw-man proposal: > We create a processor (modeled on DocExpirationUpdateProcessorFactory > perhaps?) that periodically wakes up and checks that each replica in the > given shard has the same document count (and perhaps other checks TBD?). Send > some kind of notification if a problem was detected. > Issues: > 1> this will require some way to deal with the differing commit times. > 1a> If we require a timestamp on each document we could check the config file > to see the autocommit interval and, say, check NOW-(2 x opensearcher > interval). In that case the config would just require the field to use be > specified. > 1b> we could require that part of the configuration is a query to use to > check document counts. I kind of like this one. > 2> How to let the admins know a discrepancy was found? e-mail? ERROR level > log message? Other? > 3> How does this fit into the autoscaling initiative? This is a "monitor the > system and do something" item. If we go forward with this we should do it > with an eye toward fitting it in that framework. > 3a> Is there anything we can do to auto-correct this situation? > Auto-correction could be tricky. Heuristics like "make the replica with the > most documents the leader and force full index replication on all the > replicas that don't agree" seem dangerous. > 4> How to keep the impact minimal? The simple approach would be for each > replica to check all other replicas in the shard. So say there are 10 > replicas on a single shard, that would be 90 queries. It would suffice for > just one of those to check the other 9, not have all 10 check the other nine. > Maybe restrict the checker to be the leader? Or otherwise just make it one > replica/shard that does the checking? > 5> It's probably useful to add a collections API call to fire this off > manually. Or maybe as part of CHECKSTATUS? > What do people think? -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org