Hi Aaron, late to this party for sure, sorry. I feel your pain, this is happening for us, and I've seen reports of it occurring across versions, but with very little information to go on I don't think progress has been made. I actually don't think there's an issue raised for it. Perhaps that should be a first step.
We call this problem a "Flappy Item" because the item appears, disappears in search results depending on whether the search hits the primary or replica shard. Flaps back and forth. The only way to repair the problem is to rebuild the replica shard. You can disable _all_ replicas and then re-enable them, and the primary shard will be used as the source and it will work. That's if you can live with the lack of redundancy for that length of time.... Alternatively we have found that issuing a Move command to relocate the replica shard off the current host and on to another, also causes ES to generate a new replica shard using the primary as the source, and that corrects the problem. A caveat we've found with this approach at least with the old version of ES we're sadly still using (0.19... hmm) that after the move, the cluster will likely want to rebalance, and the shard allocation after rebalance can from time to time put the replica back where it was. ES on that original node then goes "Oh look, here's the same shard I had earlier, lets use that".. Which means you're back to square one.. You can force _all_ replica shards to move by coming up with a move command that shuffles them around, and that definitely does work, but obviously takes longer for large clusters. In terms of tooling around this, I offer you these: Scrutineer - https://github.com/Aconex/scrutineer- Can detect differences between your source of truth (db?) and your index (ES). This does pickup the case where the replica is reporting an item that should have been deleted. Flappy Item Detector - https://github.com/Aconex/es-flappyitem-detector - given a set of suspect IDs can check the primary vs replica to confirm/deny it being one of these cases. There is also support to issue basic move commands with some simple logic to attempt to rebuild that replica. Hope that helps. cheers, Paul Smith On 8 August 2014 01:14, aaron <atdi...@gmail.com> wrote: > I've noticed on a few of my clusters that some shard replicas will be > perpetually inconsistent w/ other shards. Even when all of my writes are > successful and use write_consistency = ALL and replication = SYNC. > > A GET by id will return 404/missing for one replica but return the > document for the other two replicas. Even after refresh, the shard is never > "repaired". > > Using ES 0.90.7. > > Is this a known defect? Is there a means to detect, prevent, or at least > detect & repair when this occurs? > > -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to elasticsearch+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/164c3362-1ed4-4e90-8bb6-283543a20cf9%40googlegroups.com > <https://groups.google.com/d/msgid/elasticsearch/164c3362-1ed4-4e90-8bb6-283543a20cf9%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHfYWB7XhaB-ZkJqE8%3DfNu%2BZdNzGC%2BPx%3Dv61OGTzQABbHNZfSg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.