For updates, the document will always get routed to the leader of the appropriate shard, no matter what server first receives the request.
-----Original Message----- From: Martin de Vries [mailto:mar...@downnotifier.com] Sent: Thursday, March 05, 2015 4:14 PM To: solr-user@lucene.apache.org Subject: Re: Solrcloud Index corruption Hi Erick, Thank you for your detailed reply. You say in our case some docs didn't made it to the node, but that's not really true: the docs can be found on the corrupted nodes when I search on ID. The docs are also complete. The problem is that the docs do not appear when I filter on certain fields (however the fields are in the doc and have the right value when I search on ID). So something seems to be corrupt in the filter index. We will try the checkindex, hopefully it is able to identify the problematic cores. I understand there is not a "master" in SolrCloud. In our case we use haproxy as a load balancer for every request. So when indexing every document will be sent to a different solr server, immediately after each other. Maybe SolrCloud is not able to handle that correctly? Thanks, Martin Erick Erickson schreef op 05.03.2015 19:00: > Wait up. There's no "master" index in SolrCloud. Raw documents are > forwarded to each replica, indexed and put in the local tlog. If a > replica falls too far out of synch (say you take it offline), then the > entire index _can_ be replicated from the leader and, if the leader's > index was incomplete then that might propagate the error. > > The practical consequence of this is that if _any_ replica has a > complete index, you can recover. Before going there though, the > brute-force approach is to just re-index everything from scratch. > That's likely easier, especially on indexes this size. > > Here's what I'd do. > > Assuming you have the Collections API calls for ADDREPLICA and > DELETEREPLICA, then: > 0> Identify the complete replicas. If you're lucky you have at least > one for each shard. > 1> Copy 1 good index from each shard somewhere just to have a backup. > 2> DELETEREPLICA on all the incomplete replicas > 2.5> I might shut down all the nodes at this point and check that all > the cores I'd deleted were gone. If any remnants exist, 'rm -rf > deleted_core_dir'. > 3> ADDREPLICA to get the ones removed in back. > > should copy the entire index from the leader for each replica. As you > do the leadership will change and after you've deleted all the > incomplete replicas, one of the complete ones will be the leader and > you should be OK. > > If you don't want to/can't use the Collections API, then > 0> Identify the complete replicas. If you're lucky you have at least > one for each shard. > 1> Shut 'em all down. > 2> Copy the good index somewhere just to have a backup. > 3> 'rm -rf data' for all the incomplete cores. > 4> Bring up the good cores. > 5> Bring up the cores that you deleted the data dirs from. > > What should do is replicate the entire index from the leader. When you > restart the good cores (step 4 above), they'll _become_ the leader. > > bq: Is it possible to make Solrcloud invulnerable for network problems > I'm a little surprised that this is happening. It sounds like the > network problems were such that some nodes weren't out of touch long > enough for Zookeeper to sense that they were down and put them into > recovery. Not sure there's any way to secure against that. > > bq: Is it possible to see if a core is corrupt? > There's "CheckIndex", here's at least one link: > http://java.dzone.com/news/lucene-and-solrs-checkindex > What you're describing, though, is that docs just didn't make it to > the node, _not_ that the index has unexpected bits, bad disk sectors > and the like so CheckIndex can't detect that. How would it know what > _should_ have been in the index? > > bq: I noticed a difference in the "Gen" column on Overview - > Replication. Does this mean there is something wrong? > You cannot infer anything from this. In particular, the merging will > be significantly different between a single full-reindex and what the > state of segment merges is in an incrementally built index. > > The admin UI screen is rooted in the pre-cloud days, the Master/Slave > thing is entirely misleading. In SolrCloud, since all the raw data is > forwarded to all replicas, and any auto commits that happen may very > well be slightly out of sync, the index size, number of segments, > generations, and all that are pretty safely ignored. > > Best, > Erick > > On Thu, Mar 5, 2015 at 6:50 AM, Martin de Vries > <mar...@downnotifier.com> > wrote: > >> Hi Andrew, Even our master index is corrupt, so I'm afraid this won't >> help in our case. Martin Andrew Butkus schreef op 05.03.2015 16:45: >> >>> Force a fetchindex on slave from master command: >>> http://slave_host:port/solr/replication?command=fetchindex - from >>> http://wiki.apache.org/solr/SolrReplication [1] The above command >>> will download the whole index from master to slave, there are >>> configuration options in solr to make this problem happen less often >>> (allowing it to recover from new documents added and only send the >>> changes with a wider gap) - but I cant remember what those were. Links: ------ [1] http://wiki.apache.org/solr/SolrReplication