RE: Solrcloud Index corruption

Garth Grimm Thu, 05 Mar 2015 14:22:50 -0800

For updates, the document will always get routed to the leader of the 
appropriate shard, no matter what server first receives the request.


-----Original Message-----
From: Martin de Vries [mailto:mar...@downnotifier.com] 
Sent: Thursday, March 05, 2015 4:14 PM
To: solr-user@lucene.apache.org
Subject: Re: Solrcloud Index corruption

Hi Erick,

Thank you for your detailed reply.

You say in our case some docs didn't made it to the node, but that's not really 
true: the docs can be found on the corrupted nodes when I search on ID. The 
docs are also complete. The problem is that the docs do not appear when I 
filter on certain fields (however the fields are in the doc and have the right 
value when I search on ID). So something seems to be corrupt in the filter 
index. We will try the checkindex, hopefully it is able to identify the 
problematic cores.

I understand there is not a "master" in SolrCloud. In our case we use haproxy 
as a load balancer for every request. So when indexing every document will be 
sent to a different solr server, immediately after each other. Maybe SolrCloud 
is not able to handle that correctly?


Thanks,

Martin




Erick Erickson schreef op 05.03.2015 19:00:

> Wait up. There's no "master" index in SolrCloud. Raw documents are 
> forwarded to each replica, indexed and put in the local tlog. If a 
> replica falls too far out of synch (say you take it offline), then the 
> entire index _can_ be replicated from the leader and, if the leader's 
> index was incomplete then that might propagate the error.
>
> The practical consequence of this is that if _any_ replica has a 
> complete index, you can recover. Before going there though, the 
> brute-force approach is to just re-index everything from scratch.
> That's likely easier, especially on indexes this size.
>
> Here's what I'd do.
>
> Assuming you have the Collections API calls for ADDREPLICA and 
> DELETEREPLICA, then:
> 0> Identify the complete replicas. If you're lucky you have at least
> one for each shard.
> 1> Copy 1 good index from each shard somewhere just to have a backup.
> 2> DELETEREPLICA on all the incomplete replicas
> 2.5> I might shut down all the nodes at this point and check that all 
> the cores I'd deleted were gone. If any remnants exist, 'rm -rf 
> deleted_core_dir'.
> 3> ADDREPLICA to get the ones removed in back.
>
> should copy the entire index from the leader for each replica. As you 
> do the leadership will change and after you've deleted all the 
> incomplete replicas, one of the complete ones will be the leader and 
> you should be OK.
>
> If you don't want to/can't use the Collections API, then
> 0> Identify the complete replicas. If you're lucky you have at least
> one for each shard.
> 1> Shut 'em all down.
> 2> Copy the good index somewhere just to have a backup.
> 3> 'rm -rf data' for all the incomplete cores.
> 4> Bring up the good cores.
> 5> Bring up the cores that you deleted the data dirs from.
>
> What should do is replicate the entire index from the leader. When you 
> restart the good cores (step 4 above), they'll _become_ the leader.
>
> bq: Is it possible to make Solrcloud invulnerable for network problems 
> I'm a little surprised that this is happening. It sounds like the 
> network problems were such that some nodes weren't out of touch long 
> enough for Zookeeper to sense that they were down and put them into 
> recovery. Not sure there's any way to secure against that.
>
> bq: Is it possible to see if a core is corrupt?
> There's "CheckIndex", here's at least one link:
> http://java.dzone.com/news/lucene-and-solrs-checkindex
> What you're describing, though, is that docs just didn't make it to 
> the node, _not_ that the index has unexpected bits, bad disk sectors 
> and the like so CheckIndex can't detect that. How would it know what 
> _should_ have been in the index?
>
> bq: I noticed a difference in the "Gen" column on Overview - 
> Replication. Does this mean there is something wrong?
> You cannot infer anything from this. In particular, the merging will 
> be significantly different between a single full-reindex and what the 
> state of segment merges is in an incrementally built index.
>
> The admin UI screen is rooted in the pre-cloud days, the Master/Slave 
> thing is entirely misleading. In SolrCloud, since all the raw data is 
> forwarded to all replicas, and any auto commits that happen may very 
> well be slightly out of sync, the index size, number of segments, 
> generations, and all that are pretty safely ignored.
>
> Best,
> Erick
>
> On Thu, Mar 5, 2015 at 6:50 AM, Martin de Vries 
> <mar...@downnotifier.com>
> wrote:
>
>> Hi Andrew, Even our master index is corrupt, so I'm afraid this won't 
>> help in our case. Martin Andrew Butkus schreef op 05.03.2015 16:45:
>>
>>> Force a fetchindex on slave from master command:
>>> http://slave_host:port/solr/replication?command=fetchindex - from 
>>> http://wiki.apache.org/solr/SolrReplication [1] The above command 
>>> will download the whole index from master to slave, there are 
>>> configuration options in solr to make this problem happen less often 
>>> (allowing it to recover from new documents added and only send the 
>>> changes with a wider gap) - but I cant remember what those were.



Links:
------
[1] http://wiki.apache.org/solr/SolrReplication

RE: Solrcloud Index corruption

Reply via email to