Hi Erick,

I should add that our Solr cluster is in production and new documents
are constantly indexed. The new cluster has been up for three weeks now.
The problem was discovered only now because in our use case Atomic
Updates and RealTime Gets are mostly performed on new documents. With
almost absolute certainty there are already documents in the index that
were distributed to the shards according to the new hash ranges. If we
just changed the hash ranges in ZooKeeper, the index would still be in
an inconsistent state.

Is there any way to recover from this without having to re-index all
documents?

Best,
Gary

2016-06-15 19:23 GMT+02:00 Erick Erickson <erickerick...@gmail.com>:
> Simplest, though a bit risky is to manually edit the znode and
> correct the znode entry. There are various tools out there, including
> one that ships with Zookeeper (see the ZK documentation).
>
> Or you can use the zkcli scripts (the Zookeeper ones) to get the znode
> down to your local machine, edit it there and then push it back up to ZK.
>
> I'd do all this with my Solr nodes shut down, then insure that my ZK
> ensemble was consistent after the update etc....
>
> Best,
> Erick
>
> On Wed, Jun 15, 2016 at 8:36 AM, Gary Yao <gary....@zalando.de> wrote:
>> Hi all,
>>
>> My team at work maintains a SolrCloud 5.3.2 cluster with multiple
>> collections configured with sharding and replication.
>>
>> We recently backed up our Solr indexes using the built-in backup
>> functionality. After the cluster was restored from the backup, we
>> noticed that atomic updates of documents are failing occasionally with
>> the error message 'missing required field [...]'. The exceptions are
>> thrown on a host on which the document to be updated is not stored. From
>> this we are deducing that there is a problem with finding the right host
>> by the hash of the uniqueKey. Indeed, our investigations so far showed
>> that for at least one collection in the new cluster, the shards have
>> different hash ranges assigned now. We checked the hash ranges by
>> querying /admin/collections?action=CLUSTERSTATUS. Find below the shard
>> hash ranges of one collection that we debugged.
>>
>>   Old cluster:
>>     shard1_0 80000000 - aaa9ffff
>>     shard1_1 aaaa0000 - d554ffff
>>     shard2_0 d5550000 - fffeffff
>>     shard2_1 ffff0000 - 2aa9ffff
>>     shard3_0 2aaa0000 - 5554ffff
>>     shard3_1 55550000 - 7fffffff
>>
>>   New cluster:
>>     shard1 80000000 - aaa9ffff
>>     shard2 aaaa0000 - d554ffff
>>     shard3 d5550000 - ffffffff
>>     shard4 0 - 2aa9ffff
>>     shard5 2aaa0000 - 5554ffff
>>     shard6 55550000 - 7fffffff
>>
>>   Note that the shard names differ because the old cluster's shards were
>>   split.
>>
>> As you can see, the ranges of shard3 and shard4 differ from the old
>> cluster. This change of hash ranges matches with the symptoms we are
>> currently experiencing.
>>
>> We found this JIRA ticket https://issues.apache.org/jira/browse/SOLR-5750
>> in which David Smiley comments:
>>
>>   shard hash ranges aren't restored; this error could be disasterous
>>
>> It seems that this is what happened to us. We would like to hear some
>> suggestions on how we could recover from this problem.
>>
>> Best,
>> Gary

Reply via email to