Hi Pushkar,

So the shard had only 2 replicas and the leader was always constant .

Does anything look odd from the log snippet to you? Like why would leader
have ( numVersions=104904602 ) and replica have ( numVersions=104904618 )
which is more than the leader after the updates were applied?

On Sat, Jan 6, 2018 at 8:05 PM, Pushkar Raste <[email protected]>
wrote:

> Hi Varun,
> - I noticed in your logs there are multiple version ranges, this would
> happen only if replica has missed versions intermittently. I would expect
> only a single range or only a couple of ranges at the best. One possible
> explanation is there was another leader when the recovering replica went
> down and then some other replica became leader while the recovering replica
> came back up.
>
>
> - Not sure which threshold are you referring to. The line you are pointing
> to checks if the recovering replica's highest version is newer than the
> leader's highest version. This check happens before version diff (version
> ranges) is computed. This check happens irrespective of whether fingerprint
> check is enabled or not. Last time I looked at this code, there was check
> to ensure replica's versions and leader's versions have enough overlap (I
> think heuristics was there is at the least 20% overlap), which I don't see
> anymore but you can see there are comments still lingering about the
> overlap.
>
>   While the check to ensure leader has higher versions than replica
> happens way early in the PeerSync, the fingerprint check happens only after
> the updates are applied to the replica. Think of the version check as a
> short circuit test.
>
> I have made lot of changes to the PeerSync code some time ago, and would
> be happy to provide you details about what I know.
>
> On Fri, Jan 5, 2018 at 4:18 PM, Varun Thacker <[email protected]> wrote:
>
>> Hi Everyone,
>>
>> I was looking into a scenario where PeerSync failed even when we had a
>> high number maxNumLogsToKeep ( 200 ) and numRecordsToKeep ( 200000 )
>>
>> The log excerpt is at https://gist.github.com/vthack
>> er/fb536c6f1146dd0d7513afb9960a10e3 and I am still trying to pinpoint
>> the actual cause . It looks to me that the replica has more number of
>> documents till that version ( numVersions ) than the leader and I can't
>> tell why. Does this look like a bug?
>>
>> While trying to reproduce it locally here is one scenario that I ran into
>> :
>>
>>    1. I kept a very low numRecordsToKeep ( 5 ) . Indexed like 3 or 4
>>    docs while the replica was down and then started it up. PeerSync failed
>>    because of  https://github.com/apache/lucene-solr/blob/master/solr/c
>>    ore/src/java/org/apache/solr/update/PeerSync.java#L655
>>    
>> <https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/update/PeerSync.java#L655>
>>    . Do we need to do a threshold check when we are verifying via
>>    fingerprinting if the indexes are the same or not? From my understanding 
>> we
>>    can avoid this check when fingerprinting is enabled but wanted to check
>>    before filing a Jira
>>
>>
>>
>>
>

Reply via email to