Hi Varun,
- I noticed in your logs there are multiple version ranges, this would
happen only if replica has missed versions intermittently. I would expect
only a single range or only a couple of ranges at the best. One possible
explanation is there was another leader when the recovering replica went
down and then some other replica became leader while the recovering replica
came back up.


- Not sure which threshold are you referring to. The line you are pointing
to checks if the recovering replica's highest version is newer than the
leader's highest version. This check happens before version diff (version
ranges) is computed. This check happens irrespective of whether fingerprint
check is enabled or not. Last time I looked at this code, there was check
to ensure replica's versions and leader's versions have enough overlap (I
think heuristics was there is at the least 20% overlap), which I don't see
anymore but you can see there are comments still lingering about the
overlap.

  While the check to ensure leader has higher versions than replica happens
way early in the PeerSync, the fingerprint check happens only after the
updates are applied to the replica. Think of the version check as a short
circuit test.

I have made lot of changes to the PeerSync code some time ago, and would be
happy to provide you details about what I know.

On Fri, Jan 5, 2018 at 4:18 PM, Varun Thacker <[email protected]> wrote:

> Hi Everyone,
>
> I was looking into a scenario where PeerSync failed even when we had a
> high number maxNumLogsToKeep ( 200 ) and numRecordsToKeep ( 200000 )
>
> The log excerpt is at https://gist.github.com/vthacker/
> fb536c6f1146dd0d7513afb9960a10e3 and I am still trying to pinpoint the
> actual cause . It looks to me that the replica has more number of documents
> till that version ( numVersions ) than the leader and I can't tell why.
> Does this look like a bug?
>
> While trying to reproduce it locally here is one scenario that I ran into :
>
>    1. I kept a very low numRecordsToKeep ( 5 ) . Indexed like 3 or 4 docs
>    while the replica was down and then started it up. PeerSync failed because
>    of  https://github.com/apache/lucene-solr/blob/master/solr/
>    core/src/java/org/apache/solr/update/PeerSync.java#L655
>    
> <https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/update/PeerSync.java#L655>
>    . Do we need to do a threshold check when we are verifying via
>    fingerprinting if the indexes are the same or not? From my understanding we
>    can avoid this check when fingerprinting is enabled but wanted to check
>    before filing a Jira
>
>
>
>

Reply via email to