Hi Pushkar, So the shard had only 2 replicas and the leader was always constant .
Does anything look odd from the log snippet to you? Like why would leader have ( numVersions=104904602 ) and replica have ( numVersions=104904618 ) which is more than the leader after the updates were applied? On Sat, Jan 6, 2018 at 8:05 PM, Pushkar Raste <[email protected]> wrote: > Hi Varun, > - I noticed in your logs there are multiple version ranges, this would > happen only if replica has missed versions intermittently. I would expect > only a single range or only a couple of ranges at the best. One possible > explanation is there was another leader when the recovering replica went > down and then some other replica became leader while the recovering replica > came back up. > > > - Not sure which threshold are you referring to. The line you are pointing > to checks if the recovering replica's highest version is newer than the > leader's highest version. This check happens before version diff (version > ranges) is computed. This check happens irrespective of whether fingerprint > check is enabled or not. Last time I looked at this code, there was check > to ensure replica's versions and leader's versions have enough overlap (I > think heuristics was there is at the least 20% overlap), which I don't see > anymore but you can see there are comments still lingering about the > overlap. > > While the check to ensure leader has higher versions than replica > happens way early in the PeerSync, the fingerprint check happens only after > the updates are applied to the replica. Think of the version check as a > short circuit test. > > I have made lot of changes to the PeerSync code some time ago, and would > be happy to provide you details about what I know. > > On Fri, Jan 5, 2018 at 4:18 PM, Varun Thacker <[email protected]> wrote: > >> Hi Everyone, >> >> I was looking into a scenario where PeerSync failed even when we had a >> high number maxNumLogsToKeep ( 200 ) and numRecordsToKeep ( 200000 ) >> >> The log excerpt is at https://gist.github.com/vthack >> er/fb536c6f1146dd0d7513afb9960a10e3 and I am still trying to pinpoint >> the actual cause . It looks to me that the replica has more number of >> documents till that version ( numVersions ) than the leader and I can't >> tell why. Does this look like a bug? >> >> While trying to reproduce it locally here is one scenario that I ran into >> : >> >> 1. I kept a very low numRecordsToKeep ( 5 ) . Indexed like 3 or 4 >> docs while the replica was down and then started it up. PeerSync failed >> because of https://github.com/apache/lucene-solr/blob/master/solr/c >> ore/src/java/org/apache/solr/update/PeerSync.java#L655 >> >> <https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/update/PeerSync.java#L655> >> . Do we need to do a threshold check when we are verifying via >> fingerprinting if the indexes are the same or not? From my understanding >> we >> can avoid this check when fingerprinting is enabled but wanted to check >> before filing a Jira >> >> >> >> >
