Ah, so the replica has higher version after updates are applied. One possible reason could be that the replica did not buffer the updates that came in while it was recovering. While your test it failed before updates were applied as the check you are pointing out happens even before the version ranges are computed.
As mentioned the only thing that pops out from the logs is that there are too many version ranges. Ideally there should only be one version range with lower version corresponding to the last version replica received before going down and higher version corresponding to the last version replica received before it started buffering updates. My apologies for not being able to point the exact issue. There is PeerSyncReplicationTest that can be useful to verify if PeerSync is broken. If you can send me the test I can take a look at it. On Jan 6, 2018 11:45 PM, "Varun Thacker" <[email protected]> wrote: Hi Pushkar, So the shard had only 2 replicas and the leader was always constant . Does anything look odd from the log snippet to you? Like why would leader have ( numVersions=104904602 ) and replica have ( numVersions=104904618 ) which is more than the leader after the updates were applied? On Sat, Jan 6, 2018 at 8:05 PM, Pushkar Raste <[email protected]> wrote: > Hi Varun, > - I noticed in your logs there are multiple version ranges, this would > happen only if replica has missed versions intermittently. I would expect > only a single range or only a couple of ranges at the best. One possible > explanation is there was another leader when the recovering replica went > down and then some other replica became leader while the recovering replica > came back up. > > > - Not sure which threshold are you referring to. The line you are pointing > to checks if the recovering replica's highest version is newer than the > leader's highest version. This check happens before version diff (version > ranges) is computed. This check happens irrespective of whether fingerprint > check is enabled or not. Last time I looked at this code, there was check > to ensure replica's versions and leader's versions have enough overlap (I > think heuristics was there is at the least 20% overlap), which I don't see > anymore but you can see there are comments still lingering about the > overlap. > > While the check to ensure leader has higher versions than replica > happens way early in the PeerSync, the fingerprint check happens only after > the updates are applied to the replica. Think of the version check as a > short circuit test. > > I have made lot of changes to the PeerSync code some time ago, and would > be happy to provide you details about what I know. > > On Fri, Jan 5, 2018 at 4:18 PM, Varun Thacker <[email protected]> wrote: > >> Hi Everyone, >> >> I was looking into a scenario where PeerSync failed even when we had a >> high number maxNumLogsToKeep ( 200 ) and numRecordsToKeep ( 200000 ) >> >> The log excerpt is at https://gist.github.com/vthack >> er/fb536c6f1146dd0d7513afb9960a10e3 and I am still trying to pinpoint >> the actual cause . It looks to me that the replica has more number of >> documents till that version ( numVersions ) than the leader and I can't >> tell why. Does this look like a bug? >> >> While trying to reproduce it locally here is one scenario that I ran into >> : >> >> 1. I kept a very low numRecordsToKeep ( 5 ) . Indexed like 3 or 4 >> docs while the replica was down and then started it up. PeerSync failed >> because of https://github.com/apache/lucene-solr/blob/master/solr/c >> ore/src/java/org/apache/solr/update/PeerSync.java#L655 >> >> <https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/update/PeerSync.java#L655> >> . Do we need to do a threshold check when we are verifying via >> fingerprinting if the indexes are the same or not? From my understanding >> we >> can avoid this check when fingerprinting is enabled but wanted to check >> before filing a Jira >> >> >> >> >
