[ https://issues.apache.org/jira/browse/SOLR-8760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ramsey Haddad updated SOLR-8760: -------------------------------- Attachment: solr-8760-fixB.patch solr-8760-fixA.patch > PeerSync replay of ADDs older than ourLowThreshold interacting with DBQs to > stall new leadership > ------------------------------------------------------------------------------------------------ > > Key: SOLR-8760 > URL: https://issues.apache.org/jira/browse/SOLR-8760 > Project: Solr > Issue Type: Bug > Reporter: Ramsey Haddad > Priority: Minor > Attachments: solr-8760-fixA.patch, solr-8760-fixB.patch > > > When we are doing rolling restarts of our Solr servers, we are sometimes > hitting painfully long times without a shard leader. What happens is that a > new leader is elected, but first needs to fully sync old updates before it > assumes the leadership role and accepts new updates. The syncing process is > taking unusually long because of an interaction between having one of our > hourly garbage collection DBQs in the update logs and the replaying of old > ADDs. If there is a single DBQ, and 1000 older ADDs that are getting > replayed, then the DBQ is replayed 1000 times, instead of once. This itself > may be hard to fix. But, the thing that is easier to fix is that most of the > ADDs getting replayed shouldn't need to get replayed in the first place, > since they are older than ourLowThreshold. > The problem can be fixed by eliminating or by modifying the way that the > "completeList" term is used to effect the PeerSync lists. > We propose two alternatives to fix this: > FixA: Based on my possibly incomplete understanding of PeerSync, the > completeList term should be eliminated. If updates older than ourLowThreshold > need to replayed, then aren't all the prerequisities for PeerSync violated > and hence we should fall back to SnapPull? (My gut suspects that a later bug > fix to PeerSync fixed whatever issue completeList was trying to deal with.) > FixB: The patch that added the ourLowThreshold term mentions that it is > needed for the replay of some DELETEs. Well, if that is true and we do need > to replay some DELETEs older than ourLowThreshold, then there is still no > need to replay any ADDs older than ourLowThreshold, right?? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org