[ https://issues.apache.org/jira/browse/SOLR-8760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15171913#comment-15171913 ]
Ramsey Haddad commented on SOLR-8760: ------------------------------------- More details about the conditions leading up to this problem are in: http://mail-archives.apache.org/mod_mbox/lucene-dev/201602.mbox/%3ccac2x+z3at7ileypotx3xzrp5qysklaatgm-xtjn1a8zpxus...@mail.gmail.com%3E > PeerSync replay of ADDs older than ourLowThreshold interacting with DBQs to > stall new leadership > ------------------------------------------------------------------------------------------------ > > Key: SOLR-8760 > URL: https://issues.apache.org/jira/browse/SOLR-8760 > Project: Solr > Issue Type: Bug > Reporter: Ramsey Haddad > Priority: Minor > Attachments: solr-8760-fixA.patch, solr-8760-fixB.patch > > > When we are doing rolling restarts of our Solr servers, we are sometimes > hitting painfully long times without a shard leader. What happens is that a > new leader is elected, but first needs to fully sync old updates before it > assumes the leadership role and accepts new updates. The syncing process is > taking unusually long because of an interaction between having one of our > hourly garbage collection DBQs in the update logs and the replaying of old > ADDs. If there is a single DBQ, and 1000 older ADDs that are getting > replayed, then the DBQ is replayed 1000 times, instead of once. This itself > may be hard to fix. But, the thing that is easier to fix is that most of the > ADDs getting replayed shouldn't need to get replayed in the first place, > since they are older than ourLowThreshold. > The problem can be fixed by eliminating or by modifying the way that the > "completeList" term is used to effect the PeerSync lists. > We propose two alternatives to fix this: > FixA: Based on my possibly incomplete understanding of PeerSync, the > completeList term should be eliminated. If updates older than ourLowThreshold > need to replayed, then aren't all the prerequisities for PeerSync violated > and hence we should fall back to SnapPull? (My gut suspects that a later bug > fix to PeerSync fixed whatever issue completeList was trying to deal with.) > FixB: The patch that added the ourLowThreshold term mentions that it is > needed for the replay of some DELETEs. Well, if that is true and we do need > to replay some DELETEs older than ourLowThreshold, then there is still no > need to replay any ADDs older than ourLowThreshold, right?? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org