[ 
https://issues.apache.org/jira/browse/SOLR-8760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15171913#comment-15171913
 ] 

Ramsey Haddad commented on SOLR-8760:
-------------------------------------

More details about the conditions leading up to this problem are in: 
http://mail-archives.apache.org/mod_mbox/lucene-dev/201602.mbox/%3ccac2x+z3at7ileypotx3xzrp5qysklaatgm-xtjn1a8zpxus...@mail.gmail.com%3E


> PeerSync replay of ADDs older than ourLowThreshold interacting with DBQs to 
> stall new leadership
> ------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-8760
>                 URL: https://issues.apache.org/jira/browse/SOLR-8760
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Ramsey Haddad
>            Priority: Minor
>         Attachments: solr-8760-fixA.patch, solr-8760-fixB.patch
>
>
> When we are doing rolling restarts of our Solr servers, we are sometimes 
> hitting painfully long times without a shard leader. What happens is that a 
> new leader is elected, but first needs to fully sync old updates before it 
> assumes the leadership role and accepts new updates. The syncing process is 
> taking unusually long because of an interaction between having one of our 
> hourly garbage collection DBQs in the update logs and the replaying of old 
> ADDs. If there is a single DBQ, and 1000 older ADDs that are getting 
> replayed, then the DBQ is replayed 1000 times, instead of once. This itself 
> may be hard to fix. But, the thing that is easier to fix is that most of the 
> ADDs getting replayed shouldn't need to get replayed in the first place, 
> since they are older than ourLowThreshold.
> The problem can be fixed by eliminating or by modifying the way that the 
> "completeList" term is used to effect the PeerSync lists.
> We propose two alternatives to fix this:
> FixA: Based on my possibly incomplete understanding of PeerSync, the 
> completeList term should be eliminated. If updates older than ourLowThreshold 
> need to replayed, then aren't all the prerequisities for PeerSync violated 
> and hence we should fall back to SnapPull? (My gut suspects that a later bug 
> fix to PeerSync fixed whatever issue completeList was trying to deal with.)
> FixB: The patch that added the ourLowThreshold term mentions that it is 
> needed for the replay of some DELETEs. Well, if that is true and we do need 
> to replay some DELETEs older than ourLowThreshold, then there is still no 
> need to replay any ADDs older than ourLowThreshold, right??



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to