[jira] [Updated] (SOLR-8760) PeerSync replay of ADDs older than ourLowThreshold interacting with DBQs to stall new leadership

2016-02-29 Thread Ramsey Haddad (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-8760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramsey Haddad updated SOLR-8760:

Description: 
When we are doing rolling restarts of our Solr servers, we are sometimes 
hitting painfully long times without a shard leader. What happens is that a new 
leader is elected, but first needs to fully sync old updates before it assumes 
the leadership role and accepts new updates. The syncing process is taking 
unusually long because of an interaction between having one of our hourly 
garbage collection DBQs in the update logs and the replaying of old ADDs. If 
there is a single DBQ, and 1000 older ADDs that are getting replayed, then the 
DBQ is replayed 1000 times, instead of once. This itself may be hard to fix. 
But, the thing that is easier to fix is that most of the ADDs getting replayed 
shouldn't need to get replayed in the first place, since they are older than 
ourLowThreshold.

The problem can be fixed by eliminating or by modifying the way that the 
"completeList" term is used to effect the PeerSync lists.

We propose two alternatives to fix this:

FixA: Based on my possibly incomplete understanding of PeerSync, the 
completeList term should be eliminated. If updates older than ourLowThreshold 
need to replayed, then aren't all the prerequisities for PeerSync violated and 
hence we should fall back to SnapPull? (My gut suspects that a later bug fix to 
PeerSync fixed whatever issue completeList was trying to deal with.)

FixB: The patch that added the completeList term mentions that it is needed for 
the replay of some DELETEs. Well, if that is true and we do need to replay some 
DELETEs older than ourLowThreshold, then there is still no need to replay any 
ADDs older than ourLowThreshold, right??


  was:
When we are doing rolling restarts of our Solr servers, we are sometimes 
hitting painfully long times without a shard leader. What happens is that a new 
leader is elected, but first needs to fully sync old updates before it assumes 
the leadership role and accepts new updates. The syncing process is taking 
unusually long because of an interaction between having one of our hourly 
garbage collection DBQs in the update logs and the replaying of old ADDs. If 
there is a single DBQ, and 1000 older ADDs that are getting replayed, then the 
DBQ is replayed 1000 times, instead of once. This itself may be hard to fix. 
But, the thing that is easier to fix is that most of the ADDs getting replayed 
shouldn't need to get replayed in the first place, since they are older than 
ourLowThreshold.

The problem can be fixed by eliminating or by modifying the way that the 
"completeList" term is used to effect the PeerSync lists.

We propose two alternatives to fix this:

FixA: Based on my possibly incomplete understanding of PeerSync, the 
completeList term should be eliminated. If updates older than ourLowThreshold 
need to replayed, then aren't all the prerequisities for PeerSync violated and 
hence we should fall back to SnapPull? (My gut suspects that a later bug fix to 
PeerSync fixed whatever issue completeList was trying to deal with.)

FixB: The patch that added the ourLowThreshold term mentions that it is needed 
for the replay of some DELETEs. Well, if that is true and we do need to replay 
some DELETEs older than ourLowThreshold, then there is still no need to replay 
any ADDs older than ourLowThreshold, right??



> PeerSync replay of ADDs older than ourLowThreshold interacting with DBQs to 
> stall new leadership
> 
>
> Key: SOLR-8760
> URL: https://issues.apache.org/jira/browse/SOLR-8760
> Project: Solr
>  Issue Type: Bug
>Reporter: Ramsey Haddad
>Priority: Minor
> Attachments: solr-8760-fixA.patch, solr-8760-fixB.patch
>
>
> When we are doing rolling restarts of our Solr servers, we are sometimes 
> hitting painfully long times without a shard leader. What happens is that a 
> new leader is elected, but first needs to fully sync old updates before it 
> assumes the leadership role and accepts new updates. The syncing process is 
> taking unusually long because of an interaction between having one of our 
> hourly garbage collection DBQs in the update logs and the replaying of old 
> ADDs. If there is a single DBQ, and 1000 older ADDs that are getting 
> replayed, then the DBQ is replayed 1000 times, instead of once. This itself 
> may be hard to fix. But, the thing that is easier to fix is that most of the 
> ADDs getting replayed shouldn't need to get replayed in the first place, 
> since they are older than ourLowThreshold.
> The problem can be fixed by eliminating or by modifying the way that the 
> "completeList" term is used to effect the PeerSync lists.
> We propose two alternatives to fix this:
> FixA:

[jira] [Updated] (SOLR-8760) PeerSync replay of ADDs older than ourLowThreshold interacting with DBQs to stall new leadership

2016-02-29 Thread Ramsey Haddad (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-8760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramsey Haddad updated SOLR-8760:

Attachment: solr-8760-fixB.patch
solr-8760-fixA.patch

> PeerSync replay of ADDs older than ourLowThreshold interacting with DBQs to 
> stall new leadership
> 
>
> Key: SOLR-8760
> URL: https://issues.apache.org/jira/browse/SOLR-8760
> Project: Solr
>  Issue Type: Bug
>Reporter: Ramsey Haddad
>Priority: Minor
> Attachments: solr-8760-fixA.patch, solr-8760-fixB.patch
>
>
> When we are doing rolling restarts of our Solr servers, we are sometimes 
> hitting painfully long times without a shard leader. What happens is that a 
> new leader is elected, but first needs to fully sync old updates before it 
> assumes the leadership role and accepts new updates. The syncing process is 
> taking unusually long because of an interaction between having one of our 
> hourly garbage collection DBQs in the update logs and the replaying of old 
> ADDs. If there is a single DBQ, and 1000 older ADDs that are getting 
> replayed, then the DBQ is replayed 1000 times, instead of once. This itself 
> may be hard to fix. But, the thing that is easier to fix is that most of the 
> ADDs getting replayed shouldn't need to get replayed in the first place, 
> since they are older than ourLowThreshold.
> The problem can be fixed by eliminating or by modifying the way that the 
> "completeList" term is used to effect the PeerSync lists.
> We propose two alternatives to fix this:
> FixA: Based on my possibly incomplete understanding of PeerSync, the 
> completeList term should be eliminated. If updates older than ourLowThreshold 
> need to replayed, then aren't all the prerequisities for PeerSync violated 
> and hence we should fall back to SnapPull? (My gut suspects that a later bug 
> fix to PeerSync fixed whatever issue completeList was trying to deal with.)
> FixB: The patch that added the ourLowThreshold term mentions that it is 
> needed for the replay of some DELETEs. Well, if that is true and we do need 
> to replay some DELETEs older than ourLowThreshold, then there is still no 
> need to replay any ADDs older than ourLowThreshold, right??



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org