[ 
https://issues.apache.org/jira/browse/KUDU-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-2800:
--------------------------------
    Description: 
As implemented in
https://github.com/apache/kudu/blob/10ea0ce5a636a050a1207f7ab5ecf63d178683f5/src/kudu/consensus/consensus_queue.cc#L576
 , the logic for tracking 'health' of tablet replicas cannot differentiate 
between bootstrapping and failed replicas.

As a result, if a tablet replica is bootstrapping for times longer than the 
interval specified by {{--follower_unavailable_considered_failed_sec}} run-time 
flag, the system can start the process of re-replication of the tablet replica 
elsewhere.

One option might be sending a specific error with {{ConsensusResponsePB}} in 
response to a Raft message sent by a leader replica, maybe adding extra 
information on the current progress of the replica bootstrap process.  As soon 
as such bootstrapping follower replica isn't failing behind leader's WAL GC 
threshold, the leader replica will not evict it.  But if the bootstrapping 
follower replica falls behind the WAL GC threshold, leader replica will evict 
it and the system will start re-replicating it elsewhere.  In cases when the 
amount of Raft transactions for a tablet is low, this approach would allow for 
longer bootstrapping times of tablet replicas.  That might be especially 
beneficial in cases when a tablet server with IO-heavy tablet replicas is being 
restarted, and there aren't many incoming updates/inserts for tablets hosted by 
the tablet server.

However, the approach above requires the Raft consensus object for a 
bootstrapping replica to be at least partially functional, so it entails 
reading at least some information about a replica from the on-disk consensus 
metadata prior to proper bootstrapping of a tablet replica by a tablet server.


  was:
As implemented in
https://github.com/apache/kudu/blob/10ea0ce5a636a050a1207f7ab5ecf63d178683f5/src/kudu/consensus/consensus_queue.cc#L576
 , the logic for tracking 'health' of tablet replicas cannot differentiate 
between bootstrapping and failed replicas.

As a result, if a tablet replica is bootstrapping for times longer than the 
interval specified by {{--follower_unavailable_considered_failed_sec}} run-time 
flag, the system can start the process of re-replication of the tablet replica 
elsewhere.

One option might be sending a special {{PeerStatus}} for a bootstrapping 
replica with a response to a Raft message sent by a leader replica and updating 
the logic referenced above.  The response might also include additional 
information on the current progress of the bootstrap process.  Probably, we 
need add a separate timeout to track a stale bootstrapping replica, so its 
health would be reported as FAILED after the leader observes the replica being 
stuck in bootstrapping with no forward progress for a time interval longer than 
the timeout specified by the new parameter.

However, the approach above requires the Raft consensus object for a 
bootstrapping replica to be at least partially functional, so it entails 
reading at least some information about a replica from the on-disk consensus 
metadata prior to proper bootstrapping of a tablet replica by a tablet server.




> Avoid 'unintended' re-replication of long-bootstrapping tablet replicas
> -----------------------------------------------------------------------
>
>                 Key: KUDU-2800
>                 URL: https://issues.apache.org/jira/browse/KUDU-2800
>             Project: Kudu
>          Issue Type: Improvement
>          Components: consensus, tserver
>    Affects Versions: 1.7.0, 1.8.0, 1.7.1, 1.9.0, 1.9.1, 1.10.0
>            Reporter: Alexey Serbin
>            Assignee: Vladimir Verjovkin
>            Priority: Major
>              Labels: newbie
>
> As implemented in
> https://github.com/apache/kudu/blob/10ea0ce5a636a050a1207f7ab5ecf63d178683f5/src/kudu/consensus/consensus_queue.cc#L576
>  , the logic for tracking 'health' of tablet replicas cannot differentiate 
> between bootstrapping and failed replicas.
> As a result, if a tablet replica is bootstrapping for times longer than the 
> interval specified by {{--follower_unavailable_considered_failed_sec}} 
> run-time flag, the system can start the process of re-replication of the 
> tablet replica elsewhere.
> One option might be sending a specific error with {{ConsensusResponsePB}} in 
> response to a Raft message sent by a leader replica, maybe adding extra 
> information on the current progress of the replica bootstrap process.  As 
> soon as such bootstrapping follower replica isn't failing behind leader's WAL 
> GC threshold, the leader replica will not evict it.  But if the bootstrapping 
> follower replica falls behind the WAL GC threshold, leader replica will evict 
> it and the system will start re-replicating it elsewhere.  In cases when the 
> amount of Raft transactions for a tablet is low, this approach would allow 
> for longer bootstrapping times of tablet replicas.  That might be especially 
> beneficial in cases when a tablet server with IO-heavy tablet replicas is 
> being restarted, and there aren't many incoming updates/inserts for tablets 
> hosted by the tablet server.
> However, the approach above requires the Raft consensus object for a 
> bootstrapping replica to be at least partially functional, so it entails 
> reading at least some information about a replica from the on-disk consensus 
> metadata prior to proper bootstrapping of a tablet replica by a tablet server.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to