[ https://issues.apache.org/jira/browse/KUDU-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vladimir reassigned KUDU-2800: ------------------------------ Assignee: Vladimir > Avoid 'unintended' re-replication of long-bootstrapping tablet replicas > ----------------------------------------------------------------------- > > Key: KUDU-2800 > URL: https://issues.apache.org/jira/browse/KUDU-2800 > Project: Kudu > Issue Type: Improvement > Components: consensus, tserver > Affects Versions: 1.7.0, 1.8.0, 1.7.1, 1.9.0, 1.9.1, 1.10.0 > Reporter: Alexey Serbin > Assignee: Vladimir > Priority: Major > Labels: newbie > > As implemented in > https://github.com/apache/kudu/blob/10ea0ce5a636a050a1207f7ab5ecf63d178683f5/src/kudu/consensus/consensus_queue.cc#L576 > , the logic for tracking 'health' of tablet replicas cannot differentiate > between bootstrapping and failed replicas. > As a result, if a tablet replica is bootstrapping for times longer than the > interval specified by {{--follower_unavailable_considered_failed_sec}} > run-time flag, the system can start the process of re-replication of the > tablet replica elsewhere. > One option might be sending a special {{PeerStatus}} for a bootstrapping > replica with a response to a Raft message sent by a leader replica and > updating the logic referenced above. The response might also include > additional information on the current progress of the bootstrap process. > Probably, we need add a separate timeout to track a stale bootstrapping > replica, so its health would be reported as FAILED after the leader observes > the replica being stuck in bootstrapping with no forward progress for a time > interval longer than the timeout specified by the new parameter. > However, the approach above requires the Raft consensus object for a > bootstrapping replica to be at least partially functional, so it entails > reading at least some information about a replica from the on-disk consensus > metadata prior to proper bootstrapping of a tablet replica by a tablet server. -- This message was sent by Atlassian Jira (v8.3.2#803003)