[
https://issues.apache.org/jira/browse/KUDU-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15365047#comment-15365047
]
Jean-Daniel Cryans commented on KUDU-1408:
------------------------------------------
KUDU-1512 has logs that show this in action.
What really worries me about this is that with replication=3 there's no room
for machine failures (like in KUDU-1512). Start a big insert, have a few slow
nodes, one node dies, suddenly you can't use a bunch of tablets. Being next to
the PONR should be considered a crisis. It would be different with
replication=5, because one more node can fail until you're at the PONR.
> Adding a replica may never succeed if copying tablet takes longer than the
> log retention time
> ---------------------------------------------------------------------------------------------
>
> Key: KUDU-1408
> URL: https://issues.apache.org/jira/browse/KUDU-1408
> Project: Kudu
> Issue Type: Bug
> Components: consensus, tserver
> Affects Versions: 0.8.0
> Reporter: Todd Lipcon
> Priority: Critical
>
> Currently, while a remote bootstrap session is in progress, we anchor the
> logs from the time at which it started. However, as soon as the session
> finishes, we drop the anchor, and delete any logs. In the case where the
> tablet copy itself takes longer than the log retention period, this means
> it's likely to have a scenario like:
> - TS A starts downloading from TS B. It plans to download segments 1-4 and
> adds an anchor.
> - TS B handles writes for 20 minutes, rolling the log many times (e.g. up to
> log segment 20)
> - TS A finishes downloading, and ends the remote bootstrap session
> - TS B no longer has an anchor, so GCs all logs 1-16.
> - TS A finishes opening the tablet it just copied, but immediately is unable
> to catch up (because it only has segments 1-4, but the leader only has 17-20)
> - TS B evicts TS A
> This loop will go on basically forever until the write workload stops on TS B.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)