[
https://issues.apache.org/jira/browse/KUDU-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15363554#comment-15363554
]
David Alves commented on KUDU-1512:
-----------------------------------
Duping the chat I had with JD about this here:
[5:36 PM] David Alves: think I found the culprit
[5:37 PM] David Alves: on ts_tablet_manager.cc StartRemoteBootstrap() we delete
the remote bootstrap client before we open the tablet
[5:38 PM] David Alves: we need to do that at least _after_ we open the tablet,
or even after we're sure we're connected to the leader
[5:38 PM] David Alves: (deleting the remote bootstrap client releases anchors)
[5:38 PM] Jean-Daniel Cryans: but bootstrapping can take seconds/minutes
[5:39 PM] David Alves: yeah, that's the problem, at the very least it will take
some seconds
[5:40 PM] David Alves: doing it after the tablet is open is not enough though,
we need to make sure we're at least connected to a leader
[5:40 PM] David Alves: that avoids the messy coordination on the leader side
[5:44 PM] David Alves: i think something like a "last heard from leader" field
that gets set each time we receive a leader message would be helpful not only
there, but for other situations where we need to track that stuff
> Remote bootstrap always fails under heavy insert load
> -----------------------------------------------------
>
> Key: KUDU-1512
> URL: https://issues.apache.org/jira/browse/KUDU-1512
> Project: Kudu
> Issue Type: Bug
> Components: consensus
> Reporter: Jean-Daniel Cryans
> Priority: Blocker
>
> I just noticed on a test cluster a case where after remote bootstrapping a
> tablet we were lacking the proper logs to start replicating to it. Here's a
> bit of log:
> {noformat}
> I0701 17:07:23.387614 61379 consensus_peers.cc:296] T
> 807ff8e42640482d8d947b693d56ce03 P 9e59a4c24de44e3f9de219df865b4f3b -> Peer
> d80a7427c7d040ac8d949d0cadb3e7c5 (e1316.halxg.cloudera.com:7050): Sending
> request to remotely bootstrap
> ...
> I0701 17:12:15.867938 65505 log.cc:728] Deleting log segment in path:
> /data/1/kudu-tserver/wals/807ff8e42640482d8d947b693d56ce03/wal-000000217
> (GCed ops < 256735)
> ... (TS stopped GC logs while remote bootstrap was finishing)
> I0701 17:22:48.354138 413 remote_bootstrap_service.cc:242] Request end of
> remote bootstrap session
> d80a7427c7d040ac8d949d0cadb3e7c5-807ff8e42640482d8d947b693d56ce03 received
> from {real_user=kudu, eff_user=} at 10.20.130.116:48132
> I0701 17:22:48.494417 65505 log.cc:728] Deleting log segment in path:
> /data/1/kudu-tserver/wals/807ff8e42640482d8d947b693d56ce03/wal-000000218
> (GCed ops < 276284)
> I0701 17:23:02.591763 7627 consensus_queue.cc:577] T
> 807ff8e42640482d8d947b693d56ce03 P 9e59a4c24de44e3f9de219df865b4f3b [LEADER]:
> Connected to new peer: Peer: d80a7427c7d040ac8d949d0cadb3e7c5, Is new: false,
> Last received: 21.256735, Next index: 256736, Last known committed idx:
> 256493, Last exchange result: ERROR, Needs remote bootstrap: false
> I0701 17:23:02.608044 7627 consensus_peers.cc:181] T
> 807ff8e42640482d8d947b693d56ce03 P 9e59a4c24de44e3f9de219df865b4f3b -> Peer
> d80a7427c7d040ac8d949d0cadb3e7c5 (e1316.halxg.cloudera.com:7050): Could not
> obtain request from queue for peer: d80a7427c7d040ac8d949d0cadb3e7c5. Status:
> Not found: Failed to read ops 256736..279156: Segment 218 which contained
> index 256736 has been GCed
> {noformat}
> So 9e59a4c24de44e3f9de219df865b4f3b was sending data to
> d80a7427c7d040ac8d949d0cadb3e7c5 for about 16 minutes while receiving
> inserts. As soon as the new follower was done bootstrapping, we GC'd the logs
> we were holding for it. What happened after is that the leader dropped that
> new node from the config, and started all over again... over and over.
> Eventually the other follower died for a different reason and we never
> recovered the tablet.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)