[ 
https://issues.apache.org/jira/browse/KUDU-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15363554#comment-15363554
 ] 

David Alves commented on KUDU-1512:
-----------------------------------

Duping the chat I had with JD about this here:

[5:36 PM] David Alves: think I found the culprit
[5:37 PM] David Alves: on ts_tablet_manager.cc StartRemoteBootstrap() we delete 
the remote bootstrap client before we open the tablet
[5:38 PM] David Alves: we need to do that at least _after_ we open the tablet, 
or even after we're sure we're connected to the leader
[5:38 PM] David Alves: (deleting the remote bootstrap client releases anchors)
[5:38 PM] Jean-Daniel Cryans: but bootstrapping can take seconds/minutes
[5:39 PM] David Alves: yeah, that's the problem, at the very least it will take 
some seconds
[5:40 PM] David Alves: doing it after the tablet is open is not enough though, 
we need to make sure we're at least connected to a leader
[5:40 PM] David Alves: that avoids the messy coordination on the leader side
[5:44 PM] David Alves: i think something like a "last heard from leader" field 
that gets set each time we receive a leader message would be helpful not only 
there, but for other situations where we need to track that stuff

> Remote bootstrap always fails under heavy insert load
> -----------------------------------------------------
>
>                 Key: KUDU-1512
>                 URL: https://issues.apache.org/jira/browse/KUDU-1512
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus
>            Reporter: Jean-Daniel Cryans
>            Priority: Blocker
>
> I just noticed on a test cluster a case where after remote bootstrapping a 
> tablet we were lacking the proper logs to start replicating to it. Here's a 
> bit of log:
> {noformat}
> I0701 17:07:23.387614 61379 consensus_peers.cc:296] T 
> 807ff8e42640482d8d947b693d56ce03 P 9e59a4c24de44e3f9de219df865b4f3b -> Peer 
> d80a7427c7d040ac8d949d0cadb3e7c5 (e1316.halxg.cloudera.com:7050): Sending 
> request to remotely bootstrap
> ...
> I0701 17:12:15.867938 65505 log.cc:728] Deleting log segment in path: 
> /data/1/kudu-tserver/wals/807ff8e42640482d8d947b693d56ce03/wal-000000217 
> (GCed ops < 256735)
> ... (TS stopped GC logs while remote bootstrap was finishing)
> I0701 17:22:48.354138   413 remote_bootstrap_service.cc:242] Request end of 
> remote bootstrap session 
> d80a7427c7d040ac8d949d0cadb3e7c5-807ff8e42640482d8d947b693d56ce03 received 
> from {real_user=kudu, eff_user=} at 10.20.130.116:48132
> I0701 17:22:48.494417 65505 log.cc:728] Deleting log segment in path: 
> /data/1/kudu-tserver/wals/807ff8e42640482d8d947b693d56ce03/wal-000000218 
> (GCed ops < 276284)
> I0701 17:23:02.591763  7627 consensus_queue.cc:577] T 
> 807ff8e42640482d8d947b693d56ce03 P 9e59a4c24de44e3f9de219df865b4f3b [LEADER]: 
> Connected to new peer: Peer: d80a7427c7d040ac8d949d0cadb3e7c5, Is new: false, 
> Last received: 21.256735, Next index: 256736, Last known committed idx: 
> 256493, Last exchange result: ERROR, Needs remote bootstrap: false
> I0701 17:23:02.608044  7627 consensus_peers.cc:181] T 
> 807ff8e42640482d8d947b693d56ce03 P 9e59a4c24de44e3f9de219df865b4f3b -> Peer 
> d80a7427c7d040ac8d949d0cadb3e7c5 (e1316.halxg.cloudera.com:7050): Could not 
> obtain request from queue for peer: d80a7427c7d040ac8d949d0cadb3e7c5. Status: 
> Not found: Failed to read ops 256736..279156: Segment 218 which contained 
> index 256736 has been GCed
> {noformat}
> So 9e59a4c24de44e3f9de219df865b4f3b was sending data to 
> d80a7427c7d040ac8d949d0cadb3e7c5 for about 16 minutes while receiving 
> inserts. As soon as the new follower was done bootstrapping, we GC'd the logs 
> we were holding for it. What happened after is that the leader dropped that 
> new node from the config, and started all over again... over and over. 
> Eventually the other follower died for a different reason and we never 
> recovered the tablet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to