[ 
https://issues.apache.org/jira/browse/KUDU-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15364918#comment-15364918
 ] 

Todd Lipcon commented on KUDU-1512:
-----------------------------------

Is this duplicate of KUDU-1408?

> Remote bootstrap always fails under heavy insert load
> -----------------------------------------------------
>
>                 Key: KUDU-1512
>                 URL: https://issues.apache.org/jira/browse/KUDU-1512
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus
>            Reporter: Jean-Daniel Cryans
>            Priority: Blocker
>
> I just noticed on a test cluster a case where after remote bootstrapping a 
> tablet we were lacking the proper logs to start replicating to it. Here's a 
> bit of log:
> {noformat}
> I0701 17:07:23.387614 61379 consensus_peers.cc:296] T 
> 807ff8e42640482d8d947b693d56ce03 P 9e59a4c24de44e3f9de219df865b4f3b -> Peer 
> d80a7427c7d040ac8d949d0cadb3e7c5 (e1316.halxg.cloudera.com:7050): Sending 
> request to remotely bootstrap
> ...
> I0701 17:12:15.867938 65505 log.cc:728] Deleting log segment in path: 
> /data/1/kudu-tserver/wals/807ff8e42640482d8d947b693d56ce03/wal-000000217 
> (GCed ops < 256735)
> ... (TS stopped GC logs while remote bootstrap was finishing)
> I0701 17:22:48.354138   413 remote_bootstrap_service.cc:242] Request end of 
> remote bootstrap session 
> d80a7427c7d040ac8d949d0cadb3e7c5-807ff8e42640482d8d947b693d56ce03 received 
> from {real_user=kudu, eff_user=} at 10.20.130.116:48132
> I0701 17:22:48.494417 65505 log.cc:728] Deleting log segment in path: 
> /data/1/kudu-tserver/wals/807ff8e42640482d8d947b693d56ce03/wal-000000218 
> (GCed ops < 276284)
> I0701 17:23:02.591763  7627 consensus_queue.cc:577] T 
> 807ff8e42640482d8d947b693d56ce03 P 9e59a4c24de44e3f9de219df865b4f3b [LEADER]: 
> Connected to new peer: Peer: d80a7427c7d040ac8d949d0cadb3e7c5, Is new: false, 
> Last received: 21.256735, Next index: 256736, Last known committed idx: 
> 256493, Last exchange result: ERROR, Needs remote bootstrap: false
> I0701 17:23:02.608044  7627 consensus_peers.cc:181] T 
> 807ff8e42640482d8d947b693d56ce03 P 9e59a4c24de44e3f9de219df865b4f3b -> Peer 
> d80a7427c7d040ac8d949d0cadb3e7c5 (e1316.halxg.cloudera.com:7050): Could not 
> obtain request from queue for peer: d80a7427c7d040ac8d949d0cadb3e7c5. Status: 
> Not found: Failed to read ops 256736..279156: Segment 218 which contained 
> index 256736 has been GCed
> {noformat}
> So 9e59a4c24de44e3f9de219df865b4f3b was sending data to 
> d80a7427c7d040ac8d949d0cadb3e7c5 for about 16 minutes while receiving 
> inserts. As soon as the new follower was done bootstrapping, we GC'd the logs 
> we were holding for it. What happened after is that the leader dropped that 
> new node from the config, and started all over again... over and over. 
> Eventually the other follower died for a different reason and we never 
> recovered the tablet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to