Jean-Daniel Cryans created KUDU-1512:
----------------------------------------

             Summary: Remote bootstrap always fails under heavy insert load
                 Key: KUDU-1512
                 URL: https://issues.apache.org/jira/browse/KUDU-1512
             Project: Kudu
          Issue Type: Bug
          Components: consensus
            Reporter: Jean-Daniel Cryans
            Priority: Blocker


I just noticed on a test cluster a case where after remote bootstrapping a 
tablet we were lacking the proper logs to start replicating to it. Here's a bit 
of log:

{noformat}
I0701 17:07:23.387614 61379 consensus_peers.cc:296] T 
807ff8e42640482d8d947b693d56ce03 P 9e59a4c24de44e3f9de219df865b4f3b -> Peer 
d80a7427c7d040ac8d949d0cadb3e7c5 (e1316.halxg.cloudera.com:7050): Sending 
request to remotely bootstrap
...
I0701 17:12:15.867938 65505 log.cc:728] Deleting log segment in path: 
/data/1/kudu-tserver/wals/807ff8e42640482d8d947b693d56ce03/wal-000000217 (GCed 
ops < 256735)
... (TS stopped GC logs while remote bootstrap was finishing)
I0701 17:22:48.354138   413 remote_bootstrap_service.cc:242] Request end of 
remote bootstrap session 
d80a7427c7d040ac8d949d0cadb3e7c5-807ff8e42640482d8d947b693d56ce03 received from 
{real_user=kudu, eff_user=} at 10.20.130.116:48132
I0701 17:22:48.494417 65505 log.cc:728] Deleting log segment in path: 
/data/1/kudu-tserver/wals/807ff8e42640482d8d947b693d56ce03/wal-000000218 (GCed 
ops < 276284)
I0701 17:23:02.591763  7627 consensus_queue.cc:577] T 
807ff8e42640482d8d947b693d56ce03 P 9e59a4c24de44e3f9de219df865b4f3b [LEADER]: 
Connected to new peer: Peer: d80a7427c7d040ac8d949d0cadb3e7c5, Is new: false, 
Last received: 21.256735, Next index: 256736, Last known committed idx: 256493, 
Last exchange result: ERROR, Needs remote bootstrap: false
I0701 17:23:02.608044  7627 consensus_peers.cc:181] T 
807ff8e42640482d8d947b693d56ce03 P 9e59a4c24de44e3f9de219df865b4f3b -> Peer 
d80a7427c7d040ac8d949d0cadb3e7c5 (e1316.halxg.cloudera.com:7050): Could not 
obtain request from queue for peer: d80a7427c7d040ac8d949d0cadb3e7c5. Status: 
Not found: Failed to read ops 256736..279156: Segment 218 which contained index 
256736 has been GCed
{noformat}

So 9e59a4c24de44e3f9de219df865b4f3b was sending data to 
d80a7427c7d040ac8d949d0cadb3e7c5 for about 16 minutes while receiving inserts. 
As soon as the new follower was done bootstrapping, we GC'd the logs we were 
holding for it. What happened after is that the leader dropped that new node 
from the config, and started all over again... over and over. 

Eventually the other follower died for a different reason and we never 
recovered the tablet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to