Todd Lipcon created KUDU-1436:
---------------------------------

             Summary: Concurrent remote bootstrap calls from same server can 
crash or result in corrupt replicas
                 Key: KUDU-1436
                 URL: https://issues.apache.org/jira/browse/KUDU-1436
             Project: Kudu
          Issue Type: Bug
          Components: consensus, tserver
    Affects Versions: 0.8.0
            Reporter: Todd Lipcon
            Assignee: Todd Lipcon
            Priority: Blocker


In the case that a BeginRemoteBootstrapSession call times out, it's possible 
that the client will send a second call and both get processed. This triggers 
the following race, if the second call gets processed first:

- C2: initializing remote bootstrap session
- C1: waiting on lock
- C2: finishes initializing, drops lock, and starts to copy the tablet metadata 
out of the session object (outside of any lock)
- C1: acquires lock and follows the "Re-initializing" code path. This code path 
calls Clear() on its snapshots of the tablet metadata.
- C2: may crash or copy an incomplete copy of the metadata (eg with missing 
fields)
- C2 responds to the client

This can cause a number of issues:
- If C2 ends up getting a partially-initialized metadata protobuf, we can 
trigger a crash in RPC (we don't handle sending responses that have missing 
required fields)
- C2 might actually get a fully correct response back to the client. But, in 
the meantime C1 has managed to un-anchor and re-anchor logs and blocks. This 
means that C2 will eventually copy log entries which are newer than its 
metadata snapshot which can trigger an assertion like KUDU-1046

Basically all bets are off when this race is triggered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to