Alexey Serbin created KUDU-3530: ----------------------------------- Summary: Add guardrails to prevent inconsistencies on attemps to add multiple Kudu masters at once in a cluster Key: KUDU-3530 URL: https://issues.apache.org/jira/browse/KUDU-3530 Project: Kudu Issue Type: Improvement Components: master Reporter: Alexey Serbin
There have been a few reports on inconsistencies in the system catalog tablet's Raft configuration upon trying to add multiple new masters at once into a Kudu cluster. It seems the current implementation of the {{AddMaster}} RPC isn't thread-safe, since the Raft configuration of the system catalog tablet became corrupted after an attempt to add multiple extra masters at once (i.e. starting multiple of those to-be-added-masters at once). The original Kudu master reported an error like below upon next restart: {noformat} Invalid argument: RunMasterServer() failed: Unable to initialize catalog manager: Failed to initialize sys tables async: on-disk master list (:0) and provided master list (m1.my.org:7051, m2.my.org:7051, m3.my.org:7051) differ by more than one address. Their symmetric difference is: :0, m1.my.org:7051, m2.my.org:7051, m3.my.org:7051 {noformat} It would be great to have guardrails preventing such a corruption. Essentially, we should enforce the one-new-master-at-a-time invariant which the current implementation implicitly assumes, but doesn't consistently enforce. -- This message was sent by Atlassian Jira (v8.20.10#820010)