[ https://issues.apache.org/jira/browse/KUDU-3312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bankim Bhavsar updated KUDU-3312: --------------------------------- Description: When bringing up a new Kudu cluster with multiple masters, these masters must be brought up together and should start within a short time window of 30 secs (FLAGS_raft_get_node_instance_timeout_ms) However while bringing up multiple masters on Kubernetes noticed that the bring up fails sometimes since masters aren't brought up together within a short time window. Simply configuring FLAGS_raft_get_node_instance_timeout_ms to a higher timeout didn't help in some cases as the DNS resolution would fail in SetPermanentUuidForRemotePeer() at the very beginning. {code} E0827 19:28:53.052981 91 master.cc:279] Unable to init master catalog manager: Network error: Unable to initialize catalog manager: Failed to initialize sys tables async: Failed to create new distributed │ │ Raft config: Unable to resolve UUID for peer member_type: VOTER last_known_addr \{ host: "kudu-master-0.kudu-masters.warehouse-1630092493-z2sz.svc.cluster.local" port: 7051 }: unable to resolve address for ku │ │ du-master-0.kudu-masters.warehouse-1630092493-z2sz.svc.cluster.local: Name or service not known {code} So the function SetPermanentUuidForRemotePeer() needs to be retry for proxy creation/DNS failure in addition to RPC request. https://github.com/apache/kudu/blob/master/src/kudu/consensus/consensus_peers.cc#L627 was: When bringing up a new Kudu cluster with multiple masters, these masters must be brought up together and should start within a short time window of 30 secs (FLAGS_raft_get_node_instance_timeout_ms) However bringing up multiple masters on Kubernetes noticed that bring up of multiple masters fail sometimes since masters aren't brought up together within a short time window. Simply configuring FLAGS_raft_get_node_instance_timeout_ms to a higher timeout didn't help in some cases as the DNS resolution would fail in SetPermanentUuidForRemotePeer() at the very beginning. {code} E0827 19:28:53.052981 91 master.cc:279] Unable to init master catalog manager: Network error: Unable to initialize catalog manager: Failed to initialize sys tables async: Failed to create new distributed │ │ Raft config: Unable to resolve UUID for peer member_type: VOTER last_known_addr \{ host: "kudu-master-0.kudu-masters.warehouse-1630092493-z2sz.svc.cluster.local" port: 7051 }: unable to resolve address for ku │ │ du-master-0.kudu-masters.warehouse-1630092493-z2sz.svc.cluster.local: Name or service not known {code} So the function SetPermanentUuidForRemotePeer() needs to be retry for proxy creation/DNS failure in addition to RPC request. https://github.com/apache/kudu/blob/master/src/kudu/consensus/consensus_peers.cc#L627 > SetPermanentUuidForRemotePeer() isn't resilient to DNS resolution failure > ------------------------------------------------------------------------- > > Key: KUDU-3312 > URL: https://issues.apache.org/jira/browse/KUDU-3312 > Project: Kudu > Issue Type: Improvement > Components: consensus, master > Reporter: Bankim Bhavsar > Priority: Major > > When bringing up a new Kudu cluster with multiple masters, these masters must > be brought up together and should start within a short time window of 30 secs > (FLAGS_raft_get_node_instance_timeout_ms) > However while bringing up multiple masters on Kubernetes noticed that the > bring up fails sometimes since masters aren't brought up together within a > short time window. Simply configuring FLAGS_raft_get_node_instance_timeout_ms > to a higher timeout didn't help in some cases as the DNS resolution would > fail in SetPermanentUuidForRemotePeer() at the very beginning. > {code} > E0827 19:28:53.052981 91 master.cc:279] Unable to init master catalog > manager: Network error: Unable to initialize catalog manager: Failed to > initialize sys tables async: Failed to create new distributed │ │ Raft > config: Unable to resolve UUID for peer member_type: VOTER last_known_addr \{ > host: > "kudu-master-0.kudu-masters.warehouse-1630092493-z2sz.svc.cluster.local" > port: 7051 }: unable to resolve address for ku │ │ > du-master-0.kudu-masters.warehouse-1630092493-z2sz.svc.cluster.local: Name or > service not known > {code} > So the function SetPermanentUuidForRemotePeer() needs to be retry for proxy > creation/DNS failure in addition to RPC request. > https://github.com/apache/kudu/blob/master/src/kudu/consensus/consensus_peers.cc#L627 > -- This message was sent by Atlassian Jira (v8.3.4#803005)