[ 
https://issues.apache.org/jira/browse/KUDU-3312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bankim Bhavsar updated KUDU-3312:
---------------------------------
    Description: 
When bringing up a new Kudu cluster with multiple masters, these masters must 
be brought up together and should start within a short time window of 30 secs 
(FLAGS_raft_get_node_instance_timeout_ms)

However while bringing up multiple masters on Kubernetes noticed that the bring 
up fails sometimes since masters aren't brought up together within a short time 
window. Simply configuring FLAGS_raft_get_node_instance_timeout_ms to a higher 
timeout didn't help in some cases as the DNS resolution would fail in 
SetPermanentUuidForRemotePeer() at the very beginning.

{code}
 E0827 19:28:53.052981 91 master.cc:279] Unable to init master catalog manager: 
Network error: Unable to initialize catalog manager: Failed to initialize sys 
tables async: Failed to create new distributed │ │ Raft config: Unable to 
resolve UUID for peer member_type: VOTER last_known_addr \{ host: 
"kudu-master-0.kudu-masters.warehouse-1630092493-z2sz.svc.cluster.local" port: 
7051 }: unable to resolve address for ku │ │ 
du-master-0.kudu-masters.warehouse-1630092493-z2sz.svc.cluster.local: Name or 
service not known
{code}

So the function SetPermanentUuidForRemotePeer() needs to be retry for proxy 
creation/DNS failure in addition to RPC request.
https://github.com/apache/kudu/blob/master/src/kudu/consensus/consensus_peers.cc#L627
 

  was:
When bringing up a new Kudu cluster with multiple masters, these masters must 
be brought up together and should start within a short time window of 30 secs 
(FLAGS_raft_get_node_instance_timeout_ms)

However bringing up multiple masters on Kubernetes noticed that bring up of 
multiple masters fail sometimes since masters aren't brought up together within 
a short time window. Simply configuring FLAGS_raft_get_node_instance_timeout_ms 
to a higher timeout didn't help in some cases as the DNS resolution would fail 
in SetPermanentUuidForRemotePeer() at the very beginning.

{code}
 E0827 19:28:53.052981 91 master.cc:279] Unable to init master catalog manager: 
Network error: Unable to initialize catalog manager: Failed to initialize sys 
tables async: Failed to create new distributed │ │ Raft config: Unable to 
resolve UUID for peer member_type: VOTER last_known_addr \{ host: 
"kudu-master-0.kudu-masters.warehouse-1630092493-z2sz.svc.cluster.local" port: 
7051 }: unable to resolve address for ku │ │ 
du-master-0.kudu-masters.warehouse-1630092493-z2sz.svc.cluster.local: Name or 
service not known
{code}

So the function SetPermanentUuidForRemotePeer() needs to be retry for proxy 
creation/DNS failure in addition to RPC request.
https://github.com/apache/kudu/blob/master/src/kudu/consensus/consensus_peers.cc#L627
 


> SetPermanentUuidForRemotePeer() isn't resilient to DNS resolution failure
> -------------------------------------------------------------------------
>
>                 Key: KUDU-3312
>                 URL: https://issues.apache.org/jira/browse/KUDU-3312
>             Project: Kudu
>          Issue Type: Improvement
>          Components: consensus, master
>            Reporter: Bankim Bhavsar
>            Priority: Major
>
> When bringing up a new Kudu cluster with multiple masters, these masters must 
> be brought up together and should start within a short time window of 30 secs 
> (FLAGS_raft_get_node_instance_timeout_ms)
> However while bringing up multiple masters on Kubernetes noticed that the 
> bring up fails sometimes since masters aren't brought up together within a 
> short time window. Simply configuring FLAGS_raft_get_node_instance_timeout_ms 
> to a higher timeout didn't help in some cases as the DNS resolution would 
> fail in SetPermanentUuidForRemotePeer() at the very beginning.
> {code}
>  E0827 19:28:53.052981 91 master.cc:279] Unable to init master catalog 
> manager: Network error: Unable to initialize catalog manager: Failed to 
> initialize sys tables async: Failed to create new distributed │ │ Raft 
> config: Unable to resolve UUID for peer member_type: VOTER last_known_addr \{ 
> host: 
> "kudu-master-0.kudu-masters.warehouse-1630092493-z2sz.svc.cluster.local" 
> port: 7051 }: unable to resolve address for ku │ │ 
> du-master-0.kudu-masters.warehouse-1630092493-z2sz.svc.cluster.local: Name or 
> service not known
> {code}
> So the function SetPermanentUuidForRemotePeer() needs to be retry for proxy 
> creation/DNS failure in addition to RPC request.
> https://github.com/apache/kudu/blob/master/src/kudu/consensus/consensus_peers.cc#L627
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to