[ 
https://issues.apache.org/jira/browse/KUDU-1620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16875253#comment-16875253
 ] 

Grant Henke commented on KUDU-1620:
-----------------------------------

There was a discussion in the slack channel discussing how this impacts 
Kubernetes/Docker deployments. 

If a container/instance/host running a master is down/crashed tservers are 
unable to restart and fail with something like:
{noformat}
Tservers start crashing with the logs saying 'Couldn't resolve master service 
address 'master-1': unable to resolve address for master-1: Temporary failure 
in name resolution'. 
{noformat}

[~tlipcon] mentioned an option for a workaround in the thread:
{quote}
ah, maybe we will stay up if the master resolution fails at runtime, but will 
refuse to start if one of the masters is unresolvable.
which is probably on purpose to avoid people starting with typos in their 
master list and not noticing
for the k8s use case I can see why you might want it, though -- perhaps we need 
some flag to explicitly allow starting when some number o the masters are 
unresolvable, and be sure it keeps retrying to resolve
{quote}

This also appears to be why we added a 2 second delay to the docker entrypoint 
script: 
https://github.com/apache/kudu/blob/ad798391fdf22c1632a641dbb6be80085636602a/docker/kudu-entrypoint.sh#L80



 

> Consensus peer proxy hostnames should be reresolved on failure
> --------------------------------------------------------------
>
>                 Key: KUDU-1620
>                 URL: https://issues.apache.org/jira/browse/KUDU-1620
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus
>    Affects Versions: 1.0.0
>            Reporter: Adar Dembo
>            Priority: Major
>              Labels: docker
>
> Noticed this while documenting the workflow to replace a dead master, which 
> currently bypasses Raft config changes in favor of having the replacement 
> master "masquerade" as the dead master via DNS changes.
> Internally we never rebuild consensus peer proxies in the event of network 
> failure; we assume that the peer will return at the same location. Nominally 
> this is reasonable; allowing peers to change host/port information on the fly 
> is tricky and has yet to be implemented. But, we should at least retry the 
> DNS resolution; not doing so forces the workflow to include steps to restart 
> the existing masters, which creates a (small) availability outage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to