RE: 3.9 Default Router Malfunction When 1 of 3 Pods is Down

Stan Varlamov Sun, 02 Sep 2018 06:51:26 -0700

I think this is the cause. If using the ALB, each target master must have the 
Router working. Either this is not documented well enough, or I’m not reading 
the docs correctly, but my understanding of how this works was that the link 
comes in into some generic receiver, and then OpenShift would take over from 
there. With the ALB, the link comes in into the actual designated master box, 
and that box, apparently, must have all the means of acting as a designated oc 
master. Looks like I may need to remove the masters that I don’t consider real 
ones anymore from the ALB targets, and that would take care of my situation.

From: Clayton Coleman <ccole...@redhat.com> 
Sent: Sunday, September 2, 2018 9:31 PM
To: Stan Varlamov <stan.varla...@exlinc.com>
Cc: users@lists.openshift.redhat.com
Subject: Re: 3.9 Default Router Malfunction When 1 of 3 Pods is Down

When you were experiencing the outage was ALB listing 2/3 healthy backends?  
I’m not as familiar with ALB over ELB, but what you are describing sounds like 
the frontend only was able to see one of the pods.

On Sep 2, 2018, at 9:21 AM, Stan Varlamov <stan.varla...@exlinc.com 
<mailto:stan.varla...@exlinc.com> > wrote:

AWS ALB

From: Clayton Coleman <ccole...@redhat.com <mailto:ccole...@redhat.com> > 
Sent: Sunday, September 2, 2018 9:11 PM
To: Stan Varlamov <stan.varla...@exlinc.com <mailto:stan.varla...@exlinc.com> >
Cc: users@lists.openshift.redhat.com <mailto:users@lists.openshift.redhat.com> 
Subject: Re: 3.9 Default Router Malfunction When 1 of 3 Pods is Down

Routers all watch all routes.  What are you fronting your routers with for HA?  
VRRP?  An F5 or cloud load balancer?  DNS?

On Sep 2, 2018, at 6:18 AM, Stan Varlamov <stan.varla...@exlinc.com 
<mailto:stan.varla...@exlinc.com> > wrote:

Went through a pretty scary experience of partial and uncontrollable outage in 
a 3.9 cluster that happened to be caused by issues in the default out of the 
box Router. The original installation had 3 region=infra nodes where the 3 
router pods got installed via the generic ansible cluster installation. 2 of 
the 3 nodes where subsequently re-labeled at some point in the past, and after 
one node was restarted, over sudden, random routes started “disappearing”, 
causing 502s. I noticed that one of the 3 Router pods was in pending – due to 
lack of available nodes. Bottom line, till I got all 3 pods back into operation 
(tried dropping nodeselector requirements but ended up re-labeling the nodes 
back to infra) – the routes would not come back. I would expect that even one 
working Router can control all routes in the cluster – no. I couldn’t find a 
pattern which routes were off vs. those that stayed on, and some routes would 
pop in and out of operation. Is there something in the Router design that 
relies on all its pods working? Appears that individual Router pods are 
“responsible” for some routes in the cluster vs. just doing redundancy.

_______________________________________________
users mailing list
users@lists.openshift.redhat.com <mailto:users@lists.openshift.redhat.com> 
http://lists.openshift.redhat.com/openshiftmm/listinfo/users 
<http://lists.openshift.redhat.com/openshiftmm/listinfo/users>

_______________________________________________
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users

RE: 3.9 Default Router Malfunction When 1 of 3 Pods is Down

Reply via email to