Re: 3.9 Default Router Malfunction When 1 of 3 Pods is Down

Clayton Coleman Sun, 02 Sep 2018 08:31:43 -0700

On Sep 2, 2018, at 9:51 AM, Stan Varlamov <stan.varla...@exlinc.com> wrote:


I think this is the cause. If using the ALB, each target master must have
the Router working. Either this is not documented well enough, or I’m not
reading the docs correctly, but my understanding of how this works was that
the link comes in into some generic receiver, and then OpenShift would take
over from there. With the ALB, the link comes in into the actual designated
master box, and that box, apparently, must have all the means of acting as
a designated oc master. Looks like I may need to remove the masters that I
don’t consider real ones anymore from the ALB targets, and that would take
care of my situation.


I’m really confused what you are trying to do.  You should not front the
apiserver with a router.  The router and the masters are generally best not
to collocate unless your bandwidth requirements are low, but it’s much more
effective to schedule the routers on nodes and keep that traffic separate
from a resiliency perspective.

The routers need the masters to be available (2/3 min) to receive their
route configuration when restarting, but require no interconnection to
serve traffic.



*From:* Clayton Coleman <ccole...@redhat.com>
*Sent:* Sunday, September 2, 2018 9:31 PM
*To:* Stan Varlamov <stan.varla...@exlinc.com>
*Cc:* users@lists.openshift.redhat.com
*Subject:* Re: 3.9 Default Router Malfunction When 1 of 3 Pods is Down



When you were experiencing the outage was ALB listing 2/3 healthy
backends?  I’m not as familiar with ALB over ELB, but what you are
describing sounds like the frontend only was able to see one of the pods.


On Sep 2, 2018, at 9:21 AM, Stan Varlamov <stan.varla...@exlinc.com> wrote:

AWS ALB



*From:* Clayton Coleman <ccole...@redhat.com>
*Sent:* Sunday, September 2, 2018 9:11 PM
*To:* Stan Varlamov <stan.varla...@exlinc.com>
*Cc:* users@lists.openshift.redhat.com
*Subject:* Re: 3.9 Default Router Malfunction When 1 of 3 Pods is Down



Routers all watch all routes.  What are you fronting your routers with for
HA?  VRRP?  An F5 or cloud load balancer?  DNS?


On Sep 2, 2018, at 6:18 AM, Stan Varlamov <stan.varla...@exlinc.com> wrote:

Went through a pretty scary experience of partial and uncontrollable outage
in a 3.9 cluster that happened to be caused by issues in the default out of
the box Router. The original installation had 3 region=infra nodes where
the 3 router pods got installed via the generic ansible cluster
installation. 2 of the 3 nodes where subsequently re-labeled at some point
in the past, and after one node was restarted, over sudden, random routes
started “disappearing”, causing 502s. I noticed that one of the 3 Router
pods was in pending – due to lack of available nodes. Bottom line, till I
got all 3 pods back into operation (tried dropping nodeselector
requirements but ended up re-labeling the nodes back to infra) – the routes
would not come back. I would expect that even one working Router can
control all routes in the cluster – no. I couldn’t find a pattern which
routes were off vs. those that stayed on, and some routes would pop in and
out of operation. Is there something in the Router design that relies on
all its pods working? Appears that individual Router pods are “responsible”
for some routes in the cluster vs. just doing redundancy.







_______________________________________________
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users

_______________________________________________
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users

Re: 3.9 Default Router Malfunction When 1 of 3 Pods is Down

Reply via email to