Hi. Sincere apologies for the delay in following up on this. We are now able to share the details of this incident, which are as follows.
On May 26 around 1:56 PM PT, some or all pods (and the containers running in them) in some GKE clusters in the us-central1 region were forcibly restarted. The root cause was packet loss caused by a temporary GCP networking problem, which caused the GKE masters to miss some or all of their nodes' heartbeat messages for long enough to think the nodes were down. When the affected nodes became reachable from the master again, the nodes terminated their pods as expected. The pods that were running on those nodes, if they were managed by a controller (e.g. ReplicationController), were rescheduled either after the node was declared dead (in the case where there were other nodes in the cluster with free resources) or when the node became reachable again (if there were no other nodes available to run the replacement pods). Since the incident, we have taken measures to make GKE and Kubernetes more resilient to correlated node failure (e.g. PR #25571), and are working on additional protections that will be included in the 1.4 release (see issue #28832). On Thu, Jun 9, 2016 at 12:52 PM, Zaar Hai <[email protected]> wrote: > Thanks for sharing. It's good to know that the problem is being worked on. > On 8 Jun 2016 02:29, "'Daniel Smith' via Containers at Google" < > [email protected]> wrote: > >> We're aware of this issue and are preparing an incident report. >> >> You'll have to wait for that for details about the particular trigger in >> this case, but #24200 >> <https://github.com/kubernetes/kubernetes/issues/24200> is the basic >> problem. A partial amelioration >> <https://github.com/kubernetes/kubernetes/pull/25571> is already in 1.3. >> At the moment we believe only a single zone in a single region was affected. >> >> On Sat, Jun 4, 2016 at 12:46 AM, Zaar Hai <[email protected]> wrote: >> >>> We opened a ticket there. I'll update this thread if something >>> interesting pops up. >>> >>> Multi zone k8s will arrive only in 1.4 AFAIK. >>> On 4 Jun 2016 01:55, "Chris Hiestand" <[email protected]> wrote: >>> >>>> Seems like something went down. Even if the master went down, your pods >>>> shouldn't have so that is disconcerting. So perhaps there was an outage >>>> effecting your master and nodes. In my limited experience smaller outages >>>> or problems in GCP might not get reported on the cloud status page. >>>> >>>> And I imagine these were all in one zone. I wonder if ubernetes lite >>>> (additional-zone) would have mitigated the problem. >>>> >>>> To find out more, you'd probably need to pay for GCP support. >>>> >>>> -- >>>> You received this message because you are subscribed to a topic in the >>>> Google Groups "Containers at Google" group. >>>> To unsubscribe from this topic, visit >>>> https://groups.google.com/d/topic/google-containers/AB8MEDiLSik/unsubscribe >>>> . >>>> To unsubscribe from this group and all its topics, send an email to >>>> [email protected]. >>>> To post to this group, send email to [email protected] >>>> . >>>> Visit this group at https://groups.google.com/group/google-containers. >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "Containers at Google" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at https://groups.google.com/group/google-containers. >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- >> You received this message because you are subscribed to a topic in the >> Google Groups "Containers at Google" group. >> To unsubscribe from this topic, visit >> https://groups.google.com/d/topic/google-containers/AB8MEDiLSik/unsubscribe >> . >> To unsubscribe from this group and all its topics, send an email to >> [email protected]. >> To post to this group, send email to [email protected]. >> Visit this group at https://groups.google.com/group/google-containers. >> For more options, visit https://groups.google.com/d/optout. >> > -- > You received this message because you are subscribed to the Google Groups > "Containers at Google" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/google-containers. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "Containers at Google" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/google-containers. For more options, visit https://groups.google.com/d/optout.
