[ https://issues.apache.org/jira/browse/FLINK-29939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17630875#comment-17630875 ]
Gyula Fora commented on FLINK-29939: ------------------------------------ Sounds good +1 > Add metrics for Kubernetes Client Response 5xx count and rate > ------------------------------------------------------------- > > Key: FLINK-29939 > URL: https://issues.apache.org/jira/browse/FLINK-29939 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator > Affects Versions: kubernetes-operator-1.3.0 > Reporter: Zhou Jiang > Priority: Minor > > Operator now publishes k8s client response count by response code. In > addition to the accumulative count, adding rate for k8s client error > responses could help to setup alerts detect underlying cluster API server > status proactively. This is for enhancement of metrics when Flink Operator is > deployed to shared / multi-tenant k8s clusters. > > Why is rate needed for certain response codes? > To detect issues proactively by setting up alerts in certain cases. It could > not the total number but the rate indicates the start / end of unavailability > issue. > > Why do some 4xx matter in prod? > For example - noisy neighbor issue may happen at random time in shared > clusters, and operator may start to see increased number of 429 if cluster > does not have fairness in rate limiting. Another example is about churn: when > the cluster has namespaces quota defined and namespace is under pod churn, > there could be increasing number of 409. In these cases, metrics and alerting > on count / rate of certain 4xx is critical to understand start / end of prod > outage. > > Why is 5xx needed ? > For faster identify infrastructure issue. With 5xx response count + rate, > It's more straightforward than enumerating possible 5xx codes when setting up > prod alerts. > -- This message was sent by Atlassian Jira (v8.20.10#820010)