[ 
https://issues.apache.org/jira/browse/FLINK-29939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17630875#comment-17630875
 ] 

Gyula Fora commented on FLINK-29939:
------------------------------------

Sounds good +1

> Add metrics for Kubernetes Client Response 5xx count and rate
> -------------------------------------------------------------
>
>                 Key: FLINK-29939
>                 URL: https://issues.apache.org/jira/browse/FLINK-29939
>             Project: Flink
>          Issue Type: Improvement
>          Components: Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.3.0
>            Reporter: Zhou Jiang
>            Priority: Minor
>
> Operator now publishes k8s client response count by response code. In 
> addition to the accumulative count, adding rate for k8s client error 
> responses could help to setup alerts detect underlying cluster API server 
> status proactively. This is for enhancement of metrics when Flink Operator is 
> deployed to shared / multi-tenant k8s clusters. 
>  
> Why is rate needed for certain response codes?
> To detect issues proactively by setting up alerts in certain cases. It could 
> not the total number but the rate indicates the start / end of unavailability 
> issue.
>  
> Why do some 4xx matter in prod?
> For example - noisy neighbor issue may happen at random time in shared 
> clusters, and operator may start to see increased number of 429 if cluster 
> does not have fairness in rate limiting. Another example is about churn: when 
> the cluster has namespaces quota defined and namespace is under pod churn, 
> there could be increasing number of 409. In these cases, metrics and alerting 
> on count / rate of certain 4xx is critical to understand start / end of prod 
> outage.
>  
> Why is 5xx needed ?
> For faster identify infrastructure issue. With 5xx response count + rate, 
> It's more straightforward than enumerating possible 5xx codes when setting up 
> prod alerts.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to