[ 
https://issues.apache.org/jira/browse/FLINK-29939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhou Jiang updated FLINK-29939:
-------------------------------
    Description: 
Operator now publishes k8s client response count by response code. In addition 
to the accumulative count, adding rate for k8s client error responses could 
help to setup alerts detect underlying cluster API server status proactively. 
This is for enhancement of metrics when Flink Operator is deployed to shared / 
multi-tenant k8s clusters. 

 

Why is rate needed for certain response codes?

To detect issues proactively by setting up alerts in certain cases. It could 
not the total number but the rate indicates the start / end of unavailability 
issue.

 

Why do some 4xx matter in prod?

For example - noisy neighbor issue may happen at random time in shared 
clusters, and operator may start to see increased number of 429 if cluster does 
not have fairness in rate limiting. Another example is about churn: when the 
cluster has namespaces quota defined and namespace is under pod churn, there 
could be increasing number of 409. In these cases, metrics and alerting on 
count / rate of certain 4xx is critical to understand start / end of prod 
outage.

 

Why is 5xx needed ?

For faster identify infrastructure issue. With 5xx response count + rate, It's 
more straightforward than enumerating possible 5xx codes when setting up prod 
alerts.

 

  was:
Operator now publishes k8s client response count by response code. In addition 
to the accumulative count, adding rate for k8s client error responses could 
help to setup alerts detect underlying cluster API server status proactively. 
This is for enhancement of metrics when Flink Operator is deployed to shared / 
multi-tenant k8s clusters. 

 

Why is rate needed for certain response codes?

To detect issues proactively by setting up alerts in certain cases. It could 
not the total number but the rate indicates the start / end of unavailability 
issue.

 

Why do some 4xx matter in prod?

For example - noisy neighbor issue may happen at random time in shared 
clusters, and operator may start to see increased number of 429 if cluster does 
not have fairness in rate limiting. Another example is about churn: when the 
cluster has namespaces quota defined and namespace is under pod churn, there 
could be increasing number of 409. In these cases, metrics and alerting on 
count / rate of certain 4xx is critical to understand start / end of prod 
outage.

 

Why is 5xx needed ?

For faster identify infrastructure issue. With 5xx response count + rate, It's 
more straightforward than enumerating possible 5xx codes.

 


> Add metrics for Kubernetes Client Response 5xx count and rate
> -------------------------------------------------------------
>
>                 Key: FLINK-29939
>                 URL: https://issues.apache.org/jira/browse/FLINK-29939
>             Project: Flink
>          Issue Type: Improvement
>          Components: Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.3.0
>            Reporter: Zhou Jiang
>            Priority: Minor
>
> Operator now publishes k8s client response count by response code. In 
> addition to the accumulative count, adding rate for k8s client error 
> responses could help to setup alerts detect underlying cluster API server 
> status proactively. This is for enhancement of metrics when Flink Operator is 
> deployed to shared / multi-tenant k8s clusters. 
>  
> Why is rate needed for certain response codes?
> To detect issues proactively by setting up alerts in certain cases. It could 
> not the total number but the rate indicates the start / end of unavailability 
> issue.
>  
> Why do some 4xx matter in prod?
> For example - noisy neighbor issue may happen at random time in shared 
> clusters, and operator may start to see increased number of 429 if cluster 
> does not have fairness in rate limiting. Another example is about churn: when 
> the cluster has namespaces quota defined and namespace is under pod churn, 
> there could be increasing number of 409. In these cases, metrics and alerting 
> on count / rate of certain 4xx is critical to understand start / end of prod 
> outage.
>  
> Why is 5xx needed ?
> For faster identify infrastructure issue. With 5xx response count + rate, 
> It's more straightforward than enumerating possible 5xx codes when setting up 
> prod alerts.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to