Yang Wang created FLINK-24819:
---------------------------------
Summary: Higher cpu load after using SharedIndexInformer replaced
naked Kubernetes watch
Key: FLINK-24819
URL: https://issues.apache.org/jira/browse/FLINK-24819
Project: Flink
Issue Type: Improvement
Components: Deployment / Kubernetes
Affects Versions: 1.14.0
Reporter: Yang Wang
In FLINK-22054, Flink has used a shared informer for ConfigMap to replace the
naked K8s watch. After then, each Flink JVM process(JM/TM) only needs one
connection to APIServer for ConfigMap watching. It aims to reduce the network
pressure on K8s APIServer.
However, in our recent tests, we found that the CPU and memory cost of
APIServer have been doubled while running same Flink workloads. After digging
more details in the K8s, I think the root cause might be that ETCD does not
have indexes for labels. It means APIServer need to pull all the events from
ETCD for each watch and then filter with specified labels(e.g.
app=xxx,type=flink-native-kubernetes,configmap-type=high-availability)
internally. Before FLINK-22054, we started a dedicated connection for each
ConfigMap watching. And it seems that APIServer only need to pull the events
for the specified ConfigMap name.
Watch URL example(Before):
[https://kubernetes.default:6443/api/v1/namespaces/vvp-workload/configmaps?metadata.name=job-009d4f51-ca02-4793-a49b-a3344538719b-resourcemanager-leader&watch=true|https://kubernetes.default:6443/api/v1/namespaces/vvp-workload/configmaps?labelSelector=app%3Dk8s-ha-app-1-1636077491-23461%2Ctype%3Dflink-native-kubernetes%2Cconfigmap-type%3Dhigh-availability&resourceVersion=1153687321&watch=true]
Watch URL example(After):
[https://kubernetes.default:6443/api/v1/namespaces/vvp-workload/configmaps?labelSelector=app%3Dk8s-ha-app-1-1636077491-23461%2Ctype%3Dflink-native-kubernetes%2Cconfigmap-type%3Dhigh-availability&resourceVersion=1153687321&watch=true]
--
This message was sent by Atlassian Jira
(v8.20.1#820001)