[ https://issues.apache.org/jira/browse/FLINK-32678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759570#comment-17759570 ]
Yang Wang edited comment on FLINK-32678 at 8/29/23 5:15 AM: ------------------------------------------------------------ *Stress Test* Run 1000 Flink Jobs with 1 JM and 1 TM for each 1. Flink version 1.15.4 with {{high-availability.use-old-ha-services=true}} Flink JobManager has 4 leader electors(RestServer, ResourceManager, Dispatcher, JobManager) to periodically update the K8s ConfigMap. So the QPS of {{PUT ConfigMap}} for 1000 jobs will be roughly 800 req/s ≈ 4(leader elector) * 1000(Flink JobManager pods) / 5(renew interval). The the QPS of {{GET ConfigMap}} is twice as much as {{PUT}}. 2. Flink version 1.17.1(same as 1.15.4 with {{high-availability.use-old-ha-services=false}}) Flink will only have one shared leader elector. So the QPS of {{PUT ConfigMap}} for 1000 jobs will be roughly 200 req/s ≈ 1(leader elector) * 1000(Flink JobManager pods) / 5(renew interval). The the QPS of {{GET ConfigMap}} is twice as much as {{PUT}}. 3. Flink version 1.18-snapshot Flink will only have one shared leader elector. So the QPS of {{PATCH ConfigMap}} for 1000 jobs will be roughly 200 req/s ≈ 1(leader elector) * 1000(Flink JobManager pods) / 5(renew interval). The the QPS of {{GET ConfigMap}} is same as {{PATCH}}. !qps-configmap-put-115.png|width=694,height=176! !qps-configmap-put-117.png|width=694,height=176! !qps-configmap-patch-118.png|width=694,height=176! >From the above two pictures, we could verify that the new leader elector in >1.18 only sends a quarter of the write requests of the old one in 1.15 on the >K8s APIServer. It will significantly reduce the stress on the K8s APIServer. !qps-configmap-get-115.png|width=694,height=176! !qps-configmap-get-117.png|width=694,height=176! !qps-configmap-get-118.png|width=694,height=176! We could also find that the read requests are half of the 1.17. The root cause is fabric8 6.6.2(FLINK-31997) has introduced the PATCH http method for updating the leader annotation. It will save a GET request for each update. All in all, the Flink 1.18 takes less stress on the K8s APIServer while all the 1000 Flink jobs run normally as before. was (Author: fly_in_gis): *Stress Test* Run 1000 Flink Jobs with 1 JM and 1 TM for each 1. Flink version 1.15.4 with {{high-availability.use-old-ha-services=true}} Flink JobManager has 4 leader electors(RestServer, ResourceManager, Dispatcher, JobManager) to periodically update the K8s ConfigMap. So the QPS of {{PUT ConfigMap}} for 1000 jobs will be roughly 800 req/s ≈ 4(leader elector) * 1000(Flink JobManager pods) / 5(renew interval). 2. Flink version 1.18-snapshot Flink will only have one shared leader elector. So the QPS of {{PATCH ConfigMap}} for 1000 jobs will be roughly 200 req/s ≈ 1(leader elector) * 1000(Flink JobManager pods) / 5(renew interval). !qos-configmap-put-115.png|width=694,height=176! !qos-configmap-patch-118.png|width=694,height=176! >From the above two pictures, we could verify that the new leader elector in >1.18 only sends a quarter of the write requests of the old one in 1.15 on the >K8s APIServer. It will significantly reduce the stress on the K8s APIServer. !qos-configmap-get-115.png|width=694,height=176! !qos-configmap-get-118.png|width=694,height=176! We also find that the read requests are only 1/8 of the old one. The root cause is fabric8 6.6.2(FLINK-31997) has introduced the PATCH http method for updating the leader annotation. It will save a GET request for each update. All in all, the Flink 1.18 takes less stress on the K8s APIServer while all the 1000 Flink jobs run normally as before. > Release Testing: Stress-Test to cover multiple low-level changes in Flink > ------------------------------------------------------------------------- > > Key: FLINK-32678 > URL: https://issues.apache.org/jira/browse/FLINK-32678 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Coordination > Affects Versions: 1.18.0 > Reporter: Matthias Pohl > Assignee: Yang Wang > Priority: Major > Labels: release-testing > Fix For: 1.18.0 > > Attachments: qps-configmap-get-115.png, qps-configmap-get-117.jpg, > qps-configmap-get-118.png, qps-configmap-patch-118.png, > qps-configmap-put-115.png, qps-configmap-put-117.jpg > > > -We decided to do another round of testing for the LeaderElection refactoring > which happened in > [FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box].- > This release testing task is about running a bigger amount of jobs in a Flink > environment to look for unusual behavior. This Jira issue shall cover the > following 1.18 efforts: > * Leader Election refactoring > ([FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+refactoring+leaderelection+to+make+flink+support+multi-component+leader+election+out-of-the-box], > FLINK-26522) > * Akka to Pekko transition (FLINK-32468) > * flink-shaded 17.0 updates (FLINK-32032) -- This message was sent by Atlassian Jira (v8.20.10#820010)