[GitHub] [flink-kubernetes-operator] mbalassi commented on a diff in pull request #558: [FLINK-31303] Expose Flink application resource usage via metrics and status
mbalassi commented on code in PR #558: URL: https://github.com/apache/flink-kubernetes-operator/pull/558#discussion_r1155589002 ## flink-kubernetes-operator/src/test/java/org/apache/flink/kubernetes/operator/metrics/FlinkDeploymentMetricsTest.java: ## @@ -187,6 +193,66 @@ public void testMetricsMultiNamespace() { } } +@Test +public void testResourceMetrics() { +var namespace1 = "ns1"; +var namespace2 = "ns2"; +var deployment1 = TestUtils.buildApplicationCluster("deployment1", namespace1); +var deployment2 = TestUtils.buildApplicationCluster("deployment2", namespace1); +var deployment3 = TestUtils.buildApplicationCluster("deployment3", namespace2); + +deployment1 +.getStatus() +.getClusterInfo() +.putAll( +Map.of( +AbstractFlinkService.FIELD_NAME_TOTAL_CPU, "5", Review Comment: Could you please add a test that has unexpected values (null, empty string etc) - given that the status field could be (but not expected to be) modified externally we want to make sure that the operator logic does not fail on that (This is why I used `NumberUtils` in the implementation). ## flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java: ## @@ -645,10 +641,19 @@ public Map getClusterInfo(Configuration conf) throws Exception { dashboardConfiguration.getFlinkRevision()); } -// JobManager resource usage can be deduced from the CR -var jmParameters = -new KubernetesJobManagerParameters( -conf, new KubernetesClusterClientFactory().getClusterSpecification(conf)); +clusterInfo.putAll( +calculateClusterResourceMetrics( +conf, getTaskManagersInfo(conf).getTaskManagerInfos().size())); + +return clusterInfo; +} + +private HashMap calculateClusterResourceMetrics( Review Comment: nit: maybe call this `calculateClusterResourceUsage` or `calculateClusterResourceFootprint`, since technically not the metrics yet. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink-kubernetes-operator] mbalassi commented on a diff in pull request #558: [FLINK-31303] Expose Flink application resource usage via metrics and status
mbalassi commented on code in PR #558: URL: https://github.com/apache/flink-kubernetes-operator/pull/558#discussion_r1155274830 ## flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java: ## @@ -627,14 +637,42 @@ public Map getClusterInfo(Configuration conf) throws Exception { .toSeconds(), TimeUnit.SECONDS); -runtimeVersion.put( +clusterInfo.put( DashboardConfiguration.FIELD_NAME_FLINK_VERSION, dashboardConfiguration.getFlinkVersion()); -runtimeVersion.put( +clusterInfo.put( DashboardConfiguration.FIELD_NAME_FLINK_REVISION, dashboardConfiguration.getFlinkRevision()); } -return runtimeVersion; + +// JobManager resource usage can be deduced from the CR +var jmParameters = +new KubernetesJobManagerParameters( +conf, new KubernetesClusterClientFactory().getClusterSpecification(conf)); +var jmTotalCpu = +jmParameters.getJobManagerCPU() +* jmParameters.getJobManagerCPULimitFactor() +* jmParameters.getReplicas(); +var jmTotalMemory = +Math.round( +jmParameters.getJobManagerMemoryMB() +* Math.pow(1024, 2) +* jmParameters.getJobManagerMemoryLimitFactor() +* jmParameters.getReplicas()); + +// TaskManager resource usage is best gathered from the REST API to get current replicas Review Comment: Thanks @mateczagany, this approach looks good. If you have the bandwidth would you mind pushing your suggestions to this PR branch so that the commit can be attributed to you? I have invited you as a collaborator to my fork, you might need to accept that. I would ask the following if you have the time: 1. Get resource configuration from the config as you suggested uniformly for JMs and TMs 2. Get JM replicas from config, TM replicas from the REST API (we are trying to be careful with the TM replicas because we foresee that we might be changing things dynamically there via the autoscaler soon) 3. Add a test to `FlinkDeploymentMetricsTest` that verifies that given that the `status.clusterInfo` is properly filled out we fill out the metrics properly. Currently we do not have meaningful test for creating the clusterInfo and since we are relying on the application's REST API I do not see an easy way of testing it properly, so I would accept this change without that (but it might merit a separate JIRA). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink-kubernetes-operator] mbalassi commented on a diff in pull request #558: [FLINK-31303] Expose Flink application resource usage via metrics and status
mbalassi commented on code in PR #558: URL: https://github.com/apache/flink-kubernetes-operator/pull/558#discussion_r1154290969 ## flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java: ## @@ -627,14 +637,42 @@ public Map getClusterInfo(Configuration conf) throws Exception { .toSeconds(), TimeUnit.SECONDS); -runtimeVersion.put( +clusterInfo.put( DashboardConfiguration.FIELD_NAME_FLINK_VERSION, dashboardConfiguration.getFlinkVersion()); -runtimeVersion.put( +clusterInfo.put( DashboardConfiguration.FIELD_NAME_FLINK_REVISION, dashboardConfiguration.getFlinkRevision()); } -return runtimeVersion; + +// JobManager resource usage can be deduced from the CR +var jmParameters = +new KubernetesJobManagerParameters( +conf, new KubernetesClusterClientFactory().getClusterSpecification(conf)); +var jmTotalCpu = +jmParameters.getJobManagerCPU() +* jmParameters.getJobManagerCPULimitFactor() +* jmParameters.getReplicas(); +var jmTotalMemory = +Math.round( +jmParameters.getJobManagerMemoryMB() +* Math.pow(1024, 2) +* jmParameters.getJobManagerMemoryLimitFactor() +* jmParameters.getReplicas()); + +// TaskManager resource usage is best gathered from the REST API to get current replicas Review Comment: There is a limit factor for TaskManager cores that Flink allows to be configured on top of the resources defined on the Kubernestes level, similarly to have I calculated the JobManager resources. I setup an example to validate your suggestion where I have one JM and TM each, with 0.5 cpus configured in the resources field each. The cpu limit factors are 1.0. We end up with 1.5 cpus (0.5 for the JM accurately reported and 1.0 for the TM). ``` jobManager: replicas: 1 resource: cpu: 0.5 memory: 2048m serviceAccount: flink taskManager: resource: cpu: 0.5 memory: 2048m status: clusterInfo: flink-revision: DeadD0d0 @ 1970-01-01T01:00:00+01:00 flink-version: 1.16.1 tm-cpu-limit-factor: "1.0" jm-cpu-limit-factor: "1.0" total-cpu: "1.5" total-memory: "4294967296" jobManagerDeploymentStatus: READY ``` It is a bit of a tough problem, because the Flink UI also shows 1 core for the TM (using the same value that we get from the REST API). https://user-images.githubusercontent.com/5990983/229091963-f5e9a985-2ebe-4518-9623-6a4d4da9ad3c.png;> So ultimately we have to decide whether to stick with Flink or with Kubernetes, I am leaning towards the latter (with calculating in the limit factor, but avoiding the rounding). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink-kubernetes-operator] mbalassi commented on a diff in pull request #558: [FLINK-31303] Expose Flink application resource usage via metrics and status
mbalassi commented on code in PR #558: URL: https://github.com/apache/flink-kubernetes-operator/pull/558#discussion_r1154032017 ## flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java: ## @@ -627,14 +637,42 @@ public Map getClusterInfo(Configuration conf) throws Exception { .toSeconds(), TimeUnit.SECONDS); -runtimeVersion.put( +clusterInfo.put( DashboardConfiguration.FIELD_NAME_FLINK_VERSION, dashboardConfiguration.getFlinkVersion()); -runtimeVersion.put( +clusterInfo.put( DashboardConfiguration.FIELD_NAME_FLINK_REVISION, dashboardConfiguration.getFlinkRevision()); } -return runtimeVersion; + +// JobManager resource usage can be deduced from the CR +var jmParameters = +new KubernetesJobManagerParameters( +conf, new KubernetesClusterClientFactory().getClusterSpecification(conf)); +var jmTotalCpu = +jmParameters.getJobManagerCPU() +* jmParameters.getJobManagerCPULimitFactor() +* jmParameters.getReplicas(); +var jmTotalMemory = +Math.round( +jmParameters.getJobManagerMemoryMB() +* Math.pow(1024, 2) +* jmParameters.getJobManagerMemoryLimitFactor() +* jmParameters.getReplicas()); + +// TaskManager resource usage is best gathered from the REST API to get current replicas Review Comment: Good catch @mateczagany. I had this suspicion in the back of my mind, that the CPU consumption might be overreported, but the way we pass the values to the taskmanagers via `flink-kubernetes` (which does have proper fractional values) convinced me that it should be ok. I will dive a bit deeper into this and come back. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org