[GitHub] [flink-kubernetes-operator] mbalassi commented on a diff in pull request #558: [FLINK-31303] Expose Flink application resource usage via metrics and status

2023-04-03 Thread via GitHub


mbalassi commented on code in PR #558:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/558#discussion_r1155589002


##
flink-kubernetes-operator/src/test/java/org/apache/flink/kubernetes/operator/metrics/FlinkDeploymentMetricsTest.java:
##
@@ -187,6 +193,66 @@ public void testMetricsMultiNamespace() {
 }
 }
 
+@Test
+public void testResourceMetrics() {
+var namespace1 = "ns1";
+var namespace2 = "ns2";
+var deployment1 = TestUtils.buildApplicationCluster("deployment1", 
namespace1);
+var deployment2 = TestUtils.buildApplicationCluster("deployment2", 
namespace1);
+var deployment3 = TestUtils.buildApplicationCluster("deployment3", 
namespace2);
+
+deployment1
+.getStatus()
+.getClusterInfo()
+.putAll(
+Map.of(
+AbstractFlinkService.FIELD_NAME_TOTAL_CPU, "5",

Review Comment:
   Could you please add a test that has unexpected values (null, empty string 
etc) - given that the status field could be (but not expected to be) modified 
externally we want to make sure that the operator logic does not fail on that 
(This is why I used `NumberUtils` in the implementation).



##
flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java:
##
@@ -645,10 +641,19 @@ public Map getClusterInfo(Configuration 
conf) throws Exception {
 dashboardConfiguration.getFlinkRevision());
 }
 
-// JobManager resource usage can be deduced from the CR
-var jmParameters =
-new KubernetesJobManagerParameters(
-conf, new 
KubernetesClusterClientFactory().getClusterSpecification(conf));
+clusterInfo.putAll(
+calculateClusterResourceMetrics(
+conf, 
getTaskManagersInfo(conf).getTaskManagerInfos().size()));
+
+return clusterInfo;
+}
+
+private HashMap calculateClusterResourceMetrics(

Review Comment:
   nit: maybe call this `calculateClusterResourceUsage` or 
`calculateClusterResourceFootprint`, since technically not the metrics yet.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink-kubernetes-operator] mbalassi commented on a diff in pull request #558: [FLINK-31303] Expose Flink application resource usage via metrics and status

2023-04-02 Thread via GitHub


mbalassi commented on code in PR #558:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/558#discussion_r1155274830


##
flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java:
##
@@ -627,14 +637,42 @@ public Map getClusterInfo(Configuration 
conf) throws Exception {
 .toSeconds(),
 TimeUnit.SECONDS);
 
-runtimeVersion.put(
+clusterInfo.put(
 DashboardConfiguration.FIELD_NAME_FLINK_VERSION,
 dashboardConfiguration.getFlinkVersion());
-runtimeVersion.put(
+clusterInfo.put(
 DashboardConfiguration.FIELD_NAME_FLINK_REVISION,
 dashboardConfiguration.getFlinkRevision());
 }
-return runtimeVersion;
+
+// JobManager resource usage can be deduced from the CR
+var jmParameters =
+new KubernetesJobManagerParameters(
+conf, new 
KubernetesClusterClientFactory().getClusterSpecification(conf));
+var jmTotalCpu =
+jmParameters.getJobManagerCPU()
+* jmParameters.getJobManagerCPULimitFactor()
+* jmParameters.getReplicas();
+var jmTotalMemory =
+Math.round(
+jmParameters.getJobManagerMemoryMB()
+* Math.pow(1024, 2)
+* jmParameters.getJobManagerMemoryLimitFactor()
+* jmParameters.getReplicas());
+
+// TaskManager resource usage is best gathered from the REST API to 
get current replicas

Review Comment:
   Thanks @mateczagany, this approach looks good. If you have the bandwidth 
would you mind pushing your suggestions to this PR branch so that the commit 
can be attributed to you?  I have invited you as a collaborator to my fork, 
you might need to accept that.
   
   I would ask the following if you have the time:
   
   1. Get resource configuration from the config as you suggested uniformly for 
JMs and TMs
   2. Get JM replicas from config, TM replicas from the REST API (we are trying 
to be careful with the TM replicas because we foresee that we might be changing 
things dynamically there via the autoscaler soon)
   3. Add a test to `FlinkDeploymentMetricsTest` that verifies that given that 
the `status.clusterInfo` is properly filled out we fill out the metrics 
properly.
   
   Currently we do not have meaningful test for creating the clusterInfo and 
since we are relying on the application's REST API I do not see an easy way of 
testing it properly, so I would accept this change without that (but it might 
merit a separate JIRA).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink-kubernetes-operator] mbalassi commented on a diff in pull request #558: [FLINK-31303] Expose Flink application resource usage via metrics and status

2023-03-31 Thread via GitHub


mbalassi commented on code in PR #558:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/558#discussion_r1154290969


##
flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java:
##
@@ -627,14 +637,42 @@ public Map getClusterInfo(Configuration 
conf) throws Exception {
 .toSeconds(),
 TimeUnit.SECONDS);
 
-runtimeVersion.put(
+clusterInfo.put(
 DashboardConfiguration.FIELD_NAME_FLINK_VERSION,
 dashboardConfiguration.getFlinkVersion());
-runtimeVersion.put(
+clusterInfo.put(
 DashboardConfiguration.FIELD_NAME_FLINK_REVISION,
 dashboardConfiguration.getFlinkRevision());
 }
-return runtimeVersion;
+
+// JobManager resource usage can be deduced from the CR
+var jmParameters =
+new KubernetesJobManagerParameters(
+conf, new 
KubernetesClusterClientFactory().getClusterSpecification(conf));
+var jmTotalCpu =
+jmParameters.getJobManagerCPU()
+* jmParameters.getJobManagerCPULimitFactor()
+* jmParameters.getReplicas();
+var jmTotalMemory =
+Math.round(
+jmParameters.getJobManagerMemoryMB()
+* Math.pow(1024, 2)
+* jmParameters.getJobManagerMemoryLimitFactor()
+* jmParameters.getReplicas());
+
+// TaskManager resource usage is best gathered from the REST API to 
get current replicas

Review Comment:
   There is a limit factor for TaskManager cores that Flink allows to be 
configured on top of the resources defined on the Kubernestes level, similarly 
to have I calculated the JobManager resources. I setup an example to validate 
your suggestion where I have one JM and TM each, with 0.5 cpus configured in 
the resources field each. The cpu limit factors are 1.0. We end up with 1.5 
cpus (0.5 for the JM accurately reported and 1.0 for the TM).
   
   ```
 jobManager:
   replicas: 1
   resource:
 cpu: 0.5
 memory: 2048m
 serviceAccount: flink
 taskManager:
   resource:
 cpu: 0.5
 memory: 2048m
   status:
 clusterInfo:
   flink-revision: DeadD0d0 @ 1970-01-01T01:00:00+01:00
   flink-version: 1.16.1
   tm-cpu-limit-factor: "1.0"
   jm-cpu-limit-factor: "1.0"
   total-cpu: "1.5"
   total-memory: "4294967296"
 jobManagerDeploymentStatus: READY
   ```
   
   It is a bit of a tough problem, because the Flink UI also shows 1 core for 
the TM (using the same value that we get from the REST API).
   
   https://user-images.githubusercontent.com/5990983/229091963-f5e9a985-2ebe-4518-9623-6a4d4da9ad3c.png;>
   
   So ultimately we have to decide whether to stick with Flink or with 
Kubernetes, I am leaning towards the latter (with calculating in the limit 
factor, but avoiding the rounding).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink-kubernetes-operator] mbalassi commented on a diff in pull request #558: [FLINK-31303] Expose Flink application resource usage via metrics and status

2023-03-30 Thread via GitHub


mbalassi commented on code in PR #558:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/558#discussion_r1154032017


##
flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java:
##
@@ -627,14 +637,42 @@ public Map getClusterInfo(Configuration 
conf) throws Exception {
 .toSeconds(),
 TimeUnit.SECONDS);
 
-runtimeVersion.put(
+clusterInfo.put(
 DashboardConfiguration.FIELD_NAME_FLINK_VERSION,
 dashboardConfiguration.getFlinkVersion());
-runtimeVersion.put(
+clusterInfo.put(
 DashboardConfiguration.FIELD_NAME_FLINK_REVISION,
 dashboardConfiguration.getFlinkRevision());
 }
-return runtimeVersion;
+
+// JobManager resource usage can be deduced from the CR
+var jmParameters =
+new KubernetesJobManagerParameters(
+conf, new 
KubernetesClusterClientFactory().getClusterSpecification(conf));
+var jmTotalCpu =
+jmParameters.getJobManagerCPU()
+* jmParameters.getJobManagerCPULimitFactor()
+* jmParameters.getReplicas();
+var jmTotalMemory =
+Math.round(
+jmParameters.getJobManagerMemoryMB()
+* Math.pow(1024, 2)
+* jmParameters.getJobManagerMemoryLimitFactor()
+* jmParameters.getReplicas());
+
+// TaskManager resource usage is best gathered from the REST API to 
get current replicas

Review Comment:
   Good catch @mateczagany. I had this suspicion in the back of my mind, that 
the CPU consumption might be overreported, but the way we pass the values to 
the taskmanagers via `flink-kubernetes` (which does have proper fractional 
values) convinced me that it should be ok. I will dive a bit deeper into this 
and come back.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org