Hi Oliver,
I believe you are almost there. One thing I found could improve is that in
your job yaml, instead of using:
kubernetes.operator.metrics.reporter.prommetrics.reporters: prom
kubernetes.operator.metrics.reporter.prommetrics.reporter.prom.factory.class:
org.apache.flink.metrics.prometheus.PrometheusReporterFactory
kubernetes.operator.metrics.reporter.prom.port: 9249-9250
, you should use
metrics.reporter.prom.factory.class:
org.apache.flink.metrics.prometheus.PrometheusReporterFactory
metrics.reporter.prom.port: "9249"
Configs with the prefix, `kubernetes.operator`, is for the flink k8s
operator itself(You may use it if you want to collect the metrics of the
operator). For the job config, we do not need it.
I created a detailed demo
<https://github.com/bgeng777/pyflink-learning/tree/main/flink-k8s-operator-monitor>
of using Prometheus to monitor jobs started by flink k8s operator. Maybe it
can be helpful.
Best,
Biao Geng
Oliver Schmied <[email protected]> 于2024年5月19日周日 04:21写道:
> Dear Apache Flink Community,
>
> I am currently trying to monitor an Apache Flink cluster deployed on
> Kubernetes using Prometheus and Grafana. Despite following the official
> guide (
> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.8/docs/operations/metrics-logging/)
> on how to setup prometheus I have not been able to get Flink-specific
> metrics to appear in Prometheus. I am reaching out to seek your assistance,
> as I`ve tried many things but nothing worked.
>
>
>
> # My setup:
>
> * Kubernetes
>
> * flink v.18 deployed as FlinkDeployment
>
> with this manifest:
>
> ```apiVersion: flink.apache.org/v1beta1
> kind: FlinkDeployment
> metadata:
> namespace: default
> name: flink-cluster
> spec:
> image: flink:1.18
> flinkVersion: v1_18
> flinkConfiguration:
> taskmanager.numberOfTaskSlots: "2"
> #Added
> kubernetes.operator.metrics.reporter.prommetrics.reporters: prom
>
> kubernetes.operator.metrics.reporter.prommetrics.reporter.prom.factory.class:
> org.apache.flink.metrics.prometheus.PrometheusReporterFactory
> kubernetes.operator.metrics.reporter.prom.port: 9249-9250
> serviceAccount: flink
> jobManager:
> resource:
> memory: "1048m"
> cpu: 1
> taskManager:
> resource:
> memory: "1048m"
> cpu: 1
>
> ```
>
> * Prometheus operator install via
>
> helm repo add prometheus-community
> https://prometheus-community.github.io/helm-chartshelm install prometheus
> prometheus-community/kube-prometheus-stack
>
>
> * deployed a pod-monitor.yaml
> ```
> apiVersion: monitoring.coreos.com/v1
> kind: PodMonitor
> metadata:
> name: flink-kubernetes-operator
> labels:
> release: prometheus
> spec:
> selector:
> matchLabels:
> app: flink-cluster
> podMetricsEndpoints:
> - port: metrics
>
> ```
>
> # The problem
>
> * I can access prometheus fine and concerning the logs of the pod-monitor,
> it seems to collect flink specific metrics, but I can't access these
> metrics with flink
> * Do I even setup prometheus correctly in my flink deployment manifest?
> * I also added the following line to my values.yaml file, but apart from
> that I change nothing:
> ```
>
> metrics: port: 9999
>
> ```
>
> # My questions
>
> * Can anyone see the mistake in my deployment?
> * Or does anyone have a better idea on how to monitor my flink deployment?
>
>
> I would be very grateful for your answers. Thank you very much.
>
> Best regards,
> Oliver
>
>
>
>
>
>
>