Hi,
I have a few Flink jobs running on Kubernetes using the Flink Kubernetes
Operator. By following the documentation [1] I was able to set up
monitoring for the Operator itself. As for the jobs themselves, I'm a bit
confused about how to properly set it up. Here's my FlinkDeployment
configuration:
apiVersion: flink.apache.org/v1beta1
kind: FlinkDeployment
metadata:
name: sample-job
namespace: flink
spec:
image: flink:1.17
flinkVersion: v1_17
flinkConfiguration:
taskmanager.numberOfTaskSlots: "1"
state.savepoints.dir: file:///flink-data/savepoints
state.checkpoints.dir: file:///flink-data/checkpoints
high-availability.type: kubernetes
high-availability.storageDir: file:///flink-data/ha
metrics.reporter.prom.factory.class:
org.apache.flink.metrics.prometheus.PrometheusReporterFactory
metrics.reporter.prom.port: 9249-9250
serviceAccount: flink
jobManager:
resource:
memory: "1024m"
cpu: 1
taskManager:
resource:
memory: "1024m"
cpu: 1
podTemplate:
spec:
containers:
- name: flink-main-container
volumeMounts:
- mountPath: /flink-data
name: flink-volume
volumes:
- name: flink-volume
emptyDir: {}
job:
jarURI: local:///opt/flink/examples/streaming/StateMachineExample.jar
parallelism: 1
upgradeMode: savepoint
state: running
savepointTriggerNonce: 0
When I exec into the pod, I can curl http://localhost:9249 and I can see
the JobManager metrics. But the TaskManager metrics aren't there and
nothing's running on port 9250. Both the JobManager and TaskManager are
running on the same machine.
There isn't any instruction on how to scrape this so I tried to modify the
PodMonitor config provided for the Operator and run it which didn't work. I
can see the target being registered in the Prometheus dashboard but it
always stays completely blank. Here's the config I used:
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: sample-job
namespace: monitoring
labels:
release: monitoring
spec:
selector:
matchLabels:
app: sample-job
namespaceSelector:
matchNames:
- flink
podMetricsEndpoints:
- targetPort: 9249
So, here's what I want to know:
1. What should the appropriate scraping configuration look like?
2. How can I retrieve the TaskManager metrics as well?
3. In the case where I have multiple jobs potentially running on the same
machine, how can I get metrics for all of them?
Any help would be appreciated.
Versions:
Flink: 1.17.1
Flink Kubernetes Operator: 1.5.0
- [1]
https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.5/docs/operations/metrics-logging/#how-to-enable-prometheus-example
Thanks,
Sunny
--
SELISE Group
Zürich: The Circle 37, 8058 Zürich-Airport,
Switzerland
Munich: Tal 44, 80331 München, Germany
Dubai: Building 3, 3rd
Floor, Dubai Design District, Dubai, United Arab Emirates
Dhaka: Midas
Center, Road 16, Dhanmondi, Dhaka 1209, Bangladesh
Thimphu: Bhutan
Innovation Tech Center, Babesa, P.O. Box 633, Thimphu, Bhutan
Visit us:
www.selisegroup.com <http://www.selisegroup.com>
--
*Important Note: This e-mail and any attachment are confidential and
may contain trade secrets and may well also be legally privileged or
otherwise protected from disclosure. If you have received it in error, you
are on notice of its status. Please notify us immediately by reply e-mail
and then delete this e-mail and any attachment from your system. If you are
not the intended recipient please understand that you must not copy this
e-mail or any attachment or disclose the contents to any other person.
Thank you for your cooperation.*