Re: [EXTERNAL] Flink and Prometheus monitoring question
Hi Jesús, If your job has checkpointing enabled, you can monitor 'numberOfCompletedCheckpoints' to see wether the job is still alive and healthy. Thanks, Zhu Zhu Jesús Vásquez 于2019年12月17日周二 上午2:43写道: > The thing about numRunningJobs metric is that i have to configure in > advance the Prometheus rules with the number of jobs i expect to be running > in order to alert, i kind of need this rule to alert on individual jobs. I > initially thought of flink_jobmanager_downtime{job_id=~".*"} == -1 , bit it > resulted that the metric just emits 0 on running jobs, and doesn't emit -1 > for failed jobs. > > El lun., 16 dic. 2019 7:01 p. m., PoolakkalMukkath, Shakir < > shakir_poolakkalmukk...@comcast.com> escribió: > >> You could use “flink_jobmanager_numRunningJobs” to check the number of >> running jobs. >> >> >> >> Thanks >> >> >> >> *From: *Jesús Vásquez >> *Date: *Monday, December 16, 2019 at 12:47 PM >> *To: *"user@flink.apache.org" >> *Subject: *[EXTERNAL] Flink and Prometheus monitoring question >> >> >> >> Hi, >> >> I want to monitor Flink Streaming jobs using Prometheus >> >> My first goal is to send alerts when a Flink job has failed. >> >> The thing is that looking at the documentation I haven't found a metric >> that helps me defining an alerting rule. >> >> As a starting point i thought that the metric >> flink_jobmanager_job_downtime could help since the doc says this metric >> emits -1 for a completed job. >> >> But when i tested this i found out this doesn't work since the metric >> always emits 0 and after the job is completed there is no metric. >> >> Has anyone managed to alert when flink job has failed with Prometheus? >> >> Thanks for your help. >> >
Re: [EXTERNAL] Flink and Prometheus monitoring question
The thing about numRunningJobs metric is that i have to configure in advance the Prometheus rules with the number of jobs i expect to be running in order to alert, i kind of need this rule to alert on individual jobs. I initially thought of flink_jobmanager_downtime{job_id=~".*"} == -1 , bit it resulted that the metric just emits 0 on running jobs, and doesn't emit -1 for failed jobs. El lun., 16 dic. 2019 7:01 p. m., PoolakkalMukkath, Shakir < shakir_poolakkalmukk...@comcast.com> escribió: > You could use “flink_jobmanager_numRunningJobs” to check the number of > running jobs. > > > > Thanks > > > > *From: *Jesús Vásquez > *Date: *Monday, December 16, 2019 at 12:47 PM > *To: *"user@flink.apache.org" > *Subject: *[EXTERNAL] Flink and Prometheus monitoring question > > > > Hi, > > I want to monitor Flink Streaming jobs using Prometheus > > My first goal is to send alerts when a Flink job has failed. > > The thing is that looking at the documentation I haven't found a metric > that helps me defining an alerting rule. > > As a starting point i thought that the metric > flink_jobmanager_job_downtime could help since the doc says this metric > emits -1 for a completed job. > > But when i tested this i found out this doesn't work since the metric > always emits 0 and after the job is completed there is no metric. > > Has anyone managed to alert when flink job has failed with Prometheus? > > Thanks for your help. >
Re: [EXTERNAL] Flink and Prometheus monitoring question
You could use “flink_jobmanager_numRunningJobs” to check the number of running jobs. Thanks From: Jesús Vásquez Date: Monday, December 16, 2019 at 12:47 PM To: "user@flink.apache.org" Subject: [EXTERNAL] Flink and Prometheus monitoring question Hi, I want to monitor Flink Streaming jobs using Prometheus My first goal is to send alerts when a Flink job has failed. The thing is that looking at the documentation I haven't found a metric that helps me defining an alerting rule. As a starting point i thought that the metric flink_jobmanager_job_downtime could help since the doc says this metric emits -1 for a completed job. But when i tested this i found out this doesn't work since the metric always emits 0 and after the job is completed there is no metric. Has anyone managed to alert when flink job has failed with Prometheus? Thanks for your help.