Re: [EXTERNAL] Flink and Prometheus monitoring question

2019-12-16 Thread Zhu Zhu
Hi Jesús,
If your job has checkpointing enabled, you can monitor
'numberOfCompletedCheckpoints' to see wether the job is still alive and
healthy.

Thanks,
Zhu Zhu

Jesús Vásquez  于2019年12月17日周二 上午2:43写道:

> The thing about numRunningJobs metric is that i have to configure in
> advance the Prometheus rules with the number of jobs i expect to be running
> in order to alert, i kind of need this rule to alert on individual jobs. I
> initially thought of flink_jobmanager_downtime{job_id=~".*"} == -1 , bit it
> resulted that the metric just emits 0 on running jobs, and doesn't emit -1
> for failed jobs.
>
> El lun., 16 dic. 2019 7:01 p. m., PoolakkalMukkath, Shakir <
> shakir_poolakkalmukk...@comcast.com> escribió:
>
>> You could use “flink_jobmanager_numRunningJobs” to check the number of
>> running jobs.
>>
>>
>>
>> Thanks
>>
>>
>>
>> *From: *Jesús Vásquez 
>> *Date: *Monday, December 16, 2019 at 12:47 PM
>> *To: *"user@flink.apache.org" 
>> *Subject: *[EXTERNAL] Flink and Prometheus monitoring question
>>
>>
>>
>> Hi,
>>
>> I want to monitor Flink Streaming jobs using Prometheus
>>
>> My first goal is to send alerts when a Flink job has failed.
>>
>> The thing is that looking at the documentation I haven't found a metric
>> that helps me defining an alerting rule.
>>
>> As a starting point i thought that the metric
>> flink_jobmanager_job_downtime could help since the doc says this metric
>> emits -1 for a completed job.
>>
>> But when i tested this i found out this doesn't work since the metric
>> always emits 0 and after the job is completed there is no metric.
>>
>> Has anyone managed to alert when flink job has failed with Prometheus?
>>
>> Thanks for your help.
>>
>


Re: [EXTERNAL] Flink and Prometheus monitoring question

2019-12-16 Thread Jesús Vásquez
The thing about numRunningJobs metric is that i have to configure in
advance the Prometheus rules with the number of jobs i expect to be running
in order to alert, i kind of need this rule to alert on individual jobs. I
initially thought of flink_jobmanager_downtime{job_id=~".*"} == -1 , bit it
resulted that the metric just emits 0 on running jobs, and doesn't emit -1
for failed jobs.

El lun., 16 dic. 2019 7:01 p. m., PoolakkalMukkath, Shakir <
shakir_poolakkalmukk...@comcast.com> escribió:

> You could use “flink_jobmanager_numRunningJobs” to check the number of
> running jobs.
>
>
>
> Thanks
>
>
>
> *From: *Jesús Vásquez 
> *Date: *Monday, December 16, 2019 at 12:47 PM
> *To: *"user@flink.apache.org" 
> *Subject: *[EXTERNAL] Flink and Prometheus monitoring question
>
>
>
> Hi,
>
> I want to monitor Flink Streaming jobs using Prometheus
>
> My first goal is to send alerts when a Flink job has failed.
>
> The thing is that looking at the documentation I haven't found a metric
> that helps me defining an alerting rule.
>
> As a starting point i thought that the metric
> flink_jobmanager_job_downtime could help since the doc says this metric
> emits -1 for a completed job.
>
> But when i tested this i found out this doesn't work since the metric
> always emits 0 and after the job is completed there is no metric.
>
> Has anyone managed to alert when flink job has failed with Prometheus?
>
> Thanks for your help.
>


Re: [EXTERNAL] Flink and Prometheus monitoring question

2019-12-16 Thread PoolakkalMukkath, Shakir
You could use “flink_jobmanager_numRunningJobs” to check the number of running 
jobs.

Thanks

From: Jesús Vásquez 
Date: Monday, December 16, 2019 at 12:47 PM
To: "user@flink.apache.org" 
Subject: [EXTERNAL] Flink and Prometheus monitoring question

Hi,
I want to monitor Flink Streaming jobs using Prometheus
My first goal is to send alerts when a Flink job has failed.
The thing is that looking at the documentation I haven't found a metric that 
helps me defining an alerting rule.
As a starting point i thought that the metric flink_jobmanager_job_downtime 
could help since the doc says this metric emits -1 for a completed job.
But when i tested this i found out this doesn't work since the metric always 
emits 0 and after the job is completed there is no metric.
Has anyone managed to alert when flink job has failed with Prometheus?
Thanks for your help.