Re: Job uptime metric in Flink Operator managed cluster

Mason Chen Thu, 13 Oct 2022 00:54:21 -0700

Hi all,

I think what Meghajit is trying to understand is how to measure the uptime
of a submitted Flink job. Prior to the K8s operator, perhaps the job
manager was torn down with the job shutdown so the uptime value would stop;
therefore, the uptime value also measures how long the job was running.
This is not the behavior with the k8s operator as you have described, so a
different metric must be used.


With 1.15, there are some new metrics [1]--does <jobStatus>Time where
<jobStatus> is "running" work for you?

[1]
https://nightlies.apache.org/flink/flink-docs-stable/docs/ops/metrics/#availability

Best,
Mason

On Wed, Oct 12, 2022 at 11:36 PM Gyula Fóra <gyula.f...@gmail.com> wrote:

> Sorry, what I said applies to Flink 1.15+ and the savepoint upgrade mode
> (not stateless).
>
> In any case if there is no job manager there are no metrics... So not sure
> how to answer your question.
>
> Gyula
>
> On Thu, Oct 13, 2022 at 8:24 AM Meghajit Mazumdar <
> meghajit.mazum...@gojek.com> wrote:
>
>> Hi Gyula,
>>
>> Thanks for the prompt response.
>>
>> > The Flink operator currently does not delete the jobmanager pod when a
>> deployment is suspended.
>> Are you sure this is true ? I have re-tried this many times, but each
>> time the pods get deleted, along with the deployment resources.
>>
>> Additionally, the flink-operator logs also denote that the resources are
>> being deleted ( highlighted in red) after I change the state in the
>> FlinkDeployment yaml from running --> suspended
>> ( note: my FlinkDeployment name is *my-sample-dagger-v7 *)
>>
>> 2022-10-13 06:11:47,392 o.a.f.k.o.c.FlinkDeploymentController [INFO
>> ][flink-operator/my-sample-dagger-v7] End of reconciliation
>> 2022-10-13 06:11:49,879 o.a.f.k.o.c.FlinkDeploymentController [INFO
>> ][flink-operator/parquet-savepoint-test] Starting reconciliation
>> 2022-10-13 06:11:49,880 o.a.f.k.o.o.JobStatusObserver  [INFO
>> ][flink-operator/parquet-savepoint-test] Observing job status
>> 2022-10-13 06:11:52,710 o.a.f.k.o.c.FlinkDeploymentController [INFO
>> ][flink-operator/my-sample-dagger-v7] Starting reconciliation
>> 2022-10-13 06:11:52,712 o.a.f.k.o.o.JobStatusObserver  [INFO
>> ][flink-operator/my-sample-dagger-v7] Observing job status
>> 2022-10-13 06:11:52,721 o.a.f.k.o.o.JobStatusObserver  [INFO
>> ][flink-operator/my-sample-dagger-v7] Job status (RUNNING) unchanged
>> 2022-10-13 06:11:52,723 o.a.f.k.o.c.FlinkConfigManager [INFO
>> ][flink-operator/my-sample-dagger-v7] Generating new config
>> 2022-10-13 06:11:52,725 o.a.f.k.o.r.d.AbstractFlinkResourceReconciler
>> [INFO ][flink-operator/my-sample-dagger-v7] Detected spec change, starting
>> reconciliation.
>>
>>
>> 2022-10-13 06:11:52,788 o.a.f.k.o.r.d.AbstractJobReconciler [INFO
>> ][flink-operator/my-sample-dagger-v7] Stateless job, ready for upgrade
>> 2022-10-13 06:11:52,798 o.a.f.k.o.s.FlinkService       [INFO
>> ][flink-operator/my-sample-dagger-v7] Job is running, cancelling job.
>> 2022-10-13 06:11:52,815 o.a.f.k.o.s.FlinkService       [INFO
>> ][flink-operator/my-sample-dagger-v7] Job successfully cancelled.
>> 2022-10-13 06:11:52,815 o.a.f.k.o.u.FlinkUtils         [INFO
>> ][flink-operator/my-sample-dagger-v7] Deleting JobManager deployment and HA
>> metadata.
>> 2022-10-13 06:11:56,863 o.a.f.k.o.u.FlinkUtils         [INFO
>> ][flink-operator/my-sample-dagger-v7] Cluster shutdown completed.
>> 2022-10-13 06:11:56,903 o.a.f.k.o.u.FlinkUtils         [INFO
>> ][flink-operator/my-sample-dagger-v7] Cluster shutdown completed.
>> 2022-10-13 06:11:56,904 o.a.f.k.o.c.FlinkDeploymentController [INFO
>> ][flink-operator/my-sample-dagger-v7] End of reconciliation
>> 2022-10-13 06:11:56,928 o.a.f.k.o.c.FlinkDeploymentController [INFO
>> ][flink-operator/my-sample-dagger-v7] Starting reconciliation
>> 2022-10-13 06:11:56,930 o.a.f.k.o.r.d.AbstractFlinkResourceReconciler
>> [INFO ][flink-operator/my-sample-dagger-v7] Resource fully reconciled,
>> nothing to do...
>>
>> Also, my original doubt was around the uptime metric itself. What is the
>> correct metric to use for monitoring the status ( running or suspended) of
>> a job which is being managed by the Flink Operator ?
>> The  *jobmanager_job_uptime_value * seems to be giving the wrong status
>> as mentioned in the earlier mail.
>>
>> Regards,
>> Meghajit
>>
>>
>> On Wed, Oct 12, 2022 at 7:32 PM Gyula Fóra <gyula.f...@gmail.com> wrote:
>>
>>> Hello!
>>> The Flink operator currently does not delete the jobmanager pod when a
>>> deployment is suspended.
>>> This way the rest api stay available but no other resources are consumed
>>> (taskmanagers are deleted)
>>>
>>> When you delete the FlinkDeployment resource completely, then the
>>> jobmanager deployment is also deleted.
>>>
>>> In theory we could improve the logic to eventually delete the jobmanager
>>> for suspended resources but we currently use this is a way to guarantee
>>> more resiliency for the operator flow.
>>>
>>> Cheers,
>>> Gyula
>>>
>>> On Wed, Oct 12, 2022 at 3:56 PM Meghajit Mazumdar <
>>> meghajit.mazum...@gojek.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> I recently deployed a Flink Operator in Kubernetes and wrote a simple
>>>> FlinkDeployment CRD  to run it in application mode following this
>>>> <https://github.com/apache/flink-kubernetes-operator/blob/main/examples/pod-template.yaml>
>>>> .
>>>>
>>>> I noticed that, even after I edited the CRD and marked the
>>>> spec.job.state field as *suspended, *the metric 
>>>> *jobmanager_job_uptime_value
>>>> *continued to show the job status as *running*. I did verify that
>>>> after re-applying these changes, the JM and TM pods were deleted and the
>>>> cluster was not running anymore.
>>>>
>>>> Am I doing something incorrect or is there some other metric to monitor
>>>> the job status when using Flink Operator ?
>>>>
>>>>
>>>>
>>>> --
>>>> *Regards,*
>>>> *Meghajit*
>>>>
>>>
>>
>> --
>> *Regards,*
>> *Meghajit*
>>
>

Re: Job uptime metric in Flink Operator managed cluster

Reply via email to