Re: [DISCUSSION] Consider Flink operator having a way to monitor the status of bounded streaming jobs after they finish or error?

richard.su Thu, 07 Dec 2023 01:03:52 -0800

Thanks, Gyula, this is one of our technical debt in our platform's develop 
progress, which is helpless for me.


If the operator's monitor will sometimes        missing bounded job, we may 
change our strategy to modified the docker-entrypoint.sh of our 
flink-custom-image, which capture the exit code of jm process and do what we 
want after job done, although this is wired but work.

I think this should add some tips in doc of flink operator, actually the 
start-up and shutdown process of flink job works well in flink 1.14, but only 
this situation cannot work.

I will checkout all code of version judgment  through flink operator to find 
out other potential issues. hope this will be helpful for other users.

Thanks again.

Richard Su

> 2023年12月7日 16:45，Gyula Fóra <[email protected]> 写道：
> 
> This config has nothing to do with the operator (it's a core flink feature)
> and is not an issue after Flink 1.15.
> Newer operator versions (1.7+) drop support for Flink 1.13 and 1.14 as it's
> not feasible to maintain too many legacy codepaths.
> 
> The only solution for you is to update your Flink versions, you are missing
> out on so many improvements.
> 
> Gyula
> 
> On Thu, Dec 7, 2023 at 9:32 AM richard.su <[email protected]> wrote:
> 
>> Hi Gyula, Flink version is 1.14
>> Our flink version is hard to upgrade since we have some user in our
>> platform.
>> sorry I have not noticed this configuration, it's confusing because flink
>> operator announced support from 1.13 to 1.17/1.18
>> 
>> Has other solution will work in our situation?
>> 
>> Thanks
>> Richard Su
>> 
>>> 2023年12月7日 16:22，Gyula Fóra <[email protected]> 写道：
>>> 
>>> Hi!
>>> 
>>> What Flink version are you using?
>>> The operator always sets: execution.shutdown-on-application-finish to
>> false
>>> so that finished / failed application clusters should not exit
>> immediately
>>> and we can observe them.
>>> 
>>> This is however only available in Flink 1.15 and above.
>>> 
>>> Cheers,
>>> Gyula
>>> 
>>> On Thu, Dec 7, 2023 at 9:15 AM richard.su <[email protected]>
>> wrote:
>>> 
>>>> Hi, Community, I had found out this issue, but I'm not sure this issue
>>>> have any solution. I have tried flink operator 1.6, which this issue is
>>>> still exist.
>>>> 
>>>> If not, I think this could create a jira issue to following.
>>>> 
>>>> When we create a bounded streaming jobs which will finally to become
>>>> Finished status, after this job's status from Running to Finished, flink
>>>> will shut down kubernetes cluster, at code of flink-kubernetes package,
>>>> class KubernetesResourceManagerDriver's method deregisterApplication,
>> which
>>>> will delete jm deployment directly in a second (in our env).
>>>> But our operator config, when jm deployment status is Ready and not in
>>>> savepoint progress, this observer interval is 15s, which means operator
>>>> will never observe the job status changing.
>>>> So if the job is failed not finished, we cannot distinguish this. All we
>>>> known is Jm deployment is Missing and Job status is Reconciling.
>>>> We want to using flink operator integrating into our platform, but it
>>>> cannot monitor job real status, which is wired.
>>>> 
>>>> May be it till related to the clean logic of flink native mode, from my
>>>> side, operator side is hard to deal with such situation because we
>> cannot
>>>> directly get the exit code of container when pod is missing and jm
>>>> deployment is missing.
>>>> 
>>>> Thanks to your time to read this issue.
>>>> Richard Su
>>>>> 
>>>>> 2023年12月6日 13:34，richard.su <[email protected]> 写道：
>>>>> 
>>>>> For more information to produce this problem,
>>>>> 
>>>>> version: flink operator 1.4
>>>>> mode: native
>>>>> job: wordcount
>>>>> language: java
>>>>> type: FlinkDeployment
>>>>> 
>>>>>> 2023年12月6日 10:52，richard.su <[email protected]> 写道：
>>>>>> 
>>>>>> Hi Community, the default configuration of flink operator is:
>>>>>> 
>>>>>> kubernetes.operator.reconcile.interval: 15s
>>>>>> kubernetes.operator.observer.progress-check.interval: 5s
>>>>>> 
>>>>>> when a bounded streaming job already stays in stop or error status, jm
>>>> deployment will stay to be missing, if I set configuration:
>>>>>> 
>>>>>> kubernetes.operator.jm-deployment-recover.enabled: false
>>>>>> 
>>>>>> then, flink operator can only observe the job status at Recociling and
>>>> jm deployment status at Missing
>>>>>> 
>>>>>> we cannot check whether the flink job is  finished or error, because
>> of
>>>> in the interval of observer.progress-check, flink web ui is already
>> down.
>>>>>> 
>>>>>> so, we hope someone in community could show a way to monitor bounded
>>>> steaming job's status.
>>>>>> 
>>>>>> Thanks.
>>>>>> 
>>>>>> Richard Su
>>>>> 
>>>> 
>>>> 
>> 
>>

Re: [DISCUSSION] Consider Flink operator having a way to monitor the status of bounded streaming jobs after they finish or error?

Reply via email to