Re: Job leak in attached mode (batch scenario)

qi luo Wed, 17 Jul 2019 19:32:11 -0700

Thanks Haibo for the response!

Is there any community issue or plan to implement heartbeat mechanism between 
Dispatcher and Client? If not, should I create one?


Regards,
Qi

> On Jul 17, 2019, at 10:19 AM, Haibo Sun <sunhaib...@163.com> wrote:
> 
> Hi, Qi
> 
> As far as I know, there is no such mechanism now. To achieve this, I think it 
> may be necessary to add a REST-based heartbeat mechanism between Dispatcher 
> and Client. At present, perhaps you can add a monitoring service to deal with 
> these residual Flink clusters.
> 
> Best,
> Haibo
> 
> At 2019-07-16 14:42:37, "qi luo" <luoqi...@gmail.com> wrote:
> Hi guys,
> 
> We runs thousands of Flink batch job everyday. The batch jobs are submitted 
> in attached mode, so we can know from the client when the job finished and 
> then take further actions. To respond to user abort actions, we submit the 
> jobs with "—shutdownOnAttachedExit” so the Flink cluster can be shutdown when 
> the client exits.
> 
> However, in some cases when the Flink client exists abnormally (such as OOM), 
> the shutdown signal will not be sent to Flink cluster, causing the “job 
> leak”. The lingering Flink job will continue to run and never ends, consuming 
> large amount of resources and even produce unexpected results.
> 
> Does Flink has any mechanism to handle such scenario (e.g. Spark has cluster 
> mode, where the driver runs in the client side, so the job will exit when 
> client exits)? Any idea will be very appreciated!
> 
> Thanks,
> Qi

Re: Job leak in attached mode (batch scenario)

Reply via email to