Re: Task Manager shutdown causing jobs to fail

2022-03-07 Thread Zhilong Hong
Hi, Puneet:

Like Terry says, if you find your job failed unexpectedly, you could check
the configuration restart-strategy in your flink-conf.yaml. If the restart
strategy is set to be disabled or none, the job will transition to failed
once it encounters a failover. The job would also fail itself if the
failover rate or attempts exceed the limit. For more information please
refer to [1] and [2].

Best,
Zhilong

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#fault-tolerance
[2]
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/task_failure_recovery/#exponential-delay-restart-strategy

On Mon, Mar 7, 2022 at 11:45 PM Puneet Duggal 
wrote:

> Hi Terry Wang,
>
> So adding to above provided context.. whenever task manager goes down,
> jobs go into failed state and do not restart. Even though there are good
> enough free slots available on other task manager to get restarted on.
>
> Regards,
> Puneet
>
> On 04-Mar-2022, at 4:54 PM, Terry Wang  wrote:
>
> Hi, Puneet~
>
> AFAIK, that should be expected behavior that jobs on crashed TaskManager
> restarts. HA means there is no single point risk but Flink job still need
> to through failover to ensure state and data consistency. You may refer
> https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/ops/state/task_failure_recovery/
>  for
> more details.
>
> On Fri, Mar 4, 2022 at 2:50 AM Puneet Duggal 
> wrote:
>
>> Hi,
>>
>> Currently in production, i have HA session mode flink cluster with 3 job
>> managers and multiple task managers with more than enough free task slots.
>> But i have seen multiple times that whenever task manager goes down ( e.g.
>> due to heartbeat issue).. so does all the jobs running on it even when
>> there are standby task managers availaible with free slots to run them on.
>> Has anyone faced this issue?
>>
>> Regards,
>> Puneet
>
>
>
> --
> Best Regards,
> Terry Wang
>
>
>


Re: Task Manager shutdown causing jobs to fail

2022-03-07 Thread Puneet Duggal
Hi Terry Wang,

So adding to above provided context.. whenever task manager goes down, jobs go 
into failed state and do not restart. Even though there are good enough free 
slots available on other task manager to get restarted on.

Regards,
Puneet

> On 04-Mar-2022, at 4:54 PM, Terry Wang  wrote:
> 
> Hi, Puneet~
> 
> AFAIK, that should be expected behavior that jobs on crashed TaskManager 
> restarts. HA means there is no single point risk but Flink job still need to 
> through failover to ensure state and data consistency. You may refer  
> https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/ops/state/task_failure_recovery/
>  
> 
>  for more details.
> 
> On Fri, Mar 4, 2022 at 2:50 AM Puneet Duggal  > wrote:
> Hi,
> 
> Currently in production, i have HA session mode flink cluster with 3 job 
> managers and multiple task managers with more than enough free task slots. 
> But i have seen multiple times that whenever task manager goes down ( e.g. 
> due to heartbeat issue).. so does all the jobs running on it even when there 
> are standby task managers availaible with free slots to run them on. Has 
> anyone faced this issue?
> 
> Regards, 
> Puneet
> 
> 
> -- 
> Best Regards,
> Terry Wang



Re: Task Manager shutdown causing jobs to fail

2022-03-04 Thread Terry Wang
Hi, Puneet~

AFAIK, that should be expected behavior that jobs on crashed TaskManager
restarts. HA means there is no single point risk but Flink job still need
to through failover to ensure state and data consistency. You may refer
https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/ops/state/task_failure_recovery/
for
more details.

On Fri, Mar 4, 2022 at 2:50 AM Puneet Duggal 
wrote:

> Hi,
>
> Currently in production, i have HA session mode flink cluster with 3 job
> managers and multiple task managers with more than enough free task slots.
> But i have seen multiple times that whenever task manager goes down ( e.g.
> due to heartbeat issue).. so does all the jobs running on it even when
> there are standby task managers availaible with free slots to run them on.
> Has anyone faced this issue?
>
> Regards,
> Puneet



-- 
Best Regards,
Terry Wang


Task Manager shutdown causing jobs to fail

2022-03-03 Thread Puneet Duggal
Hi,

Currently in production, i have HA session mode flink cluster with 3 job 
managers and multiple task managers with more than enough free task slots. But 
i have seen multiple times that whenever task manager goes down ( e.g. due to 
heartbeat issue).. so does all the jobs running on it even when there are 
standby task managers availaible with free slots to run them on. Has anyone 
faced this issue?

Regards, 
Puneet