Hi, Puneet:

Like Terry says, if you find your job failed unexpectedly, you could check
the configuration restart-strategy in your flink-conf.yaml. If the restart
strategy is set to be disabled or none, the job will transition to failed
once it encounters a failover. The job would also fail itself if the
failover rate or attempts exceed the limit. For more information please
refer to [1] and [2].

Best,
Zhilong

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#fault-tolerance
[2]
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/task_failure_recovery/#exponential-delay-restart-strategy

On Mon, Mar 7, 2022 at 11:45 PM Puneet Duggal <puneetduggal1...@gmail.com>
wrote:

> Hi Terry Wang,
>
> So adding to above provided context.. whenever task manager goes down,
> jobs go into failed state and do not restart. Even though there are good
> enough free slots available on other task manager to get restarted on.
>
> Regards,
> Puneet
>
> On 04-Mar-2022, at 4:54 PM, Terry Wang <zjuwa...@gmail.com> wrote:
>
> Hi, Puneet~
>
> AFAIK, that should be expected behavior that jobs on crashed TaskManager
> restarts. HA means there is no single point risk but Flink job still need
> to through failover to ensure state and data consistency. You may refer
> https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/ops/state/task_failure_recovery/
>  for
> more details.
>
> On Fri, Mar 4, 2022 at 2:50 AM Puneet Duggal <puneetduggal1...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Currently in production, i have HA session mode flink cluster with 3 job
>> managers and multiple task managers with more than enough free task slots.
>> But i have seen multiple times that whenever task manager goes down ( e.g.
>> due to heartbeat issue).. so does all the jobs running on it even when
>> there are standby task managers availaible with free slots to run them on.
>> Has anyone faced this issue?
>>
>> Regards,
>> Puneet
>
>
>
> --
> Best Regards,
> Terry Wang
>
>
>

Reply via email to