Hi, Puneet: Like Terry says, if you find your job failed unexpectedly, you could check the configuration restart-strategy in your flink-conf.yaml. If the restart strategy is set to be disabled or none, the job will transition to failed once it encounters a failover. The job would also fail itself if the failover rate or attempts exceed the limit. For more information please refer to [1] and [2].
Best, Zhilong [1] https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#fault-tolerance [2] https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/task_failure_recovery/#exponential-delay-restart-strategy On Mon, Mar 7, 2022 at 11:45 PM Puneet Duggal <puneetduggal1...@gmail.com> wrote: > Hi Terry Wang, > > So adding to above provided context.. whenever task manager goes down, > jobs go into failed state and do not restart. Even though there are good > enough free slots available on other task manager to get restarted on. > > Regards, > Puneet > > On 04-Mar-2022, at 4:54 PM, Terry Wang <zjuwa...@gmail.com> wrote: > > Hi, Puneet~ > > AFAIK, that should be expected behavior that jobs on crashed TaskManager > restarts. HA means there is no single point risk but Flink job still need > to through failover to ensure state and data consistency. You may refer > https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/ops/state/task_failure_recovery/ > for > more details. > > On Fri, Mar 4, 2022 at 2:50 AM Puneet Duggal <puneetduggal1...@gmail.com> > wrote: > >> Hi, >> >> Currently in production, i have HA session mode flink cluster with 3 job >> managers and multiple task managers with more than enough free task slots. >> But i have seen multiple times that whenever task manager goes down ( e.g. >> due to heartbeat issue).. so does all the jobs running on it even when >> there are standby task managers availaible with free slots to run them on. >> Has anyone faced this issue? >> >> Regards, >> Puneet > > > > -- > Best Regards, > Terry Wang > > >