Hi everyone! I have encountered similar behavior in the case of native k8s HA with multiple JobManagers. Therefore, I have a question - were there any plans to add the ability not to restart the job during the change of the JobManagers leader? Or are there certain insurmountable obstacles preventing this from being done?
I'm not very familiar with the internals of Flink, but I've tried to figure out if it's possible to do this. As far as I understand, the first important problem is that JM does not have the necessary information to manage the job after the re-election of the leader. Now JM can only get ExecutionAttemptIds of active tasks running on TaskExecutors in heartbeat payload (just as the reconciliation mechanism is doing now) and information from HA store (checkpoint info, JobGraph). Such important data as the distribution of tasks by slots is lost after the death of the previous leader. At first I thought that it would be possible to adapt the ArchivedExecutionGraph for the purpose of saving runtime information, but then realized that it was created for completely different purposes. Anyway, if we get such information we can add some SchedulingStrategy wrapper for, at least, using current active Tasks instead of deploying new Executions. What do you think? Best regards, Manasyan Tigran On 2022/05/10 08:59:13 Konstantin Knauf wrote: > Hi Matyas, > > yes, that's expected. The feature should have never been called "high > availability", but something like "Flink Jobmanager failover", because > that's what it is. > > With standby Jobmanagers what you gain is a faster failover, because a new > Jobmanager does not need to be started before restarting the Job. That's > all. > > Cheers, > > Konstantin > > Am Di., 10. Mai 2022 um 10:56 Uhr schrieb Őrhidi Mátyás < > matyas.orh...@gmail.com>: > > > Hi Folks! > > > > I've been goofing around with the JobManager HA configs using multiple JM > > replicas (in the Flink Kubernetes Operator). It's working seemingly fine, > > however the job itself is being restarted when you kill the leader JM pod. > > Is this expected? > > > > Thanks, > > Matyas > > > > > -- > https://twitter.com/snntrable > https://github.com/knaufk >