Tomoyuki NAKAMURA created FLINK-32883:
-----------------------------------------
Summary: Support for standby task managers
Key: FLINK-32883
URL: https://issues.apache.org/jira/browse/FLINK-32883
Project: Flink
Issue Type: Improvement
Components: Kubernetes Operator
Affects Versions: kubernetes-operator-1.6.0
Reporter: Tomoyuki NAKAMURA
[https://docs.ververica.com/user_guide/application_operations/deployments/scaling.html#run-with-standby-taskmanager]
I would like to be able to support standby task managers. Because on K8s, pods
are often evicted or deleted due to node failure or autoscaling.
With the current implementation, it is not possible to set up a standby task
manager, and jobs cannot run until all task managers are up and running. If a
standby task manager could be supported, jobs could continue to run without
downtime using the standby task manager, even if the task manager is
unexpectedly deleted.
[https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/config/FlinkConfigBuilder.java#L370-L380]
If the job manager's number of replicas is set, the job's parallelism setting
is ignored, but it should be possible to support a standby task manager by
automatically setting parallelism to the replicas*task slot only if the job's
parallelism is not set (i.e. 0) and using that value if parallelism is set.
If this change looks good, I will send a PR on GitHub.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)