[jira] [Created] (FLINK-32883) Support for standby task managers

Tomoyuki NAKAMURA (Jira) Wed, 16 Aug 2023 09:03:12 -0700

Tomoyuki NAKAMURA created FLINK-32883:
-----------------------------------------


             Summary: Support for standby task managers
                 Key: FLINK-32883
                 URL: https://issues.apache.org/jira/browse/FLINK-32883
             Project: Flink
          Issue Type: Improvement
          Components: Kubernetes Operator
    Affects Versions: kubernetes-operator-1.6.0
            Reporter: Tomoyuki NAKAMURA


[https://docs.ververica.com/user_guide/application_operations/deployments/scaling.html#run-with-standby-taskmanager]
I would like to be able to support standby task managers. Because on K8s, pods 
are often evicted or deleted due to node failure or autoscaling.

With the current implementation, it is not possible to set up a standby task 
manager, and jobs cannot run until all task managers are up and running. If a 
standby task manager could be supported, jobs could continue to run without 
downtime using the standby task manager, even if the task manager is 
unexpectedly deleted.

[https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/config/FlinkConfigBuilder.java#L370-L380]
If the job manager's number of replicas is set, the job's parallelism setting 
is ignored, but it should be possible to support a standby task manager by 
automatically setting parallelism to the replicas*task slot only if the job's 
parallelism is not set (i.e. 0) and using that value if parallelism is set. 

If this change looks good, I will send a PR on GitHub.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-32883) Support for standby task managers

Reply via email to