[ https://issues.apache.org/jira/browse/FLINK-34152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17822512#comment-17822512 ]
Maximilian Michels commented on FLINK-34152: -------------------------------------------- Hi [~yang]! Thanks for taking a look at the recent changes. There has been two more follow-up PRs since the initial PR you linked. I'm very curious to hear your feedback. {quote}We may need to dynamically adjust the Kubernetes CPU and memory limits for both the job manager and task manager eventually, to align with the automatically tuned memory and CPU parameters and prevent unnecessary resource allocation. {quote} Tuning JobManager memory is still pending, but I agree that tuning only TaskManagers is not enough. As for tuning CPU, I think we eventually want to tune the number of task slots to fit them to the CPUs assigned. As for scaling CPU itself, that is already taken care of by the autoscaler which essentially scales based on the CPU usage of TaskManagers. {quote}In our specific use-case, our Flink cluster is deployed on a dedicated node group with predefined CPU and memory settings, unlike a typical Kubernetes cluster. Consequently, this auto-tuning feature might not aid in reducing infrastructure costs, as billing is based on the allocated nodes behind the scenes. {quote} Autoscaling assumes some sort of Kubernetes Cluster Autoscaling to be active. When fewer resources are allocated, that should result in fewer nodes, but in practice it isn't quite that easy. It requires a bit of extra work for nodes to get released when fewer resources are used. The default Kubernetes scheduler doesn't bin-pack, but it can be reconfigured to do bin-packing as opposed to its default behavior to evenly spread out pods. > Tune TaskManager memory > ----------------------- > > Key: FLINK-34152 > URL: https://issues.apache.org/jira/browse/FLINK-34152 > Project: Flink > Issue Type: Sub-task > Components: Autoscaler, Kubernetes Operator > Reporter: Maximilian Michels > Assignee: Maximilian Michels > Priority: Major > Labels: pull-request-available > Fix For: kubernetes-operator-1.8.0 > > > The current autoscaling algorithm adjusts the parallelism of the job task > vertices according to the processing needs. By adjusting the parallelism, we > systematically scale the amount of CPU for a task. At the same time, we also > indirectly change the amount of memory tasks have at their dispense. However, > there are some problems with this. > # Memory is overprovisioned: On scale up we may add more memory than we > actually need. Even on scale down, the memory / cpu ratio can still be off > and too much memory is used. > # Memory is underprovisioned: For stateful jobs, we risk running into > OutOfMemoryErrors on scale down. Even before running out of memory, too > little memory can have a negative impact on the effectiveness of the scaling. > We lack the capability to tune memory proportionally to the processing needs. > In the same way that we measure CPU usage and size the tasks accordingly, we > need to evaluate memory usage and adjust the heap memory size. > https://docs.google.com/document/d/19GXHGL_FvN6WBgFvLeXpDABog2H_qqkw1_wrpamkFSc/edit > -- This message was sent by Atlassian Jira (v8.20.10#820010)