Aviv Dozorets created FLINK-35594:
-------------------------------------

             Summary: Downscaling doesn't release TaskManagers.
                 Key: FLINK-35594
                 URL: https://issues.apache.org/jira/browse/FLINK-35594
             Project: Flink
          Issue Type: Bug
          Components: Kubernetes Operator
    Affects Versions: 1.18.1
         Environment: * Flink 1.18.1 (Java 11, Temurin).
 * Kubernetes Operator 1.8
 * Kubernetes version v1.28.9-eks-036c24b (AWS EKS).

 

Autoscaling configuration:
{code:java}
jobmanager.scheduler: adaptive
job.autoscaler.enabled: "true"
job.autoscaler.metrics.window: 15m
job.autoscaler.stabilization.interval: 15m
job.autoscaler.scaling.effectiveness.threshold: 0.2
job.autoscaler.target.utilization: "0.75"
job.autoscaler.target.utilization.boundary: "0.25"
job.autoscaler.metrics.busy-time.aggregator: "AVG"
job.autoscaler.restart.time-tracking.enabled: "true"{code}
            Reporter: Aviv Dozorets
         Attachments: Screenshot 2024-06-10 at 12.50.37 PM.png

(Follow-up of Slack conversation on #troubleshooting channel).

Recently I've observed a behavior, that should be improved:

A Flink DataStream that runs with autoscaler (backed by Kubernetes operator) 
and Adaptive scheduler doesn't release a node (TaskManager) when scaling down. 
In my example job started with initial parallelism of 64, while having 4 TM 
with 16 cores each (1:1 core:slot) and scaled down to 16.

My expectation: 1 TaskManager should be up and running.

Reality: All 4 initial TaskManagers are running, with multiple and unequal 
amount of available slots.

 

Didn't find an existing configuration to change the behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to