Trystan created FLINK-35285:
-------------------------------

             Summary: Autoscaler key group optimization can interfere with 
scale-down.max-factor
                 Key: FLINK-35285
                 URL: https://issues.apache.org/jira/browse/FLINK-35285
             Project: Flink
          Issue Type: Bug
            Reporter: Trystan


When setting a less aggressive scale down limit, the key group optimization can 
prevent a vertex from scaling down at all. It will hunt from target upwards to 
maxParallelism/2, and will always find the same parallelism again.

 

A simple test trying to scale down from a parallelism of 60 with a 
scale-down.max-factor of 0.2:
{code:java}
assertEquals(48, JobVertexScaler.scale(60, inputShipStrategies, 360, .8, 8, 
360)); {code}
 

It seems reasonable to make a good attempt to spread data across subtasks, but 
not at the expense of total deadlock. The problem is that during scale down it 
doesn't actually ensure that it newParallelism will be < currentParallelism.

 

Clunky, but something to ensure it can make at least some progress. There is 
another test that now fails, but just to illustrate the point:
{code:java}
for (int p = newParallelism; p <= maxParallelism / 2 && p <= upperBound; p++) {
    if ((scaleFactor < 1 && p < currentParallelism) || (scaleFactor > 1 && p > 
currentParallelism)) {
        if (maxParallelism % p == 0) {
            return p;
        }
    }
} {code}
 

Perhaps this is by design and not a bug, but total failure to scale down in 
order to keep optimized key groups does not seem ideal.

 

Key group optimization block:

https://github.com/apache/flink-kubernetes-operator/blob/fe3d24e4500d6fcaed55250ccc816546886fd1cf/flink-autoscaler/src/main/java/org/apache/flink/autoscaler/JobVertexScaler.java#L296C1-L303C10



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to