You could also check out the Autoscaler logic in the Flink Kubernetes Operator ( https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/autoscaler/ ) On the current main and in the upcoming 1.5.0 release the mechanism is pretty nice and solid :)
It works with the native integration so you can also set standby TMs with a simple config. Cheers, Gyula On Fri, Apr 28, 2023 at 7:31 AM Wei Hou <wei....@airbnb.com> wrote: > Thank you for all your responses! I think Gyula is right, simply do a MAX - > some_offset is not ideal as it can make the standby TM useless. > It is difficult for the scheduler to determine whether a pod has been lost > or scaled down when we enable autoscaling, which affects its decision to > utilize standby TMs. We probably need to monitor the HPA events in order to > get this information. > I will wait to see if there is a solution for this problem in the future. > > > On Wed, Apr 26, 2023 at 7:20 AM Gyula Fóra <gyula.f...@gmail.com> wrote: > >> I think the behaviour is going to get a little weird because this would >> actually defeat the purpose of the standby TM. >> MAX - some offset will decrease once you lose a TM so in this case we >> would scale down to again have a spare (which we never actually use.) >> >> Gyula >> >> On Wed, Apr 26, 2023 at 4:02 PM Chesnay Schepler <ches...@apache.org> >> wrote: >> >>> Reactive mode doesn't support standby taskmanagers. As you said it >>> always uses all available resources in the cluster. >>> >>> I can see it being useful though to not always scale to MAX but (MAX - >>> some_offset). >>> >>> I'd suggest to file a ticket. >>> >>> On 26/04/2023 00:17, Wei Hou via user wrote: >>> > Hi Flink community, >>> > >>> > We are trying to use Flink’s reactive mode with Kubernetes HPA for >>> autoscaling, however since the reactive mode will always use all available >>> resources, it causes a problem when we need standby task managers for fast >>> failure recover: The job will always use these extra standby task managers >>> as active task manager to process data. >>> > >>> > I wonder if you have any suggestion on this, should we avoid using >>> Flink reactive mode together with standby task managers? >>> > >>> > Best, >>> > Wei >>> > >>> > >>> >>>