Re: [Question] How to scale application based on 'reactive' mode

Dennis Jung Mon, 23 Oct 2023 00:45:36 -0700

Hello,
Thanks for feedback. I'll start with these.

Regards


2023년 9월 7일 (목) 오후 7:08, Gyula Fóra <[email protected]>님이 작성:

> Jung,
> I don't want to sound unhelpful, but I think the best thing for you to do
> is simply to try these different models in your local env.
> It should be very easy to get started with the Kubernetes Operator on
> Kind/Minikube (
> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/try-flink-kubernetes-operator/quick-start/
> )
>
> It's very difficult to answer these questions fully here. Try the
> different modes, observe what happens, read the docs and you will get all
> the answers.
>
> Gyula
>
> On Thu, Sep 7, 2023 at 10:11 AM Dennis Jung <[email protected]> wrote:
>
>> Hello Chen,
>> Thanks for your reply! I have further questions as following...
>>
>> 1. In case of non-reactive mode in Flink 1.18, if the autoscaler adjusts
>> parallelism, what is the difference by using 'reactive' mode?
>> 2. In case if I use Flink 1.15~1.17 without autoscaler, is the difference
>> of using 'reactive' mode is, changing parallelism dynamically by change of
>> TM number (manually, or by custom scaler)?
>>
>> Regards,
>> Jung
>>
>>
>> 2023년 9월 5일 (화) 오후 3:59, Chen Zhanghao <[email protected]>님이 작성:
>>
>>> Hi Dennis,
>>>
>>>
>>>    1. In Flink 1.18 + non-reactive mode, autoscaler adjusts the job's
>>>    parallelism and the job will request for extra TMs if the current ones
>>>    cannot satisfy its need and redundant TMs will be released automatically
>>>    later for being idle. In other words, parallelism changes cause TM number
>>>    change.
>>>    2. The core metrics used is busy time (the amount of time spent on
>>>    task processing per 1 second = 1 s - backpressured time - idle time), it 
>>> is
>>>    considered to be superior as it counts I/O cost etc into account as well.
>>>    Also, the metrics is on a per-task granularity and allows us to identify
>>>    bottleneck tasks.
>>>    3. Autoscaler feature currently only works for K8s opeartor + native
>>>    K8s mode.
>>>
>>>
>>> Best,
>>> Zhanghao Chen
>>> ------------------------------
>>> *发件人:* Dennis Jung <[email protected]>
>>> *发送时间:* 2023年9月2日 12:58
>>> *收件人:* Gyula Fóra <[email protected]>
>>> *抄送:* [email protected] <[email protected]>
>>> *主题:* Re: [Question] How to scale application based on 'reactive' mode
>>>
>>> Hello,
>>> Thanks for your notice.
>>>
>>> 1. In "Flink 1.18 + non-reactive", is parallelism being changed by the
>>> number of TM?
>>> 2. In the document(
>>> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.6/docs/custom-resource/autoscaler/),
>>> it said "we are not using any container memory / CPU utilization metrics
>>> directly here". Which metrics are these using internally?
>>> 3. I'm using standalone k8s(
>>> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/standalone/kubernetes/)
>>> for deployment. Is autoscaler features only available by using the "flink
>>> k8s operator"(sorry I don't understand this clearly yet...)?
>>>
>>> Regards
>>>
>>>
>>> 2023년 9월 1일 (금) 오후 10:20, Gyula Fóra <[email protected]>님이 작성:
>>>
>>> Pretty much, except that with Flink 1.18 autoscaler can scale the job in
>>> place without restarting the JM (even without reactive mode )
>>>
>>> So actually best option is autoscaler with Flink 1.18 native mode (no
>>> reactive)
>>>
>>> Gyula
>>>
>>> On Fri, 1 Sep 2023 at 13:54, Dennis Jung <[email protected]> wrote:
>>>
>>> Thanks for feedback.
>>> Could you check whether I understand correctly?
>>>
>>> *Only using 'reactive' mode:*
>>> By manually adding TaskManager(TM) (such as using './bin/taskmanager.sh
>>> start'), parallelism will be increased. For example, when job parallelism
>>> is 1 and TM is 1, and if adding 1 new TM, JobManager will be restarted and
>>> parallelism will be 2.
>>> But the number of TM is not being controlled automatically.
>>>
>>> *Autoscaler + non-reactive:*
>>> It can flexibilly control the number of TM by several metrics(CPU usage,
>>> throughput, ...), and JobManager will be restarted when scaling. But job
>>> parallelism is the same after the number of TM has been changed.
>>>
>>> *Autoscaler + 'reactive' mode*:
>>> It can control numbers of TM by metric, and increase/decrease job
>>> parallelism by changing TM.
>>>
>>> Regards,
>>> Jung
>>>
>>> 2023년 9월 1일 (금) 오후 8:16, Gyula Fóra <[email protected]>님이 작성:
>>>
>>> I would look at reactive scaling as a way to increase / decrease
>>> parallelism.
>>>
>>> It’s not a way to automatically decide when to actually do it as you
>>> need to create new TMs .
>>>
>>> The autoscaler could use reactive mode to change the parallelism but you
>>> need the autoscaler itself to decide when new resources should be added
>>>
>>> On Fri, 1 Sep 2023 at 13:09, Dennis Jung <[email protected]> wrote:
>>>
>>> For now, the thing I've found about 'reactive' mode is that it
>>> automatically adjusts 'job parallelism' when TaskManager is
>>> increased/decreased.
>>>
>>>
>>> https://www.slideshare.net/FlinkForward/autoscaling-flink-with-reactive-mode
>>>
>>> Is there some other feature that only 'reactive' mode offers for scaling?
>>>
>>> Thanks.
>>> Regards.
>>>
>>>
>>>
>>> 2023년 9월 1일 (금) 오후 4:56, Dennis Jung <[email protected]>님이 작성:
>>>
>>> Hello,
>>> Thank you for your response. I have few more questions in following:
>>> https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/elastic_scaling/
>>>
>>> *Reactive Mode configures a job so that it always uses all resources
>>> available in the cluster. Adding a TaskManager will scale up your job,
>>> removing resources will scale it down. Flink will manage the parallelism of
>>> the job, always setting it to the highest possible values.*
>>> => Does this mean when I add/remove TaskManager in 'non-reactive' mode,
>>> resource(CPU/Memory/Etc.) of the cluster is not being changed?
>>>
>>> *Reactive Mode restarts a job on a rescaling event, restoring it from
>>> the latest completed checkpoint. This means that there is no overhead of
>>> creating a savepoint (which is needed for manually rescaling a job). Also,
>>> the amount of data that is reprocessed after rescaling depends on the
>>> checkpointing interval, and the restore time depends on the state size.*
>>> => As I know 'rescaling' also works in non-reactive mode, with restoring
>>> checkpoint. What is the difference of using 'reactive' here?
>>>
>>> *The Reactive Mode allows Flink users to implement a powerful
>>> autoscaling mechanism, by having an external service monitor certain
>>> metrics, such as consumer lag, aggregate CPU utilization, throughput or
>>> latency. As soon as these metrics are above or below a certain threshold,
>>> additional TaskManagers can be added or removed from the Flink cluster.*
>>> => Why is this only possible in 'reactive' mode? Seems this is more
>>> related to 'autoscaler'. Are there some specific features/API which can
>>> control TaskManager/Parallelism only in 'reactive' mode?
>>>
>>> Thank you.
>>>
>>> 2023년 9월 1일 (금) 오후 3:30, Gyula Fóra <[email protected]>님이 작성:
>>>
>>> The reactive mode reacts to available resources. The autoscaler reacts
>>> to changing load and processing capacity and adjusts resources.
>>>
>>> Completely different concepts and applicability.
>>> Most people want the autoscaler , but this is a recent feature and is
>>> specific to the k8s operator at the moment.
>>>
>>> Gyula
>>>
>>> On Fri, 1 Sep 2023 at 04:50, Dennis Jung <[email protected]> wrote:
>>>
>>> Hello,
>>> Thanks for your notice.
>>>
>>> Than what is the purpose of using 'reactive', if this doesn't do
>>> anything itself?
>>> What is the difference if I use auto-scaler without 'reactive' mode?
>>>
>>> Regards,
>>> Jung
>>>
>>>
>>>
>>> 2023년 8월 18일 (금) 오후 7:51, Gyula Fóra <[email protected]>님이 작성:
>>>
>>> Hi!
>>>
>>> I think what you need is probably not the reactive mode but a proper
>>> autoscaler. The reactive mode as you say doesn't do anything in itself, you
>>> need to build a lot of logic around it.
>>>
>>> Check this instead:
>>> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/autoscaler/
>>>
>>> The Kubernetes Operator has a built in autoscaler that can scale jobs
>>> based on kafka data rate / processing throughput. It also doesn't rely on
>>> the reactive mode.
>>>
>>> Cheers,
>>> Gyula
>>>
>>> On Fri, Aug 18, 2023 at 12:43 PM Dennis Jung <[email protected]>
>>> wrote:
>>>
>>> Hello,
>>> Sorry for frequent questions. This is a question about 'reactive' mode.
>>>
>>> 1. As far as I understand, though I've setup `scheduler-mode: reactive`,
>>> it will not change parallelism automatically by itself, by CPU usage or
>>> Kafka consumer rate. It needs additional resource monitor features (such as
>>> Horizontal Pod Autoscaler, or else). Is this correct?
>>> 2. Is it possible to create a custom resource monitor provider
>>> application? For example, if I want to increase/decrease parallelism by
>>> Kafka consumer rate, do I need to send specific API from outside, to order
>>> rescaling?
>>> 3. If 2 is correct, what is the difference when using 'reactive' mode?
>>> Because as far as I think, calling a specific API will rescale either using
>>> 'reactive' mode or not...(or is the API just working based on this mode)?
>>>
>>> Thanks.
>>>
>>> Regards
>>>
>>>

Re: [Question] How to scale application based on 'reactive' mode

Reply via email to