RE: Deploying Flink on Kubernetes with fractional CPU and different limits and requests

Alexis Sarda-Espinosa Fri, 03 Sep 2021 02:01:39 -0700

Hi Yang,

I understand the issue, and yes, if Flink memory must be specified in the 
configuration anyway, it’s probably better to leave memory configuration in the 
templates empty.

For the CPU case I still think the template’s requests/limits should have 
priority if they are specified. The factor could still be used if the template 
doesn’t specify anything. I’m not sure if it would be entirely intuitive, but 
the logic could be something like this:

  1.  To choose CPU request
     *   Read pod template first
     *   If template doesn’t have anything, read from kubernetes.taskmanager.cpu
     *   If configuration is not specified, fall back to default
  2.  To choose CPU limit
     *   Read from template first
     *   If template doesn’t have anything, apply factor to what was chosen in 
step 1, where the default factor is 1.

Regards,
Alexis.

From: Yang Wang <[email protected]>
Sent: Freitag, 3. September 2021 08:09
To: Alexis Sarda-Espinosa <[email protected]>
Cc: spoon_lz <[email protected]>; Denis Cosmin NUTIU <[email protected]>; 
[email protected]; [email protected]
Subject: Re: Deploying Flink on Kubernetes with fractional CPU and different 
limits and requests

Hi Alexis

Thanks for your valuable inputs.

First, I want to share why Flink has to overwrite the resources which are 
defined in the pod template. You could the fields that will be
overwritten by Flink here[1]. I think the major reason is that Flink need to 
ensure the consistency between Flink configuration
(taskmanager.memory.process.size, kubernetes.taskmanager.cpu)
and pod template resource settings. Since users could specify the total process 
memory or detailed memory[2], Flink will calculate the
pod resource internally. If we allow users could specify the resources via pod 
template, then the users should guarantee the configuration
consistency especially when they specify the detailed memory(e.g. heap, 
managed, offheap, etc.). I believe it is a new burden for them.

For the limit-factor, you are right that factors aren’t linear. But I think the 
factor is more flexible than the absolute value. A bigger pod usually
could use more burst resources. Moreover, I do not suggest to set limit-factor 
for memory since it does not take too much benefit. As a comparison,
the burst cpu resources could help a lot for the performance.

[1]. 
https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/resource-providers/native_kubernetes/#pod-template
[1]. 
https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/memory/mem_setup_tm/#detailed-memory-model

@spoon_lz<mailto:[email protected]> You are right. The limit-factor should be 
greater than or equal to 1. And the default value is 1.

Best,
Yang

Alexis Sarda-Espinosa 
<[email protected]<mailto:[email protected]>>
 于2021年9月2日周四 下午8:20写道：
Just to provide my opinion, I find the idea of factors unintuitive for this 
specific case. When I’m working with Kubernetes resources and sizing, I have to 
think in absolute terms for all pods and define requests and limits with 
concrete values. Using factors for Flink means that I have to think differently 
for my Flink resources, and if I’m using pod templates, it makes this switch 
more jarring because I define what is essentially another Kubernetes resources 
that I’m familiar with, but some of the values in my template are ignored. 
Additionally, if I understand correctly, factors aren’t linear, right? If 
someone specifies a 1GiB request with a factor of 1.5, they only get 500MiB on 
top, but if they specify 10GiB, suddenly the limit goes all the way up to 15GiB.

Regards,
Alexis.

From: spoon_lz <[email protected]<mailto:[email protected]>>
Sent: Donnerstag, 2. September 2021 14:12
To: Yang Wang <[email protected]<mailto:[email protected]>>
Cc: Denis Cosmin NUTIU <[email protected]<mailto:[email protected]>>; 
Alexis Sarda-Espinosa 
<[email protected]<mailto:[email protected]>>;
 [email protected]<mailto:[email protected]>; 
[email protected]<mailto:[email protected]>
Subject: Re: Deploying Flink on Kubernetes with fractional CPU and different 
limits and requests

Hi Yang,
I agree with you, but I think the limit-factor should be greater than or equal 
to 1, and default to 1 is a better choice.
If the default value is 1.5, the memory limit will exceed the actual physical 
memory of a node, which may result in OOM, machine downtime, or random pod 
death if the node runs full.
For some required jobs, increase this value appropriately.

Best,
Zhuo

On 09/2/2021 11:50，Yang 
Wang<[email protected]><mailto:[email protected]> wrote：
Given that the limit-factor should be greater than 1, then using the 
limit-factor could also work for memory.

> Why do we need a larger memory resource limit than request?
A typical use case I could imagine is the page cache. Having more page cache 
might improve the performance.
And they could be reclaimed when the Kubernetes node does not have enough 
memory.

I still believe that it is the user responsibility to configure a proper 
resource(memory and cpu), not too big. And
using the limit-factor to allow the Flink job could benefit from the burst 
resources.

Best,
Yang

spoon_lz <[email protected]<mailto:[email protected]>> 于2021年9月1日周三 下午8:12写道：
Yes, shrinking the requested memory will result in OOM. We do this because the 
user-created job provides an initial value (for example, 2 cpus and 4096MB of 
memory for TaskManager). In most cases, the user will not change this value 
unless the task fails or there is an exception such as data delay. This results 
in a waste of memory for most simple ETL tasks. These isolated situations may 
not apply to more Flink users. We can adjust Kubernetes instead of Flink to 
solve the resource waste problem.
Just adjusting the CPU value might be a more robust choice, and there are 
probably some scenarios for both decreasing the CPU request and increasing the 
CPU limit

Best,
Zhuo

On 09/1/2021 19:39，Yang 
Wang<[email protected]><mailto:[email protected]> wrote：
Hi Lz,

Thanks for sharing your ideas.

I have to admin that I prefer the limit factor to set the resource limit, not 
the percentage to set the resource request.
Because usually the resource request is configured or calculated by Flink, and 
it indicates user required resources.
It has the same semantic for all deployments(e.g. Yarn, K8s). Especially for 
the memory resource, giving a discount
for the resource request may cause OOM.
BTW, I am wondering why the users do not allocate fewer resources if they do 
not need.

@Denis Cosmin NUTIU<mailto:[email protected]> I really appreciate for that 
you want to work on this feature. Let's first to reach a consensus
about the implementation. And then opening a PR is welcome.

Best,
Yang

spoon_lz <[email protected]<mailto:[email protected]>> 于2021年9月1日周三 下午4:36写道：

Hi,everyone
I have some other ideas for kubernetes resource Settings, as described by 
WangYang in [flink-15648], which increase the CPU limit by a certain percentage 
to provide more computational performance for jobs. Should we consider the 
alternative of shrinking the request to start more jobs, which would improve 
cluster resource utilization? For example, for some low-traffic tasks, we can 
even set the CPU request to 0 in extreme cases. Both limit enlargement and 
Request shrinkage may be required

Best,
Lz
On 09/1/2021 16:06，Denis Cosmin 
NUTIU<[email protected]><mailto:[email protected]> wrote：
Hi Yang,

I have limited Flink internals knowledge, but I can try to implement 
FLINK-15648 and open up a PR on GitHub or send the patch via email. How does 
that sound?
I'll sign the ICLA and switch to my personal address.

Sincerely,
Denis

On Wed, 2021-09-01 at 13:48 +0800, Yang Wang wrote:
Great. If no one wants to work on this ticket FLINK-15648, I will try to get 
this done in the next major release cycle(1.15).

Best,
Yang

Denis Cosmin NUTIU <[email protected]<mailto:[email protected]>> 
于2021年8月31日周二 下午4:59写道：
Hi everyone,

Thanks for getting back to me!

>  I think it would be nice if the task manager pods get their values from the 
> configuration file only if the pod templates don’t specify any resources. 
> That was the goal of supporting pod templates, right? Allowing more custom 
> scenarios without letting the configuration options get bloated.

I think that's correct. In the current behavior Flink will override the 
resources settings "The memory and cpu resources(including requests and limits) 
will be overwritten by Flink configuration options. All other resources(e.g. 
ephemeral-storage) will be retained.'[1]. After reading the comments from 
FLINK-15648[2], I'm not sure that it can be done in a clean manner with pod 
templates.

> I think it is a good improvement to support different resource requests and 
> limits. And it is very useful especially for the CPU resource since it 
> heavily depends on the upstream workloads.

I agree with you! I have limited knowledge of Flink internals but the 
kubernetes.jobmanager.limit-factor and kubernetes.taskmanager.limit-factor 
seems to be the right way to do it.

[1] Native Kubernetes | Apache 
Flink<https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/resource-providers/native_kubernetes/#pod-template>
[2] [FLINK-15648] Support to configure limit for CPU and memory requirement - 
ASF JIRA (apache.org)<https://issues.apache.org/jira/browse/FLINK-15648>

________________________________
From: Yang Wang <[email protected]<mailto:[email protected]>>
Sent: Tuesday, August 31, 2021 6:04 AM
To: Alexis Sarda-Espinosa 
<[email protected]<mailto:[email protected]>>
Cc: Denis Cosmin NUTIU <[email protected]<mailto:[email protected]>>; 
[email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>; 
[email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Subject: Re: Deploying Flink on Kubernetes with fractional CPU and different 
limits and requests

Hi all,

I think it is a good improvement to support different resource requests and 
limits. And it is very useful
especially for the CPU resource since it heavily depends on the upstream 
workloads.

Actually, we(alibaba) have introduced some internal config options to support 
this feature. WDYT?

// The prefix of Kubernetes resource limit factor. It should not be less than 
1. The resource
// could be cpu, memory, ephemeral-storage and all other types supported by 
Kubernetes.
public static final String KUBERNETES_JOBMANAGER_RESOURCE_LIMIT_FACTOR_PREFIX =
        "kubernetes.jobmanager.limit-factor.";
public static final String KUBERNETES_TASKMANAGER_RESOURCE_LIMIT_FACTOR_PREFIX =
        "kubernetes.taskmanager.limit-factor.";

BTW, we already have an old ticket for this feature[1].

[1]. https://issues.apache.org/jira/browse/FLINK-15648

Best,
Yang

Alexis Sarda-Espinosa 
<[email protected]<mailto:[email protected]>>
 于2021年8月26日周四 下午10:04写道：

I think it would be nice if the task manager pods get their values from the 
configuration file only if the pod templates don’t specify any resources. That 
was the goal of supporting pod templates, right? Allowing more custom scenarios 
without letting the configuration options get bloated.

Regards,

Alexis.

From: Denis Cosmin NUTIU <[email protected]<mailto:[email protected]>>
Sent: Donnerstag, 26. August 2021 15:55
To: [email protected]<mailto:[email protected]>
Cc: [email protected]<mailto:[email protected]>; 
[email protected]<mailto:[email protected]>
Subject: Re: Deploying Flink on Kubernetes with fractional CPU and different 
limits and requests

Hi Matthias,

Thanks for getting back to me and for your time!

We have some Flink jobs deployed on Kubernetes and running kubectl top pod 
gives the following result:

NAME                                                            CPU(cores)   
MEMORY(bytes)
aa-78c8cb77d4-zlmpg                  8m           1410Mi
aa-taskmanager-2-2                   32m          1066Mi
bb-5f7b65f95c-jwb7t          7m           1445Mi
bb-taskmanager-2-2           32m          1016Mi
cc-54d967b55d-b567x       11m          514Mi
cc-taskmanager-4-1        11m          496Mi
dd-6fbc6b8666-krhlx   10m          535Mi
dd-taskmanager-2-2    12m          522Mi
xx-6845cf7986-p45lq     53m          526Mi
xx-taskmanager-5-2      11m          507Mi

During low workloads the jobs consume just about 100m CPU and during high 
workloads the CPU consumption increases to 500m-1000m. Having the ability to 
specify requests and limit separately would give us more deployment flexibility.

Sincerely,

Denis

On Thu, 2021-08-26 at 14:22 +0200, Matthias Pohl wrote:

Hi Denis,

I did a bit of digging: It looks like there is no way to specify them 
independently. You can find documentation about pod templates for TaskManager 
and JobManager [1]. But even there it states that for cpu and memory, the 
resource specs are overwritten by the Flink configuration. The code also 
reveals that limit and requests are set using the same value [2].

I'm going to pull Yang Wang into this thread. I'm wondering whether there is a 
reason for that or whether it makes sense to create a Jira issue introducing 
more specific configuration parameters for limit and requests.

Best,
Matthias

[1] 
https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/resource-providers/native_kubernetes/#fields-overwritten-by-flink

[2] 
https://github.com/apache/flink/blob/f64261c91b195ecdcd99975b51de540db89a3f48/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/utils/KubernetesUtils.java#L324-L332

On Thu, Aug 26, 2021 at 11:17 AM Denis Cosmin NUTIU 
<[email protected]<mailto:[email protected]>> wrote:

Hello,

I've developed a Flink job and I'm trying to deploy it on a Kubernetes
cluster using Flink Native.

Setting kubernetes.taskmanager.cpu=0.5 and
kubernetes.jobmanager.cpu=0.5 sets the requests and limits to 500m,
which is correct, but I'd like to set the requests and limits to
different values, something like:

resources:
  requests:
    memory: "1048Mi"
    cpu: "100m"
  limits:
    memory: "2096Mi"
    cpu: "1000m"

I've tried using pod templates from Flink 1.13 and manually patching
the Kubernetes deployment file, the jobmanager gets spawned with the
correct reousrce requests and limits but the taskmanagers get spawned
with the defaults:

Limits:
      cpu:     1
      memory:  1728Mi
    Requests:
      cpu:     1
      memory:  1728Mi

Is there any way I could set the requests/limits for the CPU/Memory to
different values when deploying Flink in Kubernetes? If not, would it
make sense to request this as a feature?

Thanks in advance!

Denis

RE: Deploying Flink on Kubernetes with fractional CPU and different limits and requests

Reply via email to