Re: SPIP: Burst-aware memoryOverhead allocation algorithm for Spark@K8S

vaquar khan Mon, 29 Dec 2025 23:16:27 -0800

Hi Nan,

Thanks for the candid response. I see where you are coming from regarding
managed rollouts, but I think we are viewing this from two different
lenses: "Internal Platform" vs. "General Open Source Product."


Here is why I am pushing for these two specific configuration hooks:

1. Re: "Imagined Reasons" & Zero Overhead

You mentioned that you have observed jobs running fine with zero
memoryOverhead.

While that may be true for specific workloads in your environment, the
requirement for non-heap memory is not "imagined"—it is a JVM
specification. Thread stacks, CodeCache, and Netty DirectByteBuffer control
structures must live in non-heap memory.

   -

   *The Scenario:* If G=0, then Pod Request == Heap. If a node is fully
   bin-packed (Sum of Requests = Node Capacity), your executor is
   mathematically guaranteed *zero bytes* of non-heap memory unless it can
   steal from the burst pool.
   -

   *The Risk:* If the burst pool is temporarily exhausted by neighbors, a
   simple thread creation will throw OutOfMemoryError: unable to create new
   native thread.
   -

   *The Fix:* I am not asking to change your default behavior. I am asking
   to *expose the config* (minGuaranteedRatio). If you set it to 0.0
   (default), your behavior is unchanged. But for those of us running
   high-concurrency environments who need a 5-10% safety buffer for thread
   stacks, we need the *capability* to configure it without maintaining a
   fork or writing complex pre-submission wrappers.

2. Re: Kubelet Eviction Relevance

You asked how Disk/PID pressure is related.

In Kubernetes, PriorityClass is the universal signal for pod importance
during any node-pressure event (not just memory).

   -

   If a node runs out of Ephemeral Storage (common with Spark Shuffle), the
   Kubelet evicts pods.
   -

   Without a priorityClassName config, these Spark pods (which are now
   QoS-downgraded to Burstable) will be evicted *before* Best-Effort jobs
   that might have a higher priority class.
   -

   Again, this is a standard Kubernetes spec feature. There is no downside
   to exposing spark.kubernetes.executor.bursty.priorityClassName as an
   optional config.

*Proposal to Unblock*

We both want this feature merged. I am not asking to change your formula's
default behavior.

Can we agree to simply *add these two parameters as optional configurations*
?

   1.

   minGuaranteedRatio (Default: 0.0 -> preserves your logic exactly).
   2.

   priorityClassName (Default: null -> preserves your logic exactly).

This satisfies your design goals while making the feature robust enough for
my production requirements.


Regards,

Viquar Khan

Sr Data Architect

https://www.linkedin.com/in/vaquar-khan-b695577/



On Tue, 30 Dec 2025 at 01:04, Nan Zhu <[email protected]> wrote:

> > However, I maintain that for a general-purpose open-source feature
> (which will be used by teams without dedicated platform engineers to manage
> rollouts), we need structural safety guardrails.
>
> I am not sure we can roll out such a feature to all jobs without a managed
> rollout process, it is an anti-pattern in any engineering org.  this
> feature is disabled by default which is already a guard to prevent users
> silently getting into what they won't expect
>
>
> > A minGuaranteedRatio (defaulting to 0 if you prefer) is not "messing up
> the design"—it is mathematically necessary to prevent the formula from
> collapsing to zero in valid production scenarios.
>
> this formula *IS* designed to output 0 in some cases, so it is *NOT*
> collapsing to zero.. I observed that , even with 0 memoryoverhead in many
> jobs, with a proper bursty factor saved tons of money in a real PROD
> environment instead of from my imagination.. if you don't want any
> memoryOverhead to be zero in your job for your imagined reasons,  you can
> just calculate your threshold for on-heap/memoryOverhead ratio for rolling
> out
>
> step back... if your team doesn't know how to manage rollout, most likely
> you are rolling out this feature for individual jobs without a centralized
> feature rolling out point,  right? then, you can just use the simple
> arithmetics to calculate whether the resulting memoryOverhead is 0, if yes,
> don't use this feature, that's it....
>
>
> > However, Kubelet eviction is the primary mechanism for other pressure
> types (DiskPressure, PIDPressure) and "slow leak" memory pressure scenarios
> where memory.available crosses the eviction threshold before the kernel
> panics.
>
> How are they related to this feature?
>
>
> On Mon, Dec 29, 2025 at 10:37 PM vaquar khan <[email protected]>
> wrote:
>
>> Hi Nan,
>>
>> Thanks for the detailed reply. I appreciate you sharing the specific
>> context from the Pinterest implementation—it helps clarify the operational
>> model you are using.
>>
>> However, I maintain that for a general-purpose open-source feature (which
>> will be used by teams without dedicated platform engineers to manage
>> rollouts), we need structural safety guardrails.
>>
>> *Here is my response to your points:*
>>
>> 1. Re: "Zero-Guarantee" & Safety (Critical)
>>
>> You suggested that "setting a conservative bursty factor" resolves the
>> risk of zero-guaranteed overhead.
>>
>> Mathematically, this is incorrect for High-Heap jobs. The formula is
>> structural: G = O - \min((H+O) \times (B-1), O).
>>
>> Consider a standard ETL job: Heap (H) = 100GB, Overhead (O) = 5GB.
>>
>> Even if we set a very conservative Bursty Factor (B) of 1.06 (only 6%
>> burst):
>>
>>    -
>>
>>    Calculation: $(100 + 5) \times (1.06 - 1) = 105 \times 0.06 = 6.3GB.
>>    -
>>
>>    Since 6.3GB > 5GB, the formula sets *Guaranteed Overhead = 0GB*.
>>
>> Even with an extremely conservative factor, the design forces this pod to
>> have zero guaranteed memory for OS/JVM threads. This is not a tuning issue;
>> it is a formulaic edge case for high-memory jobs.
>>
>> * A minGuaranteedRatio (defaulting to 0 if you prefer) is not "messing up
>> the design"—it is mathematically necessary to prevent the formula from
>> collapsing to zero in valid production scenarios.*
>>
>> 2. Re: Kubelet Eviction vs. OOMKiller
>>
>> I concede that in sudden memory spikes, the Kernel OOMKiller often acts
>> faster than Kubelet eviction.
>>
>> However, Kubelet eviction is the primary mechanism for other pressure
>> types (DiskPressure, PIDPressure) and "slow leak" memory pressure scenarios
>> where memory.available crosses the eviction threshold before the kernel
>> panics.
>>
>> * Adding priorityClassName support to the Pod spec is a low-effort,
>> zero-risk change that aligns with Kubernetes best practices for "Defense in
>> Depth." It costs nothing to expose this config.*
>>
>> 3. Re: Native Support
>>
>> Fair point. To keep the scope tight, I am happy to drop the Native
>> Support request for this SPIP. We can treat that as a separate follow-up.
>>
>> Path Forward
>>
>> I am happy to support  if we can agree to:
>>
>>    1.
>>
>>    Add the minGuaranteedRatio config (to handle the High-Heap math
>>    proven above).
>>    2.
>>
>>    Expose the priorityClassName config (standard K8S practice).
>>
>>
>> Regards,
>>
>> Viquar Khan
>>
>> Sr Data Architect
>>
>> https://www.linkedin.com/in/vaquar-khan-b695577/
>>
>>
>> On Tue, 30 Dec 2025 at 00:16, Nan Zhu <[email protected]> wrote:
>>
>>> > Kubelet Eviction is the first line of defense before the Kernel
>>> OOMKiller strikes.
>>>
>>> This is *NOT* true. Eviction will be the first to kill some best effort
>>> pod which doesn't make any difference  on memory pressure in most cases and
>>> before it takes action again, Kernel OOMKiller already killed some executor
>>> pods. This is exactly the reason for me to say, we don't really worry about
>>> eviction here, before eviction touches those executors, OOMKiller already
>>> killed them. This behavior is consistently observed and we also had
>>> discussions with other companies who had to modify Kernel code to mitigate
>>> this behavior.
>>>
>>> > Re: "Zero-Guarantee" & Safety
>>>
>>> you basically want to tradeoff saving with system safety , then why not
>>> just setting a conservative value of bursty factor? it is exactly what we
>>> did in PINS, please check my earlier response in the thread ... key part as
>>> following:
>>>
>>> "in PINS, we basically apply a set of strategies, setting conservative
>>> bursty factor, progressive rollout, monitor the cluster metrics like Linux
>>> Kernel OOMKiller occurrence to guide us to the optimal setup of bursty
>>> factor... in usual, K8S operators will set a reserved space for daemon
>>> processes on each host, we found it is sufficient to in our case and our
>>> major tuning focuses on bursty factor value "
>>>
>>> If you really want, you can enable this feature only for jobs when
>>> OnHeap/MemoryOverhead is smaller than a certain value...
>>>
>>> I just didn't see the value of bringing another configuration
>>>
>>>
>>> > Re: Native Support
>>>
>>> I mean....this SPIP is NOT about native execution engine's memory
>>> pattern at all..... why do we bother to bring it up....
>>>
>>>
>>>
>>> On Mon, Dec 29, 2025 at 9:42 PM vaquar khan <[email protected]>
>>> wrote:
>>>
>>>> Hi Nan,
>>>>
>>>> Thanks for the prompt response and for clarifying the design intent.
>>>>
>>>> I understand the goal is to maximize savings—and I agree we shouldn't
>>>> block the current momentum even though you can see my vote is +1—but I want
>>>> to ensure we aren't over-optimizing for specific internal environments at
>>>> the cost of general community stability.
>>>>
>>>> *Here is my rejoinder on the technical points:*
>>>>
>>>> 1. Re: PriorityClass & OOMKiller (Defense in Depth)
>>>>
>>>> You mentioned that “priorityClassName is NOT the solution... What we
>>>> worry about is the Linux Kernel OOMKiller.”
>>>>
>>>> I agree that the Kernel OOMKiller (cgroup) primarily looks at
>>>> oom_score_adj (which is determined by QoS Class). However, Kubelet Eviction
>>>> is the first line of defense before the Kernel OOMKiller strikes.
>>>>
>>>> When a node comes under memory pressure (e.g., memory.available drops
>>>> below evictionHard), the *Kubelet* actively selects pods to evict to
>>>> reclaim resources. Unlike the Kernel, the Kubelet *does* explicitly
>>>> use PriorityClass when ranking candidates for eviction.
>>>>
>>>>    -
>>>>
>>>>    *The Risk:* Since we are downgrading these pods to *Burstable*
>>>>    (increasing their OOM risk), we lose the "Guaranteed" protection shield.
>>>>    -
>>>>
>>>>    *The Fix:* By assigning a high PriorityClass, we ensure that if the
>>>>    Kubelet needs to free space, it evicts lower-priority batch jobs
>>>>    *before* these Spark executors. It is a necessary "Defense in
>>>>    Depth" strategy for multi-tenant clusters that prevents optimized Spark
>>>>    jobs from being the first victims of node pressure.
>>>>
>>>> 2. Re: "Zero-Guarantee" & Safety
>>>>
>>>> You noted that “savings come from these 0 memory overhead pods.”
>>>>
>>>> While G=0 maximizes "on-paper" savings, it is theoretically unsafe for
>>>> a JVM. A JVM physically requires non-heap memory for Thread Stacks,
>>>> CodeCache, and Metaspace just to run.
>>>>
>>>>    -
>>>>
>>>>    *The Reality:* If G=0, then Pod Request == Heap. If a node is fully
>>>>    packed (Sum of Requests ≈ Node Capacity), the pod relies *100%* on
>>>>    the burst pool for basic thread allocation. If neighbors are noisy, that
>>>>    pod cannot even spawn a thread.
>>>>    -
>>>>
>>>>    *The Compromise:* I strongly suggest we add the configuration
>>>>    spark.executor.memoryOverhead.minGuaranteedRatio but set the *default
>>>>    to 0.0*.
>>>>    -
>>>>
>>>>       This preserves your logic/savings by default.
>>>>       -
>>>>
>>>>       But it gives platform admins a "safety knob" to turn (e.g., to
>>>>       0.1) when they inevitably encounter instability in high-contention
>>>>       environments, without needing a code patch.
>>>>
>>>> 3. Re: Native Support
>>>>
>>>> Agreed. We can treat Off-Heap support as a follow-up item. I would just
>>>> request that we add a "Known Limitation" note in the SPIP stating that this
>>>> optimization does not yet apply to spark.memory.offHeap.size, so users of
>>>> Gluten/Velox are aware.
>>>>
>>>> I am happy to support the PR moving forward if we can agree to include
>>>> the *PriorityClass* config support and the *Safety Floor* config (even
>>>> if disabled by default) ,Please update your SIP. This ensures the feature
>>>> is robust enough for the wider user base.
>>>>
>>>> Regards,
>>>>
>>>> Viquar Khan
>>>>
>>>> Sr Data Architect
>>>>
>>>> https://www.linkedin.com/in/vaquar-khan-b695577/
>>>>
>>>> On Mon, 29 Dec 2025 at 22:52, Nan Zhu <[email protected]> wrote:
>>>>
>>>>> Hi, Vaquar
>>>>>
>>>>> thanks for the replies,
>>>>>
>>>>> 1. for Guaranteed QoS
>>>>>
>>>>> I may missed some words in the  original doc, the idea I would like to
>>>>> convey is that we essentially need to give up this idea due to what cannot
>>>>> achieve Guaranteed QoS as we already have different values of memory and
>>>>> even we only consider CPU request/limit values, it brings other risks to 
>>>>> us
>>>>>
>>>>> Additionally , priorityClassName is NOT the solution here. What we
>>>>> really worry about is NOT eviction, but Linux Kernel OOMKiller where we
>>>>> cannot pass pod priority information into. With burstable pods, the
>>>>> only thing Linux Kernel OOMKiller considers is the memory request size
>>>>> which not necessary maps to priority information
>>>>>
>>>>> 2. The "Zero-Guarantee" Edge Case
>>>>>
>>>>> Actually, a lot of savings are from these 0 memory overhead pods... I
>>>>> am curious if you have adopted the PoC PR in prod as you have identified 
>>>>> it
>>>>> is "unsafe"?
>>>>>
>>>>> Something like a minGuaranteedRatio is not a good idea , it will mess
>>>>> up the original design idea of the formula (check Appendix C), the 
>>>>> simplest
>>>>> thing you can do is to avoid rolling out features to the jobs which you
>>>>> feel will be unsafe...
>>>>>
>>>>> 3. Native Execution Gap (Off-Heap)
>>>>>
>>>>> I am not sure the off-heap memory usage of gluten/comet shows the
>>>>> same/similar pattern as memoryOverhead. No one has validated that in
>>>>> production environment, but both PINS/Bytedance has validated
>>>>> memoryOverhead part thoroughly in their clusters
>>>>>
>>>>> Additionally, the key design of the proposal is to capture the
>>>>> relationship between on-heap and memoryOverhead sizes, in another word,
>>>>> they co-exist.... offheap memory used by native engines are different
>>>>> stories where , ideally, the on-heap usage should be minimum and most of
>>>>> memory usage should come from off-heap part...so the formula here may not
>>>>> work out of box
>>>>>
>>>>> my suggestion is, since the community has approved the original design
>>>>> which have been tested by at least 2 companies in production environments,
>>>>> we go with the current design and continue code review , in future, we can
>>>>> add what have been found/tested in production as followups
>>>>>
>>>>> Thanks
>>>>>
>>>>> Nan
>>>>>
>>>>>
>>>>> On Mon, Dec 29, 2025 at 8:03 PM vaquar khan <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Yao, Nan, and Chao,
>>>>>>
>>>>>> Thank you for this proposal though I already approved . The
>>>>>> cost-efficiency goals are very compelling, and the cited $6M annual 
>>>>>> savings
>>>>>> at Pinterest  clearly demonstrates the value of moving away from
>>>>>> rigid peak provisioning.
>>>>>>
>>>>>> However, after modeling the proposed design against standard
>>>>>> Kubernetes behavior and modern Spark workloads, I have identified *three
>>>>>> critical stability risks* that need to be addressed before this is
>>>>>> finalized.
>>>>>>
>>>>>> I have drafted a *Supplementary Design Amendment* (linked
>>>>>> below/attached) that proposes fixes for these issues, but here is the
>>>>>> summary:
>>>>>> 1. The "Guaranteed QoS" Contradiction
>>>>>>
>>>>>> The SPIP lists "Use Guaranteed QoS class" as Mitigation #1 for
>>>>>> stability risks2.
>>>>>>
>>>>>> The Issue: Technically, this mitigation is impossible under your
>>>>>> proposal.
>>>>>>
>>>>>>    -
>>>>>>
>>>>>>    In Kubernetes, a Pod is assigned the *Guaranteed* QoS class *only*
>>>>>>    if Request == Limit for both CPU and Memory.
>>>>>>    -
>>>>>>
>>>>>>    Your proposal explicitly sets Memory Request < Memory Limit
>>>>>>    (specifically $H+G < H+O$)3.
>>>>>>
>>>>>>    -
>>>>>>
>>>>>>    *Consequence:* This configuration *automatically downgrades* the
>>>>>>    Pod to the *Burstable* QoS class. In a multi-tenant cluster, the
>>>>>>    Kubelet eviction manager will kill these "Burstable" Spark pods
>>>>>>    *before* any Guaranteed system pods during node pressure.
>>>>>>    -
>>>>>>
>>>>>>    *Proposed Fix:* We cannot rely on Guaranteed QoS. We must
>>>>>>    introduce a priorityClassName configuration to offset this
>>>>>>    eviction risk.
>>>>>>
>>>>>> 2. The "Zero-Guarantee" Edge Case
>>>>>>
>>>>>> The formula $G = O - \min\{(H+O) \times (B-1), O\}$ 4 has a
>>>>>> dangerous edge case for High-Heap/Low-Overhead jobs (common in ETL).
>>>>>>
>>>>>>
>>>>>>    -
>>>>>>
>>>>>>    *Scenario:* If a job has a large Heap ($H$) relative to Overhead (
>>>>>>    $O$), the calculated burst deduction often exceeds the total
>>>>>>    Overhead.
>>>>>>    -
>>>>>>
>>>>>>    *Result:* The formula yields *$G = 0$*.
>>>>>>    -
>>>>>>
>>>>>>    *Risk:* Allocating 0MB of guaranteed overhead is unsafe.
>>>>>>    Essential JVM operations (thread stacks, Netty control buffers) 
>>>>>> require a
>>>>>>    non-zero baseline. Relying 100% on a shared burst pool for basic
>>>>>>    functionality will lead to immediate container failures if the node is
>>>>>>    contended.
>>>>>>    -
>>>>>>
>>>>>>    *Proposed Fix:* Implement a safety floor using a
>>>>>>    minGuaranteedRatio (e.g., max(Calculated_G, O * 0.1)).
>>>>>>
>>>>>> 3. Native Execution Gap (Off-Heap)
>>>>>>
>>>>>> The proposal focuses entirely on memoryOverhead5.
>>>>>>
>>>>>> The Issue: Modern native engines (Gluten, Velox, Photon) shift
>>>>>> execution memory to spark.memory.offHeap.size. This memory is equally
>>>>>> "bursty" but is excluded from your optimization.
>>>>>>
>>>>>> *Proposed Fix: *The burst-aware logic should be extensible to
>>>>>> include Off-Heap memory if enabled.
>>>>>>
>>>>>>
>>>>>> https://docs.google.com/document/d/1l7KFkHcVBi1kr-9T4Rp7d52pTJT2TxuDMOlOsibD4wk/edit?usp=sharing
>>>>>>
>>>>>>
>>>>>> I believe these changes are necessary to make the feature robust
>>>>>> enough for general community adoption beyond specific controlled
>>>>>> environments.
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Viquar Khan
>>>>>>
>>>>>> Sr Data Architect
>>>>>>
>>>>>> https://www.linkedin.com/in/vaquar-khan-b695577/
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, 17 Dec 2025 at 06:34, Qiegang Long <[email protected]> wrote:
>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> On Wed, Dec 17, 2025, 2:48 AM Wenchen Fan <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> +1
>>>>>>>>
>>>>>>>> On Wed, Dec 17, 2025 at 6:41 AM karuppayya <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> +1 from me.
>>>>>>>>> I think it's well-scoped and takes advantage of Kubernetes'
>>>>>>>>> features exactly for what they are designed for(as per my 
>>>>>>>>> understanding).
>>>>>>>>>
>>>>>>>>> On Tue, Dec 16, 2025 at 8:17 AM Chao Sun <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks Yao and Nan for the proposal, and thanks everyone for the
>>>>>>>>>> detailed and thoughtful discussion.
>>>>>>>>>>
>>>>>>>>>> Overall, this looks like a valuable addition for organizations
>>>>>>>>>> running Spark on Kubernetes, especially given how bursty
>>>>>>>>>> memoryOverhead usage tends to be in practice. I appreciate that
>>>>>>>>>> the change is relatively small in scope and fully opt-in, which 
>>>>>>>>>> helps keep
>>>>>>>>>> the risk low.
>>>>>>>>>>
>>>>>>>>>> From my perspective, the questions raised on the thread and in
>>>>>>>>>> the SPIP have been addressed. If others feel the same, do we have 
>>>>>>>>>> consensus
>>>>>>>>>> to move forward with a vote? cc Wenchen, Qieqiang, and Karuppayya.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Chao
>>>>>>>>>>
>>>>>>>>>> On Thu, Dec 11, 2025 at 11:32 PM Nan Zhu <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> this is a good question
>>>>>>>>>>>
>>>>>>>>>>> > a stage is bursty and consumes the shared portion and fails to
>>>>>>>>>>> release it for subsequent stages
>>>>>>>>>>>
>>>>>>>>>>> in the scenario you described, since the memory-leaking stage
>>>>>>>>>>> and the subsequence ones are from the same job , the pod will 
>>>>>>>>>>> likely be
>>>>>>>>>>> killed by cgroup oomkiller
>>>>>>>>>>>
>>>>>>>>>>> taking the following as the example
>>>>>>>>>>>
>>>>>>>>>>> the usage pattern is  G = 5GB S = 2GB, it uses G + S at max and
>>>>>>>>>>> in theory, it should release all 7G and then claim 7G again in some 
>>>>>>>>>>> later
>>>>>>>>>>> stages, however, due to the memory peak, it holds 2G forever and 
>>>>>>>>>>> ask for
>>>>>>>>>>> another 7G, as a result,  it hits the pod memory limit  and cgroup
>>>>>>>>>>> oomkiller will take action to terminate the pod
>>>>>>>>>>>
>>>>>>>>>>> so this should be safe to the system
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> however, we should be careful about the memory peak for sure,
>>>>>>>>>>> because it essentially breaks the assumption that the usage of
>>>>>>>>>>> memoryOverhead is bursty (memory peak ~= use memory forever)...
>>>>>>>>>>> unfortunately, shared/guaranteed memory is managed by user 
>>>>>>>>>>> applications
>>>>>>>>>>> instead of on cluster level , they, especially S, are just logical
>>>>>>>>>>> concepts  instead of a physical memory pool which pods can 
>>>>>>>>>>> explicitly claim
>>>>>>>>>>> memory from...
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Dec 11, 2025 at 10:17 PM karuppayya <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks for the interesting proposal.
>>>>>>>>>>>> The design seems to rely on memoryOverhead being transient.
>>>>>>>>>>>> What happens when a stage is bursty and consumes the shared
>>>>>>>>>>>> portion and fails to release it for subsequent stages (e.g.,  
>>>>>>>>>>>> off-heap
>>>>>>>>>>>> buffers and its not garbage collected since its off-heap)? Would 
>>>>>>>>>>>> this
>>>>>>>>>>>> trigger the host-level OOM like described in Q6? or are there 
>>>>>>>>>>>> strategies to
>>>>>>>>>>>> release the shared portion?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Dec 11, 2025 at 6:24 PM Nan Zhu <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> yes, that's the worst case in the scenario, please check my
>>>>>>>>>>>>> earlier response to Qiegang's question, we have a set of 
>>>>>>>>>>>>> strategies adopted
>>>>>>>>>>>>> in prod to mitigate the issue
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 6:21 PM Wenchen Fan <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for the explanation! So the executor is not guaranteed
>>>>>>>>>>>>>> to get 50 GB physical memory, right? All pods on the same host 
>>>>>>>>>>>>>> may reach
>>>>>>>>>>>>>> peak memory usage at the same time and cause paging/swapping 
>>>>>>>>>>>>>> which hurts
>>>>>>>>>>>>>> performance?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 10:12 AM Nan Zhu <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> np, let me try to explain
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1. Each executor container will be run in a pod together
>>>>>>>>>>>>>>> with some other sidecar containers taking care of tasks like
>>>>>>>>>>>>>>> authentication, etc. , for simplicity, we assume each pod has 
>>>>>>>>>>>>>>> only one
>>>>>>>>>>>>>>> container which is the executor container
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2. Each container is assigned with two values, r
>>>>>>>>>>>>>>> *equest&limit** (limit >= request),* for both of CPU/memory
>>>>>>>>>>>>>>> resources (we only discuss memory here). Each pod will have 
>>>>>>>>>>>>>>> request/limit
>>>>>>>>>>>>>>> values as the sum of all containers belonging to this pod
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 3. K8S Scheduler chooses a machine to host a pod based on
>>>>>>>>>>>>>>> *request* value, and cap the resource usage of each
>>>>>>>>>>>>>>> container based on their *limit* value, e.g. if I have a
>>>>>>>>>>>>>>> pod with a single container in it , and it has 1G/2G as request 
>>>>>>>>>>>>>>> and limit
>>>>>>>>>>>>>>> value respectively, any machine with 1G free RAM space will be 
>>>>>>>>>>>>>>> a candidate
>>>>>>>>>>>>>>> to host this pod, and when the container use more than 2G 
>>>>>>>>>>>>>>> memory, it will
>>>>>>>>>>>>>>> be killed by cgroup oomkiller. Once a pod is scheduled to a 
>>>>>>>>>>>>>>> host, the
>>>>>>>>>>>>>>> memory space sized at "sum of all its containers' request 
>>>>>>>>>>>>>>> values" will be
>>>>>>>>>>>>>>> booked exclusively for this pod.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 4. By default, Spark *sets request/limit as the same value
>>>>>>>>>>>>>>> for executors in k8s*, and this value is basically
>>>>>>>>>>>>>>> spark.executor.memory + spark.executor.memoryOverhead in most 
>>>>>>>>>>>>>>> cases .
>>>>>>>>>>>>>>> However,  spark.executor.memoryOverhead usage is very bursty, 
>>>>>>>>>>>>>>> the user
>>>>>>>>>>>>>>> setting  spark.executor.memoryOverhead as 10G usually means 
>>>>>>>>>>>>>>> each executor
>>>>>>>>>>>>>>> only needs 10G in a very small portion of the executor's whole 
>>>>>>>>>>>>>>> lifecycle
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 5. The proposed SPIP is essentially to decouple
>>>>>>>>>>>>>>> request/limit value in spark@k8s for executors in a safe
>>>>>>>>>>>>>>> way (this idea is from the bytedance paper we refer to in SPIP 
>>>>>>>>>>>>>>> paper).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Using the aforementioned example ,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> if we have a single node cluster with 100G RAM space, we
>>>>>>>>>>>>>>> have two pods requesting 40G + 10G (on-heap + memoryOverhead) 
>>>>>>>>>>>>>>> and we set
>>>>>>>>>>>>>>> bursty factor to 1.2, without the mechanism proposed in this 
>>>>>>>>>>>>>>> SPIP, we can
>>>>>>>>>>>>>>> at most host 2 pods with this machine, and because of the 
>>>>>>>>>>>>>>> bursty usage of
>>>>>>>>>>>>>>> that 10G space, the memory utilization would be compromised.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> When applying the burst-aware memory allocation, we only
>>>>>>>>>>>>>>> need 40 + 10 - min((40 + 10) * 0.2, 10) = 40G to host each pod, 
>>>>>>>>>>>>>>> i.e. we
>>>>>>>>>>>>>>> have 20G free memory space left in the machine which can be 
>>>>>>>>>>>>>>> used to host
>>>>>>>>>>>>>>> some smaller pods. At the same time, as we didn't change the 
>>>>>>>>>>>>>>> limit value of
>>>>>>>>>>>>>>> the executor pods, these executors can still use 50G at max.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 5:42 PM Wenchen Fan <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Sorry I'm not very familiar with the k8s infra, how does
>>>>>>>>>>>>>>>> it work under the hood? The container will adjust its system 
>>>>>>>>>>>>>>>> memory size
>>>>>>>>>>>>>>>> depending on the actual memory usage of the processes in this 
>>>>>>>>>>>>>>>> container?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 2:49 AM Nan Zhu <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> yeah, we have a few cases that we have significantly
>>>>>>>>>>>>>>>>> larger O than H, the proposed algorithm is actually a great 
>>>>>>>>>>>>>>>>> fit
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> as I explained in SPIP doc Appendix C, the proposed
>>>>>>>>>>>>>>>>> algorithm will allocate a non-trivial G to ensure the safety 
>>>>>>>>>>>>>>>>> of running but
>>>>>>>>>>>>>>>>> still cut a big chunk of memory (10s of GBs) and treat them 
>>>>>>>>>>>>>>>>> as S , saving
>>>>>>>>>>>>>>>>> tons of money burnt by them
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> but regarding native accelerators, some native
>>>>>>>>>>>>>>>>> acceleration engines do not use memoryOverhead but use 
>>>>>>>>>>>>>>>>> off-heap
>>>>>>>>>>>>>>>>> (spark.memory.offHeap.size) explicitly (e.g. Gluten). The 
>>>>>>>>>>>>>>>>> current
>>>>>>>>>>>>>>>>> implementation does not cover this part , while that will be 
>>>>>>>>>>>>>>>>> an easy
>>>>>>>>>>>>>>>>> extension
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 10:42 AM Qiegang Long <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks for the reply.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Have you tested in environments where O is bigger than H?
>>>>>>>>>>>>>>>>>> Wondering if the proposed algorithm would help more in those 
>>>>>>>>>>>>>>>>>> environments
>>>>>>>>>>>>>>>>>> (eg. with
>>>>>>>>>>>>>>>>>> native accelerators)?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 12:48 PM Nan Zhu <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi, Qiegang, thanks for the good questions as well
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> please check the following answer
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> > My initial understanding is that Kubernetes will use
>>>>>>>>>>>>>>>>>>> the Executor Memory Request (H + G) for scheduling
>>>>>>>>>>>>>>>>>>> decisions, which allows for better resource packing.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> yes, your understanding is correct
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> > How is the risk of host-level OOM mitigated when the
>>>>>>>>>>>>>>>>>>> total potential usage  sum of H+G+S across all pods on a 
>>>>>>>>>>>>>>>>>>> node exceeds its
>>>>>>>>>>>>>>>>>>> allocatable capacity? Does the proposal implicitly rely on 
>>>>>>>>>>>>>>>>>>> the cluster
>>>>>>>>>>>>>>>>>>> operator to manually ensure an unrequested memory buffer 
>>>>>>>>>>>>>>>>>>> exists on the node
>>>>>>>>>>>>>>>>>>> to serve as the shared pool?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> in PINS, we basically apply a set of strategies, setting
>>>>>>>>>>>>>>>>>>> conservative bursty factor, progressive rollout, monitor 
>>>>>>>>>>>>>>>>>>> the cluster
>>>>>>>>>>>>>>>>>>> metrics like Linux Kernel OOMKiller occurrence to guide us 
>>>>>>>>>>>>>>>>>>> to the optimal
>>>>>>>>>>>>>>>>>>> setup of bursty factor... in usual, K8S operators will set 
>>>>>>>>>>>>>>>>>>> a reserved space
>>>>>>>>>>>>>>>>>>> for daemon processes on each host, we found it is 
>>>>>>>>>>>>>>>>>>> sufficient to in our case
>>>>>>>>>>>>>>>>>>> and our major tuning focuses on bursty factor value
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> > Have you considered scheduling optimizations to ensure
>>>>>>>>>>>>>>>>>>> a strategic mix of executors with large S and small S 
>>>>>>>>>>>>>>>>>>> values on a single
>>>>>>>>>>>>>>>>>>> node?  I am wondering if this would reduce the probability 
>>>>>>>>>>>>>>>>>>> of concurrent
>>>>>>>>>>>>>>>>>>> bursting and host-level OOM.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Yes, when we work on this project, we put some attention
>>>>>>>>>>>>>>>>>>> on the cluster scheduling policy/behavior... two things we 
>>>>>>>>>>>>>>>>>>> mostly care
>>>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 1. as stated in the SPIP doc, the cluster should have
>>>>>>>>>>>>>>>>>>> certain level of diversity of workloads so that we have 
>>>>>>>>>>>>>>>>>>> enough candidates
>>>>>>>>>>>>>>>>>>> to form a mixed set of executors with large S and
>>>>>>>>>>>>>>>>>>> small S values
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 2. we avoid using binpack scheduling algorithm which
>>>>>>>>>>>>>>>>>>> tends to pack more pods from the same job to the same host, 
>>>>>>>>>>>>>>>>>>> which can
>>>>>>>>>>>>>>>>>>> create troubles as they are more likely to ask for max 
>>>>>>>>>>>>>>>>>>> memory at the same
>>>>>>>>>>>>>>>>>>> time
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 7:11 AM Qiegang Long <
>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks for sharing this interesting proposal.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> My initial understanding is that Kubernetes will use
>>>>>>>>>>>>>>>>>>>> the Executor Memory Request (H + G) for scheduling
>>>>>>>>>>>>>>>>>>>> decisions, which allows for better resource packing.  I
>>>>>>>>>>>>>>>>>>>> have a few questions regarding the shared portion S:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>    1. How is the risk of host-level OOM mitigated when
>>>>>>>>>>>>>>>>>>>>    the total potential usage  sum of H+G+S across all pods 
>>>>>>>>>>>>>>>>>>>> on a node exceeds
>>>>>>>>>>>>>>>>>>>>    its allocatable capacity? Does the proposal implicitly 
>>>>>>>>>>>>>>>>>>>> rely on the cluster
>>>>>>>>>>>>>>>>>>>>    operator to manually ensure an unrequested memory 
>>>>>>>>>>>>>>>>>>>> buffer exists on the node
>>>>>>>>>>>>>>>>>>>>    to serve as the shared pool?
>>>>>>>>>>>>>>>>>>>>    2. Have you considered scheduling optimizations to
>>>>>>>>>>>>>>>>>>>>    ensure a strategic mix of executors with large S and
>>>>>>>>>>>>>>>>>>>>    small S values on a single node?  I am wondering if
>>>>>>>>>>>>>>>>>>>>    this would reduce the probability of concurrent 
>>>>>>>>>>>>>>>>>>>> bursting and host-level OOM.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:49 AM Wenchen Fan <
>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I think I'm still missing something in the big picture:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>    - Is the memory overhead off-heap? The formular
>>>>>>>>>>>>>>>>>>>>>    indicates a fixed heap size, and memory overhead can't 
>>>>>>>>>>>>>>>>>>>>> be dynamic if it's
>>>>>>>>>>>>>>>>>>>>>    on-heap.
>>>>>>>>>>>>>>>>>>>>>    - Do Spark applications have static profiles? When
>>>>>>>>>>>>>>>>>>>>>    we submit stages, the cluster is already allocated, 
>>>>>>>>>>>>>>>>>>>>> how can we change
>>>>>>>>>>>>>>>>>>>>>    anything?
>>>>>>>>>>>>>>>>>>>>>    - How do we assign the shared memory overhead?
>>>>>>>>>>>>>>>>>>>>>    Fairly among all applications on the same physical 
>>>>>>>>>>>>>>>>>>>>> node?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:15 PM Nan Zhu <
>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> we didn't separate the design into another doc
>>>>>>>>>>>>>>>>>>>>>> since the main idea is relatively simple...
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> for request/limit calculation, I described it in Q4
>>>>>>>>>>>>>>>>>>>>>> of the SPIP doc
>>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0#heading=h.q4vjslmnfuo0
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> it is calculated based on per profile (you can say it
>>>>>>>>>>>>>>>>>>>>>> is based on per stage), when the cluster manager compose 
>>>>>>>>>>>>>>>>>>>>>> the pod spec, it
>>>>>>>>>>>>>>>>>>>>>> calculates the new memory overhead based on what user 
>>>>>>>>>>>>>>>>>>>>>> asks for in that
>>>>>>>>>>>>>>>>>>>>>> resource profile
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 9:49 PM Wenchen Fan <
>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Do we have a design sketch? How to determine the
>>>>>>>>>>>>>>>>>>>>>>> memory request and limit? Is it per stage or per 
>>>>>>>>>>>>>>>>>>>>>>> executor?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 1:40 PM Nan Zhu <
>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> yeah, the implementation is basically relying on
>>>>>>>>>>>>>>>>>>>>>>>> the request/limit concept in K8S, ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> but if there is any other cluster manager coming in
>>>>>>>>>>>>>>>>>>>>>>>> future,  as long as it has a similar concept , it can 
>>>>>>>>>>>>>>>>>>>>>>>> leverage this easily
>>>>>>>>>>>>>>>>>>>>>>>> as the main logic is implemented in ResourceProfile
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 9:34 PM Wenchen Fan <
>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> This feature is only available on k8s because it
>>>>>>>>>>>>>>>>>>>>>>>>> allows containers to have dynamic resources?
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 12:46 PM Yao <
>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Folks,
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> We are proposing a burst-aware memoryOverhead
>>>>>>>>>>>>>>>>>>>>>>>>>> allocation algorithm for Spark@K8S to improve
>>>>>>>>>>>>>>>>>>>>>>>>>> memory utilization of spark clusters.
>>>>>>>>>>>>>>>>>>>>>>>>>> Please see more details in SPIP doc
>>>>>>>>>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0>.
>>>>>>>>>>>>>>>>>>>>>>>>>> Feedbacks and discussions are welcomed.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks Chao for being shepard of this feature.
>>>>>>>>>>>>>>>>>>>>>>>>>> Also want to thank the authors of the original
>>>>>>>>>>>>>>>>>>>>>>>>>> paper
>>>>>>>>>>>>>>>>>>>>>>>>>> <https://www.vldb.org/pvldb/vol17/p3759-shi.pdf> from
>>>>>>>>>>>>>>>>>>>>>>>>>> ByteDance, specifically Rui([email protected])
>>>>>>>>>>>>>>>>>>>>>>>>>> and Yixin([email protected]).
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you.
>>>>>>>>>>>>>>>>>>>>>>>>>> Yao Wang
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Regards,
>>>>>> Vaquar Khan
>>>>>>
>>>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Vaquar Khan
>>>>
>>>>
>>
>> --
>> Regards,
>> Vaquar Khan
>>
>>

-- 
Regards,
Vaquar Khan

Re: SPIP: Burst-aware memoryOverhead allocation algorithm for Spark@K8S

Reply via email to