Re: SPIP: Burst-aware memoryOverhead allocation algorithm for Spark@K8S

vaquar khan Mon, 29 Dec 2025 21:42:28 -0800

Hi Nan,

Thanks for the prompt response and for clarifying the design intent.


I understand the goal is to maximize savings—and I agree we shouldn't block
the current momentum even though you can see my vote is +1—but I want to
ensure we aren't over-optimizing for specific internal environments at the
cost of general community stability.

*Here is my rejoinder on the technical points:*

1. Re: PriorityClass & OOMKiller (Defense in Depth)

You mentioned that “priorityClassName is NOT the solution... What we worry
about is the Linux Kernel OOMKiller.”

I agree that the Kernel OOMKiller (cgroup) primarily looks at oom_score_adj
(which is determined by QoS Class). However, Kubelet Eviction is the first
line of defense before the Kernel OOMKiller strikes.

When a node comes under memory pressure (e.g., memory.available drops below
evictionHard), the *Kubelet* actively selects pods to evict to reclaim
resources. Unlike the Kernel, the Kubelet *does* explicitly use
PriorityClass when ranking candidates for eviction.

   -

   *The Risk:* Since we are downgrading these pods to *Burstable*
   (increasing their OOM risk), we lose the "Guaranteed" protection shield.
   -

   *The Fix:* By assigning a high PriorityClass, we ensure that if the
   Kubelet needs to free space, it evicts lower-priority batch jobs *before*
   these Spark executors. It is a necessary "Defense in Depth" strategy for
   multi-tenant clusters that prevents optimized Spark jobs from being the
   first victims of node pressure.

2. Re: "Zero-Guarantee" & Safety

You noted that “savings come from these 0 memory overhead pods.”

While G=0 maximizes "on-paper" savings, it is theoretically unsafe for a
JVM. A JVM physically requires non-heap memory for Thread Stacks,
CodeCache, and Metaspace just to run.

   -

   *The Reality:* If G=0, then Pod Request == Heap. If a node is fully
   packed (Sum of Requests ≈ Node Capacity), the pod relies *100%* on the
   burst pool for basic thread allocation. If neighbors are noisy, that pod
   cannot even spawn a thread.
   -

   *The Compromise:* I strongly suggest we add the configuration
   spark.executor.memoryOverhead.minGuaranteedRatio but set the *default to
   0.0*.
   -

      This preserves your logic/savings by default.
      -

      But it gives platform admins a "safety knob" to turn (e.g., to 0.1)
      when they inevitably encounter instability in high-contention
environments,
      without needing a code patch.

3. Re: Native Support

Agreed. We can treat Off-Heap support as a follow-up item. I would just
request that we add a "Known Limitation" note in the SPIP stating that this
optimization does not yet apply to spark.memory.offHeap.size, so users of
Gluten/Velox are aware.

I am happy to support the PR moving forward if we can agree to include the
*PriorityClass* config support and the *Safety Floor* config (even if
disabled by default) ,Please update your SIP. This ensures the feature is
robust enough for the wider user base.

Regards,

Viquar Khan

Sr Data Architect

https://www.linkedin.com/in/vaquar-khan-b695577/

On Mon, 29 Dec 2025 at 22:52, Nan Zhu <[email protected]> wrote:

> Hi, Vaquar
>
> thanks for the replies,
>
> 1. for Guaranteed QoS
>
> I may missed some words in the  original doc, the idea I would like to
> convey is that we essentially need to give up this idea due to what cannot
> achieve Guaranteed QoS as we already have different values of memory and
> even we only consider CPU request/limit values, it brings other risks to us
>
> Additionally , priorityClassName is NOT the solution here. What we really
> worry about is NOT eviction, but Linux Kernel OOMKiller where we cannot
> pass pod priority information into. With burstable pods, the only thing
> Linux Kernel OOMKiller considers is the memory request size which not
> necessary maps to priority information
>
> 2. The "Zero-Guarantee" Edge Case
>
> Actually, a lot of savings are from these 0 memory overhead pods... I am
> curious if you have adopted the PoC PR in prod as you have identified it is
> "unsafe"?
>
> Something like a minGuaranteedRatio is not a good idea , it will mess up
> the original design idea of the formula (check Appendix C), the simplest
> thing you can do is to avoid rolling out features to the jobs which you
> feel will be unsafe...
>
> 3. Native Execution Gap (Off-Heap)
>
> I am not sure the off-heap memory usage of gluten/comet shows the
> same/similar pattern as memoryOverhead. No one has validated that in
> production environment, but both PINS/Bytedance has validated
> memoryOverhead part thoroughly in their clusters
>
> Additionally, the key design of the proposal is to capture the
> relationship between on-heap and memoryOverhead sizes, in another word,
> they co-exist.... offheap memory used by native engines are different
> stories where , ideally, the on-heap usage should be minimum and most of
> memory usage should come from off-heap part...so the formula here may not
> work out of box
>
> my suggestion is, since the community has approved the original design
> which have been tested by at least 2 companies in production environments,
> we go with the current design and continue code review , in future, we can
> add what have been found/tested in production as followups
>
> Thanks
>
> Nan
>
>
> On Mon, Dec 29, 2025 at 8:03 PM vaquar khan <[email protected]> wrote:
>
>> Hi Yao, Nan, and Chao,
>>
>> Thank you for this proposal though I already approved . The
>> cost-efficiency goals are very compelling, and the cited $6M annual savings
>> at Pinterest  clearly demonstrates the value of moving away from rigid
>> peak provisioning.
>>
>> However, after modeling the proposed design against standard Kubernetes
>> behavior and modern Spark workloads, I have identified *three critical
>> stability risks* that need to be addressed before this is finalized.
>>
>> I have drafted a *Supplementary Design Amendment* (linked
>> below/attached) that proposes fixes for these issues, but here is the
>> summary:
>> 1. The "Guaranteed QoS" Contradiction
>>
>> The SPIP lists "Use Guaranteed QoS class" as Mitigation #1 for stability
>> risks2.
>>
>> The Issue: Technically, this mitigation is impossible under your proposal.
>>
>>    -
>>
>>    In Kubernetes, a Pod is assigned the *Guaranteed* QoS class *only* if 
>> Request
>>    == Limit for both CPU and Memory.
>>    -
>>
>>    Your proposal explicitly sets Memory Request < Memory Limit
>>    (specifically $H+G < H+O$)3.
>>
>>    -
>>
>>    *Consequence:* This configuration *automatically downgrades* the Pod
>>    to the *Burstable* QoS class. In a multi-tenant cluster, the Kubelet
>>    eviction manager will kill these "Burstable" Spark pods *before* any
>>    Guaranteed system pods during node pressure.
>>    -
>>
>>    *Proposed Fix:* We cannot rely on Guaranteed QoS. We must introduce a
>>    priorityClassName configuration to offset this eviction risk.
>>
>> 2. The "Zero-Guarantee" Edge Case
>>
>> The formula $G = O - \min\{(H+O) \times (B-1), O\}$ 4 has a dangerous
>> edge case for High-Heap/Low-Overhead jobs (common in ETL).
>>
>>
>>    -
>>
>>    *Scenario:* If a job has a large Heap ($H$) relative to Overhead ($O$),
>>    the calculated burst deduction often exceeds the total Overhead.
>>    -
>>
>>    *Result:* The formula yields *$G = 0$*.
>>    -
>>
>>    *Risk:* Allocating 0MB of guaranteed overhead is unsafe. Essential
>>    JVM operations (thread stacks, Netty control buffers) require a non-zero
>>    baseline. Relying 100% on a shared burst pool for basic functionality will
>>    lead to immediate container failures if the node is contended.
>>    -
>>
>>    *Proposed Fix:* Implement a safety floor using a minGuaranteedRatio
>>    (e.g., max(Calculated_G, O * 0.1)).
>>
>> 3. Native Execution Gap (Off-Heap)
>>
>> The proposal focuses entirely on memoryOverhead5.
>>
>> The Issue: Modern native engines (Gluten, Velox, Photon) shift execution
>> memory to spark.memory.offHeap.size. This memory is equally "bursty" but is
>> excluded from your optimization.
>>
>> *Proposed Fix: *The burst-aware logic should be extensible to include
>> Off-Heap memory if enabled.
>>
>>
>> https://docs.google.com/document/d/1l7KFkHcVBi1kr-9T4Rp7d52pTJT2TxuDMOlOsibD4wk/edit?usp=sharing
>>
>>
>> I believe these changes are necessary to make the feature robust enough
>> for general community adoption beyond specific controlled environments.
>>
>>
>> Regards,
>>
>> Viquar Khan
>>
>> Sr Data Architect
>>
>> https://www.linkedin.com/in/vaquar-khan-b695577/
>>
>>
>>
>> On Wed, 17 Dec 2025 at 06:34, Qiegang Long <[email protected]> wrote:
>>
>>> +1
>>>
>>> On Wed, Dec 17, 2025, 2:48 AM Wenchen Fan <[email protected]> wrote:
>>>
>>>> +1
>>>>
>>>> On Wed, Dec 17, 2025 at 6:41 AM karuppayya <[email protected]>
>>>> wrote:
>>>>
>>>>> +1 from me.
>>>>> I think it's well-scoped and takes advantage of Kubernetes' features
>>>>> exactly for what they are designed for(as per my understanding).
>>>>>
>>>>> On Tue, Dec 16, 2025 at 8:17 AM Chao Sun <[email protected]> wrote:
>>>>>
>>>>>> Thanks Yao and Nan for the proposal, and thanks everyone for the
>>>>>> detailed and thoughtful discussion.
>>>>>>
>>>>>> Overall, this looks like a valuable addition for organizations
>>>>>> running Spark on Kubernetes, especially given how bursty
>>>>>> memoryOverhead usage tends to be in practice. I appreciate that the
>>>>>> change is relatively small in scope and fully opt-in, which helps keep 
>>>>>> the
>>>>>> risk low.
>>>>>>
>>>>>> From my perspective, the questions raised on the thread and in the
>>>>>> SPIP have been addressed. If others feel the same, do we have consensus 
>>>>>> to
>>>>>> move forward with a vote? cc Wenchen, Qieqiang, and Karuppayya.
>>>>>>
>>>>>> Best,
>>>>>> Chao
>>>>>>
>>>>>> On Thu, Dec 11, 2025 at 11:32 PM Nan Zhu <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> this is a good question
>>>>>>>
>>>>>>> > a stage is bursty and consumes the shared portion and fails to
>>>>>>> release it for subsequent stages
>>>>>>>
>>>>>>> in the scenario you described, since the memory-leaking stage and
>>>>>>> the subsequence ones are from the same job , the pod will likely be 
>>>>>>> killed
>>>>>>> by cgroup oomkiller
>>>>>>>
>>>>>>> taking the following as the example
>>>>>>>
>>>>>>> the usage pattern is  G = 5GB S = 2GB, it uses G + S at max and in
>>>>>>> theory, it should release all 7G and then claim 7G again in some later
>>>>>>> stages, however, due to the memory peak, it holds 2G forever and ask for
>>>>>>> another 7G, as a result,  it hits the pod memory limit  and cgroup
>>>>>>> oomkiller will take action to terminate the pod
>>>>>>>
>>>>>>> so this should be safe to the system
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> however, we should be careful about the memory peak for sure,
>>>>>>> because it essentially breaks the assumption that the usage of
>>>>>>> memoryOverhead is bursty (memory peak ~= use memory forever)...
>>>>>>> unfortunately, shared/guaranteed memory is managed by user applications
>>>>>>> instead of on cluster level , they, especially S, are just logical
>>>>>>> concepts  instead of a physical memory pool which pods can explicitly 
>>>>>>> claim
>>>>>>> memory from...
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Dec 11, 2025 at 10:17 PM karuppayya <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Thanks for the interesting proposal.
>>>>>>>> The design seems to rely on memoryOverhead being transient.
>>>>>>>> What happens when a stage is bursty and consumes the shared portion
>>>>>>>> and fails to release it for subsequent stages (e.g.,  off-heap buffers 
>>>>>>>> and
>>>>>>>> its not garbage collected since its off-heap)? Would this trigger the
>>>>>>>> host-level OOM like described in Q6? or are there strategies to 
>>>>>>>> release the
>>>>>>>> shared portion?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Dec 11, 2025 at 6:24 PM Nan Zhu <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> yes, that's the worst case in the scenario, please check my
>>>>>>>>> earlier response to Qiegang's question, we have a set of strategies 
>>>>>>>>> adopted
>>>>>>>>> in prod to mitigate the issue
>>>>>>>>>
>>>>>>>>> On Thu, Dec 11, 2025 at 6:21 PM Wenchen Fan <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks for the explanation! So the executor is not guaranteed to
>>>>>>>>>> get 50 GB physical memory, right? All pods on the same host may 
>>>>>>>>>> reach peak
>>>>>>>>>> memory usage at the same time and cause paging/swapping which hurts
>>>>>>>>>> performance?
>>>>>>>>>>
>>>>>>>>>> On Fri, Dec 12, 2025 at 10:12 AM Nan Zhu <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> np, let me try to explain
>>>>>>>>>>>
>>>>>>>>>>> 1. Each executor container will be run in a pod together with
>>>>>>>>>>> some other sidecar containers taking care of tasks like 
>>>>>>>>>>> authentication,
>>>>>>>>>>> etc. , for simplicity, we assume each pod has only one container 
>>>>>>>>>>> which is
>>>>>>>>>>> the executor container
>>>>>>>>>>>
>>>>>>>>>>> 2. Each container is assigned with two values, r*equest&limit** 
>>>>>>>>>>> (limit
>>>>>>>>>>> >= request),* for both of CPU/memory resources (we only discuss
>>>>>>>>>>> memory here). Each pod will have request/limit values as the sum of 
>>>>>>>>>>> all
>>>>>>>>>>> containers belonging to this pod
>>>>>>>>>>>
>>>>>>>>>>> 3. K8S Scheduler chooses a machine to host a pod based on
>>>>>>>>>>> *request* value, and cap the resource usage of each container
>>>>>>>>>>> based on their *limit* value, e.g. if I have a pod with a
>>>>>>>>>>> single container in it , and it has 1G/2G as request and limit value
>>>>>>>>>>> respectively, any machine with 1G free RAM space will be a 
>>>>>>>>>>> candidate to
>>>>>>>>>>> host this pod, and when the container use more than 2G memory, it 
>>>>>>>>>>> will be
>>>>>>>>>>> killed by cgroup oomkiller. Once a pod is scheduled to a host, the 
>>>>>>>>>>> memory
>>>>>>>>>>> space sized at "sum of all its containers' request values" will be 
>>>>>>>>>>> booked
>>>>>>>>>>> exclusively for this pod.
>>>>>>>>>>>
>>>>>>>>>>> 4. By default, Spark *sets request/limit as the same value for
>>>>>>>>>>> executors in k8s*, and this value is basically
>>>>>>>>>>> spark.executor.memory + spark.executor.memoryOverhead in most cases 
>>>>>>>>>>> .
>>>>>>>>>>> However,  spark.executor.memoryOverhead usage is very bursty, the 
>>>>>>>>>>> user
>>>>>>>>>>> setting  spark.executor.memoryOverhead as 10G usually means each 
>>>>>>>>>>> executor
>>>>>>>>>>> only needs 10G in a very small portion of the executor's whole 
>>>>>>>>>>> lifecycle
>>>>>>>>>>>
>>>>>>>>>>> 5. The proposed SPIP is essentially to decouple request/limit
>>>>>>>>>>> value in spark@k8s for executors in a safe way (this idea is
>>>>>>>>>>> from the bytedance paper we refer to in SPIP paper).
>>>>>>>>>>>
>>>>>>>>>>> Using the aforementioned example ,
>>>>>>>>>>>
>>>>>>>>>>> if we have a single node cluster with 100G RAM space, we have
>>>>>>>>>>> two pods requesting 40G + 10G (on-heap + memoryOverhead) and we set 
>>>>>>>>>>> bursty
>>>>>>>>>>> factor to 1.2, without the mechanism proposed in this SPIP, we can 
>>>>>>>>>>> at most
>>>>>>>>>>> host 2 pods with this machine, and because of the bursty usage of 
>>>>>>>>>>> that 10G
>>>>>>>>>>> space, the memory utilization would be compromised.
>>>>>>>>>>>
>>>>>>>>>>> When applying the burst-aware memory allocation, we only need 40
>>>>>>>>>>> + 10 - min((40 + 10) * 0.2, 10) = 40G to host each pod, i.e. we 
>>>>>>>>>>> have 20G
>>>>>>>>>>> free memory space left in the machine which can be used to host some
>>>>>>>>>>> smaller pods. At the same time, as we didn't change the limit value 
>>>>>>>>>>> of the
>>>>>>>>>>> executor pods, these executors can still use 50G at max.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Dec 11, 2025 at 5:42 PM Wenchen Fan <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Sorry I'm not very familiar with the k8s infra, how does
>>>>>>>>>>>> it work under the hood? The container will adjust its system 
>>>>>>>>>>>> memory size
>>>>>>>>>>>> depending on the actual memory usage of the processes in this 
>>>>>>>>>>>> container?
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Dec 12, 2025 at 2:49 AM Nan Zhu <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> yeah, we have a few cases that we have significantly larger O
>>>>>>>>>>>>> than H, the proposed algorithm is actually a great fit
>>>>>>>>>>>>>
>>>>>>>>>>>>> as I explained in SPIP doc Appendix C, the proposed algorithm
>>>>>>>>>>>>> will allocate a non-trivial G to ensure the safety of running but 
>>>>>>>>>>>>> still cut
>>>>>>>>>>>>> a big chunk of memory (10s of GBs) and treat them as S , saving 
>>>>>>>>>>>>> tons of
>>>>>>>>>>>>> money burnt by them
>>>>>>>>>>>>>
>>>>>>>>>>>>> but regarding native accelerators, some native acceleration
>>>>>>>>>>>>> engines do not use memoryOverhead but use off-heap
>>>>>>>>>>>>> (spark.memory.offHeap.size) explicitly (e.g. Gluten). The current
>>>>>>>>>>>>> implementation does not cover this part , while that will be an 
>>>>>>>>>>>>> easy
>>>>>>>>>>>>> extension
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 10:42 AM Qiegang Long <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for the reply.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Have you tested in environments where O is bigger than H?
>>>>>>>>>>>>>> Wondering if the proposed algorithm would help more in those 
>>>>>>>>>>>>>> environments
>>>>>>>>>>>>>> (eg. with
>>>>>>>>>>>>>> native accelerators)?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 12:48 PM Nan Zhu <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi, Qiegang, thanks for the good questions as well
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> please check the following answer
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> > My initial understanding is that Kubernetes will use the 
>>>>>>>>>>>>>>> > Executor
>>>>>>>>>>>>>>> Memory Request (H + G) for scheduling decisions, which
>>>>>>>>>>>>>>> allows for better resource packing.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> yes, your understanding is correct
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> > How is the risk of host-level OOM mitigated when the total
>>>>>>>>>>>>>>> potential usage  sum of H+G+S across all pods on a node exceeds 
>>>>>>>>>>>>>>> its
>>>>>>>>>>>>>>> allocatable capacity? Does the proposal implicitly rely on the 
>>>>>>>>>>>>>>> cluster
>>>>>>>>>>>>>>> operator to manually ensure an unrequested memory buffer exists 
>>>>>>>>>>>>>>> on the node
>>>>>>>>>>>>>>> to serve as the shared pool?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> in PINS, we basically apply a set of strategies, setting
>>>>>>>>>>>>>>> conservative bursty factor, progressive rollout, monitor the 
>>>>>>>>>>>>>>> cluster
>>>>>>>>>>>>>>> metrics like Linux Kernel OOMKiller occurrence to guide us to 
>>>>>>>>>>>>>>> the optimal
>>>>>>>>>>>>>>> setup of bursty factor... in usual, K8S operators will set a 
>>>>>>>>>>>>>>> reserved space
>>>>>>>>>>>>>>> for daemon processes on each host, we found it is sufficient to 
>>>>>>>>>>>>>>> in our case
>>>>>>>>>>>>>>> and our major tuning focuses on bursty factor value
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> > Have you considered scheduling optimizations to ensure a
>>>>>>>>>>>>>>> strategic mix of executors with large S and small S values on a 
>>>>>>>>>>>>>>> single
>>>>>>>>>>>>>>> node?  I am wondering if this would reduce the probability of 
>>>>>>>>>>>>>>> concurrent
>>>>>>>>>>>>>>> bursting and host-level OOM.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yes, when we work on this project, we put some attention on
>>>>>>>>>>>>>>> the cluster scheduling policy/behavior... two things we mostly 
>>>>>>>>>>>>>>> care about
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1. as stated in the SPIP doc, the cluster should have
>>>>>>>>>>>>>>> certain level of diversity of workloads so that we have enough 
>>>>>>>>>>>>>>> candidates
>>>>>>>>>>>>>>> to form a mixed set of executors with large S and
>>>>>>>>>>>>>>> small S values
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2. we avoid using binpack scheduling algorithm which tends
>>>>>>>>>>>>>>> to pack more pods from the same job to the same host, which can 
>>>>>>>>>>>>>>> create
>>>>>>>>>>>>>>> troubles as they are more likely to ask for max memory at the 
>>>>>>>>>>>>>>> same time
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 7:11 AM Qiegang Long <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for sharing this interesting proposal.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> My initial understanding is that Kubernetes will use the 
>>>>>>>>>>>>>>>> Executor
>>>>>>>>>>>>>>>> Memory Request (H + G) for scheduling decisions, which
>>>>>>>>>>>>>>>> allows for better resource packing.  I have a few
>>>>>>>>>>>>>>>> questions regarding the shared portion S:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    1. How is the risk of host-level OOM mitigated when the
>>>>>>>>>>>>>>>>    total potential usage  sum of H+G+S across all pods on a 
>>>>>>>>>>>>>>>> node exceeds its
>>>>>>>>>>>>>>>>    allocatable capacity? Does the proposal implicitly rely on 
>>>>>>>>>>>>>>>> the cluster
>>>>>>>>>>>>>>>>    operator to manually ensure an unrequested memory buffer 
>>>>>>>>>>>>>>>> exists on the node
>>>>>>>>>>>>>>>>    to serve as the shared pool?
>>>>>>>>>>>>>>>>    2. Have you considered scheduling optimizations to
>>>>>>>>>>>>>>>>    ensure a strategic mix of executors with large S and
>>>>>>>>>>>>>>>>    small S values on a single node?  I am wondering if
>>>>>>>>>>>>>>>>    this would reduce the probability of concurrent bursting 
>>>>>>>>>>>>>>>> and host-level OOM.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:49 AM Wenchen Fan <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I think I'm still missing something in the big picture:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    - Is the memory overhead off-heap? The formular
>>>>>>>>>>>>>>>>>    indicates a fixed heap size, and memory overhead can't be 
>>>>>>>>>>>>>>>>> dynamic if it's
>>>>>>>>>>>>>>>>>    on-heap.
>>>>>>>>>>>>>>>>>    - Do Spark applications have static profiles? When we
>>>>>>>>>>>>>>>>>    submit stages, the cluster is already allocated, how can 
>>>>>>>>>>>>>>>>> we change anything?
>>>>>>>>>>>>>>>>>    - How do we assign the shared memory overhead? Fairly
>>>>>>>>>>>>>>>>>    among all applications on the same physical node?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:15 PM Nan Zhu <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> we didn't separate the design into another doc since the
>>>>>>>>>>>>>>>>>> main idea is relatively simple...
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> for request/limit calculation, I described it in Q4 of
>>>>>>>>>>>>>>>>>> the SPIP doc
>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0#heading=h.q4vjslmnfuo0
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> it is calculated based on per profile (you can say it is
>>>>>>>>>>>>>>>>>> based on per stage), when the cluster manager compose the 
>>>>>>>>>>>>>>>>>> pod spec, it
>>>>>>>>>>>>>>>>>> calculates the new memory overhead based on what user asks 
>>>>>>>>>>>>>>>>>> for in that
>>>>>>>>>>>>>>>>>> resource profile
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 9:49 PM Wenchen Fan <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Do we have a design sketch? How to determine the memory
>>>>>>>>>>>>>>>>>>> request and limit? Is it per stage or per executor?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 1:40 PM Nan Zhu <
>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> yeah, the implementation is basically relying on the
>>>>>>>>>>>>>>>>>>>> request/limit concept in K8S, ...
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> but if there is any other cluster manager coming in
>>>>>>>>>>>>>>>>>>>> future,  as long as it has a similar concept , it can 
>>>>>>>>>>>>>>>>>>>> leverage this easily
>>>>>>>>>>>>>>>>>>>> as the main logic is implemented in ResourceProfile
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 9:34 PM Wenchen Fan <
>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> This feature is only available on k8s because it
>>>>>>>>>>>>>>>>>>>>> allows containers to have dynamic resources?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 12:46 PM Yao <
>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi Folks,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> We are proposing a burst-aware memoryOverhead
>>>>>>>>>>>>>>>>>>>>>> allocation algorithm for Spark@K8S to improve memory
>>>>>>>>>>>>>>>>>>>>>> utilization of spark clusters.
>>>>>>>>>>>>>>>>>>>>>> Please see more details in SPIP doc
>>>>>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0>.
>>>>>>>>>>>>>>>>>>>>>> Feedbacks and discussions are welcomed.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks Chao for being shepard of this feature.
>>>>>>>>>>>>>>>>>>>>>> Also want to thank the authors of the original paper
>>>>>>>>>>>>>>>>>>>>>> <https://www.vldb.org/pvldb/vol17/p3759-shi.pdf> from
>>>>>>>>>>>>>>>>>>>>>> ByteDance, specifically Rui([email protected])
>>>>>>>>>>>>>>>>>>>>>> and Yixin([email protected]).
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thank you.
>>>>>>>>>>>>>>>>>>>>>> Yao Wang
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>
>> --
>> Regards,
>> Vaquar Khan
>>
>>

-- 
Regards,
Vaquar Khan

Re: SPIP: Burst-aware memoryOverhead allocation algorithm for Spark@K8S

Reply via email to