Re: SPIP: Burst-aware memoryOverhead allocation algorithm for Spark@K8S

Nan Zhu Mon, 29 Dec 2025 23:44:41 -0800

1. Re: "Imagined Reasons" & Zero Overhead
when I said "imagined reasons", I meant I didn't see the issue you
described appear in a prod environment running millions of jobs every
month, and I have also said that why it won't happen in PINS and other
normal case: as in a K8S cluster , there will be a reserved space for
system daemons in each host, even with many 0-memoryOverhead jobs, they
won't be "fully packed" as you imagined since these 0-memory overhead jobs
don't need much memory overhead space anyway


let me bring my earlier suggestions again, if you don't want any job to
have 0 memoryOverhead, you can just calculate how much memoryOverhead is
guaranteed with simple arithmetic, if it is 0, do not use this feature

In general, I don't really suggest you use this feature if you cannot
manage the rollout process, just like no one should apply something like
auto-tuning to all of their jobs without a dedicated Spark platform team .

2. Kubelet Eviction Relevance

2.a my question is , how PID/Disk pressure is related to the memory related
feature we are discussing here? please don't fan out the discussion scope
unlimitedly
2.b exposing spark.kubernetes.executor.bursty.priorityClassName is far away
from a reasonable design, the priority class name should be controlled in
cluster level and then specified via something like spark operator or if
you can specify pod spec, instead of embedding it to a memory related
feature

3. Can we agree to simply *add these two parameters as optional
configurations*?

unfortunately no...

some of the problems you raised probably will happen in very very extreme
cases, I have provided solutions to them without the need to add additional
configs... Other  problems you raised are not related to what this SPIP is
about, e.g. PID exhausting, etc.   and some of your proposed design doesn't
make sense to me  , e.g. specifying executor's priority class via such a
memory related feature....


On Mon, Dec 29, 2025 at 11:16 PM vaquar khan <[email protected]> wrote:

> Hi Nan,
>
> Thanks for the candid response. I see where you are coming from regarding
> managed rollouts, but I think we are viewing this from two different
> lenses: "Internal Platform" vs. "General Open Source Product."
>
> Here is why I am pushing for these two specific configuration hooks:
>
> 1. Re: "Imagined Reasons" & Zero Overhead
>
> You mentioned that you have observed jobs running fine with zero
> memoryOverhead.
>
> While that may be true for specific workloads in your environment, the
> requirement for non-heap memory is not "imagined"—it is a JVM
> specification. Thread stacks, CodeCache, and Netty DirectByteBuffer control
> structures must live in non-heap memory.
>
>    -
>
>    *The Scenario:* If G=0, then Pod Request == Heap. If a node is fully
>    bin-packed (Sum of Requests = Node Capacity), your executor is
>    mathematically guaranteed *zero bytes* of non-heap memory unless it
>    can steal from the burst pool.
>    -
>
>    *The Risk:* If the burst pool is temporarily exhausted by neighbors, a
>    simple thread creation will throw OutOfMemoryError: unable to create
>    new native thread.
>    -
>
>    *The Fix:* I am not asking to change your default behavior. I am
>    asking to *expose the config* (minGuaranteedRatio). If you set it to
>    0.0 (default), your behavior is unchanged. But for those of us running
>    high-concurrency environments who need a 5-10% safety buffer for thread
>    stacks, we need the *capability* to configure it without maintaining a
>    fork or writing complex pre-submission wrappers.
>
> 2. Re: Kubelet Eviction Relevance
>
> You asked how Disk/PID pressure is related.
>
> In Kubernetes, PriorityClass is the universal signal for pod importance
> during any node-pressure event (not just memory).
>
>    -
>
>    If a node runs out of Ephemeral Storage (common with Spark Shuffle),
>    the Kubelet evicts pods.
>    -
>
>    Without a priorityClassName config, these Spark pods (which are now
>    QoS-downgraded to Burstable) will be evicted *before* Best-Effort jobs
>    that might have a higher priority class.
>    -
>
>    Again, this is a standard Kubernetes spec feature. There is no
>    downside to exposing spark.kubernetes.executor.bursty.priorityClassName
>    as an optional config.
>
> *Proposal to Unblock*
>
> We both want this feature merged. I am not asking to change your formula's
> default behavior.
>
> Can we agree to simply *add these two parameters as optional
> configurations*?
>
>    1.
>
>    minGuaranteedRatio (Default: 0.0 -> preserves your logic exactly).
>    2.
>
>    priorityClassName (Default: null -> preserves your logic exactly).
>
> This satisfies your design goals while making the feature robust enough
> for my production requirements.
>
>
> Regards,
>
> Viquar Khan
>
> Sr Data Architect
>
> https://www.linkedin.com/in/vaquar-khan-b695577/
>
>
>
> On Tue, 30 Dec 2025 at 01:04, Nan Zhu <[email protected]> wrote:
>
>> > However, I maintain that for a general-purpose open-source feature
>> (which will be used by teams without dedicated platform engineers to manage
>> rollouts), we need structural safety guardrails.
>>
>> I am not sure we can roll out such a feature to all jobs without a
>> managed rollout process, it is an anti-pattern in any engineering org.  this
>> feature is disabled by default which is already a guard to prevent users
>> silently getting into what they won't expect
>>
>>
>> > A minGuaranteedRatio (defaulting to 0 if you prefer) is not "messing
>> up the design"—it is mathematically necessary to prevent the formula from
>> collapsing to zero in valid production scenarios.
>>
>> this formula *IS* designed to output 0 in some cases, so it is *NOT*
>> collapsing to zero.. I observed that , even with 0 memoryoverhead in many
>> jobs, with a proper bursty factor saved tons of money in a real PROD
>> environment instead of from my imagination.. if you don't want any
>> memoryOverhead to be zero in your job for your imagined reasons,  you can
>> just calculate your threshold for on-heap/memoryOverhead ratio for rolling
>> out
>>
>> step back... if your team doesn't know how to manage rollout, most likely
>> you are rolling out this feature for individual jobs without a centralized
>> feature rolling out point,  right? then, you can just use the simple
>> arithmetics to calculate whether the resulting memoryOverhead is 0, if yes,
>> don't use this feature, that's it....
>>
>>
>> > However, Kubelet eviction is the primary mechanism for other pressure
>> types (DiskPressure, PIDPressure) and "slow leak" memory pressure scenarios
>> where memory.available crosses the eviction threshold before the kernel
>> panics.
>>
>> How are they related to this feature?
>>
>>
>> On Mon, Dec 29, 2025 at 10:37 PM vaquar khan <[email protected]>
>> wrote:
>>
>>> Hi Nan,
>>>
>>> Thanks for the detailed reply. I appreciate you sharing the specific
>>> context from the Pinterest implementation—it helps clarify the operational
>>> model you are using.
>>>
>>> However, I maintain that for a general-purpose open-source feature
>>> (which will be used by teams without dedicated platform engineers to manage
>>> rollouts), we need structural safety guardrails.
>>>
>>> *Here is my response to your points:*
>>>
>>> 1. Re: "Zero-Guarantee" & Safety (Critical)
>>>
>>> You suggested that "setting a conservative bursty factor" resolves the
>>> risk of zero-guaranteed overhead.
>>>
>>> Mathematically, this is incorrect for High-Heap jobs. The formula is
>>> structural: G = O - \min((H+O) \times (B-1), O).
>>>
>>> Consider a standard ETL job: Heap (H) = 100GB, Overhead (O) = 5GB.
>>>
>>> Even if we set a very conservative Bursty Factor (B) of 1.06 (only 6%
>>> burst):
>>>
>>>    -
>>>
>>>    Calculation: $(100 + 5) \times (1.06 - 1) = 105 \times 0.06 = 6.3GB.
>>>    -
>>>
>>>    Since 6.3GB > 5GB, the formula sets *Guaranteed Overhead = 0GB*.
>>>
>>> Even with an extremely conservative factor, the design forces this pod
>>> to have zero guaranteed memory for OS/JVM threads. This is not a tuning
>>> issue; it is a formulaic edge case for high-memory jobs.
>>>
>>> * A minGuaranteedRatio (defaulting to 0 if you prefer) is not "messing
>>> up the design"—it is mathematically necessary to prevent the formula from
>>> collapsing to zero in valid production scenarios.*
>>>
>>> 2. Re: Kubelet Eviction vs. OOMKiller
>>>
>>> I concede that in sudden memory spikes, the Kernel OOMKiller often acts
>>> faster than Kubelet eviction.
>>>
>>> However, Kubelet eviction is the primary mechanism for other pressure
>>> types (DiskPressure, PIDPressure) and "slow leak" memory pressure scenarios
>>> where memory.available crosses the eviction threshold before the kernel
>>> panics.
>>>
>>> * Adding priorityClassName support to the Pod spec is a low-effort,
>>> zero-risk change that aligns with Kubernetes best practices for "Defense in
>>> Depth." It costs nothing to expose this config.*
>>>
>>> 3. Re: Native Support
>>>
>>> Fair point. To keep the scope tight, I am happy to drop the Native
>>> Support request for this SPIP. We can treat that as a separate follow-up.
>>>
>>> Path Forward
>>>
>>> I am happy to support  if we can agree to:
>>>
>>>    1.
>>>
>>>    Add the minGuaranteedRatio config (to handle the High-Heap math
>>>    proven above).
>>>    2.
>>>
>>>    Expose the priorityClassName config (standard K8S practice).
>>>
>>>
>>> Regards,
>>>
>>> Viquar Khan
>>>
>>> Sr Data Architect
>>>
>>> https://www.linkedin.com/in/vaquar-khan-b695577/
>>>
>>>
>>> On Tue, 30 Dec 2025 at 00:16, Nan Zhu <[email protected]> wrote:
>>>
>>>> > Kubelet Eviction is the first line of defense before the Kernel
>>>> OOMKiller strikes.
>>>>
>>>> This is *NOT* true. Eviction will be the first to kill some best
>>>> effort pod which doesn't make any difference  on memory pressure in most
>>>> cases and before it takes action again, Kernel OOMKiller already killed
>>>> some executor pods. This is exactly the reason for me to say, we don't
>>>> really worry about eviction here, before eviction touches those executors,
>>>> OOMKiller already killed them. This behavior is consistently observed and
>>>> we also had discussions with other companies who had to modify Kernel code
>>>> to mitigate this behavior.
>>>>
>>>> > Re: "Zero-Guarantee" & Safety
>>>>
>>>> you basically want to tradeoff saving with system safety , then why not
>>>> just setting a conservative value of bursty factor? it is exactly what we
>>>> did in PINS, please check my earlier response in the thread ... key part as
>>>> following:
>>>>
>>>> "in PINS, we basically apply a set of strategies, setting conservative
>>>> bursty factor, progressive rollout, monitor the cluster metrics like Linux
>>>> Kernel OOMKiller occurrence to guide us to the optimal setup of bursty
>>>> factor... in usual, K8S operators will set a reserved space for daemon
>>>> processes on each host, we found it is sufficient to in our case and our
>>>> major tuning focuses on bursty factor value "
>>>>
>>>> If you really want, you can enable this feature only for jobs when
>>>> OnHeap/MemoryOverhead is smaller than a certain value...
>>>>
>>>> I just didn't see the value of bringing another configuration
>>>>
>>>>
>>>> > Re: Native Support
>>>>
>>>> I mean....this SPIP is NOT about native execution engine's memory
>>>> pattern at all..... why do we bother to bring it up....
>>>>
>>>>
>>>>
>>>> On Mon, Dec 29, 2025 at 9:42 PM vaquar khan <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Nan,
>>>>>
>>>>> Thanks for the prompt response and for clarifying the design intent.
>>>>>
>>>>> I understand the goal is to maximize savings—and I agree we shouldn't
>>>>> block the current momentum even though you can see my vote is +1—but I 
>>>>> want
>>>>> to ensure we aren't over-optimizing for specific internal environments at
>>>>> the cost of general community stability.
>>>>>
>>>>> *Here is my rejoinder on the technical points:*
>>>>>
>>>>> 1. Re: PriorityClass & OOMKiller (Defense in Depth)
>>>>>
>>>>> You mentioned that “priorityClassName is NOT the solution... What we
>>>>> worry about is the Linux Kernel OOMKiller.”
>>>>>
>>>>> I agree that the Kernel OOMKiller (cgroup) primarily looks at
>>>>> oom_score_adj (which is determined by QoS Class). However, Kubelet 
>>>>> Eviction
>>>>> is the first line of defense before the Kernel OOMKiller strikes.
>>>>>
>>>>> When a node comes under memory pressure (e.g., memory.available drops
>>>>> below evictionHard), the *Kubelet* actively selects pods to evict to
>>>>> reclaim resources. Unlike the Kernel, the Kubelet *does* explicitly
>>>>> use PriorityClass when ranking candidates for eviction.
>>>>>
>>>>>    -
>>>>>
>>>>>    *The Risk:* Since we are downgrading these pods to *Burstable*
>>>>>    (increasing their OOM risk), we lose the "Guaranteed" protection 
>>>>> shield.
>>>>>    -
>>>>>
>>>>>    *The Fix:* By assigning a high PriorityClass, we ensure that if
>>>>>    the Kubelet needs to free space, it evicts lower-priority batch jobs
>>>>>    *before* these Spark executors. It is a necessary "Defense in
>>>>>    Depth" strategy for multi-tenant clusters that prevents optimized Spark
>>>>>    jobs from being the first victims of node pressure.
>>>>>
>>>>> 2. Re: "Zero-Guarantee" & Safety
>>>>>
>>>>> You noted that “savings come from these 0 memory overhead pods.”
>>>>>
>>>>> While G=0 maximizes "on-paper" savings, it is theoretically unsafe for
>>>>> a JVM. A JVM physically requires non-heap memory for Thread Stacks,
>>>>> CodeCache, and Metaspace just to run.
>>>>>
>>>>>    -
>>>>>
>>>>>    *The Reality:* If G=0, then Pod Request == Heap. If a node is
>>>>>    fully packed (Sum of Requests ≈ Node Capacity), the pod relies
>>>>>    *100%* on the burst pool for basic thread allocation. If neighbors
>>>>>    are noisy, that pod cannot even spawn a thread.
>>>>>    -
>>>>>
>>>>>    *The Compromise:* I strongly suggest we add the configuration
>>>>>    spark.executor.memoryOverhead.minGuaranteedRatio but set the *default
>>>>>    to 0.0*.
>>>>>    -
>>>>>
>>>>>       This preserves your logic/savings by default.
>>>>>       -
>>>>>
>>>>>       But it gives platform admins a "safety knob" to turn (e.g., to
>>>>>       0.1) when they inevitably encounter instability in high-contention
>>>>>       environments, without needing a code patch.
>>>>>
>>>>> 3. Re: Native Support
>>>>>
>>>>> Agreed. We can treat Off-Heap support as a follow-up item. I would
>>>>> just request that we add a "Known Limitation" note in the SPIP stating 
>>>>> that
>>>>> this optimization does not yet apply to spark.memory.offHeap.size, so 
>>>>> users
>>>>> of Gluten/Velox are aware.
>>>>>
>>>>> I am happy to support the PR moving forward if we can agree to include
>>>>> the *PriorityClass* config support and the *Safety Floor* config
>>>>> (even if disabled by default) ,Please update your SIP. This ensures the
>>>>> feature is robust enough for the wider user base.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Viquar Khan
>>>>>
>>>>> Sr Data Architect
>>>>>
>>>>> https://www.linkedin.com/in/vaquar-khan-b695577/
>>>>>
>>>>> On Mon, 29 Dec 2025 at 22:52, Nan Zhu <[email protected]> wrote:
>>>>>
>>>>>> Hi, Vaquar
>>>>>>
>>>>>> thanks for the replies,
>>>>>>
>>>>>> 1. for Guaranteed QoS
>>>>>>
>>>>>> I may missed some words in the  original doc, the idea I would like
>>>>>> to convey is that we essentially need to give up this idea due to what
>>>>>> cannot achieve Guaranteed QoS as we already have different values of 
>>>>>> memory
>>>>>> and even we only consider CPU request/limit values, it brings other risks
>>>>>> to us
>>>>>>
>>>>>> Additionally , priorityClassName is NOT the solution here. What we
>>>>>> really worry about is NOT eviction, but Linux Kernel OOMKiller where we
>>>>>> cannot pass pod priority information into. With burstable pods, the
>>>>>> only thing Linux Kernel OOMKiller considers is the memory request size
>>>>>> which not necessary maps to priority information
>>>>>>
>>>>>> 2. The "Zero-Guarantee" Edge Case
>>>>>>
>>>>>> Actually, a lot of savings are from these 0 memory overhead pods... I
>>>>>> am curious if you have adopted the PoC PR in prod as you have identified 
>>>>>> it
>>>>>> is "unsafe"?
>>>>>>
>>>>>> Something like a minGuaranteedRatio is not a good idea , it will
>>>>>> mess up the original design idea of the formula (check Appendix C), the
>>>>>> simplest thing you can do is to avoid rolling out features to the jobs
>>>>>> which you feel will be unsafe...
>>>>>>
>>>>>> 3. Native Execution Gap (Off-Heap)
>>>>>>
>>>>>> I am not sure the off-heap memory usage of gluten/comet shows the
>>>>>> same/similar pattern as memoryOverhead. No one has validated that in
>>>>>> production environment, but both PINS/Bytedance has validated
>>>>>> memoryOverhead part thoroughly in their clusters
>>>>>>
>>>>>> Additionally, the key design of the proposal is to capture the
>>>>>> relationship between on-heap and memoryOverhead sizes, in another word,
>>>>>> they co-exist.... offheap memory used by native engines are different
>>>>>> stories where , ideally, the on-heap usage should be minimum and most of
>>>>>> memory usage should come from off-heap part...so the formula here may not
>>>>>> work out of box
>>>>>>
>>>>>> my suggestion is, since the community has approved the original
>>>>>> design which have been tested by at least 2 companies in production
>>>>>> environments, we go with the current design and continue code review , in
>>>>>> future, we can add what have been found/tested in production as followups
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Nan
>>>>>>
>>>>>>
>>>>>> On Mon, Dec 29, 2025 at 8:03 PM vaquar khan <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Yao, Nan, and Chao,
>>>>>>>
>>>>>>> Thank you for this proposal though I already approved . The
>>>>>>> cost-efficiency goals are very compelling, and the cited $6M annual 
>>>>>>> savings
>>>>>>> at Pinterest  clearly demonstrates the value of moving away from
>>>>>>> rigid peak provisioning.
>>>>>>>
>>>>>>> However, after modeling the proposed design against standard
>>>>>>> Kubernetes behavior and modern Spark workloads, I have identified *three
>>>>>>> critical stability risks* that need to be addressed before this is
>>>>>>> finalized.
>>>>>>>
>>>>>>> I have drafted a *Supplementary Design Amendment* (linked
>>>>>>> below/attached) that proposes fixes for these issues, but here is the
>>>>>>> summary:
>>>>>>> 1. The "Guaranteed QoS" Contradiction
>>>>>>>
>>>>>>> The SPIP lists "Use Guaranteed QoS class" as Mitigation #1 for
>>>>>>> stability risks2.
>>>>>>>
>>>>>>> The Issue: Technically, this mitigation is impossible under your
>>>>>>> proposal.
>>>>>>>
>>>>>>>    -
>>>>>>>
>>>>>>>    In Kubernetes, a Pod is assigned the *Guaranteed* QoS class
>>>>>>>    *only* if Request == Limit for both CPU and Memory.
>>>>>>>    -
>>>>>>>
>>>>>>>    Your proposal explicitly sets Memory Request < Memory Limit
>>>>>>>    (specifically $H+G < H+O$)3.
>>>>>>>
>>>>>>>    -
>>>>>>>
>>>>>>>    *Consequence:* This configuration *automatically downgrades* the
>>>>>>>    Pod to the *Burstable* QoS class. In a multi-tenant cluster, the
>>>>>>>    Kubelet eviction manager will kill these "Burstable" Spark pods
>>>>>>>    *before* any Guaranteed system pods during node pressure.
>>>>>>>    -
>>>>>>>
>>>>>>>    *Proposed Fix:* We cannot rely on Guaranteed QoS. We must
>>>>>>>    introduce a priorityClassName configuration to offset this
>>>>>>>    eviction risk.
>>>>>>>
>>>>>>> 2. The "Zero-Guarantee" Edge Case
>>>>>>>
>>>>>>> The formula $G = O - \min\{(H+O) \times (B-1), O\}$ 4 has a
>>>>>>> dangerous edge case for High-Heap/Low-Overhead jobs (common in ETL).
>>>>>>>
>>>>>>>
>>>>>>>    -
>>>>>>>
>>>>>>>    *Scenario:* If a job has a large Heap ($H$) relative to Overhead
>>>>>>>    ($O$), the calculated burst deduction often exceeds the total
>>>>>>>    Overhead.
>>>>>>>    -
>>>>>>>
>>>>>>>    *Result:* The formula yields *$G = 0$*.
>>>>>>>    -
>>>>>>>
>>>>>>>    *Risk:* Allocating 0MB of guaranteed overhead is unsafe.
>>>>>>>    Essential JVM operations (thread stacks, Netty control buffers) 
>>>>>>> require a
>>>>>>>    non-zero baseline. Relying 100% on a shared burst pool for basic
>>>>>>>    functionality will lead to immediate container failures if the node 
>>>>>>> is
>>>>>>>    contended.
>>>>>>>    -
>>>>>>>
>>>>>>>    *Proposed Fix:* Implement a safety floor using a
>>>>>>>    minGuaranteedRatio (e.g., max(Calculated_G, O * 0.1)).
>>>>>>>
>>>>>>> 3. Native Execution Gap (Off-Heap)
>>>>>>>
>>>>>>> The proposal focuses entirely on memoryOverhead5.
>>>>>>>
>>>>>>> The Issue: Modern native engines (Gluten, Velox, Photon) shift
>>>>>>> execution memory to spark.memory.offHeap.size. This memory is equally
>>>>>>> "bursty" but is excluded from your optimization.
>>>>>>>
>>>>>>> *Proposed Fix: *The burst-aware logic should be extensible to
>>>>>>> include Off-Heap memory if enabled.
>>>>>>>
>>>>>>>
>>>>>>> https://docs.google.com/document/d/1l7KFkHcVBi1kr-9T4Rp7d52pTJT2TxuDMOlOsibD4wk/edit?usp=sharing
>>>>>>>
>>>>>>>
>>>>>>> I believe these changes are necessary to make the feature robust
>>>>>>> enough for general community adoption beyond specific controlled
>>>>>>> environments.
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Viquar Khan
>>>>>>>
>>>>>>> Sr Data Architect
>>>>>>>
>>>>>>> https://www.linkedin.com/in/vaquar-khan-b695577/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, 17 Dec 2025 at 06:34, Qiegang Long <[email protected]> wrote:
>>>>>>>
>>>>>>>> +1
>>>>>>>>
>>>>>>>> On Wed, Dec 17, 2025, 2:48 AM Wenchen Fan <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> +1
>>>>>>>>>
>>>>>>>>> On Wed, Dec 17, 2025 at 6:41 AM karuppayya <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> +1 from me.
>>>>>>>>>> I think it's well-scoped and takes advantage of Kubernetes'
>>>>>>>>>> features exactly for what they are designed for(as per my 
>>>>>>>>>> understanding).
>>>>>>>>>>
>>>>>>>>>> On Tue, Dec 16, 2025 at 8:17 AM Chao Sun <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks Yao and Nan for the proposal, and thanks everyone for the
>>>>>>>>>>> detailed and thoughtful discussion.
>>>>>>>>>>>
>>>>>>>>>>> Overall, this looks like a valuable addition for organizations
>>>>>>>>>>> running Spark on Kubernetes, especially given how bursty
>>>>>>>>>>> memoryOverhead usage tends to be in practice. I appreciate that
>>>>>>>>>>> the change is relatively small in scope and fully opt-in, which 
>>>>>>>>>>> helps keep
>>>>>>>>>>> the risk low.
>>>>>>>>>>>
>>>>>>>>>>> From my perspective, the questions raised on the thread and in
>>>>>>>>>>> the SPIP have been addressed. If others feel the same, do we have 
>>>>>>>>>>> consensus
>>>>>>>>>>> to move forward with a vote? cc Wenchen, Qieqiang, and Karuppayya.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Chao
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Dec 11, 2025 at 11:32 PM Nan Zhu <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> this is a good question
>>>>>>>>>>>>
>>>>>>>>>>>> > a stage is bursty and consumes the shared portion and fails
>>>>>>>>>>>> to release it for subsequent stages
>>>>>>>>>>>>
>>>>>>>>>>>> in the scenario you described, since the memory-leaking stage
>>>>>>>>>>>> and the subsequence ones are from the same job , the pod will 
>>>>>>>>>>>> likely be
>>>>>>>>>>>> killed by cgroup oomkiller
>>>>>>>>>>>>
>>>>>>>>>>>> taking the following as the example
>>>>>>>>>>>>
>>>>>>>>>>>> the usage pattern is  G = 5GB S = 2GB, it uses G + S at max and
>>>>>>>>>>>> in theory, it should release all 7G and then claim 7G again in 
>>>>>>>>>>>> some later
>>>>>>>>>>>> stages, however, due to the memory peak, it holds 2G forever and 
>>>>>>>>>>>> ask for
>>>>>>>>>>>> another 7G, as a result,  it hits the pod memory limit  and cgroup
>>>>>>>>>>>> oomkiller will take action to terminate the pod
>>>>>>>>>>>>
>>>>>>>>>>>> so this should be safe to the system
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> however, we should be careful about the memory peak for sure,
>>>>>>>>>>>> because it essentially breaks the assumption that the usage of
>>>>>>>>>>>> memoryOverhead is bursty (memory peak ~= use memory forever)...
>>>>>>>>>>>> unfortunately, shared/guaranteed memory is managed by user 
>>>>>>>>>>>> applications
>>>>>>>>>>>> instead of on cluster level , they, especially S, are just logical
>>>>>>>>>>>> concepts  instead of a physical memory pool which pods can 
>>>>>>>>>>>> explicitly claim
>>>>>>>>>>>> memory from...
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Dec 11, 2025 at 10:17 PM karuppayya <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for the interesting proposal.
>>>>>>>>>>>>> The design seems to rely on memoryOverhead being transient.
>>>>>>>>>>>>> What happens when a stage is bursty and consumes the shared
>>>>>>>>>>>>> portion and fails to release it for subsequent stages (e.g.,  
>>>>>>>>>>>>> off-heap
>>>>>>>>>>>>> buffers and its not garbage collected since its off-heap)? Would 
>>>>>>>>>>>>> this
>>>>>>>>>>>>> trigger the host-level OOM like described in Q6? or are there 
>>>>>>>>>>>>> strategies to
>>>>>>>>>>>>> release the shared portion?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 6:24 PM Nan Zhu <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> yes, that's the worst case in the scenario, please check my
>>>>>>>>>>>>>> earlier response to Qiegang's question, we have a set of 
>>>>>>>>>>>>>> strategies adopted
>>>>>>>>>>>>>> in prod to mitigate the issue
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 6:21 PM Wenchen Fan <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for the explanation! So the executor is not
>>>>>>>>>>>>>>> guaranteed to get 50 GB physical memory, right? All pods on the 
>>>>>>>>>>>>>>> same host
>>>>>>>>>>>>>>> may reach peak memory usage at the same time and cause 
>>>>>>>>>>>>>>> paging/swapping
>>>>>>>>>>>>>>> which hurts performance?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 10:12 AM Nan Zhu <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> np, let me try to explain
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1. Each executor container will be run in a pod together
>>>>>>>>>>>>>>>> with some other sidecar containers taking care of tasks like
>>>>>>>>>>>>>>>> authentication, etc. , for simplicity, we assume each pod has 
>>>>>>>>>>>>>>>> only one
>>>>>>>>>>>>>>>> container which is the executor container
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2. Each container is assigned with two values, r
>>>>>>>>>>>>>>>> *equest&limit** (limit >= request),* for both of
>>>>>>>>>>>>>>>> CPU/memory resources (we only discuss memory here). Each pod 
>>>>>>>>>>>>>>>> will have
>>>>>>>>>>>>>>>> request/limit values as the sum of all containers belonging to 
>>>>>>>>>>>>>>>> this pod
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 3. K8S Scheduler chooses a machine to host a pod based on
>>>>>>>>>>>>>>>> *request* value, and cap the resource usage of each
>>>>>>>>>>>>>>>> container based on their *limit* value, e.g. if I have a
>>>>>>>>>>>>>>>> pod with a single container in it , and it has 1G/2G as 
>>>>>>>>>>>>>>>> request and limit
>>>>>>>>>>>>>>>> value respectively, any machine with 1G free RAM space will be 
>>>>>>>>>>>>>>>> a candidate
>>>>>>>>>>>>>>>> to host this pod, and when the container use more than 2G 
>>>>>>>>>>>>>>>> memory, it will
>>>>>>>>>>>>>>>> be killed by cgroup oomkiller. Once a pod is scheduled to a 
>>>>>>>>>>>>>>>> host, the
>>>>>>>>>>>>>>>> memory space sized at "sum of all its containers' request 
>>>>>>>>>>>>>>>> values" will be
>>>>>>>>>>>>>>>> booked exclusively for this pod.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 4. By default, Spark *sets request/limit as the same value
>>>>>>>>>>>>>>>> for executors in k8s*, and this value is basically
>>>>>>>>>>>>>>>> spark.executor.memory + spark.executor.memoryOverhead in most 
>>>>>>>>>>>>>>>> cases .
>>>>>>>>>>>>>>>> However,  spark.executor.memoryOverhead usage is very bursty, 
>>>>>>>>>>>>>>>> the user
>>>>>>>>>>>>>>>> setting  spark.executor.memoryOverhead as 10G usually means 
>>>>>>>>>>>>>>>> each executor
>>>>>>>>>>>>>>>> only needs 10G in a very small portion of the executor's whole 
>>>>>>>>>>>>>>>> lifecycle
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 5. The proposed SPIP is essentially to decouple
>>>>>>>>>>>>>>>> request/limit value in spark@k8s for executors in a safe
>>>>>>>>>>>>>>>> way (this idea is from the bytedance paper we refer to in SPIP 
>>>>>>>>>>>>>>>> paper).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Using the aforementioned example ,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> if we have a single node cluster with 100G RAM space, we
>>>>>>>>>>>>>>>> have two pods requesting 40G + 10G (on-heap + memoryOverhead) 
>>>>>>>>>>>>>>>> and we set
>>>>>>>>>>>>>>>> bursty factor to 1.2, without the mechanism proposed in this 
>>>>>>>>>>>>>>>> SPIP, we can
>>>>>>>>>>>>>>>> at most host 2 pods with this machine, and because of the 
>>>>>>>>>>>>>>>> bursty usage of
>>>>>>>>>>>>>>>> that 10G space, the memory utilization would be compromised.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> When applying the burst-aware memory allocation, we only
>>>>>>>>>>>>>>>> need 40 + 10 - min((40 + 10) * 0.2, 10) = 40G to host each 
>>>>>>>>>>>>>>>> pod, i.e. we
>>>>>>>>>>>>>>>> have 20G free memory space left in the machine which can be 
>>>>>>>>>>>>>>>> used to host
>>>>>>>>>>>>>>>> some smaller pods. At the same time, as we didn't change the 
>>>>>>>>>>>>>>>> limit value of
>>>>>>>>>>>>>>>> the executor pods, these executors can still use 50G at max.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 5:42 PM Wenchen Fan <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Sorry I'm not very familiar with the k8s infra, how does
>>>>>>>>>>>>>>>>> it work under the hood? The container will adjust its system 
>>>>>>>>>>>>>>>>> memory size
>>>>>>>>>>>>>>>>> depending on the actual memory usage of the processes in this 
>>>>>>>>>>>>>>>>> container?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 2:49 AM Nan Zhu <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> yeah, we have a few cases that we have significantly
>>>>>>>>>>>>>>>>>> larger O than H, the proposed algorithm is actually a great 
>>>>>>>>>>>>>>>>>> fit
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> as I explained in SPIP doc Appendix C, the proposed
>>>>>>>>>>>>>>>>>> algorithm will allocate a non-trivial G to ensure the safety 
>>>>>>>>>>>>>>>>>> of running but
>>>>>>>>>>>>>>>>>> still cut a big chunk of memory (10s of GBs) and treat them 
>>>>>>>>>>>>>>>>>> as S , saving
>>>>>>>>>>>>>>>>>> tons of money burnt by them
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> but regarding native accelerators, some native
>>>>>>>>>>>>>>>>>> acceleration engines do not use memoryOverhead but use 
>>>>>>>>>>>>>>>>>> off-heap
>>>>>>>>>>>>>>>>>> (spark.memory.offHeap.size) explicitly (e.g. Gluten). The 
>>>>>>>>>>>>>>>>>> current
>>>>>>>>>>>>>>>>>> implementation does not cover this part , while that will be 
>>>>>>>>>>>>>>>>>> an easy
>>>>>>>>>>>>>>>>>> extension
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 10:42 AM Qiegang Long <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks for the reply.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Have you tested in environments where O is bigger than
>>>>>>>>>>>>>>>>>>> H? Wondering if the proposed algorithm would help more in 
>>>>>>>>>>>>>>>>>>> those
>>>>>>>>>>>>>>>>>>> environments (eg. with
>>>>>>>>>>>>>>>>>>> native accelerators)?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 12:48 PM Nan Zhu <
>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi, Qiegang, thanks for the good questions as well
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> please check the following answer
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> > My initial understanding is that Kubernetes will use
>>>>>>>>>>>>>>>>>>>> the Executor Memory Request (H + G) for scheduling
>>>>>>>>>>>>>>>>>>>> decisions, which allows for better resource packing.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> yes, your understanding is correct
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> > How is the risk of host-level OOM mitigated when the
>>>>>>>>>>>>>>>>>>>> total potential usage  sum of H+G+S across all pods on a 
>>>>>>>>>>>>>>>>>>>> node exceeds its
>>>>>>>>>>>>>>>>>>>> allocatable capacity? Does the proposal implicitly rely on 
>>>>>>>>>>>>>>>>>>>> the cluster
>>>>>>>>>>>>>>>>>>>> operator to manually ensure an unrequested memory buffer 
>>>>>>>>>>>>>>>>>>>> exists on the node
>>>>>>>>>>>>>>>>>>>> to serve as the shared pool?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> in PINS, we basically apply a set of strategies,
>>>>>>>>>>>>>>>>>>>> setting conservative bursty factor, progressive rollout, 
>>>>>>>>>>>>>>>>>>>> monitor the
>>>>>>>>>>>>>>>>>>>> cluster metrics like Linux Kernel OOMKiller occurrence to 
>>>>>>>>>>>>>>>>>>>> guide us to the
>>>>>>>>>>>>>>>>>>>> optimal setup of bursty factor... in usual, K8S operators 
>>>>>>>>>>>>>>>>>>>> will set a
>>>>>>>>>>>>>>>>>>>> reserved space for daemon processes on each host, we found 
>>>>>>>>>>>>>>>>>>>> it is sufficient
>>>>>>>>>>>>>>>>>>>> to in our case and our major tuning focuses on bursty 
>>>>>>>>>>>>>>>>>>>> factor value
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> > Have you considered scheduling optimizations to
>>>>>>>>>>>>>>>>>>>> ensure a strategic mix of executors with large S and small 
>>>>>>>>>>>>>>>>>>>> S values on a
>>>>>>>>>>>>>>>>>>>> single node?  I am wondering if this would reduce the 
>>>>>>>>>>>>>>>>>>>> probability of
>>>>>>>>>>>>>>>>>>>> concurrent bursting and host-level OOM.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Yes, when we work on this project, we put some
>>>>>>>>>>>>>>>>>>>> attention on the cluster scheduling policy/behavior... two 
>>>>>>>>>>>>>>>>>>>> things we mostly
>>>>>>>>>>>>>>>>>>>> care about
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> 1. as stated in the SPIP doc, the cluster should have
>>>>>>>>>>>>>>>>>>>> certain level of diversity of workloads so that we have 
>>>>>>>>>>>>>>>>>>>> enough candidates
>>>>>>>>>>>>>>>>>>>> to form a mixed set of executors with large S and
>>>>>>>>>>>>>>>>>>>> small S values
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> 2. we avoid using binpack scheduling algorithm which
>>>>>>>>>>>>>>>>>>>> tends to pack more pods from the same job to the same 
>>>>>>>>>>>>>>>>>>>> host, which can
>>>>>>>>>>>>>>>>>>>> create troubles as they are more likely to ask for max 
>>>>>>>>>>>>>>>>>>>> memory at the same
>>>>>>>>>>>>>>>>>>>> time
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 7:11 AM Qiegang Long <
>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks for sharing this interesting proposal.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> My initial understanding is that Kubernetes will use
>>>>>>>>>>>>>>>>>>>>> the Executor Memory Request (H + G) for scheduling
>>>>>>>>>>>>>>>>>>>>> decisions, which allows for better resource packing.  I
>>>>>>>>>>>>>>>>>>>>> have a few questions regarding the shared portion S:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>    1. How is the risk of host-level OOM mitigated
>>>>>>>>>>>>>>>>>>>>>    when the total potential usage  sum of H+G+S across 
>>>>>>>>>>>>>>>>>>>>> all pods on a node
>>>>>>>>>>>>>>>>>>>>>    exceeds its allocatable capacity? Does the proposal 
>>>>>>>>>>>>>>>>>>>>> implicitly rely on the
>>>>>>>>>>>>>>>>>>>>>    cluster operator to manually ensure an unrequested 
>>>>>>>>>>>>>>>>>>>>> memory buffer exists on
>>>>>>>>>>>>>>>>>>>>>    the node to serve as the shared pool?
>>>>>>>>>>>>>>>>>>>>>    2. Have you considered scheduling optimizations to
>>>>>>>>>>>>>>>>>>>>>    ensure a strategic mix of executors with large S and
>>>>>>>>>>>>>>>>>>>>>    small S values on a single node?  I am wondering
>>>>>>>>>>>>>>>>>>>>>    if this would reduce the probability of concurrent 
>>>>>>>>>>>>>>>>>>>>> bursting and host-level
>>>>>>>>>>>>>>>>>>>>>    OOM.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:49 AM Wenchen Fan <
>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I think I'm still missing something in the big
>>>>>>>>>>>>>>>>>>>>>> picture:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>    - Is the memory overhead off-heap? The formular
>>>>>>>>>>>>>>>>>>>>>>    indicates a fixed heap size, and memory overhead 
>>>>>>>>>>>>>>>>>>>>>> can't be dynamic if it's
>>>>>>>>>>>>>>>>>>>>>>    on-heap.
>>>>>>>>>>>>>>>>>>>>>>    - Do Spark applications have static profiles?
>>>>>>>>>>>>>>>>>>>>>>    When we submit stages, the cluster is already 
>>>>>>>>>>>>>>>>>>>>>> allocated, how can we change
>>>>>>>>>>>>>>>>>>>>>>    anything?
>>>>>>>>>>>>>>>>>>>>>>    - How do we assign the shared memory overhead?
>>>>>>>>>>>>>>>>>>>>>>    Fairly among all applications on the same physical 
>>>>>>>>>>>>>>>>>>>>>> node?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:15 PM Nan Zhu <
>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> we didn't separate the design into another doc
>>>>>>>>>>>>>>>>>>>>>>> since the main idea is relatively simple...
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> for request/limit calculation, I described it in Q4
>>>>>>>>>>>>>>>>>>>>>>> of the SPIP doc
>>>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0#heading=h.q4vjslmnfuo0
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> it is calculated based on per profile (you can say
>>>>>>>>>>>>>>>>>>>>>>> it is based on per stage), when the cluster manager 
>>>>>>>>>>>>>>>>>>>>>>> compose the pod spec,
>>>>>>>>>>>>>>>>>>>>>>> it calculates the new memory overhead based on what 
>>>>>>>>>>>>>>>>>>>>>>> user asks for in that
>>>>>>>>>>>>>>>>>>>>>>> resource profile
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 9:49 PM Wenchen Fan <
>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Do we have a design sketch? How to determine the
>>>>>>>>>>>>>>>>>>>>>>>> memory request and limit? Is it per stage or per 
>>>>>>>>>>>>>>>>>>>>>>>> executor?
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 1:40 PM Nan Zhu <
>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> yeah, the implementation is basically relying on
>>>>>>>>>>>>>>>>>>>>>>>>> the request/limit concept in K8S, ...
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> but if there is any other cluster manager coming
>>>>>>>>>>>>>>>>>>>>>>>>> in future,  as long as it has a similar concept , it 
>>>>>>>>>>>>>>>>>>>>>>>>> can leverage this
>>>>>>>>>>>>>>>>>>>>>>>>> easily as the main logic is implemented in 
>>>>>>>>>>>>>>>>>>>>>>>>> ResourceProfile
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 9:34 PM Wenchen Fan <
>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> This feature is only available on k8s because it
>>>>>>>>>>>>>>>>>>>>>>>>>> allows containers to have dynamic resources?
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 12:46 PM Yao <
>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Folks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> We are proposing a burst-aware memoryOverhead
>>>>>>>>>>>>>>>>>>>>>>>>>>> allocation algorithm for Spark@K8S to improve
>>>>>>>>>>>>>>>>>>>>>>>>>>> memory utilization of spark clusters.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Please see more details in SPIP doc
>>>>>>>>>>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0>.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Feedbacks and discussions are welcomed.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks Chao for being shepard of this feature.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Also want to thank the authors of the original
>>>>>>>>>>>>>>>>>>>>>>>>>>> paper
>>>>>>>>>>>>>>>>>>>>>>>>>>> <https://www.vldb.org/pvldb/vol17/p3759-shi.pdf> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>>>>> ByteDance, specifically Rui([email protected])
>>>>>>>>>>>>>>>>>>>>>>>>>>> and Yixin([email protected]).
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Yao Wang
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Regards,
>>>>>>> Vaquar Khan
>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> Regards,
>>>>> Vaquar Khan
>>>>>
>>>>>
>>>
>>> --
>>> Regards,
>>> Vaquar Khan
>>>
>>>
>
> --
> Regards,
> Vaquar Khan
>
>

Re: SPIP: Burst-aware memoryOverhead allocation algorithm for Spark@K8S

Reply via email to