Hi Nan, Yao, and Chao, I have done a deep dive into the underlying Linux and Kubernetes kernel behaviors to validate our respective positions. While I fully support the economic goal of reclaiming the estimated 30-50% of stranded memory in static clusters, the technical evidence suggests that the "Zero-Guarantee" configuration is not just an optimization choice—it is architecturally unsafe for standard Kubernetes environments due to how the kernel calculates OOM scores.
I am sharing these findings to explain why I have insisted on the *Safety Floor (minGuaranteedRatio)* as a necessary guardrail. *1. The "Death Trap" of OOM Scores (The Math)* Nan mentioned that "Zero-Guarantee" pods work fine in Pinterest's environment. However, in a standard environment, the math works against us. The Linux kernel calculates oom_score_adj inversely to the Request size: 1000 - (1000 * Request / Capacity). - *The Risk:* By allowing memoryOverhead to drop to 0 (lowering the Request), we are mathematically inflating the OOM score. For example, on a standard node, a Zero-Guarantee pod ends up with a significantly higher OOM score (more likely to be killed) compared to a standard pod. - *The Consequence:* In a race condition, the kernel will mathematically target these "optimized" Spark pods for termination *before* their neighbors, regardless of our intent. *2. The "Smoking Gun": Kubelet Bug #131169* There is a known defect in the Kubelet (Issue #131169) where *PriorityClass is ignored when calculating OOM scores for Burstable pods*. - This invalidates the assumption that we can simply "manage" the risk with priorities later. - Until this is fixed in upstream K8s (v1.30+), a "Zero-Guarantee" pod is statistically identical to a "Best Effort" pod in the eyes of the OOM killer. - *Conclusion:* We ideally *should* enforce a minimum memory floor to keep the Request value high enough to secure a survivable OOM score. *3. Silent Failures (Thread Exhaustion)* The research confirms that "Zero-Guarantee" creates a vector for java.lang.OutOfMemoryError: unable to create new native thread. - If a pod lands on a node with just enough RAM for the Heap (Request) but zero extra for the OS, the pthread_create call will fail immediately. - This results in "silent" application crashes that do not trigger standard K8s OOM alerts, leading to un-debuggable support scenarios for general users. *Final Proposal & Documentation Compromise* My strong preference is to add the *Safety Floor (minGuaranteedRatio)* configuration to the code. However, if after reviewing this evidence you are *adamant* that no new configurations should be added to the code, I am willing to *unblock the vote* on one strict condition: *The SPIP and Documentation must explicitly flag this risk.* We cannot simply leave this as an implementation detail. The documentation must contain a "Critical Warning" block stating: *"Warning: High-Heap/Low-Overhead configurations may result in 0MB guaranteed overhead. Due to Kubelet limitations (Issue #131169), this may bypass PriorityClass protections and lead to silent 'Native Thread' exhaustion failures on contended nodes. Users are responsible for validating stability."* If you agree to either the code change (preferred) or this specific documentation warning please update SIP doc , I am happy to support.. Regards, Viquar Khan Sr Data Architect https://www.linkedin.com/in/vaquar-khan-b695577/ On Tue, 30 Dec 2025 at 01:45, Nan Zhu <[email protected]> wrote: > 1. Re: "Imagined Reasons" & Zero Overhead > when I said "imagined reasons", I meant I didn't see the issue you > described appear in a prod environment running millions of jobs every > month, and I have also said that why it won't happen in PINS and other > normal case: as in a K8S cluster , there will be a reserved space for > system daemons in each host, even with many 0-memoryOverhead jobs, they > won't be "fully packed" as you imagined since these 0-memory overhead jobs > don't need much memory overhead space anyway > > let me bring my earlier suggestions again, if you don't want any job to > have 0 memoryOverhead, you can just calculate how much memoryOverhead is > guaranteed with simple arithmetic, if it is 0, do not use this feature > > In general, I don't really suggest you use this feature if you cannot > manage the rollout process, just like no one should apply something like > auto-tuning to all of their jobs without a dedicated Spark platform team . > > 2. Kubelet Eviction Relevance > > 2.a my question is , how PID/Disk pressure is related to the memory > related feature we are discussing here? please don't fan out the discussion > scope unlimitedly > 2.b exposing spark.kubernetes.executor.bursty.priorityClassName is far > away from a reasonable design, the priority class name should be controlled > in cluster level and then specified via something like spark operator or if > you can specify pod spec, instead of embedding it to a memory related > feature > > 3. Can we agree to simply *add these two parameters as optional > configurations*? > > unfortunately no... > > some of the problems you raised probably will happen in very very extreme > cases, I have provided solutions to them without the need to add additional > configs... Other problems you raised are not related to what this SPIP is > about, e.g. PID exhausting, etc. and some of your proposed design doesn't > make sense to me , e.g. specifying executor's priority class via such a > memory related feature.... > > > On Mon, Dec 29, 2025 at 11:16 PM vaquar khan <[email protected]> > wrote: > >> Hi Nan, >> >> Thanks for the candid response. I see where you are coming from regarding >> managed rollouts, but I think we are viewing this from two different >> lenses: "Internal Platform" vs. "General Open Source Product." >> >> Here is why I am pushing for these two specific configuration hooks: >> >> 1. Re: "Imagined Reasons" & Zero Overhead >> >> You mentioned that you have observed jobs running fine with zero >> memoryOverhead. >> >> While that may be true for specific workloads in your environment, the >> requirement for non-heap memory is not "imagined"—it is a JVM >> specification. Thread stacks, CodeCache, and Netty DirectByteBuffer control >> structures must live in non-heap memory. >> >> - >> >> *The Scenario:* If G=0, then Pod Request == Heap. If a node is fully >> bin-packed (Sum of Requests = Node Capacity), your executor is >> mathematically guaranteed *zero bytes* of non-heap memory unless it >> can steal from the burst pool. >> - >> >> *The Risk:* If the burst pool is temporarily exhausted by neighbors, >> a simple thread creation will throw OutOfMemoryError: unable to >> create new native thread. >> - >> >> *The Fix:* I am not asking to change your default behavior. I am >> asking to *expose the config* (minGuaranteedRatio). If you set it to >> 0.0 (default), your behavior is unchanged. But for those of us >> running high-concurrency environments who need a 5-10% safety buffer for >> thread stacks, we need the *capability* to configure it without >> maintaining a fork or writing complex pre-submission wrappers. >> >> 2. Re: Kubelet Eviction Relevance >> >> You asked how Disk/PID pressure is related. >> >> In Kubernetes, PriorityClass is the universal signal for pod importance >> during any node-pressure event (not just memory). >> >> - >> >> If a node runs out of Ephemeral Storage (common with Spark Shuffle), >> the Kubelet evicts pods. >> - >> >> Without a priorityClassName config, these Spark pods (which are now >> QoS-downgraded to Burstable) will be evicted *before* Best-Effort >> jobs that might have a higher priority class. >> - >> >> Again, this is a standard Kubernetes spec feature. There is no >> downside to exposing >> spark.kubernetes.executor.bursty.priorityClassName as an optional >> config. >> >> *Proposal to Unblock* >> >> We both want this feature merged. I am not asking to change your >> formula's default behavior. >> >> Can we agree to simply *add these two parameters as optional >> configurations*? >> >> 1. >> >> minGuaranteedRatio (Default: 0.0 -> preserves your logic exactly). >> 2. >> >> priorityClassName (Default: null -> preserves your logic exactly). >> >> This satisfies your design goals while making the feature robust enough >> for my production requirements. >> >> >> Regards, >> >> Viquar Khan >> >> Sr Data Architect >> >> https://www.linkedin.com/in/vaquar-khan-b695577/ >> >> >> >> On Tue, 30 Dec 2025 at 01:04, Nan Zhu <[email protected]> wrote: >> >>> > However, I maintain that for a general-purpose open-source feature >>> (which will be used by teams without dedicated platform engineers to manage >>> rollouts), we need structural safety guardrails. >>> >>> I am not sure we can roll out such a feature to all jobs without a >>> managed rollout process, it is an anti-pattern in any engineering org. this >>> feature is disabled by default which is already a guard to prevent users >>> silently getting into what they won't expect >>> >>> >>> > A minGuaranteedRatio (defaulting to 0 if you prefer) is not "messing >>> up the design"—it is mathematically necessary to prevent the formula from >>> collapsing to zero in valid production scenarios. >>> >>> this formula *IS* designed to output 0 in some cases, so it is *NOT* >>> collapsing to zero.. I observed that , even with 0 memoryoverhead in many >>> jobs, with a proper bursty factor saved tons of money in a real PROD >>> environment instead of from my imagination.. if you don't want any >>> memoryOverhead to be zero in your job for your imagined reasons, you can >>> just calculate your threshold for on-heap/memoryOverhead ratio for rolling >>> out >>> >>> step back... if your team doesn't know how to manage rollout, most >>> likely you are rolling out this feature for individual jobs without a >>> centralized feature rolling out point, right? then, you can just use the >>> simple arithmetics to calculate whether the resulting memoryOverhead is 0, >>> if yes, don't use this feature, that's it.... >>> >>> >>> > However, Kubelet eviction is the primary mechanism for other pressure >>> types (DiskPressure, PIDPressure) and "slow leak" memory pressure scenarios >>> where memory.available crosses the eviction threshold before the kernel >>> panics. >>> >>> How are they related to this feature? >>> >>> >>> On Mon, Dec 29, 2025 at 10:37 PM vaquar khan <[email protected]> >>> wrote: >>> >>>> Hi Nan, >>>> >>>> Thanks for the detailed reply. I appreciate you sharing the specific >>>> context from the Pinterest implementation—it helps clarify the operational >>>> model you are using. >>>> >>>> However, I maintain that for a general-purpose open-source feature >>>> (which will be used by teams without dedicated platform engineers to manage >>>> rollouts), we need structural safety guardrails. >>>> >>>> *Here is my response to your points:* >>>> >>>> 1. Re: "Zero-Guarantee" & Safety (Critical) >>>> >>>> You suggested that "setting a conservative bursty factor" resolves the >>>> risk of zero-guaranteed overhead. >>>> >>>> Mathematically, this is incorrect for High-Heap jobs. The formula is >>>> structural: G = O - \min((H+O) \times (B-1), O). >>>> >>>> Consider a standard ETL job: Heap (H) = 100GB, Overhead (O) = 5GB. >>>> >>>> Even if we set a very conservative Bursty Factor (B) of 1.06 (only 6% >>>> burst): >>>> >>>> - >>>> >>>> Calculation: $(100 + 5) \times (1.06 - 1) = 105 \times 0.06 = 6.3GB. >>>> - >>>> >>>> Since 6.3GB > 5GB, the formula sets *Guaranteed Overhead = 0GB*. >>>> >>>> Even with an extremely conservative factor, the design forces this pod >>>> to have zero guaranteed memory for OS/JVM threads. This is not a tuning >>>> issue; it is a formulaic edge case for high-memory jobs. >>>> >>>> * A minGuaranteedRatio (defaulting to 0 if you prefer) is not "messing >>>> up the design"—it is mathematically necessary to prevent the formula from >>>> collapsing to zero in valid production scenarios.* >>>> >>>> 2. Re: Kubelet Eviction vs. OOMKiller >>>> >>>> I concede that in sudden memory spikes, the Kernel OOMKiller often acts >>>> faster than Kubelet eviction. >>>> >>>> However, Kubelet eviction is the primary mechanism for other pressure >>>> types (DiskPressure, PIDPressure) and "slow leak" memory pressure scenarios >>>> where memory.available crosses the eviction threshold before the kernel >>>> panics. >>>> >>>> * Adding priorityClassName support to the Pod spec is a low-effort, >>>> zero-risk change that aligns with Kubernetes best practices for "Defense in >>>> Depth." It costs nothing to expose this config.* >>>> >>>> 3. Re: Native Support >>>> >>>> Fair point. To keep the scope tight, I am happy to drop the Native >>>> Support request for this SPIP. We can treat that as a separate follow-up. >>>> >>>> Path Forward >>>> >>>> I am happy to support if we can agree to: >>>> >>>> 1. >>>> >>>> Add the minGuaranteedRatio config (to handle the High-Heap math >>>> proven above). >>>> 2. >>>> >>>> Expose the priorityClassName config (standard K8S practice). >>>> >>>> >>>> Regards, >>>> >>>> Viquar Khan >>>> >>>> Sr Data Architect >>>> >>>> https://www.linkedin.com/in/vaquar-khan-b695577/ >>>> >>>> >>>> On Tue, 30 Dec 2025 at 00:16, Nan Zhu <[email protected]> wrote: >>>> >>>>> > Kubelet Eviction is the first line of defense before the Kernel >>>>> OOMKiller strikes. >>>>> >>>>> This is *NOT* true. Eviction will be the first to kill some best >>>>> effort pod which doesn't make any difference on memory pressure in most >>>>> cases and before it takes action again, Kernel OOMKiller already killed >>>>> some executor pods. This is exactly the reason for me to say, we don't >>>>> really worry about eviction here, before eviction touches those executors, >>>>> OOMKiller already killed them. This behavior is consistently observed and >>>>> we also had discussions with other companies who had to modify Kernel code >>>>> to mitigate this behavior. >>>>> >>>>> > Re: "Zero-Guarantee" & Safety >>>>> >>>>> you basically want to tradeoff saving with system safety , then why >>>>> not just setting a conservative value of bursty factor? it is exactly what >>>>> we did in PINS, please check my earlier response in the thread ... key >>>>> part >>>>> as following: >>>>> >>>>> "in PINS, we basically apply a set of strategies, setting >>>>> conservative bursty factor, progressive rollout, monitor the cluster >>>>> metrics like Linux Kernel OOMKiller occurrence to guide us to the optimal >>>>> setup of bursty factor... in usual, K8S operators will set a reserved >>>>> space >>>>> for daemon processes on each host, we found it is sufficient to in our >>>>> case >>>>> and our major tuning focuses on bursty factor value " >>>>> >>>>> If you really want, you can enable this feature only for jobs when >>>>> OnHeap/MemoryOverhead is smaller than a certain value... >>>>> >>>>> I just didn't see the value of bringing another configuration >>>>> >>>>> >>>>> > Re: Native Support >>>>> >>>>> I mean....this SPIP is NOT about native execution engine's memory >>>>> pattern at all..... why do we bother to bring it up.... >>>>> >>>>> >>>>> >>>>> On Mon, Dec 29, 2025 at 9:42 PM vaquar khan <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi Nan, >>>>>> >>>>>> Thanks for the prompt response and for clarifying the design intent. >>>>>> >>>>>> I understand the goal is to maximize savings—and I agree we shouldn't >>>>>> block the current momentum even though you can see my vote is +1—but I >>>>>> want >>>>>> to ensure we aren't over-optimizing for specific internal environments at >>>>>> the cost of general community stability. >>>>>> >>>>>> *Here is my rejoinder on the technical points:* >>>>>> >>>>>> 1. Re: PriorityClass & OOMKiller (Defense in Depth) >>>>>> >>>>>> You mentioned that “priorityClassName is NOT the solution... What we >>>>>> worry about is the Linux Kernel OOMKiller.” >>>>>> >>>>>> I agree that the Kernel OOMKiller (cgroup) primarily looks at >>>>>> oom_score_adj (which is determined by QoS Class). However, Kubelet >>>>>> Eviction >>>>>> is the first line of defense before the Kernel OOMKiller strikes. >>>>>> >>>>>> When a node comes under memory pressure (e.g., memory.available >>>>>> drops below evictionHard), the *Kubelet* actively selects pods to >>>>>> evict to reclaim resources. Unlike the Kernel, the Kubelet *does* >>>>>> explicitly use PriorityClass when ranking candidates for eviction. >>>>>> >>>>>> - >>>>>> >>>>>> *The Risk:* Since we are downgrading these pods to *Burstable* >>>>>> (increasing their OOM risk), we lose the "Guaranteed" protection >>>>>> shield. >>>>>> - >>>>>> >>>>>> *The Fix:* By assigning a high PriorityClass, we ensure that if >>>>>> the Kubelet needs to free space, it evicts lower-priority batch jobs >>>>>> *before* these Spark executors. It is a necessary "Defense in >>>>>> Depth" strategy for multi-tenant clusters that prevents optimized >>>>>> Spark >>>>>> jobs from being the first victims of node pressure. >>>>>> >>>>>> 2. Re: "Zero-Guarantee" & Safety >>>>>> >>>>>> You noted that “savings come from these 0 memory overhead pods.” >>>>>> >>>>>> While G=0 maximizes "on-paper" savings, it is theoretically unsafe >>>>>> for a JVM. A JVM physically requires non-heap memory for Thread Stacks, >>>>>> CodeCache, and Metaspace just to run. >>>>>> >>>>>> - >>>>>> >>>>>> *The Reality:* If G=0, then Pod Request == Heap. If a node is >>>>>> fully packed (Sum of Requests ≈ Node Capacity), the pod relies >>>>>> *100%* on the burst pool for basic thread allocation. If >>>>>> neighbors are noisy, that pod cannot even spawn a thread. >>>>>> - >>>>>> >>>>>> *The Compromise:* I strongly suggest we add the configuration >>>>>> spark.executor.memoryOverhead.minGuaranteedRatio but set the *default >>>>>> to 0.0*. >>>>>> - >>>>>> >>>>>> This preserves your logic/savings by default. >>>>>> - >>>>>> >>>>>> But it gives platform admins a "safety knob" to turn (e.g., to >>>>>> 0.1) when they inevitably encounter instability in high-contention >>>>>> environments, without needing a code patch. >>>>>> >>>>>> 3. Re: Native Support >>>>>> >>>>>> Agreed. We can treat Off-Heap support as a follow-up item. I would >>>>>> just request that we add a "Known Limitation" note in the SPIP stating >>>>>> that >>>>>> this optimization does not yet apply to spark.memory.offHeap.size, so >>>>>> users >>>>>> of Gluten/Velox are aware. >>>>>> >>>>>> I am happy to support the PR moving forward if we can agree to >>>>>> include the *PriorityClass* config support and the *Safety Floor* >>>>>> config (even if disabled by default) ,Please update your SIP. This >>>>>> ensures >>>>>> the feature is robust enough for the wider user base. >>>>>> >>>>>> Regards, >>>>>> >>>>>> Viquar Khan >>>>>> >>>>>> Sr Data Architect >>>>>> >>>>>> https://www.linkedin.com/in/vaquar-khan-b695577/ >>>>>> >>>>>> On Mon, 29 Dec 2025 at 22:52, Nan Zhu <[email protected]> wrote: >>>>>> >>>>>>> Hi, Vaquar >>>>>>> >>>>>>> thanks for the replies, >>>>>>> >>>>>>> 1. for Guaranteed QoS >>>>>>> >>>>>>> I may missed some words in the original doc, the idea I would like >>>>>>> to convey is that we essentially need to give up this idea due to what >>>>>>> cannot achieve Guaranteed QoS as we already have different values of >>>>>>> memory >>>>>>> and even we only consider CPU request/limit values, it brings other >>>>>>> risks >>>>>>> to us >>>>>>> >>>>>>> Additionally , priorityClassName is NOT the solution here. What we >>>>>>> really worry about is NOT eviction, but Linux Kernel OOMKiller where we >>>>>>> cannot pass pod priority information into. With burstable pods, the >>>>>>> only thing Linux Kernel OOMKiller considers is the memory request size >>>>>>> which not necessary maps to priority information >>>>>>> >>>>>>> 2. The "Zero-Guarantee" Edge Case >>>>>>> >>>>>>> Actually, a lot of savings are from these 0 memory overhead pods... >>>>>>> I am curious if you have adopted the PoC PR in prod as you have >>>>>>> identified >>>>>>> it is "unsafe"? >>>>>>> >>>>>>> Something like a minGuaranteedRatio is not a good idea , it will >>>>>>> mess up the original design idea of the formula (check Appendix C), the >>>>>>> simplest thing you can do is to avoid rolling out features to the jobs >>>>>>> which you feel will be unsafe... >>>>>>> >>>>>>> 3. Native Execution Gap (Off-Heap) >>>>>>> >>>>>>> I am not sure the off-heap memory usage of gluten/comet shows the >>>>>>> same/similar pattern as memoryOverhead. No one has validated that in >>>>>>> production environment, but both PINS/Bytedance has validated >>>>>>> memoryOverhead part thoroughly in their clusters >>>>>>> >>>>>>> Additionally, the key design of the proposal is to capture the >>>>>>> relationship between on-heap and memoryOverhead sizes, in another word, >>>>>>> they co-exist.... offheap memory used by native engines are different >>>>>>> stories where , ideally, the on-heap usage should be minimum and most of >>>>>>> memory usage should come from off-heap part...so the formula here may >>>>>>> not >>>>>>> work out of box >>>>>>> >>>>>>> my suggestion is, since the community has approved the original >>>>>>> design which have been tested by at least 2 companies in production >>>>>>> environments, we go with the current design and continue code review , >>>>>>> in >>>>>>> future, we can add what have been found/tested in production as >>>>>>> followups >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>>> Nan >>>>>>> >>>>>>> >>>>>>> On Mon, Dec 29, 2025 at 8:03 PM vaquar khan <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Yao, Nan, and Chao, >>>>>>>> >>>>>>>> Thank you for this proposal though I already approved . The >>>>>>>> cost-efficiency goals are very compelling, and the cited $6M annual >>>>>>>> savings >>>>>>>> at Pinterest clearly demonstrates the value of moving away from >>>>>>>> rigid peak provisioning. >>>>>>>> >>>>>>>> However, after modeling the proposed design against standard >>>>>>>> Kubernetes behavior and modern Spark workloads, I have identified >>>>>>>> *three >>>>>>>> critical stability risks* that need to be addressed before this is >>>>>>>> finalized. >>>>>>>> >>>>>>>> I have drafted a *Supplementary Design Amendment* (linked >>>>>>>> below/attached) that proposes fixes for these issues, but here is the >>>>>>>> summary: >>>>>>>> 1. The "Guaranteed QoS" Contradiction >>>>>>>> >>>>>>>> The SPIP lists "Use Guaranteed QoS class" as Mitigation #1 for >>>>>>>> stability risks2. >>>>>>>> >>>>>>>> The Issue: Technically, this mitigation is impossible under your >>>>>>>> proposal. >>>>>>>> >>>>>>>> - >>>>>>>> >>>>>>>> In Kubernetes, a Pod is assigned the *Guaranteed* QoS class >>>>>>>> *only* if Request == Limit for both CPU and Memory. >>>>>>>> - >>>>>>>> >>>>>>>> Your proposal explicitly sets Memory Request < Memory Limit >>>>>>>> (specifically $H+G < H+O$)3. >>>>>>>> >>>>>>>> - >>>>>>>> >>>>>>>> *Consequence:* This configuration *automatically downgrades* >>>>>>>> the Pod to the *Burstable* QoS class. In a multi-tenant >>>>>>>> cluster, the Kubelet eviction manager will kill these "Burstable" >>>>>>>> Spark >>>>>>>> pods *before* any Guaranteed system pods during node pressure. >>>>>>>> - >>>>>>>> >>>>>>>> *Proposed Fix:* We cannot rely on Guaranteed QoS. We must >>>>>>>> introduce a priorityClassName configuration to offset this >>>>>>>> eviction risk. >>>>>>>> >>>>>>>> 2. The "Zero-Guarantee" Edge Case >>>>>>>> >>>>>>>> The formula $G = O - \min\{(H+O) \times (B-1), O\}$ 4 has a >>>>>>>> dangerous edge case for High-Heap/Low-Overhead jobs (common in ETL). >>>>>>>> >>>>>>>> >>>>>>>> - >>>>>>>> >>>>>>>> *Scenario:* If a job has a large Heap ($H$) relative to >>>>>>>> Overhead ($O$), the calculated burst deduction often exceeds >>>>>>>> the total Overhead. >>>>>>>> - >>>>>>>> >>>>>>>> *Result:* The formula yields *$G = 0$*. >>>>>>>> - >>>>>>>> >>>>>>>> *Risk:* Allocating 0MB of guaranteed overhead is unsafe. >>>>>>>> Essential JVM operations (thread stacks, Netty control buffers) >>>>>>>> require a >>>>>>>> non-zero baseline. Relying 100% on a shared burst pool for basic >>>>>>>> functionality will lead to immediate container failures if the node >>>>>>>> is >>>>>>>> contended. >>>>>>>> - >>>>>>>> >>>>>>>> *Proposed Fix:* Implement a safety floor using a >>>>>>>> minGuaranteedRatio (e.g., max(Calculated_G, O * 0.1)). >>>>>>>> >>>>>>>> 3. Native Execution Gap (Off-Heap) >>>>>>>> >>>>>>>> The proposal focuses entirely on memoryOverhead5. >>>>>>>> >>>>>>>> The Issue: Modern native engines (Gluten, Velox, Photon) shift >>>>>>>> execution memory to spark.memory.offHeap.size. This memory is equally >>>>>>>> "bursty" but is excluded from your optimization. >>>>>>>> >>>>>>>> *Proposed Fix: *The burst-aware logic should be extensible to >>>>>>>> include Off-Heap memory if enabled. >>>>>>>> >>>>>>>> >>>>>>>> https://docs.google.com/document/d/1l7KFkHcVBi1kr-9T4Rp7d52pTJT2TxuDMOlOsibD4wk/edit?usp=sharing >>>>>>>> >>>>>>>> >>>>>>>> I believe these changes are necessary to make the feature robust >>>>>>>> enough for general community adoption beyond specific controlled >>>>>>>> environments. >>>>>>>> >>>>>>>> >>>>>>>> Regards, >>>>>>>> >>>>>>>> Viquar Khan >>>>>>>> >>>>>>>> Sr Data Architect >>>>>>>> >>>>>>>> https://www.linkedin.com/in/vaquar-khan-b695577/ >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, 17 Dec 2025 at 06:34, Qiegang Long <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> +1 >>>>>>>>> >>>>>>>>> On Wed, Dec 17, 2025, 2:48 AM Wenchen Fan <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> +1 >>>>>>>>>> >>>>>>>>>> On Wed, Dec 17, 2025 at 6:41 AM karuppayya < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> +1 from me. >>>>>>>>>>> I think it's well-scoped and takes advantage of Kubernetes' >>>>>>>>>>> features exactly for what they are designed for(as per my >>>>>>>>>>> understanding). >>>>>>>>>>> >>>>>>>>>>> On Tue, Dec 16, 2025 at 8:17 AM Chao Sun <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Thanks Yao and Nan for the proposal, and thanks everyone for >>>>>>>>>>>> the detailed and thoughtful discussion. >>>>>>>>>>>> >>>>>>>>>>>> Overall, this looks like a valuable addition for organizations >>>>>>>>>>>> running Spark on Kubernetes, especially given how bursty >>>>>>>>>>>> memoryOverhead usage tends to be in practice. I appreciate >>>>>>>>>>>> that the change is relatively small in scope and fully opt-in, >>>>>>>>>>>> which helps >>>>>>>>>>>> keep the risk low. >>>>>>>>>>>> >>>>>>>>>>>> From my perspective, the questions raised on the thread and in >>>>>>>>>>>> the SPIP have been addressed. If others feel the same, do we have >>>>>>>>>>>> consensus >>>>>>>>>>>> to move forward with a vote? cc Wenchen, Qieqiang, and Karuppayya. >>>>>>>>>>>> >>>>>>>>>>>> Best, >>>>>>>>>>>> Chao >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Dec 11, 2025 at 11:32 PM Nan Zhu < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> this is a good question >>>>>>>>>>>>> >>>>>>>>>>>>> > a stage is bursty and consumes the shared portion and fails >>>>>>>>>>>>> to release it for subsequent stages >>>>>>>>>>>>> >>>>>>>>>>>>> in the scenario you described, since the memory-leaking stage >>>>>>>>>>>>> and the subsequence ones are from the same job , the pod will >>>>>>>>>>>>> likely be >>>>>>>>>>>>> killed by cgroup oomkiller >>>>>>>>>>>>> >>>>>>>>>>>>> taking the following as the example >>>>>>>>>>>>> >>>>>>>>>>>>> the usage pattern is G = 5GB S = 2GB, it uses G + S at max >>>>>>>>>>>>> and in theory, it should release all 7G and then claim 7G again >>>>>>>>>>>>> in some >>>>>>>>>>>>> later stages, however, due to the memory peak, it holds 2G >>>>>>>>>>>>> forever and ask >>>>>>>>>>>>> for another 7G, as a result, it hits the pod memory limit and >>>>>>>>>>>>> cgroup >>>>>>>>>>>>> oomkiller will take action to terminate the pod >>>>>>>>>>>>> >>>>>>>>>>>>> so this should be safe to the system >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> however, we should be careful about the memory peak for sure, >>>>>>>>>>>>> because it essentially breaks the assumption that the usage of >>>>>>>>>>>>> memoryOverhead is bursty (memory peak ~= use memory forever)... >>>>>>>>>>>>> unfortunately, shared/guaranteed memory is managed by user >>>>>>>>>>>>> applications >>>>>>>>>>>>> instead of on cluster level , they, especially S, are just logical >>>>>>>>>>>>> concepts instead of a physical memory pool which pods can >>>>>>>>>>>>> explicitly claim >>>>>>>>>>>>> memory from... >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Dec 11, 2025 at 10:17 PM karuppayya < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks for the interesting proposal. >>>>>>>>>>>>>> The design seems to rely on memoryOverhead being transient. >>>>>>>>>>>>>> What happens when a stage is bursty and consumes the shared >>>>>>>>>>>>>> portion and fails to release it for subsequent stages (e.g., >>>>>>>>>>>>>> off-heap >>>>>>>>>>>>>> buffers and its not garbage collected since its off-heap)? Would >>>>>>>>>>>>>> this >>>>>>>>>>>>>> trigger the host-level OOM like described in Q6? or are there >>>>>>>>>>>>>> strategies to >>>>>>>>>>>>>> release the shared portion? >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 6:24 PM Nan Zhu < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> yes, that's the worst case in the scenario, please check my >>>>>>>>>>>>>>> earlier response to Qiegang's question, we have a set of >>>>>>>>>>>>>>> strategies adopted >>>>>>>>>>>>>>> in prod to mitigate the issue >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 6:21 PM Wenchen Fan < >>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks for the explanation! So the executor is not >>>>>>>>>>>>>>>> guaranteed to get 50 GB physical memory, right? All pods on >>>>>>>>>>>>>>>> the same host >>>>>>>>>>>>>>>> may reach peak memory usage at the same time and cause >>>>>>>>>>>>>>>> paging/swapping >>>>>>>>>>>>>>>> which hurts performance? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 10:12 AM Nan Zhu < >>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> np, let me try to explain >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 1. Each executor container will be run in a pod together >>>>>>>>>>>>>>>>> with some other sidecar containers taking care of tasks like >>>>>>>>>>>>>>>>> authentication, etc. , for simplicity, we assume each pod has >>>>>>>>>>>>>>>>> only one >>>>>>>>>>>>>>>>> container which is the executor container >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 2. Each container is assigned with two values, r >>>>>>>>>>>>>>>>> *equest&limit** (limit >= request),* for both of >>>>>>>>>>>>>>>>> CPU/memory resources (we only discuss memory here). Each pod >>>>>>>>>>>>>>>>> will have >>>>>>>>>>>>>>>>> request/limit values as the sum of all containers belonging >>>>>>>>>>>>>>>>> to this pod >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 3. K8S Scheduler chooses a machine to host a pod based on >>>>>>>>>>>>>>>>> *request* value, and cap the resource usage of each >>>>>>>>>>>>>>>>> container based on their *limit* value, e.g. if I have a >>>>>>>>>>>>>>>>> pod with a single container in it , and it has 1G/2G as >>>>>>>>>>>>>>>>> request and limit >>>>>>>>>>>>>>>>> value respectively, any machine with 1G free RAM space will >>>>>>>>>>>>>>>>> be a candidate >>>>>>>>>>>>>>>>> to host this pod, and when the container use more than 2G >>>>>>>>>>>>>>>>> memory, it will >>>>>>>>>>>>>>>>> be killed by cgroup oomkiller. Once a pod is scheduled to a >>>>>>>>>>>>>>>>> host, the >>>>>>>>>>>>>>>>> memory space sized at "sum of all its containers' request >>>>>>>>>>>>>>>>> values" will be >>>>>>>>>>>>>>>>> booked exclusively for this pod. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 4. By default, Spark *sets request/limit as the same >>>>>>>>>>>>>>>>> value for executors in k8s*, and this value is basically >>>>>>>>>>>>>>>>> spark.executor.memory + spark.executor.memoryOverhead in most >>>>>>>>>>>>>>>>> cases . >>>>>>>>>>>>>>>>> However, spark.executor.memoryOverhead usage is very bursty, >>>>>>>>>>>>>>>>> the user >>>>>>>>>>>>>>>>> setting spark.executor.memoryOverhead as 10G usually means >>>>>>>>>>>>>>>>> each executor >>>>>>>>>>>>>>>>> only needs 10G in a very small portion of the executor's >>>>>>>>>>>>>>>>> whole lifecycle >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 5. The proposed SPIP is essentially to decouple >>>>>>>>>>>>>>>>> request/limit value in spark@k8s for executors in a safe >>>>>>>>>>>>>>>>> way (this idea is from the bytedance paper we refer to in >>>>>>>>>>>>>>>>> SPIP paper). >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Using the aforementioned example , >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> if we have a single node cluster with 100G RAM space, we >>>>>>>>>>>>>>>>> have two pods requesting 40G + 10G (on-heap + memoryOverhead) >>>>>>>>>>>>>>>>> and we set >>>>>>>>>>>>>>>>> bursty factor to 1.2, without the mechanism proposed in this >>>>>>>>>>>>>>>>> SPIP, we can >>>>>>>>>>>>>>>>> at most host 2 pods with this machine, and because of the >>>>>>>>>>>>>>>>> bursty usage of >>>>>>>>>>>>>>>>> that 10G space, the memory utilization would be compromised. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> When applying the burst-aware memory allocation, we only >>>>>>>>>>>>>>>>> need 40 + 10 - min((40 + 10) * 0.2, 10) = 40G to host each >>>>>>>>>>>>>>>>> pod, i.e. we >>>>>>>>>>>>>>>>> have 20G free memory space left in the machine which can be >>>>>>>>>>>>>>>>> used to host >>>>>>>>>>>>>>>>> some smaller pods. At the same time, as we didn't change the >>>>>>>>>>>>>>>>> limit value of >>>>>>>>>>>>>>>>> the executor pods, these executors can still use 50G at max. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 5:42 PM Wenchen Fan < >>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Sorry I'm not very familiar with the k8s infra, how does >>>>>>>>>>>>>>>>>> it work under the hood? The container will adjust its system >>>>>>>>>>>>>>>>>> memory size >>>>>>>>>>>>>>>>>> depending on the actual memory usage of the processes in >>>>>>>>>>>>>>>>>> this container? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 2:49 AM Nan Zhu < >>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> yeah, we have a few cases that we have significantly >>>>>>>>>>>>>>>>>>> larger O than H, the proposed algorithm is actually a great >>>>>>>>>>>>>>>>>>> fit >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> as I explained in SPIP doc Appendix C, the proposed >>>>>>>>>>>>>>>>>>> algorithm will allocate a non-trivial G to ensure the >>>>>>>>>>>>>>>>>>> safety of running but >>>>>>>>>>>>>>>>>>> still cut a big chunk of memory (10s of GBs) and treat them >>>>>>>>>>>>>>>>>>> as S , saving >>>>>>>>>>>>>>>>>>> tons of money burnt by them >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> but regarding native accelerators, some native >>>>>>>>>>>>>>>>>>> acceleration engines do not use memoryOverhead but use >>>>>>>>>>>>>>>>>>> off-heap >>>>>>>>>>>>>>>>>>> (spark.memory.offHeap.size) explicitly (e.g. Gluten). The >>>>>>>>>>>>>>>>>>> current >>>>>>>>>>>>>>>>>>> implementation does not cover this part , while that will >>>>>>>>>>>>>>>>>>> be an easy >>>>>>>>>>>>>>>>>>> extension >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 10:42 AM Qiegang Long < >>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks for the reply. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Have you tested in environments where O is bigger than >>>>>>>>>>>>>>>>>>>> H? Wondering if the proposed algorithm would help more in >>>>>>>>>>>>>>>>>>>> those >>>>>>>>>>>>>>>>>>>> environments (eg. with >>>>>>>>>>>>>>>>>>>> native accelerators)? >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 12:48 PM Nan Zhu < >>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Hi, Qiegang, thanks for the good questions as well >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> please check the following answer >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> > My initial understanding is that Kubernetes will use >>>>>>>>>>>>>>>>>>>>> the Executor Memory Request (H + G) for scheduling >>>>>>>>>>>>>>>>>>>>> decisions, which allows for better resource packing. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> yes, your understanding is correct >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> > How is the risk of host-level OOM mitigated when the >>>>>>>>>>>>>>>>>>>>> total potential usage sum of H+G+S across all pods on a >>>>>>>>>>>>>>>>>>>>> node exceeds its >>>>>>>>>>>>>>>>>>>>> allocatable capacity? Does the proposal implicitly rely >>>>>>>>>>>>>>>>>>>>> on the cluster >>>>>>>>>>>>>>>>>>>>> operator to manually ensure an unrequested memory buffer >>>>>>>>>>>>>>>>>>>>> exists on the node >>>>>>>>>>>>>>>>>>>>> to serve as the shared pool? >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> in PINS, we basically apply a set of strategies, >>>>>>>>>>>>>>>>>>>>> setting conservative bursty factor, progressive rollout, >>>>>>>>>>>>>>>>>>>>> monitor the >>>>>>>>>>>>>>>>>>>>> cluster metrics like Linux Kernel OOMKiller occurrence to >>>>>>>>>>>>>>>>>>>>> guide us to the >>>>>>>>>>>>>>>>>>>>> optimal setup of bursty factor... in usual, K8S operators >>>>>>>>>>>>>>>>>>>>> will set a >>>>>>>>>>>>>>>>>>>>> reserved space for daemon processes on each host, we >>>>>>>>>>>>>>>>>>>>> found it is sufficient >>>>>>>>>>>>>>>>>>>>> to in our case and our major tuning focuses on bursty >>>>>>>>>>>>>>>>>>>>> factor value >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> > Have you considered scheduling optimizations to >>>>>>>>>>>>>>>>>>>>> ensure a strategic mix of executors with large S and >>>>>>>>>>>>>>>>>>>>> small S values on a >>>>>>>>>>>>>>>>>>>>> single node? I am wondering if this would reduce the >>>>>>>>>>>>>>>>>>>>> probability of >>>>>>>>>>>>>>>>>>>>> concurrent bursting and host-level OOM. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Yes, when we work on this project, we put some >>>>>>>>>>>>>>>>>>>>> attention on the cluster scheduling policy/behavior... >>>>>>>>>>>>>>>>>>>>> two things we mostly >>>>>>>>>>>>>>>>>>>>> care about >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> 1. as stated in the SPIP doc, the cluster should have >>>>>>>>>>>>>>>>>>>>> certain level of diversity of workloads so that we have >>>>>>>>>>>>>>>>>>>>> enough candidates >>>>>>>>>>>>>>>>>>>>> to form a mixed set of executors with large S and >>>>>>>>>>>>>>>>>>>>> small S values >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> 2. we avoid using binpack scheduling algorithm which >>>>>>>>>>>>>>>>>>>>> tends to pack more pods from the same job to the same >>>>>>>>>>>>>>>>>>>>> host, which can >>>>>>>>>>>>>>>>>>>>> create troubles as they are more likely to ask for max >>>>>>>>>>>>>>>>>>>>> memory at the same >>>>>>>>>>>>>>>>>>>>> time >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 7:11 AM Qiegang Long < >>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Thanks for sharing this interesting proposal. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> My initial understanding is that Kubernetes will use >>>>>>>>>>>>>>>>>>>>>> the Executor Memory Request (H + G) for scheduling >>>>>>>>>>>>>>>>>>>>>> decisions, which allows for better resource packing. I >>>>>>>>>>>>>>>>>>>>>> have a few questions regarding the shared portion S: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> 1. How is the risk of host-level OOM mitigated >>>>>>>>>>>>>>>>>>>>>> when the total potential usage sum of H+G+S across >>>>>>>>>>>>>>>>>>>>>> all pods on a node >>>>>>>>>>>>>>>>>>>>>> exceeds its allocatable capacity? Does the proposal >>>>>>>>>>>>>>>>>>>>>> implicitly rely on the >>>>>>>>>>>>>>>>>>>>>> cluster operator to manually ensure an unrequested >>>>>>>>>>>>>>>>>>>>>> memory buffer exists on >>>>>>>>>>>>>>>>>>>>>> the node to serve as the shared pool? >>>>>>>>>>>>>>>>>>>>>> 2. Have you considered scheduling optimizations >>>>>>>>>>>>>>>>>>>>>> to ensure a strategic mix of executors with large S >>>>>>>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>>>>> small S values on a single node? I am wondering >>>>>>>>>>>>>>>>>>>>>> if this would reduce the probability of concurrent >>>>>>>>>>>>>>>>>>>>>> bursting and host-level >>>>>>>>>>>>>>>>>>>>>> OOM. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:49 AM Wenchen Fan < >>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> I think I'm still missing something in the big >>>>>>>>>>>>>>>>>>>>>>> picture: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> - Is the memory overhead off-heap? The formular >>>>>>>>>>>>>>>>>>>>>>> indicates a fixed heap size, and memory overhead >>>>>>>>>>>>>>>>>>>>>>> can't be dynamic if it's >>>>>>>>>>>>>>>>>>>>>>> on-heap. >>>>>>>>>>>>>>>>>>>>>>> - Do Spark applications have static profiles? >>>>>>>>>>>>>>>>>>>>>>> When we submit stages, the cluster is already >>>>>>>>>>>>>>>>>>>>>>> allocated, how can we change >>>>>>>>>>>>>>>>>>>>>>> anything? >>>>>>>>>>>>>>>>>>>>>>> - How do we assign the shared memory overhead? >>>>>>>>>>>>>>>>>>>>>>> Fairly among all applications on the same physical >>>>>>>>>>>>>>>>>>>>>>> node? >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:15 PM Nan Zhu < >>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> we didn't separate the design into another doc >>>>>>>>>>>>>>>>>>>>>>>> since the main idea is relatively simple... >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> for request/limit calculation, I described it in Q4 >>>>>>>>>>>>>>>>>>>>>>>> of the SPIP doc >>>>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0#heading=h.q4vjslmnfuo0 >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> it is calculated based on per profile (you can say >>>>>>>>>>>>>>>>>>>>>>>> it is based on per stage), when the cluster manager >>>>>>>>>>>>>>>>>>>>>>>> compose the pod spec, >>>>>>>>>>>>>>>>>>>>>>>> it calculates the new memory overhead based on what >>>>>>>>>>>>>>>>>>>>>>>> user asks for in that >>>>>>>>>>>>>>>>>>>>>>>> resource profile >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 9:49 PM Wenchen Fan < >>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Do we have a design sketch? How to determine the >>>>>>>>>>>>>>>>>>>>>>>>> memory request and limit? Is it per stage or per >>>>>>>>>>>>>>>>>>>>>>>>> executor? >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 1:40 PM Nan Zhu < >>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> yeah, the implementation is basically relying on >>>>>>>>>>>>>>>>>>>>>>>>>> the request/limit concept in K8S, ... >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> but if there is any other cluster manager coming >>>>>>>>>>>>>>>>>>>>>>>>>> in future, as long as it has a similar concept , it >>>>>>>>>>>>>>>>>>>>>>>>>> can leverage this >>>>>>>>>>>>>>>>>>>>>>>>>> easily as the main logic is implemented in >>>>>>>>>>>>>>>>>>>>>>>>>> ResourceProfile >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 9:34 PM Wenchen Fan < >>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> This feature is only available on k8s because it >>>>>>>>>>>>>>>>>>>>>>>>>>> allows containers to have dynamic resources? >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 12:46 PM Yao < >>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Folks, >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> We are proposing a burst-aware memoryOverhead >>>>>>>>>>>>>>>>>>>>>>>>>>>> allocation algorithm for Spark@K8S to improve >>>>>>>>>>>>>>>>>>>>>>>>>>>> memory utilization of spark clusters. >>>>>>>>>>>>>>>>>>>>>>>>>>>> Please see more details in SPIP doc >>>>>>>>>>>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0>. >>>>>>>>>>>>>>>>>>>>>>>>>>>> Feedbacks and discussions are welcomed. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks Chao for being shepard of this feature. >>>>>>>>>>>>>>>>>>>>>>>>>>>> Also want to thank the authors of the original >>>>>>>>>>>>>>>>>>>>>>>>>>>> paper >>>>>>>>>>>>>>>>>>>>>>>>>>>> <https://www.vldb.org/pvldb/vol17/p3759-shi.pdf> >>>>>>>>>>>>>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>>>>>>>>>>>>> ByteDance, specifically Rui( >>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]) and Yixin( >>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]). >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you. >>>>>>>>>>>>>>>>>>>>>>>>>>>> Yao Wang >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Regards, >>>>>>>> Vaquar Khan >>>>>>>> >>>>>>>> >>>>>> >>>>>> -- >>>>>> Regards, >>>>>> Vaquar Khan >>>>>> >>>>>> >>>> >>>> -- >>>> Regards, >>>> Vaquar Khan >>>> >>>> >> >> -- >> Regards, >> Vaquar Khan >> >> -- Regards, Vaquar Khan
