Hi Nan, Thanks for the candid response. I see where you are coming from regarding managed rollouts, but I think we are viewing this from two different lenses: "Internal Platform" vs. "General Open Source Product."
Here is why I am pushing for these two specific configuration hooks: 1. Re: "Imagined Reasons" & Zero Overhead You mentioned that you have observed jobs running fine with zero memoryOverhead. While that may be true for specific workloads in your environment, the requirement for non-heap memory is not "imagined"—it is a JVM specification. Thread stacks, CodeCache, and Netty DirectByteBuffer control structures must live in non-heap memory. - *The Scenario:* If G=0, then Pod Request == Heap. If a node is fully bin-packed (Sum of Requests = Node Capacity), your executor is mathematically guaranteed *zero bytes* of non-heap memory unless it can steal from the burst pool. - *The Risk:* If the burst pool is temporarily exhausted by neighbors, a simple thread creation will throw OutOfMemoryError: unable to create new native thread. - *The Fix:* I am not asking to change your default behavior. I am asking to *expose the config* (minGuaranteedRatio). If you set it to 0.0 (default), your behavior is unchanged. But for those of us running high-concurrency environments who need a 5-10% safety buffer for thread stacks, we need the *capability* to configure it without maintaining a fork or writing complex pre-submission wrappers. 2. Re: Kubelet Eviction Relevance You asked how Disk/PID pressure is related. In Kubernetes, PriorityClass is the universal signal for pod importance during any node-pressure event (not just memory). - If a node runs out of Ephemeral Storage (common with Spark Shuffle), the Kubelet evicts pods. - Without a priorityClassName config, these Spark pods (which are now QoS-downgraded to Burstable) will be evicted *before* Best-Effort jobs that might have a higher priority class. - Again, this is a standard Kubernetes spec feature. There is no downside to exposing spark.kubernetes.executor.bursty.priorityClassName as an optional config. *Proposal to Unblock* We both want this feature merged. I am not asking to change your formula's default behavior. Can we agree to simply *add these two parameters as optional configurations* ? 1. minGuaranteedRatio (Default: 0.0 -> preserves your logic exactly). 2. priorityClassName (Default: null -> preserves your logic exactly). This satisfies your design goals while making the feature robust enough for my production requirements. Regards, Viquar Khan Sr Data Architect https://www.linkedin.com/in/vaquar-khan-b695577/ On Tue, 30 Dec 2025 at 01:04, Nan Zhu <[email protected]> wrote: > > However, I maintain that for a general-purpose open-source feature > (which will be used by teams without dedicated platform engineers to manage > rollouts), we need structural safety guardrails. > > I am not sure we can roll out such a feature to all jobs without a managed > rollout process, it is an anti-pattern in any engineering org. this > feature is disabled by default which is already a guard to prevent users > silently getting into what they won't expect > > > > A minGuaranteedRatio (defaulting to 0 if you prefer) is not "messing up > the design"—it is mathematically necessary to prevent the formula from > collapsing to zero in valid production scenarios. > > this formula *IS* designed to output 0 in some cases, so it is *NOT* > collapsing to zero.. I observed that , even with 0 memoryoverhead in many > jobs, with a proper bursty factor saved tons of money in a real PROD > environment instead of from my imagination.. if you don't want any > memoryOverhead to be zero in your job for your imagined reasons, you can > just calculate your threshold for on-heap/memoryOverhead ratio for rolling > out > > step back... if your team doesn't know how to manage rollout, most likely > you are rolling out this feature for individual jobs without a centralized > feature rolling out point, right? then, you can just use the simple > arithmetics to calculate whether the resulting memoryOverhead is 0, if yes, > don't use this feature, that's it.... > > > > However, Kubelet eviction is the primary mechanism for other pressure > types (DiskPressure, PIDPressure) and "slow leak" memory pressure scenarios > where memory.available crosses the eviction threshold before the kernel > panics. > > How are they related to this feature? > > > On Mon, Dec 29, 2025 at 10:37 PM vaquar khan <[email protected]> > wrote: > >> Hi Nan, >> >> Thanks for the detailed reply. I appreciate you sharing the specific >> context from the Pinterest implementation—it helps clarify the operational >> model you are using. >> >> However, I maintain that for a general-purpose open-source feature (which >> will be used by teams without dedicated platform engineers to manage >> rollouts), we need structural safety guardrails. >> >> *Here is my response to your points:* >> >> 1. Re: "Zero-Guarantee" & Safety (Critical) >> >> You suggested that "setting a conservative bursty factor" resolves the >> risk of zero-guaranteed overhead. >> >> Mathematically, this is incorrect for High-Heap jobs. The formula is >> structural: G = O - \min((H+O) \times (B-1), O). >> >> Consider a standard ETL job: Heap (H) = 100GB, Overhead (O) = 5GB. >> >> Even if we set a very conservative Bursty Factor (B) of 1.06 (only 6% >> burst): >> >> - >> >> Calculation: $(100 + 5) \times (1.06 - 1) = 105 \times 0.06 = 6.3GB. >> - >> >> Since 6.3GB > 5GB, the formula sets *Guaranteed Overhead = 0GB*. >> >> Even with an extremely conservative factor, the design forces this pod to >> have zero guaranteed memory for OS/JVM threads. This is not a tuning issue; >> it is a formulaic edge case for high-memory jobs. >> >> * A minGuaranteedRatio (defaulting to 0 if you prefer) is not "messing up >> the design"—it is mathematically necessary to prevent the formula from >> collapsing to zero in valid production scenarios.* >> >> 2. Re: Kubelet Eviction vs. OOMKiller >> >> I concede that in sudden memory spikes, the Kernel OOMKiller often acts >> faster than Kubelet eviction. >> >> However, Kubelet eviction is the primary mechanism for other pressure >> types (DiskPressure, PIDPressure) and "slow leak" memory pressure scenarios >> where memory.available crosses the eviction threshold before the kernel >> panics. >> >> * Adding priorityClassName support to the Pod spec is a low-effort, >> zero-risk change that aligns with Kubernetes best practices for "Defense in >> Depth." It costs nothing to expose this config.* >> >> 3. Re: Native Support >> >> Fair point. To keep the scope tight, I am happy to drop the Native >> Support request for this SPIP. We can treat that as a separate follow-up. >> >> Path Forward >> >> I am happy to support if we can agree to: >> >> 1. >> >> Add the minGuaranteedRatio config (to handle the High-Heap math >> proven above). >> 2. >> >> Expose the priorityClassName config (standard K8S practice). >> >> >> Regards, >> >> Viquar Khan >> >> Sr Data Architect >> >> https://www.linkedin.com/in/vaquar-khan-b695577/ >> >> >> On Tue, 30 Dec 2025 at 00:16, Nan Zhu <[email protected]> wrote: >> >>> > Kubelet Eviction is the first line of defense before the Kernel >>> OOMKiller strikes. >>> >>> This is *NOT* true. Eviction will be the first to kill some best effort >>> pod which doesn't make any difference on memory pressure in most cases and >>> before it takes action again, Kernel OOMKiller already killed some executor >>> pods. This is exactly the reason for me to say, we don't really worry about >>> eviction here, before eviction touches those executors, OOMKiller already >>> killed them. This behavior is consistently observed and we also had >>> discussions with other companies who had to modify Kernel code to mitigate >>> this behavior. >>> >>> > Re: "Zero-Guarantee" & Safety >>> >>> you basically want to tradeoff saving with system safety , then why not >>> just setting a conservative value of bursty factor? it is exactly what we >>> did in PINS, please check my earlier response in the thread ... key part as >>> following: >>> >>> "in PINS, we basically apply a set of strategies, setting conservative >>> bursty factor, progressive rollout, monitor the cluster metrics like Linux >>> Kernel OOMKiller occurrence to guide us to the optimal setup of bursty >>> factor... in usual, K8S operators will set a reserved space for daemon >>> processes on each host, we found it is sufficient to in our case and our >>> major tuning focuses on bursty factor value " >>> >>> If you really want, you can enable this feature only for jobs when >>> OnHeap/MemoryOverhead is smaller than a certain value... >>> >>> I just didn't see the value of bringing another configuration >>> >>> >>> > Re: Native Support >>> >>> I mean....this SPIP is NOT about native execution engine's memory >>> pattern at all..... why do we bother to bring it up.... >>> >>> >>> >>> On Mon, Dec 29, 2025 at 9:42 PM vaquar khan <[email protected]> >>> wrote: >>> >>>> Hi Nan, >>>> >>>> Thanks for the prompt response and for clarifying the design intent. >>>> >>>> I understand the goal is to maximize savings—and I agree we shouldn't >>>> block the current momentum even though you can see my vote is +1—but I want >>>> to ensure we aren't over-optimizing for specific internal environments at >>>> the cost of general community stability. >>>> >>>> *Here is my rejoinder on the technical points:* >>>> >>>> 1. Re: PriorityClass & OOMKiller (Defense in Depth) >>>> >>>> You mentioned that “priorityClassName is NOT the solution... What we >>>> worry about is the Linux Kernel OOMKiller.” >>>> >>>> I agree that the Kernel OOMKiller (cgroup) primarily looks at >>>> oom_score_adj (which is determined by QoS Class). However, Kubelet Eviction >>>> is the first line of defense before the Kernel OOMKiller strikes. >>>> >>>> When a node comes under memory pressure (e.g., memory.available drops >>>> below evictionHard), the *Kubelet* actively selects pods to evict to >>>> reclaim resources. Unlike the Kernel, the Kubelet *does* explicitly >>>> use PriorityClass when ranking candidates for eviction. >>>> >>>> - >>>> >>>> *The Risk:* Since we are downgrading these pods to *Burstable* >>>> (increasing their OOM risk), we lose the "Guaranteed" protection shield. >>>> - >>>> >>>> *The Fix:* By assigning a high PriorityClass, we ensure that if the >>>> Kubelet needs to free space, it evicts lower-priority batch jobs >>>> *before* these Spark executors. It is a necessary "Defense in >>>> Depth" strategy for multi-tenant clusters that prevents optimized Spark >>>> jobs from being the first victims of node pressure. >>>> >>>> 2. Re: "Zero-Guarantee" & Safety >>>> >>>> You noted that “savings come from these 0 memory overhead pods.” >>>> >>>> While G=0 maximizes "on-paper" savings, it is theoretically unsafe for >>>> a JVM. A JVM physically requires non-heap memory for Thread Stacks, >>>> CodeCache, and Metaspace just to run. >>>> >>>> - >>>> >>>> *The Reality:* If G=0, then Pod Request == Heap. If a node is fully >>>> packed (Sum of Requests ≈ Node Capacity), the pod relies *100%* on >>>> the burst pool for basic thread allocation. If neighbors are noisy, that >>>> pod cannot even spawn a thread. >>>> - >>>> >>>> *The Compromise:* I strongly suggest we add the configuration >>>> spark.executor.memoryOverhead.minGuaranteedRatio but set the *default >>>> to 0.0*. >>>> - >>>> >>>> This preserves your logic/savings by default. >>>> - >>>> >>>> But it gives platform admins a "safety knob" to turn (e.g., to >>>> 0.1) when they inevitably encounter instability in high-contention >>>> environments, without needing a code patch. >>>> >>>> 3. Re: Native Support >>>> >>>> Agreed. We can treat Off-Heap support as a follow-up item. I would just >>>> request that we add a "Known Limitation" note in the SPIP stating that this >>>> optimization does not yet apply to spark.memory.offHeap.size, so users of >>>> Gluten/Velox are aware. >>>> >>>> I am happy to support the PR moving forward if we can agree to include >>>> the *PriorityClass* config support and the *Safety Floor* config (even >>>> if disabled by default) ,Please update your SIP. This ensures the feature >>>> is robust enough for the wider user base. >>>> >>>> Regards, >>>> >>>> Viquar Khan >>>> >>>> Sr Data Architect >>>> >>>> https://www.linkedin.com/in/vaquar-khan-b695577/ >>>> >>>> On Mon, 29 Dec 2025 at 22:52, Nan Zhu <[email protected]> wrote: >>>> >>>>> Hi, Vaquar >>>>> >>>>> thanks for the replies, >>>>> >>>>> 1. for Guaranteed QoS >>>>> >>>>> I may missed some words in the original doc, the idea I would like to >>>>> convey is that we essentially need to give up this idea due to what cannot >>>>> achieve Guaranteed QoS as we already have different values of memory and >>>>> even we only consider CPU request/limit values, it brings other risks to >>>>> us >>>>> >>>>> Additionally , priorityClassName is NOT the solution here. What we >>>>> really worry about is NOT eviction, but Linux Kernel OOMKiller where we >>>>> cannot pass pod priority information into. With burstable pods, the >>>>> only thing Linux Kernel OOMKiller considers is the memory request size >>>>> which not necessary maps to priority information >>>>> >>>>> 2. The "Zero-Guarantee" Edge Case >>>>> >>>>> Actually, a lot of savings are from these 0 memory overhead pods... I >>>>> am curious if you have adopted the PoC PR in prod as you have identified >>>>> it >>>>> is "unsafe"? >>>>> >>>>> Something like a minGuaranteedRatio is not a good idea , it will mess >>>>> up the original design idea of the formula (check Appendix C), the >>>>> simplest >>>>> thing you can do is to avoid rolling out features to the jobs which you >>>>> feel will be unsafe... >>>>> >>>>> 3. Native Execution Gap (Off-Heap) >>>>> >>>>> I am not sure the off-heap memory usage of gluten/comet shows the >>>>> same/similar pattern as memoryOverhead. No one has validated that in >>>>> production environment, but both PINS/Bytedance has validated >>>>> memoryOverhead part thoroughly in their clusters >>>>> >>>>> Additionally, the key design of the proposal is to capture the >>>>> relationship between on-heap and memoryOverhead sizes, in another word, >>>>> they co-exist.... offheap memory used by native engines are different >>>>> stories where , ideally, the on-heap usage should be minimum and most of >>>>> memory usage should come from off-heap part...so the formula here may not >>>>> work out of box >>>>> >>>>> my suggestion is, since the community has approved the original design >>>>> which have been tested by at least 2 companies in production environments, >>>>> we go with the current design and continue code review , in future, we can >>>>> add what have been found/tested in production as followups >>>>> >>>>> Thanks >>>>> >>>>> Nan >>>>> >>>>> >>>>> On Mon, Dec 29, 2025 at 8:03 PM vaquar khan <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi Yao, Nan, and Chao, >>>>>> >>>>>> Thank you for this proposal though I already approved . The >>>>>> cost-efficiency goals are very compelling, and the cited $6M annual >>>>>> savings >>>>>> at Pinterest clearly demonstrates the value of moving away from >>>>>> rigid peak provisioning. >>>>>> >>>>>> However, after modeling the proposed design against standard >>>>>> Kubernetes behavior and modern Spark workloads, I have identified *three >>>>>> critical stability risks* that need to be addressed before this is >>>>>> finalized. >>>>>> >>>>>> I have drafted a *Supplementary Design Amendment* (linked >>>>>> below/attached) that proposes fixes for these issues, but here is the >>>>>> summary: >>>>>> 1. The "Guaranteed QoS" Contradiction >>>>>> >>>>>> The SPIP lists "Use Guaranteed QoS class" as Mitigation #1 for >>>>>> stability risks2. >>>>>> >>>>>> The Issue: Technically, this mitigation is impossible under your >>>>>> proposal. >>>>>> >>>>>> - >>>>>> >>>>>> In Kubernetes, a Pod is assigned the *Guaranteed* QoS class *only* >>>>>> if Request == Limit for both CPU and Memory. >>>>>> - >>>>>> >>>>>> Your proposal explicitly sets Memory Request < Memory Limit >>>>>> (specifically $H+G < H+O$)3. >>>>>> >>>>>> - >>>>>> >>>>>> *Consequence:* This configuration *automatically downgrades* the >>>>>> Pod to the *Burstable* QoS class. In a multi-tenant cluster, the >>>>>> Kubelet eviction manager will kill these "Burstable" Spark pods >>>>>> *before* any Guaranteed system pods during node pressure. >>>>>> - >>>>>> >>>>>> *Proposed Fix:* We cannot rely on Guaranteed QoS. We must >>>>>> introduce a priorityClassName configuration to offset this >>>>>> eviction risk. >>>>>> >>>>>> 2. The "Zero-Guarantee" Edge Case >>>>>> >>>>>> The formula $G = O - \min\{(H+O) \times (B-1), O\}$ 4 has a >>>>>> dangerous edge case for High-Heap/Low-Overhead jobs (common in ETL). >>>>>> >>>>>> >>>>>> - >>>>>> >>>>>> *Scenario:* If a job has a large Heap ($H$) relative to Overhead ( >>>>>> $O$), the calculated burst deduction often exceeds the total >>>>>> Overhead. >>>>>> - >>>>>> >>>>>> *Result:* The formula yields *$G = 0$*. >>>>>> - >>>>>> >>>>>> *Risk:* Allocating 0MB of guaranteed overhead is unsafe. >>>>>> Essential JVM operations (thread stacks, Netty control buffers) >>>>>> require a >>>>>> non-zero baseline. Relying 100% on a shared burst pool for basic >>>>>> functionality will lead to immediate container failures if the node is >>>>>> contended. >>>>>> - >>>>>> >>>>>> *Proposed Fix:* Implement a safety floor using a >>>>>> minGuaranteedRatio (e.g., max(Calculated_G, O * 0.1)). >>>>>> >>>>>> 3. Native Execution Gap (Off-Heap) >>>>>> >>>>>> The proposal focuses entirely on memoryOverhead5. >>>>>> >>>>>> The Issue: Modern native engines (Gluten, Velox, Photon) shift >>>>>> execution memory to spark.memory.offHeap.size. This memory is equally >>>>>> "bursty" but is excluded from your optimization. >>>>>> >>>>>> *Proposed Fix: *The burst-aware logic should be extensible to >>>>>> include Off-Heap memory if enabled. >>>>>> >>>>>> >>>>>> https://docs.google.com/document/d/1l7KFkHcVBi1kr-9T4Rp7d52pTJT2TxuDMOlOsibD4wk/edit?usp=sharing >>>>>> >>>>>> >>>>>> I believe these changes are necessary to make the feature robust >>>>>> enough for general community adoption beyond specific controlled >>>>>> environments. >>>>>> >>>>>> >>>>>> Regards, >>>>>> >>>>>> Viquar Khan >>>>>> >>>>>> Sr Data Architect >>>>>> >>>>>> https://www.linkedin.com/in/vaquar-khan-b695577/ >>>>>> >>>>>> >>>>>> >>>>>> On Wed, 17 Dec 2025 at 06:34, Qiegang Long <[email protected]> wrote: >>>>>> >>>>>>> +1 >>>>>>> >>>>>>> On Wed, Dec 17, 2025, 2:48 AM Wenchen Fan <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> +1 >>>>>>>> >>>>>>>> On Wed, Dec 17, 2025 at 6:41 AM karuppayya < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> +1 from me. >>>>>>>>> I think it's well-scoped and takes advantage of Kubernetes' >>>>>>>>> features exactly for what they are designed for(as per my >>>>>>>>> understanding). >>>>>>>>> >>>>>>>>> On Tue, Dec 16, 2025 at 8:17 AM Chao Sun <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Thanks Yao and Nan for the proposal, and thanks everyone for the >>>>>>>>>> detailed and thoughtful discussion. >>>>>>>>>> >>>>>>>>>> Overall, this looks like a valuable addition for organizations >>>>>>>>>> running Spark on Kubernetes, especially given how bursty >>>>>>>>>> memoryOverhead usage tends to be in practice. I appreciate that >>>>>>>>>> the change is relatively small in scope and fully opt-in, which >>>>>>>>>> helps keep >>>>>>>>>> the risk low. >>>>>>>>>> >>>>>>>>>> From my perspective, the questions raised on the thread and in >>>>>>>>>> the SPIP have been addressed. If others feel the same, do we have >>>>>>>>>> consensus >>>>>>>>>> to move forward with a vote? cc Wenchen, Qieqiang, and Karuppayya. >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Chao >>>>>>>>>> >>>>>>>>>> On Thu, Dec 11, 2025 at 11:32 PM Nan Zhu <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> this is a good question >>>>>>>>>>> >>>>>>>>>>> > a stage is bursty and consumes the shared portion and fails to >>>>>>>>>>> release it for subsequent stages >>>>>>>>>>> >>>>>>>>>>> in the scenario you described, since the memory-leaking stage >>>>>>>>>>> and the subsequence ones are from the same job , the pod will >>>>>>>>>>> likely be >>>>>>>>>>> killed by cgroup oomkiller >>>>>>>>>>> >>>>>>>>>>> taking the following as the example >>>>>>>>>>> >>>>>>>>>>> the usage pattern is G = 5GB S = 2GB, it uses G + S at max and >>>>>>>>>>> in theory, it should release all 7G and then claim 7G again in some >>>>>>>>>>> later >>>>>>>>>>> stages, however, due to the memory peak, it holds 2G forever and >>>>>>>>>>> ask for >>>>>>>>>>> another 7G, as a result, it hits the pod memory limit and cgroup >>>>>>>>>>> oomkiller will take action to terminate the pod >>>>>>>>>>> >>>>>>>>>>> so this should be safe to the system >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> however, we should be careful about the memory peak for sure, >>>>>>>>>>> because it essentially breaks the assumption that the usage of >>>>>>>>>>> memoryOverhead is bursty (memory peak ~= use memory forever)... >>>>>>>>>>> unfortunately, shared/guaranteed memory is managed by user >>>>>>>>>>> applications >>>>>>>>>>> instead of on cluster level , they, especially S, are just logical >>>>>>>>>>> concepts instead of a physical memory pool which pods can >>>>>>>>>>> explicitly claim >>>>>>>>>>> memory from... >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, Dec 11, 2025 at 10:17 PM karuppayya < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Thanks for the interesting proposal. >>>>>>>>>>>> The design seems to rely on memoryOverhead being transient. >>>>>>>>>>>> What happens when a stage is bursty and consumes the shared >>>>>>>>>>>> portion and fails to release it for subsequent stages (e.g., >>>>>>>>>>>> off-heap >>>>>>>>>>>> buffers and its not garbage collected since its off-heap)? Would >>>>>>>>>>>> this >>>>>>>>>>>> trigger the host-level OOM like described in Q6? or are there >>>>>>>>>>>> strategies to >>>>>>>>>>>> release the shared portion? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Dec 11, 2025 at 6:24 PM Nan Zhu <[email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> yes, that's the worst case in the scenario, please check my >>>>>>>>>>>>> earlier response to Qiegang's question, we have a set of >>>>>>>>>>>>> strategies adopted >>>>>>>>>>>>> in prod to mitigate the issue >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Dec 11, 2025 at 6:21 PM Wenchen Fan < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks for the explanation! So the executor is not guaranteed >>>>>>>>>>>>>> to get 50 GB physical memory, right? All pods on the same host >>>>>>>>>>>>>> may reach >>>>>>>>>>>>>> peak memory usage at the same time and cause paging/swapping >>>>>>>>>>>>>> which hurts >>>>>>>>>>>>>> performance? >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 10:12 AM Nan Zhu < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> np, let me try to explain >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 1. Each executor container will be run in a pod together >>>>>>>>>>>>>>> with some other sidecar containers taking care of tasks like >>>>>>>>>>>>>>> authentication, etc. , for simplicity, we assume each pod has >>>>>>>>>>>>>>> only one >>>>>>>>>>>>>>> container which is the executor container >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2. Each container is assigned with two values, r >>>>>>>>>>>>>>> *equest&limit** (limit >= request),* for both of CPU/memory >>>>>>>>>>>>>>> resources (we only discuss memory here). Each pod will have >>>>>>>>>>>>>>> request/limit >>>>>>>>>>>>>>> values as the sum of all containers belonging to this pod >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 3. K8S Scheduler chooses a machine to host a pod based on >>>>>>>>>>>>>>> *request* value, and cap the resource usage of each >>>>>>>>>>>>>>> container based on their *limit* value, e.g. if I have a >>>>>>>>>>>>>>> pod with a single container in it , and it has 1G/2G as request >>>>>>>>>>>>>>> and limit >>>>>>>>>>>>>>> value respectively, any machine with 1G free RAM space will be >>>>>>>>>>>>>>> a candidate >>>>>>>>>>>>>>> to host this pod, and when the container use more than 2G >>>>>>>>>>>>>>> memory, it will >>>>>>>>>>>>>>> be killed by cgroup oomkiller. Once a pod is scheduled to a >>>>>>>>>>>>>>> host, the >>>>>>>>>>>>>>> memory space sized at "sum of all its containers' request >>>>>>>>>>>>>>> values" will be >>>>>>>>>>>>>>> booked exclusively for this pod. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 4. By default, Spark *sets request/limit as the same value >>>>>>>>>>>>>>> for executors in k8s*, and this value is basically >>>>>>>>>>>>>>> spark.executor.memory + spark.executor.memoryOverhead in most >>>>>>>>>>>>>>> cases . >>>>>>>>>>>>>>> However, spark.executor.memoryOverhead usage is very bursty, >>>>>>>>>>>>>>> the user >>>>>>>>>>>>>>> setting spark.executor.memoryOverhead as 10G usually means >>>>>>>>>>>>>>> each executor >>>>>>>>>>>>>>> only needs 10G in a very small portion of the executor's whole >>>>>>>>>>>>>>> lifecycle >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 5. The proposed SPIP is essentially to decouple >>>>>>>>>>>>>>> request/limit value in spark@k8s for executors in a safe >>>>>>>>>>>>>>> way (this idea is from the bytedance paper we refer to in SPIP >>>>>>>>>>>>>>> paper). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Using the aforementioned example , >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> if we have a single node cluster with 100G RAM space, we >>>>>>>>>>>>>>> have two pods requesting 40G + 10G (on-heap + memoryOverhead) >>>>>>>>>>>>>>> and we set >>>>>>>>>>>>>>> bursty factor to 1.2, without the mechanism proposed in this >>>>>>>>>>>>>>> SPIP, we can >>>>>>>>>>>>>>> at most host 2 pods with this machine, and because of the >>>>>>>>>>>>>>> bursty usage of >>>>>>>>>>>>>>> that 10G space, the memory utilization would be compromised. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> When applying the burst-aware memory allocation, we only >>>>>>>>>>>>>>> need 40 + 10 - min((40 + 10) * 0.2, 10) = 40G to host each pod, >>>>>>>>>>>>>>> i.e. we >>>>>>>>>>>>>>> have 20G free memory space left in the machine which can be >>>>>>>>>>>>>>> used to host >>>>>>>>>>>>>>> some smaller pods. At the same time, as we didn't change the >>>>>>>>>>>>>>> limit value of >>>>>>>>>>>>>>> the executor pods, these executors can still use 50G at max. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 5:42 PM Wenchen Fan < >>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Sorry I'm not very familiar with the k8s infra, how does >>>>>>>>>>>>>>>> it work under the hood? The container will adjust its system >>>>>>>>>>>>>>>> memory size >>>>>>>>>>>>>>>> depending on the actual memory usage of the processes in this >>>>>>>>>>>>>>>> container? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 2:49 AM Nan Zhu < >>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> yeah, we have a few cases that we have significantly >>>>>>>>>>>>>>>>> larger O than H, the proposed algorithm is actually a great >>>>>>>>>>>>>>>>> fit >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> as I explained in SPIP doc Appendix C, the proposed >>>>>>>>>>>>>>>>> algorithm will allocate a non-trivial G to ensure the safety >>>>>>>>>>>>>>>>> of running but >>>>>>>>>>>>>>>>> still cut a big chunk of memory (10s of GBs) and treat them >>>>>>>>>>>>>>>>> as S , saving >>>>>>>>>>>>>>>>> tons of money burnt by them >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> but regarding native accelerators, some native >>>>>>>>>>>>>>>>> acceleration engines do not use memoryOverhead but use >>>>>>>>>>>>>>>>> off-heap >>>>>>>>>>>>>>>>> (spark.memory.offHeap.size) explicitly (e.g. Gluten). The >>>>>>>>>>>>>>>>> current >>>>>>>>>>>>>>>>> implementation does not cover this part , while that will be >>>>>>>>>>>>>>>>> an easy >>>>>>>>>>>>>>>>> extension >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 10:42 AM Qiegang Long < >>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks for the reply. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Have you tested in environments where O is bigger than H? >>>>>>>>>>>>>>>>>> Wondering if the proposed algorithm would help more in those >>>>>>>>>>>>>>>>>> environments >>>>>>>>>>>>>>>>>> (eg. with >>>>>>>>>>>>>>>>>> native accelerators)? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 12:48 PM Nan Zhu < >>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hi, Qiegang, thanks for the good questions as well >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> please check the following answer >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> > My initial understanding is that Kubernetes will use >>>>>>>>>>>>>>>>>>> the Executor Memory Request (H + G) for scheduling >>>>>>>>>>>>>>>>>>> decisions, which allows for better resource packing. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> yes, your understanding is correct >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> > How is the risk of host-level OOM mitigated when the >>>>>>>>>>>>>>>>>>> total potential usage sum of H+G+S across all pods on a >>>>>>>>>>>>>>>>>>> node exceeds its >>>>>>>>>>>>>>>>>>> allocatable capacity? Does the proposal implicitly rely on >>>>>>>>>>>>>>>>>>> the cluster >>>>>>>>>>>>>>>>>>> operator to manually ensure an unrequested memory buffer >>>>>>>>>>>>>>>>>>> exists on the node >>>>>>>>>>>>>>>>>>> to serve as the shared pool? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> in PINS, we basically apply a set of strategies, setting >>>>>>>>>>>>>>>>>>> conservative bursty factor, progressive rollout, monitor >>>>>>>>>>>>>>>>>>> the cluster >>>>>>>>>>>>>>>>>>> metrics like Linux Kernel OOMKiller occurrence to guide us >>>>>>>>>>>>>>>>>>> to the optimal >>>>>>>>>>>>>>>>>>> setup of bursty factor... in usual, K8S operators will set >>>>>>>>>>>>>>>>>>> a reserved space >>>>>>>>>>>>>>>>>>> for daemon processes on each host, we found it is >>>>>>>>>>>>>>>>>>> sufficient to in our case >>>>>>>>>>>>>>>>>>> and our major tuning focuses on bursty factor value >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> > Have you considered scheduling optimizations to ensure >>>>>>>>>>>>>>>>>>> a strategic mix of executors with large S and small S >>>>>>>>>>>>>>>>>>> values on a single >>>>>>>>>>>>>>>>>>> node? I am wondering if this would reduce the probability >>>>>>>>>>>>>>>>>>> of concurrent >>>>>>>>>>>>>>>>>>> bursting and host-level OOM. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Yes, when we work on this project, we put some attention >>>>>>>>>>>>>>>>>>> on the cluster scheduling policy/behavior... two things we >>>>>>>>>>>>>>>>>>> mostly care >>>>>>>>>>>>>>>>>>> about >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> 1. as stated in the SPIP doc, the cluster should have >>>>>>>>>>>>>>>>>>> certain level of diversity of workloads so that we have >>>>>>>>>>>>>>>>>>> enough candidates >>>>>>>>>>>>>>>>>>> to form a mixed set of executors with large S and >>>>>>>>>>>>>>>>>>> small S values >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> 2. we avoid using binpack scheduling algorithm which >>>>>>>>>>>>>>>>>>> tends to pack more pods from the same job to the same host, >>>>>>>>>>>>>>>>>>> which can >>>>>>>>>>>>>>>>>>> create troubles as they are more likely to ask for max >>>>>>>>>>>>>>>>>>> memory at the same >>>>>>>>>>>>>>>>>>> time >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 7:11 AM Qiegang Long < >>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks for sharing this interesting proposal. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> My initial understanding is that Kubernetes will use >>>>>>>>>>>>>>>>>>>> the Executor Memory Request (H + G) for scheduling >>>>>>>>>>>>>>>>>>>> decisions, which allows for better resource packing. I >>>>>>>>>>>>>>>>>>>> have a few questions regarding the shared portion S: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> 1. How is the risk of host-level OOM mitigated when >>>>>>>>>>>>>>>>>>>> the total potential usage sum of H+G+S across all pods >>>>>>>>>>>>>>>>>>>> on a node exceeds >>>>>>>>>>>>>>>>>>>> its allocatable capacity? Does the proposal implicitly >>>>>>>>>>>>>>>>>>>> rely on the cluster >>>>>>>>>>>>>>>>>>>> operator to manually ensure an unrequested memory >>>>>>>>>>>>>>>>>>>> buffer exists on the node >>>>>>>>>>>>>>>>>>>> to serve as the shared pool? >>>>>>>>>>>>>>>>>>>> 2. Have you considered scheduling optimizations to >>>>>>>>>>>>>>>>>>>> ensure a strategic mix of executors with large S and >>>>>>>>>>>>>>>>>>>> small S values on a single node? I am wondering if >>>>>>>>>>>>>>>>>>>> this would reduce the probability of concurrent >>>>>>>>>>>>>>>>>>>> bursting and host-level OOM. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:49 AM Wenchen Fan < >>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I think I'm still missing something in the big picture: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> - Is the memory overhead off-heap? The formular >>>>>>>>>>>>>>>>>>>>> indicates a fixed heap size, and memory overhead can't >>>>>>>>>>>>>>>>>>>>> be dynamic if it's >>>>>>>>>>>>>>>>>>>>> on-heap. >>>>>>>>>>>>>>>>>>>>> - Do Spark applications have static profiles? When >>>>>>>>>>>>>>>>>>>>> we submit stages, the cluster is already allocated, >>>>>>>>>>>>>>>>>>>>> how can we change >>>>>>>>>>>>>>>>>>>>> anything? >>>>>>>>>>>>>>>>>>>>> - How do we assign the shared memory overhead? >>>>>>>>>>>>>>>>>>>>> Fairly among all applications on the same physical >>>>>>>>>>>>>>>>>>>>> node? >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:15 PM Nan Zhu < >>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> we didn't separate the design into another doc >>>>>>>>>>>>>>>>>>>>>> since the main idea is relatively simple... >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> for request/limit calculation, I described it in Q4 >>>>>>>>>>>>>>>>>>>>>> of the SPIP doc >>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0#heading=h.q4vjslmnfuo0 >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> it is calculated based on per profile (you can say it >>>>>>>>>>>>>>>>>>>>>> is based on per stage), when the cluster manager compose >>>>>>>>>>>>>>>>>>>>>> the pod spec, it >>>>>>>>>>>>>>>>>>>>>> calculates the new memory overhead based on what user >>>>>>>>>>>>>>>>>>>>>> asks for in that >>>>>>>>>>>>>>>>>>>>>> resource profile >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 9:49 PM Wenchen Fan < >>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Do we have a design sketch? How to determine the >>>>>>>>>>>>>>>>>>>>>>> memory request and limit? Is it per stage or per >>>>>>>>>>>>>>>>>>>>>>> executor? >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 1:40 PM Nan Zhu < >>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> yeah, the implementation is basically relying on >>>>>>>>>>>>>>>>>>>>>>>> the request/limit concept in K8S, ... >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> but if there is any other cluster manager coming in >>>>>>>>>>>>>>>>>>>>>>>> future, as long as it has a similar concept , it can >>>>>>>>>>>>>>>>>>>>>>>> leverage this easily >>>>>>>>>>>>>>>>>>>>>>>> as the main logic is implemented in ResourceProfile >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 9:34 PM Wenchen Fan < >>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> This feature is only available on k8s because it >>>>>>>>>>>>>>>>>>>>>>>>> allows containers to have dynamic resources? >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 12:46 PM Yao < >>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Folks, >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> We are proposing a burst-aware memoryOverhead >>>>>>>>>>>>>>>>>>>>>>>>>> allocation algorithm for Spark@K8S to improve >>>>>>>>>>>>>>>>>>>>>>>>>> memory utilization of spark clusters. >>>>>>>>>>>>>>>>>>>>>>>>>> Please see more details in SPIP doc >>>>>>>>>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0>. >>>>>>>>>>>>>>>>>>>>>>>>>> Feedbacks and discussions are welcomed. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks Chao for being shepard of this feature. >>>>>>>>>>>>>>>>>>>>>>>>>> Also want to thank the authors of the original >>>>>>>>>>>>>>>>>>>>>>>>>> paper >>>>>>>>>>>>>>>>>>>>>>>>>> <https://www.vldb.org/pvldb/vol17/p3759-shi.pdf> from >>>>>>>>>>>>>>>>>>>>>>>>>> ByteDance, specifically Rui([email protected]) >>>>>>>>>>>>>>>>>>>>>>>>>> and Yixin([email protected]). >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Thank you. >>>>>>>>>>>>>>>>>>>>>>>>>> Yao Wang >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>> >>>>>> -- >>>>>> Regards, >>>>>> Vaquar Khan >>>>>> >>>>>> >>>> >>>> -- >>>> Regards, >>>> Vaquar Khan >>>> >>>> >> >> -- >> Regards, >> Vaquar Khan >> >> -- Regards, Vaquar Khan
