+1 (non-binding) I have read through all the messages and agree with Nan that the risk described by Vaquar is quite rare and can mostly be avoided by setting a conservative value for the bursty factor (at least for the K8s environment my service is running on)
On Thu, Jan 1, 2026 at 10:50 AM vaquar khan <[email protected]> wrote: > I am changing my vote +1 to 0 (Non-binding) > *Reason*: I support the feature goal, but I cannot vote +1 because the > safety documentation regarding Kubelet Bug #131169( > https://github.com/kubernetes/kubernetes/issues/131169) was not > included.I stand by the technical risks highlighted in the discussion > thread. > > Regards, > Viquar khan > > On Tue, 30 Dec 2025 at 15:26, vaquar khan <[email protected]> wrote: > >> HI Sean , >> >> I use drafting tools to help me communicate clearly, as English is not my >> first language . >> >> Since I have been part of this group since 2014, I value this community >> highly. It doesn't feel good to be accused of wasting time after all these >> years. I’ll make an effort to keep things shorter moving forward as i am >> also working on same issues. >> >> Have you seen any issue in my proposals and if keep discussion techical >> would help spark community. >> >> >> Regards, >> Vaquar khan >> >> >> On Tue, Dec 30, 2025, 2:50 PM Sean Owen <[email protected]> wrote: >> >>> vaquar I think I need to ask, how much of your messages are written by >>> an AI? It has many stylistic characteristics of this output. This is not by >>> itself wrong. >>> While well-formed, the replies are verbose and repetitive, and seem to >>> be talking past the responses you receive. >>> There are 1000s of subscribers here and I want to make sure we are >>> spending everyone's time in good faith. >>> >>> On Tue, Dec 30, 2025 at 9:29 AM vaquar khan <[email protected]> >>> wrote: >>> >>>> Hi Nan, Yao, and Chao, >>>> >>>> I have done a deep dive into the underlying Linux and Kubernetes kernel >>>> behaviors to validate our respective positions. While I fully support >>>> the economic goal of reclaiming the estimated 30-50% of stranded memory in >>>> static clusters, the technical evidence suggests that the >>>> "Zero-Guarantee" configuration is not just an optimization choice—it is >>>> architecturally unsafe for standard Kubernetes environments due to how the >>>> kernel calculates OOM scores. >>>> >>>> I am sharing these findings to explain why I have insisted on the *Safety >>>> Floor (minGuaranteedRatio)* as a necessary guardrail. >>>> >>>> *1. The "Death Trap" of OOM Scores (The Math)* Nan mentioned that >>>> "Zero-Guarantee" pods work fine in Pinterest's environment. However, in a >>>> standard environment, the math works against us. The Linux kernel >>>> calculates oom_score_adj inversely to the Request size: 1000 - (1000 * >>>> Request / Capacity). >>>> >>>> - >>>> >>>> *The Risk:* By allowing memoryOverhead to drop to 0 (lowering the >>>> Request), we are mathematically inflating the OOM score. For example, >>>> on a >>>> standard node, a Zero-Guarantee pod ends up with a significantly higher >>>> OOM >>>> score (more likely to be killed) compared to a standard pod. >>>> - >>>> >>>> *The Consequence:* In a race condition, the kernel will >>>> mathematically target these "optimized" Spark pods for termination >>>> *before* their neighbors, regardless of our intent. >>>> >>>> *2. The "Smoking Gun": Kubelet Bug #131169* There is a known defect in >>>> the Kubelet (Issue #131169) where *PriorityClass is ignored when >>>> calculating OOM scores for Burstable pods*. >>>> >>>> - >>>> >>>> This invalidates the assumption that we can simply "manage" the >>>> risk with priorities later. >>>> - >>>> >>>> Until this is fixed in upstream K8s (v1.30+), a "Zero-Guarantee" >>>> pod is statistically identical to a "Best Effort" pod in the eyes of the >>>> OOM killer. >>>> - >>>> >>>> *Conclusion:* We ideally *should* enforce a minimum memory floor to >>>> keep the Request value high enough to secure a survivable OOM score. >>>> >>>> *3. Silent Failures (Thread Exhaustion)* The research confirms that >>>> "Zero-Guarantee" creates a vector for java.lang.OutOfMemoryError: >>>> unable to create new native thread. >>>> >>>> - >>>> >>>> If a pod lands on a node with just enough RAM for the Heap >>>> (Request) but zero extra for the OS, the pthread_create call will >>>> fail immediately. >>>> - >>>> >>>> This results in "silent" application crashes that do not trigger >>>> standard K8s OOM alerts, leading to un-debuggable support scenarios for >>>> general users. >>>> >>>> *Final Proposal & Documentation Compromise* >>>> >>>> My strong preference is to add the *Safety Floor (minGuaranteedRatio)* >>>> configuration to the code. >>>> >>>> However, if after reviewing this evidence you are *adamant* that no >>>> new configurations should be added to the code, I am willing to *unblock >>>> the vote* on one strict condition: >>>> >>>> *The SPIP and Documentation must explicitly flag this risk.* We cannot >>>> simply leave this as an implementation detail. The documentation must >>>> contain a "Critical Warning" block stating: >>>> >>>> *"Warning: High-Heap/Low-Overhead configurations may result in 0MB >>>> guaranteed overhead. Due to Kubelet limitations (Issue #131169), this may >>>> bypass PriorityClass protections and lead to silent 'Native Thread' >>>> exhaustion failures on contended nodes. Users are responsible for >>>> validating stability."* >>>> >>>> If you agree to either the code change (preferred) or this specific >>>> documentation warning please update SIP doc , I am happy to support.. >>>> >>>> >>>> Regards, >>>> >>>> Viquar Khan >>>> >>>> Sr Data Architect >>>> >>>> https://www.linkedin.com/in/vaquar-khan-b695577/ >>>> >>>> >>>> >>>> On Tue, 30 Dec 2025 at 01:45, Nan Zhu <[email protected]> wrote: >>>> >>>>> 1. Re: "Imagined Reasons" & Zero Overhead >>>>> when I said "imagined reasons", I meant I didn't see the issue you >>>>> described appear in a prod environment running millions of jobs every >>>>> month, and I have also said that why it won't happen in PINS and other >>>>> normal case: as in a K8S cluster , there will be a reserved space for >>>>> system daemons in each host, even with many 0-memoryOverhead jobs, they >>>>> won't be "fully packed" as you imagined since these 0-memory overhead jobs >>>>> don't need much memory overhead space anyway >>>>> >>>>> let me bring my earlier suggestions again, if you don't want any job >>>>> to have 0 memoryOverhead, you can just calculate how much memoryOverhead >>>>> is >>>>> guaranteed with simple arithmetic, if it is 0, do not use this feature >>>>> >>>>> In general, I don't really suggest you use this feature if you cannot >>>>> manage the rollout process, just like no one should apply something like >>>>> auto-tuning to all of their jobs without a dedicated Spark platform team . >>>>> >>>>> 2. Kubelet Eviction Relevance >>>>> >>>>> 2.a my question is , how PID/Disk pressure is related to the memory >>>>> related feature we are discussing here? please don't fan out the >>>>> discussion >>>>> scope unlimitedly >>>>> 2.b exposing spark.kubernetes.executor.bursty.priorityClassName is >>>>> far away from a reasonable design, the priority class name should be >>>>> controlled in cluster level and then specified via something like spark >>>>> operator or if you can specify pod spec, instead of embedding it to a >>>>> memory related feature >>>>> >>>>> 3. Can we agree to simply *add these two parameters as optional >>>>> configurations*? >>>>> >>>>> unfortunately no... >>>>> >>>>> some of the problems you raised probably will happen in very very >>>>> extreme cases, I have provided solutions to them without the need to add >>>>> additional configs... Other problems you raised are not related to what >>>>> this SPIP is about, e.g. PID exhausting, etc. and some of your proposed >>>>> design doesn't make sense to me , e.g. specifying executor's priority >>>>> class via such a memory related feature.... >>>>> >>>>> >>>>> On Mon, Dec 29, 2025 at 11:16 PM vaquar khan <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi Nan, >>>>>> >>>>>> Thanks for the candid response. I see where you are coming from >>>>>> regarding managed rollouts, but I think we are viewing this from two >>>>>> different lenses: "Internal Platform" vs. "General Open Source Product." >>>>>> >>>>>> Here is why I am pushing for these two specific configuration hooks: >>>>>> >>>>>> 1. Re: "Imagined Reasons" & Zero Overhead >>>>>> >>>>>> You mentioned that you have observed jobs running fine with zero >>>>>> memoryOverhead. >>>>>> >>>>>> While that may be true for specific workloads in your environment, >>>>>> the requirement for non-heap memory is not "imagined"—it is a JVM >>>>>> specification. Thread stacks, CodeCache, and Netty DirectByteBuffer >>>>>> control >>>>>> structures must live in non-heap memory. >>>>>> >>>>>> - >>>>>> >>>>>> *The Scenario:* If G=0, then Pod Request == Heap. If a node is >>>>>> fully bin-packed (Sum of Requests = Node Capacity), your executor is >>>>>> mathematically guaranteed *zero bytes* of non-heap memory unless >>>>>> it can steal from the burst pool. >>>>>> - >>>>>> >>>>>> *The Risk:* If the burst pool is temporarily exhausted by >>>>>> neighbors, a simple thread creation will throw OutOfMemoryError: >>>>>> unable to create new native thread. >>>>>> - >>>>>> >>>>>> *The Fix:* I am not asking to change your default behavior. I am >>>>>> asking to *expose the config* (minGuaranteedRatio). If you set it >>>>>> to 0.0 (default), your behavior is unchanged. But for those of us >>>>>> running high-concurrency environments who need a 5-10% safety buffer >>>>>> for >>>>>> thread stacks, we need the *capability* to configure it without >>>>>> maintaining a fork or writing complex pre-submission wrappers. >>>>>> >>>>>> 2. Re: Kubelet Eviction Relevance >>>>>> >>>>>> You asked how Disk/PID pressure is related. >>>>>> >>>>>> In Kubernetes, PriorityClass is the universal signal for pod >>>>>> importance during any node-pressure event (not just memory). >>>>>> >>>>>> - >>>>>> >>>>>> If a node runs out of Ephemeral Storage (common with Spark >>>>>> Shuffle), the Kubelet evicts pods. >>>>>> - >>>>>> >>>>>> Without a priorityClassName config, these Spark pods (which are >>>>>> now QoS-downgraded to Burstable) will be evicted *before* >>>>>> Best-Effort jobs that might have a higher priority class. >>>>>> - >>>>>> >>>>>> Again, this is a standard Kubernetes spec feature. There is no >>>>>> downside to exposing >>>>>> spark.kubernetes.executor.bursty.priorityClassName as an optional >>>>>> config. >>>>>> >>>>>> *Proposal to Unblock* >>>>>> >>>>>> We both want this feature merged. I am not asking to change your >>>>>> formula's default behavior. >>>>>> >>>>>> Can we agree to simply *add these two parameters as optional >>>>>> configurations*? >>>>>> >>>>>> 1. >>>>>> >>>>>> minGuaranteedRatio (Default: 0.0 -> preserves your logic exactly). >>>>>> 2. >>>>>> >>>>>> priorityClassName (Default: null -> preserves your logic exactly). >>>>>> >>>>>> This satisfies your design goals while making the feature robust >>>>>> enough for my production requirements. >>>>>> >>>>>> >>>>>> Regards, >>>>>> >>>>>> Viquar Khan >>>>>> >>>>>> Sr Data Architect >>>>>> >>>>>> https://www.linkedin.com/in/vaquar-khan-b695577/ >>>>>> >>>>> >>>> >>>> > > -- > Regards, > Vaquar Khan > >
