I am changing my vote +1 to 0 (Non-binding) *Reason*: I support the feature goal, but I cannot vote +1 because the safety documentation regarding Kubelet Bug #131169( https://github.com/kubernetes/kubernetes/issues/131169) was not included.I stand by the technical risks highlighted in the discussion thread.
Regards, Viquar khan On Tue, 30 Dec 2025 at 15:26, vaquar khan <[email protected]> wrote: > HI Sean , > > I use drafting tools to help me communicate clearly, as English is not my > first language . > > Since I have been part of this group since 2014, I value this community > highly. It doesn't feel good to be accused of wasting time after all these > years. I’ll make an effort to keep things shorter moving forward as i am > also working on same issues. > > Have you seen any issue in my proposals and if keep discussion techical > would help spark community. > > > Regards, > Vaquar khan > > > On Tue, Dec 30, 2025, 2:50 PM Sean Owen <[email protected]> wrote: > >> vaquar I think I need to ask, how much of your messages are written by an >> AI? It has many stylistic characteristics of this output. This is not by >> itself wrong. >> While well-formed, the replies are verbose and repetitive, and seem to be >> talking past the responses you receive. >> There are 1000s of subscribers here and I want to make sure we are >> spending everyone's time in good faith. >> >> On Tue, Dec 30, 2025 at 9:29 AM vaquar khan <[email protected]> >> wrote: >> >>> Hi Nan, Yao, and Chao, >>> >>> I have done a deep dive into the underlying Linux and Kubernetes kernel >>> behaviors to validate our respective positions. While I fully support >>> the economic goal of reclaiming the estimated 30-50% of stranded memory in >>> static clusters, the technical evidence suggests that the >>> "Zero-Guarantee" configuration is not just an optimization choice—it is >>> architecturally unsafe for standard Kubernetes environments due to how the >>> kernel calculates OOM scores. >>> >>> I am sharing these findings to explain why I have insisted on the *Safety >>> Floor (minGuaranteedRatio)* as a necessary guardrail. >>> >>> *1. The "Death Trap" of OOM Scores (The Math)* Nan mentioned that >>> "Zero-Guarantee" pods work fine in Pinterest's environment. However, in a >>> standard environment, the math works against us. The Linux kernel >>> calculates oom_score_adj inversely to the Request size: 1000 - (1000 * >>> Request / Capacity). >>> >>> - >>> >>> *The Risk:* By allowing memoryOverhead to drop to 0 (lowering the >>> Request), we are mathematically inflating the OOM score. For example, on >>> a >>> standard node, a Zero-Guarantee pod ends up with a significantly higher >>> OOM >>> score (more likely to be killed) compared to a standard pod. >>> - >>> >>> *The Consequence:* In a race condition, the kernel will >>> mathematically target these "optimized" Spark pods for termination >>> *before* their neighbors, regardless of our intent. >>> >>> *2. The "Smoking Gun": Kubelet Bug #131169* There is a known defect in >>> the Kubelet (Issue #131169) where *PriorityClass is ignored when >>> calculating OOM scores for Burstable pods*. >>> >>> - >>> >>> This invalidates the assumption that we can simply "manage" the risk >>> with priorities later. >>> - >>> >>> Until this is fixed in upstream K8s (v1.30+), a "Zero-Guarantee" pod >>> is statistically identical to a "Best Effort" pod in the eyes of the OOM >>> killer. >>> - >>> >>> *Conclusion:* We ideally *should* enforce a minimum memory floor to >>> keep the Request value high enough to secure a survivable OOM score. >>> >>> *3. Silent Failures (Thread Exhaustion)* The research confirms that >>> "Zero-Guarantee" creates a vector for java.lang.OutOfMemoryError: >>> unable to create new native thread. >>> >>> - >>> >>> If a pod lands on a node with just enough RAM for the Heap (Request) >>> but zero extra for the OS, the pthread_create call will fail >>> immediately. >>> - >>> >>> This results in "silent" application crashes that do not trigger >>> standard K8s OOM alerts, leading to un-debuggable support scenarios for >>> general users. >>> >>> *Final Proposal & Documentation Compromise* >>> >>> My strong preference is to add the *Safety Floor (minGuaranteedRatio)* >>> configuration to the code. >>> >>> However, if after reviewing this evidence you are *adamant* that no new >>> configurations should be added to the code, I am willing to *unblock >>> the vote* on one strict condition: >>> >>> *The SPIP and Documentation must explicitly flag this risk.* We cannot >>> simply leave this as an implementation detail. The documentation must >>> contain a "Critical Warning" block stating: >>> >>> *"Warning: High-Heap/Low-Overhead configurations may result in 0MB >>> guaranteed overhead. Due to Kubelet limitations (Issue #131169), this may >>> bypass PriorityClass protections and lead to silent 'Native Thread' >>> exhaustion failures on contended nodes. Users are responsible for >>> validating stability."* >>> >>> If you agree to either the code change (preferred) or this specific >>> documentation warning please update SIP doc , I am happy to support.. >>> >>> >>> Regards, >>> >>> Viquar Khan >>> >>> Sr Data Architect >>> >>> https://www.linkedin.com/in/vaquar-khan-b695577/ >>> >>> >>> >>> On Tue, 30 Dec 2025 at 01:45, Nan Zhu <[email protected]> wrote: >>> >>>> 1. Re: "Imagined Reasons" & Zero Overhead >>>> when I said "imagined reasons", I meant I didn't see the issue you >>>> described appear in a prod environment running millions of jobs every >>>> month, and I have also said that why it won't happen in PINS and other >>>> normal case: as in a K8S cluster , there will be a reserved space for >>>> system daemons in each host, even with many 0-memoryOverhead jobs, they >>>> won't be "fully packed" as you imagined since these 0-memory overhead jobs >>>> don't need much memory overhead space anyway >>>> >>>> let me bring my earlier suggestions again, if you don't want any job to >>>> have 0 memoryOverhead, you can just calculate how much memoryOverhead is >>>> guaranteed with simple arithmetic, if it is 0, do not use this feature >>>> >>>> In general, I don't really suggest you use this feature if you cannot >>>> manage the rollout process, just like no one should apply something like >>>> auto-tuning to all of their jobs without a dedicated Spark platform team . >>>> >>>> 2. Kubelet Eviction Relevance >>>> >>>> 2.a my question is , how PID/Disk pressure is related to the memory >>>> related feature we are discussing here? please don't fan out the discussion >>>> scope unlimitedly >>>> 2.b exposing spark.kubernetes.executor.bursty.priorityClassName is far >>>> away from a reasonable design, the priority class name should be controlled >>>> in cluster level and then specified via something like spark operator or if >>>> you can specify pod spec, instead of embedding it to a memory related >>>> feature >>>> >>>> 3. Can we agree to simply *add these two parameters as optional >>>> configurations*? >>>> >>>> unfortunately no... >>>> >>>> some of the problems you raised probably will happen in very very >>>> extreme cases, I have provided solutions to them without the need to add >>>> additional configs... Other problems you raised are not related to what >>>> this SPIP is about, e.g. PID exhausting, etc. and some of your proposed >>>> design doesn't make sense to me , e.g. specifying executor's priority >>>> class via such a memory related feature.... >>>> >>>> >>>> On Mon, Dec 29, 2025 at 11:16 PM vaquar khan <[email protected]> >>>> wrote: >>>> >>>>> Hi Nan, >>>>> >>>>> Thanks for the candid response. I see where you are coming from >>>>> regarding managed rollouts, but I think we are viewing this from two >>>>> different lenses: "Internal Platform" vs. "General Open Source Product." >>>>> >>>>> Here is why I am pushing for these two specific configuration hooks: >>>>> >>>>> 1. Re: "Imagined Reasons" & Zero Overhead >>>>> >>>>> You mentioned that you have observed jobs running fine with zero >>>>> memoryOverhead. >>>>> >>>>> While that may be true for specific workloads in your environment, the >>>>> requirement for non-heap memory is not "imagined"—it is a JVM >>>>> specification. Thread stacks, CodeCache, and Netty DirectByteBuffer >>>>> control >>>>> structures must live in non-heap memory. >>>>> >>>>> - >>>>> >>>>> *The Scenario:* If G=0, then Pod Request == Heap. If a node is >>>>> fully bin-packed (Sum of Requests = Node Capacity), your executor is >>>>> mathematically guaranteed *zero bytes* of non-heap memory unless >>>>> it can steal from the burst pool. >>>>> - >>>>> >>>>> *The Risk:* If the burst pool is temporarily exhausted by >>>>> neighbors, a simple thread creation will throw OutOfMemoryError: >>>>> unable to create new native thread. >>>>> - >>>>> >>>>> *The Fix:* I am not asking to change your default behavior. I am >>>>> asking to *expose the config* (minGuaranteedRatio). If you set it >>>>> to 0.0 (default), your behavior is unchanged. But for those of us >>>>> running high-concurrency environments who need a 5-10% safety buffer >>>>> for >>>>> thread stacks, we need the *capability* to configure it without >>>>> maintaining a fork or writing complex pre-submission wrappers. >>>>> >>>>> 2. Re: Kubelet Eviction Relevance >>>>> >>>>> You asked how Disk/PID pressure is related. >>>>> >>>>> In Kubernetes, PriorityClass is the universal signal for pod >>>>> importance during any node-pressure event (not just memory). >>>>> >>>>> - >>>>> >>>>> If a node runs out of Ephemeral Storage (common with Spark >>>>> Shuffle), the Kubelet evicts pods. >>>>> - >>>>> >>>>> Without a priorityClassName config, these Spark pods (which are >>>>> now QoS-downgraded to Burstable) will be evicted *before* >>>>> Best-Effort jobs that might have a higher priority class. >>>>> - >>>>> >>>>> Again, this is a standard Kubernetes spec feature. There is no >>>>> downside to exposing >>>>> spark.kubernetes.executor.bursty.priorityClassName as an optional >>>>> config. >>>>> >>>>> *Proposal to Unblock* >>>>> >>>>> We both want this feature merged. I am not asking to change your >>>>> formula's default behavior. >>>>> >>>>> Can we agree to simply *add these two parameters as optional >>>>> configurations*? >>>>> >>>>> 1. >>>>> >>>>> minGuaranteedRatio (Default: 0.0 -> preserves your logic exactly). >>>>> 2. >>>>> >>>>> priorityClassName (Default: null -> preserves your logic exactly). >>>>> >>>>> This satisfies your design goals while making the feature robust >>>>> enough for my production requirements. >>>>> >>>>> >>>>> Regards, >>>>> >>>>> Viquar Khan >>>>> >>>>> Sr Data Architect >>>>> >>>>> https://www.linkedin.com/in/vaquar-khan-b695577/ >>>>> >>>> >>> >>> -- Regards, Vaquar Khan
