HI Sean , I use drafting tools to help me communicate clearly, as English is not my first language .
Since I have been part of this group since 2014, I value this community highly. It doesn't feel good to be accused of wasting time after all these years. I’ll make an effort to keep things shorter moving forward as i am also working on same issues. Have you seen any issue in my proposals and if keep discussion techical would help spark community. Regards, Vaquar khan On Tue, Dec 30, 2025, 2:50 PM Sean Owen <[email protected]> wrote: > vaquar I think I need to ask, how much of your messages are written by an > AI? It has many stylistic characteristics of this output. This is not by > itself wrong. > While well-formed, the replies are verbose and repetitive, and seem to be > talking past the responses you receive. > There are 1000s of subscribers here and I want to make sure we are > spending everyone's time in good faith. > > On Tue, Dec 30, 2025 at 9:29 AM vaquar khan <[email protected]> wrote: > >> Hi Nan, Yao, and Chao, >> >> I have done a deep dive into the underlying Linux and Kubernetes kernel >> behaviors to validate our respective positions. While I fully support >> the economic goal of reclaiming the estimated 30-50% of stranded memory in >> static clusters, the technical evidence suggests that the >> "Zero-Guarantee" configuration is not just an optimization choice—it is >> architecturally unsafe for standard Kubernetes environments due to how the >> kernel calculates OOM scores. >> >> I am sharing these findings to explain why I have insisted on the *Safety >> Floor (minGuaranteedRatio)* as a necessary guardrail. >> >> *1. The "Death Trap" of OOM Scores (The Math)* Nan mentioned that >> "Zero-Guarantee" pods work fine in Pinterest's environment. However, in a >> standard environment, the math works against us. The Linux kernel >> calculates oom_score_adj inversely to the Request size: 1000 - (1000 * >> Request / Capacity). >> >> - >> >> *The Risk:* By allowing memoryOverhead to drop to 0 (lowering the >> Request), we are mathematically inflating the OOM score. For example, on a >> standard node, a Zero-Guarantee pod ends up with a significantly higher >> OOM >> score (more likely to be killed) compared to a standard pod. >> - >> >> *The Consequence:* In a race condition, the kernel will >> mathematically target these "optimized" Spark pods for termination >> *before* their neighbors, regardless of our intent. >> >> *2. The "Smoking Gun": Kubelet Bug #131169* There is a known defect in >> the Kubelet (Issue #131169) where *PriorityClass is ignored when >> calculating OOM scores for Burstable pods*. >> >> - >> >> This invalidates the assumption that we can simply "manage" the risk >> with priorities later. >> - >> >> Until this is fixed in upstream K8s (v1.30+), a "Zero-Guarantee" pod >> is statistically identical to a "Best Effort" pod in the eyes of the OOM >> killer. >> - >> >> *Conclusion:* We ideally *should* enforce a minimum memory floor to >> keep the Request value high enough to secure a survivable OOM score. >> >> *3. Silent Failures (Thread Exhaustion)* The research confirms that >> "Zero-Guarantee" creates a vector for java.lang.OutOfMemoryError: unable >> to create new native thread. >> >> - >> >> If a pod lands on a node with just enough RAM for the Heap (Request) >> but zero extra for the OS, the pthread_create call will fail >> immediately. >> - >> >> This results in "silent" application crashes that do not trigger >> standard K8s OOM alerts, leading to un-debuggable support scenarios for >> general users. >> >> *Final Proposal & Documentation Compromise* >> >> My strong preference is to add the *Safety Floor (minGuaranteedRatio)* >> configuration to the code. >> >> However, if after reviewing this evidence you are *adamant* that no new >> configurations should be added to the code, I am willing to *unblock the >> vote* on one strict condition: >> >> *The SPIP and Documentation must explicitly flag this risk.* We cannot >> simply leave this as an implementation detail. The documentation must >> contain a "Critical Warning" block stating: >> >> *"Warning: High-Heap/Low-Overhead configurations may result in 0MB >> guaranteed overhead. Due to Kubelet limitations (Issue #131169), this may >> bypass PriorityClass protections and lead to silent 'Native Thread' >> exhaustion failures on contended nodes. Users are responsible for >> validating stability."* >> >> If you agree to either the code change (preferred) or this specific >> documentation warning please update SIP doc , I am happy to support.. >> >> >> Regards, >> >> Viquar Khan >> >> Sr Data Architect >> >> https://www.linkedin.com/in/vaquar-khan-b695577/ >> >> >> >> On Tue, 30 Dec 2025 at 01:45, Nan Zhu <[email protected]> wrote: >> >>> 1. Re: "Imagined Reasons" & Zero Overhead >>> when I said "imagined reasons", I meant I didn't see the issue you >>> described appear in a prod environment running millions of jobs every >>> month, and I have also said that why it won't happen in PINS and other >>> normal case: as in a K8S cluster , there will be a reserved space for >>> system daemons in each host, even with many 0-memoryOverhead jobs, they >>> won't be "fully packed" as you imagined since these 0-memory overhead jobs >>> don't need much memory overhead space anyway >>> >>> let me bring my earlier suggestions again, if you don't want any job to >>> have 0 memoryOverhead, you can just calculate how much memoryOverhead is >>> guaranteed with simple arithmetic, if it is 0, do not use this feature >>> >>> In general, I don't really suggest you use this feature if you cannot >>> manage the rollout process, just like no one should apply something like >>> auto-tuning to all of their jobs without a dedicated Spark platform team . >>> >>> 2. Kubelet Eviction Relevance >>> >>> 2.a my question is , how PID/Disk pressure is related to the memory >>> related feature we are discussing here? please don't fan out the discussion >>> scope unlimitedly >>> 2.b exposing spark.kubernetes.executor.bursty.priorityClassName is far >>> away from a reasonable design, the priority class name should be controlled >>> in cluster level and then specified via something like spark operator or if >>> you can specify pod spec, instead of embedding it to a memory related >>> feature >>> >>> 3. Can we agree to simply *add these two parameters as optional >>> configurations*? >>> >>> unfortunately no... >>> >>> some of the problems you raised probably will happen in very very >>> extreme cases, I have provided solutions to them without the need to add >>> additional configs... Other problems you raised are not related to what >>> this SPIP is about, e.g. PID exhausting, etc. and some of your proposed >>> design doesn't make sense to me , e.g. specifying executor's priority >>> class via such a memory related feature.... >>> >>> >>> On Mon, Dec 29, 2025 at 11:16 PM vaquar khan <[email protected]> >>> wrote: >>> >>>> Hi Nan, >>>> >>>> Thanks for the candid response. I see where you are coming from >>>> regarding managed rollouts, but I think we are viewing this from two >>>> different lenses: "Internal Platform" vs. "General Open Source Product." >>>> >>>> Here is why I am pushing for these two specific configuration hooks: >>>> >>>> 1. Re: "Imagined Reasons" & Zero Overhead >>>> >>>> You mentioned that you have observed jobs running fine with zero >>>> memoryOverhead. >>>> >>>> While that may be true for specific workloads in your environment, the >>>> requirement for non-heap memory is not "imagined"—it is a JVM >>>> specification. Thread stacks, CodeCache, and Netty DirectByteBuffer control >>>> structures must live in non-heap memory. >>>> >>>> - >>>> >>>> *The Scenario:* If G=0, then Pod Request == Heap. If a node is >>>> fully bin-packed (Sum of Requests = Node Capacity), your executor is >>>> mathematically guaranteed *zero bytes* of non-heap memory unless it >>>> can steal from the burst pool. >>>> - >>>> >>>> *The Risk:* If the burst pool is temporarily exhausted by >>>> neighbors, a simple thread creation will throw OutOfMemoryError: >>>> unable to create new native thread. >>>> - >>>> >>>> *The Fix:* I am not asking to change your default behavior. I am >>>> asking to *expose the config* (minGuaranteedRatio). If you set it >>>> to 0.0 (default), your behavior is unchanged. But for those of us >>>> running high-concurrency environments who need a 5-10% safety buffer for >>>> thread stacks, we need the *capability* to configure it without >>>> maintaining a fork or writing complex pre-submission wrappers. >>>> >>>> 2. Re: Kubelet Eviction Relevance >>>> >>>> You asked how Disk/PID pressure is related. >>>> >>>> In Kubernetes, PriorityClass is the universal signal for pod importance >>>> during any node-pressure event (not just memory). >>>> >>>> - >>>> >>>> If a node runs out of Ephemeral Storage (common with Spark >>>> Shuffle), the Kubelet evicts pods. >>>> - >>>> >>>> Without a priorityClassName config, these Spark pods (which are now >>>> QoS-downgraded to Burstable) will be evicted *before* Best-Effort >>>> jobs that might have a higher priority class. >>>> - >>>> >>>> Again, this is a standard Kubernetes spec feature. There is no >>>> downside to exposing >>>> spark.kubernetes.executor.bursty.priorityClassName as an optional >>>> config. >>>> >>>> *Proposal to Unblock* >>>> >>>> We both want this feature merged. I am not asking to change your >>>> formula's default behavior. >>>> >>>> Can we agree to simply *add these two parameters as optional >>>> configurations*? >>>> >>>> 1. >>>> >>>> minGuaranteedRatio (Default: 0.0 -> preserves your logic exactly). >>>> 2. >>>> >>>> priorityClassName (Default: null -> preserves your logic exactly). >>>> >>>> This satisfies your design goals while making the feature robust enough >>>> for my production requirements. >>>> >>>> >>>> Regards, >>>> >>>> Viquar Khan >>>> >>>> Sr Data Architect >>>> >>>> https://www.linkedin.com/in/vaquar-khan-b695577/ >>>> >>> >> >>
