Re: SPIP: Burst-aware memoryOverhead allocation algorithm for Spark@K8S

vaquar khan Wed, 31 Dec 2025 17:50:30 -0800

I am changing my vote +1 to  0 (Non-binding)
*Reason*: I support the feature goal, but I cannot vote +1 because the
safety documentation regarding Kubelet Bug #131169(
https://github.com/kubernetes/kubernetes/issues/131169)  was not included.I
stand by the technical risks highlighted in the discussion thread.


Regards,
Viquar khan

On Tue, 30 Dec 2025 at 15:26, vaquar khan <[email protected]> wrote:

> HI Sean ,
>
> I use drafting tools to help me communicate clearly, as English is not my
> first language .
>
> Since I have been part of this group since 2014, I value this community
> highly. It doesn't feel good to be accused of wasting time after all these
> years. I’ll make an effort to keep things shorter moving forward as i am
> also working on same issues.
>
> Have you seen any issue in my proposals and if keep discussion techical
> would help spark community.
>
>
> Regards,
> Vaquar khan
>
>
> On Tue, Dec 30, 2025, 2:50 PM Sean Owen <[email protected]> wrote:
>
>> vaquar I think I need to ask, how much of your messages are written by an
>> AI? It has many stylistic characteristics of this output. This is not by
>> itself wrong.
>> While well-formed, the replies are verbose and repetitive, and seem to be
>> talking past the responses you receive.
>> There are 1000s of subscribers here and I want to make sure we are
>> spending everyone's time in good faith.
>>
>> On Tue, Dec 30, 2025 at 9:29 AM vaquar khan <[email protected]>
>> wrote:
>>
>>> Hi Nan, Yao, and Chao,
>>>
>>> I have done a deep dive into the underlying Linux and Kubernetes kernel
>>> behaviors to validate our respective positions. While I fully support
>>> the economic goal of reclaiming the estimated 30-50% of stranded memory in
>>> static clusters, the technical evidence suggests that the
>>> "Zero-Guarantee" configuration is not just an optimization choice—it is
>>> architecturally unsafe for standard Kubernetes environments due to how the
>>> kernel calculates OOM scores.
>>>
>>> I am sharing these findings to explain why I have insisted on the *Safety
>>> Floor (minGuaranteedRatio)* as a necessary guardrail.
>>>
>>> *1. The "Death Trap" of OOM Scores (The Math)* Nan mentioned that
>>> "Zero-Guarantee" pods work fine in Pinterest's environment. However, in a
>>> standard environment, the math works against us. The Linux kernel
>>> calculates oom_score_adj inversely to the Request size: 1000 - (1000 *
>>> Request / Capacity).
>>>
>>>    -
>>>
>>>    *The Risk:* By allowing memoryOverhead to drop to 0 (lowering the
>>>    Request), we are mathematically inflating the OOM score. For example, on 
>>> a
>>>    standard node, a Zero-Guarantee pod ends up with a significantly higher 
>>> OOM
>>>    score (more likely to be killed) compared to a standard pod.
>>>    -
>>>
>>>    *The Consequence:* In a race condition, the kernel will
>>>    mathematically target these "optimized" Spark pods for termination
>>>    *before* their neighbors, regardless of our intent.
>>>
>>> *2. The "Smoking Gun": Kubelet Bug #131169* There is a known defect in
>>> the Kubelet (Issue #131169) where *PriorityClass is ignored when
>>> calculating OOM scores for Burstable pods*.
>>>
>>>    -
>>>
>>>    This invalidates the assumption that we can simply "manage" the risk
>>>    with priorities later.
>>>    -
>>>
>>>    Until this is fixed in upstream K8s (v1.30+), a "Zero-Guarantee" pod
>>>    is statistically identical to a "Best Effort" pod in the eyes of the OOM
>>>    killer.
>>>    -
>>>
>>>    *Conclusion:* We ideally *should* enforce a minimum memory floor to
>>>    keep the Request value high enough to secure a survivable OOM score.
>>>
>>> *3. Silent Failures (Thread Exhaustion)* The research confirms that
>>> "Zero-Guarantee" creates a vector for java.lang.OutOfMemoryError:
>>> unable to create new native thread.
>>>
>>>    -
>>>
>>>    If a pod lands on a node with just enough RAM for the Heap (Request)
>>>    but zero extra for the OS, the pthread_create call will fail
>>>    immediately.
>>>    -
>>>
>>>    This results in "silent" application crashes that do not trigger
>>>    standard K8s OOM alerts, leading to un-debuggable support scenarios for
>>>    general users.
>>>
>>> *Final Proposal & Documentation Compromise*
>>>
>>> My strong preference is to add the *Safety Floor (minGuaranteedRatio)*
>>> configuration to the code.
>>>
>>> However, if after reviewing this evidence you are *adamant* that no new
>>> configurations should be added to the code, I am willing to *unblock
>>> the vote* on one strict condition:
>>>
>>> *The SPIP and Documentation must explicitly flag this risk.* We cannot
>>> simply leave this as an implementation detail. The documentation must
>>> contain a "Critical Warning" block stating:
>>>
>>> *"Warning: High-Heap/Low-Overhead configurations may result in 0MB
>>> guaranteed overhead. Due to Kubelet limitations (Issue #131169), this may
>>> bypass PriorityClass protections and lead to silent 'Native Thread'
>>> exhaustion failures on contended nodes. Users are responsible for
>>> validating stability."*
>>>
>>> If you agree to either the code change (preferred) or this specific
>>> documentation warning please update SIP doc , I am happy to support..
>>>
>>>
>>> Regards,
>>>
>>> Viquar Khan
>>>
>>> Sr Data Architect
>>>
>>> https://www.linkedin.com/in/vaquar-khan-b695577/
>>>
>>>
>>>
>>> On Tue, 30 Dec 2025 at 01:45, Nan Zhu <[email protected]> wrote:
>>>
>>>> 1. Re: "Imagined Reasons" & Zero Overhead
>>>> when I said "imagined reasons", I meant I didn't see the issue you
>>>> described appear in a prod environment running millions of jobs every
>>>> month, and I have also said that why it won't happen in PINS and other
>>>> normal case: as in a K8S cluster , there will be a reserved space for
>>>> system daemons in each host, even with many 0-memoryOverhead jobs, they
>>>> won't be "fully packed" as you imagined since these 0-memory overhead jobs
>>>> don't need much memory overhead space anyway
>>>>
>>>> let me bring my earlier suggestions again, if you don't want any job to
>>>> have 0 memoryOverhead, you can just calculate how much memoryOverhead is
>>>> guaranteed with simple arithmetic, if it is 0, do not use this feature
>>>>
>>>> In general, I don't really suggest you use this feature if you cannot
>>>> manage the rollout process, just like no one should apply something like
>>>> auto-tuning to all of their jobs without a dedicated Spark platform team .
>>>>
>>>> 2. Kubelet Eviction Relevance
>>>>
>>>> 2.a my question is , how PID/Disk pressure is related to the memory
>>>> related feature we are discussing here? please don't fan out the discussion
>>>> scope unlimitedly
>>>> 2.b exposing spark.kubernetes.executor.bursty.priorityClassName is far
>>>> away from a reasonable design, the priority class name should be controlled
>>>> in cluster level and then specified via something like spark operator or if
>>>> you can specify pod spec, instead of embedding it to a memory related
>>>> feature
>>>>
>>>> 3. Can we agree to simply *add these two parameters as optional
>>>> configurations*?
>>>>
>>>> unfortunately no...
>>>>
>>>> some of the problems you raised probably will happen in very very
>>>> extreme cases, I have provided solutions to them without the need to add
>>>> additional configs... Other  problems you raised are not related to what
>>>> this SPIP is about, e.g. PID exhausting, etc.   and some of your proposed
>>>> design doesn't make sense to me  , e.g. specifying executor's priority
>>>> class via such a memory related feature....
>>>>
>>>>
>>>> On Mon, Dec 29, 2025 at 11:16 PM vaquar khan <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Nan,
>>>>>
>>>>> Thanks for the candid response. I see where you are coming from
>>>>> regarding managed rollouts, but I think we are viewing this from two
>>>>> different lenses: "Internal Platform" vs. "General Open Source Product."
>>>>>
>>>>> Here is why I am pushing for these two specific configuration hooks:
>>>>>
>>>>> 1. Re: "Imagined Reasons" & Zero Overhead
>>>>>
>>>>> You mentioned that you have observed jobs running fine with zero
>>>>> memoryOverhead.
>>>>>
>>>>> While that may be true for specific workloads in your environment, the
>>>>> requirement for non-heap memory is not "imagined"—it is a JVM
>>>>> specification. Thread stacks, CodeCache, and Netty DirectByteBuffer 
>>>>> control
>>>>> structures must live in non-heap memory.
>>>>>
>>>>>    -
>>>>>
>>>>>    *The Scenario:* If G=0, then Pod Request == Heap. If a node is
>>>>>    fully bin-packed (Sum of Requests = Node Capacity), your executor is
>>>>>    mathematically guaranteed *zero bytes* of non-heap memory unless
>>>>>    it can steal from the burst pool.
>>>>>    -
>>>>>
>>>>>    *The Risk:* If the burst pool is temporarily exhausted by
>>>>>    neighbors, a simple thread creation will throw OutOfMemoryError:
>>>>>    unable to create new native thread.
>>>>>    -
>>>>>
>>>>>    *The Fix:* I am not asking to change your default behavior. I am
>>>>>    asking to *expose the config* (minGuaranteedRatio). If you set it
>>>>>    to 0.0 (default), your behavior is unchanged. But for those of us
>>>>>    running high-concurrency environments who need a 5-10% safety buffer 
>>>>> for
>>>>>    thread stacks, we need the *capability* to configure it without
>>>>>    maintaining a fork or writing complex pre-submission wrappers.
>>>>>
>>>>> 2. Re: Kubelet Eviction Relevance
>>>>>
>>>>> You asked how Disk/PID pressure is related.
>>>>>
>>>>> In Kubernetes, PriorityClass is the universal signal for pod
>>>>> importance during any node-pressure event (not just memory).
>>>>>
>>>>>    -
>>>>>
>>>>>    If a node runs out of Ephemeral Storage (common with Spark
>>>>>    Shuffle), the Kubelet evicts pods.
>>>>>    -
>>>>>
>>>>>    Without a priorityClassName config, these Spark pods (which are
>>>>>    now QoS-downgraded to Burstable) will be evicted *before*
>>>>>    Best-Effort jobs that might have a higher priority class.
>>>>>    -
>>>>>
>>>>>    Again, this is a standard Kubernetes spec feature. There is no
>>>>>    downside to exposing
>>>>>    spark.kubernetes.executor.bursty.priorityClassName as an optional
>>>>>    config.
>>>>>
>>>>> *Proposal to Unblock*
>>>>>
>>>>> We both want this feature merged. I am not asking to change your
>>>>> formula's default behavior.
>>>>>
>>>>> Can we agree to simply *add these two parameters as optional
>>>>> configurations*?
>>>>>
>>>>>    1.
>>>>>
>>>>>    minGuaranteedRatio (Default: 0.0 -> preserves your logic exactly).
>>>>>    2.
>>>>>
>>>>>    priorityClassName (Default: null -> preserves your logic exactly).
>>>>>
>>>>> This satisfies your design goals while making the feature robust
>>>>> enough for my production requirements.
>>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>> Viquar Khan
>>>>>
>>>>> Sr Data Architect
>>>>>
>>>>> https://www.linkedin.com/in/vaquar-khan-b695577/
>>>>>
>>>>
>>>
>>>

-- 
Regards,
Vaquar Khan

Re: SPIP: Burst-aware memoryOverhead allocation algorithm for Spark@K8S

Reply via email to