Re: SPIP: Burst-aware memoryOverhead allocation algorithm for Spark@K8S

Tan Vu Wed, 31 Dec 2025 21:24:08 -0800

+1 (non-binding)

I have read through all the messages and agree with Nan that the risk
described by Vaquar is quite rare and can mostly be avoided by setting a
conservative value for the bursty factor
(at least for the K8s environment my service is running on)


On Thu, Jan 1, 2026 at 10:50 AM vaquar khan <[email protected]> wrote:

> I am changing my vote +1 to  0 (Non-binding)
> *Reason*: I support the feature goal, but I cannot vote +1 because the
> safety documentation regarding Kubelet Bug #131169(
> https://github.com/kubernetes/kubernetes/issues/131169)  was not
> included.I stand by the technical risks highlighted in the discussion
> thread.
>
> Regards,
> Viquar khan
>
> On Tue, 30 Dec 2025 at 15:26, vaquar khan <[email protected]> wrote:
>
>> HI Sean ,
>>
>> I use drafting tools to help me communicate clearly, as English is not my
>> first language .
>>
>> Since I have been part of this group since 2014, I value this community
>> highly. It doesn't feel good to be accused of wasting time after all these
>> years. I’ll make an effort to keep things shorter moving forward as i am
>> also working on same issues.
>>
>> Have you seen any issue in my proposals and if keep discussion techical
>> would help spark community.
>>
>>
>> Regards,
>> Vaquar khan
>>
>>
>> On Tue, Dec 30, 2025, 2:50 PM Sean Owen <[email protected]> wrote:
>>
>>> vaquar I think I need to ask, how much of your messages are written by
>>> an AI? It has many stylistic characteristics of this output. This is not by
>>> itself wrong.
>>> While well-formed, the replies are verbose and repetitive, and seem to
>>> be talking past the responses you receive.
>>> There are 1000s of subscribers here and I want to make sure we are
>>> spending everyone's time in good faith.
>>>
>>> On Tue, Dec 30, 2025 at 9:29 AM vaquar khan <[email protected]>
>>> wrote:
>>>
>>>> Hi Nan, Yao, and Chao,
>>>>
>>>> I have done a deep dive into the underlying Linux and Kubernetes kernel
>>>> behaviors to validate our respective positions. While I fully support
>>>> the economic goal of reclaiming the estimated 30-50% of stranded memory in
>>>> static clusters, the technical evidence suggests that the
>>>> "Zero-Guarantee" configuration is not just an optimization choice—it is
>>>> architecturally unsafe for standard Kubernetes environments due to how the
>>>> kernel calculates OOM scores.
>>>>
>>>> I am sharing these findings to explain why I have insisted on the *Safety
>>>> Floor (minGuaranteedRatio)* as a necessary guardrail.
>>>>
>>>> *1. The "Death Trap" of OOM Scores (The Math)* Nan mentioned that
>>>> "Zero-Guarantee" pods work fine in Pinterest's environment. However, in a
>>>> standard environment, the math works against us. The Linux kernel
>>>> calculates oom_score_adj inversely to the Request size: 1000 - (1000 *
>>>> Request / Capacity).
>>>>
>>>>    -
>>>>
>>>>    *The Risk:* By allowing memoryOverhead to drop to 0 (lowering the
>>>>    Request), we are mathematically inflating the OOM score. For example, 
>>>> on a
>>>>    standard node, a Zero-Guarantee pod ends up with a significantly higher 
>>>> OOM
>>>>    score (more likely to be killed) compared to a standard pod.
>>>>    -
>>>>
>>>>    *The Consequence:* In a race condition, the kernel will
>>>>    mathematically target these "optimized" Spark pods for termination
>>>>    *before* their neighbors, regardless of our intent.
>>>>
>>>> *2. The "Smoking Gun": Kubelet Bug #131169* There is a known defect in
>>>> the Kubelet (Issue #131169) where *PriorityClass is ignored when
>>>> calculating OOM scores for Burstable pods*.
>>>>
>>>>    -
>>>>
>>>>    This invalidates the assumption that we can simply "manage" the
>>>>    risk with priorities later.
>>>>    -
>>>>
>>>>    Until this is fixed in upstream K8s (v1.30+), a "Zero-Guarantee"
>>>>    pod is statistically identical to a "Best Effort" pod in the eyes of the
>>>>    OOM killer.
>>>>    -
>>>>
>>>>    *Conclusion:* We ideally *should* enforce a minimum memory floor to
>>>>    keep the Request value high enough to secure a survivable OOM score.
>>>>
>>>> *3. Silent Failures (Thread Exhaustion)* The research confirms that
>>>> "Zero-Guarantee" creates a vector for java.lang.OutOfMemoryError:
>>>> unable to create new native thread.
>>>>
>>>>    -
>>>>
>>>>    If a pod lands on a node with just enough RAM for the Heap
>>>>    (Request) but zero extra for the OS, the pthread_create call will
>>>>    fail immediately.
>>>>    -
>>>>
>>>>    This results in "silent" application crashes that do not trigger
>>>>    standard K8s OOM alerts, leading to un-debuggable support scenarios for
>>>>    general users.
>>>>
>>>> *Final Proposal & Documentation Compromise*
>>>>
>>>> My strong preference is to add the *Safety Floor (minGuaranteedRatio)*
>>>> configuration to the code.
>>>>
>>>> However, if after reviewing this evidence you are *adamant* that no
>>>> new configurations should be added to the code, I am willing to *unblock
>>>> the vote* on one strict condition:
>>>>
>>>> *The SPIP and Documentation must explicitly flag this risk.* We cannot
>>>> simply leave this as an implementation detail. The documentation must
>>>> contain a "Critical Warning" block stating:
>>>>
>>>> *"Warning: High-Heap/Low-Overhead configurations may result in 0MB
>>>> guaranteed overhead. Due to Kubelet limitations (Issue #131169), this may
>>>> bypass PriorityClass protections and lead to silent 'Native Thread'
>>>> exhaustion failures on contended nodes. Users are responsible for
>>>> validating stability."*
>>>>
>>>> If you agree to either the code change (preferred) or this specific
>>>> documentation warning please update SIP doc , I am happy to support..
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Viquar Khan
>>>>
>>>> Sr Data Architect
>>>>
>>>> https://www.linkedin.com/in/vaquar-khan-b695577/
>>>>
>>>>
>>>>
>>>> On Tue, 30 Dec 2025 at 01:45, Nan Zhu <[email protected]> wrote:
>>>>
>>>>> 1. Re: "Imagined Reasons" & Zero Overhead
>>>>> when I said "imagined reasons", I meant I didn't see the issue you
>>>>> described appear in a prod environment running millions of jobs every
>>>>> month, and I have also said that why it won't happen in PINS and other
>>>>> normal case: as in a K8S cluster , there will be a reserved space for
>>>>> system daemons in each host, even with many 0-memoryOverhead jobs, they
>>>>> won't be "fully packed" as you imagined since these 0-memory overhead jobs
>>>>> don't need much memory overhead space anyway
>>>>>
>>>>> let me bring my earlier suggestions again, if you don't want any job
>>>>> to have 0 memoryOverhead, you can just calculate how much memoryOverhead 
>>>>> is
>>>>> guaranteed with simple arithmetic, if it is 0, do not use this feature
>>>>>
>>>>> In general, I don't really suggest you use this feature if you cannot
>>>>> manage the rollout process, just like no one should apply something like
>>>>> auto-tuning to all of their jobs without a dedicated Spark platform team .
>>>>>
>>>>> 2. Kubelet Eviction Relevance
>>>>>
>>>>> 2.a my question is , how PID/Disk pressure is related to the memory
>>>>> related feature we are discussing here? please don't fan out the 
>>>>> discussion
>>>>> scope unlimitedly
>>>>> 2.b exposing spark.kubernetes.executor.bursty.priorityClassName is
>>>>> far away from a reasonable design, the priority class name should be
>>>>> controlled in cluster level and then specified via something like spark
>>>>> operator or if you can specify pod spec, instead of embedding it to a
>>>>> memory related feature
>>>>>
>>>>> 3. Can we agree to simply *add these two parameters as optional
>>>>> configurations*?
>>>>>
>>>>> unfortunately no...
>>>>>
>>>>> some of the problems you raised probably will happen in very very
>>>>> extreme cases, I have provided solutions to them without the need to add
>>>>> additional configs... Other  problems you raised are not related to what
>>>>> this SPIP is about, e.g. PID exhausting, etc.   and some of your proposed
>>>>> design doesn't make sense to me  , e.g. specifying executor's priority
>>>>> class via such a memory related feature....
>>>>>
>>>>>
>>>>> On Mon, Dec 29, 2025 at 11:16 PM vaquar khan <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Nan,
>>>>>>
>>>>>> Thanks for the candid response. I see where you are coming from
>>>>>> regarding managed rollouts, but I think we are viewing this from two
>>>>>> different lenses: "Internal Platform" vs. "General Open Source Product."
>>>>>>
>>>>>> Here is why I am pushing for these two specific configuration hooks:
>>>>>>
>>>>>> 1. Re: "Imagined Reasons" & Zero Overhead
>>>>>>
>>>>>> You mentioned that you have observed jobs running fine with zero
>>>>>> memoryOverhead.
>>>>>>
>>>>>> While that may be true for specific workloads in your environment,
>>>>>> the requirement for non-heap memory is not "imagined"—it is a JVM
>>>>>> specification. Thread stacks, CodeCache, and Netty DirectByteBuffer 
>>>>>> control
>>>>>> structures must live in non-heap memory.
>>>>>>
>>>>>>    -
>>>>>>
>>>>>>    *The Scenario:* If G=0, then Pod Request == Heap. If a node is
>>>>>>    fully bin-packed (Sum of Requests = Node Capacity), your executor is
>>>>>>    mathematically guaranteed *zero bytes* of non-heap memory unless
>>>>>>    it can steal from the burst pool.
>>>>>>    -
>>>>>>
>>>>>>    *The Risk:* If the burst pool is temporarily exhausted by
>>>>>>    neighbors, a simple thread creation will throw OutOfMemoryError:
>>>>>>    unable to create new native thread.
>>>>>>    -
>>>>>>
>>>>>>    *The Fix:* I am not asking to change your default behavior. I am
>>>>>>    asking to *expose the config* (minGuaranteedRatio). If you set it
>>>>>>    to 0.0 (default), your behavior is unchanged. But for those of us
>>>>>>    running high-concurrency environments who need a 5-10% safety buffer 
>>>>>> for
>>>>>>    thread stacks, we need the *capability* to configure it without
>>>>>>    maintaining a fork or writing complex pre-submission wrappers.
>>>>>>
>>>>>> 2. Re: Kubelet Eviction Relevance
>>>>>>
>>>>>> You asked how Disk/PID pressure is related.
>>>>>>
>>>>>> In Kubernetes, PriorityClass is the universal signal for pod
>>>>>> importance during any node-pressure event (not just memory).
>>>>>>
>>>>>>    -
>>>>>>
>>>>>>    If a node runs out of Ephemeral Storage (common with Spark
>>>>>>    Shuffle), the Kubelet evicts pods.
>>>>>>    -
>>>>>>
>>>>>>    Without a priorityClassName config, these Spark pods (which are
>>>>>>    now QoS-downgraded to Burstable) will be evicted *before*
>>>>>>    Best-Effort jobs that might have a higher priority class.
>>>>>>    -
>>>>>>
>>>>>>    Again, this is a standard Kubernetes spec feature. There is no
>>>>>>    downside to exposing
>>>>>>    spark.kubernetes.executor.bursty.priorityClassName as an optional
>>>>>>    config.
>>>>>>
>>>>>> *Proposal to Unblock*
>>>>>>
>>>>>> We both want this feature merged. I am not asking to change your
>>>>>> formula's default behavior.
>>>>>>
>>>>>> Can we agree to simply *add these two parameters as optional
>>>>>> configurations*?
>>>>>>
>>>>>>    1.
>>>>>>
>>>>>>    minGuaranteedRatio (Default: 0.0 -> preserves your logic exactly).
>>>>>>    2.
>>>>>>
>>>>>>    priorityClassName (Default: null -> preserves your logic exactly).
>>>>>>
>>>>>> This satisfies your design goals while making the feature robust
>>>>>> enough for my production requirements.
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Viquar Khan
>>>>>>
>>>>>> Sr Data Architect
>>>>>>
>>>>>> https://www.linkedin.com/in/vaquar-khan-b695577/
>>>>>>
>>>>>
>>>>
>>>>
>
> --
> Regards,
> Vaquar Khan
>
>

Re: SPIP: Burst-aware memoryOverhead allocation algorithm for Spark@K8S

Reply via email to