Re: SPIP: Burst-aware memoryOverhead allocation algorithm for Spark@K8S

Kent Yao Sat, 03 Jan 2026 19:23:34 -0800

+1

Kent




On 2025/12/17 07:48:06 Wenchen Fan wrote:
> +1
> 
> On Wed, Dec 17, 2025 at 6:41 AM karuppayya <[email protected]> wrote:
> 
> > +1 from me.
> > I think it's well-scoped and takes advantage of Kubernetes' features
> > exactly for what they are designed for(as per my understanding).
> >
> > On Tue, Dec 16, 2025 at 8:17 AM Chao Sun <[email protected]> wrote:
> >
> >> Thanks Yao and Nan for the proposal, and thanks everyone for the detailed
> >> and thoughtful discussion.
> >>
> >> Overall, this looks like a valuable addition for organizations running
> >> Spark on Kubernetes, especially given how bursty memoryOverhead usage
> >> tends to be in practice. I appreciate that the change is relatively small
> >> in scope and fully opt-in, which helps keep the risk low.
> >>
> >> From my perspective, the questions raised on the thread and in the SPIP
> >> have been addressed. If others feel the same, do we have consensus to move
> >> forward with a vote? cc Wenchen, Qieqiang, and Karuppayya.
> >>
> >> Best,
> >> Chao
> >>
> >> On Thu, Dec 11, 2025 at 11:32 PM Nan Zhu <[email protected]> wrote:
> >>
> >>> this is a good question
> >>>
> >>> > a stage is bursty and consumes the shared portion and fails to release
> >>> it for subsequent stages
> >>>
> >>> in the scenario you described, since the memory-leaking stage and the
> >>> subsequence ones are from the same job , the pod will likely be killed by
> >>> cgroup oomkiller
> >>>
> >>> taking the following as the example
> >>>
> >>> the usage pattern is  G = 5GB S = 2GB, it uses G + S at max and in
> >>> theory, it should release all 7G and then claim 7G again in some later
> >>> stages, however, due to the memory peak, it holds 2G forever and ask for
> >>> another 7G, as a result,  it hits the pod memory limit  and cgroup
> >>> oomkiller will take action to terminate the pod
> >>>
> >>> so this should be safe to the system
> >>>
> >>>
> >>>
> >>> however, we should be careful about the memory peak for sure, because it
> >>> essentially breaks the assumption that the usage of memoryOverhead is
> >>> bursty (memory peak ~= use memory forever)... unfortunately,
> >>> shared/guaranteed memory is managed by user applications instead of on
> >>> cluster level , they, especially S, are just logical concepts  instead of 
> >>> a
> >>> physical memory pool which pods can explicitly claim memory from...
> >>>
> >>>
> >>> On Thu, Dec 11, 2025 at 10:17 PM karuppayya <[email protected]>
> >>> wrote:
> >>>
> >>>> Thanks for the interesting proposal.
> >>>> The design seems to rely on memoryOverhead being transient.
> >>>> What happens when a stage is bursty and consumes the shared portion and
> >>>> fails to release it for subsequent stages (e.g.,  off-heap buffers and 
> >>>> its
> >>>> not garbage collected since its off-heap)? Would this trigger the
> >>>> host-level OOM like described in Q6? or are there strategies to release 
> >>>> the
> >>>> shared portion?
> >>>>
> >>>>
> >>>> On Thu, Dec 11, 2025 at 6:24 PM Nan Zhu <[email protected]> wrote:
> >>>>
> >>>>> yes, that's the worst case in the scenario, please check my earlier
> >>>>> response to Qiegang's question, we have a set of strategies adopted in 
> >>>>> prod
> >>>>> to mitigate the issue
> >>>>>
> >>>>> On Thu, Dec 11, 2025 at 6:21 PM Wenchen Fan <[email protected]>
> >>>>> wrote:
> >>>>>
> >>>>>> Thanks for the explanation! So the executor is not guaranteed to get
> >>>>>> 50 GB physical memory, right? All pods on the same host may reach peak
> >>>>>> memory usage at the same time and cause paging/swapping which hurts
> >>>>>> performance?
> >>>>>>
> >>>>>> On Fri, Dec 12, 2025 at 10:12 AM Nan Zhu <[email protected]>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> np, let me try to explain
> >>>>>>>
> >>>>>>> 1. Each executor container will be run in a pod together with some
> >>>>>>> other sidecar containers taking care of tasks like authentication, 
> >>>>>>> etc. ,
> >>>>>>> for simplicity, we assume each pod has only one container which is the
> >>>>>>> executor container
> >>>>>>>
> >>>>>>> 2. Each container is assigned with two values, r*equest&limit** (limit
> >>>>>>> >= request),* for both of CPU/memory resources (we only discuss
> >>>>>>> memory here). Each pod will have request/limit values as the sum of 
> >>>>>>> all
> >>>>>>> containers belonging to this pod
> >>>>>>>
> >>>>>>> 3. K8S Scheduler chooses a machine to host a pod based on *request*
> >>>>>>> value, and cap the resource usage of each container based on their
> >>>>>>> *limit* value, e.g. if I have a pod with a single container in it ,
> >>>>>>> and it has 1G/2G as request and limit value respectively, any machine 
> >>>>>>> with
> >>>>>>> 1G free RAM space will be a candidate to host this pod, and when the
> >>>>>>> container use more than 2G memory, it will be killed by cgroup
> >>>>>>> oomkiller. Once a pod is scheduled to a host, the memory space sized 
> >>>>>>> at
> >>>>>>> "sum of all its containers' request values" will be booked 
> >>>>>>> exclusively for
> >>>>>>> this pod.
> >>>>>>>
> >>>>>>> 4. By default, Spark *sets request/limit as the same value for
> >>>>>>> executors in k8s*, and this value is basically
> >>>>>>> spark.executor.memory + spark.executor.memoryOverhead in most cases .
> >>>>>>> However,  spark.executor.memoryOverhead usage is very bursty, the user
> >>>>>>> setting  spark.executor.memoryOverhead as 10G usually means each 
> >>>>>>> executor
> >>>>>>> only needs 10G in a very small portion of the executor's whole 
> >>>>>>> lifecycle
> >>>>>>>
> >>>>>>> 5. The proposed SPIP is essentially to decouple request/limit value
> >>>>>>> in spark@k8s for executors in a safe way (this idea is from the
> >>>>>>> bytedance paper we refer to in SPIP paper).
> >>>>>>>
> >>>>>>> Using the aforementioned example ,
> >>>>>>>
> >>>>>>> if we have a single node cluster with 100G RAM space, we have two
> >>>>>>> pods requesting 40G + 10G (on-heap + memoryOverhead) and we set bursty
> >>>>>>> factor to 1.2, without the mechanism proposed in this SPIP, we can at 
> >>>>>>> most
> >>>>>>> host 2 pods with this machine, and because of the bursty usage of 
> >>>>>>> that 10G
> >>>>>>> space, the memory utilization would be compromised.
> >>>>>>>
> >>>>>>> When applying the burst-aware memory allocation, we only need 40 +
> >>>>>>> 10 - min((40 + 10) * 0.2, 10) = 40G to host each pod, i.e. we have 
> >>>>>>> 20G free
> >>>>>>> memory space left in the machine which can be used to host some 
> >>>>>>> smaller
> >>>>>>> pods. At the same time, as we didn't change the limit value of the 
> >>>>>>> executor
> >>>>>>> pods, these executors can still use 50G at max.
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, Dec 11, 2025 at 5:42 PM Wenchen Fan <[email protected]>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Sorry I'm not very familiar with the k8s infra, how does it work
> >>>>>>>> under the hood? The container will adjust its system memory size
> >>>>>>>> depending on the actual memory usage of the processes in this 
> >>>>>>>> container?
> >>>>>>>>
> >>>>>>>> On Fri, Dec 12, 2025 at 2:49 AM Nan Zhu <[email protected]>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> yeah, we have a few cases that we have significantly larger O than
> >>>>>>>>> H, the proposed algorithm is actually a great fit
> >>>>>>>>>
> >>>>>>>>> as I explained in SPIP doc Appendix C, the proposed algorithm will
> >>>>>>>>> allocate a non-trivial G to ensure the safety of running but still 
> >>>>>>>>> cut a
> >>>>>>>>> big chunk of memory (10s of GBs) and treat them as S , saving tons 
> >>>>>>>>> of money
> >>>>>>>>> burnt by them
> >>>>>>>>>
> >>>>>>>>> but regarding native accelerators, some native acceleration
> >>>>>>>>> engines do not use memoryOverhead but use off-heap
> >>>>>>>>> (spark.memory.offHeap.size) explicitly (e.g. Gluten). The current
> >>>>>>>>> implementation does not cover this part , while that will be an easy
> >>>>>>>>> extension
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Thu, Dec 11, 2025 at 10:42 AM Qiegang Long <[email protected]>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Thanks for the reply.
> >>>>>>>>>>
> >>>>>>>>>> Have you tested in environments where O is bigger than H?
> >>>>>>>>>> Wondering if the proposed algorithm would help more in those 
> >>>>>>>>>> environments
> >>>>>>>>>> (eg. with
> >>>>>>>>>> native accelerators)?
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Dec 9, 2025 at 12:48 PM Nan Zhu <[email protected]>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi, Qiegang, thanks for the good questions as well
> >>>>>>>>>>>
> >>>>>>>>>>> please check the following answer
> >>>>>>>>>>>
> >>>>>>>>>>> > My initial understanding is that Kubernetes will use the 
> >>>>>>>>>>> > Executor
> >>>>>>>>>>> Memory Request (H + G) for scheduling decisions, which allows
> >>>>>>>>>>> for better resource packing.
> >>>>>>>>>>>
> >>>>>>>>>>> yes, your understanding is correct
> >>>>>>>>>>>
> >>>>>>>>>>> > How is the risk of host-level OOM mitigated when the total
> >>>>>>>>>>> potential usage  sum of H+G+S across all pods on a node exceeds 
> >>>>>>>>>>> its
> >>>>>>>>>>> allocatable capacity? Does the proposal implicitly rely on the 
> >>>>>>>>>>> cluster
> >>>>>>>>>>> operator to manually ensure an unrequested memory buffer exists 
> >>>>>>>>>>> on the node
> >>>>>>>>>>> to serve as the shared pool?
> >>>>>>>>>>>
> >>>>>>>>>>> in PINS, we basically apply a set of strategies, setting
> >>>>>>>>>>> conservative bursty factor, progressive rollout, monitor the 
> >>>>>>>>>>> cluster
> >>>>>>>>>>> metrics like Linux Kernel OOMKiller occurrence to guide us to the 
> >>>>>>>>>>> optimal
> >>>>>>>>>>> setup of bursty factor... in usual, K8S operators will set a 
> >>>>>>>>>>> reserved space
> >>>>>>>>>>> for daemon processes on each host, we found it is sufficient to 
> >>>>>>>>>>> in our case
> >>>>>>>>>>> and our major tuning focuses on bursty factor value
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> > Have you considered scheduling optimizations to ensure a
> >>>>>>>>>>> strategic mix of executors with large S and small S values on a 
> >>>>>>>>>>> single
> >>>>>>>>>>> node?  I am wondering if this would reduce the probability of 
> >>>>>>>>>>> concurrent
> >>>>>>>>>>> bursting and host-level OOM.
> >>>>>>>>>>>
> >>>>>>>>>>> Yes, when we work on this project, we put some attention on the
> >>>>>>>>>>> cluster scheduling policy/behavior... two things we mostly care 
> >>>>>>>>>>> about
> >>>>>>>>>>>
> >>>>>>>>>>> 1. as stated in the SPIP doc, the cluster should have certain
> >>>>>>>>>>> level of diversity of workloads so that we have enough candidates 
> >>>>>>>>>>> to form a
> >>>>>>>>>>> mixed set of executors with large S and small S values
> >>>>>>>>>>>
> >>>>>>>>>>> 2. we avoid using binpack scheduling algorithm which tends to
> >>>>>>>>>>> pack more pods from the same job to the same host, which can 
> >>>>>>>>>>> create
> >>>>>>>>>>> troubles as they are more likely to ask for max memory at the 
> >>>>>>>>>>> same time
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Dec 9, 2025 at 7:11 AM Qiegang Long <[email protected]>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Thanks for sharing this interesting proposal.
> >>>>>>>>>>>>
> >>>>>>>>>>>> My initial understanding is that Kubernetes will use the Executor
> >>>>>>>>>>>> Memory Request (H + G) for scheduling decisions, which allows
> >>>>>>>>>>>> for better resource packing.  I have a few questions regarding
> >>>>>>>>>>>> the shared portion S:
> >>>>>>>>>>>>
> >>>>>>>>>>>>    1. How is the risk of host-level OOM mitigated when the
> >>>>>>>>>>>>    total potential usage  sum of H+G+S across all pods on a node 
> >>>>>>>>>>>> exceeds its
> >>>>>>>>>>>>    allocatable capacity? Does the proposal implicitly rely on 
> >>>>>>>>>>>> the cluster
> >>>>>>>>>>>>    operator to manually ensure an unrequested memory buffer 
> >>>>>>>>>>>> exists on the node
> >>>>>>>>>>>>    to serve as the shared pool?
> >>>>>>>>>>>>    2. Have you considered scheduling optimizations to ensure a
> >>>>>>>>>>>>    strategic mix of executors with large S and small S values
> >>>>>>>>>>>>    on a single node?  I am wondering if this would reduce the 
> >>>>>>>>>>>> probability of
> >>>>>>>>>>>>    concurrent bursting and host-level OOM.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:49 AM Wenchen Fan <[email protected]>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> I think I'm still missing something in the big picture:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>    - Is the memory overhead off-heap? The formular indicates
> >>>>>>>>>>>>>    a fixed heap size, and memory overhead can't be dynamic if 
> >>>>>>>>>>>>> it's on-heap.
> >>>>>>>>>>>>>    - Do Spark applications have static profiles? When we
> >>>>>>>>>>>>>    submit stages, the cluster is already allocated, how can we 
> >>>>>>>>>>>>> change anything?
> >>>>>>>>>>>>>    - How do we assign the shared memory overhead? Fairly
> >>>>>>>>>>>>>    among all applications on the same physical node?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:15 PM Nan Zhu <[email protected]>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> we didn't separate the design into another doc since the main
> >>>>>>>>>>>>>> idea is relatively simple...
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> for request/limit calculation, I described it in Q4 of the
> >>>>>>>>>>>>>> SPIP doc
> >>>>>>>>>>>>>> https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0#heading=h.q4vjslmnfuo0
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> it is calculated based on per profile (you can say it is
> >>>>>>>>>>>>>> based on per stage), when the cluster manager compose the pod 
> >>>>>>>>>>>>>> spec, it
> >>>>>>>>>>>>>> calculates the new memory overhead based on what user asks for 
> >>>>>>>>>>>>>> in that
> >>>>>>>>>>>>>> resource profile
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 9:49 PM Wenchen Fan <
> >>>>>>>>>>>>>> [email protected]> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Do we have a design sketch? How to determine the memory
> >>>>>>>>>>>>>>> request and limit? Is it per stage or per executor?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 1:40 PM Nan Zhu <
> >>>>>>>>>>>>>>> [email protected]> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> yeah, the implementation is basically relying on the
> >>>>>>>>>>>>>>>> request/limit concept in K8S, ...
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> but if there is any other cluster manager coming in
> >>>>>>>>>>>>>>>> future,  as long as it has a similar concept , it can 
> >>>>>>>>>>>>>>>> leverage this easily
> >>>>>>>>>>>>>>>> as the main logic is implemented in ResourceProfile
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 9:34 PM Wenchen Fan <
> >>>>>>>>>>>>>>>> [email protected]> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> This feature is only available on k8s because it allows
> >>>>>>>>>>>>>>>>> containers to have dynamic resources?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 12:46 PM Yao <[email protected]>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hi Folks,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> We are proposing a burst-aware memoryOverhead allocation
> >>>>>>>>>>>>>>>>>> algorithm for Spark@K8S to improve memory utilization of
> >>>>>>>>>>>>>>>>>> spark clusters.
> >>>>>>>>>>>>>>>>>> Please see more details in SPIP doc
> >>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0>.
> >>>>>>>>>>>>>>>>>> Feedbacks and discussions are welcomed.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thanks Chao for being shepard of this feature.
> >>>>>>>>>>>>>>>>>> Also want to thank the authors of the original paper
> >>>>>>>>>>>>>>>>>> <https://www.vldb.org/pvldb/vol17/p3759-shi.pdf> from
> >>>>>>>>>>>>>>>>>> ByteDance, specifically Rui([email protected])
> >>>>>>>>>>>>>>>>>> and Yixin([email protected]).
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thank you.
> >>>>>>>>>>>>>>>>>> Yao Wang
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> 

---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

Re: SPIP: Burst-aware memoryOverhead allocation algorithm for Spark@K8S

Reply via email to