Thanks Yao and Nan for the proposal, and thanks everyone for the detailed and thoughtful discussion.
Overall, this looks like a valuable addition for organizations running Spark on Kubernetes, especially given how bursty memoryOverhead usage tends to be in practice. I appreciate that the change is relatively small in scope and fully opt-in, which helps keep the risk low. >From my perspective, the questions raised on the thread and in the SPIP have been addressed. If others feel the same, do we have consensus to move forward with a vote? cc Wenchen, Qieqiang, and Karuppayya. Best, Chao On Thu, Dec 11, 2025 at 11:32 PM Nan Zhu <[email protected]> wrote: > this is a good question > > > a stage is bursty and consumes the shared portion and fails to release > it for subsequent stages > > in the scenario you described, since the memory-leaking stage and the > subsequence ones are from the same job , the pod will likely be killed by > cgroup oomkiller > > taking the following as the example > > the usage pattern is G = 5GB S = 2GB, it uses G + S at max and in theory, > it should release all 7G and then claim 7G again in some later stages, > however, due to the memory peak, it holds 2G forever and ask for another > 7G, as a result, it hits the pod memory limit and cgroup oomkiller will > take action to terminate the pod > > so this should be safe to the system > > > > however, we should be careful about the memory peak for sure, because it > essentially breaks the assumption that the usage of memoryOverhead is > bursty (memory peak ~= use memory forever)... unfortunately, > shared/guaranteed memory is managed by user applications instead of on > cluster level , they, especially S, are just logical concepts instead of a > physical memory pool which pods can explicitly claim memory from... > > > On Thu, Dec 11, 2025 at 10:17 PM karuppayya <[email protected]> > wrote: > >> Thanks for the interesting proposal. >> The design seems to rely on memoryOverhead being transient. >> What happens when a stage is bursty and consumes the shared portion and >> fails to release it for subsequent stages (e.g., off-heap buffers and its >> not garbage collected since its off-heap)? Would this trigger the >> host-level OOM like described in Q6? or are there strategies to release the >> shared portion? >> >> >> On Thu, Dec 11, 2025 at 6:24 PM Nan Zhu <[email protected]> wrote: >> >>> yes, that's the worst case in the scenario, please check my earlier >>> response to Qiegang's question, we have a set of strategies adopted in prod >>> to mitigate the issue >>> >>> On Thu, Dec 11, 2025 at 6:21 PM Wenchen Fan <[email protected]> wrote: >>> >>>> Thanks for the explanation! So the executor is not guaranteed to get 50 >>>> GB physical memory, right? All pods on the same host may reach peak memory >>>> usage at the same time and cause paging/swapping which hurts performance? >>>> >>>> On Fri, Dec 12, 2025 at 10:12 AM Nan Zhu <[email protected]> >>>> wrote: >>>> >>>>> np, let me try to explain >>>>> >>>>> 1. Each executor container will be run in a pod together with some >>>>> other sidecar containers taking care of tasks like authentication, etc. , >>>>> for simplicity, we assume each pod has only one container which is the >>>>> executor container >>>>> >>>>> 2. Each container is assigned with two values, r*equest&limit** (limit >>>>> >= request),* for both of CPU/memory resources (we only discuss >>>>> memory here). Each pod will have request/limit values as the sum of all >>>>> containers belonging to this pod >>>>> >>>>> 3. K8S Scheduler chooses a machine to host a pod based on *request* >>>>> value, and cap the resource usage of each container based on their >>>>> *limit* value, e.g. if I have a pod with a single container in it , >>>>> and it has 1G/2G as request and limit value respectively, any machine with >>>>> 1G free RAM space will be a candidate to host this pod, and when the >>>>> container use more than 2G memory, it will be killed by cgroup >>>>> oomkiller. Once a pod is scheduled to a host, the memory space sized at >>>>> "sum of all its containers' request values" will be booked exclusively for >>>>> this pod. >>>>> >>>>> 4. By default, Spark *sets request/limit as the same value for >>>>> executors in k8s*, and this value is basically >>>>> spark.executor.memory + spark.executor.memoryOverhead in most cases . >>>>> However, spark.executor.memoryOverhead usage is very bursty, the user >>>>> setting spark.executor.memoryOverhead as 10G usually means each executor >>>>> only needs 10G in a very small portion of the executor's whole lifecycle >>>>> >>>>> 5. The proposed SPIP is essentially to decouple request/limit value in >>>>> spark@k8s for executors in a safe way (this idea is from the >>>>> bytedance paper we refer to in SPIP paper). >>>>> >>>>> Using the aforementioned example , >>>>> >>>>> if we have a single node cluster with 100G RAM space, we have two pods >>>>> requesting 40G + 10G (on-heap + memoryOverhead) and we set bursty factor >>>>> to >>>>> 1.2, without the mechanism proposed in this SPIP, we can at most host 2 >>>>> pods with this machine, and because of the bursty usage of that 10G space, >>>>> the memory utilization would be compromised. >>>>> >>>>> When applying the burst-aware memory allocation, we only need 40 + 10 >>>>> - min((40 + 10) * 0.2, 10) = 40G to host each pod, i.e. we have 20G free >>>>> memory space left in the machine which can be used to host some smaller >>>>> pods. At the same time, as we didn't change the limit value of the >>>>> executor >>>>> pods, these executors can still use 50G at max. >>>>> >>>>> >>>>> On Thu, Dec 11, 2025 at 5:42 PM Wenchen Fan <[email protected]> >>>>> wrote: >>>>> >>>>>> Sorry I'm not very familiar with the k8s infra, how does it work >>>>>> under the hood? The container will adjust its system memory size >>>>>> depending on the actual memory usage of the processes in this container? >>>>>> >>>>>> On Fri, Dec 12, 2025 at 2:49 AM Nan Zhu <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> yeah, we have a few cases that we have significantly larger O than >>>>>>> H, the proposed algorithm is actually a great fit >>>>>>> >>>>>>> as I explained in SPIP doc Appendix C, the proposed algorithm will >>>>>>> allocate a non-trivial G to ensure the safety of running but still cut a >>>>>>> big chunk of memory (10s of GBs) and treat them as S , saving tons of >>>>>>> money >>>>>>> burnt by them >>>>>>> >>>>>>> but regarding native accelerators, some native acceleration engines >>>>>>> do not use memoryOverhead but use off-heap (spark.memory.offHeap.size) >>>>>>> explicitly (e.g. Gluten). The current implementation does not cover this >>>>>>> part , while that will be an easy extension >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Dec 11, 2025 at 10:42 AM Qiegang Long <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Thanks for the reply. >>>>>>>> >>>>>>>> Have you tested in environments where O is bigger than H? Wondering >>>>>>>> if the proposed algorithm would help more in those environments (eg. >>>>>>>> with >>>>>>>> native accelerators)? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Dec 9, 2025 at 12:48 PM Nan Zhu <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi, Qiegang, thanks for the good questions as well >>>>>>>>> >>>>>>>>> please check the following answer >>>>>>>>> >>>>>>>>> > My initial understanding is that Kubernetes will use the Executor >>>>>>>>> Memory Request (H + G) for scheduling decisions, which allows for >>>>>>>>> better resource packing. >>>>>>>>> >>>>>>>>> yes, your understanding is correct >>>>>>>>> >>>>>>>>> > How is the risk of host-level OOM mitigated when the total >>>>>>>>> potential usage sum of H+G+S across all pods on a node exceeds its >>>>>>>>> allocatable capacity? Does the proposal implicitly rely on the cluster >>>>>>>>> operator to manually ensure an unrequested memory buffer exists on >>>>>>>>> the node >>>>>>>>> to serve as the shared pool? >>>>>>>>> >>>>>>>>> in PINS, we basically apply a set of strategies, setting >>>>>>>>> conservative bursty factor, progressive rollout, monitor the cluster >>>>>>>>> metrics like Linux Kernel OOMKiller occurrence to guide us to the >>>>>>>>> optimal >>>>>>>>> setup of bursty factor... in usual, K8S operators will set a reserved >>>>>>>>> space >>>>>>>>> for daemon processes on each host, we found it is sufficient to in >>>>>>>>> our case >>>>>>>>> and our major tuning focuses on bursty factor value >>>>>>>>> >>>>>>>>> >>>>>>>>> > Have you considered scheduling optimizations to ensure a >>>>>>>>> strategic mix of executors with large S and small S values on a single >>>>>>>>> node? I am wondering if this would reduce the probability of >>>>>>>>> concurrent >>>>>>>>> bursting and host-level OOM. >>>>>>>>> >>>>>>>>> Yes, when we work on this project, we put some attention on the >>>>>>>>> cluster scheduling policy/behavior... two things we mostly care about >>>>>>>>> >>>>>>>>> 1. as stated in the SPIP doc, the cluster should have certain >>>>>>>>> level of diversity of workloads so that we have enough candidates to >>>>>>>>> form a >>>>>>>>> mixed set of executors with large S and small S values >>>>>>>>> >>>>>>>>> 2. we avoid using binpack scheduling algorithm which tends to pack >>>>>>>>> more pods from the same job to the same host, which can create >>>>>>>>> troubles as >>>>>>>>> they are more likely to ask for max memory at the same time >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Dec 9, 2025 at 7:11 AM Qiegang Long <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Thanks for sharing this interesting proposal. >>>>>>>>>> >>>>>>>>>> My initial understanding is that Kubernetes will use the Executor >>>>>>>>>> Memory Request (H + G) for scheduling decisions, which allows >>>>>>>>>> for better resource packing. I have a few questions regarding >>>>>>>>>> the shared portion S: >>>>>>>>>> >>>>>>>>>> 1. How is the risk of host-level OOM mitigated when the total >>>>>>>>>> potential usage sum of H+G+S across all pods on a node exceeds >>>>>>>>>> its >>>>>>>>>> allocatable capacity? Does the proposal implicitly rely on the >>>>>>>>>> cluster >>>>>>>>>> operator to manually ensure an unrequested memory buffer exists >>>>>>>>>> on the node >>>>>>>>>> to serve as the shared pool? >>>>>>>>>> 2. Have you considered scheduling optimizations to ensure a >>>>>>>>>> strategic mix of executors with large S and small S values on >>>>>>>>>> a single node? I am wondering if this would reduce the >>>>>>>>>> probability of >>>>>>>>>> concurrent bursting and host-level OOM. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Dec 9, 2025 at 2:49 AM Wenchen Fan <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> I think I'm still missing something in the big picture: >>>>>>>>>>> >>>>>>>>>>> - Is the memory overhead off-heap? The formular indicates a >>>>>>>>>>> fixed heap size, and memory overhead can't be dynamic if it's >>>>>>>>>>> on-heap. >>>>>>>>>>> - Do Spark applications have static profiles? When we submit >>>>>>>>>>> stages, the cluster is already allocated, how can we change >>>>>>>>>>> anything? >>>>>>>>>>> - How do we assign the shared memory overhead? Fairly among >>>>>>>>>>> all applications on the same physical node? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, Dec 9, 2025 at 2:15 PM Nan Zhu <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> we didn't separate the design into another doc since the main >>>>>>>>>>>> idea is relatively simple... >>>>>>>>>>>> >>>>>>>>>>>> for request/limit calculation, I described it in Q4 of the SPIP >>>>>>>>>>>> doc >>>>>>>>>>>> https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0#heading=h.q4vjslmnfuo0 >>>>>>>>>>>> >>>>>>>>>>>> it is calculated based on per profile (you can say it is based >>>>>>>>>>>> on per stage), when the cluster manager compose the pod spec, it >>>>>>>>>>>> calculates >>>>>>>>>>>> the new memory overhead based on what user asks for in that >>>>>>>>>>>> resource profile >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Dec 8, 2025 at 9:49 PM Wenchen Fan <[email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Do we have a design sketch? How to determine the memory >>>>>>>>>>>>> request and limit? Is it per stage or per executor? >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Dec 9, 2025 at 1:40 PM Nan Zhu <[email protected]> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> yeah, the implementation is basically relying on the >>>>>>>>>>>>>> request/limit concept in K8S, ... >>>>>>>>>>>>>> >>>>>>>>>>>>>> but if there is any other cluster manager coming in future, >>>>>>>>>>>>>> as long as it has a similar concept , it can leverage this >>>>>>>>>>>>>> easily as the >>>>>>>>>>>>>> main logic is implemented in ResourceProfile >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 9:34 PM Wenchen Fan < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> This feature is only available on k8s because it allows >>>>>>>>>>>>>>> containers to have dynamic resources? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 12:46 PM Yao <[email protected]> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Folks, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> We are proposing a burst-aware memoryOverhead allocation >>>>>>>>>>>>>>>> algorithm for Spark@K8S to improve memory utilization of >>>>>>>>>>>>>>>> spark clusters. >>>>>>>>>>>>>>>> Please see more details in SPIP doc >>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0>. >>>>>>>>>>>>>>>> Feedbacks and discussions are welcomed. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks Chao for being shepard of this feature. >>>>>>>>>>>>>>>> Also want to thank the authors of the original paper >>>>>>>>>>>>>>>> <https://www.vldb.org/pvldb/vol17/p3759-shi.pdf> from >>>>>>>>>>>>>>>> ByteDance, specifically Rui([email protected]) >>>>>>>>>>>>>>>> and Yixin([email protected]). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thank you. >>>>>>>>>>>>>>>> Yao Wang >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>
