[ 
https://issues.apache.org/jira/browse/SPARK-54596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nan Zhu updated SPARK-54596:
----------------------------
    Description: 
{{memoryOverhead}} is one of the largest contributors to memory consumption in 
Spark workloads. To avoid being killed by the OOM Killer in cluster 
environments, users typically reserve a large {{memoryOverhead}} budget. 
However, the actual usage pattern of this space—covering 
serialization/deserialization buffers, compression/decompression workspaces, 
direct memory allocations, etc.—is highly bursty. As a result, users must 
provision for peak {{memoryOverhead}} even though most Spark tasks utilize only 
a small fraction of it for the majority of their lifespan.

To address this inefficiency in {*}Spark on Kubernetes{*}, we implemented the 
algorithm proposed in the paper _“Towards Resource Efficiency: Practical 
Insights into Large-Scale Spark Workloads at ByteDance”_ (VLDB 2024). Our 
approach introduces a configurable parameter that splits a user’s requested 
{{memoryOverhead}} into two segments:
 * *Guaranteed memoryOverhead* — exclusive to each Spark pod, ensuring baseline 
reliability.

 * *Shared memoryOverhead* — pooled and time-shared across multiple pods to 
absorb bursty usage.

In the Kubernetes environment, this translates to the following resource 
settings for each Spark executor pod:
 * *memory request* = user-requested on-heap memory + guaranteed 
{{memoryOverhead}}

 * *memory limit* = user-requested on-heap memory + guaranteed 
{{memoryOverhead}} + shared {{memoryOverhead}}

This design preserves reliability while substantially improving cluster-level 
memory utilization by avoiding the over-provisioning traditionally required for 
worst-case {{memoryOverhead}} peaks.

  was:
memoryOverhead is one of the most significant memory consumers for Spark 
workloads. Users tend to book a big chunk of such memory space to avoid 
triggering OOMKiller in the cluster environment. 

However, the usage of pattern of memoryOverhead space, such as se/deser, 
compress/decompress, direct memory, etc. ,  is usually bursty. As a result, 
users have to book the peak usage of memoryOverhead to ensure the reliability 
of their jobs while leaving a big chunk of this space unused in the majority of 
the job lifecycle. 


To resolve this problem in the context of Spark@K8S, we implemented the 
algorithm presented in the paper "Towards Resource Efficiency: Practical 
Insights into Large-Scale Spark Workloads at ByteDance" from ByteDance team 
([https://www.vldb.org/pvldb/vol17/p3759-shi.pdf)] . We introduce a 
configurable parameter to split the user requested memoryOverhead space into 
two parts, guaranteed/shared. The guaranteed part is dedicated to a pod started 
by a spark job and shared part is used by multiple pods in a time-sharing 
fashion. Being specific to the K8S environment, the memory request of a pod 
will be user-requested on-heap space + guaranteed memoryOverhead and the memory 
limit will be user-requested on-heap space + guaranteed memoryOverhead + shared 
memoryOverhead .


> Burst-aware memoryOverhead Allocation Algorithm for Spark@K8S
> -------------------------------------------------------------
>
>                 Key: SPARK-54596
>                 URL: https://issues.apache.org/jira/browse/SPARK-54596
>             Project: Spark
>          Issue Type: Improvement
>          Components: Kubernetes, Spark Core
>    Affects Versions: 4.2
>            Reporter: Nan Zhu
>            Priority: Major
>
> {{memoryOverhead}} is one of the largest contributors to memory consumption 
> in Spark workloads. To avoid being killed by the OOM Killer in cluster 
> environments, users typically reserve a large {{memoryOverhead}} budget. 
> However, the actual usage pattern of this space—covering 
> serialization/deserialization buffers, compression/decompression workspaces, 
> direct memory allocations, etc.—is highly bursty. As a result, users must 
> provision for peak {{memoryOverhead}} even though most Spark tasks utilize 
> only a small fraction of it for the majority of their lifespan.
> To address this inefficiency in {*}Spark on Kubernetes{*}, we implemented the 
> algorithm proposed in the paper _“Towards Resource Efficiency: Practical 
> Insights into Large-Scale Spark Workloads at ByteDance”_ (VLDB 2024). Our 
> approach introduces a configurable parameter that splits a user’s requested 
> {{memoryOverhead}} into two segments:
>  * *Guaranteed memoryOverhead* — exclusive to each Spark pod, ensuring 
> baseline reliability.
>  * *Shared memoryOverhead* — pooled and time-shared across multiple pods to 
> absorb bursty usage.
> In the Kubernetes environment, this translates to the following resource 
> settings for each Spark executor pod:
>  * *memory request* = user-requested on-heap memory + guaranteed 
> {{memoryOverhead}}
>  * *memory limit* = user-requested on-heap memory + guaranteed 
> {{memoryOverhead}} + shared {{memoryOverhead}}
> This design preserves reliability while substantially improving cluster-level 
> memory utilization by avoiding the over-provisioning traditionally required 
> for worst-case {{memoryOverhead}} peaks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to