[
https://issues.apache.org/jira/browse/SPARK-54596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nan Zhu updated SPARK-54596:
----------------------------
Description:
{{memoryOverhead}} is one of the largest contributors to memory consumption in
Spark workloads. To avoid being killed by the OOM Killer in cluster
environments, users typically reserve a large {{memoryOverhead}} budget.
However, the actual usage pattern of this space—covering
serialization/deserialization buffers, compression/decompression workspaces,
direct memory allocations, etc.—is highly bursty. As a result, users must
provision for peak {{memoryOverhead}} even though most Spark tasks utilize only
a small fraction of it for the majority of their lifespan.
To address this inefficiency in {*}Spark on Kubernetes{*}, we implemented the
algorithm proposed in the paper _“Towards Resource Efficiency: Practical
Insights into Large-Scale Spark Workloads at ByteDance”_ (VLDB 2024). Our
approach introduces a configurable parameter that splits a user’s requested
{{memoryOverhead}} into two segments:
* *Guaranteed memoryOverhead* — exclusive to each Spark pod, ensuring baseline
reliability.
* *Shared memoryOverhead* — pooled and time-shared across multiple pods to
absorb bursty usage.
In the Kubernetes environment, this translates to the following resource
settings for each Spark executor pod:
* *memory request* = user-requested on-heap memory + guaranteed
{{memoryOverhead}}
* *memory limit* = user-requested on-heap memory + guaranteed
{{memoryOverhead}} + shared {{memoryOverhead}}
This design preserves reliability while substantially improving cluster-level
memory utilization by avoiding the over-provisioning traditionally required for
worst-case {{memoryOverhead}} peaks.
was:
memoryOverhead is one of the most significant memory consumers for Spark
workloads. Users tend to book a big chunk of such memory space to avoid
triggering OOMKiller in the cluster environment.
However, the usage of pattern of memoryOverhead space, such as se/deser,
compress/decompress, direct memory, etc. , is usually bursty. As a result,
users have to book the peak usage of memoryOverhead to ensure the reliability
of their jobs while leaving a big chunk of this space unused in the majority of
the job lifecycle.
To resolve this problem in the context of Spark@K8S, we implemented the
algorithm presented in the paper "Towards Resource Efficiency: Practical
Insights into Large-Scale Spark Workloads at ByteDance" from ByteDance team
([https://www.vldb.org/pvldb/vol17/p3759-shi.pdf)] . We introduce a
configurable parameter to split the user requested memoryOverhead space into
two parts, guaranteed/shared. The guaranteed part is dedicated to a pod started
by a spark job and shared part is used by multiple pods in a time-sharing
fashion. Being specific to the K8S environment, the memory request of a pod
will be user-requested on-heap space + guaranteed memoryOverhead and the memory
limit will be user-requested on-heap space + guaranteed memoryOverhead + shared
memoryOverhead .
> Burst-aware memoryOverhead Allocation Algorithm for Spark@K8S
> -------------------------------------------------------------
>
> Key: SPARK-54596
> URL: https://issues.apache.org/jira/browse/SPARK-54596
> Project: Spark
> Issue Type: Improvement
> Components: Kubernetes, Spark Core
> Affects Versions: 4.2
> Reporter: Nan Zhu
> Priority: Major
>
> {{memoryOverhead}} is one of the largest contributors to memory consumption
> in Spark workloads. To avoid being killed by the OOM Killer in cluster
> environments, users typically reserve a large {{memoryOverhead}} budget.
> However, the actual usage pattern of this space—covering
> serialization/deserialization buffers, compression/decompression workspaces,
> direct memory allocations, etc.—is highly bursty. As a result, users must
> provision for peak {{memoryOverhead}} even though most Spark tasks utilize
> only a small fraction of it for the majority of their lifespan.
> To address this inefficiency in {*}Spark on Kubernetes{*}, we implemented the
> algorithm proposed in the paper _“Towards Resource Efficiency: Practical
> Insights into Large-Scale Spark Workloads at ByteDance”_ (VLDB 2024). Our
> approach introduces a configurable parameter that splits a user’s requested
> {{memoryOverhead}} into two segments:
> * *Guaranteed memoryOverhead* — exclusive to each Spark pod, ensuring
> baseline reliability.
> * *Shared memoryOverhead* — pooled and time-shared across multiple pods to
> absorb bursty usage.
> In the Kubernetes environment, this translates to the following resource
> settings for each Spark executor pod:
> * *memory request* = user-requested on-heap memory + guaranteed
> {{memoryOverhead}}
> * *memory limit* = user-requested on-heap memory + guaranteed
> {{memoryOverhead}} + shared {{memoryOverhead}}
> This design preserves reliability while substantially improving cluster-level
> memory utilization by avoiding the over-provisioning traditionally required
> for worst-case {{memoryOverhead}} peaks.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]