[jira] [Created] (SPARK-47556) [K8] Spark App ID collision resulting in deleting wrong resources

Sundeep K (Jira) Mon, 25 Mar 2024 19:19:04 -0700

Sundeep K created SPARK-47556:
---------------------------------

             Summary: [K8] Spark App ID collision resulting in deleting wrong 
resources
                 Key: SPARK-47556
                 URL: https://issues.apache.org/jira/browse/SPARK-47556
             Project: Spark
          Issue Type: Bug
          Components: Kubernetes, Spark Core
    Affects Versions: 3.5.1, 3.3.2
            Reporter: Sundeep K



h3. Issue:

We noticed that sometimes K8s executor pods go in a crash loop. Reason being 
'Error: MountVolume.SetUp failed for volume "spark-conf-volume-exec"'. Upon 
investigation we noticed that there are 2 spark jobs that launched with same 
application id and when one of them finishes first it deletes all it's 
resources and deletes the resources of other job too.

-> Spark application ID is created using this 
[code|https://affirm.slack.com/archives/C06Q2GWLWKH/p1711132115304449?thread_ts=1711123500.783909&cid=C06Q2GWLWKH]
 
"spark-application-" + System.currentTimeMillis
This means if 2 applications launch at the same milli second they could end up 
having same AppId

->  
[spark-app-selector|https://github.com/apache/spark/blob/93f98c0a61ddb66eb777c3940fbf29fc58e2d79b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L23]
 label is added to all resource created by driver and it's value is application 
Id. Kubernetes Scheduler deletes all the apps with same 
[label|https://github.com/apache/spark/blob/2a8bb5cdd3a5a2d63428b82df5e5066a805ce878/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L162C1-L172C6]
 upon termination.

This results in deletion of config map and executor pods of job that's still 
running, driver tries to relaunch the executor pods, but config map is not 
present, so it's in crash loop
h3. Context

We are using [Spark of Kubernetes 
|https://spark.apache.org/docs/latest/running-on-kubernetes.html]and launch our 
spark jobs using PySpark. We launch multiple Spark Jobs within a given k8s 
namespace. Each Spark job can be launched from different pods or from different 
processes in a pod. Every time a job is launched it has a unique app name. Here 
is how the job is launched (omitting irrelevant details):
{code:java}
# spark_conf has settings required for spark on k8s 
sp = SparkSession.builder \
    .config(conf=spark_conf) \
    .appName('testapp')
sp.master(f'k8s://{kubernetes_host}')
session = sp.getOrCreate()
with session:
    session.sql('SELECT 1'){code}
h3. Repro

Set same app id in spark config, run 2 different jobs, one that finishes fast, 
one that runs slow. Slower job goes into crash loop
{code:java}
"spark.app.id": "<same Id for 2 spark job>"{code}
h3. Workaround

Set unique spark.app.id for all the jobs that run on k8s

eg:
{code:java}
"spark.app.id": f'{AppName}-{CurrTimeInMilliSecs}-{UUId}'[:63]{code}
h3. Fix

Add unique hash add the end of Application ID: 
[https://github.com/apache/spark/pull/45712] 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47556) [K8] Spark App ID collision resulting in deleting wrong resources

Reply via email to