[jira] [Updated] (SPARK-47556) [K8] Spark App ID collision resulting in deleting wrong resources

Dongjoon Hyun (Jira) Mon, 01 Apr 2024 15:00:50 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-47556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dongjoon Hyun updated SPARK-47556:
----------------------------------
    Fix Version/s:     (was: 3.3.0)

> [K8] Spark App ID collision resulting in deleting wrong resources
> -----------------------------------------------------------------
>
>                 Key: SPARK-47556
>                 URL: https://issues.apache.org/jira/browse/SPARK-47556
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes, Spark Core
>    Affects Versions: 3.1
>            Reporter: Sundeep K
>            Priority: Major
>
> h3. Issue:
> We noticed that sometimes K8s executor pods go in a crash loop. Reason being 
> 'Error: MountVolume.SetUp failed for volume "spark-conf-volume-exec"'. Upon 
> investigation we noticed that there are 2 spark jobs that launched with same 
> application id and when one of them finishes first it deletes all it's 
> resources and deletes the resources of other job too.
> -> Spark application ID is created using this 
> [code|https://github.com/apache/spark/blob/36126a5c1821b4418afd5788963a939ea7f64078/core/src/main/scala/org/apache/spark/scheduler/TaskScheduler.scala#L38]
> "spark-application-" + System.currentTimeMillis
> This means if 2 applications launch at the same milli second they could end 
> up having same AppId
> ->  
> [spark-app-selector|https://github.com/apache/spark/blob/93f98c0a61ddb66eb777c3940fbf29fc58e2d79b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L23]
>  label is added to all resource created by driver and it's value is 
> application Id. Kubernetes Scheduler deletes all the apps with same 
> [label|https://github.com/apache/spark/blob/2a8bb5cdd3a5a2d63428b82df5e5066a805ce878/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L162C1-L172C6]
>  upon termination.
> This results in deletion of config map and executor pods of job that's still 
> running, driver tries to relaunch the executor pods, but config map is not 
> present, so it's in crash loop
> h3. Context
> We are using [Spark of Kubernetes 
> |https://spark.apache.org/docs/latest/running-on-kubernetes.html]and launch 
> our spark jobs using PySpark. We launch multiple Spark Jobs within a given 
> k8s namespace. Each Spark job can be launched from different pods or from 
> different processes in a pod. Every time a job is launched it has a unique 
> app name. Here is how the job is launched (omitting irrelevant details):
> {code:java}
> # spark_conf has settings required for spark on k8s 
> sp = SparkSession.builder \
>     .config(conf=spark_conf) \
>     .appName('testapp')
> sp.master(f'k8s://{kubernetes_host}')
> session = sp.getOrCreate()
> with session:
>     session.sql('SELECT 1'){code}
> h3. Repro
> Set same app id in spark config, run 2 different jobs, one that finishes 
> fast, one that runs slow. Slower job goes into crash loop
> {code:java}
> "spark.app.id": "<same Id for 2 spark job>"{code}
> h3. Workaround
> Set unique spark.app.id for all the jobs that run on k8s
> eg:
> {code:java}
> "spark.app.id": f'{AppName}-{CurrTimeInMilliSecs}-{UUId}'[:63]{code}
> h3. Fix
> Add unique hash add the end of Application ID: 
> [https://github.com/apache/spark/pull/45712] 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47556) [K8] Spark App ID collision resulting in deleting wrong resources

Reply via email to