[ https://issues.apache.org/jira/browse/SPARK-47556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated SPARK-47556: ----------------------------------- Labels: pull-request-available (was: ) > [K8] Spark App ID collision resulting in deleting wrong resources > ----------------------------------------------------------------- > > Key: SPARK-47556 > URL: https://issues.apache.org/jira/browse/SPARK-47556 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core > Affects Versions: 3.1 > Reporter: Sundeep K > Priority: Major > Labels: pull-request-available > > h3. Issue: > We noticed that sometimes K8s executor pods go in a crash loop. Reason being > 'Error: MountVolume.SetUp failed for volume "spark-conf-volume-exec"'. Upon > investigation we noticed that there are 2 spark jobs that launched with same > application id and when one of them finishes first it deletes all it's > resources and deletes the resources of other job too. > -> Spark application ID is created using this > [code|https://github.com/apache/spark/blob/36126a5c1821b4418afd5788963a939ea7f64078/core/src/main/scala/org/apache/spark/scheduler/TaskScheduler.scala#L38] > "spark-application-" + System.currentTimeMillis > This means if 2 applications launch at the same milli second they could end > up having same AppId > -> > [spark-app-selector|https://github.com/apache/spark/blob/93f98c0a61ddb66eb777c3940fbf29fc58e2d79b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L23] > label is added to all resource created by driver and it's value is > application Id. Kubernetes Scheduler deletes all the apps with same > [label|https://github.com/apache/spark/blob/2a8bb5cdd3a5a2d63428b82df5e5066a805ce878/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L162C1-L172C6] > upon termination. > This results in deletion of config map and executor pods of job that's still > running, driver tries to relaunch the executor pods, but config map is not > present, so it's in crash loop > h3. Context > We are using [Spark of Kubernetes > |https://spark.apache.org/docs/latest/running-on-kubernetes.html]and launch > our spark jobs using PySpark. We launch multiple Spark Jobs within a given > k8s namespace. Each Spark job can be launched from different pods or from > different processes in a pod. Every time a job is launched it has a unique > app name. Here is how the job is launched (omitting irrelevant details): > {code:java} > # spark_conf has settings required for spark on k8s > sp = SparkSession.builder \ > .config(conf=spark_conf) \ > .appName('testapp') > sp.master(f'k8s://{kubernetes_host}') > session = sp.getOrCreate() > with session: > session.sql('SELECT 1'){code} > h3. Repro > Set same app id in spark config, run 2 different jobs, one that finishes > fast, one that runs slow. Slower job goes into crash loop > {code:java} > "spark.app.id": "<same Id for 2 spark job>"{code} > h3. Workaround > Set unique spark.app.id for all the jobs that run on k8s > eg: > {code:java} > "spark.app.id": f'{AppName}-{CurrTimeInMilliSecs}-{UUId}'[:63]{code} > h3. Fix > Add unique hash add the end of Application ID: > [https://github.com/apache/spark/pull/45712] > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org