[ 
https://issues.apache.org/jira/browse/SPARK-47495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47495:
-------------------------------------

    Assignee: Jiale Tan

> Primary resource jar added to spark.jars twice under k8s cluster mode
> ---------------------------------------------------------------------
>
>                 Key: SPARK-47495
>                 URL: https://issues.apache.org/jira/browse/SPARK-47495
>             Project: Spark
>          Issue Type: Bug
>          Components: Deploy, Kubernetes, Spark Core
>    Affects Versions: 3.4.0, 3.5.0
>            Reporter: Jiale Tan
>            Assignee: Jiale Tan
>            Priority: Minor
>              Labels: pull-request-available
>
> {*}Context{*}:
> To submit spark jobs to Kubernetes under cluster mode, the {{spark-submit}} 
> will be triggered twice. 
> The first time {{SparkSubmit}} will run under k8s cluster mode, it will 
> append primary resource to {{spark.jars}} and call 
> {{KubernetesClientApplication::start}} to create a driver pod. 
> The driver pod will run {{spark-submit}} again with the same primary resource 
> jar. However this time the {{SparkSubmit}} will run under client mode with 
> {{spark.kubernetes.submitInDriver}} as {{true}}, plus the updated 
> {{spark.jars}}. Under this mode, {{SparkSubmit}} will download all the jars 
> in {{spark.jars}} to driver and those {{spark.jars}} urls will be replaced by 
> the driver local paths. 
> Then SparkSubmit will append the same primary resource to spark.jars again. 
> So in this case, {{spark.jars}} will have 2 paths of duplicate copies of 
> primary resource, one with the original url user submit with, the other with 
> the driver local file path. 
> Later when driver starts the SparkContext, it will copy all the 
> {{spark.jars}} to {{spark.app.initial.jar.urls}}, and replace the driver 
> local jars paths in {{spark.app.initial.jar.urls}} with driver file service 
> paths. 
> Now all the jars in the {{--jars}} or `spark.jars` in the original user 
> submission will be replaced with a driver file service url and added to  
> {{spark.app.initial.jar.urls}}. And the primary resource jar in the original 
> submission will show up in {{spark.app.initial.jar.urls}} twice: one with the 
> original path in the user submission, the other with a driver file service 
> url.
> When executors start, they will download all the jars in the 
> {{spark.app.initial.jar.urls}}. 
> *Issue*:
> The executor will download 2 duplicate copies of primary resource, one with 
> the original url user submit with, the other with the driver local file path, 
> which leads to resource waste. This is also reported previously 
> [here|https://github.com/apache/spark/pull/37417#issuecomment-1517797912].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to