[jira] [Updated] (SPARK-47475) Jar Download Under K8s Cluster Mode Causes Executors Scaling Issues

Jiale Tan (Jira) Tue, 19 Mar 2024 23:31:10 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-47475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jiale Tan updated SPARK-47475:
------------------------------
    Description: 
*Issues*:
- The executor will download 2 duplicate copies of primary resource, one with 
the original url user submit with, the other with the driver local file path, 
which leads to resource waste.
- When jars are big and the application requests a lot of executors, the 
massive concurrent jars download from the driver will cause network saturation. 
In this case, the executors jar download will timeout, causing executors to be 
terminated. From user point of view, the application is trapped in the loop of 
massive executor loss and re-provision, but never gets enough live executors as 
requested, leads to SLA breach or sometimes failure.

*Root Cause*:
To submit spark jobs to Kubernetes under cluster mode, the {{spark-submit}} 
will be triggered twice. 
The first time {{SparkSubmit}} will run under k8s cluster mode, it will append 
primary resource to {{spark.jars}} and call 
{{KubernetesClientApplication::start}} to create a driver pod. 
The driver pod will run {{spark-submit}} again with the same primary resource 
jar. However this time the {{SparkSubmit}} will run under client mode with 
{{spark.kubernetes.submitInDriver}} as {{true}}, plus the updated 
{{spark.jars}}. Under this mode, all the jars in {{spark.jars}} will be 
downloaded to driver and those jars' urls will be replaced by the driver local 
paths. 
Then SparkSubmit will append the same primary resource to spark.jars again. So 
in this case, {{spark.jars}} will have 2 paths of duplicate copies of primary 
resource, one with the original url user submit with, the other with the driver 
local file path. 
Later when driver starts the SparkContext, it will copy all the {{spark.jars}} 
to {{spark.app.initial.jar.urls}}, and replace the driver local jars paths in 
{{spark.app.initial.jar.urls}} with driver file service paths, with which the 
executor can download those driver local jars.

  was:
*Issues*:
- The executor will download 2 duplicate copies of primary resource, one with 
the original url user submit with, the other with the driver local file path, 
which leads to resource waste.
- When jars are big and the application requests a lot of executors, the 
massive concurrent jars download from the driver will cause network saturation. 
In this case, the executors jar download will timeout, causing executors to be 
terminated. From user point of view, the application is trapped in the loop of 
massive executor loss and re-provision, but never gets enough live executors as 
requested, leads to SLA breach or sometimes failure.

*Root Cause*:
To submit spark jobs to Kubernetes under cluster mode, the `spark-submit` will 
be triggered twice. 
The first time `SparkSubmit` will run under k8s cluster mode, it will append 
primary resource to `spark.jars` and call `KubernetesClientApplication::start` 
to create a driver pod. 
The driver pod will run `spark-submit` again with the same primary resource 
jar. However this time the `SparkSubmit` will run under client mode with 
`spark.kubernetes.submitInDriver` as `true`, plus the updated `spark.jars`. 
Under this mode, all the jars in `spark.jars` will be downloaded to driver and 
those jars' urls will be replaced by the driver local paths. 
Then SparkSubmit will append the same primary resource to spark.jars again. So 
in this case, `spark.jars` will have 2 paths of duplicate copies of primary 
resource, one with the original url user submit with, the other with the driver 
local file path. 
Later when driver starts the SparkContext, it will copy all the `spark.jars` to 
`spark.app.initial.jar.urls`, and replace the driver local jars paths in 
`spark.app.initial.jar.urls` with driver file service paths, with which the 
executor can download those driver local jars.


> Jar Download Under K8s Cluster Mode Causes Executors Scaling Issues 
> --------------------------------------------------------------------
>
>                 Key: SPARK-47475
>                 URL: https://issues.apache.org/jira/browse/SPARK-47475
>             Project: Spark
>          Issue Type: Bug
>          Components: Deploy, Kubernetes
>    Affects Versions: 3.4.0, 3.5.0
>            Reporter: Jiale Tan
>            Priority: Major
>
> *Issues*:
> - The executor will download 2 duplicate copies of primary resource, one with 
> the original url user submit with, the other with the driver local file path, 
> which leads to resource waste.
> - When jars are big and the application requests a lot of executors, the 
> massive concurrent jars download from the driver will cause network 
> saturation. In this case, the executors jar download will timeout, causing 
> executors to be terminated. From user point of view, the application is 
> trapped in the loop of massive executor loss and re-provision, but never gets 
> enough live executors as requested, leads to SLA breach or sometimes failure.
> *Root Cause*:
> To submit spark jobs to Kubernetes under cluster mode, the {{spark-submit}} 
> will be triggered twice. 
> The first time {{SparkSubmit}} will run under k8s cluster mode, it will 
> append primary resource to {{spark.jars}} and call 
> {{KubernetesClientApplication::start}} to create a driver pod. 
> The driver pod will run {{spark-submit}} again with the same primary resource 
> jar. However this time the {{SparkSubmit}} will run under client mode with 
> {{spark.kubernetes.submitInDriver}} as {{true}}, plus the updated 
> {{spark.jars}}. Under this mode, all the jars in {{spark.jars}} will be 
> downloaded to driver and those jars' urls will be replaced by the driver 
> local paths. 
> Then SparkSubmit will append the same primary resource to spark.jars again. 
> So in this case, {{spark.jars}} will have 2 paths of duplicate copies of 
> primary resource, one with the original url user submit with, the other with 
> the driver local file path. 
> Later when driver starts the SparkContext, it will copy all the 
> {{spark.jars}} to {{spark.app.initial.jar.urls}}, and replace the driver 
> local jars paths in {{spark.app.initial.jar.urls}} with driver file service 
> paths, with which the executor can download those driver local jars.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47475) Jar Download Under K8s Cluster Mode Causes Executors Scaling Issues

Reply via email to