[ 
https://issues.apache.org/jira/browse/SPARK-27368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27368:
----------------------------------
    Description: 
Design draft:

Scenarios:
* client-mode, worker might create one or more executor processes, from 
different Spark applications.
* cluster-mode, worker might create driver process as well.
* local-cluster model, there could be multiple worker processes on the same 
node. This is an undocumented use of standalone mode, which is mainly for tests.
* Resource isolation is not considered here.

Because executor and driver processes on the same node will share the 
accelerator resources, worker must take the role that allocates resources. So 
we will add spark.worker.resource.[resourceName].discoveryScript conf for 
workers to discover resources. User need to match the resourceName in driver 
and executor requests. Besides CPU cores and memory, worker now also considers 
resources in creating new executors or drivers.

Example conf:

{code}
spark.worker.resource.gpu.discoveryScript=/path/to/list-gpus.sh
spark.driver.resource.gpu.count=4
spark.worker.resource.gpu.count=1
{code}

In client mode, driver process is not launched by worker. So user can specify 
driver resource discovery script. In cluster mode, if user still specify driver 
resource discovery script, it is ignored with a warning.

Supporting resource isolation is tricky because Spark worker doesn't know how 
to isolate resources unless we hardcode some resource names like GPU support in 
YARN, which is less ideal. Support resource isolation of multiple resource 
types is even harder. In the first version, we will implement accelerator 
support without resource isolation.

  was:
Design draft:

Scenarios:
* client-mode, worker might create one or more executor processes, from 
different Spark applications.
* cluster-mode, worker might create driver process as well.
* local-cluster model, there could be multiple worker processes on the same 
node. This is an undocumented use of standalone mode, which is mainly for tests.

Because executor and driver processes on the same node will share the 
accelerator resources, worker must take the role that allocates resources. So 
we will add spark.worker.resource.[resourceName].discoveryScript conf for 
workers to discover resources. User need to match the resourceName in driver 
and executor requests and they don't need to specify discovery scripts 
separately.


> Design: Standalone supports GPU scheduling
> ------------------------------------------
>
>                 Key: SPARK-27368
>                 URL: https://issues.apache.org/jira/browse/SPARK-27368
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Spark Core
>    Affects Versions: 3.0.0
>            Reporter: Xiangrui Meng
>            Assignee: Xiangrui Meng
>            Priority: Major
>
> Design draft:
> Scenarios:
> * client-mode, worker might create one or more executor processes, from 
> different Spark applications.
> * cluster-mode, worker might create driver process as well.
> * local-cluster model, there could be multiple worker processes on the same 
> node. This is an undocumented use of standalone mode, which is mainly for 
> tests.
> * Resource isolation is not considered here.
> Because executor and driver processes on the same node will share the 
> accelerator resources, worker must take the role that allocates resources. So 
> we will add spark.worker.resource.[resourceName].discoveryScript conf for 
> workers to discover resources. User need to match the resourceName in driver 
> and executor requests. Besides CPU cores and memory, worker now also 
> considers resources in creating new executors or drivers.
> Example conf:
> {code}
> spark.worker.resource.gpu.discoveryScript=/path/to/list-gpus.sh
> spark.driver.resource.gpu.count=4
> spark.worker.resource.gpu.count=1
> {code}
> In client mode, driver process is not launched by worker. So user can specify 
> driver resource discovery script. In cluster mode, if user still specify 
> driver resource discovery script, it is ignored with a warning.
> Supporting resource isolation is tricky because Spark worker doesn't know how 
> to isolate resources unless we hardcode some resource names like GPU support 
> in YARN, which is less ideal. Support resource isolation of multiple resource 
> types is even harder. In the first version, we will implement accelerator 
> support without resource isolation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to