[ 
https://issues.apache.org/jira/browse/SPARK-27368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-27368.
-----------------------------------
    Resolution: Fixed

resolving since code committed

> Design: Standalone supports GPU scheduling
> ------------------------------------------
>
>                 Key: SPARK-27368
>                 URL: https://issues.apache.org/jira/browse/SPARK-27368
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Spark Core
>    Affects Versions: 3.0.0
>            Reporter: Xiangrui Meng
>            Assignee: Xiangrui Meng
>            Priority: Major
>
> Design draft:
> Scenarios:
> * client-mode, worker might create one or more executor processes, from 
> different Spark applications.
> * cluster-mode, worker might create driver process as well.
> * local-cluster model, there could be multiple worker processes on the same 
> node. This is an undocumented use of standalone mode, which is mainly for 
> tests.
> * Resource isolation is not considered here.
> Because executor and driver processes on the same node will share the 
> accelerator resources, worker must take the role that allocates resources. So 
> we will add spark.worker.resource.[resourceName].discoveryScript conf for 
> workers to discover resources. User need to match the resourceName in driver 
> and executor requests. Besides CPU cores and memory, worker now also 
> considers resources in creating new executors or drivers.
> Example conf:
> {code}
> # static worker conf
> spark.worker.resource.gpu.discoveryScript=/path/to/list-gpus.sh
> # application conf
> spark.driver.resource.gpu.amount=4
> spark.executor.resource.gpu.amount=2
> spark.task.resource.gpu.amount=1
> {code}
> In client mode, driver process is not launched by worker. So user can specify 
> driver resource discovery script. In cluster mode, if user still specify 
> driver resource discovery script, it is ignored with a warning.
> Supporting resource isolation is tricky because Spark worker doesn't know how 
> to isolate resources unless we hardcode some resource names like GPU support 
> in YARN, which is less ideal. Support resource isolation of multiple resource 
> types is even harder. In the first version, we will implement accelerator 
> support without resource isolation.
> Timeline:
> 1. Worker starts.
> 2. Worker loads `work.source.*` conf and runs discovery scripts to discover 
> resources.
> 3. Worker reports to master cores, memory, and resources (new) and registers.
> 4. An application starts.
> 5. Master finds workers with sufficient available resources and let worker 
> start executor or driver process.
> 6. Worker assigns executor / driver resources by passing the resource info 
> from command-line.
> 7. Application ends.
> 8. Master requests worker to kill driver/executor process.
> 9. Master updates available resources.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to