[ https://issues.apache.org/jira/browse/SPARK-27368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Thomas Graves resolved SPARK-27368. ----------------------------------- Resolution: Fixed resolving since code committed > Design: Standalone supports GPU scheduling > ------------------------------------------ > > Key: SPARK-27368 > URL: https://issues.apache.org/jira/browse/SPARK-27368 > Project: Spark > Issue Type: Sub-task > Components: Spark Core > Affects Versions: 3.0.0 > Reporter: Xiangrui Meng > Assignee: Xiangrui Meng > Priority: Major > > Design draft: > Scenarios: > * client-mode, worker might create one or more executor processes, from > different Spark applications. > * cluster-mode, worker might create driver process as well. > * local-cluster model, there could be multiple worker processes on the same > node. This is an undocumented use of standalone mode, which is mainly for > tests. > * Resource isolation is not considered here. > Because executor and driver processes on the same node will share the > accelerator resources, worker must take the role that allocates resources. So > we will add spark.worker.resource.[resourceName].discoveryScript conf for > workers to discover resources. User need to match the resourceName in driver > and executor requests. Besides CPU cores and memory, worker now also > considers resources in creating new executors or drivers. > Example conf: > {code} > # static worker conf > spark.worker.resource.gpu.discoveryScript=/path/to/list-gpus.sh > # application conf > spark.driver.resource.gpu.amount=4 > spark.executor.resource.gpu.amount=2 > spark.task.resource.gpu.amount=1 > {code} > In client mode, driver process is not launched by worker. So user can specify > driver resource discovery script. In cluster mode, if user still specify > driver resource discovery script, it is ignored with a warning. > Supporting resource isolation is tricky because Spark worker doesn't know how > to isolate resources unless we hardcode some resource names like GPU support > in YARN, which is less ideal. Support resource isolation of multiple resource > types is even harder. In the first version, we will implement accelerator > support without resource isolation. > Timeline: > 1. Worker starts. > 2. Worker loads `work.source.*` conf and runs discovery scripts to discover > resources. > 3. Worker reports to master cores, memory, and resources (new) and registers. > 4. An application starts. > 5. Master finds workers with sufficient available resources and let worker > start executor or driver process. > 6. Worker assigns executor / driver resources by passing the resource info > from command-line. > 7. Application ends. > 8. Master requests worker to kill driver/executor process. > 9. Master updates available resources. -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org