subject:"\[jira\] \[Comment Edited\] \(SPARK\-32429\) Standalone Mode allow setting CUDA_VISIBLE

[jira] [Comment Edited] (SPARK-32429) Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch

2020-07-28 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166627#comment-17166627
 ] 

Thomas Graves edited comment on SPARK-32429 at 7/28/20, 6:47 PM:
-

Yes so for this first implementation we didn't really address users selecting 
different types of GPUs, but I think the design is generic enough to handle but 
requires extra support from the cluster manager.  Otherwise I think it's left 
to the user to discover the details on the GPU. 

So I think the scenario you are talking about is a Worker has multiple gpus of 
different types so for it to discover them we would either have to explicitly 
add support for a "type" (spark.executor.resource.gpu.type) or you have 2 
custom resources (k80/v100), which for standalone mode would be fine because 
you just supply the Worker with different discovery scripts and then like you 
say the application would request one or the other type of resource.  The 
application just needs to know to request the custom resources vs just "gpu".

I think there are a few ways we could make this generic. One to make it 
completely generic is to make it a plugin that would run before launching 
executors and python processes. spark.worker.resource.XX.launchPlugin = 
someClass.  You could pass the env and resources into each one and it could set 
whatever it needs.   There are less generic ways if you want Spark to know more 
about CUDA. What do you think of something like this?

 


was (Author: tgraves):
Yes so for this first implementation we didn't really address users selecting 
different types of GPUs, but I think the design is generic enough to handle but 
requires extra support from the cluster manager.  Otherwise I think it's left 
to the user to discover the details on the GPU. 

So I think the scenario you are talking about is a Worker has multiple gpus of 
different types so for it to discover them we would either have to explicitly 
add support for a "type" (spark.executor.resource.gpu.type) or you have 2 
custom resources (k80/v100), which for standalone mode would be fine because 
you just supply the Worker with different discovery scripts and then like you 
say the application would request one or the other type of resource.  The 
application just needs to know to request the custom resources vs just "gpu".

I think there are a few ways we could make this generic. One to make it 
completely generic is to make it a plugin that would run before launching 
executors and python processes. spark.worker.resource.XX.launchPlugins = 
someClass,anotherOne.  You could pass the env and resources into each one and 
it could set whatever it needs.   There are less generic ways if you want Spark 
to know more about CUDA. What do you think of something like this?

 

> Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch
> -
>
> Key: SPARK-32429
> URL: https://issues.apache.org/jira/browse/SPARK-32429
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> It would be nice if standalone mode could allow users to set 
> CUDA_VISIBLE_DEVICES before launching an executor.  This has multiple 
> benefits. 
>  * kind of an isolation in that the executor can only see the GPUs set there. 
>  * If your GPU application doesn't support explicitly setting the GPU device 
> id, setting this will make any GPU look like the default (id 0) and things 
> generally just work without any explicit setting
>  * New features are being added on newer GPUs that require explicit setting 
> of CUDA_VISIBLE_DEVICES like MIG 
> ([https://www.nvidia.com/en-us/technologies/multi-instance-gpu/])
> The code changes to just set this are very small, once we set them we would 
> also possibly need to change the gpu addresses as it changes them to start 
> from device id 0 again.
> The easiest implementation would just specifically support this and have it 
> behind a config and set when the config is on and GPU resources are 
> allocated. 
> Note we probably want to have this same thing set when we launch a python 
> process as well so that it gets same env.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32429) Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch

2020-07-28 Thread Xiangrui Meng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166532#comment-17166532
 ] 

Xiangrui Meng edited comment on SPARK-32429 at 7/28/20, 4:30 PM:
-

[~tgraves] Thanks for the clarification! It makes sense to add GPU isolation at 
executor level. Your prototype adds special meaning to the "gpu" resource name. 
I wonder if we want to make it more configurable in the final implementation. A 
scenario we considered previously was a cluster with two generation of GPUs: 
K80, V100. I think it is safe to assume that Spark application should only 
request one GPU type. Then we will need some configuration to tell based on 
which resource name user wants to set CUDA_VISIBLE_DEVICES.

Btw, we found that setting CUDA_DEVICE_ORDER=PCI_BUS_ID is necessary to have 
consistent device ordering between different processes even 
CUDA_VISIBLE_DEVICES are set the same. Not sure if the same setting were used 
in YARN/k8s.


was (Author: mengxr):
[~tgraves] Thanks for the clarification! It makes sense to add GPU isolation at 
executor level. Your prototype adds special meaning to the "gpu" resource name. 
I wonder if we want to make it more configurable in the final implementation. A 
scenario we considered previously was a cluster with two generation of GPUs: 
K80, V100. I think it is safe to assume that Spark application should only 
request one GPU type. Then we will need some configuration to tell based on 
which resource name user wants to set CUDA_VISIBLE_DEVICES.

Btw, we found that setting CUDA_DEVICE_ORDER=PCI_BUS_ID is necessary to have 
consistent device ordering between different processes even 
CUDA_VISIBLE_DEVICES are set the same.

> Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch
> -
>
> Key: SPARK-32429
> URL: https://issues.apache.org/jira/browse/SPARK-32429
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> It would be nice if standalone mode could allow users to set 
> CUDA_VISIBLE_DEVICES before launching an executor.  This has multiple 
> benefits. 
>  * kind of an isolation in that the executor can only see the GPUs set there. 
>  * If your GPU application doesn't support explicitly setting the GPU device 
> id, setting this will make any GPU look like the default (id 0) and things 
> generally just work without any explicit setting
>  * New features are being added on newer GPUs that require explicit setting 
> of CUDA_VISIBLE_DEVICES like MIG 
> ([https://www.nvidia.com/en-us/technologies/multi-instance-gpu/])
> The code changes to just set this are very small, once we set them we would 
> also possibly need to change the gpu addresses as it changes them to start 
> from device id 0 again.
> The easiest implementation would just specifically support this and have it 
> behind a config and set when the config is on and GPU resources are 
> allocated. 
> Note we probably want to have this same thing set when we launch a python 
> process as well so that it gets same env.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32429) Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch

[jira] [Comment Edited] (SPARK-32429) Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch

2 matches

Site Navigation

Mail list logo

Footer information