Minimal example w/ srun:

[bmooreii@gpu-interactive ~]$ salloc --gres=gpu:4 -n4
salloc: Granted job allocation 8868
salloc: Waiting for resource configuration
salloc: Nodes gpu-stage08 are ready for job
[bmooreii@gpu-interactive ~]$ cat gres_test.sh
#!/usr/bin/env bash

srun --gres=gpu:1 -n1 --exclusive bash get_cuda_vis.sh &
srun --gres=gpu:1 -n1 --exclusive bash get_cuda_vis.sh &
srun --gres=gpu:1 -n1 --exclusive bash get_cuda_vis.sh &
srun --gres=gpu:1 -n1 --exclusive bash get_cuda_vis.sh &
wait
[bmooreii@gpu-interactive ~]$ cat get_cuda_vis.sh
#!/usr/bin/env bash

echo $CUDA_VISIBLE_DEVICES
[bmooreii@gpu-interactive ~]$ bash gres_test.sh
2
0
3
1


On Sat, Sep 2, 2017 at 1:22 PM, charlie hemlock <charlieheml...@gmail.com>
wrote:

> Barry,
> Thank you so much for the reply!
> I'm afraid I need more clarification on this comment:
>
> "CUDA_VISIBLE_DEVICES should only ever contain 1 integer between 0 and 3
> if you have 4 GPUs."
>
> We only ever get
> CUDA_VISIBLE_DEVICES = *0* (12 times)
> and devices 1,2,3 are *never* used.
>
> Eventually - we want to be able to use mpi such that each rank/task can
> use 1 gpu --  but the  job can spread tasks/ranks among the 4 gpus.
> Currently it appears we are only limited to device 0 only.
>
> *In a mpi context,* I'm not certain about the wrapper based method provided
> from the link.
> I'll need to consult with the developer.
>
> Thanks again!
> -C
>
>
>
>
>
>
>
>
> On Sat, Sep 2, 2017 at 10:49 AM, Barry Moore <moore0...@gmail.com> wrote:
>
>> Charlie,
>>
>> % salloc -n 12 -c 2 -gres=gpu:1
>>> % srun  env | grep CUDA
>>> CUDA_VISIBLE_DEVICES=0
>>> (12 times)
>>> *Is this expected behavior if we have more than 1 gpu available (4
>>> total) for the 12 tasks?*
>>
>>
>> This is absolutely expected. You only ask for 1 GPU. CUDA_VISIBLE_DEVICES
>> should only ever contain 1 integer between 0 and 3 if you have 4 GPUs.
>>
>> This comment might help you: https://bugs.schedmd.com/
>> show_bug.cgi?id=2626#c3
>>
>> Basically, loop over the tasks you want to run with an index, take the
>> index % NUM_GPUS, use a wrapper like the comment.
>>
>> - Barry
>>
>>
>> On Fri, Sep 1, 2017 at 1:29 PM, charlie hemlock <charlieheml...@gmail.com
>> > wrote:
>>
>>> Hello,
>>> Can the slurm forum help with these questions, or should we seek help
>>> elsewhere?
>>>
>>> We need help with salloc gpu allocation.  Hopefully this clarifies some,
>>> given:
>>>
>>> % salloc -n 12 -c 2 -gres=gpu:1
>>> % srun  env | grep CUDA
>>> CUDA_VISIBLE_DEVICES=0
>>> (12 times)
>>>
>>> *Is this expected behavior if we have more than 1 gpu available (4
>>> total) for the 12 tasks?*
>>>
>>> We desire different behavior.  *Is there a way to specify an
>>> salloc+srun to get:*
>>>
>>> CUDA_VISIBLE_DEVICES=0
>>> CUDA_VISIBLE_DEVICES=1
>>> CUDA_VISIBLE_DEVICES=2
>>> CUDA_VISIBLE_DEVICES=3
>>> And so on...(12 total print statements) ?
>>>
>>> such that each task gets 1 gpu, but overall gpu usage is spread out
>>> among the 4 available devices.   (Not one where all device=0).
>>>
>>> That way each task is not waiting on device 0 to free up from other
>>> tasks, as is currently the case.
>>>
>>> What are we missing or misunderstanding?
>>>
>>>    - salloc / srun parameter?
>>>    - slurm.conf or gres.conf setting?
>>>
>>> Thank you!
>>>
>>>
>>> On Tue, Aug 29, 2017 at 12:27 PM, charlie hemlock <
>>> charlieheml...@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> We're looking for any advice for salloc/srun setup that uses 1 gpu/task
>>>> but where the job makes use of all available gpus.
>>>>
>>>>
>>>> *Test #1:*
>>>>
>>>> We desire an salloc and srun such that each task gets 1 GPU, but the
>>>> GPU usage for the job is spread out among 4 available devices.  See
>>>> gres.conf below.
>>>>
>>>>
>>>>
>>>> % salloc -n 12 -c 2 -gres=gpu:1
>>>>
>>>>
>>>>
>>>> % srun  env | grep CUDA
>>>>
>>>> CUDA_VISIBLE_DEVICES=0
>>>>
>>>> (12 times)
>>>>
>>>>
>>>>
>>>> Where we desire:
>>>>
>>>> CUDA_VISIBLE_DEVICES=0
>>>>
>>>> CUDA_VISIBLE_DEVICES=1
>>>>
>>>> CUDA_VISIBLE_DEVICES=2
>>>>
>>>> CUDA_VISIBLE_DEVICES=3
>>>>
>>>> And so on (12 times), such that each task still gets 1 gpu, but usage
>>>> is spread out among the 4 available devices (see gres.conf below).   Not
>>>> one (device=0).
>>>>
>>>> That way each task is not waiting on device 0 to free up, as is
>>>> currently the case.
>>>>
>>>>
>>>> What are we missing or misunderstanding?
>>>>
>>>>    - salloc / srun parameter?
>>>>    - slurm.conf or gres.conf setting?
>>>>
>>>>
>>>>
>>>> Also see other additional tests below that illustrate current behavior:
>>>>
>>>>
>>>>
>>>> *Test #2*
>>>>
>>>> Here we believe each srun task will need 4 gpus each.
>>>>
>>>> % salloc -n 12 -c 2 -gres=gpu:4
>>>>
>>>> %  srun env | grep CUDA
>>>>
>>>> CUDA_VISIBLE_DEVICES=0,1,2,3
>>>>
>>>> (12 times)
>>>>
>>>>
>>>>
>>>> This matches expectation.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *Test #3*
>>>>
>>>> Another test, where I submit multiple sruns in succession:
>>>>
>>>> Here we use a simple sleepCUDA.py scripts which sleeps a few seconds,
>>>> and then prints $CUDA_VISIBLE_DEVICES)
>>>>
>>>>
>>>>
>>>> % salloc -n 12 -c 2 -gres=gpu:4
>>>>
>>>> %  srun -gres=gpu:1 sleepCUDA.py &
>>>>
>>>> % srun -gres=gpu:1 sleepCUDA.py &
>>>>
>>>> % srun -gres=gpu:1 sleepCUDA.py &
>>>>
>>>> % srun -gres=gpu:1 sleepCUDA.py &
>>>>
>>>>
>>>>
>>>> Result:
>>>>
>>>> CUDA_VISIBLE_DEVICES=0  (jobid 1)
>>>>
>>>> CUDA_VISIBLE_DEVICES=1  (jobid 2)
>>>>
>>>> CUDA_VISIBLE_DEVICES=2  (jobid 3)
>>>>
>>>> CUDA_VISIBLE_DEVICES=3  (jobid 4)
>>>>
>>>> And so on (but not necessarily in 0,1,2,3 order)
>>>>
>>>> Though a single srun submission would only use 1 gpu (device=0) as
>>>> before  as expected.
>>>>
>>>> But this seems like a step in right direction since multiple devices
>>>> were used, but not quite what we wanted.
>>>>
>>>>
>>>> And according to: https://slurm.schedmd.com/
>>>> archive/slurm-16.05.7/gres.html
>>>>
>>>> *“By default, a job step will be allocated all of the generic resources
>>>> allocated to the job/ [Test #2]*
>>>>
>>>> *If desired, the job step may explicitly specify a different generic
>>>> resource count than the job. [Test #3]”*
>>>>
>>>>
>>>>
>>>> To Test#3 non-iteractively should we look into creating an sbatch
>>>> script (with multiple sruns) instead of salloc?
>>>>
>>>>
>>>>
>>>>
>>>> *OS: *CentOS 7
>>>>
>>>> *Slurm version: *16.05.6
>>>>
>>>>
>>>> *gres.conf*
>>>>
>>>> Name=gpu File=/dev/nvidia0
>>>>
>>>> Name=gpu File=/dev/nvidia1
>>>>
>>>> Name=gpu File=/dev/nvidia2
>>>>
>>>> Name=gpu File=/dev/nvidia3
>>>>
>>>>
>>>>
>>>> *slurm.conf (truncated/partial/simplified)*
>>>>
>>>> NodeName=node1 Gres=gpu:4
>>>>
>>>> NodeName=node2 Gres=gpu:4
>>>>
>>>> NodeName=node3 Gres=gpu:4
>>>>
>>>> NodeName=node4 Gres=gpu:4
>>>>
>>>> GresTypes=gpu
>>>>
>>>>
>>>>
>>>> No cgroup.conf
>>>>
>>>>
>>>>
>>>> Posting actual .conf is not practical due to firewalls.
>>>>
>>>>
>>>> Any advice will be greatly appreciated!
>>>>
>>>> Thank you!
>>>>
>>>
>>>
>>
>>
>> --
>> Barry E Moore II, PhD
>> E-mail: bmoor...@pitt.edu
>>
>> Assistant Research Professor
>> Center for Simulation and Modeling
>> University of Pittsburgh
>> Pittsburgh, PA 15260
>>
>
>


-- 
Barry E Moore II, PhD
E-mail: bmoor...@pitt.edu

Assistant Research Professor
Center for Simulation and Modeling
University of Pittsburgh
Pittsburgh, PA 15260

Reply via email to