Hi Pavan,

freegpus just sets CUDA_VISIBLE_DEVICES, depending on how many GPUs are
requested. It was created as all jobs were running on GPU ID 0.

Oliver

On Thu, Apr 6, 2017 at 9:13 PM, pavan tc <pavan...@gmail.com> wrote:

> Any reason why you don't want Slurm to manage CUDA_VISIBLE_DEVICES? I
> guess your program "freegpus" does a little more?
>
> On Thu, Apr 6, 2017 at 6:32 AM, Oliver Grant <olivercgr...@gmail.com>
> wrote:
>
>> Hi there,
>>
>> I use a bash script to simultaneously submit multiple, single-GPU jobs to
>> a cluster containing 18 nodes with 4 GPUs per node.
>>
>> #!/bin/bash
>> #SBATCH -J jobName
>> #SBATCH --partition=GPU
>> #SBATCH --get-user-env
>> #SBATCH --nodes=1
>> #SBATCH --tasks-per-node=1
>> #SBATCH --gres=gpu:1
>>
>> source /etc/profile.d/modules.sh
>> export pmemd="srun $AMBERHOME/bin/pmemd.cuda "
>> export CUDA_VISIBLE_DEVICES=$(/programs/bin/freegpus 1 $SLURM_JOB_ID) //
>> Program uses nvidia-smi to figure out what GPUs are occupied.
>>
>> ${pmemd} -O \
>> -i eq2.in \
>> -o eq2.o \
>> -p CPLX_Neut_Sol.prmtop \
>> -c eq1.rst7 \
>> -r eq2.rst7 \
>> -x eq2.nc \
>> -ref eq1.rst7
>>
>>
>> We installed an extra 8 nodes recently and I find when submitting to
>> those nodes I get four jobs running on a single GPU, while the other three
>> GPUs are idle. If I wait 30 seconds between submission they go on separate
>> GPUs (the behaviour I want). When submitting using the same scripts to the
>> older nodes, all works fine. I've reproduced this multiple times. See a
>> video of the problem here (note the quality may be better if you download
>> first):
>>
>> https://www.dropbox.com/s/ahc39mvsefnvnps/video1.ogv?dl=0
>> <https://www.dropbox.com/s/ahc39mvsefnvnps/video1.ogv?dl=0>
>>
>> I'm showing that the output of our program "freegpus" is ok, but when
>> submitting two jobs to node015, they both go on the same GPU with ID 0.
>> When submitting two jobs to node003, they go on separate GPUs. I've
>> repeated this behaviour ~10 times. Once in a while the jobs seem to go
>> straight to running, instead of hanging around as "PD" for several seconds.
>> When that happens they do actually go on separate GPUs on node015!
>>
>> It seems like a SLURM bug, so I thought I'd post here.
>> Any ideas?
>>
>> Oliver
>>
>
>

Reply via email to