Hi Pavan, freegpus just sets CUDA_VISIBLE_DEVICES, depending on how many GPUs are requested. It was created as all jobs were running on GPU ID 0.
Oliver On Thu, Apr 6, 2017 at 9:13 PM, pavan tc <pavan...@gmail.com> wrote: > Any reason why you don't want Slurm to manage CUDA_VISIBLE_DEVICES? I > guess your program "freegpus" does a little more? > > On Thu, Apr 6, 2017 at 6:32 AM, Oliver Grant <olivercgr...@gmail.com> > wrote: > >> Hi there, >> >> I use a bash script to simultaneously submit multiple, single-GPU jobs to >> a cluster containing 18 nodes with 4 GPUs per node. >> >> #!/bin/bash >> #SBATCH -J jobName >> #SBATCH --partition=GPU >> #SBATCH --get-user-env >> #SBATCH --nodes=1 >> #SBATCH --tasks-per-node=1 >> #SBATCH --gres=gpu:1 >> >> source /etc/profile.d/modules.sh >> export pmemd="srun $AMBERHOME/bin/pmemd.cuda " >> export CUDA_VISIBLE_DEVICES=$(/programs/bin/freegpus 1 $SLURM_JOB_ID) // >> Program uses nvidia-smi to figure out what GPUs are occupied. >> >> ${pmemd} -O \ >> -i eq2.in \ >> -o eq2.o \ >> -p CPLX_Neut_Sol.prmtop \ >> -c eq1.rst7 \ >> -r eq2.rst7 \ >> -x eq2.nc \ >> -ref eq1.rst7 >> >> >> We installed an extra 8 nodes recently and I find when submitting to >> those nodes I get four jobs running on a single GPU, while the other three >> GPUs are idle. If I wait 30 seconds between submission they go on separate >> GPUs (the behaviour I want). When submitting using the same scripts to the >> older nodes, all works fine. I've reproduced this multiple times. See a >> video of the problem here (note the quality may be better if you download >> first): >> >> https://www.dropbox.com/s/ahc39mvsefnvnps/video1.ogv?dl=0 >> <https://www.dropbox.com/s/ahc39mvsefnvnps/video1.ogv?dl=0> >> >> I'm showing that the output of our program "freegpus" is ok, but when >> submitting two jobs to node015, they both go on the same GPU with ID 0. >> When submitting two jobs to node003, they go on separate GPUs. I've >> repeated this behaviour ~10 times. Once in a while the jobs seem to go >> straight to running, instead of hanging around as "PD" for several seconds. >> When that happens they do actually go on separate GPUs on node015! >> >> It seems like a SLURM bug, so I thought I'd post here. >> Any ideas? >> >> Oliver >> > >