do people here running slurm with gres based gpu's, check that the gpu is actually usable before launching the job? if so, can you detail how you're doing it?
my cluster is currently using slurm, but we run htcondor on the nodes in the background. when a node isn't currently allocated through slurm it's made available to htcondor for use. in general this works pretty well. however, the issue that arises is that condor can't detect a slurm allocated node fast enough or halt the job it's running quick enough. when a user srun's a job, it usually errors out with some irreverent error about not being able to use the gpu. generally the user can't decipher it and tell what actually happened. i've tried setting up a prolog on the nodes to kick the jobs off, but I've seen issues in the past where users quickly issuing srun commands will hork up the nodes. and if the srun takes to long they'll just kill it and try again, hastening the problem. whether it's the node, slurm, or condor or the combination of all three i've have not nailed down yet. it might come down to, i'm doing it correctly, but my script is just too chunky. before i spend a bunch of hours tuning, i'd like to double check i'm going down the right path and/or incorporate some other ideas thanks
