do people here running slurm with gres based gpu's, check that the gpu
is actually usable before launching the job?  if so, can you detail
how you're doing it?

my cluster is currently using slurm, but we run htcondor on the nodes
in the background.  when a node isn't currently allocated through
slurm it's made available to htcondor for use.  in general this works
pretty well.

however, the issue that arises is that condor can't detect a slurm
allocated node fast enough or halt the job it's running quick enough.
when a user srun's a job, it usually errors out with some irreverent
error about not being able to use the gpu.  generally the user can't
decipher it and tell what actually happened.

i've tried setting up a prolog on the nodes to kick the jobs off, but
I've seen issues in the past where users quickly issuing srun commands
will hork up the nodes.  and if the srun takes to long they'll just
kill it and try again, hastening the problem.  whether it's the node,
slurm, or condor or the combination of all three i've have not nailed
down yet.

it might come down to, i'm doing it correctly, but my script is just
too chunky.  before i spend a bunch of hours tuning, i'd like to
double check i'm going down the right path and/or incorporate some
other ideas

thanks

Reply via email to