You can get a count of GRES using "scontrol show job $SLURM_JOBID", but that does not identify the specific CPUs on each node, only the number of GPUs. The information about specific GPUs allocated to a job is in the job credential used by the slurmd to set the CUDA_VISIBLE_DEVICES environment variable, so it could probably be made available to the prolog relatively easily.
Quoting Carles Fenoy <[email protected]>: > Hi all, > > Is there any way to get the allocated GRES(GPU) to a job on each node? We > have detected a problem with some devices that need to be rebooted from > time to time, and I would prefer to restart the device in the prolog of the > job. The problem is that I don't know how to get which device has been > allocated to a job and cannot restart all the devices in a node without > affecting already allocated jobs. > > Regards, > > -- > Carles Fenoy > Barcelona Supercomputing Center >
