In the message dated: Tue, 02 Jan 2018 09:11:51 +0000, The pithy ruminations from William Hay on <Re: [gridengine users] resource types -- changing BOOL to INT but keeping qsub unchanged> were: => On Fri, Dec 22, 2017 at 05:55:26PM -0500, [email protected] wrote: => > True, but even with that info, there doesn't seem to be any universal => > way to tell an arbitrary GPU job which GPU to use -- they all default => > to device 0. => => With Nvidia GPUs we use a prolog script that manipulates lock files => to select a GPU then chgrp's the selected /dev/nvidia? file so the group is
Can you provide a copy of the scripts? I understand the part about the chgrp, but how dows the prolog tell an arbitrary program which GPU to use? My understanding was that software defaults to GPU #0, and some packages may use a different GPU #, if they are aware of multiple GPUs and if they accept an option to use a specified device. I'm unclear on how the prolog restricts the GPU software (theano, tensorflow, caffe, FSL, locally-developed code, etc) to use of a particular device. => the group associated with the job. An epilog script undoes all of this. => The /dev/nvidia? files permissions are set to be inaccessible to anyone => other than owner(root) and the group. However you have to pass => a magic option to the kernel to prevent permissions from being reset => whenever anyone tries to access the device. Details? Does this affect things like "nvidia-smi" (user-land, accesses all GPUs, but does not run jobs)? Thanks, Mark => => This seems to be a fairly bullet proof way of restricting jobs to => their assigned GPU. => => => William => _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
