Re: [gridengine users] resource types -- changing BOOL to INT but keeping qsub unchanged

bergman Thu, 18 Jan 2018 07:25:57 -0800

In the message dated: Tue, 02 Jan 2018 09:11:51 +0000,
The pithy ruminations from William Hay on 
<Re: [gridengine users] resource types -- changing BOOL to INT but keeping qsub 
unchanged> were:
=> On Fri, Dec 22, 2017 at 05:55:26PM -0500, [email protected] wrote:
=> > True, but even with that info, there doesn't seem to be any universal
=> > way to tell an arbitrary GPU job which GPU to use -- they all default
=> > to device 0.
=> 
=> With Nvidia GPUs we use a prolog script that manipulates lock files
=> to select a GPU then chgrp's the selected /dev/nvidia? file so the group is


Can you provide a copy of the scripts?

I understand the part about the chgrp, but how dows the prolog tell an
arbitrary program which GPU to use? My understanding was that software
defaults to GPU #0, and some packages may use a different GPU #, if
they are aware of multiple GPUs and if they accept an option to use a
specified device.

I'm unclear on how the prolog restricts the GPU software (theano,
tensorflow, caffe, FSL, locally-developed code, etc) to use of a
particular device.

=> the group associated with the job.   An epilog script undoes all of this.  
=> The /dev/nvidia? files permissions are set to be inaccessible to anyone 
=> other than owner(root) and the group.  However you have to pass
=> a magic option to the kernel to prevent permissions from being reset
=> whenever anyone tries to access the device.

Details?

Does this affect things like "nvidia-smi" (user-land, accesses all GPUs,
but does not run jobs)?

Thanks,

Mark

=> 
=> This seems to be a fairly bullet proof way of restricting jobs to
=> their assigned GPU.
=> 
=> 
=> William
=> 

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] resource types -- changing BOOL to INT but keeping qsub unchanged

Reply via email to