Re: [gridengine users] Jobs sitting in queue despite suitable slots and resources available

Mark Dixon Thu, 19 Apr 2018 09:06:56 -0700

On Tue, 17 Apr 2018, Joshua Baker-LePain wrote:

As an alternative to fixing our current setup, I'd be most interested to
hear if/how other folks are handling GPUs in their SoGE setups.  I was
considering changing the slot count in gpu.q to match the number of GPUs
in a host (rather than CPU cores) and have users request slots rather than
the gpu complex, but that seems like it would run afoul of USE_CGROUPS for
GPU jobs that want to use more CPU cores than GPUs.


So, what are other people doing for GPUs?  Thanks.


Hi Joshua,

We have a daemon that keeps track of what GPUs are installed and handsthem out when asked, altering file permissions so that only the GIDassigned to the job can access them, and returns them to the pool when itnotices that the job has ended.

It allocates the 'nearest' (from a NUMA perspective) GPUs to the job thatare free. It can handle treating multi-GPU cards as one allocatable unit.


It can also keep track of and allocate network port numbers.

This is combined with complexes plus some starter method and JSV magicwhich means that you can do this:


  qrsh -l coproc_p100=1,h_rt=1:0:0 -pty y bash

And it gives you an NVIDIA P100 card for an hour, plus a quarter of thenode's cores and RAM (can ask for up to 4 cards).


Don't have multi-node GPU jobs sorted yet.

Mark
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Jobs sitting in queue despite suitable slots and resources available

Reply via email to