On Tue, 17 Apr 2018, Joshua Baker-LePain wrote:
As an alternative to fixing our current setup, I'd be most interested to
hear if/how other folks are handling GPUs in their SoGE setups. I was
considering changing the slot count in gpu.q to match the number of GPUs
in a host (rather than CPU cores) and have users request slots rather than
the gpu complex, but that seems like it would run afoul of USE_CGROUPS for
GPU jobs that want to use more CPU cores than GPUs.
So, what are other people doing for GPUs? Thanks.
Hi Joshua,
We have a daemon that keeps track of what GPUs are installed and hands
them out when asked, altering file permissions so that only the GID
assigned to the job can access them, and returns them to the pool when it
notices that the job has ended.
It allocates the 'nearest' (from a NUMA perspective) GPUs to the job that
are free. It can handle treating multi-GPU cards as one allocatable unit.
It can also keep track of and allocate network port numbers.
This is combined with complexes plus some starter method and JSV magic
which means that you can do this:
qrsh -l coproc_p100=1,h_rt=1:0:0 -pty y bash
And it gives you an NVIDIA P100 card for an hour, plus a quarter of the
node's cores and RAM (can ask for up to 4 cards).
Don't have multi-node GPU jobs sorted yet.
Mark
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users