Made a little bit of progress by running sinfo:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
defq*up infinite 3 drain n[011-013]
defq*up infinite 1 alloc n010
not sure why n[011-013] are in drain state, that needs to be fixed.
After some searching, I ran:
s
Hi list, we have a new cluster setup with Bright cluster manager. Looking into
a support contract there, but trying to get community support in the mean time.
I'm sure things were working when the cluster was delivered, but I provisioned
an additional node and now the scheduler isn't quite wo
Yeah, I don't build against NVML either at the moment (it's filed under
'try when you've got some spare time'). I'm pretty much 'autodetecting'
what my gres.conf file needs to look like on nodes via my config
management, and that all seems to work just fine.
CUDA_VISIBLE_DEVIZES and cgroup dev
I've definitely been there with the minimum cost issue. One thing I have
done personally is start attending SLUG. Now I can give back and learn
more in the process. That may be an option to pitch, iterating the value
you receive from open source software as part of the ROI.
Interestingly, I ha
Same here - $10k for less than 200 nodes. That's an order of magnitude
which makes the finance people ask what we are getting for the money.
As we don't have any special requirements which would require
customisation, that's not easy to answer, so currently we don't have a
support contract. How