Hi Ole, Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> writes:
> Hi Loris, > > On 9/29/22 09:26, Loris Bennett wrote: >> Has anyone already come up with a good way to identify non-MPI jobs which >> request multiple cores but don't restrict themselves to a single node, >> leaving cores idle on all but the first node? >> I can see that this is potentially not easy, since an MPI job might have >> still have phases where only one core is actually being used. > > Just an idea: The "pestat -F" tool[1] will tell you if any nodes have an > "unexpected" CPU load. If you see the same JobID runing on multiple nodes > with a too low CPU load, that might point to a job such as you describe. > > /Ole > > [1] https://github.com/OleHolmNielsen/Slurm_tools/tree/master/pestat I do already use 'pestat -F' although this flags over 100 of our 170 nodes, so it results in a bit of information overload. I guess it would be nice if the sensitivity of the flagging could be tweaked on the command line, so that only the worst nodes are shown. I also use some wrappers around 'sueff' from https://github.com/ubccr/stubl to generate part of an ASCII dashboard (an dasciiboard?), which looks like Username Mem_Request Max_Mem_Use CPU_Efficiency Number_of_CPUs_In_Use alpha 42000M 0.03Gn 48.80% (0.98 of 2) beta 10500M 11.01Gn 99.55% (3.98 of 4) gamma 8000M 8.39Gn 99.64% (63.77 of 64) ... chi varied 3.96Gn 83.65% (248.44 of 297) phi 1800M 1.01Gn 98.79% (248.95 of 252) omega 16G 4.61Gn 99.69% (127.60 of 128) == Above data from: Thu 29 Sep 15:26:29 CEST 2022 ============================= and just loops every 30 seconds. This is what I use to spot users with badly configured jobs. However, I'd really like to be able to identify non-MPI jobs on multiple nodes automatically. Cheers, Loris -- Dr. Loris Bennett (Herr/Mr) ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de