Re: [slurm-users] slurmctld hanging

2022-07-28 Thread Loris Bennett
Hi Byron,

byron  writes:

> Hi Loris - about a second

What is the use-case for that?  Are these individual jobs or it a job
array.  Either way it sounds to me like a very bad idea.  On our system,
jobs which can start immediately because resources are available, still
take a few seconds to start running (I'm looking at the values for
'submit' and 'start' from 'sacct').  If a one-second job has to wait for
just a minute, the ration of wait-time to run-time is already
disproportionately large. 

Why doesn't the user bundle these individual jobs together?  Depending
on your maximum run-time and to what degree jobs can make use of
backfill, I would tell the user something between a single job and
maybe 100 job.  I certainly wouldn't allow one-second jobs in any
significant numbers on our system.

I think having a job starting every second is causing your slurmdbd to
timeout and that is the error you are seeing.

Regards

Loris

> On Thu, Jul 28, 2022 at 2:47 PM Loris Bennett  
> wrote:
>
>  Hi Byron,
>
>  byron  writes:
>
>  > Hi 
>  >
>  > We recently upgraded slurm from 19.05.7 to 20.11.9 and now we occasionally 
> (3 times in 2 months) have slurmctld hanging so we get the following message 
> when running sinfo
>  >
>  > “slurm_load_jobs error: Socket timed out on send/recv operation”
>  >
>  > It only seems to happen when one of our users runs a job that submits a 
> short lived job every second for 5 days (up to 90,000 in a day).  Although 
> that could be a red-herring.  
>
>  What's your definition of a 'short lived job'?
>
>  > There is nothing to be found in the slurmctld log.
>  >
>  > Can anyone suggest how to even start troubleshooting this?  Without 
> anything in the logs I dont know where to start.
>  >
>  > Thanks
>
>  Cheers,
>
>  Loris
>
>  -- 
>  Dr. Loris Bennett (Herr/Mr)
>  ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de



Re: [slurm-users] (no subject)

2022-07-28 Thread GRANGER Nicolas
I have no experience with this, but based on my understanding of the doc, the 
shutdown command should be something like "ssh ${node} systemctl shutdown", and 
the resume "ipmitool -I lan -H ${node}-bmc -U  -f password_file.txt 
chassis power on ".
If you use libvirt for your virtual cluster, you can test waking nodes up via 
ipmi using virtual bmc https://github.com/openstack/virtualbmc (the doc is a 
bit terse unfortunately).

Best,
Nicolas Granger

Le jeudi 28 juillet 2022 à 11:49 -0400, Djamil Lakhdar-Hamina a écrit :
I am helping set up a 16 node cluster computing system, I am not a system-admin 
but I work for a small firm and unfortunately have to pick up needed skills 
fast in things I have little experience in. I am running Rocky Linux 8 on Intel 
Xeon Knights Landings nodes donated by the TAAC center. We are operating in 
Uganda where we have limited resources and where power is quite expensive.

What are some good ways to implement power-saving ? I have already tried power 
saving as per slurms power saving guide but 1) I am not quite sure what it does 
and 2) in implementing a version on my virtual dev environment I was able to 
get the power saving to stand down nodes, but I was not able to get the power 
saving mechanism to spin them back up when needed. I put power saving in the 
slurm.cfg file, and I also specified a SuspendProgram and a ResumeProgram 
similar to the one in the https://slurm.schedmd.com/power_save.html.

So 1) how do I get this power saving mechanism to work, what exactly will it 
do, I see it stands nodes down, will it spin them back up on request of those 
resources? 2) Are there any better techniques for power saving, say using 
IPMItool or something?

Sincerely,
Djamil Lakhdar-Hamina



Re: [slurm-users] (no subject)

2022-07-28 Thread Benson Muite

On 7/28/22 18:49, Djamil Lakhdar-Hamina wrote:
I am helping set up a 16 node cluster computing system, I am not a 
system-admin but I work for a small firm and unfortunately have to pick 
up needed skills fast in things I have little experience in. I am 
running Rocky Linux 8 on Intel Xeon Knights Landings nodes donated by 
the TAAC center. We are operating in Uganda where we have limited 
resources and where power is quite expensive.
It may be helpful to check whether data center co-location is a 
solution.  Uganda generates a lot of hydro electric power, distribution 
is what increases the cost.

https://en.wikipedia.org/wiki/List_of_power_stations_in_Uganda


What are some good ways to implement power-saving ? 

Do you have the exact specifications of the host chips and accelerators?

I have already tried
power saving as per slurms power saving guide but 1) I am not quite sure 
what it does and 2) in implementing a version on my virtual dev 
environment I was able to get the power saving to stand down nodes, but 
I was not able to get the power saving mechanism to spin them back up 
when needed. I put power saving in the slurm.cfg file, and I also 
specified a SuspendProgram and a ResumeProgram similar to the one in the 
https://slurm.schedmd.com/power_save.html 
.


So 1) how do I get this power saving mechanism to work, what exactly 
will it do, I see it stands nodes down, will it spin them back up on 
request of those resources? 2) Are there any better techniques for power 
saving, say using IPMItool or something?

See https://www.icl.utk.edu/files/publications/2017/icl-utk-979-2017.pdf
It may be helpful to measure power use directly for the most common 
applications.  You might also check if the system will be fully 
utilized, and if not enable jobs to run at of peak times when energy 
costs are lower.


Based on a price of $0.13 per kWh, full utilization 5 days a week, 8 
hours a day, 52 weeks per year and 500Kw per node, electricity is about 
$2000 per year.


Sincerely,
Djamil Lakhdar-Hamina





[slurm-users] (no subject)

2022-07-28 Thread Djamil Lakhdar-Hamina
I am helping set up a 16 node cluster computing system, I am not a
system-admin but I work for a small firm and unfortunately have to pick up
needed skills fast in things I have little experience in. I am running
Rocky Linux 8 on Intel Xeon Knights Landings nodes donated by the TAAC
center. We are operating in Uganda where we have limited resources and
where power is quite expensive.

What are some good ways to implement power-saving ? I have already tried
power saving as per slurms power saving guide but 1) I am not quite sure
what it does and 2) in implementing a version on my virtual dev environment
I was able to get the power saving to stand down nodes, but I was not able
to get the power saving mechanism to spin them back up when needed. I put
power saving in the slurm.cfg file, and I also specified a SuspendProgram
and a ResumeProgram similar to the one in the
https://slurm.schedmd.com/power_save.html.

So 1) how do I get this power saving mechanism to work, what exactly will
it do, I see it stands nodes down, will it spin them back up on request of
those resources? 2) Are there any better techniques for power saving, say
using IPMItool or something?

Sincerely,
Djamil Lakhdar-Hamina


Re: [slurm-users] slurmctld hanging

2022-07-28 Thread byron
Hi Loris - about a second

On Thu, Jul 28, 2022 at 2:47 PM Loris Bennett 
wrote:

> Hi Byron,
>
> byron  writes:
>
> > Hi
> >
> > We recently upgraded slurm from 19.05.7 to 20.11.9 and now we
> occasionally (3 times in 2 months) have slurmctld hanging so we get the
> following message when running sinfo
> >
> > “slurm_load_jobs error: Socket timed out on send/recv operation”
> >
> > It only seems to happen when one of our users runs a job that submits a
> short lived job every second for 5 days (up to 90,000 in a day).  Although
> that could be a red-herring.
>
> What's your definition of a 'short lived job'?
>
> > There is nothing to be found in the slurmctld log.
> >
> > Can anyone suggest how to even start troubleshooting this?  Without
> anything in the logs I dont know where to start.
> >
> > Thanks
>
> Cheers,
>
> Loris
>
> --
> Dr. Loris Bennett (Herr/Mr)
> ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de
>
>


Re: [slurm-users] slurmctld hanging

2022-07-28 Thread Fulcomer, Samuel
Hi Byron,

We ran into this with 20.02, and mitigated it with some kernel tuning. From
our sysctl.conf:

net.core.somaxconn = 2048
net.ipv4.tcp_max_syn_backlog = 8192


# prevent neighbour (aka ARP) table overflow...

net.ipv4.neigh.default.gc_thresh1 = 3
net.ipv4.neigh.default.gc_thresh2 = 32000
net.ipv4.neigh.default.gc_thresh3 = 32768
net.ipv4.neigh.default.mcast_solicit = 9
net.ipv4.neigh.default.ucast_solicit = 9
net.ipv4.neigh.default.gc_stale_time = 86400
net.ipv4.neigh.eth0.mcast_solicit = 9
net.ipv4.neigh.eth0.ucast_solicit = 9
net.ipv4.neigh.eth0.gc_stale_time = 86400

# enable selective ack algorithm
net.ipv4.tcp_sack = 1

# workaround TIME_WAIT
net.ipv4.tcp_tw_reuse = 1
# and since all traffic is local
net.ipv4.tcp_fin_timeout = 20


We have a 16-bit cluster network, so the ARP settings date to that.
tcp_sack is more of a legacy setting from when some kernels didn't set it.

You likely would see tons of connections in TIME_WAIT if you ran "netstat
-a" during periods when you're seeing the hangs. Our workaround settings
have seemed to mitigate that.



On Thu, Jul 28, 2022 at 9:29 AM byron  wrote:

> Hi
>
> We recently upgraded slurm from 19.05.7 to 20.11.9 and now we occasionally
> (3 times in 2 months) have slurmctld hanging so we get the following
> message when running sinfo
>
> “slurm_load_jobs error: Socket timed out on send/recv operation”
>
> It only seems to happen when one of our users runs a job that submits a
> short lived job every second for 5 days (up to 90,000 in a day).  Although
> that could be a red-herring.
>
> There is nothing to be found in the slurmctld log.
>
> Can anyone suggest how to even start troubleshooting this?  Without
> anything in the logs I dont know where to start.
>
> Thanks
>
>


Re: [slurm-users] slurmctld hanging

2022-07-28 Thread Loris Bennett
Hi Byron,

byron  writes:

> Hi 
>
> We recently upgraded slurm from 19.05.7 to 20.11.9 and now we occasionally (3 
> times in 2 months) have slurmctld hanging so we get the following message 
> when running sinfo
>
> “slurm_load_jobs error: Socket timed out on send/recv operation”
>
> It only seems to happen when one of our users runs a job that submits a short 
> lived job every second for 5 days (up to 90,000 in a day).  Although that 
> could be a red-herring.  

What's your definition of a 'short lived job'?

> There is nothing to be found in the slurmctld log.
>
> Can anyone suggest how to even start troubleshooting this?  Without anything 
> in the logs I dont know where to start.
>
> Thanks

Cheers,

Loris

-- 
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de



[slurm-users] slurmctld hanging

2022-07-28 Thread byron
Hi

We recently upgraded slurm from 19.05.7 to 20.11.9 and now we occasionally
(3 times in 2 months) have slurmctld hanging so we get the following
message when running sinfo

“slurm_load_jobs error: Socket timed out on send/recv operation”

It only seems to happen when one of our users runs a job that submits a
short lived job every second for 5 days (up to 90,000 in a day).  Although
that could be a red-herring.

There is nothing to be found in the slurmctld log.

Can anyone suggest how to even start troubleshooting this?  Without
anything in the logs I dont know where to start.

Thanks