Re: [slurm-users] [External] Munge thinks clocks aren't synced

2020-10-27 Thread Barbara Krašovec
Rewound credential error means that credential appears to have been encoded by more than TTL seconds in the future (default munge TTL is 5 minutes). So the clock on the decoding host is slower than on the encoding host. You can try to run munge with a different TTL (munge -t) just to verify if it i

Re: [slurm-users] X11 forwarding issues

2020-11-16 Thread Barbara Krašovec
Hello, check sshd settings (here are ours): X11Forwarding yes X11DisplayOffset 10 *X11UseLocalhost no* Add PrologFlags in slurm.conf: PrologFlags=x11 Cheers, Barbara On 11/16/20 7:20 PM, Russell Jones wrote: Here's some debug logs from the compute node after launching an interactive shel

Re: [slurm-users] Effect of slurmctld and slurmdb going down on running/pending jobs

2021-06-23 Thread Barbara Krašovec
Just in case, increase Slurmdtimeout in slurm.conf (so that when the controller is back, it will give you time to fix the issues with the communication between slurmd and slurmctld - if there will be any). Otherwise it should not affect running and pending jobs. First stop controller, then slur

Re: [slurm-users] Problem with slurmctl communication with clurmdbd

2017-11-29 Thread Barbara Krašovec
I was struggling like crazy with this one a while ago. Then I saw this in the slurm.conf man page: AccountingStoragePass The password used to gain access to the database to store the accounting data. Only used for database type storage plugins, ignored otherwise. In the case of

Re: [slurm-users] Problem with slurmctl communication with clurmdbd

2017-11-29 Thread Barbara Krašovec
Hello, does munge work? Try if decode works locally: munge -n | unmunge Try if decode works remotely: munge -n | ssh unmunge It seems as munge keys do not match... See comments inline.. > On 29 Nov 2017, at 14:40, Bruno Santos wrote: > > I actually just managed to figure that one out. > > T

Re: [slurm-users] Problem with slurmctl communication with clurmdbd

2017-11-29 Thread Barbara Krašovec
correctly. Please check your database > connection and try again. > > The problem seems to somehow be related to slurmdbd? > I am a bit lost at this point, to be honest. > > Best, > Bruno > > On 29 November 2017 at 14:06, Barbara Krašovec <mailto:barbara.kraso.

Re: [slurm-users] Slurm resource limits on jobs

2019-05-07 Thread Barbara Krašovec
Resources are limited with cgroups in SLURM. Check the documentation: https://slurm.schedmd.com/cgroups.html You simply specify ProctrackType=proctrack/cgroup or/and TaskPlugin=task/cgroup in slurm.conf and then configure which resources are limited and how much in the cgroup.conf: https://slurm

Re: [slurm-users] Nodes not responding... how does slurm track it?

2019-05-15 Thread Barbara Krašovec
It could be a problem with ARP cache. If the number of devices approaches 512, there is a kernel limitation in dynamic ARP-cache size and it can result in the loss of connectivity between nodes. The garbage collector will run if the number of entries in the cache is less than 128, by default: *g

Re: [slurm-users] Backfill CPU jobs on GPU nodes

2019-07-19 Thread Barbara Krašovec
You could limit the resources with the QOS. It is not per node, but you have some options: https://slurm.schedmd.com/qos.html#limits Otherwise you could just enforce the limits per partition and put weight on the nodes, so that the CPU nodes are allocated before the GPU nodes. Have you checked t

Re: [slurm-users] Trouble installing slurm-19.05.1-2.el7.centos.x86_64

2019-08-11 Thread Barbara Krašovec
What if you try to run ldconfig manually before building the rpm? Cheers, Barbara On 8/8/19 5:57 PM, Lou Nicotra wrote: > I am running into an error while trying to > install slurm-19.05.1-2.el7.centos.x86_64... Error is as follows: > root@panther02 x86_64# rpm -Uvh slurm-19.05.1-2.el7.centos.x8

Re: [slurm-users] 19.05 and GPUs vs GRES

2019-08-13 Thread Barbara Krašovec
We have SLURM 19.05. and implemented the cons_tres scheduling type. It does work only by specifying the --gpus-per-node when submitting the job. And there are many more options. I found this presentation to be quite informative: https://slurm.schedmd.com/SLUG18/cons_tres.pdf We still have the gr

Re: [slurm-users] RPM build error - accounting_storage_mysql.so

2019-11-11 Thread Barbara Krašovec
I solved the problem by creating a symlink: ln -s /usr/lib64/libmariadbclient.a /usr/lib64/libmariadb.a Cheers, Barbara > On 11 Nov 2019, at 21:23, William Brown wrote: > > I have in fact found the answer by looking harder. > > The config.log clearly showed that the build of the test MySQL pro

Re: [slurm-users] [EXTERNAL] problems with OpenMPI 4.0.3

2020-06-02 Thread Barbara Krašovec
Afaik, there were some problems with certain versions of UCX, where UCX expected OPAL memory hooks from OMPI, but they were disabled and the physical pages became out-of-sync. But I don't know if this is the case. Maybe you could run dynamic debug to see if there is something useful in dmesg: ech