Re: [slurm-users] [External] slurmd: error: Node configuration differs from hardware: CPUs=24:48(hw) Boards=1:1(hw) SocketsPerBoard=2:2(hw)

2020-04-23 Thread Mark Hahn
I found a great mini script to disable hyperthreading without reboot. I did get the following warning but Offlining half the hyperthreads is not the same thing as disabling HT. HT is a fundamental mode of the CPU, and enabling it will statically

Re: [slurm-users] [External] slurmd: error: Node configuration differs from hardware: CPUs=24:48(hw) Boards=1:1(hw) SocketsPerBoard=2:2(hw)

2020-04-23 Thread Michael Robbert
I’m pretty sure that you should only need to restart slurmd on the node that was reporting the problem. If it put the node into a drained state you may need to manually undrain it using scontrol. Testing job performance is not the job of the scheduler it just schedules the jobs that you

Re: [slurm-users] [External] slurmd: error: Node configuration differs from hardware: CPUs=24:48(hw) Boards=1:1(hw) SocketsPerBoard=2:2(hw)

2020-04-23 Thread Robert Kudyba
On Thu, Apr 23, 2020 at 1:43 PM Michael Robbert wrote: > It looks like you have hyper-threading turned on, but haven’t defined the > ThreadsPerCore=2. You either need to turn off Hyper-threading in the BIOS > or changed the definition of ThreadsPerCore in slurm.conf. > Nice find. node003 has

Re: [slurm-users] [External] slurmd: error: Node configuration differs from hardware: CPUs=24:48(hw) Boards=1:1(hw) SocketsPerBoard=2:2(hw)

2020-04-23 Thread Michael Robbert
It looks like you have hyper-threading turned on, but haven’t defined the ThreadsPerCore=2. You either need to turn off Hyper-threading in the BIOS or changed the definition of ThreadsPerCore in slurm.conf. Mike From: slurm-users on behalf of Robert Kudyba Reply-To: Slurm User

Re: [slurm-users] Limit the number of GPUS per user per partition

2020-04-23 Thread Killian Murphy
Hi Thomas. We limit the maximum number of GPUs a user can have allocated in a partition through the MaxTRESPerUser field of a QoS for GPU jobs, which is set as the partition QoS on our GPU partition. I.E: We have a QOS `gpujobs` that sets MaxTRESPerUser => gres/gpu:4 to limit total number of

[slurm-users] Limit the number of GPUS per user per partition

2020-04-23 Thread Theis, Thomas
Hi everyone, First message, I am trying find a good way or multiple ways to limit the usage of jobs per node or use of gpus per node, without blocking a user from submitting them. Example. We have 10 nodes each with 4 gpus in a partition. We allow a team of 6 people to submit jobs to any or

Re: [slurm-users] Munge decode failing on new node

2020-04-23 Thread Dean Schulze
I went through the exercise of making the other user the same on the slurmctld as on the slurmd nodes, but that had no effect. I still have 3 nodes that have connectivity and one node where slurmd cannot contact slurmctld. That node has ssh connectivity to and from slurmctld node, but no slurm

[slurm-users] Require help in setting up Priority in slurm

2020-04-23 Thread Sudeep Narayan Banerjee
Dear All: I want to setup priority queuing in slurm (slurm-18.08.7). Say, one user userA has submitted and running 4 jobs from a group USER1-grp; this same user userA has also submitted 4 more jobs in PD status. Now userB from User2-grp wants to submit job whose job should get top priority

[slurm-users] slurmd: error: Node configuration differs from hardware: CPUs=24:48(hw) Boards=1:1(hw) SocketsPerBoard=2:2(hw)

2020-04-23 Thread Robert Kudyba
Running Slurm 20.02 on Centos 7.7 on Bright Cluster 8.2. slurm.conf is on the head node. I don't see these errors on the other 2 nodes. After restarting slurmd on node003 I see this: slurmd[400766]: error: Node configuration differs from hardware: CPUs=24:48(hw) Boards=1:1(hw)

Re: [slurm-users] floating condo partition, , no pre-emption, guarantee a max pend time?

2020-04-23 Thread Paul Edmon
You could probably satisfy this by using a combination of fairshare and QoS's.  You could also tier partitions with a priority partition and then a normal partition and then set a QoS on the priority partition limiting maximum size.  You would naturally want to turn off preemption so that only

Re: [slurm-users] Show detailed information from a finished job

2020-04-23 Thread E.M. Dragowsky
Hi, everyone -- Our take on using epilog is likely familiar to many, but perhaps not all. Here is an extract from epilog: /usr/local/slurm/epilogctld:/usr/bin/scontrol show job=$SLURM_JOB_ID --oneliner >> /usr/local/slurm/slurmrecord/$((SLURM_JOB_ID/1)).record The file size may be adjusted.

Re: [slurm-users] Show detailed information from a finished job

2020-04-23 Thread mercan
Sorry, I falsely crop the "mkdir" line at below: mkdir -p $JDIR I should be after "JDIR=/okyanus/..." line Regards; Ahmet M. 23.04.2020 12:31 tarihinde mercan yazdı: Hi; I prefer to use epilog script to store the job information to a top directory owned by the slurm user. To avoid a

Re: [slurm-users] Show detailed information from a finished job

2020-04-23 Thread mercan
Hi; I prefer to use epilog script to store the job information to a top directory owned by the slurm user. To avoid a directory with a lot of files, It creates a sub-directory for a thousand job file. For a job which its jobid is 230988, It creates a directory named as 230XXX. Also the

[slurm-users] Reading which GPUs were assigned to which job

2020-04-23 Thread Holtgrewe, Manuel
Dear all, is it possible to find out which GPU was assigned to which job through squeue or sacct? My motivation is as follows: some users write jobs with bad resource usage (e.g., 1h CPU to precompute, followed by 1h GPU to process, and so on). I don't care so much about CPUs at the moment as

Re: [slurm-users] Show detailed information from a finished job

2020-04-23 Thread Marcus Wagner
How about sacct -o ALL Am 23.04.2020 um 09:33 schrieb Gestió Servidors: Hello, When a job is “pending” or “running”, with “scontrol show jobid=#jobjumber” I can get some usefull information, but when the job has finished, that command doesn’t return anything. For example, if I run a

[slurm-users] Slurm unlink error messages -- what do they mean?

2020-04-23 Thread David Baker
Hello, We have, rather belatedly, just upgraded to Slurm v19.05.5. On the whole, so far so good -- no major problems. One user has complained that his job now crashes and reports an unlink error. That is.. slurmstepd: error: get_exit_code task 0 died by signal: 9 slurmstepd: error:

[slurm-users] Show detailed information from a finished job

2020-04-23 Thread Gestió Servidors
Hello, When a job is "pending" or "running", with "scontrol show jobid=#jobjumber" I can get some usefull information, but when the job has finished, that command doesn't return anything. For example, if I run a "sacct" and I see that some jobs have finished with state "FAILED", how can I get

Re: [slurm-users] Munge decode failing on new node

2020-04-23 Thread Gennaro Oliva
Hi Dean, On Wed, Apr 22, 2020 at 07:28:15PM -0600, dean.w.schu...@gmail.com wrote: > Even for users other than slurm and munge? It seems strange that 3 of > 4 worker nodes work with the same UIDs/GIDs as the non-working nodes. As in: https://slurm.schedmd.com/quickstart_admin.html Super Quick

Re: [slurm-users] slurm-20.02.1-1 failed rpmbuild with error File not found

2020-04-23 Thread Ole Holm Nielsen
Hi Michael, Thanks for your insightful explanation of the Slurm RPM build process! This clarified the topic a lot for me. I have updated my Slurm installation Wiki page based upon your information: https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#build-slurm-rpms /Ole On 21-04-2020