Re: [slurm-users] Wrong hwloc detected?

2021-11-07 Thread Ole Holm Nielsen
/Slurm_installation#install-prerequisites /Ole On 05-11-2021 15:38, Diego Zuccato wrote: They aren't using modules so it must be something system-wide :( But not all jobs are impacted. And it seems it's a bit random (doesn't happen always). I'm out of ideas, currently :( Il 05/11/2021 13:10, Ole Holm

Re: [slurm-users] Wrong hwloc detected?

2021-11-05 Thread Ole Holm Nielsen
On 11/5/21 12:47, Diego Zuccato wrote: Some users are reporting this error: slurmstepd-str957-mtx-01: error: hwloc_get_obj_below_by_type() failing, task/affinity plugin may be required to address bug fixed in HWLOC version 1.11.5 slurmstepd-str957-mtx-01: error: task[0] unable to set taskset

Re: [slurm-users] Possible to get cluster utilization by partition?

2021-11-05 Thread Ole Holm Nielsen
Hi Dave, On 11/4/21 21:47, Chin,David wrote: I am running Slurm 20.02.7. I would like to generate cluster utilization report based on the billing TRES, but separated by partition. I can get full cluster utilization using:     sreport cluster utilization -T billing start=2021-01-01

Re: [slurm-users] Additional feature request/ need of how to

2021-10-22 Thread Ole Holm Nielsen
On 22-10-2021 14:45, BELLENCONTRE, FREDERIC wrote: I want to know on each node  the  job that will terminate the latest and get the planed date of completion ( it corresponds to the date when we could pass from draining status to drained status if no new jobs and no premature end) -eventually

Re: [slurm-users] Missing data in sreport for a time period in slurm

2021-10-18 Thread Ole Holm Nielsen
On 10/18/21 12:41 PM, mshubham wrote: Dear all, I am facing a issue in slurm(v19.05.1) in which data from 26 May 2020 to Sept 14 2021 is missing in sreport but the same data is present through sacct command, It which was working fine few days ago. Right now, we have to get data utilization

Re: [slurm-users] How to look for free nodes of a certain constraint efficiently

2021-10-14 Thread Ole Holm Nielsen
Hi Matt, How about this sinfo command: $ sinfo -O NodeList:30,Features:30,StateLong NODELIST AVAIL_FEATURESSTATE i023 xeon2650v2,infiniband,xeon16 draining@ i[004-022,024-050]xeon2650v2,infiniband,xeon16 allocated

Re: [slurm-users] job is pending but resources are available

2021-10-13 Thread Ole Holm Nielsen
On 10/13/21 9:59 AM, Adam Xu wrote: 在 2021/10/13 9:22, Brian Andrus 写道: Something is very odd when you have the node reporting: RealMemory=1 AllocMem=0 FreeMem=47563 Sockets=2 Boards=1 What do you get when you run ‘slurmd -C’ on the node? # slurmd -C NodeName=apollo CPUs=36 Boards=1

Re: [slurm-users] Equivalent command for showbf command of maui in slurm

2021-10-07 Thread Ole Holm Nielsen
al Message----- From: Ole Holm Nielsen Sent: 07 October 2021 11:34 To: Slurm User Community List Cc: pankajd Subject: Re: [slurm-users] Equivalent command for showbf command of maui in slurm The "showbf --help" does not explain the meaning of the output of "showbf -S&q

Re: [slurm-users] Equivalent command for showbf command of maui in slurm

2021-10-07 Thread Ole Holm Nielsen
4322 1 64322 3:16:11:01 . . and so on On October 6, 2021 at 11:42 PM Ole Holm Nielsen wrote: > On 06-10-2021 18:42, Pankaj Dorlikar wrote: > > We would like to know the free / available resources (cpu and GPUs) in > > slurm. In torque/maui, showbf –S command gives the similar

Re: [slurm-users] Equivalent command for showbf command of maui in slurm

2021-10-06 Thread Ole Holm Nielsen
On 06-10-2021 18:42, Pankaj Dorlikar wrote: We would like to know the free / available resources (cpu and GPUs) in slurm. In torque/maui, showbf –S command gives the similar output. What command / commanline should be used to check available / free resources including node number and its

Re: [slurm-users] "Low RealMem" after upgrade

2021-10-05 Thread Ole Holm Nielsen
On 10/5/21 8:05 AM, Diego Zuccato wrote: I already tried multiple times, both RESUME and IDLE, and it didn't work: it just returned to "IDLE+DRAIN" with 'Reason="low realmem"'. :( I just tried again (after an unplanned shutdown of the frontend) and it What is a "frontend"? Do you mean the

Re: [slurm-users] Error when upgrading to 21.08.1

2021-09-23 Thread Ole Holm Nielsen
On 23-09-2021 16:01, Hoot Thompson wrote: In upgrading to 21.08.1, slurmctld status reports: Sep 23 13:49:52 ip-10-10-7-17 systemd[1]: Started Slurm controller daemon. Sep 23 13:49:52 ip-10-10-7-17 slurmctld[1323]: fatal: Unable to find plugin: serializer/json Sep 23 13:49:52 ip-10-10-7-17

Re: [slurm-users] Secondary Unix group id of users not being issued in interactive srun command

2021-09-21 Thread Ole Holm Nielsen
On 9/21/21 9:11 AM, Amjad Syed wrote: We have users who have have defined unix secondary id on our login nodes. vas20xhu@login01 ~]$ groups BIO_pg BIO_AFMAKAY_LAB_USERS But when we run interactive  and go to compute node , the user does not have secondary  group of BIO_AFMAKAY_LAB_USERS

Re: [slurm-users] free resources

2021-08-26 Thread Ole Holm Nielsen
On 26-08-2021 08:01, Pankaj Dorlikar wrote: We are using slurm-20.11..7 on ubuntu system having GPUs. What is the equivalent of “showbf –S” in maui or any command in slurm for checking the free resources ? Maybe "sinfo -t idle"? The showbf manual doesn't document any -S flag:

Re: [slurm-users] free resources

2021-08-26 Thread Ole Holm Nielsen
documentation for it? /Ole -Original Message- From: slurm-users On Behalf Of Ole Holm Nielsen Sent: 26 August 2021 12:41 To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] free resources On 26-08-2021 08:01, Pankaj Dorlikar wrote: We are using slurm-20.11..7 on ubuntu system havin

Re: [slurm-users] Slurm does not start after (stupid) upgrade from 16.05.9 to 20.11.7

2021-08-25 Thread Ole Holm Nielsen
On 8/25/21 10:48 AM, Julien Tailleur wrote: We have been running a computing cluster using slurm since 2016, that I installed back then, with some help from others. I was pretty late on upgrades and decided to upgrade the cluster up to debian Bullseye, which runs slurm 20.11.7, starting from

Re: [slurm-users] sacct output in tabular form

2021-08-25 Thread Ole Holm Nielsen
Hi Sven, On 8/25/21 7:41 AM, Sternberger, Sven wrote: this is a simple wrapper for sacct which prints the output from sacct as table. So you can make a "sacctml -j foo --long" even without two 8k displays ;-) This script works nicely, thanks! However, in stead of an extremely wide display

Re: [slurm-users] Submit time instead of Start time for sacct

2021-08-09 Thread Ole Holm Nielsen
On 09-08-2021 17:24, Amjad Syed wrote: I am trying to filter number of jobs submitted in a month , not jobs that started . if i use sacct -S 2021-07-07 -E 2021-08-07 --format=jobID,Submit -D JobIDSubmit --- 72749032021-06-09T11:30:46 I get jobs that were

Re: [slurm-users] History of pending jobs

2021-07-30 Thread Ole Holm Nielsen
On 30-07-2021 20:42, Glenn (Gedaliah) Wolosh wrote: I'm interested on getting an idea how long jobs were pending in a particular partition. Is there any magic to sreport or sacct that can generate this info. I could also use something like:"sreport cluster utilization" broken down by

Re: [slurm-users] 4 sockets but "

2021-07-23 Thread Ole Holm Nielsen
On 7/23/21 1:24 PM, Diego Zuccato wrote: Il 23/07/2021 13:15, Ole Holm Nielsen ha scritto: But it's not whowing jobIDs nor users :( That is really strange!  The pestat obtains username and jobid from the squeue command.  Do you get this information from "squeue -t running"? $

Re: [slurm-users] 4 sockets but "

2021-07-23 Thread Ole Holm Nielsen
On 7/23/21 1:15 PM, Ole Holm Nielsen wrote: On 7/23/21 1:07 PM, Diego Zuccato wrote: Well, Slurm reports the 15-minute load average.  I guess users will have to learn that, because we can't print help information every time. They'd probably omit reading it anyway... Actually, I found a bit

Re: [slurm-users] 4 sockets but "

2021-07-23 Thread Ole Holm Nielsen
On 7/23/21 1:07 PM, Diego Zuccato wrote: Well, Slurm reports the 15-minute load average.  I guess users will have to learn that, because we can't print help information every time. They'd probably omit reading it anyway... Actually, I found a bit of unused space below the CPUload heading, so I

Re: [slurm-users] slumctld don't start at boot

2021-07-23 Thread Ole Holm Nielsen
On 7/23/21 1:00 PM, Diego Zuccato wrote: We answered in parallel :) I usually prefer to avoid modifying system-managed files because system updates could reset 'em. Since systemd allows overrides, I chose to use 'em :) I agree with you! The permanent fix will change those Systemd files in

Re: [slurm-users] 4 sockets but "

2021-07-23 Thread Ole Holm Nielsen
On 7/23/21 12:43 PM, Ole Holm Nielsen wrote: On 7/23/21 12:36 PM, Diego Zuccato wrote: I believe that slurmd reports the 15 minute CPU load average to the slurmctld, only.  So you got this information already. Yup. It's just unexpected: if you don't know, you run pestat and see that an idle

Re: [slurm-users] slumctld don't start at boot

2021-07-23 Thread Ole Holm Nielsen
On 7/23/21 12:29 PM, Riccardo Sucapane wrote: I am using Slurm as a workload manager on a system with a master and 3 nodes. The operating system used is the recent rocky linux 8.4 while for slurm, is used the version 20.11.8 taken from EPEL repository. Everything works correctly and when the

Re: [slurm-users] 4 sockets but "

2021-07-23 Thread Ole Holm Nielsen
Hi Diego, On 7/23/21 12:36 PM, Diego Zuccato wrote: I believe that slurmd reports the 15 minute CPU load average to the slurmctld, only.  So you got this information already. Yup. It's just unexpected: if you don't know, you run pestat and see that an idle node does have a very high load :)

Re: [slurm-users] 4 sockets but "

2021-07-23 Thread Ole Holm Nielsen
Hi Loris, On 7/23/21 9:05 AM, Loris Bennett wrote: We use both Zabbix and pestat. Zabbix gives us general information on the state of the nodes and file systems, and we have added some Slurm metrics, such as number of jobs pending, amount of memory pending, number of GPUs pending, etc. This

Re: [slurm-users] 4 sockets but "

2021-07-23 Thread Ole Holm Nielsen
Hi Diego, On 7/23/21 8:16 AM, Diego Zuccato wrote: The Configless Slurm (https://slurm.schedmd.com/configless_slurm.html) from 20.02 makes distribution of slurm.conf really simple. Eager to see it in Debian :) IMHO, there ought to be a community effort to provide up-to-date Slurm packages

Re: [slurm-users] 4 sockets but "

2021-07-21 Thread Ole Holm Nielsen
Hi Diego, On 21-07-2021 11:56, Diego Zuccato wrote: I suspendend testing config changes to update another machine. In the last test I added "CPUs=192" to the noe definition, restarted slurmctld and nothing changed. When I returned, I checked again and slurm reported 192 CPUs! Magic? I now

Re: [slurm-users] 4 sockets but "

2021-07-20 Thread Ole Holm Nielsen
Hi Diego, 2. Did you define a Sub NUMA Cluster (SNC) BIOS setting?  Then each physical socket would show up as two sockets (memory controllers), for a total of 8 "sockets" in your 4-socket system. I don't think so. Unless that's the default, I didn't change anything in the BIOS. Just checked

Re: [slurm-users] 4 sockets but "

2021-07-20 Thread Ole Holm Nielsen
Hi Diego, The Xeon Platinum 8268 is a 24-core CPU: https://ark.intel.com/content/www/us/en/ark/products/192481/intel-xeon-platinum-8268-processor-35-75m-cache-2-90-ghz.html Questions: 1. So you have 4 physical sockets in each node? 2. Did you define a Sub NUMA Cluster (SNC) BIOS setting?

Re: [slurm-users] Minimum requirements for Slurm daemons?

2021-07-14 Thread Ole Holm Nielsen
On 7/14/21 3:26 PM, Heitor wrote: On Mon, 12 Jul 2021 21:00:45 +0200 Ole Holm Nielsen wrote: SchedMD recommends that the slurmctld server should have only a few, but very fast CPU cores, in order to ensure the best responsiveness. The database server should preferably run on a physical

Re: [slurm-users] problem building pam_slurm_adopt

2021-07-14 Thread Ole Holm Nielsen
For CentOS, the list of all prerequisites for building Slurm is here: https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#install-prerequisites /Ole On 7/14/21 11:51 AM, Sean Crosby wrote: Hi Mike, To build pam_slurm_adopt, you need the pam-devel package installed on the node you're

Re: [slurm-users] Minimum requirements for Slurm daemons?

2021-07-12 Thread Ole Holm Nielsen
On 12-07-2021 20:17, Heitor wrote: Hello, I'm trying to find the minimum requirements (mainly CPU and RAM) for the slurmctld, sulrmdbd, and slurmrestd daemons, but I did not find it in the docs. Maybe I missed some page? SchedMD recommends that the slurmctld server should have only a few, but

[slurm-users] Configless Slurm: DNS SRV record does not work without FQDN on EL8 systems

2021-07-12 Thread Ole Holm Nielsen
the lookup of SRV records? This issue is tracked in Slurm bug https://bugs.schedmd.com/show_bug.cgi?id=11878#c2 Thanks, Ole -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark

Re: [slurm-users] ML Training task killed(SIGKILL) when cgroup cpu limit enabled in slurm15.08

2021-07-02 Thread Ole Holm Nielsen
#upgrading-slurm Best regards, Ole -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark

[slurm-users] Updated "showuserjobs" tool for summaries of Slurm node and batch job status

2021-06-28 Thread Ole Holm Nielsen
b.com/OleHolmNielsen/Slurm_tools/tree/master/showuserjobs Best regards, Ole -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark

Re: [slurm-users] Information about finished jobs

2021-06-15 Thread Ole Holm Nielsen
On 6/15/21 12:07 PM, Peter Kjellström wrote: On Mon, 14 Jun 2021 09:33:02 +0200 (CEST) Arthur Gilly wrote: Hi all, A related question, on my setup, scontrol show job displays the standard output, standard error redirections as well as the wd, whereas this info is lost after completion when

Re: [slurm-users] monitor draining/drain nodes

2021-06-14 Thread Ole Holm Nielsen
On 6/14/21 7:50 AM, Marcus Boden wrote: Slurm provides the strigger[1] utility for that. You can set it up to automatically send mails when nodes go into drain. I provide some Slurm triggers examples in https://github.com/OleHolmNielsen/Slurm_tools/tree/master/triggers On 12.06.21 22:29,

Re: [slurm-users] Information about finished jobs

2021-06-14 Thread Ole Holm Nielsen
On 6/14/21 9:33 AM, Arthur Gilly wrote: A related question, on my setup, scontrol show job displays the standard output, standard error redirections as well as the wd, whereas this info is lost after completion when sacct is required. Is this something that's configurable so that this info is

Re: [slurm-users] Information about finished jobs

2021-06-14 Thread Ole Holm Nielsen
On 6/14/21 8:26 AM, Gestió Servidors wrote: How can I get all information about a finished job in the same way as “scontrol show jobid=” when job is pending or running? Some minutes after job completion, you can only get the information which is stored in the Slurm database. My script

Re: [slurm-users] Slurm stats in JSON format

2021-06-07 Thread Ole Holm Nielsen
On 6/8/21 12:27 AM, Sid Young wrote: Is there a tool that will extract the job counts in JSON format? Such as #running, #in pending #onhold etc I am trying to build some custom dashboards for the our new cluster and this would be a really useful set of metrics to gather and display. We have

Re: [slurm-users] Parent accounts

2021-05-31 Thread Ole Holm Nielsen
Hi Stefan, On 5/28/21 3:31 PM, Stefan Staeglich wrote: for our monitoring system I want to query the account hierarchy. Is there a better approach than to parse the output of sacctmgr list account withasso -nP ? Something like sacctmgr list account parent=bla withasso -nP doesn't work.

Re: [slurm-users] Parent accounts

2021-05-28 Thread Ole Holm Nielsen
Hi Stefan, On 5/28/21 3:31 PM, Stefan Staeglich wrote: for our monitoring system I want to query the account hierarchy. Is there a better approach than to parse the output of sacctmgr list account withasso -nP One approach is to use the Slurm sreport tool which displays the account

Re: [slurm-users] Building SLURM with X11 support

2021-05-27 Thread Ole Holm Nielsen
On 5/27/21 2:07 PM, Thekla Loizou wrote: I am trying to use X11 forwarding in SLURM with no success. We are installing SLURM using RPMs that we generate with the command "rpmbuild -ta slurm*.tar.bz2" as per the documentation. I am currently working with SLURM version 20.11.7-1. What I am

Re: [slurm-users] pam_slurm_adopt not working for all users

2021-05-27 Thread Ole Holm Nielsen
Hi Loris, On 5/27/21 8:19 AM, Loris Bennett wrote: Regarding keys vs. host-based SSH, I see that host-based would be more elegant, but would involve more configuration. What exactly are the simplification gains you see? I just have a single cluster and naively I would think dropping a script

Re: [slurm-users] Upgrading slurm - can I do it while jobs running?

2021-05-26 Thread Ole Holm Nielsen
On 26-05-2021 20:23, Will Dennis wrote: About to embark on my first Slurm upgrade (building from source now, into a versioned path /opt/slurm// which is then symlinked to /opt/slurm/current/ for the “in-use” one…) This is a new cluster, running 20.11.5 (which we now know has a CVE that was

Re: [slurm-users] pam_slurm_adopt not working for all users

2021-05-25 Thread Ole Holm Nielsen
On 25-05-2021 18:07, Loris Bennett wrote: PS Am I wrong to be surprised that this is something one needs to roll oneself? It seems to me that most clusters would want to implement something similar. Is that incorrect? If not, are people doing something else? Or did some vendor setting things

Re: [slurm-users] pam_slurm_adopt not working for all users

2021-05-25 Thread Ole Holm Nielsen
On 25-05-2021 19:03, Patrick Goetz wrote: On 5/25/21 11:07 AM, Loris Bennett wrote: PS Am I wrong to be surprised that this is something one needs to roll oneself?  It seems to me that most clusters would want to implement something similar.  Is that incorrect?  If not, are people doing

Re: [slurm-users] pam_slurm_adopt not working for all users

2021-05-25 Thread Ole Holm Nielsen
Hi Loris, I think you need, as pointed out by others, either of: * SSH keys, see https://wiki.fysik.dtu.dk/niflheim/SLURM#ssh-keys-for-password-less-access-to-cluster-nodes * SSH host-base authentication, see https://wiki.fysik.dtu.dk/niflheim/SLURM#host-based-authentication /Ole On

Re: [slurm-users] pam_slurm_adopt not working for all users

2021-05-21 Thread Ole Holm Nielsen
Hi Loris, I don't know if this would solve your problem, but I think that node SSH keys should be gathered and distributed. See my notes in https://wiki.fysik.dtu.dk/niflheim/SLURM#ssh-keys-for-password-less-access-to-cluster-nodes /Ole On 21-05-2021 14:53, Loris Bennett wrote: Hi, We

Re: [slurm-users] Determining Cluster Usage Rate

2021-05-17 Thread Ole Holm Nielsen
On 5/17/21 8:59 AM, Diego Zuccato wrote: Il 15/05/21 00:43, Christopher Samuel ha scritto: It just doesn't recognize 'ALL'. It works if I specify the resources. That's odd, what does this say? sreport --version slurm-wlm 18.08.5-2 That's the package from Debian stable (we don't have the

Re: [slurm-users] Determining Cluster Usage Rate

2021-05-14 Thread Ole Holm Nielsen
On 14-05-2021 08:52, Diego Zuccato wrote: Il 14/05/2021 08:19, Christopher Samuel ha scritto: sreport -t percent -T ALL cluster utilization "sreport: fatal: No valid TRES given" :( This works correctly on our cluster: $ sreport -t percent -T ALL cluster utilization

Re: [slurm-users] Cluster usage, filtered by partition

2021-05-11 Thread Ole Holm Nielsen
On 5/11/21 11:06 AM, Diego Zuccato wrote: Is it possible to extract a "partition usage summary", like the one generated by "sreport cluster usage" but limited to a single partition (or a partition set)? I believe that sreport can't make per-partition reports. Alternatively, is there some

Re: [slurm-users] Questions about adding new nodes to Slurm

2021-05-04 Thread Ole Holm Nielsen
The task of adding or removing nodes from Slurm is well documented and discussed in SchedMD presentations, please see my Wiki page https://wiki.fysik.dtu.dk/niflheim/SLURM#add-and-remove-nodes /Ole On 04-05-2021 14:47, Tina Friedrich wrote: Not sure if that's changed but aren't there cases

Re: [slurm-users] [External] Re: PropagateResourceLimits

2021-04-29 Thread Ole Holm Nielsen
On 29-04-2021 18:54, Ryan Novosielski wrote: It may not for specifically PropagateResourceLimits – as I said, the docs are a little sparse on the “how” this actually works – but you’re not correct that PAM doesn’t come into play re: user jobs. If you have “UsePam = 1” set, and have an

Re: [slurm-users] [External] slurmd -C vs lscpu - which do I use to populate slurm.conf?

2021-04-29 Thread Ole Holm Nielsen
On 4/29/21 1:06 AM, Michael Robbert wrote: I think that you want to use the output of slurmd -C, but if that isn’t telling you the truth then you may not have built slurm with the correct libraries. I believe that you need to build with hwloc in order to get the most accurate details of the

Re: [slurm-users] Questions about adding new nodes to Slurm

2021-04-28 Thread Ole Holm Nielsen
On 4/28/21 2:48 AM, Sid Young wrote: I use SaltStack to push out the slurm.conf file to all nodes and do a "scontrol reconfigure" of the slurmd, this makes management much easier across the cluster. You can also do service restarts from one point etc. Avoid NFS mounts for the config, if the

Re: [slurm-users] Slurm reservation for migrating user home directories

2021-04-27 Thread Ole Holm Nielsen
On 4/16/21 4:21 PM, Ole Holm Nielsen wrote: I'm thinking of a reservation something like this: scontrol create reservation starttime=...  duration=12:00:00 ReservationName=migrate_physics nodes=ALL Accounts=-physics For the record: The idea of creating a Slurm reservation for excluding

Re: [slurm-users] What is an easy way to prevent users run programs on the master/login node.

2021-04-24 Thread Ole Holm Nielsen
On 24-04-2021 04:37, Cristóbal Navarro wrote: Hi Community, I have a set of users still not so familiar with slurm, and yesterday they bypassed srun/sbatch and just ran their CPU program directly on the head/login node thinking it would still run on the compute node. I am aware that I will

Re: [slurm-users] In high availability scenario, what is the best way to synchronize state files with scontrol takeover command?

2021-04-19 Thread Ole Holm Nielsen
Hi wenxia...@126.com, I think it is safer to get some experience with Slurm *without* using initially a High Availability setup for the slurmctld server. I highly recommend you to study the SchedMD presentations available in the page https://slurm.schedmd.com/publications.html. In

Re: [slurm-users] configless in Slurm, can not find the ip of ctld

2021-04-19 Thread Ole Holm Nielsen
Hi wenxia...@126.com, What is your full DNS domain name, and is /etc/resolv.conf consistent with your DNS? It seems to me that your DNS server is named "slurmctld-source": NS slurmctld-source. so you may have an error in the DNS setup. The DNS SRV record can be looked up by: $ host -t

[slurm-users] Slurm reservation for migrating user home directories

2021-04-16 Thread Ole Holm Nielsen
can rsync the home directories from the old NFS server to the new NFS server and update the NFS automounter links. Question: Does anyone have experiences with this type of scenario? Any good ideas or suggestions for other methods for data migration? Thanks, Ole -- Ole Holm Nielsen PhD

Re: [slurm-users] Slurm reservation for migrating user home directories

2021-04-16 Thread Ole Holm Nielsen
. I'm thinking of a reservation something like this: scontrol create reservation starttime=... duration=12:00:00 ReservationName=migrate_physics nodes=ALL Accounts=-physics Would this work as expected? Best regards, Ole On 16/04/2021 14.23, Ole Holm Nielsen wrote: I need to migrate several sets

Re: [slurm-users] derived counters

2021-04-16 Thread Ole Holm Nielsen
Hi Jürgen, On 4/13/21 6:29 PM, Juergen Salk wrote: * Heckes, Frank [210413 12:04]: This result from a mgmt. - question. How long jobs have to wait (in s, min, h, day) before they getting executed and how many jobs are waiting (are queued) for each partition in a certain time interval. The

Re: [slurm-users] Why does Slurm kill one particular user's jobs after a few seconds?

2021-04-15 Thread Ole Holm Nielsen
Hi Thomas, I wonder if your problem is related to that reported in this list thread? https://lists.schedmd.com/pipermail/slurm-users/2021-April/007107.html You could try to restart the slurmctld service, and also make sure your configuration (slurm.conf etc.) has been pushed correctly to the

Re: [slurm-users] derived counters

2021-04-13 Thread Ole Holm Nielsen
Hi Frank, On 4/12/21 9:53 AM, Heckes, Frank wrote: Hello Ole, many thanks for sharing your scripts, they cover most of the topics I was looking for. (my apologies, I noticed them already, but didn't checked them careful enough). The script are very clean coded and documented. Great work.

Re: [slurm-users] derived counters

2021-04-12 Thread Ole Holm Nielsen
On 4/11/21 6:17 PM, Heckes, Frank wrote: Sorry, if this has been asked and answered before. Does someone created a script/sql-query or maybe can provide combination of command line flags to create a ‘report’ for: I'm not sure my Slurm tools do what you want, but maybe you can get partial

Re: [slurm-users] slurmrestd configuration

2021-04-08 Thread Ole Holm Nielsen
On 4/8/21 9:50 AM, Simone Riggi wrote: I write you about how to properly setup slurmrestd. ... 2) Installed slurm with: rpmbuild -ta slurm-20.11.5.tar.bz2 --with mysql --with slurmrestd --with jwt I don't see this "--with jwt" in the slurm.spec file: [slurm-20.11.5]# grep "# --with"

[slurm-users] Updated "pestat" tool for printing Slurm nodes status with 1 line per node including job info

2021-04-06 Thread Ole Holm Nielsen
ptions. If you use pestat, could you kindly download the latest master version and test it on your system? The output of "squeue -O" and "sinfo -O" can be challenging to parse correctly, so if you find a bug in pestat, please open an issue on GitHub or send E-mail to me. Than

Re: [slurm-users] [EXT] slurmctld error

2021-04-06 Thread Ole Holm Nielsen
Hi Ioannis, On 06-04-2021 07:56, Ioannis Botsis wrote: slurmctld is active and running but on system reboot doesn’t start automatically…..I have to start it manually Maybe you will find my Slurm Wiki pages of use for setting up your Slurm system: https://wiki.fysik.dtu.dk/niflheim/SLURM

Re: [slurm-users] How can I get complete field values with without specify the length

2021-03-10 Thread Ole Holm Nielsen
On 3/10/21 12:06 PM, Reuti wrote: Am 09.03.2021 um 13:37 schrieb Marcus Boden : Then I have good news for you! There is the --delimiter option: https://slurm.schedmd.com/sacct.html#OPT_delimiter= Aha, perfect – thx. Maybe it should be noted in the man page for the "-p"/"-P". Good idea. I

Re: [slurm-users] How can I get complete field values with without specify the length

2021-03-08 Thread Ole Holm Nielsen
On 3/9/21 5:06 AM, xiaojingh...@163.com wrote: Hi, guys I am doing a parsing job on slurm fields. Sometimes when one field is too long, slum will limit the length with a “+”. But I prefer to get the complete value of that field. Do you know how can I achieve that? I do not want to specify the

Re: [slurm-users] Get original script of a job

2021-03-05 Thread Ole Holm Nielsen
On 05-03-2021 11:29, Alberto Morillas, Angelines wrote: I would like to know if it will be possible to get the script that was used to send a job. I know that when I send a job with scontroI can get the path and the name of the script used to send this job, but normally the users change

Re: [slurm-users] sreport triggers "slurm_pack_list: size limit exceeded"

2021-03-03 Thread Ole Holm Nielsen
On 3/3/21 10:52 AM, Ulf Markwardt wrote: Dear all, at the moment, a more detailed sreport breaks like... sreport job SizesByAccount flatview grouping=1,2,4,8,24,32,48,64,128,250,500,1500,2000,2500,3000,3500,4000,4500,5000,7500 start=09/01/20 end=11/01/20 -t hours --parsable2 sreport: error:

Re: [slurm-users] GPU exclusively for one account

2021-02-26 Thread Ole Holm Nielsen
On 2/26/21 8:44 AM, Baldauf, Sebastian Martin wrote: I just want to ask if someone has an idea how to give a GPU and some CPUs of a node to one account exclusively but keep the remaining CPUs of this node available for all users. For me it looks like that using partitions is only working for

Re: [slurm-users] MaxTime only for a user

2021-02-25 Thread Ole Holm Nielsen
On 2/25/21 11:17 AM, Diego Zuccato wrote: Il 25/02/21 11:00, Ole Holm Nielsen ha scritto: I think so, please see https://slurm.schedmd.com/resource_limits.html and look for the MaxWallDurationPerJob limit.  You have to set that limit on the user's association. IIUC the limit in the assoc

Re: [slurm-users] MaxTime only for a user

2021-02-25 Thread Ole Holm Nielsen
On 2/25/21 10:11 AM, Gestió Servidors wrote: I need to configure a SLURM partition to allow jobs than need more than a hour, but only for a specific user. By default, that partition allows jobs with a “MaxTime=10:00” but, now, a user needs to run some test in the same partition that will last

Re: [slurm-users] Slurmdbd purge settings

2021-02-23 Thread Ole Holm Nielsen
On 23-02-2021 15:19, mercan wrote: Hi; May be the database can not fit innodb buffer any more. If there are enough room to increase this value(innodb_buffer_pool_size) , to find reason, you can try the increase. The details of modifying the Innodb parameters are described in

Re: [slurm-users] Slurmdbd purge settings

2021-02-23 Thread Ole Holm Nielsen
On 2/23/21 1:25 PM, Luke Sudbery wrote: We have suddenly got bad performance from sreport, querying a 1 hour period (in the last 24 hours) for TopUsage went from taking under a minute to timing out after the 15 minutes max slurmdbd query time – although the SQL query on the DB server continued

Re: [slurm-users] Job flexibility with cons_tres

2021-02-12 Thread Ole Holm Nielsen
On 2/12/21 9:24 AM, Ansgar Esztermann-Kirchner wrote: After scouring the docs once more, I've noticed DefaultCpusPerGpu, which seems to be exactly what I was looking for: jobs request a number of GPUs, but no CPUs; and Slurm will assign an appropriate number of CPUs. The only disadvantage is the

Re: [slurm-users] Fairshare tree after SLURM upgrade

2021-01-28 Thread Ole Holm Nielsen
On 1/29/21 8:03 AM, Gestió Servidors wrote: I’m going to upgrade my SLURM version from 17.11.5 to 19.05.1. I know this is not the last version, but I manage another cluster that is running, also, this version. My question is: during the process, I need to upgrade “slurmdbd”. All the fairshare

Re: [slurm-users] how do array jobs stored in slurmdb database?

2021-01-28 Thread Ole Holm Nielsen
件- 发件人: Ole Holm Nielsen 发送时间: 2021年1月29日 0:14 收件人: slurm-users@lists.schedmd.com 主题: Re: [slurm-users] how do array jobs stored in slurmdb database? On 1/28/21 11:59 AM, taleinterve...@sjtu.edu.cn wrote: From query command such as ‘sacct -j 123456’ I can see a series of jobs named 123456_1

Re: [slurm-users] how do array jobs stored in slurmdb database?

2021-01-28 Thread Ole Holm Nielsen
to set the charging cost (money) of specific jobs in the database to zero. I will post a separate message about that. /Ole -邮件原件-> -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark, Fysikvej Building 309, DK-2800 Kongens Lyn

Re: [slurm-users] [EXT]Re: only 1 job running

2021-01-28 Thread Ole Holm Nielsen
On 28-01-2021 21:21, Chandler wrote: Brian Andrus wrote on 1/28/21 12:07: scontrol update state=resume nodename=n[011-013] I tried that but got, slurm_update error: Invalid node state specified As Chris Samuel said, you must restart the Slurm daemons when adding (or removing) nodes! See

Re: [slurm-users] how do array jobs stored in slurmdb database?

2021-01-28 Thread Ole Holm Nielsen
On 1/28/21 11:59 AM, taleinterve...@sjtu.edu.cn wrote: From query command such as ‘sacct -j 123456’ I can see a series of jobs named 123456_1, 123456_2, etc. And I need to delete these job records from mysql database for some reason. But in job_table of slurmdb, there is only one record with

[slurm-users] Building Slurm RPMs with NVIDIA GPU support?

2021-01-26 Thread Ole Holm Nielsen
In another thread, On 26-01-2021 17:44, Prentice Bisbal wrote: Personally, I think it's good that Slurm RPMs are now available through EPEL, although I won't be able to use them, and I'm sure many people on the list won't be able to either, since licensing issues prevent them from providing

Re: [slurm-users] Building Slurm RPMs with NVIDIA GPU support?

2021-01-26 Thread Ole Holm Nielsen
7Wy8l0sPnURkcqADc2guUW8IM=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE=ZuCDM15RrOpv2t-j8DywWrwpn86qa79eBuSPEs96SFo=> ) On Tue, 26 Jan 2021 2:29pm, Ole Holm Nielsen wrote: > In another thread, On 26-01-2021 17:44, Prentice Bisbal wro

Re: [slurm-users] Building Slurm RPMs with NVIDIA GPU support?

2021-01-26 Thread Ole Holm Nielsen
Thanks Paul! On 26-01-2021 21:11, Paul Raines wrote: You should check your jobs that allocated GPUs and make sure CUDA_VISIBLE_DEVICES is being set in the environment.  This is a sign you GPU support is not really there but SLURM is just doing "generic" resource assignment. Could you

Re: [slurm-users] Building Slurm RPMs with NVIDIA GPU support?

2021-01-26 Thread Ole Holm Nielsen
or broken without the libraries? SchedMD's slurm.spec file doesn't mention any "--with nvidia" (or similar) build options, so I'm really puzzled. Most of our nodes don't have GPUs, so I wouldn't like to install libraries on those nodes needlessly. Thanks, Ole On 1/26/2021 2:29

Re: [slurm-users] Exclude Slurm packages from the EPEL yum repository

2021-01-25 Thread Ole Holm Nielsen
of the equation. I think my points quoted below deserve careful consideration by the EPEL volunteer, because the results could be potentially harmful. Thanks, Ole Andy Riebs On 1/25/2021 2:47 AM, Ole Holm Nielsen wrote: On 1/23/21 9:43 PM, Philip Kovacs wrote: I can assure you it was easier for

Re: [slurm-users] Exclude Slurm packages from the EPEL yum repository

2021-01-24 Thread Ole Holm Nielsen
ades or double Slurm installations can take place. Thanks, Ole On Saturday, January 23, 2021, 07:03:08 AM EST, Ole Holm Nielsen wrote: We use the EPEL yum repository on our CentOS 7 nodes.  Today EPEL surprisingly delivers Slurm 20.11.2 RPMs, and the daily yum updates (luckily) f

[slurm-users] Exclude Slurm packages from the EPEL yum repository

2021-01-23 Thread Ole Holm Nielsen
We use the EPEL yum repository on our CentOS 7 nodes. Today EPEL surprisingly delivers Slurm 20.11.2 RPMs, and the daily yum updates (luckily) fail with some errors: --> Running transaction check ---> Package slurm.x86_64 0:20.02.6-1.el7 will be updated --> Processing Dependency:

[slurm-users] GPU process accounting information

2021-01-15 Thread Ole Holm Nielsen
Hi, We have installed some new GPU nodes, and now users are asking for some sort of monitoring of GPU utilisation and GPU memory utilisation at the end of a job, like what Slurm already provides for CPU and memory usage. I haven't found any pages describing how to perform GPU accounting

Re: [slurm-users] How to get CPU tres report by group without getting multiple lines by users?

2021-01-14 Thread Ole Holm Nielsen
On 1/14/21 4:20 PM, ichebo...@univ.haifa.ac.il wrote: I am trying to collect information for our cluster CPU hours usage formatted by groups, but i always get multiple groups lines. for example: sreport cluster UserUtilizationByAccount start=01/01/20 end=12/31/20 -t hours -T cpu

Re: [slurm-users] [EXTERNAL] Possible to copy sacctmgr info from one cluster to another?

2021-01-14 Thread Ole Holm Nielsen
On 1/13/21 7:19 PM, Mando Rodriguez wrote: You can ‘dump’ the info in the slurm database to a file and reload the file (here named cluster.cfg). Dump the info with: sacctmgr dump slurm_cluster file=cluster.cfg You can load the info with: sacctmgr load file=cluster.cfg It saves all

Re: [slurm-users] [EXT] GPU Jobs with Slurm

2021-01-14 Thread Ole Holm Nielsen
Hi Sean, On 1/14/21 9:19 AM, Sean Crosby wrote: Hi Abhiram, You need to configure cgroup.conf to constrain the devices a job has access to. See https://slurm.schedmd.com/cgroup.conf.html My cgroup.conf is CgroupAutomount=yes

Re: [slurm-users] slurm/munge problem: invalid credentials

2020-12-16 Thread Ole Holm Nielsen
Hi Olaf, Since you are testing Slurm, perhape my Slurm Wiki page may be of interest to you: https://wiki.fysik.dtu.dk/niflheim/Slurm_installation There is a discussion about the setup of Munge. Best regards, Ole On 12/15/20 5:48 PM, Olaf Gellert wrote: Hi all, we are setting up a new test

[slurm-users] Database backup best practices

2020-12-10 Thread Ole Holm Nielsen
tps://bugs.schedmd.com/show_bug.cgi?id=10295 2. https://serversforhackers.com/c/mysqldump-with-modern-mysql 3. https://wiki.fysik.dtu.dk/niflheim/Slurm_database#backup-and-restore-of-database -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark

Re: [slurm-users] Novice Slurm Upgrade Questions

2020-12-05 Thread Ole Holm Nielsen
ri, Dec 4, 2020 at 3:13 PM Ole Holm Nielsen mailto:ole.h.niel...@fysik.dtu.dk>> wrote: Hi Jason, Slurm upgrading should be pretty simple, IMHO.  I've been through this multiple times, and my Slurm Wiki has detailed upgrade documentation: https://wiki.fysik.dtu.dk/niflhei

<    1   2   3   4   5   >