[slurm-dev] How to apply changes in the slurm.conf
Dear Slurm user, Maybe this are dummy questions, but I can't find the response in the manual. Recently we have installed in a cluster, the slurm 14.03 version, in a Red Hat/ Scientific Linux enviroment.In order to tune the configuration, we want to test different parameters of the slurm.confBut there are several users running important jobs for several days. How can I change the configuration of slurm and restart the slurmctld without affecting to the users and the jobs of the users? Its also necessary restart the slurm daemons?Is also possible to upgrade or change the slurm version while there are jobs running? Thanks in advance.
[slurm-dev] Re: How to apply changes in the slurm.conf
On 06/10/2014 08:24 AM, José Manuel Molero wrote: Dear Slurm user, Maybe this are dummy questions, but I can't find the response in the manual. Recently we have installed in a cluster, the slurm 14.03 version, in a Red Hat/ Scientific Linux enviroment. In order to tune the configuration, we want to test different parameters of the slurm.conf But there are several users running important jobs for several days. How can I change the configuration of slurm and restart the slurmctld without affecting to the users and the jobs of the users? Its also necessary restart the slurm daemons? Is also possible to upgrade or change the slurm version while there are jobs running? Thanks in advance. Hello! We apply new configuration parameters with scontrol reconfigure (first I arrange new slurm.conf on all nodes). Upgrading slurm: in my experience, when upgrading to a minor release (e.g. from 2.6.4 to 2.6.X), it is not a problem to do it on a running cluster, jobs are conserved. But upgrading to a major release (e.g. from 2.5 to 2.6), cluster has to be drained first, otherwise jobs are killed. Cheers, Barbara
[slurm-dev] Enforce to use srun and application logger
Hi, we are using Snoopy library (https://github.com/a2o/snoopy) in order to monitor and collect statistics regarding to the applications used in the HPC resources. Since there are more than 30% of the jobs in our database without any information in this regard, it seems that Snoopy is not capable to track everything. Some other tools like PerfMiner or monitor ( http://web.eecs.utk.edu/~mucci/monitor/) are used in several places, but since it relies on PapiEx (http://icl.cs.utk.edu/~mucci/papiex/), and this project is no longer supported, I would like to know if there is some other approach to collect this data. In addition to that, I would like to know if it can be possible to enforce to use srun in the submit script. I used a sbatch wrapper before, but maybe there is now a better way to do it. Thanks! Regards, Jordi
[slurm-dev] Re: How to apply changes in the slurm.conf
Pending and running jobs should be preserved across major releases too. Quoting Barbara Krasovec barba...@arnes.si: On 06/10/2014 08:24 AM, José Manuel Molero wrote: Dear Slurm user, Maybe this are dummy questions, but I can't find the response in the manual. Recently we have installed in a cluster, the slurm 14.03 version, in a Red Hat/ Scientific Linux enviroment. In order to tune the configuration, we want to test different parameters of the slurm.conf But there are several users running important jobs for several days. How can I change the configuration of slurm and restart the slurmctld without affecting to the users and the jobs of the users? Its also necessary restart the slurm daemons? Is also possible to upgrade or change the slurm version while there are jobs running? Thanks in advance. Hello! We apply new configuration parameters with scontrol reconfigure (first I arrange new slurm.conf on all nodes). Upgrading slurm: in my experience, when upgrading to a minor release (e.g. from 2.6.4 to 2.6.X), it is not a problem to do it on a running cluster, jobs are conserved. But upgrading to a major release (e.g. from 2.5 to 2.6), cluster has to be drained first, otherwise jobs are killed. Cheers, Barbara
[slurm-dev] Re: More odd scheduler reservation behavior
No thoughts on this from the list? I wouldn't have thought we were the only ones encountering this issue. Best, Bill. -- Bill Barth, Ph.D., Director, HPC bba...@tacc.utexas.edu| Phone: (512) 232-7069 Office: ROC 1.435 | Fax: (512) 475-9445 On 6/5/14 3:09 PM, Bill Barth bba...@tacc.utexas.edu wrote: All, I'm experiencing the following unexpected behavior with SLURM reservations. If I create a reservation on some nodes and forget to point it to a specific partition, and I update the reservation later to point at the correct partition, it doesn't remove any nodes reserved in the wrong partition and replace them with nodes from the partition specified. Here's the details, first beginning with some info about the relevant defined partitions: PARTITION AVAIL TIMELIMIT NODES STATE NODELIST SB2.7* up 2-00:00:00 2 down* c3-[401,421] SB2.7* up 2-00:00:00 26 idle c3-[402-420,422-428] IB2.2up 2-00:00:00 12 idle c3-[501-512] Create the reservation: -bash-4.2$ sudo scontrol create reservation StartTime=2014-06-06T08:00:00 Duration=1:00:00 NodeCnt=4 Users=bbarth Reservation created: bbarth_3 -bash-4.2$ scontrol show res ReservationName=bbarth_3 StartTime=2014-06-06T08:00:00 EndTime=2014-06-06T09:00:00 Duration=01:00:00 Nodes=c3-[402-405] NodeCnt=4 CoreCnt=64 Features=(null) PartitionName=SB2.7 Flags= Users=bbarth Accounts=(null) Licenses=(null) State=INACTIVE Observe that the nodes happen to come from the SB2.7 partition. If update the partition on the reservation to be IB2.2, we see that the nodes from SB2.7 are still the ones reserved: -bash-4.2$ sudo scontrol update ReservationName=bbarth_3 Partition=IB2.2 Reservation updated. -bash-4.2$ scontrol show res ReservationName=bbarth_3 StartTime=2014-06-06T08:00:00 EndTime=2014-06-06T09:00:00 Duration=01:00:00 Nodes=c3-[402-405] NodeCnt=4 CoreCnt=64 Features=(null) PartitionName=IB2.2 Flags= Users=bbarth Accounts=(null) Licenses=(null) State=INACTIVE Is this the expected behavior? I also notice that if I drain a node it doesn't get replaced in the reservation, and if I stop SLURM on the node (/etc/init.d/slurm stop) it doesn't get replaced either. I would have sworn up and down that at least the latter worked. Can anyone provide some feedback? Thanks, Bill. -- Bill Barth, Ph.D., Director, HPC bba...@tacc.utexas.edu| Phone: (512) 232-7069 Office: ROC 1.435 | Fax: (512) 475-9445
[slurm-dev] jobs killed on controller restart
We've had some trouble with curious job failures- the jobs aren't even assigned nodes: JobIDNodeList State ExitCode --- -- 7229124None assigned FAILED 0:1 We finally got some better log data (I'd turned it way too low) which suggests that restarting and/or reconfiguring the controller is at the root. After some preliminaries (purging job records, recovering active jobs) there will be these sorts of messages : [2014-06-09T23:10:15.920] No nodes satisfy job 7228909 requirements in partition full [2014-06-09T23:10:15.920] sched: schedule: JobId=7228909 non-runnable: Requested node configuration is not available The indicated job has specified --mem and --tmp, but the values are within the capacities for all nodes in that full partition. Typically if a user were to request resources exceeding those available on nodes in this partition the submission is failed. It appears that this failure only occurs for jobs with memory and/or disk constraints. Worse yet, it's not consistent- only seems to happen sometime. I also cannot reproduce this in our test environment. A typical node configuration line looks thus: NodeName=gizmod[51-60] Sockets=2 CoresPerSocket=6 RealMemory=48000 Weight=10 Feature=full,restart,rx200,ssd though I've got FastSchedule=0. Honestly it *feels* like there's a moment where the node data isn't fully loaded from the slurmd and thus the scheduler doesn't see any nodes that satisfy the requirements. Thanks all... Michael
[slurm-dev] Re: Create preemptable QOS
Hi, I did figure out the issue with my setup and thought I’d post the fix in case anyone was curious. I had neglected to add the newly created qos as a possibility for the account association. So for me I needed to do: sacctmgr modify account name=normal set qos=normal,free That way a normal account could request the free qos. Hope this helps for someone. On 6/3/14, 2:18 PM, Christopher B Coffey chris.cof...@nau.edu wrote: Hi, I’m trying to create a QOS that when specified in a jobscript, makes a job preemptable by jobs in the normal account. Another goal is to have this qos, when in use, not subtract fairshare points from the user. I tried the following: Slurm.conf: PreemptType = preempt/qos PreemptMode = REQUEUE And: sacctmgr add qos free PreemptMode=cluster usagefactor=0 description=Preemptable QOS, no fairshare use sacctmgr modify qos name=normal set preempt=free Jobs using the free qos are correctly preempted, but I get this in the logs when jobs are submitted and are running: Jun 3 11:33:47 head slurmctld[5078]: sched: JobId=275 has invalid QOS Jun 3 11:33:47 head slurmctld[5078]: sched: JobId=276 has invalid QOS Jun 3 11:33:47 head slurmctld[5078]: sched: JobId=277 has invalid QOS Any ideas? Hopefully it’s not just a Monday detail, thank you! Chris
[slurm-dev] Re: Enforce to use srun and application logger
Quoting Jordi Blasco jbllis...@gmail.com: In addition to that, I would like to know if it can be possible to enforce to use srun in the submit script. I used a sbatch wrapper before, but maybe there is now a better way to do it. A job submit plugin may be your best option for that. See: http://slurm.schedmd.com/job_submit_plugins.html
[slurm-dev] Re: Enforce to use srun and application logger
Jordi, It's basically impossible to force people to call srun somewhere in their batch script. If you only want to allow the very simplest of batch scripts, then you can grep them at job submit time with a job submit plugin, but if their script calls a script which calls a script (etc) which calls srun, you'll never detect that they've done what you wanted. Worse, you'll raise false positives all the time even though the users have done what you wanted, just some levels down. We have a wrapper around the MPI job starters that we support (MVAPICH2 and Intel MPI) that calls the right startup mechanisms with the right arguments. But we haven't tried to force our users to use this script. The vast majority of them do what we want because a) we train them on it and document it well, and b) our method is generally easier to use than the other options. For monitoring, you might check out the project that I work on called TACC Stats which provides accounting and performance monitoring for HPC jobs. Some parts of the project are in a state a flux as we are adding new features, but things should begin to stabilize this summer. TACC Stats will also be working with a sister project called XALT which will also have its first release this summer which will provide information about executables and libraries used by HPC jobs. More information and source code for TACC Stats can be found on GitHub, and XALT should be available on GitHub later this summer. git clone g...@github.com:rtevans/tacc_stats.git (this will eventually move the the main TACC GitHub, but that's a work in progress) Best, Bill. -- Bill Barth, Ph.D., Director, HPC bba...@tacc.utexas.edu| Phone: (512) 232-7069 Office: ROC 1.435 | Fax: (512) 475-9445 On 6/10/14 6:19 AM, Jordi Blasco jbllis...@gmail.com wrote: Hi, we are using Snoopy library (https://github.com/a2o/snoopy) in order to monitor and collect statistics regarding to the applications used in the HPC resources. Since there are more than 30% of the jobs in our database without any information in this regard, it seems that Snoopy is not capable to track everything. Some other tools like PerfMiner or monitor (http://web.eecs.utk.edu/~mucci/monitor/) are used in several places, but since it relies on PapiEx (http://icl.cs.utk.edu/~mucci/papiex/), and this project is no longer supported, I would like to know if there is some other approach to collect this data. In addition to that, I would like to know if it can be possible to enforce to use srun in the submit script. I used a sbatch wrapper before, but maybe there is now a better way to do it. Thanks! Regards, Jordi
[slurm-dev] Fairshare=parent on an account: What should it do?
We're trying to figure out what the intended behavior of Fairshare=parent is when set on an account (http://bugs.schedmd.com/show_bug.cgi?id=864). We know what the actual behavior is but we're wondering if anyone actually likes the current behavior. There could be some use case out there that we don't know about. For example, you can end up with a scenario like the following: acctProf /|\ / | \ acctTA(parent) uD(5)uE(5) / | \ /|\ uA(5) uB(5) uC(5) The number in parenthesis is Fairshare according to sacctmgr. We incorrectly thought that Fairshare=parent would essentially collapse the tree so that uA - uE would all be on the same level. Thus, all five users would each get 5 / 25 shares. What actually happens is you get the following shares at the user level: shares (uA) = 5 / 15 = .333 shares (uB) = 5 / 15 = .333 shares (uC) = 5 / 15 = .333 shares (uD) = 5 / 10 = .5 shares (uE) = 5 / 10 = .5 That's pretty far off from each other, but not as far as it would be if one account had two users and the other had forty. Assuming this demonstration value of 5 shares, that would be: user_in_small_account = 5 / (2*5) = .5 user_in_large_account = 5 / (40*5) = .025 Is that actually useful to someone? We want to use subaccounts below a faculty account to hold, for example, a grad student or postdoc who teaches a class. It would be nice for the grad student to have administrative control over the subaccount since he actually knows the students but not have it affect priority calculations. Ryan -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University