Re: [slurm-users] Areas for improvement on our site's cluster scheduling
We've been using a backfill priority partition for people doing HTC work. We have requeue set so that jobs from the high priority partitions can take over. You can do this for your interactive nodes as well if you want. We dedicate hardware to interactive work and use Partition based QoS's to limit usage. -Paul Edmon- On 05/08/2018 10:08 AM, Renfro, Michael wrote: That’s the first limit I placed on our cluster, and it has generally worked out well (never used a job limit). A single account can get 1000 CPU-days in whatever distribution they want. I’ve just added a root-only ‘expedited’ QOS for times when the cluster is mostly idle, but a few users have jobs that run past the TRES limit. But I really like the idea of a preemptable QOS that the users can put their extra jobs into on their own.
Re: [slurm-users] Jobs in pending state
It sounds like your second partition is getting primarily scheduled by the backfill scheduler. I would try the partition_job_depth option as otherwise the main loop only looks at priority order and not by partition. -Paul Edmon- On 4/29/2018 5:32 AM, Zohar Roe MLM wrote: Hello. I am having 2 cluster in my slurm.conf: CLUS_WORK1 server1 server2 server3 CLUS_WORK2 pc1 pc2 pc3 When I'm sending 10,000 jobs to CLUS_WORK1 they are good and start running while a few are in pending state (which is ok). But if I send new jobs to CLUS_WORK2 which is idle, I see that the jobs are also in pending state and its take them about 20 minute to start running. I didn't find any settings/configuration that can cause that. Is there some log I can check why they are pending? Thanks. *** Please consider the environment before printing this email ! The information contained in this communication is proprietary to Israel Aerospace Industries Ltd. and/or third parties, may contain confidential or privileged information, and is intended only for the use of the intended addressee thereof. If you are not the intended addressee, please be aware that any use, disclosure, distribution and/or copying of this communication is strictly prohibited. If you receive this communication in error, please notify the sender immediately and delete it from your computer. Thank you. Visit us at: www.iai.co.il
Re: [slurm-users] Job still running after process completed
I would recommend putting a clean up process in your epilog script. We have a check here that sees if the job completed and if so it then terminates all the user processes by kill -9 to clean up any residuals. If it fails it closes of the node so we can reboot it. -Paul Edmon- On 04/23/2018 08:10 AM, John Hearns wrote: Nicolo, I cannot say what your problem is. However in the past with problems like this I would a) look at ps -eaf --forest Try to see what the parent processes of these job processes are Clearly if the parent PID is 1 then --forest is nto much help. But the --forest option is my 'goto' option b) look closely at the slurm logs. Do not fool yourself - force yourself to read the logs line by line, around the timestamp when the jobs ends. Being a bit more helpful, in my last job we had endless problems with Matlab jobs leaving orphaned processes. To be fair to Matlab, they have a utility which 'properly' starts parallel jobs under the control of the batch system (OK, it was PBSpro) But users can easily start a job and 'fire off' processes in MAtlab which are nut under the directo control of the batch daemon, leaving orphaned processes when the jobs ends. Actually, if you think about this this is how a batch system works. The batch system daemon starts running processes on your behalf. When the job is killed, all the daughter proccesses of that daemon should die. It is intructive to run ps -eaf --forest sometimes on a compute node during a normal job run. Get to know how things are being created, and what their parents are (two dashes in front of the forest argument) Now think of users who start a batch job and get a list of compute hosts. they MAY use a mechanism such as ssd or indeed pbsdsh to start running job rocesses on those nodes. You will then have trouble with orphaned processes when the job ends. Techniques for dealing with this: a use the PAM module which stops ssh login (actually - this probably allows ssh login suring a job time when th euser has a node allocated) b my favourite - CPU sets - actuallt this wont stop ssh logins either. c Shouting, much shouting. Screaming. Regarding users behavng like this, I have seen several cases of behaviour like this for understandable reasons. On a ssytem which I did not manage, but was asked fro advice, the vendor had provided a sample script for running Ansys. The user wanted to run Abaqus on the compute nodes (or some such - a different application anyway) So he started an empty Ansys job, which sat doing nothing. Then took the list of hosts provided by the batch system and fired up an interactive Abaqus session on his terminal. I honestly hesitate to label this behaviour 'wrong' I als have seen similar when running a CFD job. On 23 April 2018 at 11:50, Nicolò Parmiggiani <nicolo.parmiggi...@gmail.com <mailto:nicolo.parmiggi...@gmail.com>> wrote: Hi, I have a job that keeps running even though the internal process is finished. What could be the problem? Thank you.
Re: [slurm-users] Time-based partitions
You could probably accomplish this using a job submit lua script and some crafted QoS's. It would take some doing but I imagine it could work. -Paul Edmon- On 03/12/2018 02:46 PM, Keith Ball wrote: Hi All, We are looking to have time-based partitions; e.g. a"day" and "night" partition (using the same group of compute nodes). 1.) For a “night” partition, jobs will only be allocated resources one the “night-time” window is reached (e.g. 6pm – 7am). Ideally, the jobs in the “night” partition would also have higher priority during this window (so that they would preempt jobs in the "day" partition that were still running, if there were resource contention). 2.) During the “day-time” window (7am-6pm), jobs in the “day” queue can be allocated resources, and have higher priority than jobs in the “night” partition (that way, preemptive scheduling can occur if there is resource contention). I have so far not seen a way to define a run or allocation time window for partitions. Are there such options? What is the best (and hopefully least convoluted) way to achieve the scheduling behavior as described above in Slurm? Thanks, Keith
Re: [slurm-users] ntasks and cpus-per-task
Yeah, I've found that in those situations to have people wrap their threaded programs in srun inside of sbatch. That way the scheduler knows which process specifically gets the threading. -Paul Edmon- On 02/22/2018 10:39 AM, Loris Bennett wrote: Hi Paul, Paul Edmon <ped...@cfa.harvard.edu> writes: At least from my experience wonky things can happen with slurm (especially if you have thread affinity on) if you don't rightly divide between -n and -c. In general I've been telling our users that -c is for threaded applications and -n is for rank based parallelism. This way the thread affinity works out properly. Actually we have do an issue with some applications not respecting the CPU mask. I always assumed it was something to do with the way the multithreading was programmed in certain applications, but maybe we should indeed be getting the users to use multiple CPUs with a single task. Thanks for the info. Cheers, Loris
Re: [slurm-users] Extreme long db upgrade 16.05.6 -> 17.11.3
Typically the long db upgrades are only for major version upgrades. Most of the time minor versions don't take nearly as long. At least with our upgrade from 17.02.9 to 17.11.3 the upgrade only took 1.5 hours with 6 months worth of jobs (about 10 million jobs). We don't track energy usage though so perhaps we avoided that particular query due to that. From past experience these major upgrades can take quite a bit of time as they typically change a lot about the DB structure in between major versions. -Paul Edmon- On 02/22/2018 06:17 AM, Malte Thoma wrote: FYI: * We broke our upgrade from 17.02.1-2 to 17.11.2 after about 18 h. * Dropped the job table ("truncate xyz_job_table;") * Executed the everlasting sql command by hand on a back-up database * Meanwhile we did the slurm upgrade (fast) * Reset the First-Job-ID to a high number * Inserted the converted datatable in the real database again. It took two experts for this task and we would appreciate a better upgrade-concept very much! I fact, we hesitate to upgrade from 17.11.2 to 17.11.3, because we are afraid of similar problems. Does anyone has experience with this? It would be good to know if there is ANY chance if future upgrades will cause the same problems or if this will become better? Regards, Malte Am 22.02.2018 um 01:30 schrieb Christopher Benjamin Coffey: This is great to know Kurt. We can't be the only folks running into this.. I wonder if the mysql update code gets into a deadlock or something. I'm hoping a slurm dev will chime in ... Kurt, out of band if need be, I'd be interested in the details of what you ended up doing. Best, Chris — Christopher Coffey High-Performance Computing Northern Arizona University 928-523-1167 On 2/21/18, 5:08 PM, "slurm-users on behalf of Kurt H Maier" <slurm-users-boun...@lists.schedmd.com on behalf of k...@sciops.net> wrote: On Wed, Feb 21, 2018 at 11:56:38PM +, Christopher Benjamin Coffey wrote: > Hello, > > We have been trying to upgrade slurm on our cluster from 16.05.6 to 17.11.3. I'm thinking this should be doable? Past upgrades have been a breeze, and I believe during the last one, the db upgrade took like 25 minutes. Well now, the db upgrade process is taking far too long. We previously attempted the upgrade during a maintenance window and the upgrade process did not complete after 24 hrs. I gave up on the upgrade and reverted the slurm version back by restoring a backup db. We hit this on our try as well: upgrading from 17.02.9 to 17.11.3. We truncated our job history for the upgrade, and then did the rest of the conversion out-of-band and re-imported it after the fact. It took us almost sixteen hours to convert a 1.5 million-job store. We got hung up on precisely the same query you did, on a similarly hefty machine. It caused us to roll back an upgrade and try again during our subsequent maintenance window with the above approach. khm
Re: [slurm-users] restrict application to a given partition
This sounds like a solution for singularity. http://singularity.lbl.gov/ You could use the Lua script to restrict what is permitted to run via barring anything that isn't a specific singularity script. Else you could use either prolog scripts to act as emergency fall back in case the lua script doesn't catch it. -Paul Edmon- On 1/15/2018 8:31 AM, John Hearns wrote: Juan, me kne-jerk reaction is to say 'containerisation' here. However I guess that means that Slurm would have to be able to inspect the contents of a container, and I do not think that is possible. I may be very wrong here. Anyone? However have a look at thre Xalt stuff from TACC https://www.tacc.utexas.edu/research-development/tacc-projects/xalt https://github.com/Fahey-McLay/xalt Xalt is intended to instrument your cluster and collect information on what software is being run and exactly what libraries are being used. I do not think it has any options for "Nope! You may not run this executable on this partition" However it might be worth contacting the authors and discussing this. On 15 January 2018 at 14:20, Juan A. Cordero Varelaq <bioinformatica-i...@us.es <mailto:bioinformatica-i...@us.es>> wrote: But what if the user knows the path to such application (let's say python command) and executes it on the partition he/she should not be allowed to? Is it possible through lua scripts to set constrains on software usage such as a limited shell, for instance? In fact, what I'd like to implement is something like a limited shell, on a particular node for a particular partition and a particular program. On 12/01/18 17:39, Paul Edmon wrote: You could do this using a job_submit.lua script that inspects for that application and routes them properly. -Paul Edmon- On 01/12/2018 11:31 AM, Juan A. Cordero Varelaq wrote: Dear Community, I have a node (20 Cores) on my HPC with two different partitions: big (16 cores) and small (4 cores). I have installed software X on this node, but I want only one partition to have rights to run it. Is it then possible to restrict the execution of an specific application to a given partition on a given node? Thanks
Re: [slurm-users] Changing resource limits while running jobs
Typically changes like this only impact pending or newly submitted jobs. Running jobs usually are not impacted, though they will count against any new restrictions that you put in place. -Paul Edmon- On 1/4/2018 6:44 AM, Juan A. Cordero Varelaq wrote: Hi, A couple of jobs have been running for almost one month and I would like to change resource limits to prevent users from running so much time. Besides, I'd like to set AccountingStorageEnforce to qos,safe. If I make such changes would the running jobs be stopped (the user running the jobs has still no account and therefore, should not be allowed to run anything if AccountingStorageEnforce is set)? Thanks* *
Re: [slurm-users] Intermittent "Not responding" status
I've seen this happen when there are internode communications issues which disrupt the tree that slurm uses to talk to the nodes and do heartbeat. We have this happen occassionally in our environment as we have nodes that are two geographically seperate facilities and the latency is substantial, thus the lag crossing back and for can add up. I would check to see if all your nodes can talk to each other and the master and if your Timeouts are set high enough. -Paul Edmon- On 12/04/2017 01:57 PM, Stradling, Alden Reid (ars9ac) wrote: I have a number of nodes that have, after our transition to Centos 7.3/SLURM 17.02, begun to occasionally display a status of "Not responding". The health check we run on each node every 5 minutes detects nothing, and the nodes are perfectly healthy once I set their state to "idle". The slurmd continues uninterrupted, and the nodes get jobs immediately after going back online. Has anyone on this list seen similar behavior? I have increased logging to debug/verbose, but have seen no errors worth noting. Cheers, Alden
Re: [slurm-users] PMIx and Slurm
Okay, I didn't see any note on the PMIx 2.1 page about versions of slurm it was combatible with so I assumed all of them. My bad. Thanks for the correction and the help. I just naively used the rpm spec that was packaged with PMIx which does enable the legacy support. It seems best then to let PMIx handle pmix solely and let slurm handle the rest. Thanks! Am I right in reading that you don't have to build slurm against PMIx? So it just interoperates with it fine if you just have it installed and specify pmix as the launch option? That's neat. -Paul Edmon- On 11/28/2017 6:11 PM, Philip Kovacs wrote: Actually if you're set on installing pmix/pmix-devel from the rpms and then configuring slurm manually, you could just move the pmix-installed versions of libpmi.so* and libpmi2.so* to a safe place, configure and install slurm which will drop in its versions pf those libs and then either use the slurm versions or move the the pmix versions of libpmi and libpmi2 back into place in /usr/lib64. On Tuesday, November 28, 2017 5:32 PM, Philip Kovacs <pkde...@yahoo.com> wrote: This issue is that pmi 2.0+ provides a "backward compatibility" feature, enabled by default, which installs both libpmi.so and libpmi2.so in addition to libpmix.so. The route with the least friction for you would probably be to uninstall pmix, then install slurm normally, letting it install its libpmi and libpmi2. Next configure and compile a custom pmix with that backward feature _disabled_, so it only installs libpmix.so. Slurm will "see" the pmix library after you install it and load it via its plugin when you use --mpi=pmix. Again, just use the Slurm pmi and pmi2 and install pmix separately with the backward compatible option disabled. There is a packaging issue there in which two packages are trying to install their own versions of the same files. That should be brought to attention of the packages. Meantime you can work around it. For PMIX: ./configure --disable-pmi-backward-compatibility // ... etc ... On Tuesday, November 28, 2017 4:44 PM, Artem Polyakov <artpo...@gmail.com> wrote: Hello, Paul Please see below. 2017-11-28 13:13 GMT-08:00 Paul Edmon <ped...@cfa.harvard.edu <mailto:ped...@cfa.harvard.edu>>: So in an effort to future proof ourselves we are trying to build Slurm against PMIx, but when I tried to do so I got the following: Transaction check error: file /usr/lib64/libpmi.so from install of slurm-17.02.9-1fasrc02.el7.cen tos.x86_64 conflicts with file from package pmix-2.0.2-1.el7.centos.x86_64 file /usr/lib64/libpmi2.so from install of slurm-17.02.9-1fasrc02.el7.cen tos.x86_64 conflicts with file from package pmix-2.0.2-1.el7.centos.x86_64 This is with compiling Slurm with the --with-pmix=/usr option. A few things: 1. I'm surprised when I tell it to use PMIx it still builds its own versions of libpmi and pmi2 given that PMIx handles that now. PMIx is a plugin and from multiple perspectives it makes sense to keep the other versions available (i.e. backward compat or perf comparison) 2. Does this mean I have to install PMIx in a nondefault location? If so how does that work with user build codes? I'd rather not have multiple versions of PMI around for people to build against. When we introduced PMIx it was in the beta stage and we didn't want to build against it by default. Now it probably makes sense to assume --with-pmix by default. I'm also thinking that we might need to solve it at the packagers level by distributing "slurm-pmix" package that is builded and depends on the pmix package that is currently shipped with particular Linux distro. 3. What is the right way of building PMIx and Slurm such that they interoperate properly? As for now it is better to have a PMIx installed in the well-known location. And then build your MPIs or other apps against this PMIx installation. Starting (I think) from PMIx v2.1 we will have a cross-version support that will give some flexibility about what installation to use with application, Suffice it to say little to no documentation exists on how to properly this, so any guidance would be much appreciated. Indeed we have some problems with the documentation as PMIx technology is relatively new. Hopefully we can fix this in near future. Being the original developer of the PMIx plugin I'll be happy to answer any questions and help to resolve the issues. -Paul Edmon- -- С Уважением, Поляков Артем Юрьевич Best regards, Artem Y. Polyakov
[slurm-users] PMIx and Slurm
So in an effort to future proof ourselves we are trying to build Slurm against PMIx, but when I tried to do so I got the following: Transaction check error: file /usr/lib64/libpmi.so from install of slurm-17.02.9-1fasrc02.el7.centos.x86_64 conflicts with file from package pmix-2.0.2-1.el7.centos.x86_64 file /usr/lib64/libpmi2.so from install of slurm-17.02.9-1fasrc02.el7.centos.x86_64 conflicts with file from package pmix-2.0.2-1.el7.centos.x86_64 This is with compiling Slurm with the --with-pmix=/usr option. A few things: 1. I'm surprised when I tell it to use PMIx it still builds its own versions of libpmi and pmi2 given that PMIx handles that now. 2. Does this mean I have to install PMIx in a nondefault location? If so how does that work with user build codes? I'd rather not have multiple versions of PMI around for people to build against. 3. What is the right way of building PMIx and Slurm such that they interoperate properly? Suffice it to say little to no documentation exists on how to properly this, so any guidance would be much appreciated. -Paul Edmon-