Re: [slurm-users] Weird one - deleting a user

2021-07-27 Thread Douglas Jacobsen
Try running `sacctmgr show runawayjobs` (or similar see manual to be sure), my bet is that the user has a job apparently running according to the database and this will at least tell you about them. Doug Jacobsen, Ph.D. NERSC Senior Computing Engineer Group Lead, Computational Systems Group Na

Re: [slurm-users] slurm- not allow a user submmit jobs

2020-01-08 Thread Douglas Jacobsen
The way we use is to create a qos called “batchdisable” that is disallowed on all partitions. Then set the user association qos list to only batchdisable. This deauthorizes the user from submitting to all partitions. On Wed, Jan 8, 2020 at 05:52 Angelines wrote: > Hello, > > I need to forbid a

Re: [slurm-users] Forcibly end "zombie" jobs?

2020-01-08 Thread Douglas Jacobsen
Try running `sacctmgr show runawayjobs`; it should give you the list of running/pending jobs (from slurmdbd's perspective) that are unknown to slurmctld. It will give you the option to "fix" it, however note that fixing will set the end time of the job to the start time, so the accounting will be

Re: [slurm-users] job startup timeouts?

2019-04-26 Thread Douglas Jacobsen
gt; run. So there was no remote file system or MPI involvement. > > Andy > > -Original Message- > From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of > Douglas Jacobsen > Sent: Friday, April 26, 2019 9:24 AM > To: Slurm User Community List > Su

Re: [slurm-users] job startup timeouts?

2019-04-26 Thread Douglas Jacobsen
How large is very large? Where is the executable being started? In the parallel filesystem/NFS? If that is the case you may be able to trim start times by using sbcast to transfer the executable (and its dependencies if dynamically linked) into a node-local resource, such as /tmp or /dev/shm dep

Re: [slurm-users] Backfill advice

2019-03-23 Thread Douglas Jacobsen
Hello, At first blush bf_continue and bf_interval as well as bf_maxjobs (if I remembered the parameter correctly) are critical first steps in tuning. Setting DebugFlags=backfill is essential to getting the needed data to make tuning decisions. Use of per user/account settings if they are too low

Re: [slurm-users] salloc --no-shell question

2019-01-24 Thread Douglas Jacobsen
Hmmm, I can't quite replicate that: dmj@cori11:~> salloc -C knl -q interactive -N 2 --no-shell salloc: Granted job allocation 18219715 salloc: Waiting for resource configuration salloc: Nodes nid0[2318-2319] are ready for job dmj@cori11:~> srun --jobid=18219715 /bin/false srun: error: nid02318: t

Re: [slurm-users] Apparent scontrol reboot bug

2019-01-22 Thread Douglas Jacobsen
There were several related commits last week: https://github.com/SchedMD/slurm/commits/slurm-18.08 On Tue, Jan 22, 2019 at 06:28 Douglas Jacobsen wrote: > Hello, > > Yes it's a bug in the way the reboot rpcs are handled. A fix was recently > committed which we have yet to tes

Re: [slurm-users] Apparent scontrol reboot bug

2019-01-22 Thread Douglas Jacobsen
Hello, Yes it's a bug in the way the reboot rpcs are handled. A fix was recently committed which we have yet to test, but 18.08.5 is meant to repair this (among other things). Doug On Tue, Jan 22, 2019 at 02:46 Martijn Kruiten wrote: > Hi, > > We encounter a strange issue on our system (Slurm

Re: [slurm-users] Lua Job Submit - Setting Features/Constraints

2018-12-19 Thread Douglas Jacobsen
Hello, We do this, it works like most of the other string-based fields, e.g., function job_submit(job_request, partinfo, submit_uid) { job_request['features'] = 'special' return slurm.SUCCESS } Is there something detailed you are looking for? -Doug Doug Jacobsen, Ph.D. NERSC Comp

Re: [slurm-users] constraints question

2018-11-11 Thread Douglas Jacobsen
I think you'll need to update to 18.08 to get this working, constraint arithmetic and knl were not compatible until that release. Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer Acting Group Lead, Computational Systems Group National Energy Research Scientific Computing Center

Re: [slurm-users] Slurm missing non primary group memberships

2018-11-10 Thread Douglas Jacobsen
We've had issues getting sssd to work reliably on compute nodes (at least at scale), the reason is not fully understood, but basically if the connection times out with sssd it'll black list the server for 60s, which then causes those kinds of issues. Setting LaunchParameters=send_gids will sideste

Re: [slurm-users] slurmdbd not showing job accounting

2018-10-14 Thread Douglas Jacobsen
Sreport shows data that is summarized hourly. Restarting slurmdbd can delay this process. If some jobs are missing end records it can massively slow the process because it may need to pick a much earlier start time in the past to summarize. Sacctmgr show runawayjobs can help identify if you are i

Re: [slurm-users] Create users

2018-09-13 Thread Douglas Jacobsen
At one point in time we would also use the job_submit.lua to add users, however, I cannot recommend it in general since job_submit runs while locks are held within slurmcltd, which could have dramatic performance or even functionality impacts if there are delays in adding the user. Doug Jacobs

Re: [slurm-users] some way to make oversubscribe jobs packed before spread

2018-08-08 Thread Douglas Jacobsen
One thing you could consider doing is setting a higher weight on the the long nodes (cluster[37-100] in your example). This would cause jobs submitted to the batch partition to attempt to schedule on low weight nodes first, then the higher weight nodes. So "long" would only get used if a job requ

Re: [slurm-users] DefMemPerCPU is reset to 1 after upgrade

2018-07-11 Thread Douglas Jacobsen
Applying patches d52d8f4f0 and f07f53fc13 to a slurm 17.11.7 source tree fixes this issue in my experience. Only requires restarting slurmctld. Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer Acting Group Lead, Computational Systems Group National Energy Research Scientific Computing C

Re: [slurm-users] Why SlurmUser is set to slurm by default?

2018-05-25 Thread Douglas Jacobsen
SlurmUser == root also has implications for strigger. It allows any user to set slurmctld executed striggers. This can be OK, or not, depending on your use cases and user community. User-specified strigger commands would run on the same node as the slurmctld process, and so the user-specified sc

Re: [slurm-users] How to check if there's a reservation

2018-05-11 Thread Douglas Jacobsen
A feature that many slurm users might like is sbatch --time-min. Using both --time-min and --time a user can specify the range of acceptable wall times limits. This can make it much easier to keep jobs running right up to the maintenance reservation. e.g.: sbatch --time-min=30:00 --time=48:00:

Re: [slurm-users] Slurm setup question

2018-04-11 Thread Douglas Jacobsen
It looks like your slurm.conf is specifying /var/spool as your Save state directory, and `fatal: Incorrect permissions on state save loc: /var/spool` indicates that SlurmUser (another configuration in slurm.conf) does not have access to write to it. It might be a good to make a directory dedicated

Re: [slurm-users] Slurm not starting

2018-01-15 Thread Douglas Jacobsen
m is due to hostname resolution" > > > On 15 January 2018 at 16:30, Elisabetta Falivene > wrote: > >> slurmd -Dvvv says >> >> slurmd: fatal: Unable to determine this slurmd's NodeName >> >> b >> >> 2018-01-15 15:58 GMT+01:00 Douglas

Re: [slurm-users] Slurm not starting

2018-01-15 Thread Douglas Jacobsen
The fact that sinfo is responding shows that at least slurmctld is running. Slumd, on the other hand is not. Please also get output of slurmd log or running "slurmd -Dvvv" On Jan 15, 2018 06:42, "Elisabetta Falivene" wrote: > > Anyway I suggest to update the operating system to stretch and fix

Re: [slurm-users] Priority wait

2017-11-13 Thread Douglas Jacobsen
Assuming you are using backfill, I suspect this is caused by using default schedulerparameters, specifically the bf maxjobs or other similar limits that would prevent jobs from being reviewed. Setting debugflags=backfill will help greatly in debugging these issues. There are analogous parameters

Re: [slurm-users] [slurm-dev] Re: Installing SLURM locally on Ubuntu 16.04

2017-11-08 Thread Douglas Jacobsen
omputing Center <http://www.nersc.gov> dmjacob...@lbl.gov - __o -- _ '\<,_ --(_)/ (_)__ On Wed, Nov 8, 2017 at 8:33 AM, Benjamin Redling wrote: > On 11/8/17 3:01 PM, Douglas Jacobsen wrote: > >> Also please make sure

Re: [slurm-users] [slurm-dev] Re: Installing SLURM locally on Ubuntu 16.04

2017-11-08 Thread Douglas Jacobsen
Also please make sure you have the slurm-munge package installed (at least for the RPMs this is the name of the package, I'm unsure if that packaging layout was conserved for Debian) Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center