Re: [slurm-users] Slurm-17.11.5 + Pmix-2.1.1/Debugging

2018-05-08 Thread Bill Broadley
On 05/08/2018 05:33 PM, Christopher Samuel wrote: > On 09/05/18 10:23, Bill Broadley wrote: > >> It's possible of course that it's entirely an openmpi problem, I'll >> be investigating and posting there if I can't find a solution. > > One of the changes in OMPI 3.1.0 was: > > - Update PMIx to versi

Re: [slurm-users] Slurm-17.11.5 + Pmix-2.1.1/Debugging

2018-05-08 Thread Christopher Samuel
On 09/05/18 10:23, Bill Broadley wrote: It's possible of course that it's entirely an openmpi problem, I'll be investigating and posting there if I can't find a solution. One of the changes in OMPI 3.1.0 was: - Update PMIx to version 2.1.1. So I'm wondering if previous versions were falling

[slurm-users] Slurm-17.11.5 + Pmix-2.1.1/Debugging

2018-05-08 Thread Bill Broadley
Greetings all, I have slurm-17.11.5, pmix-1.2.4, and openmpi-3.0.1 working on several clusters. I find srun handy for things like: bill@headnode:~/src/relay$ srun -N 2 -n 2 -t 1 ./relay 1 c7-18 c7-19 size= 1, 16384 hops, 2 nodes in 0.03 sec ( 2.00 us/hop) 1953 KB/sec Building was st

Re: [slurm-users] Accounting not recording jobs

2018-05-08 Thread Christopher Samuel
On 08/05/18 18:31, sysadmin.caos wrote: In file /var/log/slurm/accounting appear my last job... but I don't undertand why job appears there while I have configured accounting with "AccountingStorageType=accounting_storage/slurmdbd" What does: sacctmgr list clusters say on the machine where

Re: [slurm-users] srun seg faults immediately from within sbatch but not salloc

2018-05-08 Thread Benjamin Matthews
I think this should already be fixed in the upcoming release. See: https://github.com/SchedMD/slurm/commit/947bccd2c5c7344e6d09dab565e2cc6663eb9e72 On 5/8/18 12:08 PM, a.vita...@bioc.uzh.ch wrote: > Dear all, > > I tried to debug this with some apparent success (for now). > > If anyone cares: > Wi

Re: [slurm-users] srun seg faults immediately from within sbatch but not salloc

2018-05-08 Thread a . vitalis
Dear all, I tried to debug this with some apparent success (for now). If anyone cares: With the help of gdb inside sbatch, I tracked down the immediate seg fault to strcmp. I then hacked src/srun/srun.c with some info statements and isolated this function as the culprit: static void _setup_env_

Re: [slurm-users] scancel a list of jobs

2018-05-08 Thread Michael Jennings
On Tuesday, 08 May 2018, at 17:00:33 (+), Chester Langin wrote: > Is there no way to scancel a list of jobs? Like from job 120 to job > 150? I see cancelling by user, by pending, and by job name. --Chet If you're using BASH, you can just do: scancel {120..150} In other POSIX-compatible s

Re: [slurm-users] scancel a list of jobs

2018-05-08 Thread Ryan Novosielski
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 I've used loops before to do something like that. This is something off the top of my head, but something like: seq 120 150 | while read jobid; do scancel $jobid; done You can test it with squeue -j $jobid instead. On 05/08/2018 01:00 PM, Chester La

[slurm-users] scancel a list of jobs

2018-05-08 Thread Chester Langin
Is there no way to scancel a list of jobs? Like from job 120 to job 150? I see cancelling by user, by pending, and by job name. --Chet

Re: [slurm-users] "Low socket*core*thre" - solution?

2018-05-08 Thread Mahmood Naderan
I think yes! But I tried many commands and restarting the services before. At the moment I don't know why that happened but the last three commands by Werner fixed that. If the state is not persistent after a reboot, I have to dig more. Regards, Mahmood On Tue, May 8, 2018 at 3:00 AM, Chris

Re: [slurm-users] slurmdbd: mysql/accounting errors on 17.11.5 upgrade

2018-05-08 Thread Tina Fora
Ole, slurmdb has been running on the same EL7 mariadb since slurm 14. All upgrades up to 17.02 were ok until 17.11. Nothing has changed on the database. I did see a note on the news page below about 17.11 upgrade and mysql 5.1 issues but no remedy was mentioned. I'm on mariadb 5.5. https://www.sc

Re: [slurm-users] Areas for improvement on our site's cluster scheduling

2018-05-08 Thread Paul Edmon
We've been using a backfill priority partition for people doing HTC work.  We have requeue set so that jobs from the high priority partitions can take over. You can do this for your interactive nodes as well if you want. We dedicate hardware to interactive work and use Partition based QoS's to

Re: [slurm-users] Areas for improvement on our site's cluster scheduling

2018-05-08 Thread Renfro, Michael
That’s the first limit I placed on our cluster, and it has generally worked out well (never used a job limit). A single account can get 1000 CPU-days in whatever distribution they want. I’ve just added a root-only ‘expedited’ QOS for times when the cluster is mostly idle, but a few users have jo

Re: [slurm-users] Areas for improvement on our site's cluster scheduling

2018-05-08 Thread Ole Holm Nielsen
On 05/08/2018 09:49 AM, John Hearns wrote: Actually what IS bad is users not putting cluster resources to good use. You can often see jobs which are 'stalled'  - ie the nodes are reserved for the job, but the internal logic of the job has failed and the executables have not launched. Or maybe s

[slurm-users] Accounting not recording jobs

2018-05-08 Thread sysadmin.caos
Hello, after configuring SLURM-17.11.5 with accouting/mysql, it seems databse is not recording any job. If I run "sacct -", I get this output: sacct: Jobs eligible from Tue May 08 00:00:00 2018 - Now sacct: debug:  Options selected:     opt_co

Re: [slurm-users] sacct: error

2018-05-08 Thread Marcel Sommer
Thanks for the hint, Chris! Best regards, Marcel Am 04.05.2018 um 16:06 schrieb Chris Samuel: > On Friday, 4 May 2018 4:25:04 PM AEST Marcel Sommer wrote: > >> Does anyone have an explanation for this? > > I think you're asking for functionality that is only supported with slurmdbd. > > All the b

Re: [slurm-users] Areas for improvement on our site's cluster scheduling

2018-05-08 Thread John Hearns
"Otherwise a user can have a sing le job that takes the entire cluster, and insidesplit it up the way he wants to." Yair, I agree. That is what I was referring to regardign interactive jobs. Perhaps not a user reserving the entire cluster, but a use reserving a lot of compute nodes and not making s

Re: [slurm-users] Areas for improvement on our site's cluster scheduling

2018-05-08 Thread John Hearns
> Eventually the job aging makes the jobs so high-priority, Guess I should look in the manual, but could you increase the job ageing time parameters? I guess it is also worth saying that this is the scheduler doing its job - it is supposed to keep jobs ready and waiting to go, to keep the cluster

Re: [slurm-users] Areas for improvement on our site's cluster scheduling

2018-05-08 Thread Yair Yarom
Hi, This is what we did, not sure those are the best solutions :) ## Queue stuffing We have set PriorityWeightAge several magnitudes lower than PriorityWeightFairshare, and we also have PriorityMaxAge set to cap of older jobs. As I see it, the fairshare is far more important than age. Besides t

Re: [slurm-users] Areas for improvement on our site's cluster scheduling

2018-05-08 Thread Ole Holm Nielsen
On 05/08/2018 08:44 AM, Bjørn-Helge Mevik wrote: Jonathon A Anderson writes: ## Queue stuffing There is the bf_max_job_user SchedulerParameter, which is sort of the "poor man's MAXIJOB"; it limits the number of jobs from each user the backfiller will try to start on each run. It doesn't do