On 05/08/2018 05:33 PM, Christopher Samuel wrote:
> On 09/05/18 10:23, Bill Broadley wrote:
>
>> It's possible of course that it's entirely an openmpi problem, I'll
>> be investigating and posting there if I can't find a solution.
>
> One of the changes in OMPI 3.1.0 was:
>
> - Update PMIx to versi
On 09/05/18 10:23, Bill Broadley wrote:
It's possible of course that it's entirely an openmpi problem, I'll
be investigating and posting there if I can't find a solution.
One of the changes in OMPI 3.1.0 was:
- Update PMIx to version 2.1.1.
So I'm wondering if previous versions were falling
Greetings all,
I have slurm-17.11.5, pmix-1.2.4, and openmpi-3.0.1 working on several clusters.
I find srun handy for things like:
bill@headnode:~/src/relay$ srun -N 2 -n 2 -t 1 ./relay 1
c7-18 c7-19
size= 1, 16384 hops, 2 nodes in 0.03 sec ( 2.00 us/hop) 1953 KB/sec
Building was st
On 08/05/18 18:31, sysadmin.caos wrote:
In file /var/log/slurm/accounting appear my last job... but I don't
undertand why job appears there while I have configured accounting
with "AccountingStorageType=accounting_storage/slurmdbd"
What does:
sacctmgr list clusters
say on the machine where
I think this should already be fixed in the upcoming release. See:
https://github.com/SchedMD/slurm/commit/947bccd2c5c7344e6d09dab565e2cc6663eb9e72
On 5/8/18 12:08 PM, a.vita...@bioc.uzh.ch wrote:
> Dear all,
>
> I tried to debug this with some apparent success (for now).
>
> If anyone cares:
> Wi
Dear all,
I tried to debug this with some apparent success (for now).
If anyone cares:
With the help of gdb inside sbatch, I tracked down the immediate seg fault to
strcmp.
I then hacked src/srun/srun.c with some info statements and isolated this
function as the culprit:
static void _setup_env_
On Tuesday, 08 May 2018, at 17:00:33 (+),
Chester Langin wrote:
> Is there no way to scancel a list of jobs? Like from job 120 to job
> 150? I see cancelling by user, by pending, and by job name. --Chet
If you're using BASH, you can just do: scancel {120..150}
In other POSIX-compatible s
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
I've used loops before to do something like that. This is something
off the top of my head, but something like:
seq 120 150 | while read jobid; do scancel $jobid; done
You can test it with squeue -j $jobid instead.
On 05/08/2018 01:00 PM, Chester La
Is there no way to scancel a list of jobs? Like from job 120 to job 150? I
see cancelling by user, by pending, and by job name. --Chet
I think yes!
But I tried many commands and restarting the services before. At the
moment I don't know why that happened but the last three commands by
Werner fixed that. If the state is not persistent after a reboot, I
have to dig more.
Regards,
Mahmood
On Tue, May 8, 2018 at 3:00 AM, Chris
Ole,
slurmdb has been running on the same EL7 mariadb since slurm 14. All
upgrades up to 17.02 were ok until 17.11. Nothing has changed on the
database. I did see a note on the news page below about 17.11 upgrade and
mysql 5.1 issues but no remedy was mentioned. I'm on mariadb 5.5.
https://www.sc
We've been using a backfill priority partition for people doing HTC
work. We have requeue set so that jobs from the high priority
partitions can take over.
You can do this for your interactive nodes as well if you want. We
dedicate hardware to interactive work and use Partition based QoS's to
That’s the first limit I placed on our cluster, and it has generally worked out
well (never used a job limit). A single account can get 1000 CPU-days in
whatever distribution they want. I’ve just added a root-only ‘expedited’ QOS
for times when the cluster is mostly idle, but a few users have jo
On 05/08/2018 09:49 AM, John Hearns wrote:
Actually what IS bad is users not putting cluster resources to good use.
You can often see jobs which are 'stalled' - ie the nodes are reserved
for the job,
but the internal logic of the job has failed and the executables have
not launched. Or maybe s
Hello,
after configuring SLURM-17.11.5 with accouting/mysql, it seems
databse is not recording any job. If I run "sacct -", I get
this output:
sacct: Jobs eligible from Tue May 08 00:00:00 2018 - Now
sacct: debug: Options selected:
opt_co
Thanks for the hint, Chris!
Best regards,
Marcel
Am 04.05.2018 um 16:06 schrieb Chris Samuel:
> On Friday, 4 May 2018 4:25:04 PM AEST Marcel Sommer wrote:
>
>> Does anyone have an explanation for this?
>
> I think you're asking for functionality that is only supported with
slurmdbd.
>
> All the b
"Otherwise a user can have a sing le job that takes the entire cluster,
and insidesplit it up the way he wants to."
Yair, I agree. That is what I was referring to regardign interactive jobs.
Perhaps not a user reserving the entire cluster,
but a use reserving a lot of compute nodes and not making s
> Eventually the job aging makes the jobs so high-priority,
Guess I should look in the manual, but could you increase the job ageing
time parameters?
I guess it is also worth saying that this is the scheduler doing its job -
it is supposed to keep jobs ready and waiting to go, to keep the cluster
Hi,
This is what we did, not sure those are the best solutions :)
## Queue stuffing
We have set PriorityWeightAge several magnitudes lower than
PriorityWeightFairshare, and we also have PriorityMaxAge set to cap of
older jobs. As I see it, the fairshare is far more important than age.
Besides t
On 05/08/2018 08:44 AM, Bjørn-Helge Mevik wrote:
Jonathon A Anderson writes:
## Queue stuffing
There is the bf_max_job_user SchedulerParameter, which is sort of the
"poor man's MAXIJOB"; it limits the number of jobs from each user the
backfiller will try to start on each run. It doesn't do
20 matches
Mail list logo