On 16/7/19 11:43 am, Will Dennis wrote:
[2019-07-16T09:36:51.464] error: slurmdbd: agent queue is full (20140),
discarding DBD_STEP_START:1442 request
So it looks like your slurmdbd cannot keep up with the rate of these
incoming steps and is having to throw away messages.
[2019-07-16T09:40:
A few more things to note:
- (Should have mentioned this earlier) running Slurm 17.11.7 ( via
https://launchpad.net/~jonathonf/+archive/ubuntu/slurm )
- Restarted slurmctld and slurmdbd, but still getting the slurmdbd errors as
before in slurmctld.log
- Ran "mysqlcheck --databases slurm_acct_db
Philip, Thanks for trying 18.08.8 for me. I finally got a system built with
18.08.8 and I’m having much better success running heterogeneous jobs with
PMIX. I haven’t seen the intermittent problem you have - but I’ve just started
testing. I wonder if there is a bug in 19.05.1?
$ sinfo -V
slu
I believe you are right, Mark!
Thanks so much. I put in #SBATCH --mem=16000 and now the jobs run fine. I
can get both GPU jobs running.
Ben Wong
On Tue, Jul 16, 2019 at 3:22 PM Mark Hahn wrote:
> > #!/bin/bash
> > #SBATCH -c 2
> > #SBATCH -o slurm-gpu-job.out
> > #SBATCH -p gpu.q
> > #SBATCH
#!/bin/bash
#SBATCH -c 2
#SBATCH -o slurm-gpu-job.out
#SBATCH -p gpu.q
#SBATCH -w mk-gpu-1
#SBATCH --gres=gpu:1
could it be that sbatch is defaulting to --mem=0, meaning "all the node's
memory"?
regards, mark hahn.
Hi everyone,
I have a slurm node named, mk-gpu-1, with eight GPUs which I've been
testing sending GPU based container jobs to. For whatever reason, it will
only run a single GPU at a time. All other SLURM sent GPU jobs have a
pending (PD) state due to "(Resources)".
[ztang@mk-gpu-1 ~]$ squeue
After coming out of maintenance, I have a large number of nodes with the
"maint" flag still set after deleting the maintenance reservation. I have
attempted to clear it using scontrol a variety of ways, but to no avail. Has
anyone seen this? Has anyone a solution short of mass node reboots?
Tha
Hi all,
Was looking at the running jobs on one groups cluster, and saw there was an
insane amount of "running" jobs when I did a sacct -X -s R; then looked at
output of squeue, and found a much more reasonable number...
root@slurm-controller1:/ # sacct -X -p -s R | wc -l
8895
root@ slurm-contro
Well it looks like it it does fail as often as it works.
srun --mpi=pmix -n1 -wporthos : -n1 -wathos ./hellosrun: job 681 queued and
waiting for resourcessrun: job 681 has been allocated resourcesslurmstepd:
error: athos [0] pmixp_coll_ring.c:613 [pmixp_coll_ring_check] mpi/pmix: ERROR:
0x153ab
Works here on slurm 18.08.8, pmix 3.1.2. The mpi world ranks are unified as
they should be.
$ srun --mpi=pmix -n2 -wathos ./hello : -n8 -wporthos ./hellosrun: job 586
queued and waiting for resourcessrun: job 586 has been allocated resourcesHello
world from processor athos, rank 1 out of 10 pr
Hi,
Does a regular MPI program run on two nodes? For example helloworld:
https://people.sc.fsu.edu/~jburkardt/c_src/hello_mpi/hello_mpi.c
https://people.sc.fsu.edu/~jburkardt/py_src/hello_mpi/hello_mpi.py
Benson
On 7/16/19 4:30 PM, Pär Lundö wrote:
Hi,
Thank you for your quick answer!
I’ll l
Has anyone been able to run an MPI job using PMIX and heterogeneous jobs
successfully with 19.05 (or even 18.08)? I can run without heterogeneous jobs
but get all sorts of errors when I try and split the job up.
I haven't used MPI/PMIX much so maybe I'm missing something? Any ideas?
[slurm@tre
Hi,
Thank you for your quick answer!
I’ll look into that, but they share the same hosts-file and the DHCP-server
sets their hostname.
However I came across a setting in the slurm.conf-file ”Tmpfs” and there were a
note regarding it in the guide of mpi at the slurms webpage. I implemented the
pr
srun: error: Application launch failed: Invalid node name specified
Hearns Law. All batch system problems are DNS problems.
Seriously though - check out your name resolution both on the head node and
the compute nodes.
On Tue, 16 Jul 2019 at 08:49, Pär Lundö wrote:
> Hi,
>
> I have now had th
Hi,
I have now had the time to look at some of your suggestions.
First I tried running "srun -N1 hostname" via a sbatch-script, while
having two nodes up and running.
"sinfo" yields that two nodes are up and idle prior to submitting the
sbatch-script.
After submitting the job, I receive an er
15 matches
Mail list logo