Re: [slurm-users] sacct issue: jobs staying in "RUNNING" state

2019-07-16 Thread Chris Samuel
On 16/7/19 11:43 am, Will Dennis wrote: [2019-07-16T09:36:51.464] error: slurmdbd: agent queue is full (20140), discarding DBD_STEP_START:1442 request So it looks like your slurmdbd cannot keep up with the rate of these incoming steps and is having to throw away messages. [2019-07-16T09:40:

Re: [slurm-users] sacct issue: jobs staying in "RUNNING" state

2019-07-16 Thread Will Dennis
A few more things to note: - (Should have mentioned this earlier) running Slurm 17.11.7 ( via https://launchpad.net/~jonathonf/+archive/ubuntu/slurm ) - Restarted slurmctld and slurmdbd, but still getting the slurmdbd errors as before in slurmctld.log - Ran "mysqlcheck --databases slurm_acct_db

Re: [slurm-users] PMIX with heterogeneous jobs

2019-07-16 Thread Mehlberg, Steve
Philip, Thanks for trying 18.08.8 for me. I finally got a system built with 18.08.8 and I’m having much better success running heterogeneous jobs with PMIX. I haven’t seen the intermittent problem you have - but I’ve just started testing. I wonder if there is a bug in 19.05.1? $ sinfo -V slu

Re: [slurm-users] GPU machines only run a single GPU job despite resources being available.

2019-07-16 Thread Benjamin Wong
I believe you are right, Mark! Thanks so much. I put in #SBATCH --mem=16000 and now the jobs run fine. I can get both GPU jobs running. Ben Wong On Tue, Jul 16, 2019 at 3:22 PM Mark Hahn wrote: > > #!/bin/bash > > #SBATCH -c 2 > > #SBATCH -o slurm-gpu-job.out > > #SBATCH -p gpu.q > > #SBATCH

Re: [slurm-users] GPU machines only run a single GPU job despite resources being available.

2019-07-16 Thread Mark Hahn
#!/bin/bash #SBATCH -c 2 #SBATCH -o slurm-gpu-job.out #SBATCH -p gpu.q #SBATCH -w mk-gpu-1 #SBATCH --gres=gpu:1 could it be that sbatch is defaulting to --mem=0, meaning "all the node's memory"? regards, mark hahn.

[slurm-users] GPU machines only run a single GPU job despite resources being available.

2019-07-16 Thread Benjamin Wong
Hi everyone, I have a slurm node named, mk-gpu-1, with eight GPUs which I've been testing sending GPU based container jobs to. For whatever reason, it will only run a single GPU at a time. All other SLURM sent GPU jobs have a pending (PD) state due to "(Resources)". [ztang@mk-gpu-1 ~]$ squeue

[slurm-users] Clearing the "maint" flag

2019-07-16 Thread Stradling, Alden Reid (ars9ac)
After coming out of maintenance, I have a large number of nodes with the "maint" flag still set after deleting the maintenance reservation. I have attempted to clear it using scontrol a variety of ways, but to no avail. Has anyone seen this? Has anyone a solution short of mass node reboots? Tha

[slurm-users] sacct issue: jobs staying in "RUNNING" state

2019-07-16 Thread Will Dennis
Hi all, Was looking at the running jobs on one groups cluster, and saw there was an insane amount of "running" jobs when I did a sacct -X -s R; then looked at output of squeue, and found a much more reasonable number... root@slurm-controller1:/ # sacct -X -p -s R | wc -l 8895 root@ slurm-contro

Re: [slurm-users] PMIX with heterogeneous jobs

2019-07-16 Thread Philip Kovacs
Well it looks like it it does fail as often as it works. srun --mpi=pmix -n1 -wporthos : -n1 -wathos ./hellosrun: job 681 queued and waiting for resourcessrun: job 681 has been allocated resourcesslurmstepd: error: athos [0] pmixp_coll_ring.c:613 [pmixp_coll_ring_check] mpi/pmix: ERROR: 0x153ab

Re: [slurm-users] PMIX with heterogeneous jobs

2019-07-16 Thread Philip Kovacs
Works here on slurm 18.08.8, pmix 3.1.2.  The mpi world ranks are unified as they should be. $ srun --mpi=pmix -n2 -wathos ./hello : -n8 -wporthos ./hellosrun: job 586 queued and waiting for resourcessrun: job 586 has been allocated resourcesHello world from processor athos, rank 1 out of 10 pr

Re: [slurm-users] Running pyMPI on several nodes

2019-07-16 Thread Benson Muite
Hi, Does a regular MPI program run on two nodes? For example helloworld: https://people.sc.fsu.edu/~jburkardt/c_src/hello_mpi/hello_mpi.c https://people.sc.fsu.edu/~jburkardt/py_src/hello_mpi/hello_mpi.py Benson On 7/16/19 4:30 PM, Pär Lundö wrote: Hi, Thank you for your quick answer! I’ll l

[slurm-users] PMIX with heterogeneous jobs

2019-07-16 Thread Mehlberg, Steve
Has anyone been able to run an MPI job using PMIX and heterogeneous jobs successfully with 19.05 (or even 18.08)? I can run without heterogeneous jobs but get all sorts of errors when I try and split the job up. I haven't used MPI/PMIX much so maybe I'm missing something? Any ideas? [slurm@tre

Re: [slurm-users] Running pyMPI on several nodes

2019-07-16 Thread Pär Lundö
Hi, Thank you for your quick answer! I’ll look into that, but they share the same hosts-file and the DHCP-server sets their hostname. However I came across a setting in the slurm.conf-file ”Tmpfs” and there were a note regarding it in the guide of mpi at the slurms webpage. I implemented the pr

Re: [slurm-users] Running pyMPI on several nodes

2019-07-16 Thread John Hearns
srun: error: Application launch failed: Invalid node name specified Hearns Law. All batch system problems are DNS problems. Seriously though - check out your name resolution both on the head node and the compute nodes. On Tue, 16 Jul 2019 at 08:49, Pär Lundö wrote: > Hi, > > I have now had th

Re: [slurm-users] Running pyMPI on several nodes

2019-07-16 Thread Pär Lundö
Hi, I have now had the time to look at some of your suggestions. First I tried running "srun -N1 hostname" via a sbatch-script, while having two nodes up and running. "sinfo" yields that two nodes are up and idle prior to submitting the sbatch-script. After submitting the job, I receive an er