[slurm-users] Tuning the backfill scheduler

2018-10-10 Thread Richard Feltstykket
Hello list, My cluster usually has a pretty heterogenous job load and spends a lot of the time memory bound. Ocassionally I have users that submit 100k+ short, low resource jobs. Despite having several thousand free cores and enough RAM to run the jobs, the backfill scheduler would never back

[slurm-users] Rebooted Nodes & Jobs Stuck in Cleaning State

2018-10-10 Thread Roberts, John E.
Hi, Hopefully this isn't an obvious fix I'm missing. We have a large number of KNL nodes that can get rebooted when their memory or cluster modes are changed by users. I never heard any complaints when running Slurm v16.05.10, but I've seen a number of issues since our upgrade a couple months

Re: [slurm-users] node showing "Low socket*core count"

2018-10-10 Thread Noam Bernstein
> On Oct 10, 2018, at 12:07 PM, Noam Bernstein > wrote: > > > slurmd -C confirms that indeed slurm understands the architecture, so that’s > good. However, removing the CPUs entry from the node list doesn’t change > anything. It still drains the node. If I just remove _everything_ having t

Re: [slurm-users] Heterogeneous job one MPI_COMM_WORLD

2018-10-10 Thread Chris Samuel
On 11/10/18 01:27, Christopher Benjamin Coffey wrote: That is interesting. It is disabled in 17.11.10: Yeah, I seem to remember seeing a commit that disabled in 17.11.x. I don't think it's meant to work before 18.08.x (which is what the website will be talking about). All the best, Chris -

Re: [slurm-users] node showing "Low socket*core count"

2018-10-10 Thread Noam Bernstein
> On Oct 10, 2018, at 11:40 AM, Eli V wrote: > > Don't think you need CPUs in slurm.conf for the node def, just > Sockets=4 CoresPerSocket=4 ThreadsPerCore=1 for example, and the > slurmctld does the math for # cpus. Also slurmd -C on the nodes is > very useful to see what's being autodetected.

Re: [slurm-users] node showing "Low socket*core count"

2018-10-10 Thread Eli V
Don't think you need CPUs in slurm.conf for the node def, just Sockets=4 CoresPerSocket=4 ThreadsPerCore=1 for example, and the slurmctld does the math for # cpus. Also slurmd -C on the nodes is very useful to see what's being autodetected. On Wed, Oct 10, 2018 at 11:34 AM Noam Bernstein wrote: >

[slurm-users] node showing "Low socket*core count"

2018-10-10 Thread Noam Bernstein
Hi all - I’m new to slurm, and in many ways it’s been very nice to work with, but I’m having an issue trying to properly set up thread/core/socket counts on nodes. Basically, if I don’t specify anything except CPUs, the node is available, but doesn’t appear to know about cores and hyperthreadin

Re: [slurm-users] Heterogeneous job one MPI_COMM_WORLD

2018-10-10 Thread Christopher Benjamin Coffey
That is interesting. It is disabled in 17.11.10: static bool _enable_pack_steps(void) { bool enabled = false; char *sched_params = slurm_get_sched_params(); if (sched_params && strstr(sched_params, "disable_hetero_steps")) enabled = false; else if (

Re: [slurm-users] Heterogeneous job one MPI_COMM_WORLD

2018-10-10 Thread Mehlberg, Steve
I got this same error when testing on older updates (17.11?). Try the Slurm-18.08 branch or master. I'm testing 18.08 now and get this: [slurm@trek6 mpihello]$ srun -phyper -n3 --mpi=pmi2 --pack-group=0-2 ./mpihello-ompi2-rhel7 | sort srun: job 643 queued and waiting for resources srun: job 64

Re: [slurm-users] Heterogeneous job one MPI_COMM_WORLD

2018-10-10 Thread Pritchard Jr., Howard
Hi Christopher, We hit some problems at LANL trying to use this SLURm feature. At the time, I think SchedMD said there would need to be fixes to the SLURM PMI2 library to get this to work. What version of SLURM are you using? Howard -- Howard Pritchard B Schedule HPC-ENV Office 9, 2nd floor

Re: [slurm-users] Help with developing a lua job submit script

2018-10-10 Thread Baker D . J .
Hello, Thank you for your useful replies. It certainly not anywhere as difficult as I initially thought. We should be able to start some tests later this week. Best regards, David From: slurm-users on behalf of Roche Ewan Sent: 10 October 2018 08:07 To: S

[slurm-users] slurm 18.08 - x11 forwarding issue

2018-10-10 Thread Olivier Sallou
Hi, I have setup slurm and enabled x11 forwarding (native). I connect to a node from a submission node:     srun  --ntasks-per-node=1 --mem 100 --x11 --pty bash I am connected to the node. in debug logs, I can see that x11 setup is OK: ... [2018-10-10T09:27:48.142] [131.extern] X11 forwardin

Re: [slurm-users] Heterogeneous job one MPI_COMM_WORLD

2018-10-10 Thread Chris Samuel
On 10/10/18 05:07, Christopher Benjamin Coffey wrote: Yet, we get an error: " srun: fatal: Job steps that span multiple components of a heterogeneous job are not currently supported". But the docs seem to indicate it should work? Which version of Slurm are you on? It was disabled by default i

Re: [slurm-users] Help with developing a lua job submit script

2018-10-10 Thread Roche Ewan
Hello David, for this use case we have two partitions - serial and parallel (the default). Our lua looks like: function slurm_job_submit(job_desc, part_list, submit_uid) -- As the default partition is set later by SLURM we need to set it here using the same logic if job_desc.partitio