Re: [slurm-users] Need help with controller issues

2019-12-11 Thread Eli V
Look for libmariadb-client. That's needed for slurmdbd on debian. On Wed, Dec 11, 2019 at 11:43 AM Dean Schulze wrote: > > Turns out I've already got libmariadb-dev installed: > > $ dpkg -l | grep maria > ii libmariadb-dev 3.0.3-1build1 >

Re: [slurm-users] good practices

2019-11-26 Thread Eli V
Inline below On Tue, Nov 26, 2019 at 5:50 AM Loris Bennett wrote: > > Hi Nigella, > > Nigella Sanders writes: > > > Thank you all for such interesting replies. > > > > The --dependency option is quite useful but in practice it has some > > inconvenients. Firstly, all 20 jobs are instantly queued

Re: [slurm-users] gpu count

2019-06-27 Thread Eli V
gres has to be specified in both slurm.conf and gres.conf and gres.conf must be present on the node with the gres. I keep a single cluster wide gres.conf and copy it to all nodes just like slurm.conf. Also, after adding a new gres I think both the slurmctld and the slurmd needs to be restarted. On

Re: [slurm-users] Slurm: Socket timed out on send/recv operation - slurm 17.02.2

2019-06-25 Thread Eli V
Just FYI, I tried the shared state on NFS once, and it didn't work well. Switched to native client glusterfs shared between the 2 controller nodes and haven't had a problem with it since. On Tue, Jun 25, 2019 at 6:32 AM Buckley, Ronan wrote: > > Is there a way to diagnose if the I/O to the > /cm

Re: [slurm-users] What means this error ?

2019-06-25 Thread Eli V
My first guess would be that the host is not listed as one of the two controllers in the slurm.conf. Also, keep in mind munge, and thus slurm is very sensitive to lack of clock synchronization between nodes. FYI, I run a hand built slurm 18.08.07 on debian 8 & 9 without issues. Haven't tried 10 yet

Re: [slurm-users] Sharing a node with non-gres and gres jobs

2019-03-19 Thread Eli V
On Tue, Mar 19, 2019 at 8:34 AM Peter Steinbach wrote: > > Hi, > > we are struggling with a slurm 18.08.5 installation of ours. We are in a > situation, where our GPU nodes have a considerable number of cores but > "only" 2 GPUs inside. While people run jobs using the GPUs, non-GPU jobs > can ente

Re: [slurm-users] weight setting not working

2019-03-12 Thread Eli V
On Tue, Mar 12, 2019 at 1:14 AM Andy Leung Yin Sui wrote: > > Hi, > > I am new to slurm and want to use weight option to schedule the jobs. > I have some machine with same hardware configuration with GPU cards. I > use QoS to force user at least required 1 gpu gres when submitting > jobs. > The ma

Re: [slurm-users] How to get the CPU usage of history jobs at each compute node?

2019-02-15 Thread Eli V
sacct. Though, of course, accounting has to be turned on and working. On Fri, Feb 15, 2019 at 5:08 AM hu...@sugon.com wrote: > Dear there, > How to view the cpu usage of history jobs at each compute node? > However, this command(control show jobs jobid --detail) can only get the > cpu usage of t

Re: [slurm-users] Federated Clusters

2019-02-12 Thread Eli V
Reminds me of a followup question I've been meaning to ask, is it just the Slurmctld's that need access to the shared SlurmDBD, or do all the slurmd's on all the nodes need access? On Tue, Feb 12, 2019 at 7:16 AM Antony Cleave wrote: > > You will need to be able to connect both clusters to the sa

Re: [slurm-users] Slurm forgetting about job dependencies

2018-12-19 Thread Eli V
Looking through the slurm.conf docs and greping around the source code it looks like MinJobAge might be what I need to adjust. I changed it by 2 orders of magnitude, 300 -> 300_000 on our dev cluster. I'll see how things go. On Wed, Dec 19, 2018 at 1:14 PM Eli V wrote: > > Does sl

[slurm-users] Slurm forgetting about job dependencies

2018-12-19 Thread Eli V
Does slurm remove job completion info from it's memory after a while? Might explain a why I'm seeing job's getting cancled when there dependent predecessor step finished ok. Below is the egrep '352209(1|2)_11' from slurmctld.log. The 3522092 job array was created with -d aftercorr:3522091. Looks li

Re: [slurm-users] How to implement job arrays with distribution=cyclic

2018-12-12 Thread Eli V
SelectTypeParameters=CR_LLN will do this automatically for all jobs submitted to the cluster. Not sure if that's an acceptable solution for you. On Wed, Dec 12, 2018 at 11:54 AM Roger Moye wrote: > > > I have a user who wants to control how job arrays are allocated to > nodes. He wants to mimi

Re: [slurm-users] How to allocate SMT cores

2018-12-07 Thread Eli V
On Fri, Dec 7, 2018 at 7:53 AM Maik Schmidt wrote: > > I have found --hint=multithread, but this only works with task/affinity. > We use task/cgroup. Are there any downsides to activating both task > plugins at the same time? > > Best, Maik > > Am 07.12.18 um 13:33 schrieb Maik Schmidt: > > Hi all

Re: [slurm-users] possible to set memory slack space before killing jobs?

2018-12-06 Thread Eli V
On Wed, Dec 5, 2018 at 5:04 PM Bjørn-Helge Mevik wrote: > > I don't think Slurm has any facility for soft memory limits. > > But you could emulate it by simply configure the nodes in slurm.conf > with, e.g., 15% higher RealMemory value than what is actually available > on the node. Then a node wi

Re: [slurm-users] possible to set memory slack space before killing jobs?

2018-12-06 Thread Eli V
On Thu, Dec 6, 2018 at 2:08 AM Loris Bennett wrote: > > Eli V writes: > > > We run our cluster using select parms CR_Core_Memory and always > > require a user to set the memory used when submitting a job to avoid > > swapping our nodes to uselessness. However, since sl

[slurm-users] possible to set memory slack space before killing jobs?

2018-12-05 Thread Eli V
We run our cluster using select parms CR_Core_Memory and always require a user to set the memory used when submitting a job to avoid swapping our nodes to uselessness. However, since slurmd is pretty vigilant about killing jobs that exceed their request we end up with jobs requesting more memory th

Re: [slurm-users] Can't find an address

2018-10-25 Thread Eli V
In addition to these other suggestions, keep in mind the slurmd's will talk to each other if you have more then 50 nodes(see TreeWidth in slurm.conf), so this will require the nodes to be able to DNS lookup and communicate to all the other nodes as well as the slurmctlds. I tried adding in some nod

Re: [slurm-users] Can frequent hold-release adversely affect slurm?

2018-10-18 Thread Eli V
On Thu, Oct 18, 2018 at 1:03 PM Daniel Letai wrote: > > > Hello all, > > > To solve a requirement where a large number of job arrays (~10k arrays, each > with at most 8M elements) with same priority should be executed with minimal > starvation of any array - we don't want to wait for each array

Re: [slurm-users] node showing "Low socket*core count"

2018-10-10 Thread Eli V
Don't think you need CPUs in slurm.conf for the node def, just Sockets=4 CoresPerSocket=4 ThreadsPerCore=1 for example, and the slurmctld does the math for # cpus. Also slurmd -C on the nodes is very useful to see what's being autodetected. On Wed, Oct 10, 2018 at 11:34 AM Noam Bernstein wrote: >

Re: [slurm-users] "cannot find auth plugin for auth/munge" with slurm-llnl

2018-09-28 Thread Eli V
Have you started the munge service? The order should be roughly, start munge, start mysql/mariadb, start slurmdbd, start slurmctld, start slurmd. You didn't mention which distribution you're using. On recent debian versions the 3 slurm daemons have been split out independently and you'll probably b

Re: [slurm-users] Defining new Gres types on nodes

2018-09-24 Thread Eli V
On Mon, Sep 24, 2018 at 12:27 PM Will Dennis wrote: > > Hi all, > > We want to add in some Gres resource types pertaining to GPUs (amount of GPU > memory and CUDA cores) on some of our nodes. So we added the following params > into the 'gres.conf' on the nodes that have GPUs: > > Name=gpu_mem Co

[slurm-users] Anyone see odd job array dependency issues?

2018-09-17 Thread Eli V
I'm seeing a weird issue(originally with 17.02 and still after upgrading to 18.08) where occasionally job arrays created with -d aftercorr seem to be getting mixed up in the slurm controller and the wrong jobs are getting started and cancelled. Just created a bug for it: https://bugs.schedmd.com/sh

Re: [slurm-users] Elastic Compute

2018-09-12 Thread Eli V
Sound like you figured it out, but I mis-remembered and switched the case on CR_LLN. Setting it spreads the jobs out across the nodes, not filling one up first. Also, I believe it can be set per partition as well. On Tue, Sep 11, 2018 at 5:24 PM Felix Wolfheimer wrote: > > Thanks for the input! I

Re: [slurm-users] Elastic Compute

2018-09-10 Thread Eli V
I think you probably want CR_LLN set in your SelectTypeParameters in slurm.conf. This makes it fill up a node before moving on to the next instead of "striping" the jobs across the nodes. On Mon, Sep 10, 2018 at 8:29 AM Felix Wolfheimer wrote: > > No this happens without the "Oversubscribe" parame

Re: [slurm-users] Can't run jobs after upgrade to 17.11.5 due to memory?

2018-06-12 Thread Eli V
Yes, I saw the same issue. Default for unset DefMemPerCPU changed from unlimited in earlier versions to 0. I just set it to 384 in slurm.conf so simple things run fine and make sure users always set a sane value on submission. On Mon, Jun 11, 2018 at 6:40 PM, Roberts, John E. wrote: > I see this