[slurm-users] Help with preemtion based on licenses

2019-06-19 Thread Eric Wittmayer
Hi Slurm experts, I'm new to SLURM and could really use some help getting preemption working. The limiting factor in our cluster is licenses and I want to have high and low priority jobs where submitting a high priority job will preempt (suspend) a low priority job if all the licenses are

Re: [slurm-users] Proposal for new TRES - "Processor Performance Units"....

2019-06-19 Thread Fulcomer, Samuel
Hi Paul, Thanks..Your setup is interesting. I see that you have your processor types segregated in their own partitions (with the exception of of the requeue partition), and that's how you get at the weighting mechanism. Do you have your users explicitly specify multiple partitions in the batch

Re: [slurm-users] Proposal for new TRES - "Processor Performance Units"....

2019-06-19 Thread Fulcomer, Samuel
Hi Alex, Thanks. The issue is that we don't know where they'll end up running in the heterogenous environment. In addition, because the limit is applied by GrpTRES=cpu=N, someone buying 100 cores today shouldn't get access to 130 of todays cores. Regards, Sam On Wed, Jun 19, 2019 at 3:41 PM

Re: [slurm-users] Proposal for new TRES - "Processor Performance Units"....

2019-06-19 Thread Paul Edmon
We do a similar thing here at Harvard: https://www.rc.fas.harvard.edu/fairshare/ We simply weight all the partitions based on their core type and then we allocate Shares for each account based on what they have purchased.  We don't use QoS at all, so we just rely purely on fairshare weighting

Re: [slurm-users] Proposal for new TRES - "Processor Performance Units"....

2019-06-19 Thread Alex Chekholko
Hey Samuel, Can't you just adjust the existing "cpu" limit numbers using those same multipliers? Someone bought 100 CPUs 5 years ago, now that's ~70 CPUs. Or vice versa, someone buys 100 CPUs today, they get a setting of 130 CPUs because the CPUs are normalized to the old performance. Since it

[slurm-users] Proposal for new TRES - "Processor Performance Units"....

2019-06-19 Thread Fulcomer, Samuel
(...and yes, the name is inspired by a certain OEM's software licensing schemes...) At Brown we run a ~400 node cluster containing nodes of multiple architectures (Sandy/Ivy, Haswell/Broadwell, and Sky/Cascade) purchased in some cases by University funds and in others by investigator funding

Re: [slurm-users] status of cloud nodes

2019-06-19 Thread Christopher Samuel
On 6/18/19 11:29 PM, nathan norton wrote: Without knowing the internals of slurm it feels like nodes that are turned off+cloud state don't exist in the system until they are on? Not quite, they exist internally but are not exposed until in use:

[slurm-users] MinJobAge not honored

2019-06-19 Thread Brian Andrus
Using slurm 19.05.0-1 MinJobAge is set to 300 MaxJobCount is set to 1 There are only about 30 jobs running. However, when a job completes, it vanishes immediately from the output of 'squeue' Shouldn't it be staying there for 5 minutes? Brian Andrus

Re: [slurm-users] status of cloud nodes

2019-06-19 Thread Brian Andrus
Can you give the exact command/output you have from this? I suspect a typo in your slurm.conf for nodenames or what you are typing. Brian Andrus On 6/18/2019 11:29 PM, nathan norton wrote: Hi, It just shows "Node $NODE not found" Whereas others all work as expected (ie, they are running)

Re: [slurm-users] Rename account or move user from one account to another

2019-06-19 Thread Christoph Brüning
Hi, yes, modifying the database directly seems to be the only way. Part of the story is, I think, that the account name is used as the primary key instead of some account ID... which would at least make it possible to rename an account. Associations are however referenced by an ID, so ...

[slurm-users] Deadlocks in slurmdbd logs

2019-06-19 Thread David Baker
Hello, Everyday we see several deadlocks in our slurmdbd log file. Together with the deadlock we always see a failed "roll up" operation. Please see below for an example. We are running slurm 18.08.0 on our cluster. As far as we know these deadlocks are not adversely affecting the operation

Re: [slurm-users] How to fix “slurmd.service: Can't open PID file” error

2019-06-19 Thread mercan
Hi; Using the noki user, would you try to read /var/run/slurm-llnl/slurmd.pid and /var/run/slurm-llnl/slurmctld.pid files. Are there these files present, readable and writeable? May be upper directories don't have the permission to read/execute. Regards; Ahmet M. On 19.06.2019 07:26,

Re: [slurm-users] status of cloud nodes

2019-06-19 Thread nathan norton
Hi, It just shows "Node $NODE not found" Whereas others all work as expected (ie, they are running) Without knowing the internals of slurm it feels like nodes that are turned off+cloud state don't exist in the system until they are on? Any other ideas? Thanks Nathan On Wed., 19 Jun. 2019,

Re: [slurm-users] status of cloud nodes

2019-06-19 Thread Chris Samuel
On Tuesday, 18 June 2019 9:36:56 PM PDT nathan norton wrote: > Just tried running that command, but it only shows nodes that are up and > running, doesn’t tell me about any nodes that are down and turned off, as > an example please see below. There is a job running that should be using > the 100