[slurm-users] Allow certain users to run over partition limit

2020-07-07 Thread Matthew BETTINGER
Hello, We have a slurm system with partitions set for max runtime of 24hours. What would be the proper way to allow a certain set of users to run jobs on the current partitions beyond the partition limits? In the past we would isolate some nodes based on their job requirements , make a new pa

Re: [slurm-users] Allow certain users to run over partition limit

2020-07-08 Thread Matthew BETTINGER
ttps://slurm.schedmd.com/resource_limits. ____ From: slurm-users on behalf of Matthew BETTINGER Sent: Tuesday, July 7, 2020 9:40 AM To: slurm-users@lists.schedmd.com Subject: [slurm-users] Allow certain users to run over partition limit Hello, We have a slurm sy

[slurm-users] Slurmstepd errors

2020-07-28 Thread Matthew BETTINGER
Hello, Running slurm 17.02.6 on a cray system and all of a sudden we have been receiving these message errors from slurmstepd. Not sure what triggers this? srun -N 4 -n 4 hostname nid00031 slurmstepd: error: task/cgroup: unable to add task[pid=903] to memory cg '(null)' nid00029 nid00030 slurm

Re: [slurm-users] Internet connection loss with srun to a node

2020-08-03 Thread Matthew BETTINGER
Hello, Not sure what your setup is but check compute nodes route table. Also might need to turn on ipv4 forwarding on whatever is their default gw. Then also firewalls can come in to play too. This isn't a slurm issue , pretty sure! Matt On 8/2/20, 7:53 AM, "slurm-users on behalf of Mahmo

[slurm-users] Update users partitions

2020-08-21 Thread Matthew BETTINGER
Maybe it's Friday but I cannot for the life of me figure out how to update a user's partitions. Just trying to add a user access to another partition. sacctmgr modify user where name=foo set partition=par1,par2,par3 Use keyword 'where' to modify condition Tried pretty much all the permutation

Re: [slurm-users] [External] srun at front-end nodes with --enable_configless fails with "Can't find an address, check slurm.conf"

2021-03-22 Thread Matthew BETTINGER
Also check the settings on your nodeaddr in slurm.conf On 3/22/21, 2:48 PM, "slurm-users on behalf of Michael Robbert" wrote: I haven't tried configless setup yet, but the problem you're hitting looks like it could be a DNS issue. Can you do a dns lookup of n26 from the login node? The w

Re: [slurm-users] derived counters

2021-04-14 Thread Matthew BETTINGER
Before you get all excited about it, we have had a terrible time trying to get gppu metrics. Finally abandoned and switch to Grafana, Prometheus influx. Good luck to you though. From: slurm-users on behalf of "Heckes, Frank" Reply-To: Slurm User Community List Date: Wednesday, April 14,

[slurm-users] Reservation to exceed time limit on a partition for a user

2019-01-03 Thread Matthew BETTINGER
Hello, We are running slurm 17.02.6 with accounting on a cray CLE system. We currently have a 24 hour job run limit on our partitions and a user needs to run a job which will exceed 24 hours runtime. I tried to make a reservation as seen below allocating the user 36 hours to run his job but it

Re: [slurm-users] Reservation to exceed time limit on a partition for a user

2019-01-03 Thread Matthew BETTINGER
MaxMemPerNode=UNLIMITED I can run jobs in there but if I set it to just a user (myself) then the job does not run. I may have to just make this partition like this until I can figure out the correct way since we need this to run today. On 1/3/19, 8:41 AM, "Matthew BETTINGER" wrote:

[slurm-users] Report on gres usage

2019-01-15 Thread Matthew BETTINGER
Hello, We are trying to find a way to gather information about jobs assigned to GPU's. I'm not really finding anything we want using sreport for some reason. We would like to find cpu hours for our GPUSs. They are defined in gres and need to be passed to srun when users request gpu's. It lo

Re: [slurm-users] SlurmDBD setup with mysql

2019-01-17 Thread Matthew BETTINGER
Not Sure if this is related but we ran into an issue configuring accounting because our clustername had a '-' in the name . This is an illegal character for table names in mariadb, or used to be. On 1/17/19, 11:07 AM, "slurm-users on behalf of Sajesh Singh" wrote: Trying to setup acco

Re: [slurm-users] New Bright Cluster Slurm issue for AD users

2019-02-13 Thread Matthew BETTINGER
One of the main guy Panos left Bright so no answer to your specific question but I hope you can get some support with it. We dumped our BC PoC, the sysadmin working on the PoC still has nightmares. On 2/13/19, 6:54 AM, "slurm-users on behalf of John Hearns" wrote: Yugendra, the Brigh

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-22 Thread Matthew BETTINGER
We stuck avere between Isilon and a cluster to get us over the hump until next budget cycle ... then we replaced with spectrascale for mid level storage. Still use lustre of course as scratch. On 2/22/19, 12:24 PM, "slurm-users on behalf of Will Dennis" wrote: (replies inline)

[slurm-users] How to enable QOS correctly?

2019-03-05 Thread Matthew BETTINGER
Hey slurm gurus. We have been trying to enable slurm QOS on a cray system here off and on for quite a while but can never get it working. Every time we try to enable QOS we disrupt the cluster and users and have to fall back. I'm not sure what we are doing wrong. We run a pretty open system

Re: [slurm-users] How to enable QOS correctly?

2019-03-05 Thread Matthew BETTINGER
ould be different if that were the case. Look at that, maybe send the QOS and partition config. - Michael On Tue, Mar 5, 2019 at 7:40 AM Matthew BETTINGER wrote: Hey slurm gurus. We have been trying to enable slurm

Re: [slurm-users] How to enable QOS correctly?

2019-03-05 Thread Matthew BETTINGER
qos works fine. Just do not know enough about this and how to test again without causing disruption. Inch high a mile wide over here with 3-4 different schedulers. On 3/5/19, 11:29 AM, "slurm-users on behalf of Christopher Samuel" wrote: On 3/5/19 7:37 AM, Matthew BETTINGER wrote:

[slurm-users] strigger on CG, completing state

2019-05-28 Thread Matthew BETTINGER
We use triggers for the obvious alerts but is that a way to make a trigger for nodes stuck in CG (completing) state? Some user jobs, mostly Julia notebook can get hung in completing state is the user kills the running job or cancels it with cntrl. When this happens we can have many many nodes

Re: [slurm-users] strigger on CG, completing state

2019-05-29 Thread Matthew BETTINGER
se the epilog script, you can set the epilog script to clean up all residues from the finished jobs: https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-prolog-and-epilog-scripts Ahmet M. 28.05.2019 19:03 tarihinde Matthew BETTINGER yazdı: > We

[slurm-users] Weekend Partition

2019-07-23 Thread Matthew BETTINGER
Hello, We run lsf and slurm here. For LSF we have a weekend queue with no limit and jobs get killed after Sunday. What is the best way to do something similar for slurm? Reservation? We would like to have any running jobs killed after Sunday if possible too. Thanks.

Re: [slurm-users] Weekend Partition

2019-07-23 Thread Matthew BETTINGER
works best for you. HTH --Dani_L. On 7/23/19 7:36 PM, Matthew BETTINGER wrote: Hello, We run lsf and slurm here. For LSF we have a weekend queue with no limit and jobs get killed after Sunday. What is the best way to do something similar for

Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

2019-10-09 Thread Matthew BETTINGER
Just curious if this option or oom setting (which we use) can leave the nodes in CG "completing" state. We have CG states quite often and only way is to reboot the node. I believe it occurs when parent process dies or gets killed or Z? Thanks. MB On 10/8/19, 6:11 AM, "slurm-users on behal