Re: [slurm-users] "sacctmgr add cluster" crashing slurmdbd
On Tuesday, 5 May 2020 3:21:45 PM PDT Dustin Lang wrote: > Since this happens on a fresh new database, I just don't understand how I > can get back to a basic functional state. This is exceedingly frustrating. I have to say that if you're seeing this with 17.11, 18.08 and 19.05 and this only started when your colleague upgraded MySQL then this sounds like MySQL is triggering this problem. We're running with MariaDB 10.x (from SLES15) without issues (our database is huge). All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Job Step Resource Requests are Ignored
On Tuesday, 5 May 2020 4:47:12 PM PDT Maria Semple wrote: > I'd like to set different resource limits for different steps of my job. A > sample script might look like this (e.g. job.sh): > > #!/bin/bash > srun --cpus-per-task=1 --mem=1 echo "Starting..." > srun --cpus-per-task=4 --mem=250 --exclusive > srun --cpus-per-task=1 --mem=1 echo "Finished." > > Then I would run the script from the command line using the following > command: sbatch --ntasks=1 job.sh. You shouldn't ask for more resources with "srun" than have been allocated with "sbatch" - so if you want the job to be able to use up to 4 cores at once & that amount of memory you'll need to use: sbatch -c 4 --mem=250 --ntasks=1 job.sh I'd also suggest using suffixes for memory to disambiguate the values. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] [EXT] Re: Limit the number of GPUS per user per partition
On Tuesday, 5 May 2020 3:48:22 PM PDT Sean Crosby wrote: > sacctmgr modify qos gpujobs set MaxTRESPerUser=gres/gpu=4 Also don't forget you need to tell Slurm to enforce QOS limits with: AccountingStorageEnforce=safe,qos in your Slurm configuration ("safe" is good to set, and turns on enforcement of other restrictions around associations too). See: https://slurm.schedmd.com/resource_limits.html All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
[slurm-users] Job Step Resource Requests are Ignored
Hi! I'd like to set different resource limits for different steps of my job. A sample script might look like this (e.g. job.sh): #!/bin/bash srun --cpus-per-task=1 --mem=1 echo "Starting..." srun --cpus-per-task=4 --mem=250 --exclusive srun --cpus-per-task=1 --mem=1 echo "Finished." Then I would run the script from the command line using the following command: sbatch --ntasks=1 job.sh. I have observed that while none of the steps appear to have limited memory (which I'm pretty sure has to do with my proctrack plugin type), the second step runs and scontrol show step .1 shows the step has having been allocated 4 CPUs, in reality the step is only able to use 1. I have also observed the opposite. Running the following command, I can see that the job step is able to use all CPUs allocated to the job, rather than the one it was allocated itself: sbatch --ntasks=1 --cpus-per-task=2 << EOF #!/bin/bash srun --cpus-per-task=1 EOF My goal here is to be able to run a single job with 3 steps where the first and last step are always executed, even if the second would not be run because too many resources were requested. Here is my slurm.conf, with commented out lines removed (this is just a small test cluster with a single node on the same machine as the controller): SlurmctldHost=ubuntu CredType=cred/munge AuthType=auth/munge EnforcePartLimits=ALL MpiDefault=none ProctrackType=proctrack/linuxproc ReturnToService=2 SlurmctldPidFile=/var/spool/slurm/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/spool/slurm/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurm/slurmd SlurmUser=slurm StateSaveLocation=/var/spool/slurm SwitchType=switch/none TaskPlugin=task/affinity TaskPluginParam=Sched InactiveLimit=0 KillWait=30 MinJobAge=3600 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 FastSchedule=1 SchedulerType=sched/backfill SelectType=select/cons_res SelectTypeParameters=CR_CPU AccountingStorageType=accounting_storage/slurmdbd AccountingStoreJobComment=YES ClusterName=cluster JobCompHost=localhost JobCompLoc=slurm_db JobCompPort=3306 JobCompType=jobcomp/mysql JobCompUser=slurm JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/linux SlurmctldDebug=info SlurmctldLogFile=/var/spool/slurm/slurmctld.log SlurmdDebug=info SlurmdLogFile=/var/spool/slurm/slurmd/slurmd.log NodeName=ubuntu CPUs=4 RealMemory=500 State=UNKNOWN PartitionName=main Nodes=ubuntu Default=YES MaxTime=INFINITE State=UP AllowGroups=maria Any advice would be greatly appreciated! Thanks in advance! -- Thanks, Maria
Re: [slurm-users] Major newbie - Slurm/jupyterhub
Hi Michael, I get the gist of everything you mentioned but now I feel even more overwhelmed. Can I not get jupyterhub up and running without all those modules pieces? I was hoping to have a base kernel in jupyterhub that contained a lot of the data science packages. I'm going to have some developers that are very adept and others that are just going to want to drop in code and have it run because it ran on their local machine. In an attempt to decouple jupyterhub from the environments I tried to follow the instructions on this page: https://jupyterhub.readthedocs.io/en/stable/installation-guide-hard.html These were their steps I'm trying to follow: * We will create an installation of JupyterHub and JupyterLab using a virtualenv under /opt using the system Python. * We will install conda globally. * We will create a shared conda environment which can be used (but not modified) by all users. * We will show how users can create their own private conda environments, where they can install whatever they like. That's for Ubuntu but I was able to make it work as far as getting jupyterhub up and running. I used the system python3.6 that seemed to be there from the epel repository (AdvancedHPC preinstalled CentOS for us). I struggled with part 2 when I tried to use conda to install a base environment. I tried to just use the conda from the python36 and make a directory in /opt/conda/envs like they mentioned. I added some modules and then tried to link it to jupyterhub like they say but it doesn't pick it up. It's still only seeing the default as what's in the virtualenv in /opt/jupyterhub. If I add a module with /opt/jupyterhub/bin/pip install numpy that works and I'll see it in my notebook. I don't want to end up with a bunch of pythons floating all over the place and messing everything up. It seems like not many set this up on CentOS? It seems the virtualenv part worked sucessfully using the system installed python36 but should I be doing something different for conda? I was trying to avoid each user installing a full Anaconda because that will eat up space fast. If I want miniconda on CentOS does anyone have a set of installation instructions they recommend? *** Lisa Weihl Systems Administrator Computer Science, Bowling Green State University Tel: (419) 372-0116 |Fax: (419) 372-8061 lwe...@bgsu.edu www.bgsu.edu From: slurm-users on behalf of slurm-users-requ...@lists.schedmd.com Sent: Tuesday, May 5, 2020 12:22 PM To: slurm-users@lists.schedmd.com Subject: [EXTERNAL] slurm-users Digest, Vol 31, Issue 12 Send slurm-users mailing list submissions to slurm-users@lists.schedmd.com To subscribe or unsubscribe via the World Wide Web, visit https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.schedmd.com%2Fcgi-bin%2Fmailman%2Flistinfo%2Fslurm-usersdata=02%7C01%7Clweihl%40bgsu.edu%7Cbe93d13431c540ae774608d7f1109645%7Ccdcb729d51064d7cb75ba30c455d5b0a%7C1%7C0%7C637242925861316503sdata=ObIpPvyA30i%2FuIbhYX0GeoTmzZROb2PZHpSXjgF84S8%3Dreserved=0 or, via email, send a message with subject or body 'help' to slurm-users-requ...@lists.schedmd.com You can reach the person managing the list at slurm-users-ow...@lists.schedmd.com When replying, please edit your Subject line so it is more specific than "Re: Contents of slurm-users digest..." Today's Topics: 1. Re: Major newbie - Slurm/jupyterhub (Renfro, Michael) -- Message: 1 Date: Tue, 5 May 2020 16:22:47 + From: "Renfro, Michael" To: Slurm User Community List Subject: Re: [slurm-users] Major newbie - Slurm/jupyterhub Message-ID: <0b440c9d-0458-42ce-9a09-c4598131e...@tntech.edu> Content-Type: text/plain; charset="utf-8" Aside from any Slurm configuration, I?d recommend setting up a modules [1 or 2] folder structure for CUDA and other third-party software. That handles LD_LIBRARY_PATH and other similar variables, reduces the chances for library conflicts, and lets users decide their environment on a per-job basis. Ours includes a basic Miniconda installation, and the users can make their own environments from there [3]. I very rarely install a system-wide Python module. [1] https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmodules.sourceforge.net%2Fdata=02%7C01%7Clweihl%40bgsu.edu%7Cbe93d13431c540ae774608d7f1109645%7Ccdcb729d51064d7cb75ba30c455d5b0a%7C1%7C0%7C637242925861316503sdata=KRYBT11%2BiJd5YOQ6XnZBlaJCTfArdhb9cHrVMzQ9aJI%3Dreserved=0 [2] https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flmod.readthedocs.io%2Fdata=02%7C01%7Clweihl%40bgsu.edu%7Cbe93d13431c540ae774608d7f1109645%7Ccdcb729d51064d7cb75ba30c455d5b0a%7C1%7C0%7C637242925861316503sdata=KzBKaTS2SKMsbUeh9DkEMNoZWR4y3vu7gricjlB2%2BPo%3Dreserved=0 [3]
Re: [slurm-users] [EXT] Re: Limit the number of GPUS per user per partition
Hi Thomas, That value should be sacctmgr modify qos gpujobs set MaxTRESPerUser=gres/gpu=4 Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computing Services | Business Services The University of Melbourne, Victoria 3010 Australia On Wed, 6 May 2020 at 04:53, Theis, Thomas wrote: > *UoM notice: External email. Be cautious of links, attachments, or > impersonation attempts.* > -- > > Hey Killian, > > > > I tried to limit the number of gpus a user can run on at a time by adding > MaxTRESPerUser = gres:gpu4 to both the user and the qos.. I restarted slurm > control daemon and unfortunately I am still able to run on all the gpus in > the partition. Any other ideas? > > > > *Thomas Theis* > > > > *From:* slurm-users *On Behalf Of > *Killian Murphy > *Sent:* Thursday, April 23, 2020 1:33 PM > *To:* Slurm User Community List > *Subject:* Re: [slurm-users] Limit the number of GPUS per user per > partition > > > > External Email > > Hi Thomas. > > > > We limit the maximum number of GPUs a user can have allocated in a > partition through the MaxTRESPerUser field of a QoS for GPU jobs, which is > set as the partition QoS on our GPU partition. I.E: > > > > We have a QOS `gpujobs` that sets MaxTRESPerUser => gres/gpu:4 to limit > total number of allocated GPUs to 4, and set the GPU partition QoS to the > `gpujobs` QoS. > > > > There is a section in the Slurm documentation on the 'Resource Limits' > page entitled 'QOS specific limits supported ( > https://slurm.schedmd.com/resource_limits.html) that details some care > needed when using this kind of limit setting with typed GRES. Although it > seems like you are trying to do something with generic GRES, it's worth a > read! > > > > Killian > > > > > > > > On Thu, 23 Apr 2020 at 18:19, Theis, Thomas > wrote: > > Hi everyone, > > First message, I am trying find a good way or multiple ways to limit the > usage of jobs per node or use of gpus per node, without blocking a user > from submitting them. > > > > Example. We have 10 nodes each with 4 gpus in a partition. We allow a team > of 6 people to submit jobs to any or all of the nodes. One job per gpu; > thus we can hold a total of 40 jobs concurrently in the partition. > > At the moment: each user usually submit 50- 100 jobs at once. Taking up > all gpus, and all other users have to wait in pending.. > > > > What I am trying to setup is allow all users to submit as many jobs as > they wish but only run on 1 out of the 4 gpus per node, or some number out > of the total 40 gpus across the entire partition. Using slurm 18.08.3.. > > > > This is roughly our slurm scripts. > > > > #SBATCH --job-name=Name # Job name > > #SBATCH --mem=5gb # Job memory request > > #SBATCH --ntasks=1 > > #SBATCH --gres=gpu:1 > > #SBATCH --partition=PART1 > > #SBATCH --time=200:00:00 # Time limit hrs:min:sec > > #SBATCH --output=job _%j.log # Standard output and error log > > #SBATCH --nodes=1 > > #SBATCH --qos=high > > > > srun -n1 --gres=gpu:1 --exclusive --export=ALL bash -c > "NV_GPU=$SLURM_JOB_GPUS nvidia-docker run --rm -e > SLURM_JOB_ID=$SLURM_JOB_ID -e SLURM_OUTPUT=$SLURM_OUTPUT --name > $SLURM_JOB_ID do_job.sh" > > > > *Thomas Theis* > > > > > > > -- > > Killian Murphy > > Research Software Engineer > > > > Wolfson Atmospheric Chemistry Laboratories > University of York > Heslington > York > YO10 5DD > +44 (0)1904 32 4753 > > e-mail disclaimer: http://www.york.ac.uk/docs/disclaimer/email.htm >
[slurm-users] "sacctmgr add cluster" crashing slurmdbd
Hi, I've just upgraded to slurm 19.05.5. With either my old database, OR creating an entirely new database, I am unable to create a new 'cluster' entry in the database -- slurmdbd is segfaulting! # sacctmgr add cluster test3 Adding Cluster(s) Name = test3 Would you like to commit changes? (You have 30 seconds to decide) (N/y): y sacctmgr: error: slurm_persist_conn_open_without_init: failed to open persistent connection to mn001:6819: Connection refused sacctmgr: error: slurmdbd: Getting response to message type: DBD_ADD_CLUSTERS Problem adding clusters: Unspecified error sacctmgr: error: slurmdbd: Sending PersistInit msg: Connection refused Meanwhile, running "slurmdbd -D -v -v -v -v -v", I see [2020-05-05T18:17:19.503] debug4: 10(as_mysql_cluster.c:405) query insert into txn_table (timestamp, action, name, actor, info) values (1588717037, 1405, 'test3', 'root', 'mod_time=1588717037, shares=1, grp_jobs=NULL, grp_jobs_accrue=NULL, grp_submit_jobs=NULL, grp_wall=NULL, max_jobs=NULL, max_jobs_accrue=NULL, min_prio_thresh=NULL, max_submit_jobs=NULL, max_wall_pj=NULL, priority=NULL, def_qos_id=NULL, qos=\',1,\', federation=\'\', fed_id=0, fed_state=0, features=\'\''); slurmdbd: debug4: 10(as_mysql_assoc.c:635) query select id_assoc from "test3_assoc_table" where user='' and deleted = 0 and acct='root'; [2020-05-05T18:17:19.506] debug4: 10(as_mysql_assoc.c:635) query select id_assoc from "test3_assoc_table" where user='' and deleted = 0 and acct='root'; slurmdbd: debug4: 10(as_mysql_assoc.c:714) query call get_parent_limits('assoc_table', 'root', 'test3', 0); select @par_id, @mj, @mja, @mpt, @msj, @mwpj, @mtpj, @mtpn, @mtmpj, @mtrm, @def_qos_id, @qos, @delta_qos, @prio; [2020-05-05T18:17:19.506] debug4: 10(as_mysql_assoc.c:714) query call get_parent_limits('assoc_table', 'root', 'test3', 0); select @par_id, @mj, @mja, @mpt, @msj, @mwpj, @mtpj, @mtpn, @mtmpj, @mtrm, @def_qos_id, @qos, @delta_qos, @prio; Segmentation fault (core dumped) Since this happens on a fresh new database, I just don't understand how I can get back to a basic functional state. This is exceedingly frustrating. Thanks for any hints. --dustin
Re: [slurm-users] slurmdbd crashes with segmentation fault following DBD_GET_ASSOCS
I tried upgrading Slurm to 18.08.9 and I am still getting this Segmentation Fault! On Tue, May 5, 2020 at 2:39 PM Dustin Lang wrote: > Hi, > > Apparently my colleague upgraded the mysql client and server, but, as far > as I can tell, this was only 5.7.29 to 5.7.30, and checking the mysql > release notes I don't see anything that looks suspicious there... > > cheers, > --dustin > > > On Tue, May 5, 2020 at 1:37 PM Dustin Lang wrote: > >> Hi, >> >> We're running Slurm 17.11.12. Everything has been working fine, and then >> suddenly slurmctld is crashing and slurmdbd is crashing. >> >> We use fair-share as part of the queuing policy, and previously set up >> accounts with sacctmgr; that has been working fine for months. >> >> If I run slurmdbd in debug mode, >> >> slurmdbd -D -v -v -v -v -v >> >> it eventually (after being contacted by slurmctld) segfaults with: >> >> ... >> slurmdbd: debug2: DBD_NODE_STATE: NODE:cn049 STATE:UP REASON:(null) >> TIME:1588695584 >> slurmdbd: debug4: got 0 commits >> slurmdbd: debug2: DBD_NODE_STATE: NODE:cn050 STATE:UP REASON:(null) >> TIME:1588695584 >> slurmdbd: debug4: got 0 commits >> slurmdbd: debug4: got 0 commits >> slurmdbd: debug2: DBD_GET_TRES: called >> slurmdbd: debug4: got 0 commits >> slurmdbd: debug2: DBD_GET_QOS: called >> slurmdbd: debug4: got 0 commits >> slurmdbd: debug2: DBD_GET_USERS: called >> slurmdbd: debug4: got 0 commits >> slurmdbd: debug2: DBD_GET_ASSOCS: called >> slurmdbd: debug4: 10(as_mysql_assoc.c:2033) query >> call get_parent_limits('assoc_table', 'root', 'slurm_cluster', 0); select >> @par_id, @mj, @msj, @mwpj, @mtpj, @mtpn, @mtmpj, @mtrm, @def_qos_id, @qos, >> @delta_qos; >> Segmentation fault (core dumped) >> >> >> It looks (running slurmdbd in gdb) like that segfault is coming from >> >> >> https://github.com/SchedMD/slurm/blob/slurm-17-11-12-1/src/plugins/accounting_storage/mysql/as_mysql_assoc.c#L2073 >> >> and If I connect to the mysql database directly and call that stored >> procedure, I get >> >> mysql> call get_parent_limits('assoc_table', 'root', 'slurm_cluster', 0); >> >> +-+-+-+--+---+-+-+-+-+--+-+-+ >> | @par_id := id_assoc | @mj := max_jobs | @msj := max_submit_jobs | @mwpj >> := max_wall_pj | @def_qos_id := def_qos_id | @qos := qos | @delta_qos := >> REPLACE(CONCAT(delta_qos, @delta_qos), ',,', ',') | @mtpj := CONCAT(@mtpj, >> if (@mtpj != '' && max_tres_pj != '', ',', ''), max_tres_pj) | @mtpn := >> CONCAT(@mtpn, if (@mtpn != '' && max_tres_pn != '', ',', ''), max_tres_pn) >> | @mtmpj := CONCAT(@mtmpj, if (@mtmpj != '' && max_tres_mins_pj != '', ',', >> ''), max_tres_mins_pj) | @mtrm := CONCAT(@mtrm, if (@mtrm != '' && >> max_tres_run_mins != '', ',', ''), max_tres_run_mins) | @my_acct_new := >> parent_acct | >> >> +-+-+-+--+---+-+-+-+-+--+-+-+ >> | 1 |NULL |NULL | >> NULL | NULL | ,1, | NULL >>| NULL >> | NULL >>| NULL >> >> | NULL >> | | >> >> +-+-+-+--+---+-+-+-+-+--+-+-+ >> >> and if I run >> >> mysql> call
Re: [slurm-users] Limit the number of GPUS per user per partition
Hey Killian, I tried to limit the number of gpus a user can run on at a time by adding MaxTRESPerUser = gres:gpu4 to both the user and the qos.. I restarted slurm control daemon and unfortunately I am still able to run on all the gpus in the partition. Any other ideas? Thomas Theis From: slurm-users On Behalf Of Killian Murphy Sent: Thursday, April 23, 2020 1:33 PM To: Slurm User Community List Subject: Re: [slurm-users] Limit the number of GPUS per user per partition External Email Hi Thomas. We limit the maximum number of GPUs a user can have allocated in a partition through the MaxTRESPerUser field of a QoS for GPU jobs, which is set as the partition QoS on our GPU partition. I.E: We have a QOS `gpujobs` that sets MaxTRESPerUser => gres/gpu:4 to limit total number of allocated GPUs to 4, and set the GPU partition QoS to the `gpujobs` QoS. There is a section in the Slurm documentation on the 'Resource Limits' page entitled 'QOS specific limits supported (https://slurm.schedmd.com/resource_limits.html) that details some care needed when using this kind of limit setting with typed GRES. Although it seems like you are trying to do something with generic GRES, it's worth a read! Killian On Thu, 23 Apr 2020 at 18:19, Theis, Thomas mailto:thomas.th...@teledyne.com>> wrote: Hi everyone, First message, I am trying find a good way or multiple ways to limit the usage of jobs per node or use of gpus per node, without blocking a user from submitting them. Example. We have 10 nodes each with 4 gpus in a partition. We allow a team of 6 people to submit jobs to any or all of the nodes. One job per gpu; thus we can hold a total of 40 jobs concurrently in the partition. At the moment: each user usually submit 50- 100 jobs at once. Taking up all gpus, and all other users have to wait in pending.. What I am trying to setup is allow all users to submit as many jobs as they wish but only run on 1 out of the 4 gpus per node, or some number out of the total 40 gpus across the entire partition. Using slurm 18.08.3.. This is roughly our slurm scripts. #SBATCH --job-name=Name # Job name #SBATCH --mem=5gb # Job memory request #SBATCH --ntasks=1 #SBATCH --gres=gpu:1 #SBATCH --partition=PART1 #SBATCH --time=200:00:00 # Time limit hrs:min:sec #SBATCH --output=job _%j.log # Standard output and error log #SBATCH --nodes=1 #SBATCH --qos=high srun -n1 --gres=gpu:1 --exclusive --export=ALL bash -c "NV_GPU=$SLURM_JOB_GPUS nvidia-docker run --rm -e SLURM_JOB_ID=$SLURM_JOB_ID -e SLURM_OUTPUT=$SLURM_OUTPUT --name $SLURM_JOB_ID do_job.sh" Thomas Theis -- Killian Murphy Research Software Engineer Wolfson Atmospheric Chemistry Laboratories University of York Heslington York YO10 5DD +44 (0)1904 32 4753 e-mail disclaimer: http://www.york.ac.uk/docs/disclaimer/email.htm
Re: [slurm-users] slurmdbd crashes with segmentation fault following DBD_GET_ASSOCS
Hi, Apparently my colleague upgraded the mysql client and server, but, as far as I can tell, this was only 5.7.29 to 5.7.30, and checking the mysql release notes I don't see anything that looks suspicious there... cheers, --dustin On Tue, May 5, 2020 at 1:37 PM Dustin Lang wrote: > Hi, > > We're running Slurm 17.11.12. Everything has been working fine, and then > suddenly slurmctld is crashing and slurmdbd is crashing. > > We use fair-share as part of the queuing policy, and previously set up > accounts with sacctmgr; that has been working fine for months. > > If I run slurmdbd in debug mode, > > slurmdbd -D -v -v -v -v -v > > it eventually (after being contacted by slurmctld) segfaults with: > > ... > slurmdbd: debug2: DBD_NODE_STATE: NODE:cn049 STATE:UP REASON:(null) > TIME:1588695584 > slurmdbd: debug4: got 0 commits > slurmdbd: debug2: DBD_NODE_STATE: NODE:cn050 STATE:UP REASON:(null) > TIME:1588695584 > slurmdbd: debug4: got 0 commits > slurmdbd: debug4: got 0 commits > slurmdbd: debug2: DBD_GET_TRES: called > slurmdbd: debug4: got 0 commits > slurmdbd: debug2: DBD_GET_QOS: called > slurmdbd: debug4: got 0 commits > slurmdbd: debug2: DBD_GET_USERS: called > slurmdbd: debug4: got 0 commits > slurmdbd: debug2: DBD_GET_ASSOCS: called > slurmdbd: debug4: 10(as_mysql_assoc.c:2033) query > call get_parent_limits('assoc_table', 'root', 'slurm_cluster', 0); select > @par_id, @mj, @msj, @mwpj, @mtpj, @mtpn, @mtmpj, @mtrm, @def_qos_id, @qos, > @delta_qos; > Segmentation fault (core dumped) > > > It looks (running slurmdbd in gdb) like that segfault is coming from > > > https://github.com/SchedMD/slurm/blob/slurm-17-11-12-1/src/plugins/accounting_storage/mysql/as_mysql_assoc.c#L2073 > > and If I connect to the mysql database directly and call that stored > procedure, I get > > mysql> call get_parent_limits('assoc_table', 'root', 'slurm_cluster', 0); > > +-+-+-+--+---+-+-+-+-+--+-+-+ > | @par_id := id_assoc | @mj := max_jobs | @msj := max_submit_jobs | @mwpj > := max_wall_pj | @def_qos_id := def_qos_id | @qos := qos | @delta_qos := > REPLACE(CONCAT(delta_qos, @delta_qos), ',,', ',') | @mtpj := CONCAT(@mtpj, > if (@mtpj != '' && max_tres_pj != '', ',', ''), max_tres_pj) | @mtpn := > CONCAT(@mtpn, if (@mtpn != '' && max_tres_pn != '', ',', ''), max_tres_pn) > | @mtmpj := CONCAT(@mtmpj, if (@mtmpj != '' && max_tres_mins_pj != '', ',', > ''), max_tres_mins_pj) | @mtrm := CONCAT(@mtrm, if (@mtrm != '' && > max_tres_run_mins != '', ',', ''), max_tres_run_mins) | @my_acct_new := > parent_acct | > > +-+-+-+--+---+-+-+-+-+--+-+-+ > | 1 |NULL |NULL | > NULL | NULL | ,1, | NULL >| NULL > | NULL >| NULL > > | NULL > | | > > +-+-+-+--+---+-+-+-+-+--+-+-+ > > and if I run > > mysql> call get_parent_limits('assoc_table', 'root', 'slurm_cluster', 0); > select @par_id, @mj, @msj, @mwpj, @mtpj, @mtpn, @mtmpj, @mtrm, @def_qos_id, > @qos, @delta_qos; > > I get > > >
[slurm-users] slurmdbd crashes with segmentation fault following DBD_GET_ASSOCS
Hi, We're running Slurm 17.11.12. Everything has been working fine, and then suddenly slurmctld is crashing and slurmdbd is crashing. We use fair-share as part of the queuing policy, and previously set up accounts with sacctmgr; that has been working fine for months. If I run slurmdbd in debug mode, slurmdbd -D -v -v -v -v -v it eventually (after being contacted by slurmctld) segfaults with: ... slurmdbd: debug2: DBD_NODE_STATE: NODE:cn049 STATE:UP REASON:(null) TIME:1588695584 slurmdbd: debug4: got 0 commits slurmdbd: debug2: DBD_NODE_STATE: NODE:cn050 STATE:UP REASON:(null) TIME:1588695584 slurmdbd: debug4: got 0 commits slurmdbd: debug4: got 0 commits slurmdbd: debug2: DBD_GET_TRES: called slurmdbd: debug4: got 0 commits slurmdbd: debug2: DBD_GET_QOS: called slurmdbd: debug4: got 0 commits slurmdbd: debug2: DBD_GET_USERS: called slurmdbd: debug4: got 0 commits slurmdbd: debug2: DBD_GET_ASSOCS: called slurmdbd: debug4: 10(as_mysql_assoc.c:2033) query call get_parent_limits('assoc_table', 'root', 'slurm_cluster', 0); select @par_id, @mj, @msj, @mwpj, @mtpj, @mtpn, @mtmpj, @mtrm, @def_qos_id, @qos, @delta_qos; Segmentation fault (core dumped) It looks (running slurmdbd in gdb) like that segfault is coming from https://github.com/SchedMD/slurm/blob/slurm-17-11-12-1/src/plugins/accounting_storage/mysql/as_mysql_assoc.c#L2073 and If I connect to the mysql database directly and call that stored procedure, I get mysql> call get_parent_limits('assoc_table', 'root', 'slurm_cluster', 0); +-+-+-+--+---+-+-+-+-+--+-+-+ | @par_id := id_assoc | @mj := max_jobs | @msj := max_submit_jobs | @mwpj := max_wall_pj | @def_qos_id := def_qos_id | @qos := qos | @delta_qos := REPLACE(CONCAT(delta_qos, @delta_qos), ',,', ',') | @mtpj := CONCAT(@mtpj, if (@mtpj != '' && max_tres_pj != '', ',', ''), max_tres_pj) | @mtpn := CONCAT(@mtpn, if (@mtpn != '' && max_tres_pn != '', ',', ''), max_tres_pn) | @mtmpj := CONCAT(@mtmpj, if (@mtmpj != '' && max_tres_mins_pj != '', ',', ''), max_tres_mins_pj) | @mtrm := CONCAT(@mtrm, if (@mtrm != '' && max_tres_run_mins != '', ',', ''), max_tres_run_mins) | @my_acct_new := parent_acct | +-+-+-+--+---+-+-+-+-+--+-+-+ | 1 |NULL |NULL | NULL | NULL | ,1, | NULL | NULL | NULL | NULL | NULL | | +-+-+-+--+---+-+-+-+-+--+-+-+ and if I run mysql> call get_parent_limits('assoc_table', 'root', 'slurm_cluster', 0); select @par_id, @mj, @msj, @mwpj, @mtpj, @mtpn, @mtmpj, @mtrm, @def_qos_id, @qos, @delta_qos; I get +-+--+--+---+---+---++---+-+--++ | @par_id | @mj | @msj | @mwpj | @mtpj | @mtpn | @mtmpj | @mtrm | @def_qos_id | @qos | @delta_qos | +-+--+--+---+---+---++---+-+--++ | 1 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | ,1, | NULL |
Re: [slurm-users] Major newbie - Slurm/jupyterhub
Aside from any Slurm configuration, I’d recommend setting up a modules [1 or 2] folder structure for CUDA and other third-party software. That handles LD_LIBRARY_PATH and other similar variables, reduces the chances for library conflicts, and lets users decide their environment on a per-job basis. Ours includes a basic Miniconda installation, and the users can make their own environments from there [3]. I very rarely install a system-wide Python module. [1] http://modules.sourceforge.net [2] https://lmod.readthedocs.io/ [3] https://its.tntech.edu/display/MON/HPC+Sample+Job%3A+Jupyter+Notebook > On May 5, 2020, at 9:37 AM, Lisa Kay Weihl wrote: > > Thanks Guy, I did find that there was a jupyterhub_slurmspawner log in my > home directory. That enabled me to find out that it could not find the path > for batchspawner-singleuser. > > > So I added this to jupyter_config.py > export PATH=/opt/rh/rh-python36/root/bin:$PATH > > > That seemed to now allow the server to launch for my user that I use for all > the configuration work. I get errors (see below) but the notebook loads. The > problem is I'm not sure how to kill the job in the Slurm queue or the > notebook server if I finish before the job times out and kills it. Logout > doesn't seem to do it. > > It still doesn't work for a regular user (see below) > > I think my problems all have to do with Slurm/jupyterhub finding python. So I > have some questions about the best way to set it up for multiple users and > make it work for this. > > I use CentOS distribution so that if the university admins will ever have to > take over it will match their RedHat setups they use. I know on all Linux > distros you need to leave the python 2 system install alone. It looks like as > of CentOS 7.7 there is now a python3 in the repository. I didn't go that > route because in the past I installed the python from RedHat Software > Collection which is what I did this time. > I don't know if that's the best route for this use case. They also say don't > sudo pip3 to try to install global packages but does that mean sudo to root > and then using pip3 is okay? > > When I test and faculty don't give me code I go to the web and try to find > examples. I know I also wanted to try to test the GPUs from within the > notebook. I have 2 examples: > > Example 1 uses these modules: > import numpy as np > import xgboost as xgb > from sklearn import datasets > from sklearn.model_selection import train_test_split > from sklearn.datasets import dump_svmlight_file > from sklearn.externals import joblib > from sklearn.metrics import precision_score > > It gives error: cannot load library > '/home/csadmin/.local/lib/python3.6/site-packages/librmm.so': > libcudart.so.9.2: cannot open shared object file: No such file or directory > > libcudart.so is in: /usr/local/cuda-10.2/targets/x86_64-linux/lib > > Does this mean I need LD_LIBRARY_PATH set also? Cuda was installed with > typical NVIDIA instructions using their repo. > > Example 2 uses these modules: > import numpy as np > from numba import vectorize > > And gives error: NvvmSupportError: libNVVM cannot be found. Do `conda > install cudatoolkit`: > library nvvm not found > > I don't have conda installed. Will that interfere with pip3? > > Part II - using jupyterhub with regular user gives different error > > I'm assuming this is a python path issue? > > File "/opt/rh/rh-python36/root/bin/batchspawner-singleuser", line 4, in > > __import__('pkg_resources').require('batchspawner==1.0.0rc0') > and later > pkg_resources.DistributionNotFound: The 'batchspawner==1.0.0rc0' distribution > was not found and is required by the application > > Thanks again for any help especially if you can help clear up python > configuration. > > > *** > Lisa Weihl Systems Administrator > Computer Science, Bowling Green State University > Tel: (419) 372-0116 |Fax: (419) 372-8061 > lwe...@bgsu.edu > www.bgsu.edu > > From: slurm-users on behalf of > slurm-users-requ...@lists.schedmd.com > Sent: Tuesday, May 5, 2020 4:59 AM > To: slurm-users@lists.schedmd.com > Subject: [EXTERNAL] slurm-users Digest, Vol 31, Issue 8 > > Send slurm-users mailing list submissions to > slurm-users@lists.schedmd.com > > To subscribe or unsubscribe via the World Wide Web, visit > > https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.schedmd.com%2Fcgi-bin%2Fmailman%2Flistinfo%2Fslurm-usersdata=02%7C01%7Clweihl%40bgsu.edu%7C322dc8435ab642ef25aa08d7f0d29d44%7Ccdcb729d51064d7cb75ba30c455d5b0a%7C1%7C0%7C637242659703767084sdata=Nh9fjFzOGIXhLhdnbyiLc9oIENdpVkVl%2F5hysXbkMT8%3Dreserved=0 > or, via email, send a message with subject or body 'help' to > slurm-users-requ...@lists.schedmd.com > > You can reach the person managing the list at > slurm-users-ow...@lists.schedmd.com > > When replying, please edit your
Re: [slurm-users] Major newbie - Slurm/jupyterhub
Thanks Guy, I did find that there was a jupyterhub_slurmspawner log in my home directory. That enabled me to find out that it could not find the path for batchspawner-singleuser. So I added this to jupyter_config.py export PATH=/opt/rh/rh-python36/root/bin:$PATH That seemed to now allow the server to launch for my user that I use for all the configuration work. I get errors (see below) but the notebook loads. The problem is I'm not sure how to kill the job in the Slurm queue or the notebook server if I finish before the job times out and kills it. Logout doesn't seem to do it. It still doesn't work for a regular user (see below) I think my problems all have to do with Slurm/jupyterhub finding python. So I have some questions about the best way to set it up for multiple users and make it work for this. I use CentOS distribution so that if the university admins will ever have to take over it will match their RedHat setups they use. I know on all Linux distros you need to leave the python 2 system install alone. It looks like as of CentOS 7.7 there is now a python3 in the repository. I didn't go that route because in the past I installed the python from RedHat Software Collection which is what I did this time. I don't know if that's the best route for this use case. They also say don't sudo pip3 to try to install global packages but does that mean sudo to root and then using pip3 is okay? When I test and faculty don't give me code I go to the web and try to find examples. I know I also wanted to try to test the GPUs from within the notebook. I have 2 examples: Example 1 uses these modules: import numpy as np import xgboost as xgb from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn.datasets import dump_svmlight_file from sklearn.externals import joblib from sklearn.metrics import precision_score It gives error: cannot load library '/home/csadmin/.local/lib/python3.6/site-packages/librmm.so': libcudart.so.9.2: cannot open shared object file: No such file or directory libcudart.so is in: /usr/local/cuda-10.2/targets/x86_64-linux/lib Does this mean I need LD_LIBRARY_PATH set also? Cuda was installed with typical NVIDIA instructions using their repo. Example 2 uses these modules: import numpy as np from numba import vectorize And gives error: NvvmSupportError: libNVVM cannot be found. Do `conda install cudatoolkit`: library nvvm not found I don't have conda installed. Will that interfere with pip3? Part II - using jupyterhub with regular user gives different error I'm assuming this is a python path issue? File "/opt/rh/rh-python36/root/bin/batchspawner-singleuser", line 4, in __import__('pkg_resources').require('batchspawner==1.0.0rc0') and later pkg_resources.DistributionNotFound: The 'batchspawner==1.0.0rc0' distribution was not found and is required by the application Thanks again for any help especially if you can help clear up python configuration. *** Lisa Weihl Systems Administrator Computer Science, Bowling Green State University Tel: (419) 372-0116 |Fax: (419) 372-8061 lwe...@bgsu.edu www.bgsu.edu From: slurm-users on behalf of slurm-users-requ...@lists.schedmd.com Sent: Tuesday, May 5, 2020 4:59 AM To: slurm-users@lists.schedmd.com Subject: [EXTERNAL] slurm-users Digest, Vol 31, Issue 8 Send slurm-users mailing list submissions to slurm-users@lists.schedmd.com To subscribe or unsubscribe via the World Wide Web, visit https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.schedmd.com%2Fcgi-bin%2Fmailman%2Flistinfo%2Fslurm-usersdata=02%7C01%7Clweihl%40bgsu.edu%7C322dc8435ab642ef25aa08d7f0d29d44%7Ccdcb729d51064d7cb75ba30c455d5b0a%7C1%7C0%7C637242659703767084sdata=Nh9fjFzOGIXhLhdnbyiLc9oIENdpVkVl%2F5hysXbkMT8%3Dreserved=0 or, via email, send a message with subject or body 'help' to slurm-users-requ...@lists.schedmd.com You can reach the person managing the list at slurm-users-ow...@lists.schedmd.com When replying, please edit your Subject line so it is more specific than "Re: Contents of slurm-users digest..." Today's Topics: 1. Re: Major newbie - Slurm/jupyterhub (Guy Coates) -- Message: 1 Date: Tue, 5 May 2020 09:59:01 +0100 From: Guy Coates To: Slurm User Community List Subject: Re: [slurm-users] Major newbie - Slurm/jupyterhub Message-ID: Content-Type: text/plain; charset="utf-8" Hi Lisa, Below is my jupyterhub slurm config. It uses the profiles, which allows you to spawn different sized jobs. I found the most useful thing for debugging is to make sure that the --output option is being honoured; any jupyter python errors will end up there, and to to explicitly set the python environment at the start of the script. (The example below uses conda, replace
Re: [slurm-users] how to restrict jobs
Haven’t done it yet myself, but it’s on my todo list. But I’d assume that if you use the FlexLM or RLM parts of that documentation, that Slurm would query the remote license server periodically and hold the job until the necessary licenses were available. > On May 5, 2020, at 8:37 AM, navin srivastava wrote: > > External Email Warning > This email originated from outside the university. Please use caution when > opening attachments, clicking links, or responding to requests. > Thanks Michael, > > yes i have gone through but the licenses are remote license and it will be > used by outside as well not only in slurm. > so basically i am interested to know how we can update the database > dynamically to get the exact value at that point of time. > i mean query the license server and update the database accordingly. does > slurm automatically updated the value based on usage? > > > Regards > Navin. > > > On Tue, May 5, 2020 at 7:00 PM Renfro, Michael wrote: > Have you seen https://slurm.schedmd.com/licenses.html already? If the > software is just for use inside the cluster, one Licenses= line in slurm.conf > plus users submitting with the -L flag should suffice. Should be able to set > that license value is 4 if it’s licensed per node and you can run up to 4 > jobs simultaneously, or 4*NCPUS if it’s licensed per CPU, or 1 if it’s a > single license good for one run from 1-4 nodes. > > There are also options to query a FlexLM or RLM server for license management. > > -- > Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services > 931 372-3601 / Tennessee Tech University > > > On May 5, 2020, at 7:54 AM, navin srivastava wrote: > > > > Hi Team, > > > > we have an application whose licenses is limited .it scales upto 4 > > nodes(~80 cores). > > so if 4 nodes are full, in 5th node job used to get fail. > > we want to put a restriction so that the application can't go for the > > execution beyond the 4 nodes and fail it should be in queue state. > > i do not want to keep a separate partition to achieve this config.is there > > a way to achieve this scenario using some dynamic resource which can call > > the license variable on the fly and if it is reached it should keep the job > > in queue. > > > > Regards > > Navin. > > > > > > >
Re: [slurm-users] how to restrict jobs
Thanks Michael, yes i have gone through but the licenses are remote license and it will be used by outside as well not only in slurm. so basically i am interested to know how we can update the database dynamically to get the exact value at that point of time. i mean query the license server and update the database accordingly. does slurm automatically updated the value based on usage? Regards Navin. On Tue, May 5, 2020 at 7:00 PM Renfro, Michael wrote: > Have you seen https://slurm.schedmd.com/licenses.html already? If the > software is just for use inside the cluster, one Licenses= line in > slurm.conf plus users submitting with the -L flag should suffice. Should be > able to set that license value is 4 if it’s licensed per node and you can > run up to 4 jobs simultaneously, or 4*NCPUS if it’s licensed per CPU, or 1 > if it’s a single license good for one run from 1-4 nodes. > > There are also options to query a FlexLM or RLM server for license > management. > > -- > Mike Renfro, PhD / HPC Systems Administrator, Information Technology > Services > 931 372-3601 / Tennessee Tech University > > > On May 5, 2020, at 7:54 AM, navin srivastava > wrote: > > > > Hi Team, > > > > we have an application whose licenses is limited .it scales upto 4 > nodes(~80 cores). > > so if 4 nodes are full, in 5th node job used to get fail. > > we want to put a restriction so that the application can't go for the > execution beyond the 4 nodes and fail it should be in queue state. > > i do not want to keep a separate partition to achieve this config.is > there a way to achieve this scenario using some dynamic resource which can > call the license variable on the fly and if it is reached it should keep > the job in queue. > > > > Regards > > Navin. > > > > > > > >
Re: [slurm-users] how to restrict jobs
Have you seen https://slurm.schedmd.com/licenses.html already? If the software is just for use inside the cluster, one Licenses= line in slurm.conf plus users submitting with the -L flag should suffice. Should be able to set that license value is 4 if it’s licensed per node and you can run up to 4 jobs simultaneously, or 4*NCPUS if it’s licensed per CPU, or 1 if it’s a single license good for one run from 1-4 nodes. There are also options to query a FlexLM or RLM server for license management. -- Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services 931 372-3601 / Tennessee Tech University > On May 5, 2020, at 7:54 AM, navin srivastava wrote: > > Hi Team, > > we have an application whose licenses is limited .it scales upto 4 nodes(~80 > cores). > so if 4 nodes are full, in 5th node job used to get fail. > we want to put a restriction so that the application can't go for the > execution beyond the 4 nodes and fail it should be in queue state. > i do not want to keep a separate partition to achieve this config.is there a > way to achieve this scenario using some dynamic resource which can call the > license variable on the fly and if it is reached it should keep the job in > queue. > > Regards > Navin. > > >
[slurm-users] how to restrict jobs
Hi Team, we have an application whose licenses is limited .it scales upto 4 nodes(~80 cores). so if 4 nodes are full, in 5th node job used to get fail. we want to put a restriction so that the application can't go for the execution beyond the 4 nodes and fail it should be in queue state. i do not want to keep a separate partition to achieve this config.is there a way to achieve this scenario using some dynamic resource which can call the license variable on the fly and if it is reached it should keep the job in queue. Regards Navin.
Re: [slurm-users] Major newbie - Slurm/jupyterhub
Hi, please post also the stdout/stderr of the job7117.. What I don't see in UR config and I do have there is: c.SlurmSpawner.hub_connect_ip = '192.168.1.1' #- the IP where slurm job will try to connect to jupyterhub. Also check if port 8081 is reachable from compute nodes. -- josef On 05. 05. 20 2:24, Lisa Kay Weihl wrote: .. -- Josef Dvoracek Institute of Physics | Czech Academy of Sciences cell: +420 608 563 558 | office: +420 266 052 669 | fzu phone nr. : 2669
Re: [slurm-users] Major newbie - Slurm/jupyterhub
Hi Lisa, Below is my jupyterhub slurm config. It uses the profiles, which allows you to spawn different sized jobs. I found the most useful thing for debugging is to make sure that the --output option is being honoured; any jupyter python errors will end up there, and to to explicitly set the python environment at the start of the script. (The example below uses conda, replace with whatever makes sense in your environment). Hope that helps, Guy #Extend timeouts to deal with slow job launch c.JupyterHub.spawner_class = 'wrapspawner.ProfilesSpawner' c.Spawner.start_timeout=120 c.Spawner.term_timeout=20 c.Spawner.http_timeout = 120 # Set up the various sizes of job c.ProfilesSpawner.profiles = [ ("Local server: (Run on local machine)", "local", "jupyterhub.spawner.LocalProcessSpawner", {'ip':'0.0.0.0'}), ("Single CPU: (1 CPU, 8GB, 48 hrs)", "cpu1", "batchspawner.SlurmSpawner", dict(req_options=" -n 1 -t 48:00:00 -p normal --mem=8G ")), ("Single GPU: (1 CPU, 1 GPU, 8GB, 48 hrs)", "gpu1", "batchspawner.SlurmSpawner", dict(req_options=" -n 1 -t 48:00:00 -p normal --mem=8G --gres=gpu:k40:1")), ("Whole Node: (32 CPUs, 128 GB, 48 hrs)", "node1", "batchspawner.SlurmSpawner", dict(req_options=" -n 32 -N 1 -t 48:00:00 -p normal --mem=127000M")), ("Whole GPU Node: (32 CPUs, 2 GPUs, 128GB, 48 hrs)", "gnode1", "batchspawner.SlurmSpawner", dict(req_options=" -n 32 -N 1 -t 48:00:00 -p normal --mem=127000M --gres=gpu:k40:2")), ] #Configure the batch job. Make sure --output is set and explicitly set up #the jupyterhub python environment c.SlurmSpawner.batch_script = """#!/bin/bash #SBATCH --output={homedir}/jupyterhub_slurmspawner_%j.log #SBATCH --job-name=spawner-jupyterhub #SBATCH --chdir={homedir} #SBATCH --export={keepvars} #SBATCH --get-user-env=L #SBATCH {options} trap 'echo SIGTERM received' TERM . /usr/local/jupyterhub/miniconda3/etc/profile.d/conda.sh conda activate /usr/local/jupyterhub/jupyterhub which jupyterhub-singleuser {cmd} echo "jupyterhub-singleuser ended gracefully" """ On Tue, 5 May 2020 at 01:27, Lisa Kay Weihl wrote: > I have a single server with 2 cpu, 384gb memory and 4 gpu (GeForce RTX > 2080 Ti). > > Use is to be for GPU ML computing and python based data science. > > One faculty wants jupyter notebooks, other faculty member is used to using > CUDA for GPU but has only done it on a workstation in his lab with a GUI. > New faculty member coming in has used nvidia-docker container for GPU (I > think on a large cluster, we are just getting started) > > I'm charged with making all this work and hopefully all at once. Right now > I'll take one thing working. > > So I managed to get Slurm-20.02.1 installed with CUDA-10.2 on CentOS 7 (SE > Linux enabled). I posted once before about having trouble getting that > combination correct and I finally worked that out. Most of the tests in the > test suite seem to run okay. I'm trying to start with very basic Slurm > configuration so I haven't enabled accounting. > > *For reference here is my slurm.conf* > > # slurm.conf file generated by configurator easy.html. > > # Put this file on all nodes of your cluster. > > # See the slurm.conf man page for more information. > > # > > SlurmctldHost=cs-host > > > #authentication > > AuthType=auth/munge > > CacheGroups = 0 > > CryptoType=crypto/munge > > > #Add GPU support > > GresTypes=gpu > > > # > > #MailProg=/bin/mail > > MpiDefault=none > > #MpiParams=ports=#-# > > > #service > > ProctrackType=proctrack/cgroup > > ReturnToService=1 > > SlurmctldPidFile=/var/run/slurmctld.pid > > #SlurmctldPort=6817 > > SlurmdPidFile=/var/run/slurmd.pid > > #SlurmdPort=6818 > > SlurmdSpoolDir=/var/spool/slurmd > > SlurmUser=slurm > > #SlurmdUser=root > > StateSaveLocation=/var/spool/slurmctld > > SwitchType=switch/none > > TaskPlugin=task/affinity > > # > > # > > # TIMERS > > #KillWait=30 > > #MinJobAge=300 > > #SlurmctldTimeout=120 > > SlurmdTimeout=1800 > > # > > # > > # SCHEDULING > > SchedulerType=sched/backfill > > SelectType=select/cons_tres > > SelectTypeParameters=CR_Core_Memory > > PriorityType=priority/multifactor > > PriorityDecayHalfLife=3-0 > > PriorityMaxAge=7-0 > > PriorityFavorSmall=YES > > PriorityWeightAge=1000 > > PriorityWeightFairshare=0 > > PriorityWeightJobSize=125 > > PriorityWeightPartition=1000 > > PriorityWeightQOS=0 > > # > > # > > # LOGGING AND ACCOUNTING > > AccountingStorageType=accounting_storage/none > > ClusterName=cs-host > > #JobAcctGatherFrequency=30 > > JobAcctGatherType=jobacct_gather/none > > SlurmctldDebug=info > > SlurmctldLogFile=/var/log/slurmctld.log > > #SlurmdDebug=info > > SlurmdLogFile=/var/log/slurmd.log > > # > > # > > # COMPUTE NODES > > NodeName=cs-host CPUs=24 RealMemory=385405 Sockets=2 CoresPerSocket=6 > ThreadsPerCore=2 State=UNKNOWN Gres=gpu:4 > > > #PARTITIONS > > PartitionName=DEFAULT Nodes=cs-host Shared=FORCE:1 Default=YES > MaxTime=INFINITE State=UP > > PartitionName=faculty Priority=10 Default=YES > > > I have jupyterhub running as part of RedHat SCL. It