Re: [slurm-users] "sacctmgr add cluster" crashing slurmdbd

2020-05-05 Thread Chris Samuel
On Tuesday, 5 May 2020 3:21:45 PM PDT Dustin Lang wrote:

> Since this happens on a fresh new database, I just don't understand how I
> can get back to a basic functional state.  This is exceedingly frustrating.

I have to say that if you're seeing this with 17.11, 18.08 and 19.05 and this 
only started when your colleague upgraded MySQL then this sounds like MySQL is 
triggering this problem.

We're running with MariaDB 10.x (from SLES15) without issues (our database is 
huge).

All the best,
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA






Re: [slurm-users] Job Step Resource Requests are Ignored

2020-05-05 Thread Chris Samuel
On Tuesday, 5 May 2020 4:47:12 PM PDT Maria Semple wrote:

> I'd like to set different resource limits for different steps of my job. A
> sample script might look like this (e.g. job.sh):
> 
> #!/bin/bash
> srun --cpus-per-task=1 --mem=1 echo "Starting..."
> srun --cpus-per-task=4 --mem=250 --exclusive 
> srun --cpus-per-task=1 --mem=1 echo "Finished."
> 
> Then I would run the script from the command line using the following
> command: sbatch --ntasks=1 job.sh.

You shouldn't ask for more resources with "srun" than have been allocated with 
"sbatch" - so if you want the job to be able to use up to 4 cores at once & 
that amount of memory you'll need to use:

sbatch -c 4 --mem=250 --ntasks=1 job.sh

I'd also suggest using suffixes for memory to disambiguate the values.

All the best,
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA






Re: [slurm-users] [EXT] Re: Limit the number of GPUS per user per partition

2020-05-05 Thread Chris Samuel
On Tuesday, 5 May 2020 3:48:22 PM PDT Sean Crosby wrote:

> sacctmgr modify qos gpujobs set MaxTRESPerUser=gres/gpu=4

Also don't forget you need to tell Slurm to enforce QOS limits with:

AccountingStorageEnforce=safe,qos

in your Slurm configuration ("safe" is good to set, and turns on enforcement of 
other restrictions around associations too).  See:

https://slurm.schedmd.com/resource_limits.html

All the best,
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA






[slurm-users] Job Step Resource Requests are Ignored

2020-05-05 Thread Maria Semple
Hi!

I'd like to set different resource limits for different steps of my job. A
sample script might look like this (e.g. job.sh):

#!/bin/bash
srun --cpus-per-task=1 --mem=1 echo "Starting..."
srun --cpus-per-task=4 --mem=250 --exclusive 
srun --cpus-per-task=1 --mem=1 echo "Finished."

Then I would run the script from the command line using the following
command: sbatch --ntasks=1 job.sh. I have observed that while none of the
steps appear to have limited memory (which I'm pretty sure has to do with
my proctrack plugin type), the second step runs and scontrol show step
.1 shows the step has having been allocated 4 CPUs, in reality the step
is only able to use 1.

I have also observed the opposite. Running the following command, I can see
that the job step is able to use all CPUs allocated to the job, rather than
the one it was allocated itself:

sbatch --ntasks=1 --cpus-per-task=2 << EOF
#!/bin/bash
srun --cpus-per-task=1 
EOF

My goal here is to be able to run a single job with 3 steps where the first
and last step are always executed, even if the second would not be run
because too many resources were requested.

Here is my slurm.conf, with commented out lines removed (this is just a
small test cluster with a single node on the same machine as the
controller):

SlurmctldHost=ubuntu
CredType=cred/munge
AuthType=auth/munge
EnforcePartLimits=ALL
MpiDefault=none
ProctrackType=proctrack/linuxproc
ReturnToService=2
SlurmctldPidFile=/var/spool/slurm/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/spool/slurm/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurm/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurm
SwitchType=switch/none
TaskPlugin=task/affinity
TaskPluginParam=Sched
InactiveLimit=0
KillWait=30
MinJobAge=3600
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreJobComment=YES
ClusterName=cluster
JobCompHost=localhost
JobCompLoc=slurm_db
JobCompPort=3306
JobCompType=jobcomp/mysql
JobCompUser=slurm
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=info
SlurmctldLogFile=/var/spool/slurm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/spool/slurm/slurmd/slurmd.log
NodeName=ubuntu CPUs=4 RealMemory=500 State=UNKNOWN
PartitionName=main Nodes=ubuntu Default=YES MaxTime=INFINITE State=UP
AllowGroups=maria

Any advice would be greatly appreciated! Thanks in advance!

-- 
Thanks,
Maria


Re: [slurm-users] Major newbie - Slurm/jupyterhub

2020-05-05 Thread Lisa Kay Weihl
Hi Michael,

I get the gist of everything you mentioned but now I feel even more 
overwhelmed. Can I not get jupyterhub up and running without all those modules 
pieces? I was hoping to have a base kernel in jupyterhub that contained a lot 
of the data science packages. I'm going to have some developers that are very 
adept and others that are just going to want to drop in code and have it run 
because it ran on their local machine.

In an attempt to decouple jupyterhub from the environments I tried to follow 
the instructions on this page: 
https://jupyterhub.readthedocs.io/en/stable/installation-guide-hard.html

These were their steps I'm trying to follow:

  *   We will create an installation of JupyterHub and JupyterLab using a 
virtualenv under /opt using the system Python.

  *   We will install conda globally.

  *   We will create a shared conda environment which can be used (but not 
modified) by all users.

  *   We will show how users can create their own private conda environments, 
where they can install whatever they like.


That's for Ubuntu but I was able to make it work as far as getting jupyterhub 
up and running. I used the system python3.6 that seemed to be there from the 
epel repository (AdvancedHPC preinstalled CentOS for us).

I struggled with part 2 when I tried to use conda to install a base 
environment.  I tried to just use the conda from the python36 and make a 
directory in /opt/conda/envs like they mentioned. I added some modules and then 
tried to link it to jupyterhub like they say but it doesn't pick it up. It's 
still only seeing the default as what's in the virtualenv in /opt/jupyterhub.  
If I add a module with /opt/jupyterhub/bin/pip install numpy that works and 
I'll see it in my notebook.

I don't want to end up with a bunch of pythons floating all over the place and 
messing everything up. It seems like not many set this up on CentOS?

It seems the virtualenv part worked sucessfully using the system installed 
python36 but should I be doing something different for conda?

I was trying to avoid each user installing a full Anaconda because that will 
eat up space fast.

If I want miniconda on CentOS does anyone have a set of installation 
instructions they recommend?

***

Lisa Weihl Systems Administrator

Computer Science, Bowling Green State University
Tel: (419) 372-0116   |Fax: (419) 372-8061
lwe...@bgsu.edu
www.bgsu.edu​


From: slurm-users  on behalf of 
slurm-users-requ...@lists.schedmd.com 
Sent: Tuesday, May 5, 2020 12:22 PM
To: slurm-users@lists.schedmd.com 
Subject: [EXTERNAL] slurm-users Digest, Vol 31, Issue 12

Send slurm-users mailing list submissions to
slurm-users@lists.schedmd.com

To subscribe or unsubscribe via the World Wide Web, visit

https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.schedmd.com%2Fcgi-bin%2Fmailman%2Flistinfo%2Fslurm-usersdata=02%7C01%7Clweihl%40bgsu.edu%7Cbe93d13431c540ae774608d7f1109645%7Ccdcb729d51064d7cb75ba30c455d5b0a%7C1%7C0%7C637242925861316503sdata=ObIpPvyA30i%2FuIbhYX0GeoTmzZROb2PZHpSXjgF84S8%3Dreserved=0
or, via email, send a message with subject or body 'help' to
slurm-users-requ...@lists.schedmd.com

You can reach the person managing the list at
slurm-users-ow...@lists.schedmd.com

When replying, please edit your Subject line so it is more specific
than "Re: Contents of slurm-users digest..."


Today's Topics:

   1. Re: Major newbie - Slurm/jupyterhub (Renfro, Michael)


--

Message: 1
Date: Tue, 5 May 2020 16:22:47 +
From: "Renfro, Michael" 
To: Slurm User Community List 
Subject: Re: [slurm-users] Major newbie - Slurm/jupyterhub
Message-ID: <0b440c9d-0458-42ce-9a09-c4598131e...@tntech.edu>
Content-Type: text/plain; charset="utf-8"

Aside from any Slurm configuration, I?d recommend setting up a modules [1 or 2] 
folder structure for CUDA and other third-party software. That handles 
LD_LIBRARY_PATH and other similar variables, reduces the chances for library 
conflicts, and lets users decide their environment on a per-job basis. Ours 
includes a basic Miniconda installation, and the users can make their own 
environments from there [3]. I very rarely install a system-wide Python module.

[1] 
https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmodules.sourceforge.net%2Fdata=02%7C01%7Clweihl%40bgsu.edu%7Cbe93d13431c540ae774608d7f1109645%7Ccdcb729d51064d7cb75ba30c455d5b0a%7C1%7C0%7C637242925861316503sdata=KRYBT11%2BiJd5YOQ6XnZBlaJCTfArdhb9cHrVMzQ9aJI%3Dreserved=0
[2] 
https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flmod.readthedocs.io%2Fdata=02%7C01%7Clweihl%40bgsu.edu%7Cbe93d13431c540ae774608d7f1109645%7Ccdcb729d51064d7cb75ba30c455d5b0a%7C1%7C0%7C637242925861316503sdata=KzBKaTS2SKMsbUeh9DkEMNoZWR4y3vu7gricjlB2%2BPo%3Dreserved=0
[3] 

Re: [slurm-users] [EXT] Re: Limit the number of GPUS per user per partition

2020-05-05 Thread Sean Crosby
Hi Thomas,

That value should be

sacctmgr modify qos gpujobs set MaxTRESPerUser=gres/gpu=4

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia



On Wed, 6 May 2020 at 04:53, Theis, Thomas 
wrote:

> *UoM notice: External email. Be cautious of links, attachments, or
> impersonation attempts.*
> --
>
> Hey Killian,
>
>
>
> I tried to limit the number of gpus a user can run on at a time by adding
> MaxTRESPerUser = gres:gpu4 to both the user and the qos.. I restarted slurm
> control daemon and unfortunately I am still able to run on all the gpus in
> the partition. Any other ideas?
>
>
>
> *Thomas Theis*
>
>
>
> *From:* slurm-users  *On Behalf Of
> *Killian Murphy
> *Sent:* Thursday, April 23, 2020 1:33 PM
> *To:* Slurm User Community List 
> *Subject:* Re: [slurm-users] Limit the number of GPUS per user per
> partition
>
>
>
> External Email
>
> Hi Thomas.
>
>
>
> We limit the maximum number of GPUs a user can have allocated in a
> partition through the MaxTRESPerUser field of a QoS for GPU jobs, which is
> set as the partition QoS on our GPU partition. I.E:
>
>
>
> We have a QOS `gpujobs` that sets MaxTRESPerUser => gres/gpu:4 to limit
> total number of allocated GPUs to 4, and set the GPU partition QoS to the
> `gpujobs` QoS.
>
>
>
> There is a section in the Slurm documentation on the 'Resource Limits'
> page entitled 'QOS specific limits supported (
> https://slurm.schedmd.com/resource_limits.html) that details some care
> needed when using this kind of limit setting with typed GRES. Although it
> seems like you are trying to do something with generic GRES, it's worth a
> read!
>
>
>
> Killian
>
>
>
>
>
>
>
> On Thu, 23 Apr 2020 at 18:19, Theis, Thomas 
> wrote:
>
> Hi everyone,
>
> First message, I am trying find a good way or multiple ways to limit the
> usage of jobs per node or use of gpus per node, without blocking a user
> from submitting them.
>
>
>
> Example. We have 10 nodes each with 4 gpus in a partition. We allow a team
> of 6 people to submit jobs to any or all of the nodes. One job per gpu;
> thus we can hold a total of 40 jobs concurrently in the partition.
>
> At the moment: each user usually submit 50- 100 jobs at once. Taking up
> all gpus, and all other users have to wait in pending..
>
>
>
> What I am trying to setup is allow all users to submit as many jobs as
> they wish but only run on 1 out of the 4 gpus per node, or some number out
> of the total 40 gpus across the entire partition. Using slurm 18.08.3..
>
>
>
> This is roughly our slurm scripts.
>
>
>
> #SBATCH --job-name=Name # Job name
>
> #SBATCH --mem=5gb # Job memory request
>
> #SBATCH --ntasks=1
>
> #SBATCH --gres=gpu:1
>
> #SBATCH --partition=PART1
>
> #SBATCH --time=200:00:00   # Time limit hrs:min:sec
>
> #SBATCH --output=job _%j.log # Standard output and error log
>
> #SBATCH --nodes=1
>
> #SBATCH --qos=high
>
>
>
> srun -n1 --gres=gpu:1 --exclusive --export=ALL bash -c
> "NV_GPU=$SLURM_JOB_GPUS nvidia-docker run --rm -e
> SLURM_JOB_ID=$SLURM_JOB_ID -e SLURM_OUTPUT=$SLURM_OUTPUT --name
> $SLURM_JOB_ID do_job.sh"
>
>
>
> *Thomas Theis*
>
>
>
>
>
>
> --
>
> Killian Murphy
>
> Research Software Engineer
>
>
>
> Wolfson Atmospheric Chemistry Laboratories
> University of York
> Heslington
> York
> YO10 5DD
> +44 (0)1904 32 4753
>
> e-mail disclaimer: http://www.york.ac.uk/docs/disclaimer/email.htm
>


[slurm-users] "sacctmgr add cluster" crashing slurmdbd

2020-05-05 Thread Dustin Lang
Hi,

I've just upgraded to slurm 19.05.5.

With either my old database, OR creating an entirely new database, I am
unable to create a new 'cluster' entry in the database -- slurmdbd is
segfaulting!

# sacctmgr add cluster test3
 Adding Cluster(s)
  Name   = test3
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y
sacctmgr: error: slurm_persist_conn_open_without_init: failed to open
persistent connection to mn001:6819: Connection refused
sacctmgr: error: slurmdbd: Getting response to message type:
DBD_ADD_CLUSTERS
 Problem adding clusters: Unspecified error
sacctmgr: error: slurmdbd: Sending PersistInit msg: Connection refused

Meanwhile, running "slurmdbd -D -v -v -v -v -v", I see

[2020-05-05T18:17:19.503] debug4: 10(as_mysql_cluster.c:405) query
insert into txn_table (timestamp, action, name, actor, info) values
(1588717037, 1405, 'test3', 'root', 'mod_time=1588717037, shares=1,
grp_jobs=NULL, grp_jobs_accrue=NULL, grp_submit_jobs=NULL, grp_wall=NULL,
max_jobs=NULL, max_jobs_accrue=NULL, min_prio_thresh=NULL,
max_submit_jobs=NULL, max_wall_pj=NULL, priority=NULL, def_qos_id=NULL,
qos=\',1,\', federation=\'\', fed_id=0, fed_state=0, features=\'\'');
slurmdbd: debug4: 10(as_mysql_assoc.c:635) query
select id_assoc from "test3_assoc_table" where user='' and deleted = 0 and
acct='root';
[2020-05-05T18:17:19.506] debug4: 10(as_mysql_assoc.c:635) query
select id_assoc from "test3_assoc_table" where user='' and deleted = 0 and
acct='root';
slurmdbd: debug4: 10(as_mysql_assoc.c:714) query
call get_parent_limits('assoc_table', 'root', 'test3', 0); select @par_id,
@mj, @mja, @mpt, @msj, @mwpj, @mtpj, @mtpn, @mtmpj, @mtrm, @def_qos_id,
@qos, @delta_qos, @prio;
[2020-05-05T18:17:19.506] debug4: 10(as_mysql_assoc.c:714) query
call get_parent_limits('assoc_table', 'root', 'test3', 0); select @par_id,
@mj, @mja, @mpt, @msj, @mwpj, @mtpj, @mtpn, @mtmpj, @mtrm, @def_qos_id,
@qos, @delta_qos, @prio;
Segmentation fault (core dumped)


Since this happens on a fresh new database, I just don't understand how I
can get back to a basic functional state.  This is exceedingly frustrating.

Thanks for any hints.

--dustin


Re: [slurm-users] slurmdbd crashes with segmentation fault following DBD_GET_ASSOCS

2020-05-05 Thread Dustin Lang
I tried upgrading Slurm to 18.08.9 and I am still getting this Segmentation
Fault!



On Tue, May 5, 2020 at 2:39 PM Dustin Lang  wrote:

> Hi,
>
> Apparently my colleague upgraded the mysql client and server, but, as far
> as I can tell, this was only 5.7.29 to 5.7.30, and checking the mysql
> release notes I  don't see anything that looks suspicious there...
>
> cheers,
> --dustin
>
>
> On Tue, May 5, 2020 at 1:37 PM Dustin Lang  wrote:
>
>> Hi,
>>
>> We're running Slurm 17.11.12.  Everything has been working fine, and then
>> suddenly slurmctld is crashing and slurmdbd is crashing.
>>
>> We use fair-share as part of the queuing policy, and previously set up
>> accounts with sacctmgr; that has been working fine for months.
>>
>> If I run slurmdbd in debug mode,
>>
>>  slurmdbd -D -v -v -v -v -v
>>
>> it eventually (after being contacted by slurmctld) segfaults with:
>>
>> ...
>> slurmdbd: debug2: DBD_NODE_STATE: NODE:cn049 STATE:UP REASON:(null)
>> TIME:1588695584
>> slurmdbd: debug4: got 0 commits
>> slurmdbd: debug2: DBD_NODE_STATE: NODE:cn050 STATE:UP REASON:(null)
>> TIME:1588695584
>> slurmdbd: debug4: got 0 commits
>> slurmdbd: debug4: got 0 commits
>> slurmdbd: debug2: DBD_GET_TRES: called
>> slurmdbd: debug4: got 0 commits
>> slurmdbd: debug2: DBD_GET_QOS: called
>> slurmdbd: debug4: got 0 commits
>> slurmdbd: debug2: DBD_GET_USERS: called
>> slurmdbd: debug4: got 0 commits
>> slurmdbd: debug2: DBD_GET_ASSOCS: called
>> slurmdbd: debug4: 10(as_mysql_assoc.c:2033) query
>> call get_parent_limits('assoc_table', 'root', 'slurm_cluster', 0); select
>> @par_id, @mj, @msj, @mwpj, @mtpj, @mtpn, @mtmpj, @mtrm, @def_qos_id, @qos,
>> @delta_qos;
>> Segmentation fault (core dumped)
>>
>>
>> It looks (running slurmdbd in gdb) like that segfault is coming from
>>
>>
>> https://github.com/SchedMD/slurm/blob/slurm-17-11-12-1/src/plugins/accounting_storage/mysql/as_mysql_assoc.c#L2073
>>
>> and If I connect to the mysql database directly and call that stored
>> procedure, I get
>>
>> mysql> call get_parent_limits('assoc_table', 'root', 'slurm_cluster', 0);
>>
>> +-+-+-+--+---+-+-+-+-+--+-+-+
>> | @par_id := id_assoc | @mj := max_jobs | @msj := max_submit_jobs | @mwpj
>> := max_wall_pj | @def_qos_id := def_qos_id | @qos := qos | @delta_qos :=
>> REPLACE(CONCAT(delta_qos, @delta_qos), ',,', ',') | @mtpj := CONCAT(@mtpj,
>> if (@mtpj != '' && max_tres_pj != '', ',', ''), max_tres_pj) | @mtpn :=
>> CONCAT(@mtpn, if (@mtpn != '' && max_tres_pn != '', ',', ''), max_tres_pn)
>> | @mtmpj := CONCAT(@mtmpj, if (@mtmpj != '' && max_tres_mins_pj != '', ',',
>> ''), max_tres_mins_pj) | @mtrm := CONCAT(@mtrm, if (@mtrm != '' &&
>> max_tres_run_mins != '', ',', ''), max_tres_run_mins) | @my_acct_new :=
>> parent_acct |
>>
>> +-+-+-+--+---+-+-+-+-+--+-+-+
>> |   1 |NULL |NULL |
>>   NULL |  NULL | ,1, | NULL
>>| NULL
>>  | NULL
>>| NULL
>>
>>   | NULL
>>  | |
>>
>> +-+-+-+--+---+-+-+-+-+--+-+-+
>>
>> and if I run
>>
>> mysql> call 

Re: [slurm-users] Limit the number of GPUS per user per partition

2020-05-05 Thread Theis, Thomas
Hey Killian,

I tried to limit the number of gpus a user can run on at a time by adding 
MaxTRESPerUser = gres:gpu4 to both the user and the qos.. I restarted slurm 
control daemon and unfortunately I am still able to run on all the gpus in the 
partition. Any other ideas?

Thomas Theis

From: slurm-users  On Behalf Of Killian 
Murphy
Sent: Thursday, April 23, 2020 1:33 PM
To: Slurm User Community List 
Subject: Re: [slurm-users] Limit the number of GPUS per user per partition

External Email
Hi Thomas.

We limit the maximum number of GPUs a user can have allocated in a partition 
through the MaxTRESPerUser field of a QoS for GPU jobs, which is set as the 
partition QoS on our GPU partition. I.E:

We have a QOS `gpujobs` that sets MaxTRESPerUser => gres/gpu:4 to limit total 
number of allocated GPUs to 4, and set the GPU partition QoS to the `gpujobs` 
QoS.

There is a section in the Slurm documentation on the 'Resource Limits' page 
entitled 'QOS specific limits supported 
(https://slurm.schedmd.com/resource_limits.html) that details some care needed 
when using this kind of limit setting with typed GRES. Although it seems like 
you are trying to do something with generic GRES, it's worth a read!

Killian



On Thu, 23 Apr 2020 at 18:19, Theis, Thomas 
mailto:thomas.th...@teledyne.com>> wrote:
Hi everyone,
First message, I am trying find a good way or multiple ways to limit the usage 
of jobs per node or use of gpus per node, without blocking a user from 
submitting them.

Example. We have 10 nodes each with 4 gpus in a partition. We allow a team of 6 
people to submit jobs to any or all of the nodes. One job per gpu; thus we can 
hold a total of 40 jobs concurrently in the partition.
At the moment: each user usually submit 50- 100 jobs at once. Taking up all 
gpus, and all other users have to wait in pending..

What I am trying to setup is allow all users to submit as many jobs as they 
wish but only run on 1 out of the 4 gpus per node, or some number out of the 
total 40 gpus across the entire partition. Using slurm 18.08.3..

This is roughly our slurm scripts.

#SBATCH --job-name=Name # Job name
#SBATCH --mem=5gb # Job memory request
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1
#SBATCH --partition=PART1
#SBATCH --time=200:00:00   # Time limit hrs:min:sec
#SBATCH --output=job _%j.log # Standard output and error log
#SBATCH --nodes=1
#SBATCH --qos=high

srun -n1 --gres=gpu:1 --exclusive --export=ALL bash -c "NV_GPU=$SLURM_JOB_GPUS 
nvidia-docker run --rm -e SLURM_JOB_ID=$SLURM_JOB_ID -e 
SLURM_OUTPUT=$SLURM_OUTPUT --name $SLURM_JOB_ID do_job.sh"

Thomas Theis



--
Killian Murphy
Research Software Engineer

Wolfson Atmospheric Chemistry Laboratories
University of York
Heslington
York
YO10 5DD
+44 (0)1904 32 4753

e-mail disclaimer: http://www.york.ac.uk/docs/disclaimer/email.htm


Re: [slurm-users] slurmdbd crashes with segmentation fault following DBD_GET_ASSOCS

2020-05-05 Thread Dustin Lang
Hi,

Apparently my colleague upgraded the mysql client and server, but, as far
as I can tell, this was only 5.7.29 to 5.7.30, and checking the mysql
release notes I  don't see anything that looks suspicious there...

cheers,
--dustin


On Tue, May 5, 2020 at 1:37 PM Dustin Lang  wrote:

> Hi,
>
> We're running Slurm 17.11.12.  Everything has been working fine, and then
> suddenly slurmctld is crashing and slurmdbd is crashing.
>
> We use fair-share as part of the queuing policy, and previously set up
> accounts with sacctmgr; that has been working fine for months.
>
> If I run slurmdbd in debug mode,
>
>  slurmdbd -D -v -v -v -v -v
>
> it eventually (after being contacted by slurmctld) segfaults with:
>
> ...
> slurmdbd: debug2: DBD_NODE_STATE: NODE:cn049 STATE:UP REASON:(null)
> TIME:1588695584
> slurmdbd: debug4: got 0 commits
> slurmdbd: debug2: DBD_NODE_STATE: NODE:cn050 STATE:UP REASON:(null)
> TIME:1588695584
> slurmdbd: debug4: got 0 commits
> slurmdbd: debug4: got 0 commits
> slurmdbd: debug2: DBD_GET_TRES: called
> slurmdbd: debug4: got 0 commits
> slurmdbd: debug2: DBD_GET_QOS: called
> slurmdbd: debug4: got 0 commits
> slurmdbd: debug2: DBD_GET_USERS: called
> slurmdbd: debug4: got 0 commits
> slurmdbd: debug2: DBD_GET_ASSOCS: called
> slurmdbd: debug4: 10(as_mysql_assoc.c:2033) query
> call get_parent_limits('assoc_table', 'root', 'slurm_cluster', 0); select
> @par_id, @mj, @msj, @mwpj, @mtpj, @mtpn, @mtmpj, @mtrm, @def_qos_id, @qos,
> @delta_qos;
> Segmentation fault (core dumped)
>
>
> It looks (running slurmdbd in gdb) like that segfault is coming from
>
>
> https://github.com/SchedMD/slurm/blob/slurm-17-11-12-1/src/plugins/accounting_storage/mysql/as_mysql_assoc.c#L2073
>
> and If I connect to the mysql database directly and call that stored
> procedure, I get
>
> mysql> call get_parent_limits('assoc_table', 'root', 'slurm_cluster', 0);
>
> +-+-+-+--+---+-+-+-+-+--+-+-+
> | @par_id := id_assoc | @mj := max_jobs | @msj := max_submit_jobs | @mwpj
> := max_wall_pj | @def_qos_id := def_qos_id | @qos := qos | @delta_qos :=
> REPLACE(CONCAT(delta_qos, @delta_qos), ',,', ',') | @mtpj := CONCAT(@mtpj,
> if (@mtpj != '' && max_tres_pj != '', ',', ''), max_tres_pj) | @mtpn :=
> CONCAT(@mtpn, if (@mtpn != '' && max_tres_pn != '', ',', ''), max_tres_pn)
> | @mtmpj := CONCAT(@mtmpj, if (@mtmpj != '' && max_tres_mins_pj != '', ',',
> ''), max_tres_mins_pj) | @mtrm := CONCAT(@mtrm, if (@mtrm != '' &&
> max_tres_run_mins != '', ',', ''), max_tres_run_mins) | @my_acct_new :=
> parent_acct |
>
> +-+-+-+--+---+-+-+-+-+--+-+-+
> |   1 |NULL |NULL |
>   NULL |  NULL | ,1, | NULL
>| NULL
>  | NULL
>| NULL
>
>   | NULL
>  | |
>
> +-+-+-+--+---+-+-+-+-+--+-+-+
>
> and if I run
>
> mysql> call get_parent_limits('assoc_table', 'root', 'slurm_cluster', 0);
> select @par_id, @mj, @msj, @mwpj, @mtpj, @mtpn, @mtmpj, @mtrm, @def_qos_id,
> @qos, @delta_qos;
>
> I get
>
>
> 

[slurm-users] slurmdbd crashes with segmentation fault following DBD_GET_ASSOCS

2020-05-05 Thread Dustin Lang
Hi,

We're running Slurm 17.11.12.  Everything has been working fine, and then
suddenly slurmctld is crashing and slurmdbd is crashing.

We use fair-share as part of the queuing policy, and previously set up
accounts with sacctmgr; that has been working fine for months.

If I run slurmdbd in debug mode,

 slurmdbd -D -v -v -v -v -v

it eventually (after being contacted by slurmctld) segfaults with:

...
slurmdbd: debug2: DBD_NODE_STATE: NODE:cn049 STATE:UP REASON:(null)
TIME:1588695584
slurmdbd: debug4: got 0 commits
slurmdbd: debug2: DBD_NODE_STATE: NODE:cn050 STATE:UP REASON:(null)
TIME:1588695584
slurmdbd: debug4: got 0 commits
slurmdbd: debug4: got 0 commits
slurmdbd: debug2: DBD_GET_TRES: called
slurmdbd: debug4: got 0 commits
slurmdbd: debug2: DBD_GET_QOS: called
slurmdbd: debug4: got 0 commits
slurmdbd: debug2: DBD_GET_USERS: called
slurmdbd: debug4: got 0 commits
slurmdbd: debug2: DBD_GET_ASSOCS: called
slurmdbd: debug4: 10(as_mysql_assoc.c:2033) query
call get_parent_limits('assoc_table', 'root', 'slurm_cluster', 0); select
@par_id, @mj, @msj, @mwpj, @mtpj, @mtpn, @mtmpj, @mtrm, @def_qos_id, @qos,
@delta_qos;
Segmentation fault (core dumped)


It looks (running slurmdbd in gdb) like that segfault is coming from

https://github.com/SchedMD/slurm/blob/slurm-17-11-12-1/src/plugins/accounting_storage/mysql/as_mysql_assoc.c#L2073

and If I connect to the mysql database directly and call that stored
procedure, I get

mysql> call get_parent_limits('assoc_table', 'root', 'slurm_cluster', 0);
+-+-+-+--+---+-+-+-+-+--+-+-+
| @par_id := id_assoc | @mj := max_jobs | @msj := max_submit_jobs | @mwpj
:= max_wall_pj | @def_qos_id := def_qos_id | @qos := qos | @delta_qos :=
REPLACE(CONCAT(delta_qos, @delta_qos), ',,', ',') | @mtpj := CONCAT(@mtpj,
if (@mtpj != '' && max_tres_pj != '', ',', ''), max_tres_pj) | @mtpn :=
CONCAT(@mtpn, if (@mtpn != '' && max_tres_pn != '', ',', ''), max_tres_pn)
| @mtmpj := CONCAT(@mtmpj, if (@mtmpj != '' && max_tres_mins_pj != '', ',',
''), max_tres_mins_pj) | @mtrm := CONCAT(@mtrm, if (@mtrm != '' &&
max_tres_run_mins != '', ',', ''), max_tres_run_mins) | @my_acct_new :=
parent_acct |
+-+-+-+--+---+-+-+-+-+--+-+-+
|   1 |NULL |NULL |
NULL |  NULL | ,1, | NULL
 | NULL
   | NULL
 | NULL

| NULL
   | |
+-+-+-+--+---+-+-+-+-+--+-+-+

and if I run

mysql> call get_parent_limits('assoc_table', 'root', 'slurm_cluster', 0);
select @par_id, @mj, @msj, @mwpj, @mtpj, @mtpn, @mtmpj, @mtrm, @def_qos_id,
@qos, @delta_qos;

I get

+-+--+--+---+---+---++---+-+--++
| @par_id | @mj  | @msj | @mwpj | @mtpj | @mtpn | @mtmpj | @mtrm |
@def_qos_id | @qos | @delta_qos |
+-+--+--+---+---+---++---+-+--++
|   1 | NULL | NULL |  NULL | NULL  | NULL  | NULL   | NULL  |
 NULL | ,1,  | NULL   |

Re: [slurm-users] Major newbie - Slurm/jupyterhub

2020-05-05 Thread Renfro, Michael
Aside from any Slurm configuration, I’d recommend setting up a modules [1 or 2] 
folder structure for CUDA and other third-party software. That handles 
LD_LIBRARY_PATH and other similar variables, reduces the chances for library 
conflicts, and lets users decide their environment on a per-job basis. Ours 
includes a basic Miniconda installation, and the users can make their own 
environments from there [3]. I very rarely install a system-wide Python module.

[1] http://modules.sourceforge.net
[2] https://lmod.readthedocs.io/
[3] https://its.tntech.edu/display/MON/HPC+Sample+Job%3A+Jupyter+Notebook

> On May 5, 2020, at 9:37 AM, Lisa Kay Weihl  wrote:
> 
> Thanks Guy, I did find that there was a jupyterhub_slurmspawner log in my 
> home directory.  That enabled me to find out that it could not find the path 
> for batchspawner-singleuser. 
> 
> 
> So I added this to jupyter_config.py
> export PATH=/opt/rh/rh-python36/root/bin:$PATH
> 
> 
> That seemed to now allow the server to launch for my user that I use for all 
> the configuration work. I get errors (see below) but the notebook loads. The 
> problem is I'm not sure how to kill the job in the Slurm queue or the 
> notebook server if I finish before the job times out and kills it. Logout 
> doesn't seem to do it.
> 
> It still doesn't work for a regular user (see below)
> 
> I think my problems all have to do with Slurm/jupyterhub finding python. So I 
> have some questions about the best way to set it up for multiple users and 
> make it work for this.
> 
> I use CentOS distribution so that if the university admins will ever have to 
> take over it will match their RedHat setups they use. I know on all Linux 
> distros you need to leave the python 2 system install alone. It looks like as 
> of CentOS 7.7 there is now a python3 in the repository. I didn't go that 
> route because in the past I installed the python from RedHat Software 
> Collection which is what I did this time.
> I don't know if that's the best route for this use case. They also say don't 
> sudo pip3 to try to install global packages but does that mean sudo to root 
> and then using pip3 is okay?
> 
> When I test and faculty don't give me code I go to the web and try to find 
> examples. I know I also wanted to try to test the GPUs from within the 
> notebook. I have 2 examples:
> 
> Example 1 uses these modules:
> import numpy as np
> import xgboost as xgb
> from sklearn import datasets
> from sklearn.model_selection import train_test_split
> from sklearn.datasets import dump_svmlight_file
> from sklearn.externals import joblib
> from sklearn.metrics import precision_score
> 
> It gives error: cannot load library 
> '/home/csadmin/.local/lib/python3.6/site-packages/librmm.so': 
> libcudart.so.9.2: cannot open shared object file: No such file or directory
> 
> libcudart.so is in: /usr/local/cuda-10.2/targets/x86_64-linux/lib
> 
> Does this mean I need LD_LIBRARY_PATH  set also? Cuda was installed with 
> typical NVIDIA instructions using their repo.
> 
> Example 2 uses these modules:
> import numpy as np
> from numba import vectorize
> 
> And gives error:  NvvmSupportError: libNVVM cannot be found. Do `conda 
> install cudatoolkit`:
> library nvvm not found
> 
> I don't have conda installed. Will that interfere with pip3?
> 
> Part II - using jupyterhub with regular user gives different error
> 
> I'm assuming this is a python path issue?
> 
>  File "/opt/rh/rh-python36/root/bin/batchspawner-singleuser", line 4, in 
> 
> __import__('pkg_resources').require('batchspawner==1.0.0rc0')
> and later
> pkg_resources.DistributionNotFound: The 'batchspawner==1.0.0rc0' distribution 
> was not found and is required by the application
> 
> Thanks again for any help especially if you can help clear up python 
> configuration.
> 
> 
> ***
> Lisa Weihl Systems Administrator
> Computer Science, Bowling Green State University
> Tel: (419) 372-0116   |Fax: (419) 372-8061
> lwe...@bgsu.edu
> www.bgsu.edu​
> 
> From: slurm-users  on behalf of 
> slurm-users-requ...@lists.schedmd.com 
> Sent: Tuesday, May 5, 2020 4:59 AM
> To: slurm-users@lists.schedmd.com 
> Subject: [EXTERNAL] slurm-users Digest, Vol 31, Issue 8
>  
> Send slurm-users mailing list submissions to
> slurm-users@lists.schedmd.com
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> 
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.schedmd.com%2Fcgi-bin%2Fmailman%2Flistinfo%2Fslurm-usersdata=02%7C01%7Clweihl%40bgsu.edu%7C322dc8435ab642ef25aa08d7f0d29d44%7Ccdcb729d51064d7cb75ba30c455d5b0a%7C1%7C0%7C637242659703767084sdata=Nh9fjFzOGIXhLhdnbyiLc9oIENdpVkVl%2F5hysXbkMT8%3Dreserved=0
> or, via email, send a message with subject or body 'help' to
> slurm-users-requ...@lists.schedmd.com
> 
> You can reach the person managing the list at
> slurm-users-ow...@lists.schedmd.com
> 
> When replying, please edit your 

Re: [slurm-users] Major newbie - Slurm/jupyterhub

2020-05-05 Thread Lisa Kay Weihl
Thanks Guy, I did find that there was a jupyterhub_slurmspawner log in my home 
directory.  That enabled me to find out that it could not find the path for 
batchspawner-singleuser.


So I added this to jupyter_config.py

export PATH=/opt/rh/rh-python36/root/bin:$PATH


That seemed to now allow the server to launch for my user that I use for all 
the configuration work. I get errors (see below) but the notebook loads. The 
problem is I'm not sure how to kill the job in the Slurm queue or the notebook 
server if I finish before the job times out and kills it. Logout doesn't seem 
to do it.

It still doesn't work for a regular user (see below)

I think my problems all have to do with Slurm/jupyterhub finding python. So I 
have some questions about the best way to set it up for multiple users and make 
it work for this.

I use CentOS distribution so that if the university admins will ever have to 
take over it will match their RedHat setups they use. I know on all Linux 
distros you need to leave the python 2 system install alone. It looks like as 
of CentOS 7.7 there is now a python3 in the repository. I didn't go that route 
because in the past I installed the python from RedHat Software Collection 
which is what I did this time.
I don't know if that's the best route for this use case. They also say don't 
sudo pip3 to try to install global packages but does that mean sudo to root and 
then using pip3 is okay?

When I test and faculty don't give me code I go to the web and try to find 
examples. I know I also wanted to try to test the GPUs from within the 
notebook. I have 2 examples:

Example 1 uses these modules:
import numpy as np
import xgboost as xgb
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.datasets import dump_svmlight_file
from sklearn.externals import joblib
from sklearn.metrics import precision_score

It gives error: cannot load library 
'/home/csadmin/.local/lib/python3.6/site-packages/librmm.so': libcudart.so.9.2: 
cannot open shared object file: No such file or directory

libcudart.so is in: /usr/local/cuda-10.2/targets/x86_64-linux/lib

Does this mean I need LD_LIBRARY_PATH  set also? Cuda was installed with 
typical NVIDIA instructions using their repo.

Example 2 uses these modules:
import numpy as np
from numba import vectorize

And gives error:  NvvmSupportError: libNVVM cannot be found. Do `conda install 
cudatoolkit`:
library nvvm not found

I don't have conda installed. Will that interfere with pip3?

Part II - using jupyterhub with regular user gives different error

I'm assuming this is a python path issue?


 File "/opt/rh/rh-python36/root/bin/batchspawner-singleuser", line 4, in 


__import__('pkg_resources').require('batchspawner==1.0.0rc0')

and later

pkg_resources.DistributionNotFound: The 'batchspawner==1.0.0rc0' distribution 
was not found and is required by the application

Thanks again for any help especially if you can help clear up python 
configuration.


***

Lisa Weihl Systems Administrator

Computer Science, Bowling Green State University
Tel: (419) 372-0116   |Fax: (419) 372-8061
lwe...@bgsu.edu
www.bgsu.edu​


From: slurm-users  on behalf of 
slurm-users-requ...@lists.schedmd.com 
Sent: Tuesday, May 5, 2020 4:59 AM
To: slurm-users@lists.schedmd.com 
Subject: [EXTERNAL] slurm-users Digest, Vol 31, Issue 8

Send slurm-users mailing list submissions to
slurm-users@lists.schedmd.com

To subscribe or unsubscribe via the World Wide Web, visit

https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.schedmd.com%2Fcgi-bin%2Fmailman%2Flistinfo%2Fslurm-usersdata=02%7C01%7Clweihl%40bgsu.edu%7C322dc8435ab642ef25aa08d7f0d29d44%7Ccdcb729d51064d7cb75ba30c455d5b0a%7C1%7C0%7C637242659703767084sdata=Nh9fjFzOGIXhLhdnbyiLc9oIENdpVkVl%2F5hysXbkMT8%3Dreserved=0
or, via email, send a message with subject or body 'help' to
slurm-users-requ...@lists.schedmd.com

You can reach the person managing the list at
slurm-users-ow...@lists.schedmd.com

When replying, please edit your Subject line so it is more specific
than "Re: Contents of slurm-users digest..."


Today's Topics:

   1. Re: Major newbie - Slurm/jupyterhub (Guy Coates)


--

Message: 1
Date: Tue, 5 May 2020 09:59:01 +0100
From: Guy Coates 
To: Slurm User Community List 
Subject: Re: [slurm-users] Major newbie - Slurm/jupyterhub
Message-ID:

Content-Type: text/plain; charset="utf-8"

Hi Lisa,

Below is my jupyterhub slurm config. It uses the profiles, which allows you
to spawn different sized jobs.  I found the most useful thing for debugging
is to make sure that the --output option is being honoured; any jupyter
python errors will end up there, and to to explicitly set the python
environment at the start of the script. (The example below uses conda,
replace 

Re: [slurm-users] how to restrict jobs

2020-05-05 Thread Renfro, Michael
Haven’t done it yet myself, but it’s on my todo list.

But I’d assume that if you use the FlexLM or RLM parts of that documentation, 
that Slurm would query the remote license server periodically and hold the job 
until the necessary licenses were available.

> On May 5, 2020, at 8:37 AM, navin srivastava  wrote:
> 
> External Email Warning
> This email originated from outside the university. Please use caution when 
> opening attachments, clicking links, or responding to requests.
> Thanks Michael,
> 
> yes i have gone through but the licenses are remote license and it will be 
> used by outside as well not only in slurm.
> so basically i am interested to know how we can update the database 
> dynamically to get the exact value at that point of time.
> i mean query the license server and update the database accordingly. does 
> slurm automatically updated the value based on usage?
> 
> 
> Regards
> Navin.
> 
> 
> On Tue, May 5, 2020 at 7:00 PM Renfro, Michael  wrote:
> Have you seen https://slurm.schedmd.com/licenses.html already? If the 
> software is just for use inside the cluster, one Licenses= line in slurm.conf 
> plus users submitting with the -L flag should suffice. Should be able to set 
> that license value is 4 if it’s licensed per node and you can run up to 4 
> jobs simultaneously, or 4*NCPUS if it’s licensed per CPU, or 1 if it’s a 
> single license good for one run from 1-4 nodes.
> 
> There are also options to query a FlexLM or RLM server for license management.
> 
> -- 
> Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services
> 931 372-3601 / Tennessee Tech University
> 
> > On May 5, 2020, at 7:54 AM, navin srivastava  wrote:
> > 
> > Hi Team,
> > 
> > we have an application whose licenses is limited .it scales upto 4 
> > nodes(~80 cores).
> > so if 4 nodes are full, in 5th node job used to get fail.
> > we want to put a restriction so that the application can't go for the 
> > execution beyond the 4 nodes and fail it should be in queue state.
> > i do not want to keep a separate partition to achieve this config.is there 
> > a way to achieve this scenario using some dynamic resource which can call 
> > the license variable on the fly and if it is reached it should keep the job 
> > in queue.
> > 
> > Regards
> > Navin.
> > 
> > 
> > 
> 



Re: [slurm-users] how to restrict jobs

2020-05-05 Thread navin srivastava
Thanks Michael,

yes i have gone through but the licenses are remote license and it will be
used by outside as well not only in slurm.
so basically i am interested to know how we can update the database
dynamically to get the exact value at that point of time.
i mean query the license server and update the database accordingly. does
slurm automatically updated the value based on usage?


Regards
Navin.


On Tue, May 5, 2020 at 7:00 PM Renfro, Michael  wrote:

> Have you seen https://slurm.schedmd.com/licenses.html already? If the
> software is just for use inside the cluster, one Licenses= line in
> slurm.conf plus users submitting with the -L flag should suffice. Should be
> able to set that license value is 4 if it’s licensed per node and you can
> run up to 4 jobs simultaneously, or 4*NCPUS if it’s licensed per CPU, or 1
> if it’s a single license good for one run from 1-4 nodes.
>
> There are also options to query a FlexLM or RLM server for license
> management.
>
> --
> Mike Renfro, PhD / HPC Systems Administrator, Information Technology
> Services
> 931 372-3601 / Tennessee Tech University
>
> > On May 5, 2020, at 7:54 AM, navin srivastava 
> wrote:
> >
> > Hi Team,
> >
> > we have an application whose licenses is limited .it scales upto 4
> nodes(~80 cores).
> > so if 4 nodes are full, in 5th node job used to get fail.
> > we want to put a restriction so that the application can't go for the
> execution beyond the 4 nodes and fail it should be in queue state.
> > i do not want to keep a separate partition to achieve this config.is
> there a way to achieve this scenario using some dynamic resource which can
> call the license variable on the fly and if it is reached it should keep
> the job in queue.
> >
> > Regards
> > Navin.
> >
> >
> >
>
>


Re: [slurm-users] how to restrict jobs

2020-05-05 Thread Renfro, Michael
Have you seen https://slurm.schedmd.com/licenses.html already? If the software 
is just for use inside the cluster, one Licenses= line in slurm.conf plus users 
submitting with the -L flag should suffice. Should be able to set that license 
value is 4 if it’s licensed per node and you can run up to 4 jobs 
simultaneously, or 4*NCPUS if it’s licensed per CPU, or 1 if it’s a single 
license good for one run from 1-4 nodes.

There are also options to query a FlexLM or RLM server for license management.

-- 
Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services
931 372-3601 / Tennessee Tech University

> On May 5, 2020, at 7:54 AM, navin srivastava  wrote:
> 
> Hi Team,
> 
> we have an application whose licenses is limited .it scales upto 4 nodes(~80 
> cores).
> so if 4 nodes are full, in 5th node job used to get fail.
> we want to put a restriction so that the application can't go for the 
> execution beyond the 4 nodes and fail it should be in queue state.
> i do not want to keep a separate partition to achieve this config.is there a 
> way to achieve this scenario using some dynamic resource which can call the 
> license variable on the fly and if it is reached it should keep the job in 
> queue.
> 
> Regards
> Navin.
> 
> 
> 



[slurm-users] how to restrict jobs

2020-05-05 Thread navin srivastava
Hi Team,

we have an application whose licenses is limited .it scales upto 4
nodes(~80 cores).
so if 4 nodes are full, in 5th node job used to get fail.
we want to put a restriction so that the application can't go for the
execution beyond the 4 nodes and fail it should be in queue state.
i do not want to keep a separate partition to achieve this config.is there
a way to achieve this scenario using some dynamic resource which can call
the license variable on the fly and if it is reached it should keep the job
in queue.

Regards
Navin.


Re: [slurm-users] Major newbie - Slurm/jupyterhub

2020-05-05 Thread Josef Dvoracek

Hi,

please post also the stdout/stderr of the job7117..

What I don't see in UR config and I do have there is:

c.SlurmSpawner.hub_connect_ip = '192.168.1.1' #- the IP where slurm job 
will try to connect to jupyterhub.


Also check if port 8081 is reachable from compute nodes.

--

josef


On 05. 05. 20 2:24, Lisa Kay Weihl wrote:
..

--
Josef Dvoracek
Institute of Physics | Czech Academy of Sciences
cell: +420 608 563 558 | office: +420 266 052 669 | fzu phone nr. : 2669



Re: [slurm-users] Major newbie - Slurm/jupyterhub

2020-05-05 Thread Guy Coates
Hi Lisa,

Below is my jupyterhub slurm config. It uses the profiles, which allows you
to spawn different sized jobs.  I found the most useful thing for debugging
is to make sure that the --output option is being honoured; any jupyter
python errors will end up there, and to to explicitly set the python
environment at the start of the script. (The example below uses conda,
replace with whatever makes sense in your environment).

Hope that helps,

Guy


#Extend timeouts to deal with slow job launch
c.JupyterHub.spawner_class = 'wrapspawner.ProfilesSpawner'
c.Spawner.start_timeout=120
c.Spawner.term_timeout=20
c.Spawner.http_timeout = 120

# Set up the various sizes of job
c.ProfilesSpawner.profiles = [
("Local server: (Run on local machine)", "local",
"jupyterhub.spawner.LocalProcessSpawner", {'ip':'0.0.0.0'}),
("Single CPU: (1 CPU, 8GB, 48 hrs)", "cpu1", "batchspawner.SlurmSpawner",
 dict(req_options=" -n 1 -t 48:00:00 -p normal --mem=8G ")),
("Single GPU: (1 CPU, 1 GPU, 8GB, 48 hrs)", "gpu1",
"batchspawner.SlurmSpawner",
 dict(req_options=" -n 1 -t 48:00:00 -p normal --mem=8G --gres=gpu:k40:1")),
("Whole Node: (32 CPUs, 128 GB, 48 hrs)", "node1",
"batchspawner.SlurmSpawner",
 dict(req_options=" -n 32 -N 1  -t 48:00:00 -p normal --mem=127000M")),
("Whole GPU Node: (32 CPUs, 2 GPUs, 128GB, 48 hrs)", "gnode1",
"batchspawner.SlurmSpawner",
 dict(req_options=" -n 32 -N 1  -t 48:00:00 -p normal --mem=127000M
--gres=gpu:k40:2")),
]

#Configure the batch job. Make sure --output is set and explicitly set up
#the jupyterhub python environment
c.SlurmSpawner.batch_script = """#!/bin/bash
#SBATCH --output={homedir}/jupyterhub_slurmspawner_%j.log
#SBATCH --job-name=spawner-jupyterhub
#SBATCH --chdir={homedir}
#SBATCH --export={keepvars}
#SBATCH --get-user-env=L
#SBATCH {options}
trap 'echo SIGTERM received' TERM
 . /usr/local/jupyterhub/miniconda3/etc/profile.d/conda.sh
conda activate /usr/local/jupyterhub/jupyterhub
which jupyterhub-singleuser
{cmd}
echo "jupyterhub-singleuser ended gracefully"
"""

On Tue, 5 May 2020 at 01:27, Lisa Kay Weihl  wrote:

> I have a single server with 2 cpu, 384gb memory and 4 gpu (GeForce RTX
> 2080 Ti).
>
> Use is to be for GPU ML computing and python based data science.
>
> One faculty wants jupyter notebooks, other faculty member is used to using
> CUDA for GPU but has only done it on a workstation in his lab with a GUI.
> New faculty member coming in has used nvidia-docker container for GPU (I
> think on a large cluster, we are just getting started)
>
> I'm charged with making all this work and hopefully all at once. Right now
> I'll take one thing working.
>
> So I managed to get Slurm-20.02.1 installed with CUDA-10.2 on CentOS 7 (SE
> Linux enabled). I posted once before about having trouble getting that
> combination correct and I finally worked that out. Most of the tests in the
> test suite seem to run okay. I'm trying to start with very basic Slurm
> configuration so I haven't enabled accounting.
>
> *For reference here is my slurm.conf*
>
> # slurm.conf file generated by configurator easy.html.
>
> # Put this file on all nodes of your cluster.
>
> # See the slurm.conf man page for more information.
>
> #
>
> SlurmctldHost=cs-host
>
>
> #authentication
>
> AuthType=auth/munge
>
> CacheGroups = 0
>
> CryptoType=crypto/munge
>
>
> #Add GPU support
>
> GresTypes=gpu
>
>
> #
>
> #MailProg=/bin/mail
>
> MpiDefault=none
>
> #MpiParams=ports=#-#
>
>
> #service
>
> ProctrackType=proctrack/cgroup
>
> ReturnToService=1
>
> SlurmctldPidFile=/var/run/slurmctld.pid
>
> #SlurmctldPort=6817
>
> SlurmdPidFile=/var/run/slurmd.pid
>
> #SlurmdPort=6818
>
> SlurmdSpoolDir=/var/spool/slurmd
>
> SlurmUser=slurm
>
> #SlurmdUser=root
>
> StateSaveLocation=/var/spool/slurmctld
>
> SwitchType=switch/none
>
> TaskPlugin=task/affinity
>
> #
>
> #
>
> # TIMERS
>
> #KillWait=30
>
> #MinJobAge=300
>
> #SlurmctldTimeout=120
>
> SlurmdTimeout=1800
>
> #
>
> #
>
> # SCHEDULING
>
> SchedulerType=sched/backfill
>
> SelectType=select/cons_tres
>
> SelectTypeParameters=CR_Core_Memory
>
> PriorityType=priority/multifactor
>
> PriorityDecayHalfLife=3-0
>
> PriorityMaxAge=7-0
>
> PriorityFavorSmall=YES
>
> PriorityWeightAge=1000
>
> PriorityWeightFairshare=0
>
> PriorityWeightJobSize=125
>
> PriorityWeightPartition=1000
>
> PriorityWeightQOS=0
>
> #
>
> #
>
> # LOGGING AND ACCOUNTING
>
> AccountingStorageType=accounting_storage/none
>
> ClusterName=cs-host
>
> #JobAcctGatherFrequency=30
>
> JobAcctGatherType=jobacct_gather/none
>
> SlurmctldDebug=info
>
> SlurmctldLogFile=/var/log/slurmctld.log
>
> #SlurmdDebug=info
>
> SlurmdLogFile=/var/log/slurmd.log
>
> #
>
> #
>
> # COMPUTE NODES
>
> NodeName=cs-host CPUs=24 RealMemory=385405 Sockets=2 CoresPerSocket=6
> ThreadsPerCore=2 State=UNKNOWN Gres=gpu:4
>
>
> #PARTITIONS
>
> PartitionName=DEFAULT Nodes=cs-host Shared=FORCE:1 Default=YES
> MaxTime=INFINITE State=UP
>
> PartitionName=faculty  Priority=10 Default=YES
>
>
> I have jupyterhub running as part of RedHat SCL. It