Re: [slurm-users] Multi-node job failure

2019-12-12 Thread Chris Samuel

On 11/12/19 8:05 am, Chris Woelkers - NOAA Federal wrote:

Partial progress. The scientist that developed the model took a look at 
the output and found that instead of one model run being ran in parallel 
srun had ran multiple instances of the model, one per thread, which for 
this test was 110 threads.


This sounds like MVAPICH isn't built to support Slurm, from the Slurm 
MPI guide you need to build it with this to enable Slurm support (and of 
course add any other options you were using):


./configure --with-pmi=pmi2 --with-pm=slurm

All the best,
Chris
--
 Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] Is that possible to submit jobs to a Slurm cluster right from a developer's PC

2019-12-12 Thread Chris Samuel

On 12/12/19 7:38 am, Ryan Cox wrote:

Be careful with this approach.  You also need the same munge key 
installed everywhere.  If the developers have root on their own system, 
they can submit jobs and run Slurm commands as any user.


I would echo Ryan's caution on this and add that as root they will be 
able to run admin commands on the box too, create reservations, shut 
Slurm down, cancel other users jobs, etc.


At the Slurm User Group this year Tim Wickberg foreshadowed (and demo'd 
with a very neat "pay-for-priority" box) a REST API planned for the 
Slurm 20.02 release.  It has its own auth system separate to munge and 
would make this a lot safer.


All the best,
Chris
--
 Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] Need help with controller issues

2019-12-12 Thread Chris Samuel

On 12/12/19 8:14 am, Dean Schulze wrote:

configure:5021: gcc -o conftest -I/usr/include/mysql -g -O2   conftest.c 
-L/usr/lib/x86_64-linux-gnu -lmysqlclient -lpthread -lz -lm -lrt 
-latomic -lssl -lcrypto -ldl  >&5

/usr/bin/ld: cannot find -lssl
/usr/bin/ld: cannot find -lcrypto
collect2: error: ld returned 1 exit status


That looks like your failure, you're missing the package that provides 
those libraries it's trying to use - in this case for Debian/Ubuntu I 
suspect it's libssl-dev.


All the best,
Chris
--
 Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] Maxjobs to accrue age priority points

2019-12-12 Thread Chris Samuel

Hi Chris,

On 12/12/19 3:16 pm, Christopher Benjamin Coffey wrote:


What am I missing?


It's just a setting on the QOS, not the user:

csamuel@cori01:~> sacctmgr show qos where name=regular_1 
format=MaxJobsAccruePerUser

MaxJobsAccruePU
---
  2

So any user in that QOS can only have 2 jobs ageing at any one time.

All the best,
Chris
--
 Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] Maxjobs to accrue age priority points

2019-12-12 Thread Christopher Benjamin Coffey
Hmm, after trying this out I'm confused. I don't see the limit placed on the 
qos. Infact, I see that the qos header is missing some other options that are 
available in the man page. Maybe I'm missing an option that enables some of the 
options.

[ddd@siris /home/ddd]$ sacctmgr update qos name=billybob set 
maxjobsaccrueperuser=8 -i
 Modified qos...
  billybob
[ddd@siris /home/ddd ]$ sacctmgr list qos -p|grep billybob
billybob|0|00:00:00|exploratory,free||cluster|||1.00||

[ddd@siris /home/ddd ]$ sacctmgr list qos -p|grep Name
Name|Priority|GraceTime|Preempt|PreemptExemptTime|PreemptMode|Flags|UsageThres|UsageFactor|GrpTRES|GrpTRESMins|GrpTRESRunMins|GrpJobs|GrpSubmit|GrpWall|MaxTRES|MaxTRESPerNode|MaxTRESMins|MaxWall|MaxTRESPU|MaxJobsPU|MaxSubmitPU|MaxTRESPA|MaxJobsPA|MaxSubmitPA|MinTRES|

What am I missing?

Best,
Chris
 
-- 
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 
 

On 12/12/19, 3:23 PM, "slurm-users on behalf of Christopher Benjamin Coffey" 
 wrote:

Ahh hah! Thanks Killian!

Best,
Chris
 
-- 
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 
 

On 12/12/19, 3:03 PM, "slurm-users on behalf of Kilian Cavalotti" 
 wrote:

Hi Chris,

On Thu, Dec 12, 2019 at 10:47 AM Christopher Benjamin Coffey
 wrote:
> I believe I heard recently that you could limit the number of users 
jobs that accrue age priority points. Yet, I cannot find this option in the man 
pages. Anyone have an idea? Thank you!

It's the *JobsAccrue* options in 
https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fsacctmgr.htmldata=02%7C01%7Cchris.coffey%40nau.edu%7Cec2ec92361224b31670908d77f51ef7e%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C637117862191747391sdata=3W%2BYSp7LmvoqzEzIjT2iVK3wGDv1hlJ1J7gCgdFmyvs%3Dreserved=0

Cheers,
-- 
Kilian







Re: [slurm-users] Maxjobs to accrue age priority points

2019-12-12 Thread Christopher Benjamin Coffey
Ahh hah! Thanks Killian!

Best,
Chris
 
-- 
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 
 

On 12/12/19, 3:03 PM, "slurm-users on behalf of Kilian Cavalotti" 
 wrote:

Hi Chris,

On Thu, Dec 12, 2019 at 10:47 AM Christopher Benjamin Coffey
 wrote:
> I believe I heard recently that you could limit the number of users jobs 
that accrue age priority points. Yet, I cannot find this option in the man 
pages. Anyone have an idea? Thank you!

It's the *JobsAccrue* options in 
https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fsacctmgr.htmldata=02%7C01%7Cchris.coffey%40nau.edu%7Cb66846b949a540ccd0eb08d77f4f2185%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C637117850150219525sdata=HRMH7Po4fuHiExnVde%2FPTdM784v91OMHwrROlOVLuf0%3Dreserved=0

Cheers,
-- 
Kilian





Re: [slurm-users] Maxjobs to accrue age priority points

2019-12-12 Thread Kilian Cavalotti
Hi Chris,

On Thu, Dec 12, 2019 at 10:47 AM Christopher Benjamin Coffey
 wrote:
> I believe I heard recently that you could limit the number of users jobs that 
> accrue age priority points. Yet, I cannot find this option in the man pages. 
> Anyone have an idea? Thank you!

It's the *JobsAccrue* options in https://slurm.schedmd.com/sacctmgr.html

Cheers,
-- 
Kilian



Re: [slurm-users] Need help with controller issues

2019-12-12 Thread Dean Schulze
Thanks for mentioning the config.log file.  It has dozens of errors in it,
yet ./configure completes and doesn't report any errors.

Here's what got me past the problem with the mysql plugin.  A test program
that needed -lssl and -lcrypto on the make command line was failing.  The
solution was

sudo apt-get install libssl-dev

I also added

sudo apt-get install g++
sudo apt install build-essential

to eliminate some other failures.  Thanks to all who responded here, I now
have slurmctld and slurmdbd running.

The config.log still has dozens of errors in it, almost all due to failed
include statements.  I'll open another thread about those.



On Tue, Dec 10, 2019 at 2:05 PM Dean Schulze 
wrote:

> I'm trying to set up my first slurm installation following these
> instructions:
>
> https://github.com/nateGeorge/slurm_gpu_ubuntu
>
> I've had to deviate a little bit because I'm using virtual machines that
> don't have GPUs, so I don't have a gres.conf file and in
> /etc/slurm/slurm.conf I don't have an entry like Gres=gpu:2 on the last
> line.
>
> On my controller vm I get errors when trying to do simple commnands:
>
> $ sinfo
> slurm_load_partitions: Unable to contact slurm controller (connect failure)
>
> $ sudo sacctmgr add cluster compute-cluster
> sacctmgr: error: slurm_persist_conn_open_without_init: failed to open
> persistent connection to localhost:6819: Connection refused
> sacctmgr: error: slurmdbd: Sending PersistInit msg: Connection refused
> sacctmgr: error: Problem talking to the database: Connection refused
>
>
> Something is supposed to be running on port 6819, but netstat shows
> nothing using that port.  What is supposed to be running on 6819?
>
> My database (Maria) is running.  I can connect to it with `sudo mysql -U
> root`.
>
> When I boot my controller which services are supposed to be running and on
> which ports?
>
> Thanks.
>
>


[slurm-users] Maxjobs to accrue age priority points

2019-12-12 Thread Christopher Benjamin Coffey
Hi,

I believe I heard recently that you could limit the number of users jobs that 
accrue age priority points. Yet, I cannot find this option in the man pages. 
Anyone have an idea? Thank you!

Best,
Chris
 
-- 
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 
 



Re: [slurm-users] sched

2019-12-12 Thread Alex Chekholko
Hey Steve,

I think it doesn't just "power down" the nodes but deletes the instances.
So then when you need a new node, it creates one, then provisions the
config, then updates the slurm cluster config...

That's how I understand it, but I haven't tried running it myself.

Regards,
Alex

On Thu, Dec 12, 2019 at 1:20 AM Steve Brasier  wrote:

> Hi, I'm hoping someone can shed some light on the SchedMD-provided example
> here https://github.com/SchedMD/slurm-gcp for an autoscaling cluster on
> Google Cloud Plaform (GCP).
>
> I understand that slurm autoscaling uses the power saving interface to
> create/remove nodes and the example suspend.py and resume.py scripts in the
> seem pretty clear and in line with the slurm docs. However I don't
> understand why the additional slurm-gcp-sync.py script is required. It
> seems to compare the states of nodes as seen by google compute and slurm
> and then on the GCP side either start instances or shut them down, and on
> the slurm side mark them as in RESUME or DOWN states. I don't see why this
> is necessary though; my understanding from the slurm docs is that e.g. the
> suspend script simply has to "power down" the nodes, and slurmctld will
> then mark them as in power saving mode - marking nodes down would seem to
> prevent jobs being scheduled on them, which isn't what we want. Similarly,
> I would have thought the resume.py script could mark nodes as in RESUME
> state itself, (once it's tested that the node is up and slurmd is running
> etc).
>
> thanks for any help
> Steve
>


Re: [slurm-users] Need help with controller issues

2019-12-12 Thread Dean Schulze
There's a mysql test failure in config.log.  It looks like a couple of
missing libraries.  The config.log also shows errors because g++ isn't
present, and dozens of errors because of failed includes.  I must need g++
packages on my Ubuntu instance.

But ./configure completes successfully in spite of dozens of failures.


configure:4890: checking for mysql_config
configure:4908: found /usr/bin/mysql_config
configure:4920: result: /usr/bin/mysql_config
configure:5021: gcc -o conftest -I/usr/include/mysql -g -O2   conftest.c
-L/usr/lib/x86_64-linux-gnu -lmysqlclient -lpthread -lz -lm -lrt -latomic
-lssl -lcrypto -ldl  >&5
/usr/bin/ld: cannot find -lssl
/usr/bin/ld: cannot find -lcrypto
collect2: error: ld returned 1 exit status
configure:5021: $? = 1
configure: failed program was:
| /* confdefs.h */
| #define PACKAGE_NAME "slurm"
| #define PACKAGE_TARNAME "slurm"
| #define PACKAGE_VERSION "19.05"
| #define PACKAGE_STRING "slurm 19.05"
| #define PACKAGE_BUGREPORT ""
| #define PACKAGE_URL "https://slurm.schedmd.com;
| #define PROJECT "slurm"
| #define SLURM_API_VERSION 0x22
| #define SLURM_API_CURRENT 34
| #define SLURM_API_MAJOR 34
| #define SLURM_API_AGE 0
| #define SLURM_API_REVISION 0
| #define VERSION "19.05.4"
| #define SLURM_VERSION_NUMBER 0x130504
| #define SLURM_MAJOR "19"
| #define SLURM_MINOR "05"
| #define SLURM_MICRO "4"
| #define RELEASE "1"
| #define SLURM_VERSION_STRING "19.05.4"
| /* end confdefs.h.  */
| #include 
| int
| main ()
| {
|
| MYSQL mysql;
| (void) mysql_init();
| (void) mysql_close();
|
|   ;
|   return 0;
| }
configure:5041: WARNING: *** MySQL test program execution failed. A
thread-safe MySQL library is required.


On Wed, Dec 11, 2019 at 6:33 PM Kurt H Maier  wrote:

> On Wed, Dec 11, 2019 at 04:04:44PM -0700, Dean Schulze wrote:
> > I tried again with a completely new system (virtual machine).  I used the
> > latest source, I used mysql instead of mariadb, and I installed all the
> > client and dev libs (below).  I still get the same error.  It doesn't
> > build the /usr/lib/slurm/accounting_storage_mysql.so file.
> >
> > Could the ./configure command be the problem?  Here's how I run it:
>
> It's going to be extremely difficult to diagnose this without the output
> from the build process.  Perhaps you could attach this to the bug report
> you opened about this issue.
>
> khm
>
>


[slurm-users] pkgconfig conflict

2019-12-12 Thread William Brown
Version 19.05.3-2
CentOS 7.7

I was wanting to install the slurm-devel RPM that I had built, but I get
this translation check error:

$ sudo yum localinstall
/home/apps/slurm/19.05/RPMS/slurm-devel-19.05.3-2.el7.x86_64.rpm
.
.
Transaction check error:
  file /usr/lib64/pkgconfig from install of
slurm-devel-19.05.3-2.el7.x86_64 conflicts with file from package
pkgconfig-1:0.27.1-4.el7.x86_64
  file /usr/lib64/pkgconfig from install of
slurm-devel-19.05.3-2.el7.x86_64 conflicts with file from package
MariaDB-devel-10.4.10-1.el7.centos.x86_64

Reading elsewhere on the Internet seems to suggest that the RPM shouldn't
include the directory itself:

$ rpm -qlp /home/apps/slurm/19.05/RPMS/slurm-devel-19.05.3-2.el7.x86_64.rpm
/usr/include/slurm
/usr/include/slurm/pmi.h
/usr/include/slurm/pmi2.h
/usr/include/slurm/slurm.h
/usr/include/slurm/slurm_errno.h
/usr/include/slurm/slurmdb.h
/usr/include/slurm/smd_ns.h
/usr/include/slurm/spank.h
/usr/lib64/pkgconfig  <<<
/usr/lib64/pkgconfig/slurm.pc

Anyone else seen this?

I am not very familiar with building RPMs but it sounds as if it is
possible when building an RPM to put in the 'spec file' to tag some files
(and I guess directories) as 'noreplace'.

William


Re: [slurm-users] Is that possible to submit jobs to a Slurm cluster right from a developer's PC

2019-12-12 Thread Ryan Cox
Be careful with this approach.  You also need the same munge key 
installed everywhere.  If the developers have root on their own system, 
they can submit jobs and run Slurm commands as any user.


ssh sounds significantly safer.  A quick and easy way to make sure that 
users don't abuse the system is to set limits using pam_limits.so, 
usually in /etc/security/limits.conf.  A cputime limit of one minute 
should prevent users from running their work there.  If I'm reading it 
right, it sounds like you do want jobs running on that system but do not 
want people launching work over ssh.  In that case, you would need to 
make sure that pam_limits.so is enabled for ssh but not Slurm.


Ryan

On 12/12/19 2:01 AM, Nguyen Dai Quy wrote:
On Thu, Dec 12, 2019 at 5:53 AM Ryan Novosielski > wrote:


Sure; they’ll need to have the appropriate part of SLURM installed
and the config file. This is similar to having just one login node
per user. Typically login nodes don’t run either daemon.


Hi,
It's interesting ! Do you have any link/tutorial for this kind of setup?
Thanks,




On Dec 11, 2019, at 22:41, Victor (Weikai) Xie
mailto:xiewei...@gmail.com>> wrote:


Hi,

We are trying to setup a tiny Slurm cluster to manage shared
access to the GPU server in our team. Both slurmctld and slumrd
are going to run on this GPU server. But here is a problem. On
one hand, we don't want to give developers ssh access to that
box, because otherwise they might bypass Slurm job queue and
launch jobs directly on the box. On the other hand, if developers
don't have ssh access to the box, how can they run 'sbatch'
command to submit jobs?

Does Slurm provide an option to allow developers submit jobs
right from their own PCs?

Regards,

Victor (Weikai)  Xie






Re: [slurm-users] Slurm 18.08.8 --mem-per-cpu + --exclusive = strange behavior

2019-12-12 Thread Marcus Wagner

Hi Beatrice and Bjørn-Helge,

I can sign, that it works with 18.08.7. We additionally use 
TRESBillingWeights together with PriorityFlags=MAX_TRES. For example:

TRESBillingWeights="CPU=1.0,Mem=0.1875G,gres/gpu=12.0"
We use the billing factor for our external accounting. We do this to do 
a fair accounting of the nodes. But we do have a similar effect due to 
--exclusive.

In Beatrice case, the billingweight would be:
TRESBillingWeights="CPU=1.0,Mem=0.21875G"
So, a 10 cpu job with 1 GB per cpu would be billed 10.
An 1 cpu job with 10 GB would be billed 2 (0.21875*10, floor).
An exclusive 10 cpu job with 1 GB per cpu would be billed 28 (all 28 
cores are for the job).
An exclusive 1 cpu job with 30GB (Beatrice' example) would be billed 
28(cores)*30(GB)*0.21875 => 118.125 => 118 cores.


Best
Marcus

On 12/12/19 9:47 AM, Bjørn-Helge Mevik wrote:

Beatrice Charton  writes:


Hi,

We have a strange behaviour of Slurm after updating from 18.08.7 to
18.08.8, for jobs using --exclusive and --mem-per-cpu.

Our nodes have 128GB of memory, 28 cores.
$ srun  --mem-per-cpu=3 -n 1  --exclusive  hostname
=> works in 18.08.7
=> doesn’t work in 18.08.8

I'm actually surprised it _worked_ in 18.08.7.  At one time - long before
v 18.08, the behaviour was changed when using --exclusive: In order to
account the job for all cpus on the node, the number of
cpus asked for with --ntasks would simply be multiplied with with
"#cpus-on-node / --ntasks" (so in your case: 28).  Unfortunately, that
also means that the memory the job requires per node is "#cpus-on-node /
--ntasks" multiplied with --mem-per-cpu (in your case 28 * 3 MiB ~=
820 GiB).  For this reason, we tend to ban --exclusive on our clusters
(or at least warn about it).

I haven't looked at the code for a long time, so I don't know whether
this is still the current behaviour, but every time I've tested, I've
seen the same problem.  I believe I've tested on 19.05 (but I might
remember wrong).



--
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de




Re: [slurm-users] Need help with controller issues

2019-12-12 Thread Gennaro Oliva
Hi Dean,

On Wed, Dec 11, 2019 at 04:04:44PM -0700, Dean Schulze wrote:
> I tried again with a completely new system (virtual machine).  I used the
> latest source, I used mysql instead of mariadb, and I installed all the
> client and dev libs (below).  I still get the same error.  It doesn't
> build the /usr/lib/slurm/accounting_storage_mysql.so file.

On Debian it builds fine with default-libmysqlclient-dev installed.

Best regards,
-- 
Gennaro Oliva



[slurm-users] slurm cpu allocation

2019-12-12 Thread Ricardo Gregorio
Hi all,

Wondering whether someone could help me with the following:

I am new and struggling a bit with some slurm concepts.

We are running an old version: 17.02.11 soon upgrading to 18.X
On our slurm.conf "SelectType=select/cons_res"
https://slurm.schedmd.com/cpu_management.html#Step1

21x compute nodes * 32 cores = 672
partition=standard

User was submitting a job and requesting 168CPUs. Basically the job was getting 
allocated 6x nodes [5*32 + 1*8]. Which means that 5 nodes fully committed to 
this single job. If he submitted 4x jobs, one would go pending because each job 
"required" 6x nodes.

Would it be possible to achieve, instead: 21x nodes providing 8x cores each, 
i.e., same total 168 and therefore still able to run the 4th job and maxing out 
the resources? Would it mean specifying -minnodes=21?

Or it does not make any sense what I am saying?

Thanks in advance
Ricardo



Rothamsted Research is a company limited by guarantee, registered in England at 
Harpenden, Hertfordshire, AL5 2JQ under the registration number 2393175 and a 
not for profit charity number 802038.


[slurm-users] sched

2019-12-12 Thread Steve Brasier
Hi, I'm hoping someone can shed some light on the SchedMD-provided example
here https://github.com/SchedMD/slurm-gcp for an autoscaling cluster on
Google Cloud Plaform (GCP).

I understand that slurm autoscaling uses the power saving interface to
create/remove nodes and the example suspend.py and resume.py scripts in the
seem pretty clear and in line with the slurm docs. However I don't
understand why the additional slurm-gcp-sync.py script is required. It
seems to compare the states of nodes as seen by google compute and slurm
and then on the GCP side either start instances or shut them down, and on
the slurm side mark them as in RESUME or DOWN states. I don't see why this
is necessary though; my understanding from the slurm docs is that e.g. the
suspend script simply has to "power down" the nodes, and slurmctld will
then mark them as in power saving mode - marking nodes down would seem to
prevent jobs being scheduled on them, which isn't what we want. Similarly,
I would have thought the resume.py script could mark nodes as in RESUME
state itself, (once it's tested that the node is up and slurmd is running
etc).

thanks for any help
Steve


Re: [slurm-users] Need help with controller issues

2019-12-12 Thread William Brown
I looked back in the list to November when I had the same problem problem
building with MariaDB:
 On 11-11-2019 21:23, William Brown wrote:
> I have in fact found the answer by looking harder.
>
> The config.log clearly showed that the build of the test MySQL
> program failed, which is why it was set to be excluded.
>
> It failed to link against '-lmariadb'.  It turns out that library is
> no longer in MariaDB or MariaDB-devel, it is separately packaged in
> MariaDB-shared.  That may of course be because I have built MariaDB
> 10.4 from the mariadb.org site, because CentOS 7 only ships with the
> extremely old version 5.5.
>
> Once I installed the missing package it built the RPMs just fine.
> However it would be easier to use it linked to static MariaDB
> libraries, as I now have to installed MariaDB-shared on every server
> that will run slurmd, i.e. all compute nodes.  I expect that if I
> looked harder at the build options there may be a way to do this,
> perhaps with linker flags.

I think that even if you are building with MySQL on a clean VM you need to
look at the detailed log of the build of the accounting components.

The configure looks and if it finds either mysql_config or mariadb_config
commands it assumes that you need to build with MySQL support.  In your
case you actually want it, where I really didn't as I use slurmdbd, not a
direct connect to MySQL.  It was then a matter of having the right RPMs
installed, and for MariaDB the missing bit was MariaDB-shared (this is for
RHEL/CentOS).  As soon as I had installed that I was able to get the
accounting_storage_mysql to build.

On Thu, 12 Dec 2019 at 03:54,  wrote:

> Is that logged somewhere or do I need to capture the output from the make
> command to a file?
>
> -Original Message-
> From: slurm-users  On Behalf Of
> Kurt
> H Maier
> Sent: Wednesday, December 11, 2019 6:32 PM
> To: Slurm User Community List 
> Subject: Re: [slurm-users] Need help with controller issues
>
> On Wed, Dec 11, 2019 at 04:04:44PM -0700, Dean Schulze wrote:
> > I tried again with a completely new system (virtual machine).  I used
> > the latest source, I used mysql instead of mariadb, and I installed
> > all the client and dev libs (below).  I still get the same error.  It
> > doesn't build the /usr/lib/slurm/accounting_storage_mysql.so file.
> >
> > Could the ./configure command be the problem?  Here's how I run it:
>
> It's going to be extremely difficult to diagnose this without the output
> from the build process.  Perhaps you could attach this to the bug report
> you
> opened about this issue.
>
> khm
>
>
>
>


Re: [slurm-users] Is that possible to submit jobs to a Slurm cluster right from a developer's PC

2019-12-12 Thread Nguyen Dai Quy
On Thu, Dec 12, 2019 at 5:53 AM Ryan Novosielski 
wrote:

> Sure; they’ll need to have the appropriate part of SLURM installed and the
> config file. This is similar to having just one login node per user.
> Typically login nodes don’t run either daemon.
>
>
Hi,
It's interesting ! Do you have any link/tutorial for this kind of setup?
Thanks,



> On Dec 11, 2019, at 22:41, Victor (Weikai) Xie 
> wrote:
>
> 
> Hi,
>
> We are trying to setup a tiny Slurm cluster to manage shared access to the
> GPU server in our team. Both slurmctld and slumrd are going to run on this
> GPU server. But here is a problem. On one hand, we don't want to give
> developers ssh access to that box, because otherwise they might bypass
> Slurm job queue and launch jobs directly on the box. On the other hand, if
> developers don't have ssh access to the box, how can they run 'sbatch'
> command to submit jobs?
>
> Does Slurm provide an option to allow developers submit jobs right from
> their own PCs?
>
> Regards,
>
> Victor (Weikai)  Xie
>
>


Re: [slurm-users] Slurm 18.08.8 --mem-per-cpu + --exclusive = strange behavior

2019-12-12 Thread Bjørn-Helge Mevik
Beatrice Charton  writes:

> Hi,
>
> We have a strange behaviour of Slurm after updating from 18.08.7 to
> 18.08.8, for jobs using --exclusive and --mem-per-cpu.
>
> Our nodes have 128GB of memory, 28 cores.
>   $ srun  --mem-per-cpu=3 -n 1  --exclusive  hostname
> => works in 18.08.7 
> => doesn’t work in 18.08.8

I'm actually surprised it _worked_ in 18.08.7.  At one time - long before
v 18.08, the behaviour was changed when using --exclusive: In order to
account the job for all cpus on the node, the number of
cpus asked for with --ntasks would simply be multiplied with with
"#cpus-on-node / --ntasks" (so in your case: 28).  Unfortunately, that
also means that the memory the job requires per node is "#cpus-on-node /
--ntasks" multiplied with --mem-per-cpu (in your case 28 * 3 MiB ~=
820 GiB).  For this reason, we tend to ban --exclusive on our clusters
(or at least warn about it).

I haven't looked at the code for a long time, so I don't know whether
this is still the current behaviour, but every time I've tested, I've
seen the same problem.  I believe I've tested on 19.05 (but I might
remember wrong).

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature


Re: [slurm-users] cleanup script after timeout

2019-12-12 Thread Reuti
Hi,

Am 12.12.2019 um 03:06 schrieb Brian Andrus:

> You prompted me to dig even deeper into my epilog. I was trying to access a 
> semaphore file in the user's home directory.
> 
> It seems that when the epilogue is run the ~ is not expanded in anyway. So I 
> can't even use ~${SLURM_JOB_USER} to access their semaphore file.

To use ~${SLURM_JOB_USER} it should be necessary to use `eval`, as it needs to 
be evaluated twice. More promising might be:

$ getent passwd ${SLURM_JOB_USER} | cut -d: -f6

-- Reuti


> Potentially problematic for any sites with homes in different locations or 
> accounts with non-standard homes, but at least what I needed to do can work.
> 
> Brian
> 
> 
> On 12/11/2019 3:44 PM, Juergen Salk wrote:
>> Hi Brian,
>> 
>> can you maybe elaborate on how exactly you verified that your epilog
>> does not run when a job exceeds it's walltime limit? Does it run when
>> the jobs end normally or when a running job is cancelled by the user?
>> 
>> I am asking because in our environment the epilog also runs when a job
>> hits the walltime limit or is cancelled and I think this is actually
>> how it is supposed to work.
>> 
>> Best regards
>> Jürgen
>> 
>