date:20180117

Re: [slurm-users] slurm 17.11.2: Socket timed out on send/recv operation

2018-01-17 Thread Alessandro Federico

Hi John

thanks for the infos. 
We are investigating the slowdown of sssd and I found some bug reports 
regarding slow sssd query
with almost the same backtrace. Hopefully an update of sssd could solve this 
issue.

We'll let you know if we found a solution.

thanks
ale

- Original Message -
> From: "John DeSantis" 
> To: "Alessandro Federico" 
> Cc: "Slurm User Community List" , "Isabella 
> Baccarelli" ,
> hpc-sysmgt-i...@cineca.it
> Sent: Wednesday, January 17, 2018 3:30:43 PM
> Subject: Re: [slurm-users] slurm 17.11.2: Socket timed out on send/recv 
> operation
> 
> Ale,
> 
> > As Matthieu said it seems something related to SSS daemon.
> 
> That was a great catch by Matthieu.
> 
> > Moreover, only 3 SLURM partitions have the AllowGroups ACL
> 
> Correct, which may seem negligent, but after each `scontrol
> reconfigure`, slurmctld restart, and/or AllowGroups= partition
> update,
> the mapping of UID's for each group will be updated.
> 
> > So why does the UID-GID mapping take so long?
> 
> We attempted to use "AllowGroups" previously, but we found (even with
> sssd.conf tuning regarding enumeration) that unless the group was
> local
> (/etc/group), we were experiencing delays before the AllowGroups
> parameter was respected.  This is why we opted to use SLURM's
> AllowQOS/AllowAccounts instead.
> 
> You would have to enable debugging on your remote authentication
> software to see where the hang-up is occurring (if it is that at all,
> and not just a delay with the slurmctld).
> 
> Given the direction that this is going - why not replace the
> "AllowGroups" with either a simple "AllowAccounts=" or "AllowQOS="?
> 
> > @John: we defined many partitions on the same nodes but in the
> > production cluster they will be more or less split across the 6K
> > nodes.
> 
> Ok, that makes sense.  Looking initially at your partition
> definitions,
> I immediately thought of being DRY, especially since the "finer"
> tuning
> between the partitions could easily be controlled via the QOS'
> allowed
> to access the resources.
> 
> John DeSantis
> 
> On Wed, 17 Jan 2018 13:20:49 +0100
> Alessandro Federico  wrote:
> 
> > Hi Matthieu & John
> > 
> > this is the backtrace of slurmctld during the slowdown
> > 
> > (gdb) bt
> > #0  0x7fb0e8b1e69d in poll () from /lib64/libc.so.6
> > #1  0x7fb0e8617bfa in sss_cli_make_request_nochecks ()
> > from /lib64/libnss_sss.so.2 #2  0x7fb0e86185a3 in
> > sss_nss_make_request () from /lib64/libnss_sss.so.2 #3
> > 0x7fb0e8619104 in _nss_sss_getpwnam_r ()
> > from /lib64/libnss_sss.so.2 #4  0x7fb0e8aef07d in
> > getpwnam_r@@GLIBC_2.2.5 () from /lib64/libc.so.6 #5
> > 0x7fb0e9360986 in _getpwnam_r (result=,
> > bufsiz=, buf=, pwd=,
> > name=) at uid.c:73 #6  uid_from_string
> > (name=0x1820e41
> > "g2bottin", uidp=uidp@entry=0x7fff07f03a6c) at uid.c:111 #7
> > 0x0043587d in get_group_members (group_name=0x10ac500 "g2")
> > at groups.c:139 #8  0x0047525a in _get_groups_members
> > (group_names=) at partition_mgr.c:2006 #9
> > 0x00475505 in _update_part_uid_access_list
> > (x=0x7fb03401e650,
> > arg=0x7fff07f13bf4) at partition_mgr.c:1930 #10 0x7fb0e92ab675
> > in
> > list_for_each (l=0x1763e50, f=f@entry=0x4754d8
> > <_update_part_uid_access_list>, arg=arg@entry=0x7fff07f13bf4) at
> > list.c:420 #11 0x0047911a in load_part_uid_allow_list
> > (force=1) at partition_mgr.c:1971 #12 0x00428e5c in
> > _slurmctld_background (no_data=0x0) at controller.c:1911 #13 main
> > (argc=, argv=) at controller.c:601
> > 
> > As Matthieu said it seems something related to SSS daemon.
> > However we don't notice any slowdown due to SSSd in our
> > environment.
> > As I told you before, we are just testing SLURM on a small 100
> > nodes
> > cluster before going into production with about 6000 nodes next
> > Wednesday. At present the other nodes are managed by PBSPro and the
> > 2
> > PBS servers are running on the same nodes as the SLURM controllers.
> > PBS queues are also configured with users/groups ACLs and we never
> > noticed any similar slowdown.
> > 
> > Moreover, only 3 SLURM partitions have the AllowGroups ACL
> > 
> > [root@mgmt01 slurm]# grep AllowGroups slurm.conf
> > PartitionName=bdw_fua_gwdbg Nodes=r040c03s0[1,2] Default=NO
> > DefMemPerCPU=3000 DefaultTime=00:30:00 MaxTime=00:30:00  State=UP
> > QOS=bdw_fua_gwdbg DenyQos=bdw_qos_special AllowGroups=g2
> > PartitionName=bdw_fua_gwNodes=r040c03s0[1,2] Default=NO
> > DefMemPerCPU=3000 DefaultTime=00:30:00 MaxTime=48:00:00  State=UP
> > QOS=bdw_fua_gwDenyQos=bdw_qos_special AllowGroups=g2
> > PartitionName=bdw_fua_gwg2  Nodes=r040c03s0[1,2] Default=NO
> > DefMemPerCPU=3000 DefaultTime=00:30:00 MaxTime=168:00:00 State=UP
> > QOS=bdw_fua_gwg2  DenyQos=bdw_qos_special AllowGroups=g2
> > 
> > So why does the UID-GID mapping take so long?
> > 
> > @John:

Re: [slurm-users] Slurm and available libraries

2018-01-17 Thread Christopher Samuel


On 18/01/18 02:53, Loris Bennett wrote:


This is all very OT, so it might be better to discuss it on, say, the
OpenHPC mailing list, since as far as I can tell Spack, EasyBuild and
Lmod (but not old or new 'environment-modules') are part of OpenHPC.


Another place might be the Beowulf list, all about Linux HPC (started by
Don Becker many moons ago), now maintained by yours truly.

http://www.beowulf.org/

Happy to add people to the list if they wish, just email me directly.

All the best,
Chris
--
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC

Re: [slurm-users] Slurm and available libraries

2018-01-17 Thread Loris Bennett

Hi Ole,

Ole Holm Nielsen  writes:

> John: I would refrain from installing the old default package
> "environment-modules" from the Linux distribution, since it doesn't
> seem to be maintained any more.

Is this still true? Here

  http://modules.sourceforge.net/

there is a version 4.1.0 which is two days old.  Does anyone have any
experience of this and how it compares to the old version and/or Lmod?

> Lmod, on the other hand, is actively maintained and solves some
> problems with the old "environment-modules" software.
>
> There's an excellent review paper on different module tools: "Modern
> Scientific Software Management Using EasyBuild and Lmod",
> http://dl.acm.org/citation.cfm?id=2691141

Thanks for the link.  I would also be interested in how EasyBuild and
Spack compare in practice.

This is all very OT, so it might be better to discuss it on, say, the
OpenHPC mailing list, since as far as I can tell Spack, EasyBuild and
Lmod (but not old or new 'environment-modules') are part of OpenHPC.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de

Re: [slurm-users] Best practice: How much node memory to specify in slurm.conf?

2018-01-17 Thread Christopher Samuel


On 18/01/18 01:52, Paul Edmon wrote:

We've been typically taking 4G off the top for memory in our slurm.conf 
for the system and other processes.  This seems to work pretty well.


Where I was working previously we'd discount the memory by the amount
of GPFS page cache too, plus a little for system processes.

Not sure if Greg (hi Greg!) is running GPFS there, but if so it's
worth keeping in mind..

cheers,
Chris
--
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC

Re: [slurm-users] slurm 17.11.2: Socket timed out on send/recv operation

2018-01-17 Thread John DeSantis

Ale,

> As Matthieu said it seems something related to SSS daemon.

That was a great catch by Matthieu.

> Moreover, only 3 SLURM partitions have the AllowGroups ACL

Correct, which may seem negligent, but after each `scontrol
reconfigure`, slurmctld restart, and/or AllowGroups= partition update,
the mapping of UID's for each group will be updated.

> So why does the UID-GID mapping take so long?

We attempted to use "AllowGroups" previously, but we found (even with
sssd.conf tuning regarding enumeration) that unless the group was local
(/etc/group), we were experiencing delays before the AllowGroups
parameter was respected.  This is why we opted to use SLURM's
AllowQOS/AllowAccounts instead.

You would have to enable debugging on your remote authentication
software to see where the hang-up is occurring (if it is that at all,
and not just a delay with the slurmctld).

Given the direction that this is going - why not replace the
"AllowGroups" with either a simple "AllowAccounts=" or "AllowQOS="?

> @John: we defined many partitions on the same nodes but in the
> production cluster they will be more or less split across the 6K
> nodes.

Ok, that makes sense.  Looking initially at your partition definitions,
I immediately thought of being DRY, especially since the "finer" tuning
between the partitions could easily be controlled via the QOS' allowed
to access the resources.

John DeSantis

On Wed, 17 Jan 2018 13:20:49 +0100
Alessandro Federico  wrote:

> Hi Matthieu & John
> 
> this is the backtrace of slurmctld during the slowdown
> 
> (gdb) bt
> #0  0x7fb0e8b1e69d in poll () from /lib64/libc.so.6
> #1  0x7fb0e8617bfa in sss_cli_make_request_nochecks ()
> from /lib64/libnss_sss.so.2 #2  0x7fb0e86185a3 in
> sss_nss_make_request () from /lib64/libnss_sss.so.2 #3
> 0x7fb0e8619104 in _nss_sss_getpwnam_r ()
> from /lib64/libnss_sss.so.2 #4  0x7fb0e8aef07d in
> getpwnam_r@@GLIBC_2.2.5 () from /lib64/libc.so.6 #5
> 0x7fb0e9360986 in _getpwnam_r (result=,
> bufsiz=, buf=, pwd=,
> name=) at uid.c:73 #6  uid_from_string (name=0x1820e41
> "g2bottin", uidp=uidp@entry=0x7fff07f03a6c) at uid.c:111 #7
> 0x0043587d in get_group_members (group_name=0x10ac500 "g2")
> at groups.c:139 #8  0x0047525a in _get_groups_members
> (group_names=) at partition_mgr.c:2006 #9
> 0x00475505 in _update_part_uid_access_list (x=0x7fb03401e650,
> arg=0x7fff07f13bf4) at partition_mgr.c:1930 #10 0x7fb0e92ab675 in
> list_for_each (l=0x1763e50, f=f@entry=0x4754d8
> <_update_part_uid_access_list>, arg=arg@entry=0x7fff07f13bf4) at
> list.c:420 #11 0x0047911a in load_part_uid_allow_list
> (force=1) at partition_mgr.c:1971 #12 0x00428e5c in
> _slurmctld_background (no_data=0x0) at controller.c:1911 #13 main
> (argc=, argv=) at controller.c:601
> 
> As Matthieu said it seems something related to SSS daemon.
> However we don't notice any slowdown due to SSSd in our environment. 
> As I told you before, we are just testing SLURM on a small 100 nodes
> cluster before going into production with about 6000 nodes next
> Wednesday. At present the other nodes are managed by PBSPro and the 2
> PBS servers are running on the same nodes as the SLURM controllers.
> PBS queues are also configured with users/groups ACLs and we never
> noticed any similar slowdown.
> 
> Moreover, only 3 SLURM partitions have the AllowGroups ACL
> 
> [root@mgmt01 slurm]# grep AllowGroups slurm.conf 
> PartitionName=bdw_fua_gwdbg Nodes=r040c03s0[1,2] Default=NO
> DefMemPerCPU=3000 DefaultTime=00:30:00 MaxTime=00:30:00  State=UP
> QOS=bdw_fua_gwdbg DenyQos=bdw_qos_special AllowGroups=g2
> PartitionName=bdw_fua_gwNodes=r040c03s0[1,2] Default=NO
> DefMemPerCPU=3000 DefaultTime=00:30:00 MaxTime=48:00:00  State=UP
> QOS=bdw_fua_gwDenyQos=bdw_qos_special AllowGroups=g2
> PartitionName=bdw_fua_gwg2  Nodes=r040c03s0[1,2] Default=NO
> DefMemPerCPU=3000 DefaultTime=00:30:00 MaxTime=168:00:00 State=UP
> QOS=bdw_fua_gwg2  DenyQos=bdw_qos_special AllowGroups=g2
> 
> So why does the UID-GID mapping take so long?
> 
> @John: we defined many partitions on the same nodes but in the
> production cluster they will be more or less split across the 6K
> nodes.
> 
> thank you very much
> ale
> 
> - Original Message -
> > From: "John DeSantis" 
> > To: "Matthieu Hautreux" 
> > Cc: hpc-sysmgt-i...@cineca.it, "Slurm User Community List"
> > , "Isabella Baccarelli"
> >  Sent: Tuesday, January 16, 2018 8:20:20 PM
> > Subject: Re: [slurm-users] slurm 17.11.2: Socket timed out on
> > send/recv operation
> > 
> > Matthieu,
> > 
> > > I would bet on something like LDAP requests taking too much time
> > > because of a missing sssd cache.
> > 
> > Good point!  It's easy to forget to check something as "simple" as
> > user
> > look-up when something is taking "too long".
> > 
> > John DeSantis
> > 
> > On Tue, 16

Re: [slurm-users] Slurm and available libraries

2018-01-17 Thread Vanzo, Davide

Hi Bill!

Always glad to contribute to the Lmod cause! ;)

Back to the discussion, I simply gave my contribution based on how we set up
our system. In no way I intended to say that that is the only way to deploy
software. Yours is definitely a valid alternative, although it requires a
deeper experience in software packaging and deployment.

To solve the problem of users overloading the login nodes we are experimenting
with cgroups, but here we are going a little too much off topic.

PS: Now that I am in San Antonio I have no more excuses to come and visit you
guys at TACC.

--
Davide Vanzo, PhD
Application Developer
Adjunct Assistant Professor of Chemical and Biomolecular Engineering
Advanced Computing Center for Research and Education (ACCRE)
Vanderbilt University - Hill Center 201
(615)-875-9137
www.accre.vanderbilt.edu

On 2018-01-17 08:01:10-06:00 slurm-users wrote:

I’d go slightly further, though I do appreciate the Lmod shout-out!: In some
cases, you may not even want the software on the frontend nodes (hear me out
before I retract it).

If it’s a library that requires linking against before it can be used, then you
probably have to have it unless you require users to submit interactive jobs to
some dedicated build nodes to do their compilation. You’ll find that when users
have all their software needs in one place on the frontend nodes, that
sometimes they try to run it there, taking away resources from others. Now, a
quick test run to make sure that their build is correct is probably no big
deal, but some users will run their full-on science experiments (or pre- and
post-processing steps) on the login nodes! We like to encourage those folks to
submit jobs to the compute nodes. You could, but they probably wouldn’t like,
cripple or not install some libraries on the login nodes to prevent this, but
we just watch those systems like a hawk, given that we do want users to be able
to build their programs on the login nodes.

We don’t use EB, but we do collaborate with them to make it and Lmod
compatible. We use something like OpenHPC to push RPMs we build in-house to
manage software on our login and compute nodes. Sometimes, we also just install
a binary package (like an ISV code like ANSYS or MATLAB) into a shared
filesystem (one of our Lustre filesystems, usually) when making our own RPM is
too cumbersome, and then use Lmod to make it available and visible to our
users. There are more strategies for this than you can imagine, so settle on a
few and keep it simple for you!

Best,
Bill.

--
Bill Barth, Ph.D., Director, HPC
bba...@tacc.utexas.edu| Phone: (512) 232-7069
Office: ROC 1.435| Fax: (512) 475-9445

On 1/17/18, 7:48 AM, "slurm-users on behalf of Vanzo, Davide"
slurm-users-boun...@lists.schedmd.com on="" behalf="" of=""
davide.va...@vanderbilt.edu="" wrote:

Ciao Elisabetta,

I second John's reply.
On our cluster we install software on the shared parallel filesystem with
EasyBuild and use Lmod as a module front-end. Then users will simply load
software in the job's environment by using the module command.

Feel free to ping me directly if you need specific help.

https://na01.safelinks.protection.outlook.com/?url=www.accre.vanderbilt.eduamp;data=02%7C01%7Cdavide.vanzo%40vanderbilt.edu%7Ca55a733721e34284029d08d55db2bfa4%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636517944686221529amp;sdata=4qU%2FqW28JoTtmWYE9Jyjc1VeKOT7U4aiMQdsjXnAVYg%3Damp;reserved=0

On 2018-01-17 07:28:31-06:00 slurm-users wrote:

Hi,
let's say I need to execute a python script with slurm. The script require
a particular library installed on the system like numpy.
If the library is not installed to the system, it is necessary to install
it on the master AND the nodes, right? This has to be done on each machine
separately or there's a way to install it one time for all the machine (master
and nodes)?
Elisabetta

/slurm-users-boun...@lists.schedmd.com

Re: [slurm-users] Slurm and available libraries

2018-01-17 Thread Bill Barth

I’d go slightly further, though I do appreciate the Lmod shout-out!: In some 
cases, you may not even want the software on the frontend nodes (hear me out 
before I retract it). 

If it’s a library that requires linking against before it can be used, then you 
probably have to have it unless you require users to submit interactive jobs to 
some dedicated build nodes to do their compilation. You’ll find that when users 
have all their software needs in one place on the frontend nodes, that 
sometimes they try to run it there, taking away resources from others. Now, a 
quick test run to make sure that their build is correct is probably no big 
deal, but some users will run their full-on science experiments (or pre- and 
post-processing steps) on the login nodes! We like to encourage those folks to 
submit jobs to the compute nodes. You could, but they probably wouldn’t like, 
cripple or not install some libraries on the login nodes to prevent this, but 
we just watch those systems like a hawk, given that we do want users to be able 
to build their programs on the login nodes.

We don’t use EB, but we do collaborate with them to make it and Lmod 
compatible. We use something like OpenHPC to push RPMs we build in-house to 
manage software on our login and compute nodes. Sometimes, we also just install 
a binary package (like an ISV code like ANSYS or MATLAB) into a shared 
filesystem (one of our Lustre filesystems, usually) when making our own RPM is 
too cumbersome, and then use Lmod to make it available and visible to our 
users. There are more strategies for this than you can imagine, so settle on a 
few and keep it simple for you!

Best,
Bill.

-- 
Bill Barth, Ph.D., Director, HPC
bba...@tacc.utexas.edu|   Phone: (512) 232-7069
Office: ROC 1.435|   Fax:   (512) 475-9445
 
 

On 1/17/18, 7:48 AM, "slurm-users on behalf of Vanzo, Davide" 
 wrote:

Ciao Elisabetta,

I second John's reply.
On our cluster we install software on the shared parallel filesystem with 
EasyBuild and use Lmod as a module front-end. Then users will simply load 
software in the job's environment by using the module command.

Feel free to ping me directly if you need specific help.
 

-- 
Davide Vanzo, PhD
Application Developer
Adjunct Assistant Professor of Chemical and Biomolecular Engineering
Advanced Computing Center for Research and Education (ACCRE)
Vanderbilt University - Hill Center 201
(615)-875-9137
www.accre.vanderbilt.edu

 
On 2018-01-17 07:28:31-06:00 slurm-users wrote:

Hi,
let's say I need to execute a python script with slurm. The script require 
a particular library installed on the system like numpy.
If the library is not installed to the system, it is necessary to install 
it on the master AND the nodes, right? This has to be done on each machine 
separately or there's a way to install it one time for all the machine (master 
and nodes)?
Elisabetta

Re: [slurm-users] Slurm and available libraries

2018-01-17 Thread John Hearns

I should also say that Modules should be easy to install on Ubuntu. It will
 be the package named "environment-modules"

You probably will have to edit the configuration file a little bit since
the default install will assume al lModules files are local.
You need to set your MODULESPATH to include a shared directory where you
will keep all your Modules files.
This really is a lot easier than it sounds.

On 17 January 2018 at 14:48, Vanzo, Davide 
wrote:

> Ciao Elisabetta,
>
> I second John's reply.
> On our cluster we install software on the shared parallel filesystem with
> EasyBuild and use Lmod as a module front-end. Then users will simply load
> software in the job's environment by using the module command.
>
> Feel free to ping me directly if you need specific help.
>
> --
> *Davide Vanzo, PhD*
> Application Developer
> Adjunct Assistant Professor of Chemical and Biomolecular Engineering
> Advanced Computing Center for Research and Education (ACCRE)
> Vanderbilt University - Hill Center 201
> (615)-875-9137 <(615)%20875-9137>
> www.accre.vanderbilt.edu
>
>
> On 2018-01-17 07:28:31-06:00 slurm-users wrote:
>
> Hi,
> let's say I need to execute a python script with slurm. The script require
> a particular library installed on the system like numpy.
> If the library is not installed to the system, it is necessary to install
> it on the master AND the nodes, right? This has to be done on each machine
> separately or there's a way to install it one time for all the machine
> (master and nodes)?
> Elisabetta
>
>

Re: [slurm-users] Slurm and available libraries

2018-01-17 Thread Ole Holm Nielsen

I can highly recommend EasyBuild as an easy way to provide software 
packages as "modules" to your cluster.  We have been very pleased with 
EasyBuild in our cluster.


I made some notes about installing EasyBuild in a Wiki page:
  https://wiki.fysik.dtu.dk/niflheim/EasyBuild_modules
We use CentOS 7 Linux.

Also, if you want information about Slurm setup, I have written another 
set of Wiki pages:

  https://wiki.fysik.dtu.dk/niflheim/SLURM

/Ole

On 01/17/2018 02:39 PM, John Hearns wrote:
Hi Elisabetta.  No, you normally do not need to install software on all 
the compute nodes separately.


It is quite common to use the 'modules' environment to manage software 
like this

http://www.admin-magazine.com/HPC/Articles/Environment-Modules

Once you have numpy installed on a shared drive on the cluster, and have 
a Modules file in place, your users put this at the start of their job 
scripts:

module load numpy

You might also want to look at Easybuild 
http://easybuild.readthedocs.io/en/latest/Introduction.html

There are Easybuild 'recipes' for numpy. We use them where I work.



On 17 January 2018 at 14:28, Elisabetta Falivene 
> wrote:


Hi,
let's say I need to execute a python script with slurm. The script
require a particular library installed on the system like numpy.
If the library is not installed to the system, it is necessary to
install it on the master AND the nodes, right? This has to be done
on each machine separately or there's a way to install it one time
for all the machine (master and nodes)?

Elisabetta

Re: [slurm-users] Slurm and available libraries

2018-01-17 Thread Vanzo, Davide

Ciao Elisabetta,

I second John's reply.
On our cluster we install software on the shared parallel filesystem with 
EasyBuild and use Lmod as a module front-end. Then users will simply load 
software in the job's environment by using the module command.

Feel free to ping me directly if you need specific help.

--
Davide Vanzo, PhD
Application Developer
Adjunct Assistant Professor of Chemical and Biomolecular Engineering
Advanced Computing Center for Research and Education (ACCRE)
Vanderbilt University - Hill Center 201
(615)-875-9137
www.accre.vanderbilt.edu


On 2018-01-17 07:28:31-06:00 slurm-users wrote:

Hi,
let's say I need to execute a python script with slurm. The script require a 
particular library installed on the system like numpy.
If the library is not installed to the system, it is necessary to install it on 
the master AND the nodes, right? This has to be done on each machine separately 
or there's a way to install it one time for all the machine (master and nodes)?
Elisabetta

Re: [slurm-users] Slurm and available libraries

2018-01-17 Thread John Hearns

Hi Elisabetta.  No, you normally do not need to install software on all the
compute nodes separately.

It is quite common to use the 'modules' environment to manage software like
this
http://www.admin-magazine.com/HPC/Articles/Environment-Modules

Once you have numpy installed on a shared drive on the cluster, and have a
Modules file in place, your users put this at the start of their job
scripts:
module load numpy

You might also want to look at Easybuild
http://easybuild.readthedocs.io/en/latest/Introduction.html
There are Easybuild 'recipes' for numpy. We use them where I work.



On 17 January 2018 at 14:28, Elisabetta Falivene 
wrote:

> Hi,
> let's say I need to execute a python script with slurm. The script require
> a particular library installed on the system like numpy.
> If the library is not installed to the system, it is necessary to install
> it on the master AND the nodes, right? This has to be done on each machine
> separately or there's a way to install it one time for all the machine
> (master and nodes)?
>
> Elisabetta
>

[slurm-users] Slurm and available libraries

2018-01-17 Thread Elisabetta Falivene

Hi,
let's say I need to execute a python script with slurm. The script require
a particular library installed on the system like numpy.
If the library is not installed to the system, it is necessary to install
it on the master AND the nodes, right? This has to be done on each machine
separately or there's a way to install it one time for all the machine
(master and nodes)?

Elisabetta

Re: [slurm-users] Slurm not starting

2018-01-17 Thread Elisabetta Falivene

Ciao Gennaro!


> > *NodeName=node[01-08] CPUs=16 RealMemory=16000 State=UNKNOWN*
> > to
> > *NodeName=node[01-08] CPUs=16 RealMemory=15999 State=UNKNOWN*
> >
> > Now, slurm works and the nodes are running. There is only one minor
> problem
> >
> > *error: Node node04 has low real_memory size (7984 < 15999)*
> > *error: Node node02 has low real_memory size (3944 < 15999)*
> >
> > Two nodes are still put to drain state. The nodes suffered a physical
> > damage to some rams and I had to physically remove them, so slurm think
> it
> > is not a good idea to use them.
> > It is possibile to make slurm use the node anyway?
>
> I think you can specify their properties on separate lines:
>
> NodeName=node[01,03,05-08] CPUs=16 RealMemory=15999 State=UNKNOWN*
> NodeName=node02 CPUs=16 RealMemory=3944 State=UNKNOWN*
> NodeName=node04 CPUs=16 RealMemory=7984 State=UNKNOWN*
>
>
It was possible indeed! Only it required to type "UNKNOWN" instead of
"UNKNOWN*"
Problem fully solved!
Thank you very much!
Elisabetta

Re: [slurm-users] Best practice: How much node memory to specify in slurm.conf?

2018-01-17 Thread Bjørn-Helge Mevik

I tend to run a test program on an otherwise idle node, allocating (and
actually using!) more and more memory, and then see when it starts
swapping.  I typically end up with between 1 and 1.5 GiB less than what
"free" reports as the total memory.

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature

Re: [slurm-users] slurm 17.11.2: Socket timed out on send/recv operation

Re: [slurm-users] Slurm and available libraries

Re: [slurm-users] Slurm and available libraries

Re: [slurm-users] Best practice: How much node memory to specify in slurm.conf?

Re: [slurm-users] slurm 17.11.2: Socket timed out on send/recv operation

Re: [slurm-users] Slurm and available libraries

Re: [slurm-users] Slurm and available libraries

Re: [slurm-users] Slurm and available libraries

Re: [slurm-users] Slurm and available libraries

Re: [slurm-users] Slurm and available libraries

Re: [slurm-users] Slurm and available libraries

[slurm-users] Slurm and available libraries

Re: [slurm-users] Slurm not starting

Re: [slurm-users] Best practice: How much node memory to specify in slurm.conf?

14 matches

Site Navigation

Mail list logo

Footer information