Re: [slurm-users] slurm 17.11.2: Socket timed out on send/recv operation
Hi John thanks for the infos. We are investigating the slowdown of sssd and I found some bug reports regarding slow sssd query with almost the same backtrace. Hopefully an update of sssd could solve this issue. We'll let you know if we found a solution. thanks ale - Original Message - > From: "John DeSantis"> To: "Alessandro Federico" > Cc: "Slurm User Community List" , "Isabella > Baccarelli" , > hpc-sysmgt-i...@cineca.it > Sent: Wednesday, January 17, 2018 3:30:43 PM > Subject: Re: [slurm-users] slurm 17.11.2: Socket timed out on send/recv > operation > > Ale, > > > As Matthieu said it seems something related to SSS daemon. > > That was a great catch by Matthieu. > > > Moreover, only 3 SLURM partitions have the AllowGroups ACL > > Correct, which may seem negligent, but after each `scontrol > reconfigure`, slurmctld restart, and/or AllowGroups= partition > update, > the mapping of UID's for each group will be updated. > > > So why does the UID-GID mapping take so long? > > We attempted to use "AllowGroups" previously, but we found (even with > sssd.conf tuning regarding enumeration) that unless the group was > local > (/etc/group), we were experiencing delays before the AllowGroups > parameter was respected. This is why we opted to use SLURM's > AllowQOS/AllowAccounts instead. > > You would have to enable debugging on your remote authentication > software to see where the hang-up is occurring (if it is that at all, > and not just a delay with the slurmctld). > > Given the direction that this is going - why not replace the > "AllowGroups" with either a simple "AllowAccounts=" or "AllowQOS="? > > > @John: we defined many partitions on the same nodes but in the > > production cluster they will be more or less split across the 6K > > nodes. > > Ok, that makes sense. Looking initially at your partition > definitions, > I immediately thought of being DRY, especially since the "finer" > tuning > between the partitions could easily be controlled via the QOS' > allowed > to access the resources. > > John DeSantis > > On Wed, 17 Jan 2018 13:20:49 +0100 > Alessandro Federico wrote: > > > Hi Matthieu & John > > > > this is the backtrace of slurmctld during the slowdown > > > > (gdb) bt > > #0 0x7fb0e8b1e69d in poll () from /lib64/libc.so.6 > > #1 0x7fb0e8617bfa in sss_cli_make_request_nochecks () > > from /lib64/libnss_sss.so.2 #2 0x7fb0e86185a3 in > > sss_nss_make_request () from /lib64/libnss_sss.so.2 #3 > > 0x7fb0e8619104 in _nss_sss_getpwnam_r () > > from /lib64/libnss_sss.so.2 #4 0x7fb0e8aef07d in > > getpwnam_r@@GLIBC_2.2.5 () from /lib64/libc.so.6 #5 > > 0x7fb0e9360986 in _getpwnam_r (result=, > > bufsiz=, buf=, pwd=, > > name=) at uid.c:73 #6 uid_from_string > > (name=0x1820e41 > > "g2bottin", uidp=uidp@entry=0x7fff07f03a6c) at uid.c:111 #7 > > 0x0043587d in get_group_members (group_name=0x10ac500 "g2") > > at groups.c:139 #8 0x0047525a in _get_groups_members > > (group_names=) at partition_mgr.c:2006 #9 > > 0x00475505 in _update_part_uid_access_list > > (x=0x7fb03401e650, > > arg=0x7fff07f13bf4) at partition_mgr.c:1930 #10 0x7fb0e92ab675 > > in > > list_for_each (l=0x1763e50, f=f@entry=0x4754d8 > > <_update_part_uid_access_list>, arg=arg@entry=0x7fff07f13bf4) at > > list.c:420 #11 0x0047911a in load_part_uid_allow_list > > (force=1) at partition_mgr.c:1971 #12 0x00428e5c in > > _slurmctld_background (no_data=0x0) at controller.c:1911 #13 main > > (argc=, argv=) at controller.c:601 > > > > As Matthieu said it seems something related to SSS daemon. > > However we don't notice any slowdown due to SSSd in our > > environment. > > As I told you before, we are just testing SLURM on a small 100 > > nodes > > cluster before going into production with about 6000 nodes next > > Wednesday. At present the other nodes are managed by PBSPro and the > > 2 > > PBS servers are running on the same nodes as the SLURM controllers. > > PBS queues are also configured with users/groups ACLs and we never > > noticed any similar slowdown. > > > > Moreover, only 3 SLURM partitions have the AllowGroups ACL > > > > [root@mgmt01 slurm]# grep AllowGroups slurm.conf > > PartitionName=bdw_fua_gwdbg Nodes=r040c03s0[1,2] Default=NO > > DefMemPerCPU=3000 DefaultTime=00:30:00 MaxTime=00:30:00 State=UP > > QOS=bdw_fua_gwdbg DenyQos=bdw_qos_special AllowGroups=g2 > > PartitionName=bdw_fua_gwNodes=r040c03s0[1,2] Default=NO > > DefMemPerCPU=3000 DefaultTime=00:30:00 MaxTime=48:00:00 State=UP > > QOS=bdw_fua_gwDenyQos=bdw_qos_special AllowGroups=g2 > > PartitionName=bdw_fua_gwg2 Nodes=r040c03s0[1,2] Default=NO > > DefMemPerCPU=3000 DefaultTime=00:30:00 MaxTime=168:00:00 State=UP > > QOS=bdw_fua_gwg2 DenyQos=bdw_qos_special AllowGroups=g2 > > > > So why does the UID-GID mapping take so long? > > > > @John:
Re: [slurm-users] Slurm and available libraries
On 18/01/18 02:53, Loris Bennett wrote: This is all very OT, so it might be better to discuss it on, say, the OpenHPC mailing list, since as far as I can tell Spack, EasyBuild and Lmod (but not old or new 'environment-modules') are part of OpenHPC. Another place might be the Beowulf list, all about Linux HPC (started by Don Becker many moons ago), now maintained by yours truly. http://www.beowulf.org/ Happy to add people to the list if they wish, just email me directly. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
Re: [slurm-users] Slurm and available libraries
Hi Ole, Ole Holm Nielsenwrites: > John: I would refrain from installing the old default package > "environment-modules" from the Linux distribution, since it doesn't > seem to be maintained any more. Is this still true? Here http://modules.sourceforge.net/ there is a version 4.1.0 which is two days old. Does anyone have any experience of this and how it compares to the old version and/or Lmod? > Lmod, on the other hand, is actively maintained and solves some > problems with the old "environment-modules" software. > > There's an excellent review paper on different module tools: "Modern > Scientific Software Management Using EasyBuild and Lmod", > http://dl.acm.org/citation.cfm?id=2691141 Thanks for the link. I would also be interested in how EasyBuild and Spack compare in practice. This is all very OT, so it might be better to discuss it on, say, the OpenHPC mailing list, since as far as I can tell Spack, EasyBuild and Lmod (but not old or new 'environment-modules') are part of OpenHPC. Cheers, Loris -- Dr. Loris Bennett (Mr.) ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de
Re: [slurm-users] Best practice: How much node memory to specify in slurm.conf?
On 18/01/18 01:52, Paul Edmon wrote: We've been typically taking 4G off the top for memory in our slurm.conf for the system and other processes. This seems to work pretty well. Where I was working previously we'd discount the memory by the amount of GPFS page cache too, plus a little for system processes. Not sure if Greg (hi Greg!) is running GPFS there, but if so it's worth keeping in mind.. cheers, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
Re: [slurm-users] slurm 17.11.2: Socket timed out on send/recv operation
Ale, > As Matthieu said it seems something related to SSS daemon. That was a great catch by Matthieu. > Moreover, only 3 SLURM partitions have the AllowGroups ACL Correct, which may seem negligent, but after each `scontrol reconfigure`, slurmctld restart, and/or AllowGroups= partition update, the mapping of UID's for each group will be updated. > So why does the UID-GID mapping take so long? We attempted to use "AllowGroups" previously, but we found (even with sssd.conf tuning regarding enumeration) that unless the group was local (/etc/group), we were experiencing delays before the AllowGroups parameter was respected. This is why we opted to use SLURM's AllowQOS/AllowAccounts instead. You would have to enable debugging on your remote authentication software to see where the hang-up is occurring (if it is that at all, and not just a delay with the slurmctld). Given the direction that this is going - why not replace the "AllowGroups" with either a simple "AllowAccounts=" or "AllowQOS="? > @John: we defined many partitions on the same nodes but in the > production cluster they will be more or less split across the 6K > nodes. Ok, that makes sense. Looking initially at your partition definitions, I immediately thought of being DRY, especially since the "finer" tuning between the partitions could easily be controlled via the QOS' allowed to access the resources. John DeSantis On Wed, 17 Jan 2018 13:20:49 +0100 Alessandro Federicowrote: > Hi Matthieu & John > > this is the backtrace of slurmctld during the slowdown > > (gdb) bt > #0 0x7fb0e8b1e69d in poll () from /lib64/libc.so.6 > #1 0x7fb0e8617bfa in sss_cli_make_request_nochecks () > from /lib64/libnss_sss.so.2 #2 0x7fb0e86185a3 in > sss_nss_make_request () from /lib64/libnss_sss.so.2 #3 > 0x7fb0e8619104 in _nss_sss_getpwnam_r () > from /lib64/libnss_sss.so.2 #4 0x7fb0e8aef07d in > getpwnam_r@@GLIBC_2.2.5 () from /lib64/libc.so.6 #5 > 0x7fb0e9360986 in _getpwnam_r (result=, > bufsiz=, buf=, pwd=, > name=) at uid.c:73 #6 uid_from_string (name=0x1820e41 > "g2bottin", uidp=uidp@entry=0x7fff07f03a6c) at uid.c:111 #7 > 0x0043587d in get_group_members (group_name=0x10ac500 "g2") > at groups.c:139 #8 0x0047525a in _get_groups_members > (group_names=) at partition_mgr.c:2006 #9 > 0x00475505 in _update_part_uid_access_list (x=0x7fb03401e650, > arg=0x7fff07f13bf4) at partition_mgr.c:1930 #10 0x7fb0e92ab675 in > list_for_each (l=0x1763e50, f=f@entry=0x4754d8 > <_update_part_uid_access_list>, arg=arg@entry=0x7fff07f13bf4) at > list.c:420 #11 0x0047911a in load_part_uid_allow_list > (force=1) at partition_mgr.c:1971 #12 0x00428e5c in > _slurmctld_background (no_data=0x0) at controller.c:1911 #13 main > (argc=, argv=) at controller.c:601 > > As Matthieu said it seems something related to SSS daemon. > However we don't notice any slowdown due to SSSd in our environment. > As I told you before, we are just testing SLURM on a small 100 nodes > cluster before going into production with about 6000 nodes next > Wednesday. At present the other nodes are managed by PBSPro and the 2 > PBS servers are running on the same nodes as the SLURM controllers. > PBS queues are also configured with users/groups ACLs and we never > noticed any similar slowdown. > > Moreover, only 3 SLURM partitions have the AllowGroups ACL > > [root@mgmt01 slurm]# grep AllowGroups slurm.conf > PartitionName=bdw_fua_gwdbg Nodes=r040c03s0[1,2] Default=NO > DefMemPerCPU=3000 DefaultTime=00:30:00 MaxTime=00:30:00 State=UP > QOS=bdw_fua_gwdbg DenyQos=bdw_qos_special AllowGroups=g2 > PartitionName=bdw_fua_gwNodes=r040c03s0[1,2] Default=NO > DefMemPerCPU=3000 DefaultTime=00:30:00 MaxTime=48:00:00 State=UP > QOS=bdw_fua_gwDenyQos=bdw_qos_special AllowGroups=g2 > PartitionName=bdw_fua_gwg2 Nodes=r040c03s0[1,2] Default=NO > DefMemPerCPU=3000 DefaultTime=00:30:00 MaxTime=168:00:00 State=UP > QOS=bdw_fua_gwg2 DenyQos=bdw_qos_special AllowGroups=g2 > > So why does the UID-GID mapping take so long? > > @John: we defined many partitions on the same nodes but in the > production cluster they will be more or less split across the 6K > nodes. > > thank you very much > ale > > - Original Message - > > From: "John DeSantis" > > To: "Matthieu Hautreux" > > Cc: hpc-sysmgt-i...@cineca.it, "Slurm User Community List" > > , "Isabella Baccarelli" > > Sent: Tuesday, January 16, 2018 8:20:20 PM > > Subject: Re: [slurm-users] slurm 17.11.2: Socket timed out on > > send/recv operation > > > > Matthieu, > > > > > I would bet on something like LDAP requests taking too much time > > > because of a missing sssd cache. > > > > Good point! It's easy to forget to check something as "simple" as > > user > > look-up when something is taking "too long". > > > > John DeSantis > > > > On Tue, 16
Re: [slurm-users] Slurm and available libraries
Hi Bill! Always glad to contribute to the Lmod cause! ;) Back to the discussion, I simply gave my contribution based on how we set up our system. In no way I intended to say that that is the only way to deploy software. Yours is definitely a valid alternative, although it requires a deeper experience in software packaging and deployment. To solve the problem of users overloading the login nodes we are experimenting with cgroups, but here we are going a little too much off topic. PS: Now that I am in San Antonio I have no more excuses to come and visit you guys at TACC. -- Davide Vanzo, PhD Application Developer Adjunct Assistant Professor of Chemical and Biomolecular Engineering Advanced Computing Center for Research and Education (ACCRE) Vanderbilt University - Hill Center 201 (615)-875-9137 www.accre.vanderbilt.edu On 2018-01-17 08:01:10-06:00 slurm-users wrote: I’d go slightly further, though I do appreciate the Lmod shout-out!: In some cases, you may not even want the software on the frontend nodes (hear me out before I retract it). If it’s a library that requires linking against before it can be used, then you probably have to have it unless you require users to submit interactive jobs to some dedicated build nodes to do their compilation. You’ll find that when users have all their software needs in one place on the frontend nodes, that sometimes they try to run it there, taking away resources from others. Now, a quick test run to make sure that their build is correct is probably no big deal, but some users will run their full-on science experiments (or pre- and post-processing steps) on the login nodes! We like to encourage those folks to submit jobs to the compute nodes. You could, but they probably wouldn’t like, cripple or not install some libraries on the login nodes to prevent this, but we just watch those systems like a hawk, given that we do want users to be able to build their programs on the login nodes. We don’t use EB, but we do collaborate with them to make it and Lmod compatible. We use something like OpenHPC to push RPMs we build in-house to manage software on our login and compute nodes. Sometimes, we also just install a binary package (like an ISV code like ANSYS or MATLAB) into a shared filesystem (one of our Lustre filesystems, usually) when making our own RPM is too cumbersome, and then use Lmod to make it available and visible to our users. There are more strategies for this than you can imagine, so settle on a few and keep it simple for you! Best, Bill. -- Bill Barth, Ph.D., Director, HPC bba...@tacc.utexas.edu| Phone: (512) 232-7069 Office: ROC 1.435| Fax: (512) 475-9445 On 1/17/18, 7:48 AM, "slurm-users on behalf of Vanzo, Davide" slurm-users-boun...@lists.schedmd.com on="" behalf="" of="" davide.va...@vanderbilt.edu="" wrote: Ciao Elisabetta, I second John's reply. On our cluster we install software on the shared parallel filesystem with EasyBuild and use Lmod as a module front-end. Then users will simply load software in the job's environment by using the module command. Feel free to ping me directly if you need specific help. -- Davide Vanzo, PhD Application Developer Adjunct Assistant Professor of Chemical and Biomolecular Engineering Advanced Computing Center for Research and Education (ACCRE) Vanderbilt University - Hill Center 201 (615)-875-9137 https://na01.safelinks.protection.outlook.com/?url=www.accre.vanderbilt.eduamp;data=02%7C01%7Cdavide.vanzo%40vanderbilt.edu%7Ca55a733721e34284029d08d55db2bfa4%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636517944686221529amp;sdata=4qU%2FqW28JoTtmWYE9Jyjc1VeKOT7U4aiMQdsjXnAVYg%3Damp;reserved=0 On 2018-01-17 07:28:31-06:00 slurm-users wrote: Hi, let's say I need to execute a python script with slurm. The script require a particular library installed on the system like numpy. If the library is not installed to the system, it is necessary to install it on the master AND the nodes, right? This has to be done on each machine separately or there's a way to install it one time for all the machine (master and nodes)? Elisabetta /slurm-users-boun...@lists.schedmd.com
Re: [slurm-users] Slurm and available libraries
I’d go slightly further, though I do appreciate the Lmod shout-out!: In some cases, you may not even want the software on the frontend nodes (hear me out before I retract it). If it’s a library that requires linking against before it can be used, then you probably have to have it unless you require users to submit interactive jobs to some dedicated build nodes to do their compilation. You’ll find that when users have all their software needs in one place on the frontend nodes, that sometimes they try to run it there, taking away resources from others. Now, a quick test run to make sure that their build is correct is probably no big deal, but some users will run their full-on science experiments (or pre- and post-processing steps) on the login nodes! We like to encourage those folks to submit jobs to the compute nodes. You could, but they probably wouldn’t like, cripple or not install some libraries on the login nodes to prevent this, but we just watch those systems like a hawk, given that we do want users to be able to build their programs on the login nodes. We don’t use EB, but we do collaborate with them to make it and Lmod compatible. We use something like OpenHPC to push RPMs we build in-house to manage software on our login and compute nodes. Sometimes, we also just install a binary package (like an ISV code like ANSYS or MATLAB) into a shared filesystem (one of our Lustre filesystems, usually) when making our own RPM is too cumbersome, and then use Lmod to make it available and visible to our users. There are more strategies for this than you can imagine, so settle on a few and keep it simple for you! Best, Bill. -- Bill Barth, Ph.D., Director, HPC bba...@tacc.utexas.edu| Phone: (512) 232-7069 Office: ROC 1.435| Fax: (512) 475-9445 On 1/17/18, 7:48 AM, "slurm-users on behalf of Vanzo, Davide"wrote: Ciao Elisabetta, I second John's reply. On our cluster we install software on the shared parallel filesystem with EasyBuild and use Lmod as a module front-end. Then users will simply load software in the job's environment by using the module command. Feel free to ping me directly if you need specific help. -- Davide Vanzo, PhD Application Developer Adjunct Assistant Professor of Chemical and Biomolecular Engineering Advanced Computing Center for Research and Education (ACCRE) Vanderbilt University - Hill Center 201 (615)-875-9137 www.accre.vanderbilt.edu On 2018-01-17 07:28:31-06:00 slurm-users wrote: Hi, let's say I need to execute a python script with slurm. The script require a particular library installed on the system like numpy. If the library is not installed to the system, it is necessary to install it on the master AND the nodes, right? This has to be done on each machine separately or there's a way to install it one time for all the machine (master and nodes)? Elisabetta
Re: [slurm-users] Slurm and available libraries
I should also say that Modules should be easy to install on Ubuntu. It will be the package named "environment-modules" You probably will have to edit the configuration file a little bit since the default install will assume al lModules files are local. You need to set your MODULESPATH to include a shared directory where you will keep all your Modules files. This really is a lot easier than it sounds. On 17 January 2018 at 14:48, Vanzo, Davidewrote: > Ciao Elisabetta, > > I second John's reply. > On our cluster we install software on the shared parallel filesystem with > EasyBuild and use Lmod as a module front-end. Then users will simply load > software in the job's environment by using the module command. > > Feel free to ping me directly if you need specific help. > > -- > *Davide Vanzo, PhD* > Application Developer > Adjunct Assistant Professor of Chemical and Biomolecular Engineering > Advanced Computing Center for Research and Education (ACCRE) > Vanderbilt University - Hill Center 201 > (615)-875-9137 <(615)%20875-9137> > www.accre.vanderbilt.edu > > > On 2018-01-17 07:28:31-06:00 slurm-users wrote: > > Hi, > let's say I need to execute a python script with slurm. The script require > a particular library installed on the system like numpy. > If the library is not installed to the system, it is necessary to install > it on the master AND the nodes, right? This has to be done on each machine > separately or there's a way to install it one time for all the machine > (master and nodes)? > Elisabetta > >
Re: [slurm-users] Slurm and available libraries
I can highly recommend EasyBuild as an easy way to provide software packages as "modules" to your cluster. We have been very pleased with EasyBuild in our cluster. I made some notes about installing EasyBuild in a Wiki page: https://wiki.fysik.dtu.dk/niflheim/EasyBuild_modules We use CentOS 7 Linux. Also, if you want information about Slurm setup, I have written another set of Wiki pages: https://wiki.fysik.dtu.dk/niflheim/SLURM /Ole On 01/17/2018 02:39 PM, John Hearns wrote: Hi Elisabetta. No, you normally do not need to install software on all the compute nodes separately. It is quite common to use the 'modules' environment to manage software like this http://www.admin-magazine.com/HPC/Articles/Environment-Modules Once you have numpy installed on a shared drive on the cluster, and have a Modules file in place, your users put this at the start of their job scripts: module load numpy You might also want to look at Easybuild http://easybuild.readthedocs.io/en/latest/Introduction.html There are Easybuild 'recipes' for numpy. We use them where I work. On 17 January 2018 at 14:28, Elisabetta Falivene> wrote: Hi, let's say I need to execute a python script with slurm. The script require a particular library installed on the system like numpy. If the library is not installed to the system, it is necessary to install it on the master AND the nodes, right? This has to be done on each machine separately or there's a way to install it one time for all the machine (master and nodes)? Elisabetta
Re: [slurm-users] Slurm and available libraries
Ciao Elisabetta, I second John's reply. On our cluster we install software on the shared parallel filesystem with EasyBuild and use Lmod as a module front-end. Then users will simply load software in the job's environment by using the module command. Feel free to ping me directly if you need specific help. -- Davide Vanzo, PhD Application Developer Adjunct Assistant Professor of Chemical and Biomolecular Engineering Advanced Computing Center for Research and Education (ACCRE) Vanderbilt University - Hill Center 201 (615)-875-9137 www.accre.vanderbilt.edu On 2018-01-17 07:28:31-06:00 slurm-users wrote: Hi, let's say I need to execute a python script with slurm. The script require a particular library installed on the system like numpy. If the library is not installed to the system, it is necessary to install it on the master AND the nodes, right? This has to be done on each machine separately or there's a way to install it one time for all the machine (master and nodes)? Elisabetta
Re: [slurm-users] Slurm and available libraries
Hi Elisabetta. No, you normally do not need to install software on all the compute nodes separately. It is quite common to use the 'modules' environment to manage software like this http://www.admin-magazine.com/HPC/Articles/Environment-Modules Once you have numpy installed on a shared drive on the cluster, and have a Modules file in place, your users put this at the start of their job scripts: module load numpy You might also want to look at Easybuild http://easybuild.readthedocs.io/en/latest/Introduction.html There are Easybuild 'recipes' for numpy. We use them where I work. On 17 January 2018 at 14:28, Elisabetta Falivenewrote: > Hi, > let's say I need to execute a python script with slurm. The script require > a particular library installed on the system like numpy. > If the library is not installed to the system, it is necessary to install > it on the master AND the nodes, right? This has to be done on each machine > separately or there's a way to install it one time for all the machine > (master and nodes)? > > Elisabetta >
[slurm-users] Slurm and available libraries
Hi, let's say I need to execute a python script with slurm. The script require a particular library installed on the system like numpy. If the library is not installed to the system, it is necessary to install it on the master AND the nodes, right? This has to be done on each machine separately or there's a way to install it one time for all the machine (master and nodes)? Elisabetta
Re: [slurm-users] Slurm not starting
Ciao Gennaro! > > *NodeName=node[01-08] CPUs=16 RealMemory=16000 State=UNKNOWN* > > to > > *NodeName=node[01-08] CPUs=16 RealMemory=15999 State=UNKNOWN* > > > > Now, slurm works and the nodes are running. There is only one minor > problem > > > > *error: Node node04 has low real_memory size (7984 < 15999)* > > *error: Node node02 has low real_memory size (3944 < 15999)* > > > > Two nodes are still put to drain state. The nodes suffered a physical > > damage to some rams and I had to physically remove them, so slurm think > it > > is not a good idea to use them. > > It is possibile to make slurm use the node anyway? > > I think you can specify their properties on separate lines: > > NodeName=node[01,03,05-08] CPUs=16 RealMemory=15999 State=UNKNOWN* > NodeName=node02 CPUs=16 RealMemory=3944 State=UNKNOWN* > NodeName=node04 CPUs=16 RealMemory=7984 State=UNKNOWN* > > It was possible indeed! Only it required to type "UNKNOWN" instead of "UNKNOWN*" Problem fully solved! Thank you very much! Elisabetta
Re: [slurm-users] Best practice: How much node memory to specify in slurm.conf?
I tend to run a test program on an otherwise idle node, allocating (and actually using!) more and more memory, and then see when it starts swapping. I typically end up with between 1 and 1.5 GiB less than what "free" reports as the total memory. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature