[slurm-dev] Re: question about federation

2017-11-01 Thread Ole Holm Nielsen
bout the configuration. I guess the two slurmctld should be configured to use the same slurmdbd. Is it right? Or which is the right way? Thanks,regards zhangtao102...@126.com *From:* Ole Holm Nielsen <mailto:ole.h.niel..

[slurm-dev] Re: question about federation

2017-10-31 Thread Ole Holm Nielsen
On 10/31/2017 09:34 AM, zhangtao102...@126.com wrote:>    I have noticed that slurm v17.11 will federated cluster, but i cann't find detailed documentation about it.    Now, i have 2 question about federated cluster:    (1) When configuring federated cluster, should i configure the two slur

[slurm-dev] Re: SLURM 17.02.8 not optimally scheduling jobs/utilizing resources

2017-10-25 Thread Ole Holm Nielsen
On 10/25/2017 01:52 PM, Holger Naundorf wrote: I'd really appreciate any help the SLURM wizards can provide! We suspect it's something to do with how we've set up QoS or maybe, we need to tweak the scheduler configuration in 17.02.8 however there's no single clear path forward. Just let me know

[slurm-dev] Re: Running jobs are stopped and reqeued when adding new nodes

2017-10-23 Thread Ole Holm Nielsen
10/23/2017 03:07 PM, Ole Holm Nielsen wrote: Hi Jin, I think that I always do your steps 3,4 in the opposite order: Restart slurmctld, then slurmd on nodes: > 3. Restart the slurmd on all nodes > 4. Restart the slurmctld Since you run a very old Slurm 15.08, perhaps you should up

[slurm-dev] Re: Running jobs are stopped and reqeued when adding new nodes

2017-10-23 Thread Ole Holm Nielsen
15.08.7 I've included the slurm.conf rather than slurmdbd.conf. Cheers, Jin On Mon, Oct 23, 2017 at 8:25 AM Ole Holm Nielsen mailto:Ole.H.Nhttps://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurmiel...@fysik.dtu.dk>> wrote: Hi Jin, Your slurmctld.log s

[slurm-dev] Re: Running jobs are stopped and reqeued when adding new nodes

2017-10-23 Thread Ole Holm Nielsen
Hi Jin, Your slurmctld.log says "Node compute004 appears to have a different slurm.conf than the slurmctld" etc. This will happen if you didn't copy correctly the slurm.conf to the nodes. Please correct this potential error. Also, please specify which version of Slurm you're running. /Ole

[slurm-dev] Re: Running jobs are stopped and reqeued when adding new nodes

2017-10-23 Thread Ole Holm Nielsen
I have added nodes to an existing partition several times using the same procedure which you describe, and no bad side effects have been noticed. This is a very normal kind of operation in a cluster, where hardware may be added or retired from time to time, while the cluster of course contin

[slurm-dev] SC17 Slurm BOF session on Nov. 16

2017-10-05 Thread Ole Holm Nielsen
FYI: For Slurm users participating in the Supercomputing SC17 conference in Denver, Colorado, USA: SchedMD will present a Birds of a Feather (BOF) session: Time: Thursday, November 16th 12:15pm - 1:15pm Location: 201-203 http://sc17.supercomputing.org/presentation/?id=bof105&sess=sess312

[slurm-dev] Re: Setting up Environment Modules package

2017-10-05 Thread Ole Holm Nielsen
On 10/04/2017 06:11 PM, Mike Cammilleri wrote: I'm in search of a best practice for setting up Environment Modules for our Slurm 16.05.6 installation (we have not had the time to upgrade to 17.02 yet). We're a small group and had no explicit need for this in the beginning, but as we are growi

[slurm-dev] Re: Setting up Environment Modules package

2017-10-05 Thread Ole Holm Nielsen
On 10/05/2017 08:38 AM, Blomqvist Janne wrote: what we do is, roughly, a combination of your options #2 and #3. To start with, however, I'd like to point out that we're using Lmod instead of the old Tcl environment-modules. I'd really recommend you to do the same. So basically, we have our mo

[slurm-dev] Re: Upgrading Slurm

2017-10-04 Thread Ole Holm Nielsen
On 10/04/2017 09:38 AM, Elisabetta Falivene wrote: Ps: if you know some good source of information about how to set up a cluster and slurm beside official doc, I would be grateful if you could share. It is difficult to find good material I agree about the lack of availability of HowTo guides.

[slurm-dev] Re: Upgrading Slurm

2017-10-03 Thread Ole Holm Nielsen
On 10/03/2017 03:29 PM, Elisabetta Falivene wrote: I've been asked to upgrade our slurm installation. I have a slurm 2.3.4 on a Debian 7.0 wheezy cluster (1 master + 8 nodes). I've not installed it so I'm a bit confused about how to do this and how to proceed without destroying anything. I w

[slurm-dev] Re: Limiting SSH sessions to cgroups?

2017-09-20 Thread Ole Holm Nielsen
My Wiki page summarizes what's known about the pam_slurm_adopt setup. May I remind you of my previous answer: There is now a better understanding of how to use slurm-pam_slurm with Slurm 17.02.2 or later for limiting SSH access to nodes, see: https://bugs.schedmd.com/show_bug.cgi?id=4098

[slurm-dev] Re: systemd slurm not starting on boot

2017-09-19 Thread Ole Holm Nielsen
using Ubuntu 16.04 on each head/compute node, and have installed slurm-wlm from the apt repositories.  It is slurm 15.08.7. On Tue, Sep 19, 2017 at 11:07 AM, Ole Holm Nielsen mailto:ole.h.niel...@fysik.dtu.dk>> wrote: If your OS is CentOS/RHEL 7, you may want to consult my Wiki pag

[slurm-dev] Re: systemd slurm not starting on boot

2017-09-19 Thread Ole Holm Nielsen
If your OS is CentOS/RHEL 7, you may want to consult my Wiki page about setting up Slurm: https://wiki.fysik.dtu.dk/niflheim/SLURM. If you do things correctly, there should be no problems :-) /Ole On 09/19/2017 05:02 PM, Kyle Mills wrote: Hello, I'm trying to get SLURM set up on a small cl

[slurm-dev] Re: Limiting SSH sessions to cgroups?

2017-09-19 Thread Ole Holm Nielsen
On 09/19/2017 03:25 PM, Jacob Chappell wrote: I found an old mailing list discussion about this. I'm curious if any progress has been made since and if there is a solution now? There is now a better understanding of how to use slurm-pam_slurm with Slurm 17.02.2 or later for limiting SSH acces

[slurm-dev] Announce: Node status tool "pestat" for Slurm updated

2017-09-11 Thread Ole Holm Nielsen
I'm announcing an updated version of the node status tool "pestat" for Slurm. The job list for each node may now optionally include the (expected) job EndTime using the -E option. This information is very useful when you are waiting for a draining node to be cleared of jobs. For example, it

[slurm-dev] Re: An issue about slurm on CentOS 7.3

2017-08-28 Thread Ole Holm Nielsen
On 08/25/2017 06:19 PM, Nicholas McCollum wrote: I like your documentation but I would add a few things: I highly recommend not having the slurmctld start automatically upon reboot. If for some reason the slurm spool directory isn't available (on a shared folder) it will cause all the jobs to

[slurm-dev] Re: An issue about slurm on CentOS 7.3

2017-08-25 Thread Ole Holm Nielsen
On 08/25/2017 01:37 PM, Huijun HJ1 Ni wrote:> I installed slurm on my cluster whose OS are CentOS7.3. After I completed the configuration, I found that it would be hung while executing ‘systemctl start slurm’ on compute nodes(but is ok on control node where slurmctld runs

[slurm-dev] Re: how to configure 2 servers

2017-08-17 Thread Ole Holm Nielsen
On 08/17/2017 01:58 PM, Shlomit Afgin wrote: I follow https://www.slothparadise.com/how-to-install-slurm-on-centos-7-cluster/ to install Slurm. In the instruction, it use the command /‘/usr/sbin/create-munge-key –r’/ to build munge.key on the server. Then need copy the key file to each one

[slurm-dev] Re: Slurmd v15 to v17 stopped working (slurmd: fatal: Unable to determine this slurmd's NodeName) on ControlMachine

2017-08-14 Thread Ole Holm Nielsen
Hi Olivier, You might also want to consult my HowTo wiki for Slurm on CentOS 7: https://wiki.fysik.dtu.dk/niflheim/SLURM Lots of little details are discussed in this wiki. /Ole On 08/10/2017 03:04 PM, LAHAYE Olivier wrote: how stupid I am, your perfectly right! How by hell was I unable to se

[slurm-dev] Re: ANNOUNCE: A collection of Slurm tools

2017-07-21 Thread Ole Holm Nielsen
De Boelelaan 1083 1081 HV Amsterdam, The Netherlands T +31 20 598 7620 (at the office Tuesday and Friday From 9:30am to 1:30pm) On 20 Jul 2017, at 15:57, Ole Holm Nielsen <mailto:ole.h.niel...@fysik.dtu.dk>> wrote: As a small contribution to the Slurm community, I've moved my collec

[slurm-dev] Re: ANNOUNCE: A collection of Slurm tools

2017-07-21 Thread Ole Holm Nielsen
On 07/21/2017 12:00 PM, Loris Bennett wrote: Thanks for sharing your tools. Here are some brief comments I've updated the following tools on https://github.com/OleHolmNielsen/Slurm_tools, see the changes below. - psjob/psnode - The USERLIST variable makes the commands a bit brittle, si

[slurm-dev] Re: ANNOUNCE: A collection of Slurm tools

2017-07-21 Thread Ole Holm Nielsen
Hi Loris, Thanks so much for your relevant comments! On 07/21/2017 12:00 PM, Loris Bennett wrote: Hi Ole, Ole Holm Nielsen writes: As a small contribution to the Slurm community, I've moved my collection of Slurm tools to GitHub at https://github.com/OleHolmNielsen/Slurm_tools.

[slurm-dev] ANNOUNCE: A collection of Slurm tools

2017-07-20 Thread Ole Holm Nielsen
As a small contribution to the Slurm community, I've moved my collection of Slurm tools to GitHub at https://github.com/OleHolmNielsen/Slurm_tools. These are tools which I feel makes the daily cluster monitoring and management a little easier. The following Slurm tools are available: * pes

[slurm-dev] Re: How to set 'future' node state?

2017-07-15 Thread Ole Holm Nielsen
On 14-07-2017 23:26, Robbert Eggermont wrote: We're adding some nodes to our cluster (17.02.5). In preparation, we've defined the nodes in our slurm.conf with "State=FUTURE" (as descibed in the man page). But it doesn't work like this, because when we start the slurmd on the nodes, the nodes i

[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-06 Thread Ole Holm Nielsen
On 07/06/2017 04:31 PM, Uwe Sauter wrote: Alternatively you can systemctl disable firewalld.service systemctl mask firewalld.service yum install iptables-services systemctl enable iptables.service ip6tables.service and configure configure iptables in /etc/sysconfig/iptables and

[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-06 Thread Ole Holm Nielsen
s. I turned off firewall on both machines but > still no luck. I can confirm that No managed switch is preventing the nodes > from communicating. If you check the log file, there is communication for > about 4mins and then the node state

[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-06 Thread Ole Holm Nielsen
I can confirm that No managed switch is preventing the nodes > from communicating. If you check the log file, there is communication for > about 4mins and then the node state goes down. > Any other idea? >

[slurm-dev] Re: Need for recompiling openmpi built with --with-pmi?

2017-07-06 Thread Ole Holm Nielsen
I'd like a second revival of this thread! The full thread is available at https://groups.google.com/forum/#!msg/slurm-devel/oDoHPoAbiPQ/q9pQL2Uw3y0J We're in the process of upgrading Slurm from 16.05 to 17.02. I'd like to be certain that our MPI libraries don't require a specific library

[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-05 Thread Ole Holm Nielsen
On 07/05/2017 11:40 AM, Felix Willenborg wrote: in my network I encountered that managed switches were preventing necessary network communication between the nodes, on which SLURM relies. You should check if you're using managed switches to connect nodes to the network and if so, if they're bloc

[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-05 Thread Ole Holm Nielsen
On 07/05/2017 11:25 AM, Ole Holm Nielsen wrote: Could it be that you have enabled the firewall on the compute nodes? The firewall must be turned off (this requirement isn't documented anywhere). You may want to go through my Slurm deployment Wiki at https://wiki.fysik.dtu.dk/nif

[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-05 Thread Ole Holm Nielsen
Your help will be greatly appreciated, Sincerely, Said. -- Ole Holm Nielsen PhD, Manager of IT services Department of Physics, Technical University of Denmark, Building 307, DK-2800 Kongens Lyngby, Denmark E-mail: ole.h.niel...@fysik.dtu.dk Homepage: http://dcwww.fysik.dtu.dk/~ohnielse

[slurm-dev] Re: Topology for message aggregation

2017-07-03 Thread Ole Holm Nielsen
On 07/03/2017 01:18 PM, Ulf Markwardt wrote: is there a chance to explicitely assign nodes (e.g. machines outside the HPC machine) for message aggregation? All I see at the moment is that Slurm uses the (high-speed interconnect) topolgy for this. But I do not want to put communication load (noi

[slurm-dev] Re: Multifactor Priority Plugin for Small clusters

2017-07-03 Thread Ole Holm Nielsen
On 07/03/2017 08:11 AM, Christopher Samuel wrote: On 03/07/17 16:02, Loris Bennett wrote: I don't think you can achieve what you want with Fairshare and Multifactor Priority. Fairshare looks at distributing resources fairly between users over a *period* of time. At any *point* in time it is

[slurm-dev] Announce: Node status tool "pestat" for Slurm updated to version 0.52

2017-06-28 Thread Ole Holm Nielsen
8 88.07 23900 181* 91683 user01 a083 xeon8*alloc 8 88.06 23900 172* 91683 user01 The -s option is useful for checking on possibly unusual node states, for example: # pestat -s mixed -- Ole Holm Nielsen Department of Physics, Technical University of Denmark

[slurm-dev] Re: slurm.conf for single node

2017-06-28 Thread Ole Holm Nielsen
On 06/28/2017 11:27 AM, Sofiane Bendoukha wrote: what the right configuration for a single node (just for testing)? How should the slurm.conf configured? Perhaps you can find some inspiration in our Wiki page https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration ? Our main Slurm Wiki page i

[slurm-dev] Announce: Node status tool "pestat" for Slurm updated to version 0.51

2017-06-28 Thread Ole Holm Nielsen
91683 user01 a067 xeon8*alloc 8 88.07 23900 181* 91683 user01 a083 xeon8*alloc 8 88.06 23900 172* 91683 user01 The -s option is useful for checking on possibly unusual node states, for example: # pestat -s mixed -- Ole Holm

[slurm-dev] Re: slurm-dev Announce: Node status tool "pestat" for Slurm updated to version 0.50

2017-06-27 Thread Ole Holm Nielsen
On 26-06-2017 17:20, Adrian Sevcenco wrote: On 06/22/2017 01:34 PM, Ole Holm Nielsen wrote: I'm announcing an updated version 0.50 of the node status tool "pestat" for Slurm. I discovered how to obtain the node Free Memory with sinfo, so now we can do nice things with mem

[slurm-dev] Re: Dry run upgrade procedure for the slurmdbd database

2017-06-26 Thread Ole Holm Nielsen
t for that purpose to vet slurm upgrades before we roll them on the production cluster. Thus far no problems. However, paranoia is usually a good thing for cases like this. -Paul Edmon- On 06/26/2017 07:30 AM, Ole Holm Nielsen wrote: On 06/26/2017 01:24 PM, Loris Bennett wrote: We

[slurm-dev] Re: Dry run upgrade procedure for the slurmdbd database

2017-06-26 Thread Ole Holm Nielsen
ion for collective action, to build collective power, to achieve collective transformation, rooted in grief and rage but pointed towards vision and dreams." - Patrisse Cullors, Black Lives Matter founder On 26 June 2017 at 20:04, Ole Holm Nielsen wrote: We're planning to upgrade

[slurm-dev] Re: Controlling the output of 'scontrol show hostlist'?

2017-06-26 Thread Ole Holm Nielsen
te Usage: clush [options] command clush: error: No node to run on. Could you kindly explain this (and perhaps add examples to the documentation)? > Cheers, -- Ole Holm Nielsen PhD, Manager of IT services Department of Physics, Technical University of Denmark, Building 307, DK-2800 Kongens Lyng

[slurm-dev] Re: Announce: Node status tool "pestat" for Slurm updated to version 0.50

2017-06-26 Thread Ole Holm Nielsen
On 23-06-2017 17:20, Belgin, Mehmet wrote: One thing I noticed is that pestat reports zero Freemem until a job is allocated on nodes. I’d expect it to report the same value as Memsize if no jobs are running. I wanted to offer this as a suggestion since zero free memory on idle nodes may be a b

[slurm-dev] Dry run upgrade procedure for the slurmdbd database

2017-06-26 Thread Ole Holm Nielsen
We're planning to upgrade Slurm 16.05 to 17.02 soon. The most critical step seems to me to be the upgrade of the slurmdbd database, which may also take tens of minutes. I thought it's a good idea to test the slurmdbd database upgrade locally on a drained compute node in order to verify both

[slurm-dev] Re: Controlling the output of 'scontrol show hostlist'?

2017-06-23 Thread Ole Holm Nielsen
On 06/22/2017 06:22 PM, Kilian Cavalotti wrote:> ClusterShell is incredibly useful, it provides not only a parallel shell for remote execution (and file distribution, output aggregation or diff'ing...), but also an event-driven Python library that can be used in your Python scripts, and CLI too

[slurm-dev] Re: Controlling the output of 'scontrol show hostlist'?

2017-06-22 Thread Ole Holm Nielsen
On 06/22/2017 06:39 PM, Michael Jennings wrote: On Thursday, 22 June 2017, at 04:19:04 (-0600), Loris Bennett wrote: rpmbuild --rebuild --with=slurm --without=torque pdsh-2.26-4.el6.src.rpm Remove the equals signs. I have no problems building pdsh 2.29 via: rpmbuild --rebuild --with

[slurm-dev] Re: Controlling the output of 'scontrol show hostlist'?

2017-06-22 Thread Ole Holm Nielsen
On 06/22/2017 12:18 PM, Loris Bennett wrote:> I have just realised that pdsh, which was what I wanted the consolidated list for, has a Slurm module, which knows about Slurm jobs. I followed your instructions here: https://wiki.fysik.dtu.dk/niflheim/SLURM#pdsh-parallel-distributed-shell wi

[slurm-dev] slurm-dev Announce: Node status tool "pestat" for Slurm updated to version 0.50

2017-06-22 Thread Ole Holm Nielsen
ssibly unusual node states, for example: # pestat -s mixed -- Ole Holm Nielsen Department of Physics, Technical University of Denmark

[slurm-dev] Re: Controlling the output of 'scontrol show hostlist'?

2017-06-22 Thread Ole Holm Nielsen
easier manipulation of node lists without one having to google the appropiate sed magic. -- Ole Holm Nielsen PhD, Manager of IT services Department of Physics, Technical University of Denmark, Building 307, DK-2800 Kongens Lyngby, Denmark E-mail: ole.h.niel...@fysik.dtu.dk Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/ Tel: (+45) 4525 3187 / Mobile (+45) 5180 1620

[slurm-dev] Re: Long delay starting slurmdbd after upgrade to 17.02

2017-06-20 Thread Ole Holm Nielsen
On 06/20/2017 04:32 PM, Loris Bennett wrote: We do our upgrades while full production is up and running. We just stop the Slurm daemons, dump the database and copy the statesave directory just in case. We then do the update, and finally restart the Slurm daemons. We only lost jobs once during

[slurm-dev] Re: Can't get formatted sinfo to work...

2017-06-20 Thread Ole Holm Nielsen
Hi Mehmet, Perhaps you need to configure NHC to use the short hostname, see the example in https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#node-health-check /Ole On 06/19/2017 05:09 PM, Belgin, Mehmet wrote: Thank you Loris, it was my bad. I should have used the short hostname, whic

[slurm-dev] Announce: Node status tool "pestat" for Slurm updated to version 0.41

2017-06-16 Thread Ole Holm Nielsen
# pestat -s mixed -- Ole Holm Nielsen Department of Physics, Technical University of Denmark

[slurm-dev] Re: sinfo

2017-05-24 Thread Ole Holm Nielsen
On 24-05-2017 21:03, A wrote: Couldnt find this in the man, Whats the syntax for listing which nodes are free (not allocated/mixed), with sinfo? sinfo -t idle

[slurm-dev] Tools for using strigger to monitor nodes?

2017-05-23 Thread Ole Holm Nielsen
I'd like to configure E-mail notifications of failing nodes. I already use the LBL NHC (Node Health Check) on the compute nodes to send alerts, but one may also use the Slurm strigger mechanism on the slurmctld host. The examples in http://slurm.schedmd.com/strigger.html are quite rudimenta

[slurm-dev] Re: How to get pids of a job

2017-05-11 Thread Ole Holm Nielsen
I have written a small tool to display the user processes in a job: https://ftp.fysik.dtu.dk/Slurm/sshjob An example output is: # sshjob 57811 Nodelist for job-id 57811: a128 Node usage: NumNodes=1 NumCPUs=8 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:* a128 P

[slurm-dev] Announce: Node status tool "pestat" for Slurm updated to version 0.40

2017-05-09 Thread Ole Holm Nielsen
use "pestat -f" all the time because it prints and flags (in color) only the nodes which have an unexpected CPU load or node status. The -s option is useful for checking on possibly unusual node states, for example "pestat -s mix". -- Ole Holm Nielsen Department of Physics, Technical University of Denmark

[slurm-dev] Announce: Infiniband topology tool "slurmibtopology.sh" updated version 0.21

2017-05-09 Thread Ole Holm Nielsen
ibsw7 Nodes=g[041-044,049-052,057-060,065-068,073-076] SwitchName=ibsw8 Nodes=g[001-004,009-012,017-020,025-028,033-036] It would be great if other sites could test this tool on their Infiniband network and report bugs or suggest improvements. -- Ole Holm Nielsen Department of Physics, Technical University of Denmark

[slurm-dev] Re: Announce: Infiniband topology tool "slurmibtopology.sh" version 0.1

2017-05-09 Thread Ole Holm Nielsen
On 05/09/2017 09:14 AM, Janne Blomqvist wrote: On 2017-05-07 15:29, Ole Holm Nielsen wrote: I'm announcing an initial version 0.1 of an Infiniband topology tool "slurmibtopology.sh" for Slurm. I have also created one, at https://github.com/jabl/ibtopotool You need the

[slurm-dev] Re: Announce: Infiniband topology tool "slurmibtopology.sh" version 0.1

2017-05-08 Thread Ole Holm Nielsen
-- cat < Sincerely damien On 07 May 2017, at 14:29, Ole Holm Nielsen wrote: I'm announcing an initial version 0.1 of an Infiniband topology tool "slurmibtopology.sh" for Slurm. I had to create a Slurm topology.conf file and needed an automated way to get the

[slurm-dev] Re: Announce: Infiniband topology tool "slurmibtopology.sh" version 0.1

2017-05-08 Thread Ole Holm Nielsen
NOT OK 26c26 < cat <&2 --- cat < Sincerely damien On 07 May 2017, at 14:29, Ole Holm Nielsen wrote: I'm announcing an initial version 0.1 of an Infiniband topology tool "slurmibtopology.sh" for Slurm. I had to create a Slurm topology.conf file and needed an

[slurm-dev] Announce: Infiniband topology tool "slurmibtopology.sh" version 0.1

2017-05-07 Thread Ole Holm Nielsen
e=ibsw8 Nodes=g[001-004,009-012,017-020,025-028,033-036] # Merging all switches in a top-level spine switch SwitchName=spineswitch Switches=ibsw[1-8] It would be great if other sites could test this tool on their Infiniband network and report bugs or suggest improvements. -- Ole Holm Niels

[slurm-dev] Re: Announce: Node status tool "pestat" version 0.1 for Slurm

2017-05-03 Thread Ole Holm Nielsen
Sorry, the HTTP URL is http://ftp.fysik.dtu.dk/Slurm/pestat On 05/03/2017 05:53 PM, Ole Holm Nielsen wrote: On 05/03/2017 04:44 PM, Andrej Prsa wrote: I'll be expanding the functionality of pestat over time, so please send me comments and bug reports. Thanks for sharing! I had to c

[slurm-dev] Re: Announce: Node status tool "pestat" version 0.1 for Slurm

2017-05-03 Thread Ole Holm Nielsen
On 05/03/2017 04:44 PM, Andrej Prsa wrote: I'll be expanding the functionality of pestat over time, so please send me comments and bug reports. Thanks for sharing! I had to change the hardcoded paths, so perhaps you should make the paths variables at the top of the script or look for the sin

[slurm-dev] Re: Announce: Node status tool "pestat" for Slurm updated to version 0.30

2017-05-03 Thread Ole Holm Nielsen
t the 'pestat' utility for Slurm and for PBS on a site which uses http? The reason is many (most ?) corporate networks block ftp access. Thankyou On 3 May 2017 at 09:06, Ole Holm Nielsen mailto:ole.h.niel...@fysik.dtu.dk>> wrote: I'm announcing an updated version 0.

[slurm-dev] Announce: Node status tool "pestat" for Slurm updated to version 0.30

2017-05-03 Thread Ole Holm Nielsen
-V: Version information I use "pestat -f" all the time because it prints and flags (in color) only the nodes which have an unexpected CPU load or node status. -- Ole Holm Nielsen Department of Physics, Technical University of Denmark

[slurm-dev] Re: Announce: Node status tool "pestat" version 0.1 for Slurm

2017-05-03 Thread Ole Holm Nielsen
On 05/03/2017 08:47 AM, Bjørn-Helge Mevik wrote: Ole Holm Nielsen writes: I'm announcing an initial version 0.1 of the node status tool "pestat" for Slurm. Interesting tool! Thanks! This tool needs to expand Slurm hostlists like a[095,097-098] into a095,a097,a098, so I

[slurm-dev] Announce: Node status tool "pestat" version 0.1 for Slurm

2017-05-02 Thread Ole Holm Nielsen
mple to install this as an RPM package, see https://wiki.fysik.dtu.dk/niflheim/SLURM#expanding-host-lists I'll be expanding the functionality of pestat over time, so please send me comments and bug reports. My ToDo list would include: 1. Report current node memory usage of jobs. 2. Flag j

[slurm-dev] Re: New slurm user question.

2017-05-01 Thread Ole Holm Nielsen
I've been missing such a node status tool for Slurm for a long time! For Torque clusters I wrote the tool "pestat" (available in ftp://ftp.fysik.dtu.dk/pub/Torque/) and we use it all the time. Here's my quick stab at writing a "pestat" tool for Slurm: ftp://ftp.fysik.dtu.dk/pub/Slurm/pestat

[slurm-dev] Re: Nodes in state 'down*' despite slurmd running

2017-04-05 Thread Ole Holm Nielsen
On 04/05/2017 03:59 PM, Loris Bennett wrote: We are running 16.05.10-2 with power-saving. However, we have noticed a problem recently when nodes are woken up in order to start a job. The node will go from 'idle~' to, say, 'mixed#', but then the job will fail and the node will be put in 'down*'

[slurm-dev] TaskProlog script examples?

2017-03-28 Thread Ole Holm Nielsen
in /etc/profile.d/ but this is ignored in tasks started by slurmd. Other useful TaskProlog tasks could be to set up scratch directories for jobs and wipe them again in TaskEpilog. Does anyone have good scripts for this? Thanks a lot, Ole -- Ole Holm Nielsen Department of Physics, Technical University of Denmark

[slurm-dev] slurm-dev Re: check_fs_mount error on nodes

2017-02-24 Thread Ole Holm Nielsen
tmpfs" -f "/sys/fs/cgroup" testnode1 || check_fs_mount_rw -t "pstore" -s "pstore" -f "/sys/fs/pstore" testnode1 || check_fs_mount_rw -t "configfs" -s "configfs" -f "/sys/kernel/config" testnode1 || check_fs_mount_rw -t "

[slurm-dev] Re: check_fs_mount error on nodes

2017-02-23 Thread Ole Holm Nielsen
for your Linux version. You do need to configure NHC appropriately for your servers, however, so check the file /etc/nhc/nhc.conf. What lines 'check_fs_mount' is in your nhc.conf? Which Linux OS do you use? Which NHC version do you use? /Ole -- Ole Holm Nielsen Department of Physics, Technical University of Denmark,

[slurm-dev] Re: Stopping compute usage on login nodes

2017-02-09 Thread Ole Holm Nielsen
We limit the cpu times in /etc/security/limits.conf so that user processes have a maximum of 10 minutes. It doesn't eliminate the problem completely, but it's fairly effective on users who misunderstood the role of login nodes. On Thu, Feb 9, 2017 at 6:38 PM +0100, "Jason Bacon" mailto:bacon4

[slurm-dev] Slurm daemons started incorrectly on CentOS/RHEL 7 (Systemd systems)

2017-01-12 Thread Ole Holm Nielsen
gi?id=3371. It is expected to be resolved in Slurm 17.11. Ole Holm Nielsen Technical University of Denmark

[slurm-dev] Re: mail job status to user

2017-01-09 Thread Ole Holm Nielsen
On 01/10/2017 12:46 AM, Christopher Samuel wrote: On 10/01/17 09:36, Steven Lo wrote: Torque/Maui has the ability to mail user about the job status automatically when job exit. Does Slurm has the same feature without using the SBATCH command in the job submission? Unless Torque has changed

[slurm-dev] Re: preemptive fair share scheduling

2017-01-06 Thread Ole Holm Nielsen
On 01/06/2017 01:07 PM, Sophana Kok wrote: Hi again, is there someone from schedmd to respond? On the contact form, it is written to contact the mailing list for technical questions. If the Slurm community doesn't get you going, you need to buy commercial support, see https://www.schedmd.com/

[slurm-dev] Re: Two partitions with same compute nodes

2016-11-29 Thread Ole Holm Nielsen
On 11/29/2016 12:27 PM, Daniel Ruiz Molina wrote: I would like to know if it would be possible in SLURM configure two partition, composed by the same nodes, but one for using with GPUs and the other one only for OpenMPI. This configuration was allowed in Sun Grid Engine because GPU resource was

[slurm-dev] Re: PIDfile on CentOS7 and compute nodes

2016-11-28 Thread Ole Holm Nielsen
From: Ole Holm Nielsen [mailto:ole.h.niel...@fysik.dtu.dk] Sent: Sunday, November 27, 2016 11:12 PM To: slurm-dev Subject: [slurm-dev] Re: PIDfile on CentOS7 and compute nodes Hi Brian, Did you build and install the Slurm RPMs on CentOS 7, or is it a manual install? Which Slurm and CentO

[slurm-dev] Re: PIDfile on CentOS7 and compute nodes

2016-11-28 Thread Ole Holm Nielsen
On 11/28/2016 09:36 AM, Janne Blomqvist wrote: On 2016-11-25 18:03, Andrus, Brian Contractor wrote: All, I have been having an issue where if I try to run the slurm daemon under systemd, it hangs for some time and then errors out with: If you're using rpm's built using the rpm spec file in

[slurm-dev] Re: PIDfile on CentOS7 and compute nodes

2016-11-27 Thread Ole Holm Nielsen
Hi Brian, Did you build and install the Slurm RPMs on CentOS 7, or is it a manual install? Which Slurm and CentOS versions do you run? We run Slurm 16.05 on CentOS 7, see instructions in our Wiki https://wiki.fysik.dtu.dk/niflheim/SLURM /Ole On 11/25/2016 05:04 PM, Andrus, Brian Contract

[slurm-dev] Re: start munge again after boot?

2016-11-07 Thread Ole Holm Nielsen
On 11/07/2016 11:21 PM, Lachlan Musicman wrote: Peixin, What operating system are you using? I found on Centos 7 I needed to create a tmpfile.d entry to make sure that the /var/run/munge was created correctly on boot every time. https://www.freedesktop.org/software/systemd/man/tmpfiles.d.html

[slurm-dev] Re: Requirement of no firewall on compute nodes?

2016-10-28 Thread Ole Holm Nielsen
Hi Neile, I agree that you can run a firewall to block off all non-cluster nodes. The requirement is that between compute nodes, all ports must be opened in the firewall (in case you use one). /Ole On 10/27/2016 05:11 PM, Neile Havens wrote: Can anyone confirm that Moe's statement is stil

[slurm-dev] Re: Requirement of no firewall on compute nodes?

2016-10-28 Thread Ole Holm Nielsen
offey High-Performance Computing Northern Arizona University 928-523-1167 On 10/27/16, 5:58 AM, "Ole Holm Nielsen" wrote: In the process of developing our new cluster using Slurm, I've been bitten by the firewall settings on the compute nodes preventing MPI jobs from

[slurm-dev] Re: slurm network address problem ?

2016-10-27 Thread Ole Holm Nielsen
You might want to check out my Wiki-page for setting up Slurm on CentOS 7.2: https://wiki.fysik.dtu.dk/niflheim/SLURM. Perhaps you'll solve the problem using this information? On 10/27/2016 04:14 PM, Mikhail Kuzminsky wrote: I worked w/PBS and SGE; now I'm beginner w/slurm, and installed slur

[slurm-dev] Requirement of no firewall on compute nodes?

2016-10-27 Thread Ole Holm Nielsen
In the process of developing our new cluster using Slurm, I've been bitten by the firewall settings on the compute nodes preventing MPI jobs from spawning tasks on remote nodes. I now believe that Slurm actually has a requirement that compute nodes must have their Linux firewall disabled. I

[slurm-dev] Re: Impact to jobs when reconfiguring partitions?

2016-10-27 Thread Ole Holm Nielsen
On 10/27/2016 09:42 AM, Loris Bennett wrote: So is restarting slurmctld the only way to let it pick up changes in slurm.conf? No. You can also do scontrol reconfigure This does not restart slurmctld. Question: How are the slurmd daemons notified about the changes in slurm.conf? Will s

[slurm-dev] Re: Packaging for fedora (and EPEL)

2016-10-17 Thread Ole Holm Nielsen
FWIW, I've documented how to install Slurm 16.05 on CentOS 7.2 in this Wiki page: https://wiki.fysik.dtu.dk/niflheim/SLURM /Ole On 10/17/2016 09:48 AM, Andrew Elwell wrote: I see from https://bugzilla.redhat.com/show_bug.cgi?id=1149566 that there have been a few unsuccessful attempts to get

[slurm-dev] Re: rpm dependencies in 16.05.5

2016-10-13 Thread Ole Holm Nielsen
I have a Wiki page describing how to install Munge and Slurm on CentOS 7: https://wiki.fysik.dtu.dk/niflheim/SLURM I hope this may help. /Ole On 10/13/2016 02:38 PM, Andrew Elwell wrote: Hi folks, I've just built 16.05.5 into rpms (using the rpmbuild -ta slurm*.tar.bz2 method) to update a C

[slurm-dev] Re: Slurm Upgrade from 14.11.3 to 16.5.3 - Instructions needed

2016-08-18 Thread Ole Holm Nielsen
On 08/17/2016 03:49 PM, Barbara Krasovec wrote: I upgraded SLURM rom 15.08 to 16.05 without draining the nodes and without loosing any jobs, this was my procedure: I increased timeouts in slurm.conf: SlurmctldTimeout=3600 SlurmdTimeout=3600 Question: When you change parameters in slurm.conf

[slurm-dev] Re: Slurm Upgrade from 14.11.3 to 16.5.3 - Instructions needed

2016-08-17 Thread Ole Holm Nielsen
On 08/03/2016 03:04 AM, Christopher Samuel wrote: So you always go in the order of upgrading: * slurmdbd * slurmctld [recompile all plugins, MPI stacks, etc that link against Slurm] * slurmd We use a health check script that defines the version of Slurm that is considered production so we can

[slurm-dev] Re: Slurm Upgrade from 14.11.3 to 16.5.3 - Instructions needed

2016-08-16 Thread Ole Holm Nielsen
On 03-08-2016 03:04, Christopher Samuel wrote: On 03/08/16 03:13, Balaji Deivam wrote: Right now we are using Slurm 14.11.3 and planning to upgrade to the latest version 16.5.3. Could you please share the latest upgrade steps document link? Google is your friend: http://slurm.schedmd.com/q

[slurm-dev] How to configure PAM with SLURM on CentOS 7?

2016-07-18 Thread Ole Holm Nielsen
SLURM access restrictions on compute nodes? Thanks, Ole -- Ole Holm Nielsen Department of Physics, Technical University of Denmark

[slurm-dev] Re: Trying to get a simple slurm cluster going

2016-07-17 Thread Ole Holm Nielsen
Perhaps my SLURM HowTo Wiki page https://wiki.fysik.dtu.dk/niflheim/SLURM could help you getting started. We're using CentOS 7.2, but most of the setup will be the same or similar for CentOS 6. /Ole On 07/18/2016 12:52 AM, P. Larry Nelson wrote: While I am in search of real hardware on wh

[slurm-dev] srun jobs fail on login node (with compute nodes on a private network)

2016-07-14 Thread Ole Holm Nielsen
empirical evidence that SLURM and Torque have completely different designs for interactive jobs. Can anyone confirm this? FYI: I'm writing a SLURM HowTo document with my experiences in https://wiki.fysik.dtu.dk/niflheim/SLURM. Comments are welcome. -- Ole Holm Nielsen Department of Physics, Technical University of Denmark

[slurm-dev] Re: NHC and disk / dell server health

2016-01-27 Thread Ole Holm Nielsen
ul for catching mostly memory errors. Use this NHC check in nhc.conf: # Check Machine Check Exception (MCE, mcelog) errors (Intel only, not AMD) * || check_hw_mcelog You'll need to have the mcelogd daemon running. Make a manual test by: mcelog --client -- Ole Holm Nielsen PhD, Manager o

[slurm-dev] Re: more detailed installation guide

2016-01-06 Thread Ole Holm Nielsen
On 01/06/2016 02:54 PM, Novosielski, Ryan wrote: Sorry, was referring to: "running RHEL 7 there is a bug systemctl start/stop does not work on RHEL 7 ." I should have clicked the link. In any case, I've not noticed this problem so perhaps it was fixe

[slurm-dev] Re: more detailed installation guide

2016-01-06 Thread Ole Holm Nielsen
On 01/06/2016 06:03 AM, Novosielski, Ryan wrote: I haven't gotten all the way through it, but this is really good so far! I am curious though -- not seen this problem -- what is the thing about RHEL7 and systemd? I've not seen any problem there. I use CentOS 7.1 and 7.2. Maybe not an issue anymo

[slurm-dev] Re: more detailed installation guide

2016-01-05 Thread Ole Holm Nielsen
01/05/2016 01:26 PM, Ole Holm Nielsen wrote: On 01/05/2016 12:12 PM, Randy Bin Lin wrote: I was wondering if anyone has a more detailed installation guide than the official guide below: http://slurm.schedmd.com/quickstart_admin.html I got the general idea how to install slurm on a local

[slurm-dev] Re: more detailed installation guide

2016-01-05 Thread Ole Holm Nielsen
(and RHEL 7) configurations, but a number of points should be valid for other Linuxes as well. If there are errors or missing points in this Wiki page, please write to me. Thanks, Ole -- Ole Holm Nielsen PhD, Manager of IT services Department of Physics, Technical University of Denmark

[slurm-dev] srun: error: Unable to allocate resources: Requested node configuration is not available

2015-12-17 Thread Ole Holm Nielsen
es=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s I believe that the nodes are configured identically, except for their hardware differences. -- Ole Holm Nielsen Department of Physics, Technical University of Denmark

  1   2   >