Thank you, this helped me to fix one problem :
root@our-slurm-master:~# sacctmgr -vvvv add cluster our-slurm
sacctmgr: debug3: Trying to load plugin
/usr/lib/x86_64-linux-gnu/slurm/accounting_storage_slurmdbd.so
sacctmgr: Accounting storage SLURMDBD plugin loaded with AuthInfo=(null)
sacctmgr: debug3: Success.
sacctmgr: debug2: slurm_connect failed: Connection refused
sacctmgr: debug2: Error connecting slurm stream socket at 127.0.0.1:6819:
Connection refused
sacctmgr: debug: slurmdbd: slurm_open_msg_conn to localhost:6819: Connection
refused
sacctmgr: error: Problem talking to the database: Connection refused
After searching, the mistake was that I copied from the previous slurmdbd.conf
DbdPort=7031 . After fixing this, the database connection works smoothly.
I'm still looking into the fatal: Unable to determine this slurmd's NodeName .
Do we absolutely need to set the master node into the slurm.conf as a
NodeName=our-slurm-master ?
I'm pretty sure this wasn't necessary previously and is unexpected in our
architecture.
Or ... maybe I misunderstood ...
* slurmdbd and slurmctl services have to be run on the our-slurm-master node
* slurm service has to be run on each of the nodes ?
Regards,
Samuel
On 24. 08. 16 12:53, Carlos Fenoy wrote:
Re: [slurm-dev] Re: setup Slurm on Ubuntu 16.04 server
Have you added the cluster to the database?
something like: "sacctmgr add cluster CLUSTER_NAME"
On Wed, Aug 24, 2016 at 11:04 AM, Bancal Samuel <samuel.ban...@epfl.ch
<mailto:samuel.ban...@epfl.ch>> wrote:
Hi,
Thanks for your quick answer.
In fact NodeName=DEFAULT is not the server's hostname, but matches all subsequent
nodes defined ( http://slurm.schedmd.com/slurm.conf.html
<http://slurm.schedmd.com/slurm.conf.html> ).
The server's hostname is "our-slurm-master". Here is the /etc/hosts (which
I think is correct) :
root@our-slurm-master:~# cat /etc/hosts
127.0.0.1 localhost
123.234.1.2 our-slurm-master.epfl.ch <http://our-slurm-master.epfl.ch>
our-slurm-master
# The following lines are desirable for IPv6 capable hosts
::1 localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
I checked /var/run/slurm-llnl/ ... it was automatically created and belongs
to slurm:slurm 755.
Also /var/log/slurm-llnl/ was automatically created and belongs to
slurm:slurm 755.
The NodeName and PartitionName part of the slurm.conf is the exact copy of
the previous install (slurm 2.6.1) ... in which we didn't declared the master
node as a calculation node. Is it still possible?
The complete error message is :
● slurmctld.service - Slurm controller daemon
Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor
preset: enabled)
Active: failed (Result: exit-code) since Wed 2016-08-24 09:18:58 CEST;
1h 30min ago
Process: 46742 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
(code=exited, status=0/SUCCESS)
Main PID: 46746 (code=exited, status=1/FAILURE)
Aug 24 09:18:58 our-slurm-master systemd[1]: Starting Slurm controller
daemon...
Aug 24 09:18:58 our-slurm-master systemd[1]: slurmctld.service: PID file
/var/run/slurm-llnl/slurmctld.pid not readable (yet?) after start:
Aug 24 09:18:58 our-slurm-master systemd[1]: Started Slurm controller
daemon.
Aug 24 09:18:58 our-slurm-master slurmctld[46746]: fatal: You are running
with a database but for some reason we have no TRES from it. This should only
happen if the database is down and you don't have any state files.
Aug 24 09:18:58 our-slurm-master systemd[1]: slurmctld.service: Main
process exited, code=exited, status=1/FAILURE
Aug 24 09:18:58 our-slurm-master systemd[1]: slurmctld.service: Unit
entered failed state.
Aug 24 09:18:58 our-slurm-master systemd[1]: slurmctld.service: Failed with
result 'exit-code'.
Sorry for for cutting it.
The thing is that the DB is up :
root@our-slurm-master:~# netstat -ntaupe | grep 3306
tcp 0 0 127.0.0.1:3306 <http://127.0.0.1:3306> 0.0.0.0:*
LISTEN 115 1223659 40389/mysqld
And even tried :
root@our-slurm-master:~# mysql -u slurm -p slurm_acct_db
Enter password:
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 38
Server version: 10.0.25-MariaDB-0ubuntu0.16.04.1 Ubuntu 16.04
Copyright (c) 2000, 2016, Oracle, MariaDB Corporation Ab and others.
Type 'help;' or '\h' for help. Type '\c' to clear the current input
statement.
MariaDB [slurm_acct_db]> show tables;
+-------------------------+
| Tables_in_slurm_acct_db |
+-------------------------+
| acct_coord_table |
| acct_table |
| clus_res_table |
| cluster_table |
| qos_table |
| res_table |
| table_defs_table |
| tres_table |
| txn_table |
| user_table |
+-------------------------+
10 rows in set (0.00 sec)
MariaDB [slurm_acct_db]> Bye
Regards,
Samuel
On 24. 08. 16 10:06, Raymond Wan wrote:
Hi Samuel,
On Wed, Aug 24, 2016 at 3:44 PM, Bancal Samuel <samuel.ban...@epfl.ch
<mailto:samuel.ban...@epfl.ch>> wrote:
We're trying to setup Slurm on a Ubuntu 16.04 server.
I attach the steps we did for the setup.
It's been a long time since I installed Slurm on our Ubuntu 16.04
system. I'm not sure if I remember everything I did...
Aug 24 09:17:38 our-slurm-master systemd[1]: Starting Slurm node
daemon...
Aug 24 09:17:38 our-slurm-master slurmd[46263]: fatal: Unable to
determine
this slurmd's NodeName
Aug 24 09:17:38 our-slurm-master systemd[1]: slurmd.service:
Control process
exited, code=exited status=1
Aug 24 09:17:38 our-slurm-master systemd[1]: Failed to start Slurm
node
daemon.
Aug 24 09:17:38 our-slurm-master systemd[1]: slurmd.service: Unit
entered
failed state.
Aug 24 09:17:38 our-slurm-master systemd[1]: slurmd.service: Failed
with
result 'exit-code'.
In your attachment, I noticed a line that says:
NodeName=Default ..
Is that correct? I'm just surprised someone would call their server
"default". (Of course, you can do that... :-) )
One thing I remember is that the node names you have in the COMPUTE
NODES section should match the names in your /etc/hosts file. When I
had the error above, I think this was the problem that I had...
Aug 24 09:18:58 our-slurm-master systemd[1]: Starting Slurm
controller
daemon...
Aug 24 09:18:58 our-slurm-master systemd[1]: slurmctld.service: PID
file
/var/run/slurm-llnl/slurmctld.pid not readable (yet?) after start:
Despite using the standard Slurm packages for Ubuntu, I had to do a
few things manually. One of them was creating directories such as
/var/run/slurm-llnl and making sure that the permissions of the
directory were correct. I think owner had to be the user Slurm user
("slurm" in my case).
I did go through a loop a few times where I tried to start Slurm, it
complained about permissions or even the existence of the PID or log
directory. I created it, and then it moved to the next error. A bit
tedious...but eventually, it did stop complaining.
Aug 24 09:18:58 our-slurm-master systemd[1]: Started Slurm
controller
daemon.
Aug 24 09:18:58 our-slurm-master slurmctld[46746]: fatal: You are
running
with a database but for some reason we have no TRES from it. Thi
I don't know what this error means, but perhaps you can copy the rest
of it and maybe I (or someone else) might have an idea.
Good luck!
Ray
--
Samuel Bancal
ENAC-IT
EPFL
--
--
Carles Fenoy
--
Samuel Bancal
ENAC-IT
EPFL