Hi,
Thanks for your quick answer.
In fact NodeName=DEFAULT is not the server's hostname, but matches all
subsequent nodes defined ( http://slurm.schedmd.com/slurm.conf.html ).
The server's hostname is "our-slurm-master". Here is the /etc/hosts (which I
think is correct) :
root@our-slurm-master:~# cat /etc/hosts
127.0.0.1 localhost
123.234.1.2 our-slurm-master.epfl.ch our-slurm-master
# The following lines are desirable for IPv6 capable hosts
::1 localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
I checked /var/run/slurm-llnl/ ... it was automatically created and belongs to
slurm:slurm 755.
Also /var/log/slurm-llnl/ was automatically created and belongs to slurm:slurm
755.
The NodeName and PartitionName part of the slurm.conf is the exact copy of the
previous install (slurm 2.6.1) ... in which we didn't declared the master node
as a calculation node. Is it still possible?
The complete error message is :
● slurmctld.service - Slurm controller daemon
Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor
preset: enabled)
Active: failed (Result: exit-code) since Wed 2016-08-24 09:18:58 CEST; 1h
30min ago
Process: 46742 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited,
status=0/SUCCESS)
Main PID: 46746 (code=exited, status=1/FAILURE)
Aug 24 09:18:58 our-slurm-master systemd[1]: Starting Slurm controller daemon...
Aug 24 09:18:58 our-slurm-master systemd[1]: slurmctld.service: PID file
/var/run/slurm-llnl/slurmctld.pid not readable (yet?) after start:
Aug 24 09:18:58 our-slurm-master systemd[1]: Started Slurm controller daemon.
Aug 24 09:18:58 our-slurm-master slurmctld[46746]: fatal: You are running with
a database but for some reason we have no TRES from it. This should only
happen if the database is down and you don't have any state files.
Aug 24 09:18:58 our-slurm-master systemd[1]: slurmctld.service: Main process
exited, code=exited, status=1/FAILURE
Aug 24 09:18:58 our-slurm-master systemd[1]: slurmctld.service: Unit entered
failed state.
Aug 24 09:18:58 our-slurm-master systemd[1]: slurmctld.service: Failed with
result 'exit-code'.
Sorry for for cutting it.
The thing is that the DB is up :
root@our-slurm-master:~# netstat -ntaupe | grep 3306
tcp 0 0 127.0.0.1:3306 0.0.0.0:* LISTEN 115
1223659 40389/mysqld
And even tried :
root@our-slurm-master:~# mysql -u slurm -p slurm_acct_db
Enter password:
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 38
Server version: 10.0.25-MariaDB-0ubuntu0.16.04.1 Ubuntu 16.04
Copyright (c) 2000, 2016, Oracle, MariaDB Corporation Ab and others.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
MariaDB [slurm_acct_db]> show tables;
+-------------------------+
| Tables_in_slurm_acct_db |
+-------------------------+
| acct_coord_table |
| acct_table |
| clus_res_table |
| cluster_table |
| qos_table |
| res_table |
| table_defs_table |
| tres_table |
| txn_table |
| user_table |
+-------------------------+
10 rows in set (0.00 sec)
MariaDB [slurm_acct_db]> Bye
Regards,
Samuel
On 24. 08. 16 10:06, Raymond Wan wrote:
Hi Samuel,
On Wed, Aug 24, 2016 at 3:44 PM, Bancal Samuel <[email protected]> wrote:
We're trying to setup Slurm on a Ubuntu 16.04 server.
I attach the steps we did for the setup.
It's been a long time since I installed Slurm on our Ubuntu 16.04
system. I'm not sure if I remember everything I did...
Aug 24 09:17:38 our-slurm-master systemd[1]: Starting Slurm node daemon...
Aug 24 09:17:38 our-slurm-master slurmd[46263]: fatal: Unable to determine
this slurmd's NodeName
Aug 24 09:17:38 our-slurm-master systemd[1]: slurmd.service: Control process
exited, code=exited status=1
Aug 24 09:17:38 our-slurm-master systemd[1]: Failed to start Slurm node
daemon.
Aug 24 09:17:38 our-slurm-master systemd[1]: slurmd.service: Unit entered
failed state.
Aug 24 09:17:38 our-slurm-master systemd[1]: slurmd.service: Failed with
result 'exit-code'.
In your attachment, I noticed a line that says:
NodeName=Default ..
Is that correct? I'm just surprised someone would call their server
"default". (Of course, you can do that... :-) )
One thing I remember is that the node names you have in the COMPUTE
NODES section should match the names in your /etc/hosts file. When I
had the error above, I think this was the problem that I had...
Aug 24 09:18:58 our-slurm-master systemd[1]: Starting Slurm controller
daemon...
Aug 24 09:18:58 our-slurm-master systemd[1]: slurmctld.service: PID file
/var/run/slurm-llnl/slurmctld.pid not readable (yet?) after start:
Despite using the standard Slurm packages for Ubuntu, I had to do a
few things manually. One of them was creating directories such as
/var/run/slurm-llnl and making sure that the permissions of the
directory were correct. I think owner had to be the user Slurm user
("slurm" in my case).
I did go through a loop a few times where I tried to start Slurm, it
complained about permissions or even the existence of the PID or log
directory. I created it, and then it moved to the next error. A bit
tedious...but eventually, it did stop complaining.
Aug 24 09:18:58 our-slurm-master systemd[1]: Started Slurm controller
daemon.
Aug 24 09:18:58 our-slurm-master slurmctld[46746]: fatal: You are running
with a database but for some reason we have no TRES from it. Thi
I don't know what this error means, but perhaps you can copy the rest
of it and maybe I (or someone else) might have an idea.
Good luck!
Ray
--
Samuel Bancal
ENAC-IT
EPFL