Hi Samuel,
On Wed, Aug 24, 2016 at 3:44 PM, Bancal Samuel <[email protected]> wrote: > We're trying to setup Slurm on a Ubuntu 16.04 server. > I attach the steps we did for the setup. It's been a long time since I installed Slurm on our Ubuntu 16.04 system. I'm not sure if I remember everything I did... > Aug 24 09:17:38 our-slurm-master systemd[1]: Starting Slurm node daemon... > Aug 24 09:17:38 our-slurm-master slurmd[46263]: fatal: Unable to determine > this slurmd's NodeName > Aug 24 09:17:38 our-slurm-master systemd[1]: slurmd.service: Control process > exited, code=exited status=1 > Aug 24 09:17:38 our-slurm-master systemd[1]: Failed to start Slurm node > daemon. > Aug 24 09:17:38 our-slurm-master systemd[1]: slurmd.service: Unit entered > failed state. > Aug 24 09:17:38 our-slurm-master systemd[1]: slurmd.service: Failed with > result 'exit-code'. In your attachment, I noticed a line that says: NodeName=Default .. Is that correct? I'm just surprised someone would call their server "default". (Of course, you can do that... :-) ) One thing I remember is that the node names you have in the COMPUTE NODES section should match the names in your /etc/hosts file. When I had the error above, I think this was the problem that I had... > Aug 24 09:18:58 our-slurm-master systemd[1]: Starting Slurm controller > daemon... > Aug 24 09:18:58 our-slurm-master systemd[1]: slurmctld.service: PID file > /var/run/slurm-llnl/slurmctld.pid not readable (yet?) after start: Despite using the standard Slurm packages for Ubuntu, I had to do a few things manually. One of them was creating directories such as /var/run/slurm-llnl and making sure that the permissions of the directory were correct. I think owner had to be the user Slurm user ("slurm" in my case). I did go through a loop a few times where I tried to start Slurm, it complained about permissions or even the existence of the PID or log directory. I created it, and then it moved to the next error. A bit tedious...but eventually, it did stop complaining. > Aug 24 09:18:58 our-slurm-master systemd[1]: Started Slurm controller > daemon. > Aug 24 09:18:58 our-slurm-master slurmctld[46746]: fatal: You are running > with a database but for some reason we have no TRES from it. Thi I don't know what this error means, but perhaps you can copy the rest of it and maybe I (or someone else) might have an idea. Good luck! Ray
