Hi Samuel,

On Wed, Aug 24, 2016 at 3:44 PM, Bancal Samuel <[email protected]> wrote:
> We're trying to setup Slurm on a Ubuntu 16.04 server.
> I attach the steps we did for the setup.


It's been a long time since I installed Slurm on our Ubuntu 16.04
system.  I'm not sure if I remember everything I did...


> Aug 24 09:17:38 our-slurm-master systemd[1]: Starting Slurm node daemon...
> Aug 24 09:17:38 our-slurm-master slurmd[46263]: fatal: Unable to determine
> this slurmd's NodeName
> Aug 24 09:17:38 our-slurm-master systemd[1]: slurmd.service: Control process
> exited, code=exited status=1
> Aug 24 09:17:38 our-slurm-master systemd[1]: Failed to start Slurm node
> daemon.
> Aug 24 09:17:38 our-slurm-master systemd[1]: slurmd.service: Unit entered
> failed state.
> Aug 24 09:17:38 our-slurm-master systemd[1]: slurmd.service: Failed with
> result 'exit-code'.


In your attachment, I noticed a line that says:

NodeName=Default ..

Is that correct?  I'm just surprised someone would call their server
"default".  (Of course, you can do that...  :-) )

One thing I remember is that the node names you have in the COMPUTE
NODES section should match the names in your /etc/hosts file.  When I
had the error above, I think this was the problem that I had...


> Aug 24 09:18:58 our-slurm-master systemd[1]: Starting Slurm controller
> daemon...
> Aug 24 09:18:58 our-slurm-master systemd[1]: slurmctld.service: PID file
> /var/run/slurm-llnl/slurmctld.pid not readable (yet?) after start:


Despite using the standard Slurm packages for Ubuntu, I had to do a
few things manually.  One of them was creating directories such as
/var/run/slurm-llnl and making sure that the permissions of the
directory were correct.  I think owner had to be the user Slurm user
("slurm" in my case).

I did go through a loop a few times where I tried to start Slurm, it
complained about permissions or even the existence of the PID or log
directory.  I created it, and then it moved to the next error.  A bit
tedious...but eventually, it did stop complaining.


> Aug 24 09:18:58 our-slurm-master systemd[1]: Started Slurm controller
> daemon.
> Aug 24 09:18:58 our-slurm-master slurmctld[46746]: fatal: You are running
> with a database but for some reason we have no TRES from it.  Thi


I don't know what this error means, but perhaps you can copy the rest
of it and maybe I (or someone else) might have an idea.

Good luck!

Ray

Reply via email to