[slurm-dev] Re: setup Slurm on Ubuntu 16.04 server

Bancal Samuel Thu, 25 Aug 2016 02:41:13 -0700

Thank you, this helped me to fix one problem :

root@our-slurm-master:~# sacctmgr -vvvv add cluster our-slurm
sacctmgr: debug3: Trying to load plugin 
/usr/lib/x86_64-linux-gnu/slurm/accounting_storage_slurmdbd.so
sacctmgr: Accounting storage SLURMDBD plugin loaded with AuthInfo=(null)
sacctmgr: debug3: Success.
sacctmgr: debug2: slurm_connect failed: Connection refused
sacctmgr: debug2: Error connecting slurm stream socket at 127.0.0.1:6819: 
Connection refused
sacctmgr: debug:  slurmdbd: slurm_open_msg_conn to localhost:6819: Connection 
refused
sacctmgr: error: Problem talking to the database: Connection refused


After searching, the mistake was that I copied from the previous slurmdbd.conf 
DbdPort=7031 . After fixing this, the database connection works smoothly.

I'm still looking into the fatal: Unable to determine this slurmd's NodeName .

Do we absolutely need to set the master node into the slurm.conf as a 
NodeName=our-slurm-master ?
I'm pretty sure this wasn't necessary previously and is unexpected in our 
architecture.

Or ... maybe I misunderstood ...

 * slurmdbd and slurmctl services have to be run on the our-slurm-master node
 * slurm service has to be run on each of the nodes ?

Regards,
Samuel


On 24. 08. 16 12:53, Carlos Fenoy wrote:

Re: [slurm-dev] Re: setup Slurm on Ubuntu 16.04 server
Have you added the cluster to the database?

something like: "sacctmgr add cluster CLUSTER_NAME"

On Wed, Aug 24, 2016 at 11:04 AM, Bancal Samuel <samuel.ban...@epfl.ch 
<mailto:samuel.ban...@epfl.ch>> wrote:


    Hi,

    Thanks for your quick answer.

    In fact NodeName=DEFAULT is not the server's hostname, but matches all subsequent 
nodes defined ( http://slurm.schedmd.com/slurm.conf.html 
<http://slurm.schedmd.com/slurm.conf.html> ).
    The server's hostname is "our-slurm-master". Here is the /etc/hosts (which 
I think is correct) :

    root@our-slurm-master:~# cat /etc/hosts
    127.0.0.1    localhost
    123.234.1.2 our-slurm-master.epfl.ch <http://our-slurm-master.epfl.ch> 
our-slurm-master

    # The following lines are desirable for IPv6 capable hosts
    ::1     localhost ip6-localhost ip6-loopback
    ff02::1 ip6-allnodes
    ff02::2 ip6-allrouters

    I checked /var/run/slurm-llnl/ ... it was automatically created and belongs 
to slurm:slurm 755.
    Also /var/log/slurm-llnl/ was automatically created and belongs to 
slurm:slurm 755.

    The NodeName and PartitionName part of the slurm.conf is the exact copy of 
the previous install (slurm 2.6.1) ... in which we didn't declared the master 
node as a calculation node. Is it still possible?

    The complete error message is :

    ● slurmctld.service - Slurm controller daemon
       Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor 
preset: enabled)
       Active: failed (Result: exit-code) since Wed 2016-08-24 09:18:58 CEST; 
1h 30min ago
      Process: 46742 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS 
(code=exited, status=0/SUCCESS)
     Main PID: 46746 (code=exited, status=1/FAILURE)

    Aug 24 09:18:58 our-slurm-master systemd[1]: Starting Slurm controller 
daemon...
    Aug 24 09:18:58 our-slurm-master systemd[1]: slurmctld.service: PID file 
/var/run/slurm-llnl/slurmctld.pid not readable (yet?) after start:
    Aug 24 09:18:58 our-slurm-master systemd[1]: Started Slurm controller 
daemon.
    Aug 24 09:18:58 our-slurm-master slurmctld[46746]: fatal: You are running 
with a database but for some reason we have no TRES from it.  This should only 
happen if the database is down and you don't have any state files.
    Aug 24 09:18:58 our-slurm-master systemd[1]: slurmctld.service: Main 
process exited, code=exited, status=1/FAILURE
    Aug 24 09:18:58 our-slurm-master systemd[1]: slurmctld.service: Unit 
entered failed state.
    Aug 24 09:18:58 our-slurm-master systemd[1]: slurmctld.service: Failed with 
result 'exit-code'.

    Sorry for for cutting it.

    The thing is that the DB is up :

    root@our-slurm-master:~# netstat -ntaupe | grep 3306
    tcp        0      0 127.0.0.1:3306 <http://127.0.0.1:3306> 0.0.0.0:*  
LISTEN      115        1223659 40389/mysqld

    And even tried :

    root@our-slurm-master:~# mysql -u slurm -p slurm_acct_db
    Enter password:
    Reading table information for completion of table and column names
    You can turn off this feature to get a quicker startup with -A

    Welcome to the MariaDB monitor.  Commands end with ; or \g.
    Your MariaDB connection id is 38
    Server version: 10.0.25-MariaDB-0ubuntu0.16.04.1 Ubuntu 16.04

    Copyright (c) 2000, 2016, Oracle, MariaDB Corporation Ab and others.

    Type 'help;' or '\h' for help. Type '\c' to clear the current input 
statement.

    MariaDB [slurm_acct_db]> show tables;
    +-------------------------+
    | Tables_in_slurm_acct_db |
    +-------------------------+
    | acct_coord_table        |
    | acct_table              |
    | clus_res_table          |
    | cluster_table           |
    | qos_table               |
    | res_table               |
    | table_defs_table        |
    | tres_table              |
    | txn_table               |
    | user_table              |
    +-------------------------+
    10 rows in set (0.00 sec)

    MariaDB [slurm_acct_db]> Bye

    Regards,
    Samuel



    On 24. 08. 16 10:06, Raymond Wan wrote:

        Hi Samuel,


        On Wed, Aug 24, 2016 at 3:44 PM, Bancal Samuel <samuel.ban...@epfl.ch 
<mailto:samuel.ban...@epfl.ch>> wrote:

            We're trying to setup Slurm on a Ubuntu 16.04 server.
            I attach the steps we did for the setup.


        It's been a long time since I installed Slurm on our Ubuntu 16.04
        system.  I'm not sure if I remember everything I did...


            Aug 24 09:17:38 our-slurm-master systemd[1]: Starting Slurm node 
daemon...
            Aug 24 09:17:38 our-slurm-master slurmd[46263]: fatal: Unable to 
determine
            this slurmd's NodeName
            Aug 24 09:17:38 our-slurm-master systemd[1]: slurmd.service: 
Control process
            exited, code=exited status=1
            Aug 24 09:17:38 our-slurm-master systemd[1]: Failed to start Slurm 
node
            daemon.
            Aug 24 09:17:38 our-slurm-master systemd[1]: slurmd.service: Unit 
entered
            failed state.
            Aug 24 09:17:38 our-slurm-master systemd[1]: slurmd.service: Failed 
with
            result 'exit-code'.


        In your attachment, I noticed a line that says:

        NodeName=Default ..

        Is that correct?  I'm just surprised someone would call their server
        "default".  (Of course, you can do that...  :-) )

        One thing I remember is that the node names you have in the COMPUTE
        NODES section should match the names in your /etc/hosts file.  When I
        had the error above, I think this was the problem that I had...


            Aug 24 09:18:58 our-slurm-master systemd[1]: Starting Slurm 
controller
            daemon...
            Aug 24 09:18:58 our-slurm-master systemd[1]: slurmctld.service: PID 
file
            /var/run/slurm-llnl/slurmctld.pid not readable (yet?) after start:


        Despite using the standard Slurm packages for Ubuntu, I had to do a
        few things manually.  One of them was creating directories such as
        /var/run/slurm-llnl and making sure that the permissions of the
        directory were correct.  I think owner had to be the user Slurm user
        ("slurm" in my case).

        I did go through a loop a few times where I tried to start Slurm, it
        complained about permissions or even the existence of the PID or log
        directory.  I created it, and then it moved to the next error.  A bit
        tedious...but eventually, it did stop complaining.


            Aug 24 09:18:58 our-slurm-master systemd[1]: Started Slurm 
controller
            daemon.
            Aug 24 09:18:58 our-slurm-master slurmctld[46746]: fatal: You are 
running
            with a database but for some reason we have no TRES from it.  Thi


        I don't know what this error means, but perhaps you can copy the rest
        of it and maybe I (or someone else) might have an idea.

        Good luck!

        Ray

--Samuel Bancal

    ENAC-IT
    EPFL




--
--
Carles Fenoy


--
Samuel Bancal
ENAC-IT
EPFL

[slurm-dev] Re: setup Slurm on Ubuntu 16.04 server

Reply via email to