Hi, We're trying to setup Slurm on a Ubuntu 16.04 server. I attach the steps we did for the setup.
When starting slurmdbd, everything seems ok. Then when starting slurmd and slurmctld, we get : root@our-slurm-master:~# systemctl start slurmd Job for slurmd.service failed because the control process exited with error code. See "systemctl status slurmd.service" and "journalctl -xe" for details. root@our-slurm-master:~# systemctl status slurmd.service ● slurmd.service - Slurm node daemon Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled) Active: failed (Result: exit-code) since Wed 2016-08-24 09:17:38 CEST; 5s ago Process: 46263 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=1/FAILURE) Aug 24 09:17:38 our-slurm-master systemd[1]: Starting Slurm node daemon... Aug 24 09:17:38 our-slurm-master slurmd[46263]: fatal: Unable to determine this slurmd's NodeName Aug 24 09:17:38 our-slurm-master systemd[1]: slurmd.service: Control process exited, code=exited status=1 Aug 24 09:17:38 our-slurm-master systemd[1]: Failed to start Slurm node daemon. Aug 24 09:17:38 our-slurm-master systemd[1]: slurmd.service: Unit entered failed state. Aug 24 09:17:38 our-slurm-master systemd[1]: slurmd.service: Failed with result 'exit-code'. root@our-slurm-master:~# systemctl start slurmctld root@our-slurm-master:~# systemctl status slurmctld ● slurmctld.service - Slurm controller daemon Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor preset: enabled) Active: failed (Result: exit-code) since Wed 2016-08-24 09:18:58 CEST; 24s ago Process: 46742 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 46746 (code=exited, status=1/FAILURE) Aug 24 09:18:58 our-slurm-master systemd[1]: Starting Slurm controller daemon... Aug 24 09:18:58 our-slurm-master systemd[1]: slurmctld.service: PID file /var/run/slurm-llnl/slurmctld.pid not readable (yet?) after start: Aug 24 09:18:58 our-slurm-master systemd[1]: Started Slurm controller daemon. Aug 24 09:18:58 our-slurm-master slurmctld[46746]: fatal: You are running with a database but for some reason we have no TRES from it. Thi Aug 24 09:18:58 our-slurm-master systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE Aug 24 09:18:58 our-slurm-master systemd[1]: slurmctld.service: Unit entered failed state. Aug 24 09:18:58 our-slurm-master systemd[1]: slurmctld.service: Failed with result 'exit-code'. This seems like if we have 2 different errors ... slurmd speeks about the NodeName ... but we don't expect the slurm-master server to be one of the calculation nodes; only be the master. slurmcltd speeks about the DB. but when we look to /var/log/slurm-llnl/slurmdbd.log , we see : [2016-08-24T08:52:09.923] Accounting storage MYSQL plugin loaded [2016-08-24T08:52:09.932] slurmdbd version 15.08.7 started Any idea? Thanks! Samuel oh ... we're using the versions offered by Ubuntu official repos : root@our-slurm-master:~# dpkg -l | grep -e slurm -e munge ii libmunge2 0.5.11-3 amd64 authentication service for credential -- library package ii munge 0.5.11-3 amd64 authentication service to create and validate credentials ii slurm 0.4.3-2build1 amd64 Realtime network interface monitor ii slurm-client 15.08.7-1build1 amd64 SLURM client side commands ii slurm-wlm 15.08.7-1build1 amd64 Simple Linux Utility for Resource Management ii slurm-wlm-basic-plugins 15.08.7-1build1 amd64 SLURM basic plugins ii slurmctld 15.08.7-1build1 amd64 SLURM central management daemon ii slurmd 15.08.7-1build1 amd64 SLURM compute node daemon ii slurmdbd 15.08.7-1build1 amd64 Secure enterprise-wide interface to a database for SLURM -- Samuel Bancal ENAC-IT EPFL
2016-08-23 - SB â MariaDB ------------------------------------------------------------------------ ~~~ bash apt install mariadb-server ~~~ 2016-08-23 - SB â Munge ------------------------------------------------------------------------ ~~~ bash apt install munge ~~~ Note : munge service fails to start because : ~~~ output Aug 23 15:36:10 our-slurm-master munged[20031]: munged: Error: Logfile is insecure: group-writable permissions set on "/var/log" Aug 23 15:36:10 our-slurm-master systemd[1]: munge.service: Control process exited, code=exited status=1 Aug 23 15:36:10 our-slurm-master systemd[1]: Failed to start MUNGE authentication service. ~~~ Workaround https://github.com/dun/munge/issues/31#issuecomment-127726497 : ~~~ systemctl edit --system --full munge ~~~ ~~~ snip ExecStart=/usr/sbin/munged --syslog ~~~ ~~~ bash systemctl start munge ~~~ 2016-08-23 - SB â SLURM ------------------------------------------------------------------------ ~~~ bash apt install slurmd slurmdbd slurm slurm-wlm vi /etc/slurm-llnl/slurmdbd.conf ~~~ ~~~ snip ArchiveEvents=yes ArchiveJobs=yes #ArchiveResv=no ArchiveSteps=yes ArchiveSuspend=yes #ArchiveScript=/usr/sbin/slurm.dbd.archive AuthInfo=/var/run/munge/munge.socket.2 AuthType=auth/munge DbdHost=localhost DbdPort=7031 #DbdBackupHost= MessageTimeout=100 DebugLevel=4 #DefaultQOS=normal,standby PurgeEventAfter=1month PurgeJobAfter=12month PurgeResvAfter=1month PurgeStepAfter=1month PurgeSuspendAfter=1month LogFile=/var/log/slurm-llnl/slurmdbd.log PidFile=/var/run/slurm-llnl/slurmdbd.pid SlurmUser=slurm StorageHost=localhost StoragePort=3306 StorageType=accounting_storage/mysql StorageLoc=slurm_acct_db StorageUser=slurm StoragePass=<KeepassX> ~~~ ~~~ bash mysql -u root ~~~ ~~~ sql create user 'slurm'@'localhost' identified by '<KeepassX>'; create database slurm_acct_db; grant all privileges on slurm_acct_db.* to 'slurm'@'localhost'; commit; ~~~ ~~~ bash systemctl start slurmdbd ~~~ ~~~ vi /etc/slurm-llnl/slurm.conf ~~~ ~~~ snip # slurm.conf file generated by configurator easy.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ControlMachine=our-slurm-master ControlAddr=123.234.1.2 # #MailProg=/bin/mail MpiDefault=none #MpiParams=ports=#-# ProctrackType=proctrack/cgroup ReturnToService=2 SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid #SlurmctldPort=6817 SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid #SlurmdPort=6818 SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd SlurmUser=slurm #SlurmdUser=root StateSaveLocation=/var/lib/slurm-llnl/slurmctld SwitchType=switch/none TaskPlugin=task/cgroup # # # TIMERS #KillWait=30 #MinJobAge=300 #SlurmctldTimeout=120 #SlurmdTimeout=300 # # # SCHEDULING FastSchedule=1 SchedulerType=sched/backfill #SchedulerPort=7321 SelectType=select/cons_res # # # LOGGING AND ACCOUNTING AccountingStorageType=accounting_storage/slurmdbd ClusterName=our-slurm #JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/linux #SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log #SlurmdDebug=3 SlurmdLogFile=/var/log/slurm-llnl/slurmd.log # # # COMPUTE NODES NodeName=DEFAULT TmpDisk=1024 State=UNKNOWN NodeName=nodepc-[94-97] CPUs=8 Sockets=1 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=15000 Weight=1 Feature=GROUP0 NodeName=nodepc-[01-42] CPUs=8 Sockets=1 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=15000 Weight=1 Feature=GROUP1 NodeName=nodepc-[51-77] CPUs=8 Sockets=1 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=15000 Weight=1 Feature=GROUP2 NodeName=nodepc[192-206] CPUs=4 Sockets=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=6000 Weight=1 Feature=GROUP3 PartitionName=DEFAULT DefaultTime=60 MinNodes=1 MaxNodes=UNLIMITED MaxTime=31-0 PreemptMode=CANCEL Shared=NO State=UP Default=NO AllowGroups=ALL PartitionName=group0 Nodes=nodepc-[94-97] PartitionName=group1 Nodes=nodepc-[01-42] PartitionName=group2 Nodes=nodepc-[51-77] PartitionName=group12 Nodes=nodepc-[01-42],nodepc-[51-77] PartitionName=group3 Nodes=nodepc[192-206] PartitionName=group123 Nodes=nodepc-[01-42],nodepc-[51-77],nodepc-[94-97],nodepc[192-206] ~~~ ~~~ bash vi /etc/slurm-llnl/cgroup.conf ~~~ ~~~ snip CgroupMountpoint=/cgroup CgroupAutomount=yes CgroupReleaseAgentDir=/etc/slurm-llnl/cgroup ConstrainCores=yes TaskAffinity=yes ~~~
