Both folders exists under /var/spool/: drwxr-xr-x 4 slurm slurm 4096 Feb 13 15:34 slurmctld drwxr-xr-x 2 slurm slurm 4096 Feb 13 14:05 slurmd
Thank you for the tip, thinking about setting the slurm user to root. Von: slurm-users <slurm-users-boun...@lists.schedmd.com> Im Auftrag von Antony Cleave Gesendet: Mittwoch, 13. Februar 2019 15:12 An: Slurm User Community List <slurm-users@lists.schedmd.com> Betreff: Re: [slurm-users] Slurmd not starting there is very very a strong likelyhood that you have configured SlurmdUser=slurm and one of the following 1) there is no /var/spool/slurmd folder 2) the /var/spool/slurmd folder exists but is owned by root make sure it exists and is owned by whatever SlurmdUser is set to or change your SlurmdUser to run as root which may not be acceptable to you for security reasons but if you were to change this it makes "doing cool stuff" in prologs and epilogs easier as you can avoid complex paswordless sudo configs on all nodes. Antony On Wed, 13 Feb 2019 at 14:00, Nathalie Gocht <nathalie.go...@outlook.com<mailto:nathalie.go...@outlook.com>> wrote: Hey, I am building up a one node cluster. Master and node are n the same machine. My slurm.conf: ControlMachine=bayes # MpiDefault=none ProctrackType=proctrack/pgid ReturnToService=1 SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm StateSaveLocation=/var/spool/slurmctld SwitchType=switch/none TaskPlugin=task/none # # # TIMERS InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 # # # SCHEDULING FastSchedule=1 SchedulerType=sched/builtin SelectType=select/linear # # # LOGGING AND ACCOUNTING AccountingStorageLoc=/var/log/slurm-llnl/job_accounting AccountingStorageType=accounting_storage/filetxt AccountingStoreJobComment=YES ClusterName=bayes JobCompLoc=/var/log/slurm-llnl/job_completion JobCompType=jobcomp/filetxt JobAcctGatherFrequency=60 JobAcctGatherType=jobacct_gather/linux SlurmctldDebug=info SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log SlurmdDebug=info SlurmdLogFile=/var/log/slurm-llnl/slurmd.log # COMPUTE NODES GresTypes=gpu NodeName=bayes Gres=gpu:tesla:1 CPUs=48 Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 State=UNKNOWN PartitionName=long Nodes=bayes Default=YES MaxTime=INFINITE State=UP I started the control deamon, but get this information: $ systemctl status slurmctld.service ● slurmctld.service - Slurm controller daemon Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor preset: enabled) Active: failed (Result: exit-code) since Wed 2019-02-13 14:43:02 CET; 7min ago Docs: man:slurmctld(8) Process: 40552 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCE Main PID: 40560 (code=exited, status=1/FAILURE) $ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST long* up infinite 1 idle bayes I tried to start the slurm deamon, but the timout exceeds. slurmd -Dvvv gives: slurmd: error: chmod(/var/spool/slurmd, 0755): Operation not permitted slurmd: error: Unable to initialize slurmd spooldir slurmd: error: slurmd initialization failed Does someone know whats going on? Thanks, Nathalie