Hello!

I would suggest you to do the following steps:

1) Configure on your server a correct mailprog in order to get email
notifications. (optional)

2) Change your "StateSaveLocation" parameter, don't use the /tmp. Create
another directory and give write permissions only to the SlurmUser (user
who runs slurmctld daemon).

3) Use both ControlMachine and ControlAddr options. Set
ControlMachine=server-hostname and ControlAddr=server-ip, use hostname
and ip that are visible/accessible on the same network as the compute nodes.

4) You could try to set: ProctrackType=proctrack/linuxproc

5) The directories for the pid and log files should be writable by the
slurmuser.

6) Before trying to use slurmdbd, at least set up the jobcomp mechanism
in order to have a history of old jobs. Just set
JobCompType="jobcomp/filetxt" and JobCompLoc="<dir>" where jobcomploc
should be writable by the slurmuser.

7) You could also change to JobAcctGatherFrequency=0

8) You should check also your firewall.

9) I see that you are trying to run slurmctld and slurmd on the same
machine. I have never tried this but could you tell us if you can see
both processes to be running? When you run "service slurm start" which
processes are started? In the end I would suggest you to use a different
machine as a compute node. If you use VMs then is shouldn't be difficult
to set up a second VM ;)

I hope this will help you!

Best Regards,
Chrysovalantis Paschoulas


On 10/09/2014 09:11 PM, Uwe Sauter wrote:
And port 6817 as well

Am 09.10.2014 21:04, schrieb Monica Marathe:
Hey Michael,
I did build my configuration file:

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=control-machine
#ControlAddr=
#BackupController=
#BackupAddr=
#
AuthType=auth/munge
CacheGroups=0
#CheckpointType=checkpoint/none
CryptoType=crypto/munge
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=999999
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobCheckpointDir=/var/slurm/checkpoint
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=1
#KillOnBadExit=0
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=5000
#MaxStepCount=40000
#MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/pgid
#Prolog=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/tmp/slurmd
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/tmp
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/none
#TaskPluginParam=
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFs=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
FastSchedule=1
#MaxMemPerCPU=0
#SchedulerRootFilter=1
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/linear
#SelectTypeParameters=
#
#
# JOB PRIORITY
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
AccountingStoreJobComment=YES
ClusterName=cluster
#DebugFlags=
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
#SlurmctldLogFile=
SlurmdDebug=3
#SlurmdLogFile=
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=control-machine CPUs=1 State=UNKNOWN
PartitionName=debug Nodes=control-machine Default=YES MaxTime=INFINITE
State=UP

But I get the following errors when i run slurmctld:

slurmctld: error: Configured MailProg is invalid

slurmctld: error: ################################################
slurmctld: error: ###       SEVERE SECURITY VULERABILTY        ###
slurmctld: error: ### StateSaveLocation DIRECTORY IS WORLD WRITABLE ###
slurmctld: error: ###         CORRECT FILE PERMISSIONS         ###
slurmctld: error: ################################################
slurmctld: error: Could not open node state file /tmp/node_state: No
such file or directory
slurmctld: error: NOTE: Trying backup state save file. Information may
be lost!
slurmctld: No node state file (/tmp/node_state.old) to recover
slurmctld: error: Incomplete node data checkpoint file
slurmctld: Recovered state of 0 nodes
slurmctld: error: Could not open job state file /tmp/job_state: No such
file or directory
slurmctld: error: NOTE: Trying backup state save file. Jobs may be lost!
slurmctld: No job state file (/tmp/job_state.old) to recover
slurmctld: debug:  Updating partition uid access list
slurmctld: error: Could not open reservation state file /tmp/resv_state:
No such file or directory
slurmctld: error: NOTE: Trying backup state save file. Reservations may
be lost
slurmctld: No reservation state file (/tmp/resv_state.old) to recover
slurmctld: Recovered state of 0 reservations
slurmctld: error: Could not open trigger state file /tmp/trigger_state:
No such file or directory
slurmctld: error: NOTE: Trying backup state save file. Triggers may be lost!
slurmctld: No trigger state file (/tmp/trigger_state.old) to recover
slurmctld: error: Incomplete trigger data checkpoint file
slurmctld: State of 0 triggers recovered
slurmctld: read_slurm_conf: backup_controller not specified.
slurmctld: Reinitializing job accounting state
slurmctld: Running as primary controller

slurmctld: debug2: slurmctld listening on 0.0.0.0:6817 <http://0.0.0.0:6817>
slurmctld: debug:  Spawning registration agent for control-machine 1 hosts
slurmctld: debug2: Spawning RPC agent for msg_type
REQUEST_NODE_REGISTRATION_STATUS
slurmctld: debug2: got 1 threads to send out
slurmctld: debug2: Tree head got back 0 looking for 1
slurmctld: debug3: Tree sending to control-machine
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at
10.47.65.195:6818 <http://10.47.65.195:6818>: Connection refused
slurmctld: debug3: connect refused, retrying
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at
10.47.65.195:6818 <http://10.47.65.195:6818>: Connection refused
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at
10.47.65.195:6818 <http://10.47.65.195:6818>: Connection refused

slurmctld: debug3: problems with control-machine
slurmctld: debug2: Tree head got back 1
slurmctld: agent/is_node_resp: node:control-machine rpc:1001 :
Communication connection failure
slurmctld: error: Nodes control-machine not responding

slurmctld: debug2: Error connecting slurm stream socket at
10.47.65.195:6818 <http://10.47.65.195:6818>: Connection refused
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at
10.47.65.195:6818 <http://10.47.65.195:6818>: Connection refused
^Cslurmctld: Terminate signal (SIGINT or SIGTERM) received
slurmctld: debug:  sched: slurmctld terminating
slurmctld: debug3: _slurmctld_rpc_mgr shutting down
slurmctld: Saving all slurm state
slurmctld: error: Could not open job state file /tmp/job_state: No such
file or directory
slurmctld: error: NOTE: Trying backup state save file. Jobs may be lost!
slurmctld: No job state file (/tmp/job_state.old) found
slurmctld: debug3: Writing job id 1 to header record of job_state file
slurmctld: debug4: unable to create link for /tmp/job_state ->
/tmp/job_state.old: No such file or directory
slurmctld: debug4: unable to create link for /tmp/node_state ->
/tmp/node_state.old: No such file or directory
slurmctld: debug4: unable to create link for /tmp/part_state ->
/tmp/part_state.old: No such file or directory
slurmctld: debug4: unable to create link for /tmp/resv_state ->
/tmp/resv_state.old: No such file or directory
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at
10.47.65.195:6818 <http://10.47.65.195:6818>: Connection refused
slurmctld: debug3: problems with control-machine
slurmctld: debug2: Tree head got back 1
slurmctld: debug4: unable to create link for /tmp/trigger_state ->
/tmp/trigger_state.old: No such file or directory
slurmctld: debug4: unable to create link for /tmp/assoc_mgr_state ->
/tmp/assoc_mgr_state.old: No such file or directory
slurmctld: debug4: unable to create link for /tmp/assoc_usage ->
/tmp/assoc_usage.old: No such file or directory
slurmctld: debug4: unable to create link for /tmp/qos_usage ->
/tmp/qos_usage.old: No such file or directory
slurmctld: debug3: _slurmctld_background shutting down
slurmctld: Unable to remove pidfile '/var/run/slurmctld.pid': Permission
denied

I am trying to run this on a single machine.
Any suggestions?

Thanks!
-Monica



On Thu, Oct 9, 2014 at 2:17 PM, Michael Jennings <[email protected]
<mailto:[email protected]>> wrote:

     Have you tried generating your own on the web?

     http://slurm.schedmd.com/configurator.easy.html
     <http://slurm.schedmd.com/configurator.easy.html>
     http://slurm.schedmd.com/configurator.html
     <http://slurm.schedmd.com/configurator.html>

     You might more appropriate results that way.  :-)

     Michael


     On Thu, Oct 9, 2014 at 11:11 AM, Monica Marathe
     <[email protected] <mailto:[email protected]>> wrote:

         Hi,

         Can anyone send me a sample slurm.conf file? I am trying to
         configure SLURM on a single machine only.

         Thanks!
         -Monica

         --
         - Monica Marathe




     --
     Michael Jennings <[email protected] <mailto:[email protected]>>
     Senior HPC Systems Engineer
     High-Performance Computing Services
     Lawrence Berkeley National Laboratory
     Bldg 50B-3209E        W: 510-495-2687 <tel:510-495-2687>
     MS 050B-3209          F: 510-486-8615 <tel:510-486-8615>




--
- Monica Marathe



------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------

Reply via email to