Hello, I obtain the following error message when I try to use SLURM.
root@kosmos:~# sinfo sinfo: error: slurm_receive_msg: Zero Bytes were transmitted or received slurm_load_partitions: Zero Bytes were transmitted or received Here is the output of same command with an increased level of verbosity: root@kosmos:~# sinfo -vv ----------------------------- dead = false exact = 0 filtering = false format = %9P %.5a %.10l %.6D %.6t %N iterate = 0 long = false no_header = false node_field = false node_format = false nodes = n/a part_field = true partition = n/a responding = false states = (null) sort = (null) summarize = false verbose = 2 ----------------------------- all_flag = false avail_flag = true bg_flag = false cpus_flag = false default_time_flag =false disk_flag = false features_flag = false groups_flag = false gres_flag = false job_size_flag = false max_time_flag = true memory_flag = false partition_flag = true priority_flag = false reason_flag = false reason_timestamp_flag = false reason_user_flag = false reservation_flag = false root_flag = false share_flag = false state_flag = true weight_flag = false ----------------------------- sinfo: debug: Reading slurm.conf file: /etc/slurm-llnl/slurm.conf Tue Oct 8 15:30:10 2013 sinfo: auth plugin for Munge (http://code.google.com/p/munge/) loaded sinfo: debug: _slurm_recv_timeout at 0 of 4, recv zero bytes sinfo: error: slurm_receive_msg: Zero Bytes were transmitted or received slurm_load_partitions: Zero Bytes were transmitted or received Everything seems fine with Munge: root@kosmos:~# munge -n|ssh k01 unmunge STATUS: Success (0) ENCODE_HOST: kosmos (10.3.1.80) ENCODE_TIME: 2013-10-08 15:30:48 (1381239048) DECODE_TIME: 2013-10-08 15:30:48 (1381239048) TTL: 300 CIPHER: aes128 (4) MAC: sha1 (3) ZIP: none (0) UID: root (0) GID: root (0) LENGTH: 0 Here is the slurm.conf: root@kosmos:~# cat /etc/slurm-llnl/slurm.conf # slurm.conf file generated by configurator.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ControlMachine=kosmos #ControlAddr= #BackupController= #BackupAddr= # #AuthType=auth/none AuthType=auth/munge CacheGroups=0 #CheckpointType=checkpoint/none CryptoType=crypto/munge #CryptoType=crypto/openssl #DisableRootJobs=NO #EnforcePartLimits=NO #Epilog= #PrologSlurmctld= #FirstJobId=1 #MaxJobId=999999 #GresTypes= #GroupUpdateForce=0 #GroupUpdateTime=600 JobCheckpointDir=/var/lib/slurm-llnl/checkpoint #JobCredentialPrivateKey= #JobCredentialPublicCertificate= #JobCredentialPrivateKey=/home/slurm/ssl/id_rsa #JobCredentialPublicCertificate=/home/slurm/ssl/id_rsa.pub #JobFileAppend=0 #JobRequeue=1 #JobSubmitPlugins=1 #KillOnBadExit=0 #Licenses=foo*4,bar #MailProg=/usr/bin/mail #MaxJobCount=5000 #MaxStepCount=40000 #MaxTasksPerNode=128 MpiDefault=none #MpiParams=ports=#-# #PluginDir= #PlugStackConfig= #PrivateData=jobs ProctrackType=proctrack/pgid #Prolog= #PrologSlurmctld= #PropagatePrioProcess=0 #PropagateResourceLimits= #PropagateResourceLimitsExcept= ReturnToService=1 #SallocDefaultCommand= SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd SlurmUser=slurm #SrunEpilog= #SrunProlog= StateSaveLocation=/var/lib/slurm-llnl/slurmctld SwitchType=switch/none #TaskEpilog= TaskPlugin=task/none #TaskPluginParam= #TaskProlog= #TopologyPlugin=topology/tree #TmpFs=/tmp #TrackWCKey=no #TreeWidth= #UnkillableStepProgram= #UsePAM=0 # # # TIMERS #BatchStartTimeout=10 #CompleteWait=0 #EpilogMsgTime=2000 #GetEnvTimeout=2 #HealthCheckInterval=0 #HealthCheckProgram= InactiveLimit=0 KillWait=30 #MessageTimeout=10 #ResvOverRun=0 MinJobAge=300 #OverTimeLimit=0 SlurmctldTimeout=120 SlurmdTimeout=300 #UnkillableStepTimeout=60 #VSizeFactor=0 Waittime=0 # # # SCHEDULING #DefMemPerCPU=0 FastSchedule=1 #MaxMemPerCPU=0 #SchedulerRootFilter=1 #SchedulerTimeSlice=30 SchedulerType=sched/backfill SchedulerPort=7321 SelectType=select/cons_res SelectTypeParameters=CR_Core_Memory # # # JOB PRIORITY #PriorityType=priority/basic #PriorityDecayHalfLife= #PriorityCalcPeriod= #PriorityFavorSmall= #PriorityMaxAge= #PriorityUsageResetPeriod= #PriorityWeightAge= #PriorityWeightFairshare= #PriorityWeightJobSize= #PriorityWeightPartition= #PriorityWeightQOS= # # # LOGGING AND ACCOUNTING #AccountingStorageEnforce=0 #AccountingStorageHost= AccountingStorageLoc=/var/log/slurm/accounting.txt #AccountingStoragePass= #AccountingStoragePort= AccountingStorageType=accounting_storage/filetxt #AccountingStorageUser= AccountingStoreJobComment=YES ClusterName=cluster #DebugFlags= #JobCompHost= JobCompLoc=/var/log/slurm/slurm.log #JobCompPass= #JobCompPort= JobCompType=jobcomp/filetxt #JobCompUser= JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/linux SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log SlurmdDebug=3 SlurmdLogFile=/var/log/slurm-llnl/slurmd.log #SlurmSchedLogFile= #SlurmSchedLogLevel= # # # POWER SAVE SUPPORT FOR IDLE NODES (optional) #SuspendProgram= #ResumeProgram= #SuspendTimeout= #ResumeTimeout= #ResumeRate= #SuspendExcNodes= #SuspendExcParts= #SuspendRate= #SuspendTime= # # # COMPUTE NODES NodeName=k0[1-3] CPUs=32 RealMemory=129186 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN PartitionName=uag Nodes=k0[1-3] Default=YES MaxTime=INFINITE State=UP NodeName=kosmos CPUs=32 RealMemory=129186 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN PartitionName=uag Nodes=kosmos Default=YES MaxTime=INFINITE State=UP NTP is running on all the nodes and the clocks are in sync. Thank you for your help! Best regards, Philippe
