Might be unrelated, but did you update your compute nodes as well ? You have two lines PartitionName=uag .... in your slurm.conf. The second one is ignored, meaning that kosmos is not seen as a compute node. (You should see something like "slurmctld: error: _parse_part_spec: duplicate entry for partition uag, ignoring" in your logs. )
Do "sinfo -V" and "slurmctld -V" produce the same output ? HTH damien On 08 Oct 2013, at 19:59, [email protected] wrote: > > It does not seem to solve the problem. > > I have updated everything (I even did a make clean before the compilation) > and have checked that I have the new version of the different components but > the problem persists. > > # sinfo > sinfo: error: slurm_receive_msg: Zero Bytes were transmitted or received > slurm_load_partitions: Zero Bytes were transmitted or received > > # tail /var/log/slurm-llnl/slurmctld.log -n 3 > [2013-10-08T19:53:44] error: Invalid Protocol Version 6656 from uid=0 at > 10.3.1.80:38100 > [2013-10-08T19:53:44] error: slurm_receive_msg: Protocol version has changed, > re-link your code > [2013-10-08T19:53:44] error: slurm_receive_msg: Protocol version has changed, > re-link your code > > > Thanks. > > > ----- Mail original ----- >> De: "Moe Jette" <[email protected]> >> À: "slurm-dev" <[email protected]> >> Envoyé: Mardi 8 Octobre 2013 18:46:53 >> Objet: [slurm-dev] Re: sinfo: error: slurm_receive_msg: Zero Bytes were >> transmitted or received >> >> >> That means your slurmctld is older than the command you are executing >> and it does not understand the RPC. (Older commands than the daemon >> would have worked fine). >> >> Quoting [email protected]: >> >>> >>> Thank you! >>> >>> The problem seem to lie there. >>> >>> slurmctld.log contains this: >>> >>> error: slurm_receive_msg: Protocol version has changed, re-link >>> your code >>> >>> In fact, I was using the version of SLURM coming with Debian. Then >>> I >>> installed the last version from source in order to have array jobs. >>> The problem seems to come from there. >>> >>> Philippe >>> >>> >>> >>> ----- Mail original ----- >>>> De: "Moe Jette" <[email protected]> >>>> À: "slurm-dev" <[email protected]> >>>> Envoyé: Mardi 8 Octobre 2013 17:29:53 >>>> Objet: [slurm-dev] Re: sinfo: error: slurm_receive_msg: Zero Bytes >>>> were transmitted or received >>>> >>>> >>>> The command and daemon are not communicating. Check your slurmctld >>>> log >>>> file. There is also a troubleshooting guide online that may prove >>>> helpful to you: >>>> http://slurm.schedmd.com/troubleshoot.html >>>> >>>> Quoting [email protected]: >>>> >>>>> >>>>> Hello, >>>>> >>>>> Thank you for your reply. >>>>> >>>>> It seems to be running. >>>>> >>>>> root@kosmos:~# /etc/init.d/slurm-llnl status >>>>> slurmctld (pid 6093) is running... >>>>> slurmd (pid 6221) is running... >>>>> >>>>> >>>>> >>>>> ----- Mail original ----- >>>>>> De: "Danny Auble" <[email protected]> >>>>>> À: "slurm-dev" <[email protected]> >>>>>> Envoyé: Mardi 8 Octobre 2013 16:42:11 >>>>>> Objet: [slurm-dev] Re: sinfo: error: slurm_receive_msg: Zero >>>>>> Bytes >>>>>> were transmitted or received >>>>>> >>>>>> It doesn't appear your slurmctld is running or responsive. >>>>>> >>>>>> >>>>>> [email protected] wrote: >>>>>> >>>>>> >>>>>> Hello, >>>>>> >>>>>> I obtain the following error message when I try to use SLURM. >>>>>> >>>>>> root@kosmos:~# sinfo >>>>>> sinfo: error: slurm_receive_msg: Zero Bytes were transmitted or >>>>>> received >>>>>> slurm_load_partitions: Zero Bytes were transmitted or received >>>>>> >>>>>> >>>>>> >>>>>> Here is the output of same command with an increased level of >>>>>> verbosity: >>>>>> >>>>>> root@kosmos:~# sinfo -vv >>>>>> ----------------------------- >>>>>> dead = false >>>>>> exact = 0 >>>>>> filtering = false >>>>>> format = %9P %.5a %.10l %.6D %.6t %N >>>>>> iterate = 0 >>>>>> long = false >>>>>> no_header = false >>>>>> node_field = false >>>>>> node_format = false >>>>>> nodes = n/a >>>>>> part_field = true >>>>>> partition = n/a >>>>>> responding = false >>>>>> states = (null) >>>>>> sort = (null) >>>>>> summarize = false >>>>>> verbose = 2 >>>>>> ----------------------------- >>>>>> all_flag = false >>>>>> avail_flag = true >>>>>> bg_flag >>>>>> = >>>>>> false >>>>>> cpus_flag = false >>>>>> default_time_flag =false >>>>>> disk_flag = false >>>>>> features_flag = false >>>>>> groups_flag = false >>>>>> gres_flag = false >>>>>> job_size_flag = false >>>>>> max_time_flag = true >>>>>> memory_flag = false >>>>>> partition_flag = true >>>>>> priority_flag = false >>>>>> reason_flag = false >>>>>> reason_timestamp_flag = false >>>>>> reason_user_flag = false >>>>>> reservation_flag = false >>>>>> root_flag = false >>>>>> share_flag = false >>>>>> state_flag = true >>>>>> weight_flag = false >>>>>> ----------------------------- >>>>>> >>>>>> sinfo: debug: Reading slurm.conf file: >>>>>> /etc/slurm-llnl/slurm.conf >>>>>> Tue Oct 8 15:30:10 2013 >>>>>> sinfo: auth plugin for Munge ( http://code.google.com/p/munge >>>>>> /) >>>>>> loaded >>>>>> sinfo: debug: _slurm_recv_timeout at 0 of 4, recv zero bytes >>>>>> sinfo: error: slurm_receive_msg: Zero Bytes were transmitted or >>>>>> received >>>>>> slurm_load_partitions: Zero Bytes were transmitted or received >>>>>> >>>>>> >>>>>> >>>>>> Everything seems fine with Munge: >>>>>> >>>>>> root@kosmos:~# munge -n|ssh k01 unmunge >>>>>> STATUS: Success (0) >>>>>> ENCODE_HOST: kosmos ( 10.3.1.80 ) >>>>>> ENCODE_TIME: 2013-10-08 15:30:48 (1381239048) >>>>>> DECODE_TIME: 2013-10-08 15:30:48 (1381239048) >>>>>> TTL: 300 >>>>>> CIPHER: aes128 (4) >>>>>> MAC: sha1 (3) >>>>>> ZIP: none (0) >>>>>> UID: root (0) >>>>>> GID: root (0) >>>>>> LENGTH: 0 >>>>>> >>>>>> >>>>>> >>>>>> Here is the slurm.conf: >>>>>> >>>>>> root@kosmos:~# cat /etc/slurm-llnl/slurm.conf >>>>>> # slurm.conf file generated by configurator.html. >>>>>> # Put this file on all nodes of your cluster. >>>>>> # See the slurm.conf man page for more information. >>>>>> # >>>>>> ControlMachine=kosmos >>>>>> #ControlAddr= >>>>>> #BackupController= >>>>>> #BackupAddr= >>>>>> # >>>>>> #AuthType=auth/none >>>>>> AuthType=auth/munge >>>>>> CacheGroups=0 >>>>>> #CheckpointType=checkpoint/none >>>>>> CryptoType=crypto/munge >>>>>> #CryptoType=crypto/openssl >>>>>> #DisableRootJobs=NO >>>>>> #EnforcePartLimits=NO >>>>>> #Epilog= >>>>>> #PrologSlurmctld= >>>>>> #FirstJobId=1 >>>>>> #MaxJobId=999999 >>>>>> #GresTypes= >>>>>> #GroupUpdateForce=0 >>>>>> #GroupUpdateTime=600 >>>>>> JobCheckpointDir=/var/lib/slurm-llnl/checkpoint >>>>>> #JobCredentialPrivateKey= >>>>>> #JobCredentialPublicCertificate= >>>>>> #JobCredentialPrivateKey=/home/slurm/ssl/id_rsa >>>>>> #JobCredentialPublicCertificate=/home/slurm/ssl/id_rsa.pub >>>>>> #JobFileAppend=0 >>>>>> #JobRequeue=1 >>>>>> #JobSubmitPlugins=1 >>>>>> #KillOnBadExit=0 >>>>>> #Licenses=foo*4,bar >>>>>> #MailProg=/usr/bin/mail >>>>>> #MaxJobCount=5000 >>>>>> #MaxStepCount=40000 >>>>>> #MaxTasksPerNode=128 >>>>>> MpiDefault=none >>>>>> #MpiParams=ports=#-# >>>>>> #PluginDir= >>>>>> #PlugStackConfig= >>>>>> #PrivateData=jobs >>>>>> ProctrackType=proctrack/pgid >>>>>> #Prolog= >>>>>> #PrologSlurmctld= >>>>>> #PropagatePrioProcess=0 >>>>>> #PropagateResourceLimits= >>>>>> #PropagateResourceLimitsExcept= >>>>>> ReturnToService=1 >>>>>> #SallocDefaultCommand= >>>>>> SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid >>>>>> SlurmctldPort=6817 >>>>>> SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid >>>>>> SlurmdPort=6818 >>>>>> SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd >>>>>> SlurmUser=slurm >>>>>> #SrunEpilog= >>>>>> #SrunProlog= >>>>>> StateSaveLocation=/var/lib/slurm-llnl/slurmctld >>>>>> SwitchType=switch/none >>>>>> #TaskEpilog= >>>>>> TaskPlugin=task/none >>>>>> #TaskPluginParam= >>>>>> #TaskProlog= >>>>>> #TopologyPlugin=topology/tree >>>>>> #TmpFs=/tmp >>>>>> #TrackWCKey=no >>>>>> #TreeWidth= >>>>>> #UnkillableStepProgram= >>>>>> #UsePAM=0 >>>>>> # >>>>>> # >>>>>> # TIMERS >>>>>> #BatchStartTimeout=10 >>>>>> #CompleteWait=0 >>>>>> #EpilogMsgTime=2000 >>>>>> #GetEnvTimeout=2 >>>>>> #HealthCheckInterval=0 >>>>>> #HealthCheckProgram= >>>>>> InactiveLimit=0 >>>>>> KillWait=30 >>>>>> #MessageTimeout=10 >>>>>> #ResvOverRun=0 >>>>>> MinJobAge=300 >>>>>> #OverTimeLimit=0 >>>>>> SlurmctldTimeout=120 >>>>>> SlurmdTimeout=300 >>>>>> #UnkillableStepTimeout=60 >>>>>> #VSizeFactor=0 >>>>>> Waittime=0 >>>>>> # >>>>>> # >>>>>> # SCHEDULING >>>>>> #DefMemPerCPU=0 >>>>>> FastSchedule=1 >>>>>> #MaxMemPerCPU=0 >>>>>> #SchedulerRootFilter=1 >>>>>> #SchedulerTimeSlice=30 >>>>>> SchedulerType=sched/backfill >>>>>> SchedulerPort=7321 >>>>>> SelectType=select/cons_res >>>>>> SelectTypeParameters=CR_Core_Memory >>>>>> # >>>>>> # >>>>>> # JOB PRIORITY >>>>>> #PriorityType=priority/basic >>>>>> #PriorityDecayHalfLife= >>>>>> #PriorityCalcPeriod= >>>>>> #PriorityFavorSmall= >>>>>> #PriorityMaxAge= >>>>>> #PriorityUsageResetPeriod= >>>>>> #PriorityWeightAge= >>>>>> #PriorityWeightFairshare= >>>>>> #PriorityWeightJobSize= >>>>>> #PriorityWeightPartition= >>>>>> #PriorityWeightQOS= >>>>>> # >>>>>> # >>>>>> # LOGGING AND ACCOUNTING >>>>>> #AccountingStorageEnforce=0 >>>>>> #AccountingStorageHost= >>>>>> AccountingStorageLoc=/var/log/slurm/accounting.txt >>>>>> #AccountingStoragePass= >>>>>> #AccountingStoragePort= >>>>>> AccountingStorageType=accounting_storage/filetxt >>>>>> #AccountingStorageUser= >>>>>> AccountingStoreJobComment=YES >>>>>> ClusterName=cluster >>>>>> #DebugFlags= >>>>>> #JobCompHost= >>>>>> JobCompLoc=/var/log/slurm/slurm.log >>>>>> #JobCompPass= >>>>>> #JobCompPort= >>>>>> JobCompType=jobcomp/filetxt >>>>>> #JobCompUser= >>>>>> JobAcctGatherFrequency=30 >>>>>> JobAcctGatherType=jobacct_gather/linux >>>>>> SlurmctldDebug=3 >>>>>> SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log >>>>>> SlurmdDebug=3 >>>>>> SlurmdLogFile=/var/log/slurm-llnl/slurmd.log >>>>>> #SlurmSchedLogFile= >>>>>> #SlurmSchedLogLevel= >>>>>> # >>>>>> # >>>>>> # POWER SAVE SUPPORT FOR IDLE NODES (optional) >>>>>> #SuspendProgram= >>>>>> #ResumeProgram= >>>>>> #SuspendTimeout= >>>>>> #ResumeTimeout= >>>>>> #ResumeRate= >>>>>> #SuspendExcNodes= >>>>>> #SuspendExcParts= >>>>>> #SuspendRate= >>>>>> #SuspendTime= >>>>>> # >>>>>> # >>>>>> # COMPUTE NODES >>>>>> NodeName=k0[1-3] CPUs=32 RealMemory=129186 Sockets=2 >>>>>> CoresPerSocket=8 >>>>>> ThreadsPerCore=2 >>>>>> State=UNKNOWN >>>>>> PartitionName=uag Nodes=k0[1-3] Default=YES MaxTime=INFINITE >>>>>> State=UP >>>>>> >>>>>> NodeName=kosmos CPUs=32 RealMemory=129186 Sockets=2 >>>>>> CoresPerSocket=8 >>>>>> ThreadsPerCore=2 State=UNKNOWN >>>>>> PartitionName=uag Nodes=kosmos Default=YES MaxTime=INFINITE >>>>>> State=UP >>>>>> >>>>>> >>>>>> >>>>>> NTP is running on all the nodes and the clocks are in sync. >>>>>> >>>>>> Thank you for your help! >>>>>> >>>>>> Best regards, >>>>>> >>>>>> Philippe >>>>>> >>>> >>>> >>
