Might be unrelated, but did you update your compute nodes as well ? 
You have two lines PartitionName=uag .... in your slurm.conf. The second one is 
ignored, meaning that kosmos is not seen as a compute node. 
(You should see something like "slurmctld: error: _parse_part_spec: duplicate 
entry for partition uag, ignoring" in your logs. )

Do "sinfo -V" and "slurmctld -V" produce the same output ?

HTH

damien

On 08 Oct 2013, at 19:59, [email protected] wrote:

> 
> It does not seem to solve the problem.
> 
> I have updated everything (I even did a make clean before the compilation) 
> and have checked that I have the new version of the different components but 
> the problem persists.
> 
> # sinfo 
> sinfo: error: slurm_receive_msg: Zero Bytes were transmitted or received
> slurm_load_partitions: Zero Bytes were transmitted or received
> 
> # tail /var/log/slurm-llnl/slurmctld.log -n 3
> [2013-10-08T19:53:44] error: Invalid Protocol Version 6656 from uid=0 at 
> 10.3.1.80:38100
> [2013-10-08T19:53:44] error: slurm_receive_msg: Protocol version has changed, 
> re-link your code
> [2013-10-08T19:53:44] error: slurm_receive_msg: Protocol version has changed, 
> re-link your code
> 
> 
> Thanks. 
> 
> 
> ----- Mail original -----
>> De: "Moe Jette" <[email protected]>
>> À: "slurm-dev" <[email protected]>
>> Envoyé: Mardi 8 Octobre 2013 18:46:53
>> Objet: [slurm-dev] Re: sinfo: error: slurm_receive_msg: Zero Bytes were 
>> transmitted or received
>> 
>> 
>> That means your slurmctld is older than the command you are executing
>> and it does not understand the RPC. (Older commands than the daemon
>> would have worked fine).
>> 
>> Quoting [email protected]:
>> 
>>> 
>>> Thank you!
>>> 
>>> The problem seem to lie there.
>>> 
>>> slurmctld.log contains this:
>>> 
>>> error: slurm_receive_msg: Protocol version has changed, re-link
>>> your code
>>> 
>>> In fact, I was using the version of SLURM coming with Debian. Then
>>> I
>>> installed the last version from source in order to have array jobs.
>>> The problem seems to come from there.
>>> 
>>> Philippe
>>> 
>>> 
>>> 
>>> ----- Mail original -----
>>>> De: "Moe Jette" <[email protected]>
>>>> À: "slurm-dev" <[email protected]>
>>>> Envoyé: Mardi 8 Octobre 2013 17:29:53
>>>> Objet: [slurm-dev] Re: sinfo: error: slurm_receive_msg: Zero Bytes
>>>> were transmitted or received
>>>> 
>>>> 
>>>> The command and daemon are not communicating. Check your slurmctld
>>>> log
>>>> file. There is also a troubleshooting guide online that may prove
>>>> helpful to you:
>>>> http://slurm.schedmd.com/troubleshoot.html
>>>> 
>>>> Quoting [email protected]:
>>>> 
>>>>> 
>>>>> Hello,
>>>>> 
>>>>> Thank you for your reply.
>>>>> 
>>>>> It seems to be running.
>>>>> 
>>>>> root@kosmos:~# /etc/init.d/slurm-llnl status
>>>>> slurmctld (pid 6093) is running...
>>>>> slurmd (pid 6221) is running...
>>>>> 
>>>>> 
>>>>> 
>>>>> ----- Mail original -----
>>>>>> De: "Danny Auble" <[email protected]>
>>>>>> À: "slurm-dev" <[email protected]>
>>>>>> Envoyé: Mardi 8 Octobre 2013 16:42:11
>>>>>> Objet: [slurm-dev] Re: sinfo: error: slurm_receive_msg: Zero
>>>>>> Bytes
>>>>>> were transmitted or received
>>>>>> 
>>>>>> It doesn't appear your slurmctld is running or responsive.
>>>>>> 
>>>>>> 
>>>>>> [email protected] wrote:
>>>>>> 
>>>>>> 
>>>>>> Hello,
>>>>>> 
>>>>>> I obtain the following error message when I try to use SLURM.
>>>>>> 
>>>>>> root@kosmos:~# sinfo
>>>>>> sinfo: error: slurm_receive_msg: Zero Bytes were transmitted or
>>>>>> received
>>>>>> slurm_load_partitions: Zero Bytes were transmitted or received
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Here is the output of same command with an increased level of
>>>>>> verbosity:
>>>>>> 
>>>>>> root@kosmos:~# sinfo -vv
>>>>>> -----------------------------
>>>>>> dead        = false
>>>>>> exact       = 0
>>>>>> filtering   = false
>>>>>> format      = %9P %.5a %.10l %.6D %.6t %N
>>>>>> iterate     = 0
>>>>>> long        = false
>>>>>> no_header   = false
>>>>>> node_field  = false
>>>>>> node_format = false
>>>>>> nodes       = n/a
>>>>>> part_field  = true
>>>>>> partition   = n/a
>>>>>> responding  = false
>>>>>> states      = (null)
>>>>>> sort        = (null)
>>>>>> summarize   = false
>>>>>> verbose     = 2
>>>>>> -----------------------------
>>>>>> all_flag        = false
>>>>>> avail_flag      = true
>>>>>> bg_flag
>>>>>>       =
>>>>>> false
>>>>>> cpus_flag       = false
>>>>>> default_time_flag =false
>>>>>> disk_flag       = false
>>>>>> features_flag   = false
>>>>>> groups_flag     = false
>>>>>> gres_flag       = false
>>>>>> job_size_flag   = false
>>>>>> max_time_flag   = true
>>>>>> memory_flag     = false
>>>>>> partition_flag  = true
>>>>>> priority_flag   = false
>>>>>> reason_flag     = false
>>>>>> reason_timestamp_flag = false
>>>>>> reason_user_flag = false
>>>>>> reservation_flag = false
>>>>>> root_flag       = false
>>>>>> share_flag      = false
>>>>>> state_flag      = true
>>>>>> weight_flag     = false
>>>>>> -----------------------------
>>>>>> 
>>>>>> sinfo: debug:  Reading slurm.conf file:
>>>>>> /etc/slurm-llnl/slurm.conf
>>>>>> Tue Oct  8 15:30:10 2013
>>>>>> sinfo: auth plugin for Munge ( http://code.google.com/p/munge
>>>>>> /)
>>>>>> loaded
>>>>>> sinfo: debug:  _slurm_recv_timeout at 0 of 4, recv zero bytes
>>>>>> sinfo: error: slurm_receive_msg: Zero Bytes were transmitted or
>>>>>> received
>>>>>> slurm_load_partitions: Zero Bytes were transmitted or received
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Everything seems fine with Munge:
>>>>>> 
>>>>>> root@kosmos:~# munge -n|ssh k01 unmunge
>>>>>> STATUS:           Success (0)
>>>>>> ENCODE_HOST:      kosmos ( 10.3.1.80 )
>>>>>> ENCODE_TIME:      2013-10-08 15:30:48 (1381239048)
>>>>>> DECODE_TIME:      2013-10-08 15:30:48 (1381239048)
>>>>>> TTL:              300
>>>>>> CIPHER:           aes128 (4)
>>>>>> MAC:              sha1 (3)
>>>>>> ZIP:              none (0)
>>>>>> UID:              root (0)
>>>>>> GID:              root (0)
>>>>>> LENGTH:           0
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Here is the slurm.conf:
>>>>>> 
>>>>>> root@kosmos:~# cat /etc/slurm-llnl/slurm.conf
>>>>>> # slurm.conf file generated by configurator.html.
>>>>>> # Put this file on all nodes of your cluster.
>>>>>> # See the slurm.conf man page for more information.
>>>>>> #
>>>>>> ControlMachine=kosmos
>>>>>> #ControlAddr=
>>>>>> #BackupController=
>>>>>> #BackupAddr=
>>>>>> #
>>>>>> #AuthType=auth/none
>>>>>> AuthType=auth/munge
>>>>>> CacheGroups=0
>>>>>> #CheckpointType=checkpoint/none
>>>>>> CryptoType=crypto/munge
>>>>>> #CryptoType=crypto/openssl
>>>>>> #DisableRootJobs=NO
>>>>>> #EnforcePartLimits=NO
>>>>>> #Epilog=
>>>>>> #PrologSlurmctld=
>>>>>> #FirstJobId=1
>>>>>> #MaxJobId=999999
>>>>>> #GresTypes=
>>>>>> #GroupUpdateForce=0
>>>>>> #GroupUpdateTime=600
>>>>>> JobCheckpointDir=/var/lib/slurm-llnl/checkpoint
>>>>>> #JobCredentialPrivateKey=
>>>>>> #JobCredentialPublicCertificate=
>>>>>> #JobCredentialPrivateKey=/home/slurm/ssl/id_rsa
>>>>>> #JobCredentialPublicCertificate=/home/slurm/ssl/id_rsa.pub
>>>>>> #JobFileAppend=0
>>>>>> #JobRequeue=1
>>>>>> #JobSubmitPlugins=1
>>>>>> #KillOnBadExit=0
>>>>>> #Licenses=foo*4,bar
>>>>>> #MailProg=/usr/bin/mail
>>>>>> #MaxJobCount=5000
>>>>>> #MaxStepCount=40000
>>>>>> #MaxTasksPerNode=128
>>>>>> MpiDefault=none
>>>>>> #MpiParams=ports=#-#
>>>>>> #PluginDir=
>>>>>> #PlugStackConfig=
>>>>>> #PrivateData=jobs
>>>>>> ProctrackType=proctrack/pgid
>>>>>> #Prolog=
>>>>>> #PrologSlurmctld=
>>>>>> #PropagatePrioProcess=0
>>>>>> #PropagateResourceLimits=
>>>>>> #PropagateResourceLimitsExcept=
>>>>>> ReturnToService=1
>>>>>> #SallocDefaultCommand=
>>>>>> SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
>>>>>> SlurmctldPort=6817
>>>>>> SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
>>>>>> SlurmdPort=6818
>>>>>> SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
>>>>>> SlurmUser=slurm
>>>>>> #SrunEpilog=
>>>>>> #SrunProlog=
>>>>>> StateSaveLocation=/var/lib/slurm-llnl/slurmctld
>>>>>> SwitchType=switch/none
>>>>>> #TaskEpilog=
>>>>>> TaskPlugin=task/none
>>>>>> #TaskPluginParam=
>>>>>> #TaskProlog=
>>>>>> #TopologyPlugin=topology/tree
>>>>>> #TmpFs=/tmp
>>>>>> #TrackWCKey=no
>>>>>> #TreeWidth=
>>>>>> #UnkillableStepProgram=
>>>>>> #UsePAM=0
>>>>>> #
>>>>>> #
>>>>>> # TIMERS
>>>>>> #BatchStartTimeout=10
>>>>>> #CompleteWait=0
>>>>>> #EpilogMsgTime=2000
>>>>>> #GetEnvTimeout=2
>>>>>> #HealthCheckInterval=0
>>>>>> #HealthCheckProgram=
>>>>>> InactiveLimit=0
>>>>>> KillWait=30
>>>>>> #MessageTimeout=10
>>>>>> #ResvOverRun=0
>>>>>> MinJobAge=300
>>>>>> #OverTimeLimit=0
>>>>>> SlurmctldTimeout=120
>>>>>> SlurmdTimeout=300
>>>>>> #UnkillableStepTimeout=60
>>>>>> #VSizeFactor=0
>>>>>> Waittime=0
>>>>>> #
>>>>>> #
>>>>>> # SCHEDULING
>>>>>> #DefMemPerCPU=0
>>>>>> FastSchedule=1
>>>>>> #MaxMemPerCPU=0
>>>>>> #SchedulerRootFilter=1
>>>>>> #SchedulerTimeSlice=30
>>>>>> SchedulerType=sched/backfill
>>>>>> SchedulerPort=7321
>>>>>> SelectType=select/cons_res
>>>>>> SelectTypeParameters=CR_Core_Memory
>>>>>> #
>>>>>> #
>>>>>> # JOB PRIORITY
>>>>>> #PriorityType=priority/basic
>>>>>> #PriorityDecayHalfLife=
>>>>>> #PriorityCalcPeriod=
>>>>>> #PriorityFavorSmall=
>>>>>> #PriorityMaxAge=
>>>>>> #PriorityUsageResetPeriod=
>>>>>> #PriorityWeightAge=
>>>>>> #PriorityWeightFairshare=
>>>>>> #PriorityWeightJobSize=
>>>>>> #PriorityWeightPartition=
>>>>>> #PriorityWeightQOS=
>>>>>> #
>>>>>> #
>>>>>> # LOGGING AND ACCOUNTING
>>>>>> #AccountingStorageEnforce=0
>>>>>> #AccountingStorageHost=
>>>>>> AccountingStorageLoc=/var/log/slurm/accounting.txt
>>>>>> #AccountingStoragePass=
>>>>>> #AccountingStoragePort=
>>>>>> AccountingStorageType=accounting_storage/filetxt
>>>>>> #AccountingStorageUser=
>>>>>> AccountingStoreJobComment=YES
>>>>>> ClusterName=cluster
>>>>>> #DebugFlags=
>>>>>> #JobCompHost=
>>>>>> JobCompLoc=/var/log/slurm/slurm.log
>>>>>> #JobCompPass=
>>>>>> #JobCompPort=
>>>>>> JobCompType=jobcomp/filetxt
>>>>>> #JobCompUser=
>>>>>> JobAcctGatherFrequency=30
>>>>>> JobAcctGatherType=jobacct_gather/linux
>>>>>> SlurmctldDebug=3
>>>>>> SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
>>>>>> SlurmdDebug=3
>>>>>> SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
>>>>>> #SlurmSchedLogFile=
>>>>>> #SlurmSchedLogLevel=
>>>>>> #
>>>>>> #
>>>>>> # POWER SAVE SUPPORT FOR IDLE NODES (optional)
>>>>>> #SuspendProgram=
>>>>>> #ResumeProgram=
>>>>>> #SuspendTimeout=
>>>>>> #ResumeTimeout=
>>>>>> #ResumeRate=
>>>>>> #SuspendExcNodes=
>>>>>> #SuspendExcParts=
>>>>>> #SuspendRate=
>>>>>> #SuspendTime=
>>>>>> #
>>>>>> #
>>>>>> # COMPUTE NODES
>>>>>> NodeName=k0[1-3] CPUs=32 RealMemory=129186 Sockets=2
>>>>>> CoresPerSocket=8
>>>>>> ThreadsPerCore=2
>>>>>> State=UNKNOWN
>>>>>> PartitionName=uag Nodes=k0[1-3] Default=YES MaxTime=INFINITE
>>>>>> State=UP
>>>>>> 
>>>>>> NodeName=kosmos CPUs=32 RealMemory=129186 Sockets=2
>>>>>> CoresPerSocket=8
>>>>>> ThreadsPerCore=2 State=UNKNOWN
>>>>>> PartitionName=uag Nodes=kosmos Default=YES MaxTime=INFINITE
>>>>>> State=UP
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> NTP is running on all the nodes and the clocks are in sync.
>>>>>> 
>>>>>> Thank you for your help!
>>>>>> 
>>>>>> Best regards,
>>>>>> 
>>>>>> Philippe
>>>>>> 
>>>> 
>>>> 
>> 

Reply via email to