That means your slurmctld is older than the command you are executing and it does not understand the RPC. (Older commands than the daemon would have worked fine).

Quoting [email protected]:


Thank you!

The problem seem to lie there.

slurmctld.log contains this:

error: slurm_receive_msg: Protocol version has changed, re-link your code

In fact, I was using the version of SLURM coming with Debian. Then I installed the last version from source in order to have array jobs. The problem seems to come from there.

Philippe



----- Mail original -----
De: "Moe Jette" <[email protected]>
À: "slurm-dev" <[email protected]>
Envoyé: Mardi 8 Octobre 2013 17:29:53
Objet: [slurm-dev] Re: sinfo: error: slurm_receive_msg: Zero Bytes were transmitted or received


The command and daemon are not communicating. Check your slurmctld
log
file. There is also a troubleshooting guide online that may prove
helpful to you:
http://slurm.schedmd.com/troubleshoot.html

Quoting [email protected]:

>
> Hello,
>
> Thank you for your reply.
>
> It seems to be running.
>
> root@kosmos:~# /etc/init.d/slurm-llnl status
> slurmctld (pid 6093) is running...
> slurmd (pid 6221) is running...
>
>
>
> ----- Mail original -----
>> De: "Danny Auble" <[email protected]>
>> À: "slurm-dev" <[email protected]>
>> Envoyé: Mardi 8 Octobre 2013 16:42:11
>> Objet: [slurm-dev] Re: sinfo: error: slurm_receive_msg: Zero Bytes
>> were transmitted or received
>>
>> It doesn't appear your slurmctld is running or responsive.
>>
>>
>> [email protected] wrote:
>>
>>
>> Hello,
>>
>> I obtain the following error message when I try to use SLURM.
>>
>> root@kosmos:~# sinfo
>> sinfo: error: slurm_receive_msg: Zero Bytes were transmitted or
>> received
>> slurm_load_partitions: Zero Bytes were transmitted or received
>>
>>
>>
>> Here is the output of same command with an increased level of
>> verbosity:
>>
>> root@kosmos:~# sinfo -vv
>> -----------------------------
>> dead        = false
>> exact       = 0
>> filtering   = false
>> format      = %9P %.5a %.10l %.6D %.6t %N
>> iterate     = 0
>> long        = false
>> no_header   = false
>> node_field  = false
>> node_format = false
>> nodes       = n/a
>> part_field  = true
>> partition   = n/a
>> responding  = false
>> states      = (null)
>> sort        = (null)
>> summarize   = false
>> verbose     = 2
>> -----------------------------
>> all_flag        = false
>> avail_flag      = true
>> bg_flag
>>        =
>> false
>> cpus_flag       = false
>> default_time_flag =false
>> disk_flag       = false
>> features_flag   = false
>> groups_flag     = false
>> gres_flag       = false
>> job_size_flag   = false
>> max_time_flag   = true
>> memory_flag     = false
>> partition_flag  = true
>> priority_flag   = false
>> reason_flag     = false
>> reason_timestamp_flag = false
>> reason_user_flag = false
>> reservation_flag = false
>> root_flag       = false
>> share_flag      = false
>> state_flag      = true
>> weight_flag     = false
>> -----------------------------
>>
>> sinfo: debug:  Reading slurm.conf file: /etc/slurm-llnl/slurm.conf
>> Tue Oct  8 15:30:10 2013
>> sinfo: auth plugin for Munge ( http://code.google.com/p/munge /)
>> loaded
>> sinfo: debug:  _slurm_recv_timeout at 0 of 4, recv zero bytes
>> sinfo: error: slurm_receive_msg: Zero Bytes were transmitted or
>> received
>> slurm_load_partitions: Zero Bytes were transmitted or received
>>
>>
>>
>> Everything seems fine with Munge:
>>
>> root@kosmos:~# munge -n|ssh k01 unmunge
>> STATUS:           Success (0)
>> ENCODE_HOST:      kosmos ( 10.3.1.80 )
>> ENCODE_TIME:      2013-10-08 15:30:48 (1381239048)
>> DECODE_TIME:      2013-10-08 15:30:48 (1381239048)
>> TTL:              300
>> CIPHER:           aes128 (4)
>> MAC:              sha1 (3)
>> ZIP:              none (0)
>> UID:              root (0)
>> GID:              root (0)
>> LENGTH:           0
>>
>>
>>
>> Here is the slurm.conf:
>>
>> root@kosmos:~# cat /etc/slurm-llnl/slurm.conf
>> # slurm.conf file generated by configurator.html.
>> # Put this file on all nodes of your cluster.
>> # See the slurm.conf man page for more information.
>> #
>> ControlMachine=kosmos
>> #ControlAddr=
>> #BackupController=
>> #BackupAddr=
>> #
>> #AuthType=auth/none
>> AuthType=auth/munge
>> CacheGroups=0
>> #CheckpointType=checkpoint/none
>> CryptoType=crypto/munge
>> #CryptoType=crypto/openssl
>> #DisableRootJobs=NO
>> #EnforcePartLimits=NO
>> #Epilog=
>> #PrologSlurmctld=
>> #FirstJobId=1
>> #MaxJobId=999999
>> #GresTypes=
>> #GroupUpdateForce=0
>> #GroupUpdateTime=600
>> JobCheckpointDir=/var/lib/slurm-llnl/checkpoint
>> #JobCredentialPrivateKey=
>> #JobCredentialPublicCertificate=
>> #JobCredentialPrivateKey=/home/slurm/ssl/id_rsa
>> #JobCredentialPublicCertificate=/home/slurm/ssl/id_rsa.pub
>> #JobFileAppend=0
>> #JobRequeue=1
>> #JobSubmitPlugins=1
>> #KillOnBadExit=0
>> #Licenses=foo*4,bar
>> #MailProg=/usr/bin/mail
>> #MaxJobCount=5000
>> #MaxStepCount=40000
>> #MaxTasksPerNode=128
>> MpiDefault=none
>> #MpiParams=ports=#-#
>> #PluginDir=
>> #PlugStackConfig=
>> #PrivateData=jobs
>> ProctrackType=proctrack/pgid
>> #Prolog=
>> #PrologSlurmctld=
>> #PropagatePrioProcess=0
>> #PropagateResourceLimits=
>> #PropagateResourceLimitsExcept=
>> ReturnToService=1
>> #SallocDefaultCommand=
>> SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
>> SlurmctldPort=6817
>> SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
>> SlurmdPort=6818
>> SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
>> SlurmUser=slurm
>> #SrunEpilog=
>> #SrunProlog=
>> StateSaveLocation=/var/lib/slurm-llnl/slurmctld
>> SwitchType=switch/none
>> #TaskEpilog=
>> TaskPlugin=task/none
>> #TaskPluginParam=
>> #TaskProlog=
>> #TopologyPlugin=topology/tree
>> #TmpFs=/tmp
>> #TrackWCKey=no
>> #TreeWidth=
>> #UnkillableStepProgram=
>> #UsePAM=0
>> #
>> #
>> # TIMERS
>> #BatchStartTimeout=10
>> #CompleteWait=0
>> #EpilogMsgTime=2000
>> #GetEnvTimeout=2
>> #HealthCheckInterval=0
>> #HealthCheckProgram=
>> InactiveLimit=0
>> KillWait=30
>> #MessageTimeout=10
>> #ResvOverRun=0
>> MinJobAge=300
>> #OverTimeLimit=0
>> SlurmctldTimeout=120
>> SlurmdTimeout=300
>> #UnkillableStepTimeout=60
>> #VSizeFactor=0
>> Waittime=0
>> #
>> #
>> # SCHEDULING
>> #DefMemPerCPU=0
>> FastSchedule=1
>> #MaxMemPerCPU=0
>> #SchedulerRootFilter=1
>> #SchedulerTimeSlice=30
>> SchedulerType=sched/backfill
>> SchedulerPort=7321
>> SelectType=select/cons_res
>> SelectTypeParameters=CR_Core_Memory
>> #
>> #
>> # JOB PRIORITY
>> #PriorityType=priority/basic
>> #PriorityDecayHalfLife=
>> #PriorityCalcPeriod=
>> #PriorityFavorSmall=
>> #PriorityMaxAge=
>> #PriorityUsageResetPeriod=
>> #PriorityWeightAge=
>> #PriorityWeightFairshare=
>> #PriorityWeightJobSize=
>> #PriorityWeightPartition=
>> #PriorityWeightQOS=
>> #
>> #
>> # LOGGING AND ACCOUNTING
>> #AccountingStorageEnforce=0
>> #AccountingStorageHost=
>> AccountingStorageLoc=/var/log/slurm/accounting.txt
>> #AccountingStoragePass=
>> #AccountingStoragePort=
>> AccountingStorageType=accounting_storage/filetxt
>> #AccountingStorageUser=
>> AccountingStoreJobComment=YES
>> ClusterName=cluster
>> #DebugFlags=
>> #JobCompHost=
>> JobCompLoc=/var/log/slurm/slurm.log
>> #JobCompPass=
>> #JobCompPort=
>> JobCompType=jobcomp/filetxt
>> #JobCompUser=
>> JobAcctGatherFrequency=30
>> JobAcctGatherType=jobacct_gather/linux
>> SlurmctldDebug=3
>> SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
>> SlurmdDebug=3
>> SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
>> #SlurmSchedLogFile=
>> #SlurmSchedLogLevel=
>> #
>> #
>> # POWER SAVE SUPPORT FOR IDLE NODES (optional)
>> #SuspendProgram=
>> #ResumeProgram=
>> #SuspendTimeout=
>> #ResumeTimeout=
>> #ResumeRate=
>> #SuspendExcNodes=
>> #SuspendExcParts=
>> #SuspendRate=
>> #SuspendTime=
>> #
>> #
>> # COMPUTE NODES
>> NodeName=k0[1-3] CPUs=32 RealMemory=129186 Sockets=2
>> CoresPerSocket=8
>> ThreadsPerCore=2
>> State=UNKNOWN
>> PartitionName=uag Nodes=k0[1-3] Default=YES MaxTime=INFINITE
>> State=UP
>>
>> NodeName=kosmos CPUs=32 RealMemory=129186 Sockets=2
>> CoresPerSocket=8
>> ThreadsPerCore=2 State=UNKNOWN
>> PartitionName=uag Nodes=kosmos Default=YES MaxTime=INFINITE
>> State=UP
>>
>>
>>
>> NTP is running on all the nodes and the clocks are in sync.
>>
>> Thank you for your help!
>>
>> Best regards,
>>
>> Philippe
>>



Reply via email to