Thank you!

The problem seem to lie there.

slurmctld.log contains this:

error: slurm_receive_msg: Protocol version has changed, re-link your code

In fact, I was using the version of SLURM coming with Debian. Then I installed 
the last version from source in order to have array jobs. The problem seems to 
come from there.

Philippe



----- Mail original -----
> De: "Moe Jette" <[email protected]>
> À: "slurm-dev" <[email protected]>
> Envoyé: Mardi 8 Octobre 2013 17:29:53
> Objet: [slurm-dev] Re: sinfo: error: slurm_receive_msg: Zero Bytes were 
> transmitted or received
> 
> 
> The command and daemon are not communicating. Check your slurmctld
> log
> file. There is also a troubleshooting guide online that may prove
> helpful to you:
> http://slurm.schedmd.com/troubleshoot.html
> 
> Quoting [email protected]:
> 
> >
> > Hello,
> >
> > Thank you for your reply.
> >
> > It seems to be running.
> >
> > root@kosmos:~# /etc/init.d/slurm-llnl status
> > slurmctld (pid 6093) is running...
> > slurmd (pid 6221) is running...
> >
> >
> >
> > ----- Mail original -----
> >> De: "Danny Auble" <[email protected]>
> >> À: "slurm-dev" <[email protected]>
> >> Envoyé: Mardi 8 Octobre 2013 16:42:11
> >> Objet: [slurm-dev] Re: sinfo: error: slurm_receive_msg: Zero Bytes
> >> were transmitted or received
> >>
> >> It doesn't appear your slurmctld is running or responsive.
> >>
> >>
> >> [email protected] wrote:
> >>
> >>
> >> Hello,
> >>
> >> I obtain the following error message when I try to use SLURM.
> >>
> >> root@kosmos:~# sinfo
> >> sinfo: error: slurm_receive_msg: Zero Bytes were transmitted or
> >> received
> >> slurm_load_partitions: Zero Bytes were transmitted or received
> >>
> >>
> >>
> >> Here is the output of same command with an increased level of
> >> verbosity:
> >>
> >> root@kosmos:~# sinfo -vv
> >> -----------------------------
> >> dead        = false
> >> exact       = 0
> >> filtering   = false
> >> format      = %9P %.5a %.10l %.6D %.6t %N
> >> iterate     = 0
> >> long        = false
> >> no_header   = false
> >> node_field  = false
> >> node_format = false
> >> nodes       = n/a
> >> part_field  = true
> >> partition   = n/a
> >> responding  = false
> >> states      = (null)
> >> sort        = (null)
> >> summarize   = false
> >> verbose     = 2
> >> -----------------------------
> >> all_flag        = false
> >> avail_flag      = true
> >> bg_flag
> >>        =
> >> false
> >> cpus_flag       = false
> >> default_time_flag =false
> >> disk_flag       = false
> >> features_flag   = false
> >> groups_flag     = false
> >> gres_flag       = false
> >> job_size_flag   = false
> >> max_time_flag   = true
> >> memory_flag     = false
> >> partition_flag  = true
> >> priority_flag   = false
> >> reason_flag     = false
> >> reason_timestamp_flag = false
> >> reason_user_flag = false
> >> reservation_flag = false
> >> root_flag       = false
> >> share_flag      = false
> >> state_flag      = true
> >> weight_flag     = false
> >> -----------------------------
> >>
> >> sinfo: debug:  Reading slurm.conf file: /etc/slurm-llnl/slurm.conf
> >> Tue Oct  8 15:30:10 2013
> >> sinfo: auth plugin for Munge ( http://code.google.com/p/munge /)
> >> loaded
> >> sinfo: debug:  _slurm_recv_timeout at 0 of 4, recv zero bytes
> >> sinfo: error: slurm_receive_msg: Zero Bytes were transmitted or
> >> received
> >> slurm_load_partitions: Zero Bytes were transmitted or received
> >>
> >>
> >>
> >> Everything seems fine with Munge:
> >>
> >> root@kosmos:~# munge -n|ssh k01 unmunge
> >> STATUS:           Success (0)
> >> ENCODE_HOST:      kosmos ( 10.3.1.80 )
> >> ENCODE_TIME:      2013-10-08 15:30:48 (1381239048)
> >> DECODE_TIME:      2013-10-08 15:30:48 (1381239048)
> >> TTL:              300
> >> CIPHER:           aes128 (4)
> >> MAC:              sha1 (3)
> >> ZIP:              none (0)
> >> UID:              root (0)
> >> GID:              root (0)
> >> LENGTH:           0
> >>
> >>
> >>
> >> Here is the slurm.conf:
> >>
> >> root@kosmos:~# cat /etc/slurm-llnl/slurm.conf
> >> # slurm.conf file generated by configurator.html.
> >> # Put this file on all nodes of your cluster.
> >> # See the slurm.conf man page for more information.
> >> #
> >> ControlMachine=kosmos
> >> #ControlAddr=
> >> #BackupController=
> >> #BackupAddr=
> >> #
> >> #AuthType=auth/none
> >> AuthType=auth/munge
> >> CacheGroups=0
> >> #CheckpointType=checkpoint/none
> >> CryptoType=crypto/munge
> >> #CryptoType=crypto/openssl
> >> #DisableRootJobs=NO
> >> #EnforcePartLimits=NO
> >> #Epilog=
> >> #PrologSlurmctld=
> >> #FirstJobId=1
> >> #MaxJobId=999999
> >> #GresTypes=
> >> #GroupUpdateForce=0
> >> #GroupUpdateTime=600
> >> JobCheckpointDir=/var/lib/slurm-llnl/checkpoint
> >> #JobCredentialPrivateKey=
> >> #JobCredentialPublicCertificate=
> >> #JobCredentialPrivateKey=/home/slurm/ssl/id_rsa
> >> #JobCredentialPublicCertificate=/home/slurm/ssl/id_rsa.pub
> >> #JobFileAppend=0
> >> #JobRequeue=1
> >> #JobSubmitPlugins=1
> >> #KillOnBadExit=0
> >> #Licenses=foo*4,bar
> >> #MailProg=/usr/bin/mail
> >> #MaxJobCount=5000
> >> #MaxStepCount=40000
> >> #MaxTasksPerNode=128
> >> MpiDefault=none
> >> #MpiParams=ports=#-#
> >> #PluginDir=
> >> #PlugStackConfig=
> >> #PrivateData=jobs
> >> ProctrackType=proctrack/pgid
> >> #Prolog=
> >> #PrologSlurmctld=
> >> #PropagatePrioProcess=0
> >> #PropagateResourceLimits=
> >> #PropagateResourceLimitsExcept=
> >> ReturnToService=1
> >> #SallocDefaultCommand=
> >> SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
> >> SlurmctldPort=6817
> >> SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
> >> SlurmdPort=6818
> >> SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
> >> SlurmUser=slurm
> >> #SrunEpilog=
> >> #SrunProlog=
> >> StateSaveLocation=/var/lib/slurm-llnl/slurmctld
> >> SwitchType=switch/none
> >> #TaskEpilog=
> >> TaskPlugin=task/none
> >> #TaskPluginParam=
> >> #TaskProlog=
> >> #TopologyPlugin=topology/tree
> >> #TmpFs=/tmp
> >> #TrackWCKey=no
> >> #TreeWidth=
> >> #UnkillableStepProgram=
> >> #UsePAM=0
> >> #
> >> #
> >> # TIMERS
> >> #BatchStartTimeout=10
> >> #CompleteWait=0
> >> #EpilogMsgTime=2000
> >> #GetEnvTimeout=2
> >> #HealthCheckInterval=0
> >> #HealthCheckProgram=
> >> InactiveLimit=0
> >> KillWait=30
> >> #MessageTimeout=10
> >> #ResvOverRun=0
> >> MinJobAge=300
> >> #OverTimeLimit=0
> >> SlurmctldTimeout=120
> >> SlurmdTimeout=300
> >> #UnkillableStepTimeout=60
> >> #VSizeFactor=0
> >> Waittime=0
> >> #
> >> #
> >> # SCHEDULING
> >> #DefMemPerCPU=0
> >> FastSchedule=1
> >> #MaxMemPerCPU=0
> >> #SchedulerRootFilter=1
> >> #SchedulerTimeSlice=30
> >> SchedulerType=sched/backfill
> >> SchedulerPort=7321
> >> SelectType=select/cons_res
> >> SelectTypeParameters=CR_Core_Memory
> >> #
> >> #
> >> # JOB PRIORITY
> >> #PriorityType=priority/basic
> >> #PriorityDecayHalfLife=
> >> #PriorityCalcPeriod=
> >> #PriorityFavorSmall=
> >> #PriorityMaxAge=
> >> #PriorityUsageResetPeriod=
> >> #PriorityWeightAge=
> >> #PriorityWeightFairshare=
> >> #PriorityWeightJobSize=
> >> #PriorityWeightPartition=
> >> #PriorityWeightQOS=
> >> #
> >> #
> >> # LOGGING AND ACCOUNTING
> >> #AccountingStorageEnforce=0
> >> #AccountingStorageHost=
> >> AccountingStorageLoc=/var/log/slurm/accounting.txt
> >> #AccountingStoragePass=
> >> #AccountingStoragePort=
> >> AccountingStorageType=accounting_storage/filetxt
> >> #AccountingStorageUser=
> >> AccountingStoreJobComment=YES
> >> ClusterName=cluster
> >> #DebugFlags=
> >> #JobCompHost=
> >> JobCompLoc=/var/log/slurm/slurm.log
> >> #JobCompPass=
> >> #JobCompPort=
> >> JobCompType=jobcomp/filetxt
> >> #JobCompUser=
> >> JobAcctGatherFrequency=30
> >> JobAcctGatherType=jobacct_gather/linux
> >> SlurmctldDebug=3
> >> SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
> >> SlurmdDebug=3
> >> SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
> >> #SlurmSchedLogFile=
> >> #SlurmSchedLogLevel=
> >> #
> >> #
> >> # POWER SAVE SUPPORT FOR IDLE NODES (optional)
> >> #SuspendProgram=
> >> #ResumeProgram=
> >> #SuspendTimeout=
> >> #ResumeTimeout=
> >> #ResumeRate=
> >> #SuspendExcNodes=
> >> #SuspendExcParts=
> >> #SuspendRate=
> >> #SuspendTime=
> >> #
> >> #
> >> # COMPUTE NODES
> >> NodeName=k0[1-3] CPUs=32 RealMemory=129186 Sockets=2
> >> CoresPerSocket=8
> >> ThreadsPerCore=2
> >> State=UNKNOWN
> >> PartitionName=uag Nodes=k0[1-3] Default=YES MaxTime=INFINITE
> >> State=UP
> >>
> >> NodeName=kosmos CPUs=32 RealMemory=129186 Sockets=2
> >> CoresPerSocket=8
> >> ThreadsPerCore=2 State=UNKNOWN
> >> PartitionName=uag Nodes=kosmos Default=YES MaxTime=INFINITE
> >> State=UP
> >>
> >>
> >>
> >> NTP is running on all the nodes and the clocks are in sync.
> >>
> >> Thank you for your help!
> >>
> >> Best regards,
> >>
> >> Philippe
> >>
> 
> 

Reply via email to