It does not seem to solve the problem.

I have updated everything (I even did a make clean before the compilation) and 
have checked that I have the new version of the different components but the 
problem persists.

# sinfo 
sinfo: error: slurm_receive_msg: Zero Bytes were transmitted or received
slurm_load_partitions: Zero Bytes were transmitted or received

# tail /var/log/slurm-llnl/slurmctld.log -n 3
[2013-10-08T19:53:44] error: Invalid Protocol Version 6656 from uid=0 at 
10.3.1.80:38100
[2013-10-08T19:53:44] error: slurm_receive_msg: Protocol version has changed, 
re-link your code
[2013-10-08T19:53:44] error: slurm_receive_msg: Protocol version has changed, 
re-link your code


Thanks. 


----- Mail original -----
> De: "Moe Jette" <[email protected]>
> À: "slurm-dev" <[email protected]>
> Envoyé: Mardi 8 Octobre 2013 18:46:53
> Objet: [slurm-dev] Re: sinfo: error: slurm_receive_msg: Zero Bytes were 
> transmitted or received
> 
> 
> That means your slurmctld is older than the command you are executing
> and it does not understand the RPC. (Older commands than the daemon
> would have worked fine).
> 
> Quoting [email protected]:
> 
> >
> > Thank you!
> >
> > The problem seem to lie there.
> >
> > slurmctld.log contains this:
> >
> > error: slurm_receive_msg: Protocol version has changed, re-link
> > your code
> >
> > In fact, I was using the version of SLURM coming with Debian. Then
> > I
> > installed the last version from source in order to have array jobs.
> > The problem seems to come from there.
> >
> > Philippe
> >
> >
> >
> > ----- Mail original -----
> >> De: "Moe Jette" <[email protected]>
> >> À: "slurm-dev" <[email protected]>
> >> Envoyé: Mardi 8 Octobre 2013 17:29:53
> >> Objet: [slurm-dev] Re: sinfo: error: slurm_receive_msg: Zero Bytes
> >> were transmitted or received
> >>
> >>
> >> The command and daemon are not communicating. Check your slurmctld
> >> log
> >> file. There is also a troubleshooting guide online that may prove
> >> helpful to you:
> >> http://slurm.schedmd.com/troubleshoot.html
> >>
> >> Quoting [email protected]:
> >>
> >> >
> >> > Hello,
> >> >
> >> > Thank you for your reply.
> >> >
> >> > It seems to be running.
> >> >
> >> > root@kosmos:~# /etc/init.d/slurm-llnl status
> >> > slurmctld (pid 6093) is running...
> >> > slurmd (pid 6221) is running...
> >> >
> >> >
> >> >
> >> > ----- Mail original -----
> >> >> De: "Danny Auble" <[email protected]>
> >> >> À: "slurm-dev" <[email protected]>
> >> >> Envoyé: Mardi 8 Octobre 2013 16:42:11
> >> >> Objet: [slurm-dev] Re: sinfo: error: slurm_receive_msg: Zero
> >> >> Bytes
> >> >> were transmitted or received
> >> >>
> >> >> It doesn't appear your slurmctld is running or responsive.
> >> >>
> >> >>
> >> >> [email protected] wrote:
> >> >>
> >> >>
> >> >> Hello,
> >> >>
> >> >> I obtain the following error message when I try to use SLURM.
> >> >>
> >> >> root@kosmos:~# sinfo
> >> >> sinfo: error: slurm_receive_msg: Zero Bytes were transmitted or
> >> >> received
> >> >> slurm_load_partitions: Zero Bytes were transmitted or received
> >> >>
> >> >>
> >> >>
> >> >> Here is the output of same command with an increased level of
> >> >> verbosity:
> >> >>
> >> >> root@kosmos:~# sinfo -vv
> >> >> -----------------------------
> >> >> dead        = false
> >> >> exact       = 0
> >> >> filtering   = false
> >> >> format      = %9P %.5a %.10l %.6D %.6t %N
> >> >> iterate     = 0
> >> >> long        = false
> >> >> no_header   = false
> >> >> node_field  = false
> >> >> node_format = false
> >> >> nodes       = n/a
> >> >> part_field  = true
> >> >> partition   = n/a
> >> >> responding  = false
> >> >> states      = (null)
> >> >> sort        = (null)
> >> >> summarize   = false
> >> >> verbose     = 2
> >> >> -----------------------------
> >> >> all_flag        = false
> >> >> avail_flag      = true
> >> >> bg_flag
> >> >>        =
> >> >> false
> >> >> cpus_flag       = false
> >> >> default_time_flag =false
> >> >> disk_flag       = false
> >> >> features_flag   = false
> >> >> groups_flag     = false
> >> >> gres_flag       = false
> >> >> job_size_flag   = false
> >> >> max_time_flag   = true
> >> >> memory_flag     = false
> >> >> partition_flag  = true
> >> >> priority_flag   = false
> >> >> reason_flag     = false
> >> >> reason_timestamp_flag = false
> >> >> reason_user_flag = false
> >> >> reservation_flag = false
> >> >> root_flag       = false
> >> >> share_flag      = false
> >> >> state_flag      = true
> >> >> weight_flag     = false
> >> >> -----------------------------
> >> >>
> >> >> sinfo: debug:  Reading slurm.conf file:
> >> >> /etc/slurm-llnl/slurm.conf
> >> >> Tue Oct  8 15:30:10 2013
> >> >> sinfo: auth plugin for Munge ( http://code.google.com/p/munge
> >> >> /)
> >> >> loaded
> >> >> sinfo: debug:  _slurm_recv_timeout at 0 of 4, recv zero bytes
> >> >> sinfo: error: slurm_receive_msg: Zero Bytes were transmitted or
> >> >> received
> >> >> slurm_load_partitions: Zero Bytes were transmitted or received
> >> >>
> >> >>
> >> >>
> >> >> Everything seems fine with Munge:
> >> >>
> >> >> root@kosmos:~# munge -n|ssh k01 unmunge
> >> >> STATUS:           Success (0)
> >> >> ENCODE_HOST:      kosmos ( 10.3.1.80 )
> >> >> ENCODE_TIME:      2013-10-08 15:30:48 (1381239048)
> >> >> DECODE_TIME:      2013-10-08 15:30:48 (1381239048)
> >> >> TTL:              300
> >> >> CIPHER:           aes128 (4)
> >> >> MAC:              sha1 (3)
> >> >> ZIP:              none (0)
> >> >> UID:              root (0)
> >> >> GID:              root (0)
> >> >> LENGTH:           0
> >> >>
> >> >>
> >> >>
> >> >> Here is the slurm.conf:
> >> >>
> >> >> root@kosmos:~# cat /etc/slurm-llnl/slurm.conf
> >> >> # slurm.conf file generated by configurator.html.
> >> >> # Put this file on all nodes of your cluster.
> >> >> # See the slurm.conf man page for more information.
> >> >> #
> >> >> ControlMachine=kosmos
> >> >> #ControlAddr=
> >> >> #BackupController=
> >> >> #BackupAddr=
> >> >> #
> >> >> #AuthType=auth/none
> >> >> AuthType=auth/munge
> >> >> CacheGroups=0
> >> >> #CheckpointType=checkpoint/none
> >> >> CryptoType=crypto/munge
> >> >> #CryptoType=crypto/openssl
> >> >> #DisableRootJobs=NO
> >> >> #EnforcePartLimits=NO
> >> >> #Epilog=
> >> >> #PrologSlurmctld=
> >> >> #FirstJobId=1
> >> >> #MaxJobId=999999
> >> >> #GresTypes=
> >> >> #GroupUpdateForce=0
> >> >> #GroupUpdateTime=600
> >> >> JobCheckpointDir=/var/lib/slurm-llnl/checkpoint
> >> >> #JobCredentialPrivateKey=
> >> >> #JobCredentialPublicCertificate=
> >> >> #JobCredentialPrivateKey=/home/slurm/ssl/id_rsa
> >> >> #JobCredentialPublicCertificate=/home/slurm/ssl/id_rsa.pub
> >> >> #JobFileAppend=0
> >> >> #JobRequeue=1
> >> >> #JobSubmitPlugins=1
> >> >> #KillOnBadExit=0
> >> >> #Licenses=foo*4,bar
> >> >> #MailProg=/usr/bin/mail
> >> >> #MaxJobCount=5000
> >> >> #MaxStepCount=40000
> >> >> #MaxTasksPerNode=128
> >> >> MpiDefault=none
> >> >> #MpiParams=ports=#-#
> >> >> #PluginDir=
> >> >> #PlugStackConfig=
> >> >> #PrivateData=jobs
> >> >> ProctrackType=proctrack/pgid
> >> >> #Prolog=
> >> >> #PrologSlurmctld=
> >> >> #PropagatePrioProcess=0
> >> >> #PropagateResourceLimits=
> >> >> #PropagateResourceLimitsExcept=
> >> >> ReturnToService=1
> >> >> #SallocDefaultCommand=
> >> >> SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
> >> >> SlurmctldPort=6817
> >> >> SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
> >> >> SlurmdPort=6818
> >> >> SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
> >> >> SlurmUser=slurm
> >> >> #SrunEpilog=
> >> >> #SrunProlog=
> >> >> StateSaveLocation=/var/lib/slurm-llnl/slurmctld
> >> >> SwitchType=switch/none
> >> >> #TaskEpilog=
> >> >> TaskPlugin=task/none
> >> >> #TaskPluginParam=
> >> >> #TaskProlog=
> >> >> #TopologyPlugin=topology/tree
> >> >> #TmpFs=/tmp
> >> >> #TrackWCKey=no
> >> >> #TreeWidth=
> >> >> #UnkillableStepProgram=
> >> >> #UsePAM=0
> >> >> #
> >> >> #
> >> >> # TIMERS
> >> >> #BatchStartTimeout=10
> >> >> #CompleteWait=0
> >> >> #EpilogMsgTime=2000
> >> >> #GetEnvTimeout=2
> >> >> #HealthCheckInterval=0
> >> >> #HealthCheckProgram=
> >> >> InactiveLimit=0
> >> >> KillWait=30
> >> >> #MessageTimeout=10
> >> >> #ResvOverRun=0
> >> >> MinJobAge=300
> >> >> #OverTimeLimit=0
> >> >> SlurmctldTimeout=120
> >> >> SlurmdTimeout=300
> >> >> #UnkillableStepTimeout=60
> >> >> #VSizeFactor=0
> >> >> Waittime=0
> >> >> #
> >> >> #
> >> >> # SCHEDULING
> >> >> #DefMemPerCPU=0
> >> >> FastSchedule=1
> >> >> #MaxMemPerCPU=0
> >> >> #SchedulerRootFilter=1
> >> >> #SchedulerTimeSlice=30
> >> >> SchedulerType=sched/backfill
> >> >> SchedulerPort=7321
> >> >> SelectType=select/cons_res
> >> >> SelectTypeParameters=CR_Core_Memory
> >> >> #
> >> >> #
> >> >> # JOB PRIORITY
> >> >> #PriorityType=priority/basic
> >> >> #PriorityDecayHalfLife=
> >> >> #PriorityCalcPeriod=
> >> >> #PriorityFavorSmall=
> >> >> #PriorityMaxAge=
> >> >> #PriorityUsageResetPeriod=
> >> >> #PriorityWeightAge=
> >> >> #PriorityWeightFairshare=
> >> >> #PriorityWeightJobSize=
> >> >> #PriorityWeightPartition=
> >> >> #PriorityWeightQOS=
> >> >> #
> >> >> #
> >> >> # LOGGING AND ACCOUNTING
> >> >> #AccountingStorageEnforce=0
> >> >> #AccountingStorageHost=
> >> >> AccountingStorageLoc=/var/log/slurm/accounting.txt
> >> >> #AccountingStoragePass=
> >> >> #AccountingStoragePort=
> >> >> AccountingStorageType=accounting_storage/filetxt
> >> >> #AccountingStorageUser=
> >> >> AccountingStoreJobComment=YES
> >> >> ClusterName=cluster
> >> >> #DebugFlags=
> >> >> #JobCompHost=
> >> >> JobCompLoc=/var/log/slurm/slurm.log
> >> >> #JobCompPass=
> >> >> #JobCompPort=
> >> >> JobCompType=jobcomp/filetxt
> >> >> #JobCompUser=
> >> >> JobAcctGatherFrequency=30
> >> >> JobAcctGatherType=jobacct_gather/linux
> >> >> SlurmctldDebug=3
> >> >> SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
> >> >> SlurmdDebug=3
> >> >> SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
> >> >> #SlurmSchedLogFile=
> >> >> #SlurmSchedLogLevel=
> >> >> #
> >> >> #
> >> >> # POWER SAVE SUPPORT FOR IDLE NODES (optional)
> >> >> #SuspendProgram=
> >> >> #ResumeProgram=
> >> >> #SuspendTimeout=
> >> >> #ResumeTimeout=
> >> >> #ResumeRate=
> >> >> #SuspendExcNodes=
> >> >> #SuspendExcParts=
> >> >> #SuspendRate=
> >> >> #SuspendTime=
> >> >> #
> >> >> #
> >> >> # COMPUTE NODES
> >> >> NodeName=k0[1-3] CPUs=32 RealMemory=129186 Sockets=2
> >> >> CoresPerSocket=8
> >> >> ThreadsPerCore=2
> >> >> State=UNKNOWN
> >> >> PartitionName=uag Nodes=k0[1-3] Default=YES MaxTime=INFINITE
> >> >> State=UP
> >> >>
> >> >> NodeName=kosmos CPUs=32 RealMemory=129186 Sockets=2
> >> >> CoresPerSocket=8
> >> >> ThreadsPerCore=2 State=UNKNOWN
> >> >> PartitionName=uag Nodes=kosmos Default=YES MaxTime=INFINITE
> >> >> State=UP
> >> >>
> >> >>
> >> >>
> >> >> NTP is running on all the nodes and the clocks are in sync.
> >> >>
> >> >> Thank you for your help!
> >> >>
> >> >> Best regards,
> >> >>
> >> >> Philippe
> >> >>
> >>
> >>
> 
> 

Reply via email to