Thanks for the answer Chrysovalantis! It was actually a misconfiguration in munge that was causing the problem.
Thank you again. 2014-09-08 11:36 GMT-03:00 Chrysovalantis Paschoulas < [email protected]>: > Hi! > > At first, I would suggest you to change the "ControlAddr" parameter and > set it to the local network IP so torquepbsno1 can access the server (but > in our case, with lsof, I can see that slurmctld is listening on all > interfaces.. "TCP *:6817 (LISTEN)", so maybe it is not this option, but > maybe the IP is used by the clients....). > > After that you should check the firewall rules or routing configuration > maybe. > > Also common reasons for clients not to able to connect is wrong Munge > configuration (you said it is OK) or Slurm daemons version mismatch. > > Of course some logs would be really helpful to understand the reasons of > your problems ;) > > Best Regards, > Chrysovalantis Paschoulas > > Forschungszentrum Juelich - Juelich Supercomputing Centre > > > > On 09/08/2014 02:53 PM, Erica Riello wrote: > > Hello all, > > I have 2 machines running Slurm 14.03.07, called torquepbs and > torquepbsno1. > Slurmctld is running in torquepbs, and there's a slurmd running in > torquepbs and torquepbsno1. > They both have the same munge key and the same configuration file > (slurm.conf), but torquepbs is up and torquepbsno1 is down. > Munge daemon is running on both machines. > > What can be wrong? > > slurm.conf: > > # slurm.conf file generated by configurator easy.html. > # Put this file on all nodes of your cluster. > # See the slurm.conf man page for more information. > # > ControlMachine=torquepbs > ControlAddr=localhost > # > MailProg=/bin/mail > MpiDefault=none > #MpiParams=ports=#-# > ProctrackType=proctrack/pgid > ReturnToService=1 > SlurmctldPidFile=/var/run/slurmctld.pid > #SlurmctldPort=6817 > SlurmdPidFile=/var/run/slurmd.pid > #SlurmdPort=6818 > SlurmdSpoolDir=/var/spool/slurmd > SlurmUser=slurm > #SlurmdUser=root > StateSaveLocation=/var/spool/slurmctld > SwitchType=switch/none > TaskPlugin=task/none > # > # > # TIMERS > #KillWait=30 > #MinJobAge=300 > #SlurmctldTimeout=120 > #SlurmdTimeout=300 > # > # > # SCHEDULING > FastSchedule=1 > SchedulerType=sched/backfill > #SchedulerPort=7321 > SelectType=select/linear > # > # > # LOGGING AND ACCOUNTING > AccountingStorageType=accounting_storage/none > ClusterName=cluster > #JobAcctGatherFrequency=30 > JobAcctGatherType=jobacct_gather/linux > #SlurmctldDebug=3 > SlurmctldLogFile=/var/log/slurm-llnl/slurmcltd > #SlurmdDebug=3 > SlurmdLogFile=/var/log/slurm-llnl/slurmd > # > # > # COMPUTE NODES > NodeName=torquepbsno1 CPUs=4 RealMemory=2001 Sockets=2 CoresPerSocket=2 > ThreadsP > erCore=1 State=UNKNOWN > NodeName=torquepbsno2 CPUs=4 RealMemory=2001 Sockets=2 CoresPerSocket=2 > ThreadsP > erCore=1 State=UNKNOWN > NodeName=torquepbs CPUs=4 RealMemory=2001 Sockets=2 CoresPerSocket=2 > ThreadsPerC > ore=1 State=UNKNOWN > PartitionName=particao1 Nodes=torquepbs,torquepbsno1,torquepbsno2 > Default=YES Ma > xTime=INFINITE State=UP > > Thanks in advance. > > -- > =============== > Erica Riello > Aluna Engenharia de Computação PUC-Rio > > > > > > ------------------------------------------------------------------------------------------------ > > ------------------------------------------------------------------------------------------------ > Forschungszentrum Juelich GmbH > 52425 Juelich > Sitz der Gesellschaft: Juelich > Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 > Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher > Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender), > Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt, > Prof. Dr. Sebastian M. Schmidt > > ------------------------------------------------------------------------------------------------ > > ------------------------------------------------------------------------------------------------ > > -- =============== Erica Riello Aluna Engenharia de Computação PUC-Rio
