Hi! At first, I would suggest you to change the "ControlAddr" parameter and set it to the local network IP so torquepbsno1 can access the server (but in our case, with lsof, I can see that slurmctld is listening on all interfaces.. "TCP *:6817 (LISTEN)", so maybe it is not this option, but maybe the IP is used by the clients....).
After that you should check the firewall rules or routing configuration maybe. Also common reasons for clients not to able to connect is wrong Munge configuration (you said it is OK) or Slurm daemons version mismatch. Of course some logs would be really helpful to understand the reasons of your problems ;) Best Regards, Chrysovalantis Paschoulas Forschungszentrum Juelich - Juelich Supercomputing Centre On 09/08/2014 02:53 PM, Erica Riello wrote: Hello all, I have 2 machines running Slurm 14.03.07, called torquepbs and torquepbsno1. Slurmctld is running in torquepbs, and there's a slurmd running in torquepbs and torquepbsno1. They both have the same munge key and the same configuration file (slurm.conf), but torquepbs is up and torquepbsno1 is down. Munge daemon is running on both machines. What can be wrong? slurm.conf: # slurm.conf file generated by configurator easy.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ControlMachine=torquepbs ControlAddr=localhost # MailProg=/bin/mail MpiDefault=none #MpiParams=ports=#-# ProctrackType=proctrack/pgid ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid #SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid #SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm #SlurmdUser=root StateSaveLocation=/var/spool/slurmctld SwitchType=switch/none TaskPlugin=task/none # # # TIMERS #KillWait=30 #MinJobAge=300 #SlurmctldTimeout=120 #SlurmdTimeout=300 # # # SCHEDULING FastSchedule=1 SchedulerType=sched/backfill #SchedulerPort=7321 SelectType=select/linear # # # LOGGING AND ACCOUNTING AccountingStorageType=accounting_storage/none ClusterName=cluster #JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/linux #SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurm-llnl/slurmcltd #SlurmdDebug=3 SlurmdLogFile=/var/log/slurm-llnl/slurmd # # # COMPUTE NODES NodeName=torquepbsno1 CPUs=4 RealMemory=2001 Sockets=2 CoresPerSocket=2 ThreadsP erCore=1 State=UNKNOWN NodeName=torquepbsno2 CPUs=4 RealMemory=2001 Sockets=2 CoresPerSocket=2 ThreadsP erCore=1 State=UNKNOWN NodeName=torquepbs CPUs=4 RealMemory=2001 Sockets=2 CoresPerSocket=2 ThreadsPerC ore=1 State=UNKNOWN PartitionName=particao1 Nodes=torquepbs,torquepbsno1,torquepbsno2 Default=YES Ma xTime=INFINITE State=UP Thanks in advance. -- =============== Erica Riello Aluna Engenharia de Computação PUC-Rio ------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------ Forschungszentrum Juelich GmbH 52425 Juelich Sitz der Gesellschaft: Juelich Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender), Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt, Prof. Dr. Sebastian M. Schmidt ------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------
