Hi!

At first, I would suggest you to change the "ControlAddr" parameter and set it to the 
local network IP so torquepbsno1 can access the server (but in our case, with lsof, I can see that 
slurmctld is listening on all interfaces.. "TCP *:6817 (LISTEN)", so maybe it is not this 
option, but maybe the IP is used by the clients....).

After that you should check the firewall rules or routing configuration maybe.

Also common reasons for clients not to able to connect is wrong Munge 
configuration (you said it is OK) or Slurm daemons version mismatch.

Of course some logs would be really helpful to understand the reasons of your 
problems ;)

Best Regards,
Chrysovalantis Paschoulas

Forschungszentrum Juelich - Juelich Supercomputing Centre


On 09/08/2014 02:53 PM, Erica Riello wrote:
Hello all,

I have 2 machines running Slurm 14.03.07, called torquepbs and torquepbsno1.
Slurmctld is running in torquepbs, and there's a slurmd running in torquepbs 
and torquepbsno1.
They both have the same munge key and the same configuration file (slurm.conf), 
but torquepbs is up and torquepbsno1 is down.
Munge daemon is running on both machines.

What can be wrong?

slurm.conf:

# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=torquepbs
ControlAddr=localhost
#
MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
#SchedulerPort=7321
SelectType=select/linear
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=cluster
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
#SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmcltd
#SlurmdDebug=3
SlurmdLogFile=/var/log/slurm-llnl/slurmd
#
#
# COMPUTE NODES
NodeName=torquepbsno1 CPUs=4 RealMemory=2001 Sockets=2 CoresPerSocket=2 ThreadsP
erCore=1 State=UNKNOWN
NodeName=torquepbsno2 CPUs=4 RealMemory=2001 Sockets=2 CoresPerSocket=2 ThreadsP
erCore=1 State=UNKNOWN
NodeName=torquepbs CPUs=4 RealMemory=2001 Sockets=2 CoresPerSocket=2 ThreadsPerC
ore=1 State=UNKNOWN
PartitionName=particao1 Nodes=torquepbs,torquepbsno1,torquepbsno2 Default=YES Ma
xTime=INFINITE State=UP

Thanks in advance.

--
===============
Erica Riello
Aluna Engenharia de Computação PUC-Rio





------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------

Reply via email to