Thanks for the answer Chrysovalantis!

It was actually a misconfiguration in munge that was causing the problem.

Thank you again.

2014-09-08 11:36 GMT-03:00 Chrysovalantis Paschoulas <
[email protected]>:

>  Hi!
>
> At first, I would suggest you to change the "ControlAddr" parameter and
> set it to the local network IP so torquepbsno1 can access the server (but
> in our case, with lsof, I can see that slurmctld is listening on all
> interfaces.. "TCP *:6817 (LISTEN)", so maybe it is not this option, but
> maybe the IP is used by the clients....).
>
> After that you should check the firewall rules or routing configuration
> maybe.
>
> Also common reasons for clients not to able to connect is wrong Munge
> configuration (you said it is OK) or Slurm daemons version mismatch.
>
> Of course some logs would be really helpful to understand the reasons of
> your problems ;)
>
> Best Regards,
> Chrysovalantis Paschoulas
>
> Forschungszentrum Juelich - Juelich Supercomputing Centre
>
>
>
> On 09/08/2014 02:53 PM, Erica Riello wrote:
>
>     Hello all,
>
>  I have 2 machines running Slurm 14.03.07, called torquepbs and
> torquepbsno1.
>  Slurmctld is running in torquepbs, and there's a slurmd running in
> torquepbs and torquepbsno1.
>  They both have the same munge key and the same configuration file
> (slurm.conf), but torquepbs is up and torquepbsno1 is down.
>  Munge daemon is running on both machines.
>
>  What can be wrong?
>
>  slurm.conf:
>
> # slurm.conf file generated by configurator easy.html.
> # Put this file on all nodes of your cluster.
> # See the slurm.conf man page for more information.
> #
> ControlMachine=torquepbs
> ControlAddr=localhost
> #
> MailProg=/bin/mail
> MpiDefault=none
> #MpiParams=ports=#-#
> ProctrackType=proctrack/pgid
> ReturnToService=1
> SlurmctldPidFile=/var/run/slurmctld.pid
> #SlurmctldPort=6817
> SlurmdPidFile=/var/run/slurmd.pid
> #SlurmdPort=6818
> SlurmdSpoolDir=/var/spool/slurmd
> SlurmUser=slurm
> #SlurmdUser=root
> StateSaveLocation=/var/spool/slurmctld
> SwitchType=switch/none
> TaskPlugin=task/none
> #
> #
> # TIMERS
> #KillWait=30
> #MinJobAge=300
> #SlurmctldTimeout=120
> #SlurmdTimeout=300
> #
> #
> # SCHEDULING
> FastSchedule=1
> SchedulerType=sched/backfill
> #SchedulerPort=7321
> SelectType=select/linear
> #
> #
> # LOGGING AND ACCOUNTING
> AccountingStorageType=accounting_storage/none
> ClusterName=cluster
> #JobAcctGatherFrequency=30
> JobAcctGatherType=jobacct_gather/linux
> #SlurmctldDebug=3
> SlurmctldLogFile=/var/log/slurm-llnl/slurmcltd
> #SlurmdDebug=3
> SlurmdLogFile=/var/log/slurm-llnl/slurmd
> #
> #
> # COMPUTE NODES
> NodeName=torquepbsno1 CPUs=4 RealMemory=2001 Sockets=2 CoresPerSocket=2
> ThreadsP
> erCore=1 State=UNKNOWN
> NodeName=torquepbsno2 CPUs=4 RealMemory=2001 Sockets=2 CoresPerSocket=2
> ThreadsP
> erCore=1 State=UNKNOWN
> NodeName=torquepbs CPUs=4 RealMemory=2001 Sockets=2 CoresPerSocket=2
> ThreadsPerC
> ore=1 State=UNKNOWN
> PartitionName=particao1 Nodes=torquepbs,torquepbsno1,torquepbsno2
> Default=YES Ma
> xTime=INFINITE State=UP
>
>  Thanks in advance.
>
> --
> ===============
> Erica Riello
> Aluna Engenharia de Computação PUC-Rio
>
>
>
>
>
> ------------------------------------------------------------------------------------------------
>
> ------------------------------------------------------------------------------------------------
> Forschungszentrum Juelich GmbH
> 52425 Juelich
> Sitz der Gesellschaft: Juelich
> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
> Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
> Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
> Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
> Prof. Dr. Sebastian M. Schmidt
>
> ------------------------------------------------------------------------------------------------
>
> ------------------------------------------------------------------------------------------------
>
>


-- 
===============
Erica Riello
Aluna Engenharia de Computação PUC-Rio

Reply via email to