[slurm-dev] Re: Problem with MPICH2 communication between nodes

Andy Riebs Mon, 02 Jul 2012 05:06:09 -0700

Hi Tim,

Forgive me if the answers to these are buried in the documentation you sent:

1. What version of MPICH2 are you running? (Is it the same on bothcontrol nodes?)

2. Does "srun -N2 hostname" work as expected?

Andy

On 07/02/2012 07:27 AM, Tim Butters wrote:

Problem with MPICH2 communication between nodes
Hi,

First of all thanks for any help in advance, and I apologise if I havemissed something simple with this problem, I can't seem to get thingsworking properly.

I have slurm installed on a small cluster with 1 control node and 2compute nodes (each compute node has 48 cores). Everything works finefor jobs running on just one compute node, but when I run an mpi(MPICH2) job that spans both compute nodes I get the following error:


$ srun -n49 ./a.out
Fatal error in MPI_Send: Other MPI error, error stack:

MPI_Send(174).....................: MPI_Send(buf=0xb3e078, count=22,MPI_CHAR, dest=0, tag=50, MPI_COMM_WORLD) failed

MPIDI_CH3I_Progress(150)..........:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1709).......: Communication error
srun: error: computenode2: task 48: Exited with exit code 1
Hello from 1 computenode1
Hello from 2 computenode1
............etc.

I get the results from the first compute node (Hello form 1computenode1), then it seems to hang indefinitely.


The jobs are compiled using mpic++ -L/usr/lib64/slurm -lpmi hello.cc.

If run using sbatch then the error file contains this line:
"srun: error: slurm_send_recv_rc_msg_only_one: Connection timed out"

I have run slurmctld and slurmd in terminals (-vvvvvvvvv) but haven'tbeen able to find anything useful in the messages, I have attached theoutput to this email (controldnode.txt, computenode1.txt andcomputenode2.txt). All IP addresses have been replaced with <*.*.*.*>.I have also attached the log files for slurmctld and slurmd from thetwo compute nodes.


Many thanks for your help,

Tim

scontrol show config:
Configuration data as of 2012-07-02T11:41:01
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = none
AccountingStorageHost   = localhost
AccountingStorageLoc    = /var/log/slurm_jobacct.log
AccountingStoragePort   = 0
AccountingStorageType   = accounting_storage/none
AccountingStorageUser   = root
AccountingStoreJobComment = YES
AuthType                = auth/munge
BackupAddr              = (null)
BackupController        = (null)
BatchStartTimeout       = 10 sec
BOOT_TIME               = 2012-07-02T11:40:40
CacheGroups             = 0
CheckpointType          = checkpoint/none
ClusterName             = cluster
CompleteWait            = 0 sec
ControlAddr             = <*.*.*.*>
ControlMachine          = controlnode
CryptoType              = crypto/munge
DebugFlags              = (null)
DefMemPerNode           = UNLIMITED
DisableRootJobs         = NO
EnforcePartLimits       = NO
Epilog                  = (null)
EpilogMsgTime           = 2000 usec
EpilogSlurmctld         = (null)
FastSchedule            = 1
FirstJobId              = 1
GetEnvTimeout           = 2 sec
GresTypes               = (null)
GroupUpdateForce        = 0
GroupUpdateTime         = 600 sec
HASH_VAL                = Match
HealthCheckInterval     = 0 sec
HealthCheckProgram      = (null)
InactiveLimit           = 0 sec
JobAcctGatherFrequency  = 30 sec
JobAcctGatherType       = jobacct_gather/none
JobCheckpointDir        = /var/slurm/checkpoint
JobCompHost             = localhost
JobCompLoc              = /var/log/slurm_jobcomp.log
JobCompPort             = 0
JobCompType             = jobcomp/none
JobCompUser             = root
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobFileAppend           = 0
JobRequeue              = 1
JobSubmitPlugins        = (null)
KillOnBadExit           = 0
KillWait                = 30 sec
Licenses                = (null)
MailProg                = /bin/mail
MaxJobCount             = 10000
MaxJobId                = 4294901760
MaxMemPerNode           = UNLIMITED
MaxStepCount            = 40000
MaxTasksPerNode         = 128
MessageTimeout          = 10 sec
MinJobAge               = 300 sec
MpiDefault              = none
MpiParams               = (null)
NEXT_JOB_ID             = 205
OverTimeLimit           = 0 min
PluginDir               = /usr/lib64/slurm
PlugStackConfig         = /etc/slurm/plugstack.conf
PreemptMode             = OFF
PreemptType             = preempt/none
PriorityType            = priority/basic
PrivateData             = none
ProctrackType           = proctrack/pgid
Prolog                  = (null)
PrologSlurmctld         = (null)
PropagatePrioProcess    = 0
PropagateResourceLimits = ALL
PropagateResourceLimitsExcept = (null)
ResumeProgram           = (null)
ResumeRate              = 300 nodes/min
ResumeTimeout           = 60 sec
ResvOverRun             = 0 min
ReturnToService         = 1
SallocDefaultCommand    = (null)
SchedulerParameters     = (null)
SchedulerPort           = 7321
SchedulerRootFilter     = 1
SchedulerTimeSlice      = 30 sec
SchedulerType           = sched/backfill
SelectType              = select/cons_res
SelectTypeParameters    = CR_CPU
SlurmUser               = slurm(494)
SlurmctldDebug          = 3
SlurmctldLogFile        = /var/log/slurm/slurmctld
SlurmSchedLogFile       = (null)
SlurmctldPort           = 6817
SlurmctldTimeout        = 120 sec
SlurmdDebug             = 3
SlurmdLogFile           = /var/log/slurm/slurmd
SlurmdPidFile           = /var/run/slurm/slurmd.pid
SlurmdPort              = 6818
SlurmdSpoolDir          = /var/spool/slurm/slurmd
SlurmdTimeout           = 300 sec
SlurmdUser              = root(0)
SlurmSchedLogLevel      = 0
SlurmctldPidFile        = /var/run/slurm/slurmctld.pid
SLURM_CONF              = /etc/slurm/slurm.conf
SLURM_VERSION           = 2.3.5
SrunEpilog              = (null)
SrunProlog              = (null)
StateSaveLocation       = /var/tmp
SuspendExcNodes         = (null)
SuspendExcParts         = (null)
SuspendProgram          = (null)
SuspendRate             = 60 nodes/min
SuspendTime             = NONE
SuspendTimeout          = 30 sec
SwitchType              = switch/none
TaskEpilog              = (null)
TaskPlugin              = task/none
TaskPluginParam         = (null type)
TaskProlog              = (null)
TmpFS                   = /tmp
TopologyPlugin          = topology/none
TrackWCKey              = 0
TreeWidth               = 50
UsePam                  = 0
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 60 sec
VSizeFactor             = 0 percent
WaitTime                = 0 sec

Slurmctld(primary/backup) at controlnode/(NULL) are UP/DOWN


--
Andy Riebs
Hewlett-Packard Company
High Performance Computing
+1-786-263-9743
My opinions are not necessarily those of HP

[slurm-dev] Re: Problem with MPICH2 communication between nodes

Reply via email to