I have a cluster with two nodes qdr3 and qdr4. I run slurmctld on qdr3 and slurmd on qdr3 and qdr4 both. I have attached the slurm.conf file. I am using MVAPICH2 2.0a (the lates is 2.0b)
I then wrote a simple MPI hello world program that mentions the process
rank and the processor name from whichever node it is run.
I compiled the code using
mpicc -L/usr/local/lib/slurm -lpmi Hello.c
where /usr/local/lib/slurm is the place where slurm libraries reside.
Compilation and the subsequent commands were all entered in qdr3's
terminal, where slurmctld runs too.
$: salloc -N2 bash
salloc : Granted job allocation 24
$: sbcast a.out /tmp/random.a.out
$: srun /tmp/random.a.out
In: PMI_Abort(1,Fatal error in MPI_Init: Other MPI error)
In: PMI_Abort(1,Fatal error in MPI_Init: Other MPI error)
slurmd[qdr4]: *** STEP 24.0 KILLED AT 2013-11-25T18:52:52 with SIGNAL 9 ***
srun: Job step aborted: Waiting upto 2 seconds for job step to finish
srun: error: qdr3: task 0: Exited with exit code 1
srun: error: qdr4: task 1: Exited with exit code 1
I checked the /tmp folder on qdr4 and qdr3 and they did contain
random.a.out as a file. I can log in to each machine from the other without
having to use a password.
Even if I try srun -n4 /tmp/random.a.out
srun -n2 /tmp/random.a.out
srun -n14 /tmp/random.a.out
don't work and give off similar errors.
What could be going wrong here ?
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char** argv)
{
const int tag=42;
int size;
int id;
int err,source_id;
int i;
char msg[80];
MPI_Status status;
err=MPI_Init(&argc,&argv);
if(err!=MPI_SUCCESS){
printf("MPI Init Failed\n");
exit(1);
}
err=MPI_Comm_size(MPI_COMM_WORLD,&size);
if(err!=MPI_SUCCESS){
printf("MPI_Comm_size failed");
exit(1);
}
err=MPI_Comm_rank(MPI_COMM_WORLD,&id);
if(err!=MPI_SUCCESS){
printf("MPI_Comm_rank failed");
exit(1);
}
if(size<2){
printf("You have to use atlease 2 processes here\n");
MPI_Finalize();
exit(0);
}
if(id==0){
int length;
MPI_Get_processor_name(msg,&length);
printf("Hello World from process %d running on %s\n",id,msg);
for(i=1;i<size;i++){
err=MPI_Recv(msg,80,MPI_CHAR,MPI_ANY_SOURCE,tag,MPI_COMM_WORLD,&status);
if(err!=MPI_SUCCESS){
printf("Error in MPI_Recv!\n");
exit(1);
}
source_id=status.MPI_SOURCE;
printf("Hello World from proces %d running on %s\n", source_id, msg);
}
}
else{
int length;
MPI_Get_processor_name(msg,&length);
err=MPI_Send(msg,length,MPI_CHAR,0,tag,MPI_COMM_WORLD);
if(err!=MPI_SUCCESS){
printf("Process %i: Error in MPI_Send!\n",id);
exit(1);
}
}
printf("Hello World\n");
err=MPI_Finalize();
if(err!=MPI_SUCCESS){
printf("Error in MPI Finalize\n");
exit(1);
}
return 0;
}
slurm.conf
Description: Binary data
