I have a cluster with two nodes qdr3 and qdr4. I run slurmctld on qdr3 and
slurmd on qdr3 and qdr4 both. I have attached the slurm.conf file. I am
using MVAPICH2 2.0a (the lates is 2.0b)

I then wrote a simple MPI hello world program that mentions the process
rank and the processor name from whichever node it is run.

I compiled the code using
mpicc -L/usr/local/lib/slurm -lpmi Hello.c

where /usr/local/lib/slurm is the place where slurm libraries reside.
Compilation and the subsequent commands were all entered in qdr3's
terminal, where slurmctld runs too.


$: salloc -N2 bash
salloc : Granted job allocation 24
$: sbcast a.out /tmp/random.a.out
$: srun /tmp/random.a.out
In: PMI_Abort(1,Fatal error in MPI_Init: Other MPI error)
In: PMI_Abort(1,Fatal error in MPI_Init: Other MPI error)

slurmd[qdr4]: *** STEP 24.0 KILLED AT 2013-11-25T18:52:52 with SIGNAL 9 ***
srun: Job step aborted: Waiting upto 2 seconds for job step to finish
srun: error: qdr3: task 0: Exited with exit code 1
srun: error: qdr4: task 1: Exited with exit code 1


I checked the /tmp folder on qdr4 and qdr3 and they did contain
random.a.out as a file. I can log in to each machine from the other without
having to use a password.

Even if I try srun -n4 /tmp/random.a.out
                  srun -n2 /tmp/random.a.out
                  srun -n14 /tmp/random.a.out
don't work and give off similar errors.


What could be going wrong here ?
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

int main(int argc, char** argv)
{
    const int tag=42;
    int size;
    int id;
    int err,source_id;
    int i;
    char msg[80];
    MPI_Status status;

    err=MPI_Init(&argc,&argv);
    if(err!=MPI_SUCCESS){
        printf("MPI Init Failed\n");
        exit(1);
    }

    err=MPI_Comm_size(MPI_COMM_WORLD,&size);
    if(err!=MPI_SUCCESS){
        printf("MPI_Comm_size failed");
        exit(1);
    }

        err=MPI_Comm_rank(MPI_COMM_WORLD,&id);
    if(err!=MPI_SUCCESS){
        printf("MPI_Comm_rank failed");
        exit(1);

    }

    if(size<2){
        printf("You have to use atlease 2 processes here\n");
        MPI_Finalize();
        exit(0);
    }

    if(id==0){
        int length;
	MPI_Get_processor_name(msg,&length);
        printf("Hello World from process %d running on %s\n",id,msg);
        for(i=1;i<size;i++){
            err=MPI_Recv(msg,80,MPI_CHAR,MPI_ANY_SOURCE,tag,MPI_COMM_WORLD,&status);
            if(err!=MPI_SUCCESS){
                printf("Error in MPI_Recv!\n");
		exit(1);
            }
        source_id=status.MPI_SOURCE;
        printf("Hello  World from proces %d running on %s\n", source_id, msg);
        }
    }

    else{
        int length;
        MPI_Get_processor_name(msg,&length);
	err=MPI_Send(msg,length,MPI_CHAR,0,tag,MPI_COMM_WORLD);
        if(err!=MPI_SUCCESS){
            printf("Process %i: Error in MPI_Send!\n",id);
            exit(1);
        }
    }

    printf("Hello World\n");
    err=MPI_Finalize();
    if(err!=MPI_SUCCESS){
        printf("Error in MPI Finalize\n");
        exit(1);
    }
    return 0;
}


Attachment: slurm.conf
Description: Binary data

Reply via email to