Hi, I have set up an Xgrid including one laptop and 7 Mac mini nodes (all are duo core machines). I have also installed openMPI (openmpi 1.2.1) on all nodes. The laptop node (hostname: sib) has three roles: agent, controller and client, all the other nodes are only agents.
When I started "mpirun -n 8 /bin/hostname" on my laptop node terminal, it shows all 8 nodes' hostnames correctly. It seems that xgrid works fine. Then I wanted to run a simple mpi code. The source code "Hello.c" has been compiled (use mpicc) and the excuatalbe "Hello" has been copied to each node under same path(I have also tested they all run properly on each of the local nodes.). when I asked for 1 or 2 processors to run the job, xgrid worked fine, but when I asked for 3 or more processors, all jobs were failed. Following are the commands and the results/messages that I got. Can anybody help me out? ************************************* running "hostname" and the results, they looks good. ************************************* sib:sharcnet$ mpirun -n 8 /bin/hostname node2 node8 node4 node5 node3 node7 sib node6 ************************************* the simple mpi program Hello.c source code ************************************* #include #include int main(int argc, char *argv[]) { int numprocs, rank, namelen; char processor_name[MPI_MAX_PROCESSOR_NAME]; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Get_processor_name(processor_name, &namelen); printf("Process %d on %s out of %d\n", rank, processor_name, numprocs); MPI_Finalize(); } ************************************* ask for 1 and 2 processors to run "Hello" and the results are all good ************************************* sib:sharcnet$ mpirun -n 1 ~/openMPI_sutuff/Hello Process 0 on sib out of 1 sib:sharcnet$ mpiurun -n 2 ~/openMPI_stuff/Hello Process 1 on node2 out of 2 Process 0 on sib out of 2 ************************************* Here is the problem when ask for 3 processors to run the job, following are all the messages I got ************************************* sib:sharcnet$ mpirun -n 3 ~/openMPI_stuff/Hello Process 0.1.1 is unable to reach 0.1.2 for MPI communication. If you specified the use of a BTL component, you may have forgotten a component (such as "self") in the list of usable components. Process 0.1.2 is unable to reach 0.1.1 for MPI communication. If you specified the use of a BTL component, you may have forgotten a component (such as "self") in the list of usable components. It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Unreachable" (-12) instead of "Success" (0) *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (goodbye) It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Unreachable" (-12) instead of "Success" (0) *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (goodbye) mpirun noticed that job rank 0 with PID 817 on node xgrid-node-0 exited on signal 15 (Terminated). sib:sharcnet$