Now i have the same openmpi versions. 1.3.2 recalulated on both nodes and it works again on each node seperatly: node1: cluster@bioclust:/mnt/projects/PS3Cluster/Benchmark$ mpirun --version mpirun (Open MPI) 1.3.2 cluster@bioclust:/mnt/projects/PS3Cluster/Benchmark$ ( mailto:1.3.2cluster@bioclust:/mnt/projects/PS3Cluster/Benchmark$ ) mpirun --hostfile /etc/openmpi/openmpi-default-hostfile -np 4 /mnt/projects/PS3Cluster/Benchmark/pi Input number of intervals: 20 1: pi = 0.798498008827023 2: pi = 0.773339953424083 3: pi = 0.747089984650041 0: pi = 0.822248040052981 pi = 3.141175986954128 node2 (PS3): root@kasimir:/mnt/projects/PS3Cluster/Benchmark# mpirun --version mpirun (Open MPI) 1.3.2 [...] root@kasimir:/mnt/projects/PS3Cluster/Benchmark# mpirun -np 2 pi Input number of intervals: 20 0: pi = 1.595587993477064 1: pi = 1.545587993477064 pi = 3.141175986954128 BUT when i start it on node1 with more than 16 processes and hostfile. i get this errors: cluster@bioclust:/mnt/projects/PS3Cluster/Benchmark$ mpirun --hostfile /etc/openmpi/openmpi-default-hostfile -np 17 /mnt/projects/PS3Cluster/Benchmark/pi -------------------------------------------------------------------------- This installation of Open MPI was configured without support for heterogeneous architectures, but at least one node in the allocation was detected to have a different architecture. The detected node was: Node: bioclust In order to operate in a heterogeneous environment, please reconfigure Open MPI with --enable-heterogeneous. -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_proc_set_arch failed --> Returned "Not supported" (-8) instead of "Success" (0) -------------------------------------------------------------------------- *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [bioclust:1239] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [bioclust:1240] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [bioclust:1241] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [bioclust:1242] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [bioclust:1244] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [bioclust:1245] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [bioclust:1246] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [bioclust:1247] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [bioclust:1248] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [bioclust:1250] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [bioclust:1251] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [bioclust:1238] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [kasimir:12678] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [bioclust:1243] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [bioclust:1249] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [bioclust:1252] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [bioclust:1253] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! -------------------------------------------------------------------------- mpirun has exited due to process rank 16 with PID 12678 on node 10.4.1.23 exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -------------------------------------------------------------------------- [bioclust:01236] 16 more processes have sent help message help-mpi-runtime / heterogeneous-support-unavailable [bioclust:01236] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [bioclust:01236] 16 more processes have sent help message help-mpi-runtime / mpi_init:startup:internal-failure
>>> Lenny Verkhovsky <lenny.verkhov...@gmail.com> 17.11.2009 16:52 >>> I noticed that you also have different versions of OMPI. You have 1.3.2 on node1 and 1.3 on node2. can you try to put same versions of OMPI on both nodes. can you also try running np 16 on node1 when you try running separately. Lenny. On Tue, Nov 17, 2009 at 5:45 PM, Laurin Müller <laurin.muel...@umit.at> wrote: >>> Ralph Castain 11/17/09 4:04 PM >>> >Your cmd line is telling OMPI to run 17 processes. Since your hostfile indicates that only 16 of them are to >run on 10.4.23.107 (which I assume is your PS3 node?), 1 process is going to be run on 10.4.1.23 (I assume >this is node1?). node1 has 16 Cores (4 x AMD Quad Core Processors) node2 is the ps3 with two processors (slots) >I would guess that the executable is compiled to run on the PS3 given your specified path, so I would >expect it to bomb on node1 - which is exactly what appears to be happening. the executable is compiled on each node separately and lies at each node in the same directory /mnt/projects/PS3Cluster/Benchmark/pi on each node different directories are mounted. so there exists a separate executable file compiled at each node. in the end i want to ran R on this cluster with Rmpi - as i get a similar problem there i rist wanted to try with an c programm. with r happens the same thing it works when i start it on each node but if i want to start more than 16 processes on node one in exits. On Nov 17, 2009, at 1:59 AM, Laurin Müller wrote: Hi, i want to build a cluster with openmpi. 2 nodes: node 1: 4 x Amd Quad Core, ubuntu 9.04, openmpi 1.3.2 node 2: Sony PS3, ubuntu 9.04, openmpi 1.3 both can connect with ssh to each other and to itself without passwd. I can run the sample proramm pi.c on both nodes seperatly (see below). But if i try to start it on node1 with --hostfile option to use node 2 "remote" i got this error: cluster@bioclust:~$ ( mailto:cluster@bioclust:%7E$ ) mpirun --hostfile /etc/openmpi/openmpi-default-hostfile -np 17 /mnt/projects/PS3Cluster/Benchmark/pi -------------------------------------------------------------------------- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -------------------------------------------------------------------------- my hostfile: cluster@bioclust:~$ ( mailto:cluster@bioclust:%7E$ ) cat /etc/openmpi/openmpi-default-hostfile 10.4.23.107 slots=16 10.4.1.23 slots=2 i can see with top that the processors of node2 begin to work shortly, then it apports on node1. I use this sample/test program: #include <stdio.h> #include <stdlib.h> #include "mpi.h" int main(int argc, char *argv[]) { int i, n; double h, pi, x; int me, nprocs; double piece; /* --------------------------------------------------- */ MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &nprocs); MPI_Comm_rank (MPI_COMM_WORLD, &me); /* --------------------------------------------------- */ if (me == 0) { printf("%s", "Input number of intervals:\n"); scanf ("%d", &n); } /* --------------------------------------------------- */ MPI_Bcast (&n, 1, MPI_INT, 0, MPI_COMM_WORLD); /* --------------------------------------------------- */ h = 1. / (double) n; piece = 0.; for (i=me+1; i <= n; i+=nprocs) { x = (i-1)*h; piece = piece + ( 4/(1+(x)*(x)) + 4/(1+(x+h)*(x+h))) / 2 * h; } printf("%d: pi = %25.15f\n", me, piece); /* --------------------------------------------------- */ MPI_Reduce (&piece, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); /* --------------------------------------------------- */ if (me == 0) { printf("pi = %25.15f\n", pi); } /* --------------------------------------------------- */ MPI_Finalize(); return 0; } it works on each node. node1: cluster@bioclust:~$ ( mailto:cluster@bioclust:%7E$ ) mpirun -np 4 /mnt/projects/PS3Cluster/Benchmark/piInput number of intervals: 20 0: pi = 0.822248040052981 2: pi = 0.773339953424083 3: pi = 0.747089984650041 1: pi = 0.798498008827023 pi = 3.141175986954128 node2: cluster@kasimir:~$ ( mailto:cluster@kasimir:%7E$ ) mpirun -np 2 /mnt/projects/PS3Cluster/Benchmark/pi Input number of intervals: 5 1: pi = 1.267463056905495 0: pi = 1.867463056905495 pi = 3.134926113810990 cluster@kasimir:~$ ( mailto:cluster@kasimir:%7E$ ) Thx in advance, Laurin _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users