Bingo! This is why we ask for info on how you configure OMPI :-) You need to rebuild OMPI with --enable-heterogeneous. Because there is additional overhead associated with running hetero configurations, and so few people do so, it is disabled by default.
On Nov 18, 2009, at 2:55 AM, Laurin Müller wrote: > Now i have the same openmpi versions. 1.3.2 > > recalulated on both nodes and it works again on each node seperatly: > > node1: > cluster@bioclust:/mnt/projects/PS3Cluster/Benchmark$ mpirun --version > mpirun (Open MPI) 1.3.2 > cluster@bioclust:/mnt/projects/PS3Cluster/Benchmark$ mpirun --hostfile > /etc/openmpi/openmpi-default-hostfile -np 4 > /mnt/projects/PS3Cluster/Benchmark/pi > Input number of intervals: > 20 > 1: pi = 0.798498008827023 > 2: pi = 0.773339953424083 > 3: pi = 0.747089984650041 > 0: pi = 0.822248040052981 > pi = 3.141175986954128 > node2 (PS3): > root@kasimir:/mnt/projects/PS3Cluster/Benchmark# mpirun --version > mpirun (Open MPI) 1.3.2 > [...] > root@kasimir:/mnt/projects/PS3Cluster/Benchmark# mpirun -np 2 pi > Input number of intervals: > 20 > 0: pi = 1.595587993477064 > 1: pi = 1.545587993477064 > pi = 3.141175986954128 > BUT when i start it on node1 with more than 16 processes and hostfile. i get > this errors: > cluster@bioclust:/mnt/projects/PS3Cluster/Benchmark$ mpirun --hostfile > /etc/openmpi/openmpi-default-hostfile -np 17 > /mnt/projects/PS3Cluster/Benchmark/pi > -------------------------------------------------------------------------- > This installation of Open MPI was configured without support for > heterogeneous architectures, but at least one node in the allocation > was detected to have a different architecture. The detected node was: > > Node: bioclust > > In order to operate in a heterogeneous environment, please reconfigure > Open MPI with --enable-heterogeneous. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > It looks like MPI_INIT failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during MPI_INIT; some of which are due to configuration or environment > problems. This failure appears to be an internal failure; here's some > additional information (which may only be relevant to an Open MPI > developer): > > ompi_proc_set_arch failed > --> Returned "Not supported" (-8) instead of "Success" (0) > -------------------------------------------------------------------------- > *** An error occurred in MPI_Init > *** before MPI was initialized > *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > [bioclust:1239] Abort before MPI_INIT completed successfully; not able to > guarantee that all other processes were killed! > *** An error occurred in MPI_Init > *** before MPI was initialized > *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > [bioclust:1240] Abort before MPI_INIT completed successfully; not able to > guarantee that all other processes were killed! > *** An error occurred in MPI_Init > *** before MPI was initialized > *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > [bioclust:1241] Abort before MPI_INIT completed successfully; not able to > guarantee that all other processes were killed! > *** An error occurred in MPI_Init > *** before MPI was initialized > *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > [bioclust:1242] Abort before MPI_INIT completed successfully; not able to > guarantee that all other processes were killed! > *** An error occurred in MPI_Init > *** before MPI was initialized > *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > [bioclust:1244] Abort before MPI_INIT completed successfully; not able to > guarantee that all other processes were killed! > *** An error occurred in MPI_Init > *** before MPI was initialized > *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > [bioclust:1245] Abort before MPI_INIT completed successfully; not able to > guarantee that all other processes were killed! > *** An error occurred in MPI_Init > *** before MPI was initialized > *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > [bioclust:1246] Abort before MPI_INIT completed successfully; not able to > guarantee that all other processes were killed! > *** An error occurred in MPI_Init > *** before MPI was initialized > *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > [bioclust:1247] Abort before MPI_INIT completed successfully; not able to > guarantee that all other processes were killed! > *** An error occurred in MPI_Init > *** before MPI was initialized > *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > [bioclust:1248] Abort before MPI_INIT completed successfully; not able to > guarantee that all other processes were killed! > *** An error occurred in MPI_Init > *** before MPI was initialized > *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > [bioclust:1250] Abort before MPI_INIT completed successfully; not able to > guarantee that all other processes were killed! > *** An error occurred in MPI_Init > *** before MPI was initialized > *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > [bioclust:1251] Abort before MPI_INIT completed successfully; not able to > guarantee that all other processes were killed! > *** An error occurred in MPI_Init > *** before MPI was initialized > *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > [bioclust:1238] Abort before MPI_INIT completed successfully; not able to > guarantee that all other processes were killed! > *** An error occurred in MPI_Init > *** before MPI was initialized > *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > [kasimir:12678] Abort before MPI_INIT completed successfully; not able to > guarantee that all other processes were killed! > *** An error occurred in MPI_Init > *** before MPI was initialized > *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > [bioclust:1243] Abort before MPI_INIT completed successfully; not able to > guarantee that all other processes were killed! > *** An error occurred in MPI_Init > *** before MPI was initialized > *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > [bioclust:1249] Abort before MPI_INIT completed successfully; not able to > guarantee that all other processes were killed! > *** An error occurred in MPI_Init > *** before MPI was initialized > *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > [bioclust:1252] Abort before MPI_INIT completed successfully; not able to > guarantee that all other processes were killed! > *** An error occurred in MPI_Init > *** before MPI was initialized > *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > [bioclust:1253] Abort before MPI_INIT completed successfully; not able to > guarantee that all other processes were killed! > -------------------------------------------------------------------------- > mpirun has exited due to process rank 16 with PID 12678 on > node 10.4.1.23 exiting without calling "finalize". This may > have caused other processes in the application to be > terminated by signals sent by mpirun (as reported here). > -------------------------------------------------------------------------- > [bioclust:01236] 16 more processes have sent help message help-mpi-runtime / > heterogeneous-support-unavailable > [bioclust:01236] Set MCA parameter "orte_base_help_aggregate" to 0 to see all > help / error messages > [bioclust:01236] 16 more processes have sent help message help-mpi-runtime / > mpi_init:startup:internal-failure > > > > > > >>> Lenny Verkhovsky <lenny.verkhov...@gmail.com> 17.11.2009 16:52 >>> > I noticed that you also have different versions of OMPI. You have 1.3.2 on > node1 and 1.3 on node2. > can you try to put same versions of OMPI on both nodes. > can you also try running np 16 on node1 when you try running separately. > Lenny. > > On Tue, Nov 17, 2009 at 5:45 PM, Laurin Müller <laurin.muel...@umit.at> wrote: > > > >>> Ralph Castain 11/17/09 4:04 PM >>> > > >Your cmd line is telling OMPI to run 17 processes. Since your hostfile > >indicates that only 16 of them are to >run on 10.4.23.107 (which I assume is > >your PS3 node?), 1 process is going to be run on 10.4.1.23 (I assume >this > >is node1?). > node1 has 16 Cores (4 x AMD Quad Core Processors) > > node2 is the ps3 with two processors (slots) > > > >I would guess that the executable is compiled to run on the PS3 given your > >specified path, so I would >expect it to bomb on node1 - which is exactly > >what appears to be happening. > the executable is compiled on each node separately and lies at each node in > the same directory > > /mnt/projects/PS3Cluster/Benchmark/pi > on each node different directories are mounted. so there exists a separate > executable file compiled at each node. > > in the end i want to ran R on this cluster with Rmpi - as i get a similar > problem there i rist wanted to try with an c programm. > > with r happens the same thing it works when i start it on each node but if i > want to start more than 16 processes on node one in exits. > > > On Nov 17, 2009, at 1:59 AM, Laurin Müller wrote: > >> Hi, >> i want to build a cluster with openmpi. >> 2 nodes: >> node 1: 4 x Amd Quad Core, ubuntu 9.04, openmpi 1.3.2 >> node 2: Sony PS3, ubuntu 9.04, openmpi 1.3 >> both can connect with ssh to each other and to itself without passwd. >> I can run the sample proramm pi.c on both nodes seperatly (see below). But >> if i try to start it on node1 with --hostfile option to use node 2 "remote" >> i got this error: >> cluster@bioclust:~$ mpirun --hostfile /etc/openmpi/openmpi-default-hostfile >> -np 17 /mnt/projects/PS3Cluster/Benchmark/pi >> -------------------------------------------------------------------------- >> mpirun noticed that the job aborted, but has no info as to the process >> that caused that situation. >> -------------------------------------------------------------------------- >> my hostfile: >> cluster@bioclust:~$ cat /etc/openmpi/openmpi-default-hostfile >> 10.4.23.107 slots=16 >> 10.4.1.23 slots=2 >> i can see with top that the processors of node2 begin to work shortly, then >> it apports on node1. >> I use this sample/test program: >> #include <stdio.h> >> #include <stdlib.h> >> #include "mpi.h" >> int main(int argc, char *argv[]) >> { >> int i, n; >> double h, pi, x; >> int me, nprocs; >> double piece; >> /* --------------------------------------------------- */ >> MPI_Init (&argc, &argv); >> MPI_Comm_size (MPI_COMM_WORLD, &nprocs); >> MPI_Comm_rank (MPI_COMM_WORLD, &me); >> /* --------------------------------------------------- */ >> if (me == 0) >> { >> printf("%s", "Input number of intervals:\n"); >> scanf ("%d", &n); >> } >> /* --------------------------------------------------- */ >> MPI_Bcast (&n, 1, MPI_INT, 0, MPI_COMM_WORLD); >> /* --------------------------------------------------- */ >> h = 1. / (double) n; >> piece = 0.; >> for (i=me+1; i <= n; i+=nprocs) >> { >> x = (i-1)*h; >> piece = piece + ( 4/(1+(x)*(x)) + 4/(1+(x+h)*(x+h))) / 2 * h; >> } >> printf("%d: pi = %25.15f\n", me, piece); >> /* --------------------------------------------------- */ >> MPI_Reduce (&piece, &pi, 1, MPI_DOUBLE, >> MPI_SUM, 0, MPI_COMM_WORLD); >> /* --------------------------------------------------- */ >> if (me == 0) >> { >> printf("pi = %25.15f\n", pi); >> } >> /* --------------------------------------------------- */ >> MPI_Finalize(); >> return 0; >> } >> it works on each node. >> node1: >> cluster@bioclust:~$ mpirun -np 4 /mnt/projects/PS3Cluster/Benchmark/piInput >> number of intervals: >> 20 >> 0: pi = 0.822248040052981 >> 2: pi = 0.773339953424083 >> 3: pi = 0.747089984650041 >> 1: pi = 0.798498008827023 >> pi = 3.141175986954128 >> node2: >> cluster@kasimir:~$ mpirun -np 2 /mnt/projects/PS3Cluster/Benchmark/pi >> Input number of intervals: >> 5 >> 1: pi = 1.267463056905495 >> 0: pi = 1.867463056905495 >> pi = 3.134926113810990 >> cluster@kasimir:~$ >> Thx in advance, >> Laurin >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users