A rather stable production code that has worked with various versions of MPI on various architectures started hanging with gcc-4.4.2 and openmpi 1.3.33
Which lead me to this thread. I made some very small changes to Eugene's code, here's the diff: $ diff testorig.c billtest.c 3,5c3,4 < < #define N 40000 < #define M 40000 --- > #define N 8000 > #define M 8000 17c16 < --- > fprintf (stderr, "Initialized\n"); 32,33c31,39 < MPI_Sendrecv (sbuf, N, MPI_FLOAT, top, 0, < rbuf, N, MPI_FLOAT, bottom, 0, MPI_COMM_WORLD, &status); --- > { > if ((me == 0) && (i % 100 == 0)) > { > fprintf (stderr, "%d\n", i); > } > MPI_Sendrecv (sbuf, N, MPI_FLOAT, top, 0, rbuf, N, MPI_FLOAT, bottom, 0, > MPI_COMM_WORLD, &status); > } > Basically print some occasional progress, and shrink M and N. I'm running on a new intel dual socket nehalem system with centos-5.4. I compiled gcc-4.4.2 and openmpi myself with all the defaults, except I had to point out mpfr-2.4.1 to gcc. If I run: $ mpirun -np 4 ./billtest About 1 in 2 times I get something like: bill@farm bill]$ mpirun -np 4 ./billtest Initialized Initialized Initialized Initialized 0 100 <hang> Next time worked, next time: [bill@farm bill]$ mpirun -np 4 ./billtest Initialized Initialized Initialized Initialized 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000 3100 3200 3300 3400 3500 <hang> Next time hung at 7100. Next time worked. If I strace it when hung I get something like: poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}], 6, 0) = 0 (Timeout) If I run gdb on a hung job (compiled with -O4 -g) (gdb) bt #0 0x00002ab3b34cb385 in ompi_request_default_wait () from /share/apps/openmpisb-1.3/gcc-4.4/lib/libmpi.so.0 #1 0x00002ab3b34f0d48 in PMPI_Sendrecv () from /share/apps/openmpisb-1.3/gcc-4.4/lib/libmpi.so.0 #2 0x0000000000400b88 in main (argc=1, argv=0x7fff083fd298) at billtest.c:36 (gdb) If I recompile with -O1 I get the same thing. Even -g I get the same thing. If I compile the application with gcc-4.3 and still use a gcc-4.4 compiled openmpi I still get hangs. If I compiled openmpi-1.3.3 with gcc-4.3 and the application with gcc-4.3 and I run it 20 times I get zero hangs. Seems like that gcc-4.4 and openib-1.3.3 are incompatible. In my production code I'd always get hung at MPI_Waitall, but the above is obviously inside of Sendrecv. To be paranoid I just reran it 40 times without a hang. Original code below. Eugene Loh wrote: ... > #include <stdio.h> > #include <mpi.h> > > #define N 40000 > #define M 40000 > > int main(int argc, char **argv) { > int np, me, i, top, bottom; > float sbuf[N], rbuf[N]; > MPI_Status status; > > MPI_Init(&argc,&argv); > MPI_Comm_size(MPI_COMM_WORLD,&np); > MPI_Comm_rank(MPI_COMM_WORLD,&me); > > top = me + 1; if ( top >= np ) top -= np; > bottom = me - 1; if ( bottom < 0 ) bottom += np; > > for ( i = 0; i < N; i++ ) sbuf[i] = 0; > for ( i = 0; i < N; i++ ) rbuf[i] = 0; > > MPI_Barrier(MPI_COMM_WORLD); > for ( i = 0; i < M - 1; i++ ) > MPI_Sendrecv(sbuf, N, MPI_FLOAT, top , 0, > rbuf, N, MPI_FLOAT, bottom, 0, MPI_COMM_WORLD, &status); > MPI_Barrier(MPI_COMM_WORLD); > > MPI_Finalize(); > return 0; > } > > Can you reproduce your problem with this test case? > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users