A rather stable production code that has worked with various versions of MPI
on various architectures started hanging with gcc-4.4.2 and openmpi 1.3.33

Which lead me to this thread.

I made some very small changes to Eugene's code, here's the diff:
$ diff testorig.c billtest.c
3,5c3,4
<
< #define N 40000
< #define M 40000
---
> #define N 8000
> #define M 8000
17c16
<
---
>   fprintf (stderr, "Initialized\n");
32,33c31,39
<     MPI_Sendrecv (sbuf, N, MPI_FLOAT, top, 0,
<                 rbuf, N, MPI_FLOAT, bottom, 0, MPI_COMM_WORLD, &status);
---
>     {
>       if ((me == 0) && (i % 100 == 0))
>       {
>         fprintf (stderr, "%d\n", i);
>       }
>       MPI_Sendrecv (sbuf, N, MPI_FLOAT, top, 0, rbuf, N, MPI_FLOAT, bottom, 0,
>                   MPI_COMM_WORLD, &status);
>     }
>

Basically print some occasional progress, and shrink M and N.

I'm running on a new intel dual socket nehalem system with centos-5.4.  I
compiled gcc-4.4.2 and openmpi myself with all the defaults, except I had to
point out mpfr-2.4.1 to gcc.

If I run:
$ mpirun -np 4 ./billtest

About 1 in 2 times I get something like:
bill@farm bill]$ mpirun -np 4 ./billtest
Initialized
Initialized
Initialized
Initialized
0
100
<hang>

Next time worked, next time:
[bill@farm bill]$ mpirun -np 4 ./billtest
Initialized
Initialized
Initialized
Initialized
0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
2500
2600
2700
2800
2900
3000
3100
3200
3300
3400
3500
<hang>

Next time hung at 7100.

Next time worked.

If I strace it when hung I get something like:
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN},
{fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}], 6, 0) =
0 (Timeout)

If I run gdb on a hung job (compiled with -O4 -g)
(gdb) bt
#0  0x00002ab3b34cb385 in ompi_request_default_wait ()
   from /share/apps/openmpisb-1.3/gcc-4.4/lib/libmpi.so.0
#1  0x00002ab3b34f0d48 in PMPI_Sendrecv () from
/share/apps/openmpisb-1.3/gcc-4.4/lib/libmpi.so.0
#2  0x0000000000400b88 in main (argc=1, argv=0x7fff083fd298) at billtest.c:36
(gdb)

If I recompile with -O1 I get the same thing.

Even -g I get the same thing.

If I compile the application with gcc-4.3 and still use a gcc-4.4 compiled
openmpi I still get hangs.

If I compiled openmpi-1.3.3 with gcc-4.3 and the application with gcc-4.3 and
I run it 20 times I get zero hangs.  Seems like that gcc-4.4 and openib-1.3.3
are incompatible.  In my production code I'd always get hung at MPI_Waitall,
but the above is obviously inside of Sendrecv.

To be paranoid I just reran it 40 times without a hang.

Original code below.

Eugene Loh wrote:
...

> #include <stdio.h>
> #include <mpi.h>
> 
> #define N 40000
> #define M 40000
> 
> int main(int argc, char **argv) {
>  int np, me, i, top, bottom;
>  float sbuf[N], rbuf[N];
>  MPI_Status status;
> 
>  MPI_Init(&argc,&argv);
>  MPI_Comm_size(MPI_COMM_WORLD,&np);
>  MPI_Comm_rank(MPI_COMM_WORLD,&me);
> 
>  top    = me + 1;   if ( top  >= np ) top    -= np;
>  bottom = me - 1;   if ( bottom < 0 ) bottom += np;
> 
>  for ( i = 0; i < N; i++ ) sbuf[i] = 0;
>  for ( i = 0; i < N; i++ ) rbuf[i] = 0;
> 
>  MPI_Barrier(MPI_COMM_WORLD);
>  for ( i = 0; i < M - 1; i++ )
>    MPI_Sendrecv(sbuf, N, MPI_FLOAT, top   , 0,
>                 rbuf, N, MPI_FLOAT, bottom, 0, MPI_COMM_WORLD, &status);
>  MPI_Barrier(MPI_COMM_WORLD);
> 
>  MPI_Finalize();
>  return 0;
> }
> 
> Can you reproduce your problem with this test case?
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to