Re: [OMPI users] MPI processes hang when using OpenMPI 1.3.2 and Gcc-4.4.0

2009-11-18 Thread Ashley Pittman
On Wed, 2009-11-18 at 01:28 -0800, Bill Broadley wrote:
> A rather stable production code that has worked with various versions
> of MPI
> on various architectures started hanging with gcc-4.4.2 and openmpi
> 1.3.33
> 
> Which lead me to this thread. 

If you're investigating hangs in a parallel job take a look at the tool
linked to below (padb), it should be able to give you a parallel stack
trace and the message queues for the job.

http://padb.pittman.org.uk/full-report.html

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI users] MPI processes hang when using OpenMPI 1.3.2 and Gcc-4.4.0

2009-11-18 Thread Eugene Loh




Vincent Loechner wrote:

  Bill,
  
  
A rather stable production code that has worked with various versions of MPI
on various architectures started hanging with gcc-4.4.2 and openmpi 1.3.33

  
  
Probably this bug :
https://svn.open-mpi.org/trac/ompi/ticket/2043

Waiting for a correction, try adding this option to mpirun :
-mca btl_sm_num_fifos 5

Bill, I noticed you updated the ticket.  Thank you.  I've been working
on this in earnest.  Something funny is going on as far as the "memory
model" goes:  values when writing to the shared-memory FIFOs go goofy. 
Like a FIFO slot that was initialized to be free and still "should be"
free, looks occupied when a writer checks, but it's empty immediately
thereafter even though no one "presumably" has accessed that location. 
I almost have a stand-alone program (C only, no OMPI infrastructure)
that demonstrates the problem, but I'm not quite there.  Then, it'll
either become evident to me what's wrong or I'll be able to show other
people more easily why I think something is wrong.  At this point, I
really have no idea if the problem is GCC 4.4.x or OMPI 1.3.x.




Re: [OMPI users] MPI processes hang when using OpenMPI 1.3.2 and Gcc-4.4.0

2009-11-18 Thread Vincent Loechner

Bill,

> A rather stable production code that has worked with various versions of MPI
> on various architectures started hanging with gcc-4.4.2 and openmpi 1.3.33

Probably this bug :
https://svn.open-mpi.org/trac/ompi/ticket/2043

Waiting for a correction, try adding this option to mpirun :
-mca btl_sm_num_fifos 5

--Vincent


Re: [OMPI users] MPI processes hang when using OpenMPI 1.3.2 and Gcc-4.4.0

2009-11-18 Thread Bill Broadley
A rather stable production code that has worked with various versions of MPI
on various architectures started hanging with gcc-4.4.2 and openmpi 1.3.33

Which lead me to this thread.

I made some very small changes to Eugene's code, here's the diff:
$ diff testorig.c billtest.c
3,5c3,4
<
< #define N 4
< #define M 4
---
> #define N 8000
> #define M 8000
17c16
<
---
>   fprintf (stderr, "Initialized\n");
32,33c31,39
< MPI_Sendrecv (sbuf, N, MPI_FLOAT, top, 0,
< rbuf, N, MPI_FLOAT, bottom, 0, MPI_COMM_WORLD, );
---
> {
>   if ((me == 0) && (i % 100 == 0))
>   {
> fprintf (stderr, "%d\n", i);
>   }
>   MPI_Sendrecv (sbuf, N, MPI_FLOAT, top, 0, rbuf, N, MPI_FLOAT, bottom, 0,
>   MPI_COMM_WORLD, );
> }
>

Basically print some occasional progress, and shrink M and N.

I'm running on a new intel dual socket nehalem system with centos-5.4.  I
compiled gcc-4.4.2 and openmpi myself with all the defaults, except I had to
point out mpfr-2.4.1 to gcc.

If I run:
$ mpirun -np 4 ./billtest

About 1 in 2 times I get something like:
bill@farm bill]$ mpirun -np 4 ./billtest
Initialized
Initialized
Initialized
Initialized
0
100


Next time worked, next time:
[bill@farm bill]$ mpirun -np 4 ./billtest
Initialized
Initialized
Initialized
Initialized
0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
2500
2600
2700
2800
2900
3000
3100
3200
3300
3400
3500


Next time hung at 7100.

Next time worked.

If I strace it when hung I get something like:
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN},
{fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}], 6, 0) =
0 (Timeout)

If I run gdb on a hung job (compiled with -O4 -g)
(gdb) bt
#0  0x2ab3b34cb385 in ompi_request_default_wait ()
   from /share/apps/openmpisb-1.3/gcc-4.4/lib/libmpi.so.0
#1  0x2ab3b34f0d48 in PMPI_Sendrecv () from
/share/apps/openmpisb-1.3/gcc-4.4/lib/libmpi.so.0
#2  0x00400b88 in main (argc=1, argv=0x7fff083fd298) at billtest.c:36
(gdb)

If I recompile with -O1 I get the same thing.

Even -g I get the same thing.

If I compile the application with gcc-4.3 and still use a gcc-4.4 compiled
openmpi I still get hangs.

If I compiled openmpi-1.3.3 with gcc-4.3 and the application with gcc-4.3 and
I run it 20 times I get zero hangs.  Seems like that gcc-4.4 and openib-1.3.3
are incompatible.  In my production code I'd always get hung at MPI_Waitall,
but the above is obviously inside of Sendrecv.

To be paranoid I just reran it 40 times without a hang.

Original code below.

Eugene Loh wrote:
...

> #include 
> #include 
> 
> #define N 4
> #define M 4
> 
> int main(int argc, char **argv) {
>  int np, me, i, top, bottom;
>  float sbuf[N], rbuf[N];
>  MPI_Status status;
> 
>  MPI_Init(,);
>  MPI_Comm_size(MPI_COMM_WORLD,);
>  MPI_Comm_rank(MPI_COMM_WORLD,);
> 
>  top= me + 1;   if ( top  >= np ) top-= np;
>  bottom = me - 1;   if ( bottom < 0 ) bottom += np;
> 
>  for ( i = 0; i < N; i++ ) sbuf[i] = 0;
>  for ( i = 0; i < N; i++ ) rbuf[i] = 0;
> 
>  MPI_Barrier(MPI_COMM_WORLD);
>  for ( i = 0; i < M - 1; i++ )
>MPI_Sendrecv(sbuf, N, MPI_FLOAT, top   , 0,
> rbuf, N, MPI_FLOAT, bottom, 0, MPI_COMM_WORLD, );
>  MPI_Barrier(MPI_COMM_WORLD);
> 
>  MPI_Finalize();
>  return 0;
> }
> 
> Can you reproduce your problem with this test case?
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] MPI processes hang when using OpenMPI 1.3.2 and Gcc-4.4.0

2009-05-18 Thread Eugene Loh

Simone Pellegrini wrote:

sorry for the delay but I did some additional experiments to found out 
whether the problem was openmpi or gcc!


The program just hangs... and never terminates! I am running on a SMP 
machine with 32 cores, actually it is a Sun Fire X4600 X2. (8 
quad-core Barcelona AMD chips), the OS is CentOS 5 and the kernel is 
2.6.18-92.el5.src-PAPI (patched with PAPI).
I use a N of 1024, and if I print out the value of the iterator i, 
sometimes it stops around 165, other times around 520... and it 
doesn't make any sense.


If I run the program (and it's important to notice I don't recompile 
it, I just use another mpirun from a different mpi version) the 
program works fine. I did some experiments during the weekend and if I 
use openmpi-1.3.2 compiled with gcc433 everything works fine.


So I really think the problem is strictly related to the usage of 
gcc-4.4.0! ...and it doesn't depends from OpenMPI as the program hangs 
even when I use gcc 1.3.1 compiled with gcc 4.4!


I finally got GCC 4.4, but was unable to reproduce the problem.  How 
small can you make np (number of MPI processes) and still see the 
problem?  How reproducible is the problem?  When it hangs, can you get 
stack traces of all the processes?  We're trying to hunt down some 
similar behavior, but I think yours is of a different flavor.


Re: [OMPI users] MPI processes hang when using OpenMPI 1.3.2 and Gcc-4.4.0

2009-05-04 Thread Simone Pellegrini

Hi,
sorry for the delay but I did some additional experiments to found out 
whether the problem was openmpi or gcc!


In attach u will find the program that causes the problem before mentioned.
I compile the program with the following line:

$HOME/openmpi-1.3.2-gcc44/bin/mpicc -O3 -g -Wall -fmessage-length=0 -m64 
bug.c -o bug


When I run the program using mpi 1.3.2 compiled with gcc44 in the 
following way:


$HOME/openmpi-1.3.2-gcc44/bin/mpirun --mca btl self,sm --np 32 ./bug 1024

The program just hangs... and never terminates! I am running on a SMP 
machine with 32 cores, actually it is a Sun Fire X4600 X2. (8 quad-core 
Barcelona AMD chips), the OS is CentOS 5 and the kernel is 
2.6.18-92.el5.src-PAPI (patched with PAPI).
I use a N of 1024, and if I print out the value of the iterator i, 
sometimes it stops around 165, other times around 520... and it doesn't 
make any sense.


If I run the program (and it's important to notice I don't recompile it, 
I just use another mpirun from a different mpi version) the program 
works fine. I did some experiments during the weekend and if I use 
openmpi-1.3.2 compiled with gcc433 everything works fine.


So I really think the problem is strictly related to the usage of 
gcc-4.4.0! ...and it doesn't depends from OpenMPI as the program hangs 
even when I use gcc 1.3.1 compiled with gcc 4.4!


I hope everything is clear now.

regards, Simone

Eugene Loh wrote:
So far, I'm unable to reproduce this problem.  I haven't exactly 
reproduced your test conditions, but then I can't.  At a minimum, I 
don't have exactly the code you ran (and not convinced I want to!).  So:


*) Can you reproduce the problem with the stand-alone test case I sent 
out?
*) Does the problem correlate with OMPI version?  (I.e., 1.3.1 versus 
1.3.2.)

*) Does the problem occur at lower np?
*) Does the problem correlate with the compiler version?  (I.e., GCC 
4.4 versus 4.3.3.)
*) What is the failure rate?  How many times should I expect to run to 
see failures?

*) How large is N?

Eugene Loh wrote:


Simone Pellegrini wrote:


Dear all,
I have successfully compiled and installed openmpi 1.3.2 on a 8 
socket quad-core machine from Sun.


I have used both Gcc-4.4 and Gcc-4.3.3 during the compilation phase 
but when I try to run simple MPI programs processes hangs. Actually 
this is the kernel of the application I am trying to run:


MPI_Barrier(MPI_COMM_WORLD);
total = MPI_Wtime();
for(i=0; i0)
MPI_Sendrecv(A[i-1], N, MPI_FLOAT, top, 0, row, N, 
MPI_FLOAT, bottom, 0, MPI_COMM_WORLD, );

for(k=0; k

Re: [OMPI users] MPI processes hang when using OpenMPI 1.3.2 and Gcc-4.4.0

2009-05-01 Thread Eugene Loh
So far, I'm unable to reproduce this problem.  I haven't exactly 
reproduced your test conditions, but then I can't.  At a minimum, I 
don't have exactly the code you ran (and not convinced I want to!).  So:


*) Can you reproduce the problem with the stand-alone test case I sent out?
*) Does the problem correlate with OMPI version?  (I.e., 1.3.1 versus 
1.3.2.)

*) Does the problem occur at lower np?
*) Does the problem correlate with the compiler version?  (I.e., GCC 4.4 
versus 4.3.3.)
*) What is the failure rate?  How many times should I expect to run to 
see failures?

*) How large is N?

Eugene Loh wrote:


Simone Pellegrini wrote:


Dear all,
I have successfully compiled and installed openmpi 1.3.2 on a 8 
socket quad-core machine from Sun.


I have used both Gcc-4.4 and Gcc-4.3.3 during the compilation phase 
but when I try to run simple MPI programs processes hangs. Actually 
this is the kernel of the application I am trying to run:


MPI_Barrier(MPI_COMM_WORLD);
total = MPI_Wtime();
for(i=0; i0)
MPI_Sendrecv(A[i-1], N, MPI_FLOAT, top, 0, row, N, 
MPI_FLOAT, bottom, 0, MPI_COMM_WORLD, );

for(k=0; k

Re: [OMPI users] MPI processes hang when using OpenMPI 1.3.2 and Gcc-4.4.0

2009-04-30 Thread Eugene Loh
I'm responsible for some sm changes in 1.3.2, so I can try looking at 
this.  Some questions below:


Simone Pellegrini wrote:


Dear all,
I have successfully compiled and installed openmpi 1.3.2 on a 8 socket 
quad-core machine from Sun.


I have used both Gcc-4.4 and Gcc-4.3.3 during the compilation phase 
but when I try to run simple MPI programs processes hangs. Actually 
this is the kernel of the application I am trying to run:


MPI_Barrier(MPI_COMM_WORLD);
total = MPI_Wtime();
for(i=0; i0)
MPI_Sendrecv(A[i-1], N, MPI_FLOAT, top, 0, row, N, 
MPI_FLOAT, bottom, 0, MPI_COMM_WORLD, );

for(k=0; k

[OMPI users] MPI processes hang when using OpenMPI 1.3.2 and Gcc-4.4.0

2009-04-30 Thread Simone Pellegrini

Dear all,
I have successfully compiled and installed openmpi 1.3.2 on a 8 socket 
quad-core machine from Sun.


I have used both Gcc-4.4 and Gcc-4.3.3 during the compilation phase but 
when I try to run simple MPI programs processes hangs. Actually this is 
the kernel of the application I am trying to run:


MPI_Barrier(MPI_COMM_WORLD);
   total = MPI_Wtime();
   for(i=0; i0)
   MPI_Sendrecv(A[i-1], N, MPI_FLOAT, top, 0, row, N, 
MPI_FLOAT, bottom, 0, MPI_COMM_WORLD, );

   for(k=0; k