Re: [OMPI users] MPI processes hang when using OpenMPI 1.3.2 and Gcc-4.4.0

2009-11-18 Thread Vincent Loechner

Bill,

> A rather stable production code that has worked with various versions of MPI
> on various architectures started hanging with gcc-4.4.2 and openmpi 1.3.33

Probably this bug :
https://svn.open-mpi.org/trac/ompi/ticket/2043

Waiting for a correction, try adding this option to mpirun :
-mca btl_sm_num_fifos 5

--Vincent


Re: [OMPI users] sending/receiving large buffers

2009-11-09 Thread Vincent Loechner

Martin,

> I expect problems with sizes larger than 2^31-1, but these array sizes
> are still much smaller.
No, they are bigger, you allocate two arrays of 320 Mdouble :
2 * 320M * 8 = 5GB.

Are your processes limited to 4GB virtual memory ?

--Vincent


Re: [OMPI users] collective communications broken on more than 4 cores

2009-10-29 Thread Vincent Loechner

> >>> It seems that the calls to collective communication are not
> >>> returning for some MPI processes, when the number of processes is
> >>> greater or equal to 5. It's reproduceable, on two different
> >>> architectures, with two different versions of OpenMPI (1.3.2 and
> >>> 1.3.3). It was working correctly with OpenMPI version 1.2.7.
> >>
> >> Does it work if you turn off the shared memory transport layer;  
> >> that is,
> >>
> >> mpirun -n 6 -mca btl ^sm ./testmpi
> >
> > Yes it does, on both my configurations (AMD and Intel processor).
> > So it seems that the shared memory synchronization process is
> > broken.
> 
> Presumably that is this bug:
> https://svn.open-mpi.org/trac/ompi/ticket/2043

Yes it is.

> I also found by trial and error that increasing the number of fifos, eg
> -mca btl_sm_num_fifos 5
> on a 6-processor job, apparently worked around the problem.
> But yes, something seems broken in OpenMP shared memory transport with  
> gcc 4.4.x.

Yes, same for me: -mca btl_sm_num_fifos 5 worked.
Thanks for your answer Jonathan.

If I may help the developpers in any way to track this bug get into
contact with me.

--Vincent


Re: [OMPI users] collective communications broken on more than 4 cores

2009-10-29 Thread Vincent Loechner

> > It seems that the calls to collective communication are not
> > returning for some MPI processes, when the number of processes is
> > greater or equal to 5. It's reproduceable, on two different
> > architectures, with two different versions of OpenMPI (1.3.2 and
> > 1.3.3). It was working correctly with OpenMPI version 1.2.7.
> 
> Does it work if you turn off the shared memory transport layer; that is,
> 
> mpirun -n 6 -mca btl ^sm ./testmpi

Yes it does, on both my configurations (AMD and Intel processor).
So it seems that the shared memory synchronization process is
broken.

Could be a system bug, I don't know what library OpenMPI uses
(is it IPC ?). Both my systems are Linux 2.6.31, the AMD is Ubuntu,
and the Intel is an ARCH-linux.

--Vincent


[OMPI users] collective communications broken on more than 4 cores

2009-10-29 Thread Vincent Loechner

Hello to the list,

I came to a problem running a simple program with collective
communications, on a 6-core processors (6 local MPI processes).
It seems that the calls to collective communication are not
returning for some MPI processes, when the number of processes is
greater or equal to 5. It's reproduceable, on two different
architectures, with two different versions of OpenMPI (1.3.2 and
1.3.3). It was working correctly with OpenMPI version 1.2.7.


I just wrote a very simple test, making 1000 calls to MPI_Barrier().
Running on an istanbul processor (6-core AMD Opteron) :
$ uname -a
Linux istanbool 2.6.31-14-generic #46-Ubuntu SMP Tue Oct 13 16:47:28 UTC 2009 
x86_64 GNU/Linux
with a OpenMPI ubuntu package, version 1.3.2.
Running with 5 or 6 MPI processes, it just hangs after a random
number of iterations, ranging from 3 to 600, and sometimes it
finishes correctly (about 1 time out of 8). Just ran :
'mpirun -n 6 ./testmpi'
Same behavior with more MPI processes.

I tried the '--mca coll_basic_priority 50' option, the program has
more chance to finish -about one time out of 2, but also deadlocks
the other time after a random number of iterations.

Without setting the coll_basic_priority option, I ran a debugger, and
found out that some processes are blocked in:
#0  0x7f858f272f7a in opal_progress () from /usr/lib/libopen-pal.so.0
#1  0x7f858f7524f5 in ?? () from /usr/lib/libmpi.so.0
#2  0x7f8589e74c5a in ?? ()
   from /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so
#3  0x7f8589e7cefa in ?? ()
   from /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so
#4  0x7f858f767b32 in PMPI_Barrier () from /usr/lib/libmpi.so.0
#5  0x00400c10 in main (argc=1, argv=0x7fff9d59acf8) at testmpi.c:24

and the others in:
#0  0x7f05799e933a in ?? () from /usr/lib/openmpi/lib/openmpi/mca_btl_sm.so
#1  0x7f057dd22fba in opal_progress () from /usr/lib/libopen-pal.so.0
#2  0x7f057e2024f5 in ?? () from /usr/lib/libmpi.so.0
#3  0x7f0578924c5a in ?? ()
   from /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so
#4  0x7f057892cefa in ?? ()
   from /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so
#5  0x7f057e217b32 in PMPI_Barrier () from /usr/lib/libmpi.so.0
#6  0x00400c10 in main (argc=1, argv=0x7fff1b55b4a8) at testmpi.c:24


Seems that other collective communications are broken, my original
program was blocked after a call to MPI_Allreduce.

I also made tests on a 4-core Intel core i7, openMPI version 1.3.3,
with exatly the same problem: calls to collective communication not
returning for some MPI processes when the number of processes is
greater or equal to 5.

Below, some technical details on my configuration, input file,
example outputs. The output of ompi_info --all is attached to this
mail.

Best regards,
-- 
Vincent LOECHNER |0---0  |  ICPS, LSIIT (UMR 7005),
 PhD |   /|  /|  |  Equipe INRIA CAMUS,
 Phone: +33 (0)368 85 45 37  |  0---0 |  |  Université de Strasbourg
 Fax  : +33 (0)368 85 45 47  |  | 0-|-0  |  Pôle API, Bd. Sébastien Brant
 |  |/  |/   |  F-67412 ILLKIRCH Cedex
 loech...@unistra.fr |  0---0|  http://icps.u-strasbg.fr
--


Input program:
// testmpi.c ---
#include 
#include 
#define MCW MPI_COMM_WORLD

int main( int argc, char **argv )
{
int n, r;   /* number of processes, process rank */
int i;

MPI_Init( ,  );
MPI_Comm_size( MCW,  );
MPI_Comm_rank( MCW,  );

for( i=0 ; i<1000 ; i++ )
{
printf( "(%d) %d\n", r, i ); fflush(stdout);
MPI_Barrier( MCW );
}

MPI_Finalize();
return( 0 );
}
// testmpi.c ---

Compilation line:
$ mpicc -O2 -Wall -g testmpi.c -o testmpi

GCC version :
$ mpicc --version
gcc (Ubuntu 4.4.1-4ubuntu7) 4.4.1

OpenMPI version : 1.3.2
$ ompi_info -v ompi full
 Package: Open MPI buildd@crested Distribution
Open MPI: 1.3.2
   Open MPI SVN revision: r21054
   Open MPI release date: Apr 21, 2009
Open RTE: 1.3.2
   Open RTE SVN revision: r21054
   Open RTE release date: Apr 21, 2009
OPAL: 1.3.2
   OPAL SVN revision: r21054
   OPAL release date: Apr 21, 2009
Ident string: 1.3.2

--- example run (I hit ^C after a while)
$ mpirun  -n 6 ./testmpi
(0) 0
(0) 1
(0) 2
(0) 3
(1) 0
(1) 1
(1) 2
(2) 0
(2) 1
(2) 2
(2) 3
(3) 0
(3) 1
(3) 2
(4) 0
(4) 1
(4) 2
(4) 3
(5) 0
(5) 1
(5) 2
(5) 3
^Cmpirun: killing job...

--
mpirun noticed that process rank 0 with PID 10466 on node istanbool exited on 
signal 0 (Unk