Hello to the list,
I came to a problem running a simple program with collective
communications, on a 6-core processors (6 local MPI processes).
It seems that the calls to collective communication are not
returning for some MPI processes, when the number of processes is
greater or equal to 5. It's reproduceable, on two different
architectures, with two different versions of OpenMPI (1.3.2 and
1.3.3). It was working correctly with OpenMPI version 1.2.7.
I just wrote a very simple test, making 1000 calls to MPI_Barrier().
Running on an istanbul processor (6-core AMD Opteron) :
$ uname -a
Linux istanbool 2.6.31-14-generic #46-Ubuntu SMP Tue Oct 13 16:47:28 UTC 2009
x86_64 GNU/Linux
with a OpenMPI ubuntu package, version 1.3.2.
Running with 5 or 6 MPI processes, it just hangs after a random
number of iterations, ranging from 3 to 600, and sometimes it
finishes correctly (about 1 time out of 8). Just ran :
'mpirun -n 6 ./testmpi'
Same behavior with more MPI processes.
I tried the '--mca coll_basic_priority 50' option, the program has
more chance to finish -about one time out of 2, but also deadlocks
the other time after a random number of iterations.
Without setting the coll_basic_priority option, I ran a debugger, and
found out that some processes are blocked in:
#0 0x7f858f272f7a in opal_progress () from /usr/lib/libopen-pal.so.0
#1 0x7f858f7524f5 in ?? () from /usr/lib/libmpi.so.0
#2 0x7f8589e74c5a in ?? ()
from /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so
#3 0x7f8589e7cefa in ?? ()
from /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so
#4 0x7f858f767b32 in PMPI_Barrier () from /usr/lib/libmpi.so.0
#5 0x00400c10 in main (argc=1, argv=0x7fff9d59acf8) at testmpi.c:24
and the others in:
#0 0x7f05799e933a in ?? () from /usr/lib/openmpi/lib/openmpi/mca_btl_sm.so
#1 0x7f057dd22fba in opal_progress () from /usr/lib/libopen-pal.so.0
#2 0x7f057e2024f5 in ?? () from /usr/lib/libmpi.so.0
#3 0x7f0578924c5a in ?? ()
from /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so
#4 0x7f057892cefa in ?? ()
from /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so
#5 0x7f057e217b32 in PMPI_Barrier () from /usr/lib/libmpi.so.0
#6 0x00400c10 in main (argc=1, argv=0x7fff1b55b4a8) at testmpi.c:24
Seems that other collective communications are broken, my original
program was blocked after a call to MPI_Allreduce.
I also made tests on a 4-core Intel core i7, openMPI version 1.3.3,
with exatly the same problem: calls to collective communication not
returning for some MPI processes when the number of processes is
greater or equal to 5.
Below, some technical details on my configuration, input file,
example outputs. The output of ompi_info --all is attached to this
mail.
Best regards,
----------
Vincent LOECHNER |0---0 | ICPS, LSIIT (UMR 7005),
PhD | /| /| | Equipe INRIA CAMUS,
Phone: +33 (0)368 85 45 37 | 0---0 | | Université de Strasbourg
Fax : +33 (0)368 85 45 47 | | 0-|-0 | Pôle API, Bd. Sébastien Brant
| |/ |/ | F-67412 ILLKIRCH Cedex
loech...@unistra.fr | 0---0| http://icps.u-strasbg.fr
--
Input program:
// testmpi.c ---
#include
#include
#define MCW MPI_COMM_WORLD
int main( int argc, char **argv )
{
int n, r; /* number of processes, process rank */
int i;
MPI_Init( &argc, &argv );
MPI_Comm_size( MCW, &n );
MPI_Comm_rank( MCW, &r );
for( i=0 ; i<1000 ; i++ )
{
printf( "(%d) %d\n", r, i ); fflush(stdout);
MPI_Barrier( MCW );
}
MPI_Finalize();
return( 0 );
}
// testmpi.c ---
Compilation line:
$ mpicc -O2 -Wall -g testmpi.c -o testmpi
GCC version :
$ mpicc --version
gcc (Ubuntu 4.4.1-4ubuntu7) 4.4.1
OpenMPI version : 1.3.2
$ ompi_info -v ompi full
Package: Open MPI buildd@crested Distribution
Open MPI: 1.3.2
Open MPI SVN revision: r21054
Open MPI release date: Apr 21, 2009
Open RTE: 1.3.2
Open RTE SVN revision: r21054
Open RTE release date: Apr 21, 2009
OPAL: 1.3.2
OPAL SVN revision: r21054
OPAL release date: Apr 21, 2009
Ident string: 1.3.2
--- example run (I hit ^C after a while)
$ mpirun -n 6 ./testmpi
(0) 0
(0) 1
(0) 2
(0) 3
(1) 0
(1) 1
(1) 2
(2) 0
(2) 1
(2) 2
(2) 3
(3) 0
(3) 1
(3) 2
(4) 0
(4) 1
(4) 2
(4) 3
(5) 0
(5) 1
(5) 2
(5) 3
^Cmpirun: killing job...
--
mpirun noticed that process rank 0 with PID 10466 o