Re: [OMPI devel] Failures
George, i was able to reproduce the hang with intel compiler 14.0.0 but i am still unable to reproduce it with intel compiler 14.3 i was not able to understand where the issue come from, so i could not create an appropriate test in configure at this stage, i can only recommend you update your compiler version Cheers, Gilles On 2015/01/17 0:19, George Bosilca wrote: > Your patch solve the issue with opal_tree. The opal_lifo remains broken. > > George. > > > On Fri, Jan 16, 2015 at 5:12 AM, Gilles Gouaillardet < > gilles.gouaillar...@iferc.org> wrote: > >> George, >> >> i pushed >> https://github.com/open-mpi/ompi/commit/ac16970d21d21f529f1ec01ebe0520843227475b >> in order to get the intel compiler work with ompi >> >> Cheers, >> >> Gilles >> >> >> On 2015/01/16 17:29, Gilles Gouaillardet wrote: >> >> George, >> >> i was unable to reproduce the hang with icc 14.0.3.174 and greater on a >> RHEL6 like distro. >> >> i was able to reproduce the opal_tree failure and found two possible >> workarounds : >> a) manually compile opal/class/opal_tree.lo *without* the >> -finline-functions flag >> b) update deserialize_add_tree_item and declare curr_delim as volatile >> char * (see the patch below) >> >> this function is recursive, and the compiler could generate some >> incorrect code. >> >> Cheers, >> >> Gilles >> >> diff --git a/opal/class/opal_tree.c b/opal/class/opal_tree.c >> index e8964e0..492e8dc 100644 >> --- a/opal/class/opal_tree.c >> +++ b/opal/class/opal_tree.c >> @@ -465,7 +465,7 @@ int opal_tree_serialize(opal_tree_item_t >> *start_item, opal_buffer_t *buffer) >> static int deserialize_add_tree_item(opal_buffer_t *data, >> opal_tree_item_t *parent_item, >> opal_tree_item_deserialize_fn_t >> deserialize, >> - char *curr_delim, >> + volatile char *curr_delim, >> int depth) >> { >> int idx = 1, rc; >> >> On 2015/01/16 8:57, George Bosilca wrote: >> >> Today's trunk compiled with icc fails to complete the check on 2 tests: >> opal_lifo and opal_tree. >> >> For opal_tree the output is: >> OPAL dss:unpack: got type 9 when expecting type 3 >> Failure : failed tree deserialization size compare >> SUPPORT: OMPI Test failed: opal_tree_t (1 of 12 failed) >> >> and opal_lifo gets stuck forever in the single threaded call to thread_test >> in a 128 bits atomic CAS. Unfortunately I lack the time to dig deep enough >> to see what is the root cause, but a quick look at the opal_config.h file >> indicates that our configure detects that __int128 is a supported type when >> it should not be. >> >> George >> >> Open MPI git d13c14e configured with --enable-debug >> icc (ICC) 14.0.0 20130728 >> >> >> >> ___ >> devel mailing listde...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/01/16789.php >> >> >> >> ___ >> devel mailing listde...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/01/16790.php >> >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/01/16791.php >> > > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/01/16794.php
Re: [OMPI devel] Failures
Your patch solve the issue with opal_tree. The opal_lifo remains broken. George. On Fri, Jan 16, 2015 at 5:12 AM, Gilles Gouaillardet < gilles.gouaillar...@iferc.org> wrote: > George, > > i pushed > https://github.com/open-mpi/ompi/commit/ac16970d21d21f529f1ec01ebe0520843227475b > in order to get the intel compiler work with ompi > > Cheers, > > Gilles > > > On 2015/01/16 17:29, Gilles Gouaillardet wrote: > > George, > > i was unable to reproduce the hang with icc 14.0.3.174 and greater on a > RHEL6 like distro. > > i was able to reproduce the opal_tree failure and found two possible > workarounds : > a) manually compile opal/class/opal_tree.lo *without* the > -finline-functions flag > b) update deserialize_add_tree_item and declare curr_delim as volatile > char * (see the patch below) > > this function is recursive, and the compiler could generate some > incorrect code. > > Cheers, > > Gilles > > diff --git a/opal/class/opal_tree.c b/opal/class/opal_tree.c > index e8964e0..492e8dc 100644 > --- a/opal/class/opal_tree.c > +++ b/opal/class/opal_tree.c > @@ -465,7 +465,7 @@ int opal_tree_serialize(opal_tree_item_t > *start_item, opal_buffer_t *buffer) > static int deserialize_add_tree_item(opal_buffer_t *data, > opal_tree_item_t *parent_item, > opal_tree_item_deserialize_fn_t > deserialize, > - char *curr_delim, > + volatile char *curr_delim, > int depth) > { > int idx = 1, rc; > > On 2015/01/16 8:57, George Bosilca wrote: > > Today's trunk compiled with icc fails to complete the check on 2 tests: > opal_lifo and opal_tree. > > For opal_tree the output is: > OPAL dss:unpack: got type 9 when expecting type 3 > Failure : failed tree deserialization size compare > SUPPORT: OMPI Test failed: opal_tree_t (1 of 12 failed) > > and opal_lifo gets stuck forever in the single threaded call to thread_test > in a 128 bits atomic CAS. Unfortunately I lack the time to dig deep enough > to see what is the root cause, but a quick look at the opal_config.h file > indicates that our configure detects that __int128 is a supported type when > it should not be. > > George > > Open MPI git d13c14e configured with --enable-debug > icc (ICC) 14.0.0 20130728 > > > > ___ > devel mailing listde...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/01/16789.php > > > > ___ > devel mailing listde...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/01/16790.php > > > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/01/16791.php >
Re: [OMPI devel] Failures
George, i pushed https://github.com/open-mpi/ompi/commit/ac16970d21d21f529f1ec01ebe0520843227475b in order to get the intel compiler work with ompi Cheers, Gilles On 2015/01/16 17:29, Gilles Gouaillardet wrote: > George, > > i was unable to reproduce the hang with icc 14.0.3.174 and greater on a > RHEL6 like distro. > > i was able to reproduce the opal_tree failure and found two possible > workarounds : > a) manually compile opal/class/opal_tree.lo *without* the > -finline-functions flag > b) update deserialize_add_tree_item and declare curr_delim as volatile > char * (see the patch below) > > this function is recursive, and the compiler could generate some > incorrect code. > > Cheers, > > Gilles > > diff --git a/opal/class/opal_tree.c b/opal/class/opal_tree.c > index e8964e0..492e8dc 100644 > --- a/opal/class/opal_tree.c > +++ b/opal/class/opal_tree.c > @@ -465,7 +465,7 @@ int opal_tree_serialize(opal_tree_item_t > *start_item, opal_buffer_t *buffer) > static int deserialize_add_tree_item(opal_buffer_t *data, > opal_tree_item_t *parent_item, > opal_tree_item_deserialize_fn_t > deserialize, > - char *curr_delim, > + volatile char *curr_delim, > int depth) > { > int idx = 1, rc; > > On 2015/01/16 8:57, George Bosilca wrote: >> Today's trunk compiled with icc fails to complete the check on 2 tests: >> opal_lifo and opal_tree. >> >> For opal_tree the output is: >> OPAL dss:unpack: got type 9 when expecting type 3 >> Failure : failed tree deserialization size compare >> SUPPORT: OMPI Test failed: opal_tree_t (1 of 12 failed) >> >> and opal_lifo gets stuck forever in the single threaded call to thread_test >> in a 128 bits atomic CAS. Unfortunately I lack the time to dig deep enough >> to see what is the root cause, but a quick look at the opal_config.h file >> indicates that our configure detects that __int128 is a supported type when >> it should not be. >> >> George >> >> Open MPI git d13c14e configured with --enable-debug >> icc (ICC) 14.0.0 20130728 >> >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/01/16789.php > > > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/01/16790.php
Re: [OMPI devel] Failures
George, i was unable to reproduce the hang with icc 14.0.3.174 and greater on a RHEL6 like distro. i was able to reproduce the opal_tree failure and found two possible workarounds : a) manually compile opal/class/opal_tree.lo *without* the -finline-functions flag b) update deserialize_add_tree_item and declare curr_delim as volatile char * (see the patch below) this function is recursive, and the compiler could generate some incorrect code. Cheers, Gilles diff --git a/opal/class/opal_tree.c b/opal/class/opal_tree.c index e8964e0..492e8dc 100644 --- a/opal/class/opal_tree.c +++ b/opal/class/opal_tree.c @@ -465,7 +465,7 @@ int opal_tree_serialize(opal_tree_item_t *start_item, opal_buffer_t *buffer) static int deserialize_add_tree_item(opal_buffer_t *data, opal_tree_item_t *parent_item, opal_tree_item_deserialize_fn_t deserialize, - char *curr_delim, + volatile char *curr_delim, int depth) { int idx = 1, rc; On 2015/01/16 8:57, George Bosilca wrote: > Today's trunk compiled with icc fails to complete the check on 2 tests: > opal_lifo and opal_tree. > > For opal_tree the output is: > OPAL dss:unpack: got type 9 when expecting type 3 > Failure : failed tree deserialization size compare > SUPPORT: OMPI Test failed: opal_tree_t (1 of 12 failed) > > and opal_lifo gets stuck forever in the single threaded call to thread_test > in a 128 bits atomic CAS. Unfortunately I lack the time to dig deep enough > to see what is the root cause, but a quick look at the opal_config.h file > indicates that our configure detects that __int128 is a supported type when > it should not be. > > George > > Open MPI git d13c14e configured with --enable-debug > icc (ICC) 14.0.0 20130728 > > > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/01/16789.php
[OMPI devel] Failures
Today's trunk compiled with icc fails to complete the check on 2 tests: opal_lifo and opal_tree. For opal_tree the output is: OPAL dss:unpack: got type 9 when expecting type 3 Failure : failed tree deserialization size compare SUPPORT: OMPI Test failed: opal_tree_t (1 of 12 failed) and opal_lifo gets stuck forever in the single threaded call to thread_test in a 128 bits atomic CAS. Unfortunately I lack the time to dig deep enough to see what is the root cause, but a quick look at the opal_config.h file indicates that our configure detects that __int128 is a supported type when it should not be. George Open MPI git d13c14e configured with --enable-debug icc (ICC) 14.0.0 20130728 icc.tgz Description: GNU Zip compressed data
Re: [OMPI devel] failures runing mpi4py testsuite, perhaps Comm.Split()
On 7/11/07, George Bosilca wrote: The two errors you provide are quite different. The first one has been addresses few days ago in the trunk (https://svn.open-mpi.org/ trac/ompi/changeset/15291). If instead of the 1.2.3 you use anything after r15291 you will be safe in a threading case. Please, take into account that in this case I not used MPI_Init_tread() ... In any case, sorry for making noise if this was already reported. I have other issues to report, but perhaps I should try the svn version. Please, understand me, I am really busy with many things as to be up-to-date with every source code I use. Sorry again. The second is different. The problem is that memcpy is a lot faster than memmove, and that's why we use it. Yes, of course. The case where the 2 data overlap are quite minimal. I'll take a look to see exactly what happened there. Initially, I though it was my error, but next realized that this seems to happen in Comm.Split() internals. -- Lisandro Dalcín --- Centro Internacional de Métodos Computacionales en Ingeniería (CIMEC) Instituto de Desarrollo Tecnológico para la Industria Química (INTEC) Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET) PTLC - Güemes 3450, (3000) Santa Fe, Argentina Tel/Fax: +54-(0)342-451.1594
Re: [OMPI devel] failures runing mpi4py testsuite, perhaps Comm.Split()
Lisandro, The two errors you provide are quite different. The first one has been addresses few days ago in the trunk (https://svn.open-mpi.org/ trac/ompi/changeset/15291). If instead of the 1.2.3 you use anything after r15291 you will be safe in a threading case. The second is different. The problem is that memcpy is a lot faster than memmove, and that's why we use it. The case where the 2 data overlap are quite minimal. I'll take a look to see exactly what happened there. george. On Jul 11, 2007, at 8:08 PM, Lisandro Dalcin wrote: Ups, sended to wrong list, forwarded here... -- Forwarded message -- From: Lisandro Dalcin Date: Jul 11, 2007 8:58 PM Subject: failures runing mpi4py testsuite, perhaps Comm.Split() To: Open MPI Hello all, after a long time I'm here again. I am improving mpi4py in order to support MPI threads, and I've found some problem with latest version 1.2.3 I've configured with: $ ./configure --prefix /usr/local/openmpi/1.2.3 --enable-mpi-threads --disable-dependency-tracking However, for the following fail, MPI_Init_thread() was not used. This test creates a intercommunicator by using Comm.Split() followed by Intracomm.Create_intercomm(). When running in two or more procs (for one proc this test is skipped), I got (sometimes) the following trace [trantor:06601] *** Process received signal *** [trantor:06601] Signal: Segmentation fault (11) [trantor:06601] Signal code: Address not mapped (1) [trantor:06601] Failing at address: 0xa8 [trantor:06601] [ 0] [0x958440] [trantor:06601] [ 1] /usr/local/openmpi/1.2.3/lib/openmpi/mca_btl_sm.so (mca_btl_sm_component_progress+0x1483) [0x995553] [trantor:06601] [ 2] /usr/local/openmpi/1.2.3/lib/openmpi/mca_bml_r2.so (mca_bml_r2_progress+0x36) [0x645d06] [trantor:06601] [ 3] /usr/local/openmpi/1.2.3/lib/libopen-pal.so.0(opal_progress+0x58) [0x1a2c88] [trantor:06601] [ 4] /usr/local/openmpi/1.2.3/lib/libmpi.so.0(ompi_request_wait_all+0xea) [0x140a8a] [trantor:06601] [ 5] /usr/local/openmpi/1.2.3/lib/openmpi/mca_coll_tuned.so (ompi_coll_tuned_sendrecv_actual+0xc8) [0x22d6e8] [trantor:06601] [ 6] /usr/local/openmpi/1.2.3/lib/openmpi/mca_coll_tuned.so (ompi_coll_tuned_allgather_intra_bruck+0xf2) [0x231ca2] [trantor:06601] [ 7] /usr/local/openmpi/1.2.3/lib/openmpi/mca_coll_tuned.so (ompi_coll_tuned_allgather_intra_dec_fixed+0x8b) [0x22db7b] [trantor:06601] [ 8] /usr/local/openmpi/1.2.3/lib/libmpi.so.0(ompi_comm_split+0x9d) [0x12d92d] [trantor:06601] [ 9] /usr/local/openmpi/1.2.3/lib/libmpi.so.0(MPI_Comm_split+0xad) [0x15a53d] [trantor:06601] [10] /u/dalcinl/lib/python/mpi4py/_mpi.so [0x508500] [trantor:06601] [11] /usr/local/lib/libpython2.5.so.1.0(PyCFunction_Call+0x14d) [0xe150ad] [trantor:06601] [12] /usr/local/lib/libpython2.5.so.1.0(PyEval_EvalFrameEx+0x64af) [0xe626bf] [trantor:06601] [13] /usr/local/lib/libpython2.5.so.1.0(PyEval_EvalCodeEx+0x7c4) [0xe63814] [trantor:06601] [14] /usr/local/lib/libpython2.5.so.1.0(PyEval_EvalFrameEx+0x5a43) [0xe61c53] [trantor:06601] [15] /usr/local/lib/libpython2.5.so.1.0(PyEval_EvalFrameEx+0x6130) [0xe62340] [trantor:06601] [16] /usr/local/lib/libpython2.5.so.1.0(PyEval_EvalCodeEx+0x7c4) [0xe63814] [trantor:06601] [17] /usr/local/lib/libpython2.5.so.1.0 [0xe01450] [trantor:06601] [18] /usr/local/lib/libpython2.5.so.1.0(PyObject_Call+0x37) [0xddf5c7] [trantor:06601] [19] /usr/local/lib/libpython2.5.so.1.0(PyEval_EvalFrameEx+0x42eb) [0xe604fb] [trantor:06601] [20] /usr/local/lib/libpython2.5.so.1.0(PyEval_EvalCodeEx+0x7c4) [0xe63814] [trantor:06601] [21] /usr/local/lib/libpython2.5.so.1.0 [0xe0137a] [trantor:06601] [22] /usr/local/lib/libpython2.5.so.1.0(PyObject_Call+0x37) [0xddf5c7] [trantor:06601] [23] /usr/local/lib/libpython2.5.so.1.0 [0xde6de5] [trantor:06601] [24] /usr/local/lib/libpython2.5.so.1.0(PyObject_Call+0x37) [0xddf5c7] [trantor:06601] [25] /usr/local/lib/libpython2.5.so.1.0 [0xe2abc9] [trantor:06601] [26] /usr/local/lib/libpython2.5.so.1.0(PyObject_Call+0x37) [0xddf5c7] [trantor:06601] [27] /usr/local/lib/libpython2.5.so.1.0(PyEval_EvalFrameEx+0x1481) [0xe5d691] [trantor:06601] [28] /usr/local/lib/libpython2.5.so.1.0(PyEval_EvalCodeEx+0x7c4) [0xe63814] [trantor:06601] [29] /usr/local/lib/libpython2.5.so.1.0 [0xe01450] [trantor:06601] *** End of error message *** As the problem seems to originate in Comm.Split(), I've written a small python script to test it:: from mpi4py import MPI # true MPI_COMM_WORLD_HANDLE BASECOMM = MPI.__COMM_WORLD__ BASE_SIZE = BASECOMM.Get_size() BASE_RANK = BASECOMM.Get_rank() if BASE_RANK < (BASE_SIZE // 2) : COLOR = 0 else: COLOR = 1 INTRACOMM = BASECOMM.Split(COLOR, key=0) print 'Done!!!' This seems always work, but running it under valgrind (note valgrind-py below is just an alias adding a suppression file for python) I get the following: mpiexec -n 3 valgrind-py python test.py =6727== Warning: set address range perms: large range 134217728 (defined) ==6727== Source and destination overlap in memc
[OMPI devel] failures runing mpi4py testsuite, perhaps Comm.Split()
Ups, sended to wrong list, forwarded here... -- Forwarded message -- From: Lisandro Dalcin List-Post: devel@lists.open-mpi.org Date: Jul 11, 2007 8:58 PM Subject: failures runing mpi4py testsuite, perhaps Comm.Split() To: Open MPI Hello all, after a long time I'm here again. I am improving mpi4py in order to support MPI threads, and I've found some problem with latest version 1.2.3 I've configured with: $ ./configure --prefix /usr/local/openmpi/1.2.3 --enable-mpi-threads --disable-dependency-tracking However, for the following fail, MPI_Init_thread() was not used. This test creates a intercommunicator by using Comm.Split() followed by Intracomm.Create_intercomm(). When running in two or more procs (for one proc this test is skipped), I got (sometimes) the following trace [trantor:06601] *** Process received signal *** [trantor:06601] Signal: Segmentation fault (11) [trantor:06601] Signal code: Address not mapped (1) [trantor:06601] Failing at address: 0xa8 [trantor:06601] [ 0] [0x958440] [trantor:06601] [ 1] /usr/local/openmpi/1.2.3/lib/openmpi/mca_btl_sm.so(mca_btl_sm_component_progress+0x1483) [0x995553] [trantor:06601] [ 2] /usr/local/openmpi/1.2.3/lib/openmpi/mca_bml_r2.so(mca_bml_r2_progress+0x36) [0x645d06] [trantor:06601] [ 3] /usr/local/openmpi/1.2.3/lib/libopen-pal.so.0(opal_progress+0x58) [0x1a2c88] [trantor:06601] [ 4] /usr/local/openmpi/1.2.3/lib/libmpi.so.0(ompi_request_wait_all+0xea) [0x140a8a] [trantor:06601] [ 5] /usr/local/openmpi/1.2.3/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_sendrecv_actual+0xc8) [0x22d6e8] [trantor:06601] [ 6] /usr/local/openmpi/1.2.3/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allgather_intra_bruck+0xf2) [0x231ca2] [trantor:06601] [ 7] /usr/local/openmpi/1.2.3/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allgather_intra_dec_fixed+0x8b) [0x22db7b] [trantor:06601] [ 8] /usr/local/openmpi/1.2.3/lib/libmpi.so.0(ompi_comm_split+0x9d) [0x12d92d] [trantor:06601] [ 9] /usr/local/openmpi/1.2.3/lib/libmpi.so.0(MPI_Comm_split+0xad) [0x15a53d] [trantor:06601] [10] /u/dalcinl/lib/python/mpi4py/_mpi.so [0x508500] [trantor:06601] [11] /usr/local/lib/libpython2.5.so.1.0(PyCFunction_Call+0x14d) [0xe150ad] [trantor:06601] [12] /usr/local/lib/libpython2.5.so.1.0(PyEval_EvalFrameEx+0x64af) [0xe626bf] [trantor:06601] [13] /usr/local/lib/libpython2.5.so.1.0(PyEval_EvalCodeEx+0x7c4) [0xe63814] [trantor:06601] [14] /usr/local/lib/libpython2.5.so.1.0(PyEval_EvalFrameEx+0x5a43) [0xe61c53] [trantor:06601] [15] /usr/local/lib/libpython2.5.so.1.0(PyEval_EvalFrameEx+0x6130) [0xe62340] [trantor:06601] [16] /usr/local/lib/libpython2.5.so.1.0(PyEval_EvalCodeEx+0x7c4) [0xe63814] [trantor:06601] [17] /usr/local/lib/libpython2.5.so.1.0 [0xe01450] [trantor:06601] [18] /usr/local/lib/libpython2.5.so.1.0(PyObject_Call+0x37) [0xddf5c7] [trantor:06601] [19] /usr/local/lib/libpython2.5.so.1.0(PyEval_EvalFrameEx+0x42eb) [0xe604fb] [trantor:06601] [20] /usr/local/lib/libpython2.5.so.1.0(PyEval_EvalCodeEx+0x7c4) [0xe63814] [trantor:06601] [21] /usr/local/lib/libpython2.5.so.1.0 [0xe0137a] [trantor:06601] [22] /usr/local/lib/libpython2.5.so.1.0(PyObject_Call+0x37) [0xddf5c7] [trantor:06601] [23] /usr/local/lib/libpython2.5.so.1.0 [0xde6de5] [trantor:06601] [24] /usr/local/lib/libpython2.5.so.1.0(PyObject_Call+0x37) [0xddf5c7] [trantor:06601] [25] /usr/local/lib/libpython2.5.so.1.0 [0xe2abc9] [trantor:06601] [26] /usr/local/lib/libpython2.5.so.1.0(PyObject_Call+0x37) [0xddf5c7] [trantor:06601] [27] /usr/local/lib/libpython2.5.so.1.0(PyEval_EvalFrameEx+0x1481) [0xe5d691] [trantor:06601] [28] /usr/local/lib/libpython2.5.so.1.0(PyEval_EvalCodeEx+0x7c4) [0xe63814] [trantor:06601] [29] /usr/local/lib/libpython2.5.so.1.0 [0xe01450] [trantor:06601] *** End of error message *** As the problem seems to originate in Comm.Split(), I've written a small python script to test it:: from mpi4py import MPI # true MPI_COMM_WORLD_HANDLE BASECOMM = MPI.__COMM_WORLD__ BASE_SIZE = BASECOMM.Get_size() BASE_RANK = BASECOMM.Get_rank() if BASE_RANK < (BASE_SIZE // 2) : COLOR = 0 else: COLOR = 1 INTRACOMM = BASECOMM.Split(COLOR, key=0) print 'Done!!!' This seems always work, but running it under valgrind (note valgrind-py below is just an alias adding a suppression file for python) I get the following: mpiexec -n 3 valgrind-py python test.py =6727== Warning: set address range perms: large range 134217728 (defined) ==6727== Source and destination overlap in memcpy(0x4C93EA0, 0x4C93EA8, 16) ==6727==at 0x4006CE6: memcpy (mc_replace_strmem.c:116) ==6727==by 0x46C59CA: ompi_ddt_copy_content_same_ddt (in /usr/local/openmpi/1.2.3/lib/libmpi.so.0.0.0) ==6727==by 0x4BADDCE: ompi_coll_tuned_allgather_intra_bruck (in /usr/local/openmpi/1.2.3/lib/openmpi/mca_coll_tuned.so) ==6727==by 0x4BA9B7A: ompi_coll_tuned_allgather_intra_dec_fixed (in /usr/local/openmpi/1.2.3/lib/openmpi/mca_coll_tuned.so) ==6727==by 0x46A692C: ompi_comm_split (in /usr/local/openmpi/1.2.3/lib/libmpi.so.0.0.0) ==6