Re: [OMPI devel] Failures

2015-01-19 Thread Gilles Gouaillardet
George,

i was able to reproduce the hang with intel compiler 14.0.0
but i am still unable to reproduce it with intel compiler 14.3

i was not able to understand where the issue come from, so
i could not create an appropriate test in configure

at this stage, i can only recommend you update your compiler version


Cheers,

Gilles

On 2015/01/17 0:19, George Bosilca wrote:
> Your patch solve the issue with opal_tree. The opal_lifo remains broken.
>
>   George.
>
>
> On Fri, Jan 16, 2015 at 5:12 AM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
>
>>  George,
>>
>> i pushed
>> https://github.com/open-mpi/ompi/commit/ac16970d21d21f529f1ec01ebe0520843227475b
>> in order to get the intel compiler work with ompi
>>
>> Cheers,
>>
>> Gilles
>>
>>
>> On 2015/01/16 17:29, Gilles Gouaillardet wrote:
>>
>> George,
>>
>> i was unable to reproduce the hang with icc 14.0.3.174 and greater on a
>> RHEL6 like distro.
>>
>> i was able to reproduce the opal_tree failure and found two possible
>> workarounds :
>> a) manually compile opal/class/opal_tree.lo *without* the
>> -finline-functions flag
>> b) update deserialize_add_tree_item and declare curr_delim as volatile
>> char * (see the patch below)
>>
>> this function is recursive, and the compiler could generate some
>> incorrect code.
>>
>> Cheers,
>>
>> Gilles
>>
>> diff --git a/opal/class/opal_tree.c b/opal/class/opal_tree.c
>> index e8964e0..492e8dc 100644
>> --- a/opal/class/opal_tree.c
>> +++ b/opal/class/opal_tree.c
>> @@ -465,7 +465,7 @@ int opal_tree_serialize(opal_tree_item_t
>> *start_item, opal_buffer_t *buffer)
>>  static int deserialize_add_tree_item(opal_buffer_t *data,
>>   opal_tree_item_t *parent_item,
>>   opal_tree_item_deserialize_fn_t
>> deserialize,
>> - char *curr_delim,
>> + volatile char *curr_delim,
>>   int depth)
>>  {
>>  int idx = 1, rc;
>>
>> On 2015/01/16 8:57, George Bosilca wrote:
>>
>>  Today's trunk compiled with icc fails to complete the check on 2 tests:
>> opal_lifo and opal_tree.
>>
>> For opal_tree the output is:
>> OPAL dss:unpack: got type 9 when expecting type 3
>>  Failure :  failed tree deserialization size compare
>> SUPPORT: OMPI Test failed: opal_tree_t (1 of 12 failed)
>>
>> and opal_lifo gets stuck forever in the single threaded call to thread_test
>> in a 128 bits atomic CAS. Unfortunately I lack the time to dig deep enough
>> to see what is the root cause, but a quick look at the opal_config.h file
>> indicates that our configure detects that __int128 is a supported type when
>> it should not be.
>>
>>   George
>>
>> Open MPI git d13c14e configured with --enable-debug
>> icc (ICC) 14.0.0 20130728
>>
>>
>>
>> ___
>> devel mailing listde...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2015/01/16789.php
>>
>>
>>
>> ___
>> devel mailing listde...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2015/01/16790.php
>>
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2015/01/16791.php
>>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/01/16794.php



Re: [OMPI devel] Failures

2015-01-16 Thread George Bosilca
Your patch solve the issue with opal_tree. The opal_lifo remains broken.

  George.


On Fri, Jan 16, 2015 at 5:12 AM, Gilles Gouaillardet <
gilles.gouaillar...@iferc.org> wrote:

>  George,
>
> i pushed
> https://github.com/open-mpi/ompi/commit/ac16970d21d21f529f1ec01ebe0520843227475b
> in order to get the intel compiler work with ompi
>
> Cheers,
>
> Gilles
>
>
> On 2015/01/16 17:29, Gilles Gouaillardet wrote:
>
> George,
>
> i was unable to reproduce the hang with icc 14.0.3.174 and greater on a
> RHEL6 like distro.
>
> i was able to reproduce the opal_tree failure and found two possible
> workarounds :
> a) manually compile opal/class/opal_tree.lo *without* the
> -finline-functions flag
> b) update deserialize_add_tree_item and declare curr_delim as volatile
> char * (see the patch below)
>
> this function is recursive, and the compiler could generate some
> incorrect code.
>
> Cheers,
>
> Gilles
>
> diff --git a/opal/class/opal_tree.c b/opal/class/opal_tree.c
> index e8964e0..492e8dc 100644
> --- a/opal/class/opal_tree.c
> +++ b/opal/class/opal_tree.c
> @@ -465,7 +465,7 @@ int opal_tree_serialize(opal_tree_item_t
> *start_item, opal_buffer_t *buffer)
>  static int deserialize_add_tree_item(opal_buffer_t *data,
>   opal_tree_item_t *parent_item,
>   opal_tree_item_deserialize_fn_t
> deserialize,
> - char *curr_delim,
> + volatile char *curr_delim,
>   int depth)
>  {
>  int idx = 1, rc;
>
> On 2015/01/16 8:57, George Bosilca wrote:
>
>  Today's trunk compiled with icc fails to complete the check on 2 tests:
> opal_lifo and opal_tree.
>
> For opal_tree the output is:
> OPAL dss:unpack: got type 9 when expecting type 3
>  Failure :  failed tree deserialization size compare
> SUPPORT: OMPI Test failed: opal_tree_t (1 of 12 failed)
>
> and opal_lifo gets stuck forever in the single threaded call to thread_test
> in a 128 bits atomic CAS. Unfortunately I lack the time to dig deep enough
> to see what is the root cause, but a quick look at the opal_config.h file
> indicates that our configure detects that __int128 is a supported type when
> it should not be.
>
>   George
>
> Open MPI git d13c14e configured with --enable-debug
> icc (ICC) 14.0.0 20130728
>
>
>
> ___
> devel mailing listde...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/01/16789.php
>
>
>
> ___
> devel mailing listde...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/01/16790.php
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/01/16791.php
>


Re: [OMPI devel] Failures

2015-01-16 Thread Gilles Gouaillardet
George,

i pushed
https://github.com/open-mpi/ompi/commit/ac16970d21d21f529f1ec01ebe0520843227475b
in order to get the intel compiler work with ompi

Cheers,

Gilles

On 2015/01/16 17:29, Gilles Gouaillardet wrote:
> George,
>
> i was unable to reproduce the hang with icc 14.0.3.174 and greater on a
> RHEL6 like distro.
>
> i was able to reproduce the opal_tree failure and found two possible
> workarounds :
> a) manually compile opal/class/opal_tree.lo *without* the
> -finline-functions flag
> b) update deserialize_add_tree_item and declare curr_delim as volatile
> char * (see the patch below)
>
> this function is recursive, and the compiler could generate some
> incorrect code.
>
> Cheers,
>
> Gilles
>
> diff --git a/opal/class/opal_tree.c b/opal/class/opal_tree.c
> index e8964e0..492e8dc 100644
> --- a/opal/class/opal_tree.c
> +++ b/opal/class/opal_tree.c
> @@ -465,7 +465,7 @@ int opal_tree_serialize(opal_tree_item_t
> *start_item, opal_buffer_t *buffer)
>  static int deserialize_add_tree_item(opal_buffer_t *data,
>   opal_tree_item_t *parent_item,
>   opal_tree_item_deserialize_fn_t
> deserialize,
> - char *curr_delim,
> + volatile char *curr_delim,
>   int depth)
>  {
>  int idx = 1, rc;
>
> On 2015/01/16 8:57, George Bosilca wrote:
>> Today's trunk compiled with icc fails to complete the check on 2 tests:
>> opal_lifo and opal_tree.
>>
>> For opal_tree the output is:
>> OPAL dss:unpack: got type 9 when expecting type 3
>>  Failure :  failed tree deserialization size compare
>> SUPPORT: OMPI Test failed: opal_tree_t (1 of 12 failed)
>>
>> and opal_lifo gets stuck forever in the single threaded call to thread_test
>> in a 128 bits atomic CAS. Unfortunately I lack the time to dig deep enough
>> to see what is the root cause, but a quick look at the opal_config.h file
>> indicates that our configure detects that __int128 is a supported type when
>> it should not be.
>>
>>   George
>>
>> Open MPI git d13c14e configured with --enable-debug
>> icc (ICC) 14.0.0 20130728
>>
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2015/01/16789.php
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/01/16790.php



Re: [OMPI devel] Failures

2015-01-16 Thread Gilles Gouaillardet
George,

i was unable to reproduce the hang with icc 14.0.3.174 and greater on a
RHEL6 like distro.

i was able to reproduce the opal_tree failure and found two possible
workarounds :
a) manually compile opal/class/opal_tree.lo *without* the
-finline-functions flag
b) update deserialize_add_tree_item and declare curr_delim as volatile
char * (see the patch below)

this function is recursive, and the compiler could generate some
incorrect code.

Cheers,

Gilles

diff --git a/opal/class/opal_tree.c b/opal/class/opal_tree.c
index e8964e0..492e8dc 100644
--- a/opal/class/opal_tree.c
+++ b/opal/class/opal_tree.c
@@ -465,7 +465,7 @@ int opal_tree_serialize(opal_tree_item_t
*start_item, opal_buffer_t *buffer)
 static int deserialize_add_tree_item(opal_buffer_t *data,
  opal_tree_item_t *parent_item,
  opal_tree_item_deserialize_fn_t
deserialize,
- char *curr_delim,
+ volatile char *curr_delim,
  int depth)
 {
 int idx = 1, rc;

On 2015/01/16 8:57, George Bosilca wrote:
> Today's trunk compiled with icc fails to complete the check on 2 tests:
> opal_lifo and opal_tree.
>
> For opal_tree the output is:
> OPAL dss:unpack: got type 9 when expecting type 3
>  Failure :  failed tree deserialization size compare
> SUPPORT: OMPI Test failed: opal_tree_t (1 of 12 failed)
>
> and opal_lifo gets stuck forever in the single threaded call to thread_test
> in a 128 bits atomic CAS. Unfortunately I lack the time to dig deep enough
> to see what is the root cause, but a quick look at the opal_config.h file
> indicates that our configure detects that __int128 is a supported type when
> it should not be.
>
>   George
>
> Open MPI git d13c14e configured with --enable-debug
> icc (ICC) 14.0.0 20130728
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/01/16789.php



[OMPI devel] Failures

2015-01-15 Thread George Bosilca
Today's trunk compiled with icc fails to complete the check on 2 tests:
opal_lifo and opal_tree.

For opal_tree the output is:
OPAL dss:unpack: got type 9 when expecting type 3
 Failure :  failed tree deserialization size compare
SUPPORT: OMPI Test failed: opal_tree_t (1 of 12 failed)

and opal_lifo gets stuck forever in the single threaded call to thread_test
in a 128 bits atomic CAS. Unfortunately I lack the time to dig deep enough
to see what is the root cause, but a quick look at the opal_config.h file
indicates that our configure detects that __int128 is a supported type when
it should not be.

  George

Open MPI git d13c14e configured with --enable-debug
icc (ICC) 14.0.0 20130728


icc.tgz
Description: GNU Zip compressed data


Re: [OMPI devel] failures runing mpi4py testsuite, perhaps Comm.Split()

2007-07-11 Thread Lisandro Dalcin

On 7/11/07, George Bosilca  wrote:

The two errors you provide are quite different. The first one has
been addresses few days ago in the trunk (https://svn.open-mpi.org/
trac/ompi/changeset/15291). If instead of the 1.2.3 you use anything
after r15291 you will be safe in a threading case.


Please, take into account that in this case I not used MPI_Init_tread() ...

In any case, sorry for making noise if this was already reported. I
have other issues to report, but perhaps I should try the svn version.
Please, understand me, I am really busy with many things as to be
up-to-date with every source code I use. Sorry again.




The second is different. The problem is that memcpy is a lot faster
than memmove, and that's why we use it.


Yes, of course.


The case where the 2 data
overlap are quite minimal. I'll take a look to see exactly what
happened there.


Initially, I though it was my error, but next realized that this seems
to happen in Comm.Split() internals.



--
Lisandro Dalcín
---
Centro Internacional de Métodos Computacionales en Ingeniería (CIMEC)
Instituto de Desarrollo Tecnológico para la Industria Química (INTEC)
Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET)
PTLC - Güemes 3450, (3000) Santa Fe, Argentina
Tel/Fax: +54-(0)342-451.1594



Re: [OMPI devel] failures runing mpi4py testsuite, perhaps Comm.Split()

2007-07-11 Thread George Bosilca

Lisandro,

The two errors you provide are quite different. The first one has  
been addresses few days ago in the trunk (https://svn.open-mpi.org/ 
trac/ompi/changeset/15291). If instead of the 1.2.3 you use anything  
after r15291 you will be safe in a threading case.


The second is different. The problem is that memcpy is a lot faster  
than memmove, and that's why we use it. The case where the 2 data  
overlap are quite minimal. I'll take a look to see exactly what  
happened there.


  george.

On Jul 11, 2007, at 8:08 PM, Lisandro Dalcin wrote:


Ups, sended to wrong list, forwarded here...

-- Forwarded message --
From: Lisandro Dalcin 
Date: Jul 11, 2007 8:58 PM
Subject: failures runing mpi4py testsuite, perhaps Comm.Split()
To: Open MPI 


Hello all, after a long time I'm here again. I am improving mpi4py in
order to support MPI threads, and I've found some problem with latest
version 1.2.3

I've configured with:

$ ./configure --prefix /usr/local/openmpi/1.2.3 --enable-mpi-threads
--disable-dependency-tracking

However, for the following fail, MPI_Init_thread() was not used. This
test creates a intercommunicator by using Comm.Split() followed by
Intracomm.Create_intercomm(). When running in two or more procs (for
one proc this test is skipped), I got (sometimes) the following trace

[trantor:06601] *** Process received signal ***
[trantor:06601] Signal: Segmentation fault (11)
[trantor:06601] Signal code: Address not mapped (1)
[trantor:06601] Failing at address: 0xa8
[trantor:06601] [ 0] [0x958440]
[trantor:06601] [ 1]
/usr/local/openmpi/1.2.3/lib/openmpi/mca_btl_sm.so 
(mca_btl_sm_component_progress+0x1483)

[0x995553]
[trantor:06601] [ 2]
/usr/local/openmpi/1.2.3/lib/openmpi/mca_bml_r2.so 
(mca_bml_r2_progress+0x36)

[0x645d06]
[trantor:06601] [ 3]
/usr/local/openmpi/1.2.3/lib/libopen-pal.so.0(opal_progress+0x58)
[0x1a2c88]
[trantor:06601] [ 4]
/usr/local/openmpi/1.2.3/lib/libmpi.so.0(ompi_request_wait_all+0xea)
[0x140a8a]
[trantor:06601] [ 5]
/usr/local/openmpi/1.2.3/lib/openmpi/mca_coll_tuned.so 
(ompi_coll_tuned_sendrecv_actual+0xc8)

[0x22d6e8]
[trantor:06601] [ 6]
/usr/local/openmpi/1.2.3/lib/openmpi/mca_coll_tuned.so 
(ompi_coll_tuned_allgather_intra_bruck+0xf2)

[0x231ca2]
[trantor:06601] [ 7]
/usr/local/openmpi/1.2.3/lib/openmpi/mca_coll_tuned.so 
(ompi_coll_tuned_allgather_intra_dec_fixed+0x8b)

[0x22db7b]
[trantor:06601] [ 8]
/usr/local/openmpi/1.2.3/lib/libmpi.so.0(ompi_comm_split+0x9d)
[0x12d92d]
[trantor:06601] [ 9]
/usr/local/openmpi/1.2.3/lib/libmpi.so.0(MPI_Comm_split+0xad)
[0x15a53d]
[trantor:06601] [10] /u/dalcinl/lib/python/mpi4py/_mpi.so [0x508500]
[trantor:06601] [11]
/usr/local/lib/libpython2.5.so.1.0(PyCFunction_Call+0x14d) [0xe150ad]
[trantor:06601] [12]
/usr/local/lib/libpython2.5.so.1.0(PyEval_EvalFrameEx+0x64af)
[0xe626bf]
[trantor:06601] [13]
/usr/local/lib/libpython2.5.so.1.0(PyEval_EvalCodeEx+0x7c4) [0xe63814]
[trantor:06601] [14]
/usr/local/lib/libpython2.5.so.1.0(PyEval_EvalFrameEx+0x5a43)
[0xe61c53]
[trantor:06601] [15]
/usr/local/lib/libpython2.5.so.1.0(PyEval_EvalFrameEx+0x6130)
[0xe62340]
[trantor:06601] [16]
/usr/local/lib/libpython2.5.so.1.0(PyEval_EvalCodeEx+0x7c4) [0xe63814]
[trantor:06601] [17] /usr/local/lib/libpython2.5.so.1.0 [0xe01450]
[trantor:06601] [18]
/usr/local/lib/libpython2.5.so.1.0(PyObject_Call+0x37) [0xddf5c7]
[trantor:06601] [19]
/usr/local/lib/libpython2.5.so.1.0(PyEval_EvalFrameEx+0x42eb)
[0xe604fb]
[trantor:06601] [20]
/usr/local/lib/libpython2.5.so.1.0(PyEval_EvalCodeEx+0x7c4) [0xe63814]
[trantor:06601] [21] /usr/local/lib/libpython2.5.so.1.0 [0xe0137a]
[trantor:06601] [22]
/usr/local/lib/libpython2.5.so.1.0(PyObject_Call+0x37) [0xddf5c7]
[trantor:06601] [23] /usr/local/lib/libpython2.5.so.1.0 [0xde6de5]
[trantor:06601] [24]
/usr/local/lib/libpython2.5.so.1.0(PyObject_Call+0x37) [0xddf5c7]
[trantor:06601] [25] /usr/local/lib/libpython2.5.so.1.0 [0xe2abc9]
[trantor:06601] [26]
/usr/local/lib/libpython2.5.so.1.0(PyObject_Call+0x37) [0xddf5c7]
[trantor:06601] [27]
/usr/local/lib/libpython2.5.so.1.0(PyEval_EvalFrameEx+0x1481)
[0xe5d691]
[trantor:06601] [28]
/usr/local/lib/libpython2.5.so.1.0(PyEval_EvalCodeEx+0x7c4) [0xe63814]
[trantor:06601] [29] /usr/local/lib/libpython2.5.so.1.0 [0xe01450]
[trantor:06601] *** End of error message ***


As the problem seems to originate in Comm.Split(), I've written a
small python script to test it::

from mpi4py import MPI

# true MPI_COMM_WORLD_HANDLE
BASECOMM = MPI.__COMM_WORLD__

BASE_SIZE = BASECOMM.Get_size()
BASE_RANK = BASECOMM.Get_rank()

if BASE_RANK < (BASE_SIZE // 2) :
COLOR = 0
else:
COLOR = 1

INTRACOMM = BASECOMM.Split(COLOR, key=0)
print 'Done!!!'

This seems always work, but running it under valgrind (note
valgrind-py below is just an alias adding a suppression file for
python) I get the following:

mpiexec -n 3 valgrind-py python test.py

=6727== Warning: set address range perms: large range 134217728  
(defined)
==6727== Source and destination overlap in memc

[OMPI devel] failures runing mpi4py testsuite, perhaps Comm.Split()

2007-07-11 Thread Lisandro Dalcin

Ups, sended to wrong list, forwarded here...

-- Forwarded message --
From: Lisandro Dalcin 
List-Post: devel@lists.open-mpi.org
Date: Jul 11, 2007 8:58 PM
Subject: failures runing mpi4py testsuite, perhaps Comm.Split()
To: Open MPI 


Hello all, after a long time I'm here again. I am improving mpi4py in
order to support MPI threads, and I've found some problem with latest
version 1.2.3

I've configured with:

$ ./configure --prefix /usr/local/openmpi/1.2.3 --enable-mpi-threads
--disable-dependency-tracking

However, for the following fail, MPI_Init_thread() was not used. This
test creates a intercommunicator by using Comm.Split() followed by
Intracomm.Create_intercomm(). When running in two or more procs (for
one proc this test is skipped), I got (sometimes) the following trace

[trantor:06601] *** Process received signal ***
[trantor:06601] Signal: Segmentation fault (11)
[trantor:06601] Signal code: Address not mapped (1)
[trantor:06601] Failing at address: 0xa8
[trantor:06601] [ 0] [0x958440]
[trantor:06601] [ 1]
/usr/local/openmpi/1.2.3/lib/openmpi/mca_btl_sm.so(mca_btl_sm_component_progress+0x1483)
[0x995553]
[trantor:06601] [ 2]
/usr/local/openmpi/1.2.3/lib/openmpi/mca_bml_r2.so(mca_bml_r2_progress+0x36)
[0x645d06]
[trantor:06601] [ 3]
/usr/local/openmpi/1.2.3/lib/libopen-pal.so.0(opal_progress+0x58)
[0x1a2c88]
[trantor:06601] [ 4]
/usr/local/openmpi/1.2.3/lib/libmpi.so.0(ompi_request_wait_all+0xea)
[0x140a8a]
[trantor:06601] [ 5]
/usr/local/openmpi/1.2.3/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_sendrecv_actual+0xc8)
[0x22d6e8]
[trantor:06601] [ 6]
/usr/local/openmpi/1.2.3/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allgather_intra_bruck+0xf2)
[0x231ca2]
[trantor:06601] [ 7]
/usr/local/openmpi/1.2.3/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allgather_intra_dec_fixed+0x8b)
[0x22db7b]
[trantor:06601] [ 8]
/usr/local/openmpi/1.2.3/lib/libmpi.so.0(ompi_comm_split+0x9d)
[0x12d92d]
[trantor:06601] [ 9]
/usr/local/openmpi/1.2.3/lib/libmpi.so.0(MPI_Comm_split+0xad)
[0x15a53d]
[trantor:06601] [10] /u/dalcinl/lib/python/mpi4py/_mpi.so [0x508500]
[trantor:06601] [11]
/usr/local/lib/libpython2.5.so.1.0(PyCFunction_Call+0x14d) [0xe150ad]
[trantor:06601] [12]
/usr/local/lib/libpython2.5.so.1.0(PyEval_EvalFrameEx+0x64af)
[0xe626bf]
[trantor:06601] [13]
/usr/local/lib/libpython2.5.so.1.0(PyEval_EvalCodeEx+0x7c4) [0xe63814]
[trantor:06601] [14]
/usr/local/lib/libpython2.5.so.1.0(PyEval_EvalFrameEx+0x5a43)
[0xe61c53]
[trantor:06601] [15]
/usr/local/lib/libpython2.5.so.1.0(PyEval_EvalFrameEx+0x6130)
[0xe62340]
[trantor:06601] [16]
/usr/local/lib/libpython2.5.so.1.0(PyEval_EvalCodeEx+0x7c4) [0xe63814]
[trantor:06601] [17] /usr/local/lib/libpython2.5.so.1.0 [0xe01450]
[trantor:06601] [18]
/usr/local/lib/libpython2.5.so.1.0(PyObject_Call+0x37) [0xddf5c7]
[trantor:06601] [19]
/usr/local/lib/libpython2.5.so.1.0(PyEval_EvalFrameEx+0x42eb)
[0xe604fb]
[trantor:06601] [20]
/usr/local/lib/libpython2.5.so.1.0(PyEval_EvalCodeEx+0x7c4) [0xe63814]
[trantor:06601] [21] /usr/local/lib/libpython2.5.so.1.0 [0xe0137a]
[trantor:06601] [22]
/usr/local/lib/libpython2.5.so.1.0(PyObject_Call+0x37) [0xddf5c7]
[trantor:06601] [23] /usr/local/lib/libpython2.5.so.1.0 [0xde6de5]
[trantor:06601] [24]
/usr/local/lib/libpython2.5.so.1.0(PyObject_Call+0x37) [0xddf5c7]
[trantor:06601] [25] /usr/local/lib/libpython2.5.so.1.0 [0xe2abc9]
[trantor:06601] [26]
/usr/local/lib/libpython2.5.so.1.0(PyObject_Call+0x37) [0xddf5c7]
[trantor:06601] [27]
/usr/local/lib/libpython2.5.so.1.0(PyEval_EvalFrameEx+0x1481)
[0xe5d691]
[trantor:06601] [28]
/usr/local/lib/libpython2.5.so.1.0(PyEval_EvalCodeEx+0x7c4) [0xe63814]
[trantor:06601] [29] /usr/local/lib/libpython2.5.so.1.0 [0xe01450]
[trantor:06601] *** End of error message ***


As the problem seems to originate in Comm.Split(), I've written a
small python script to test it::

from mpi4py import MPI

# true MPI_COMM_WORLD_HANDLE
BASECOMM = MPI.__COMM_WORLD__

BASE_SIZE = BASECOMM.Get_size()
BASE_RANK = BASECOMM.Get_rank()

if BASE_RANK < (BASE_SIZE // 2) :
   COLOR = 0
else:
   COLOR = 1

INTRACOMM = BASECOMM.Split(COLOR, key=0)
print 'Done!!!'

This seems always work, but running it under valgrind (note
valgrind-py below is just an alias adding a suppression file for
python) I get the following:

mpiexec -n 3 valgrind-py python test.py

=6727== Warning: set address range perms: large range 134217728 (defined)
==6727== Source and destination overlap in memcpy(0x4C93EA0, 0x4C93EA8, 16)
==6727==at 0x4006CE6: memcpy (mc_replace_strmem.c:116)
==6727==by 0x46C59CA: ompi_ddt_copy_content_same_ddt (in
/usr/local/openmpi/1.2.3/lib/libmpi.so.0.0.0)
==6727==by 0x4BADDCE: ompi_coll_tuned_allgather_intra_bruck (in
/usr/local/openmpi/1.2.3/lib/openmpi/mca_coll_tuned.so)
==6727==by 0x4BA9B7A: ompi_coll_tuned_allgather_intra_dec_fixed
(in /usr/local/openmpi/1.2.3/lib/openmpi/mca_coll_tuned.so)
==6727==by 0x46A692C: ompi_comm_split (in
/usr/local/openmpi/1.2.3/lib/libmpi.so.0.0.0)
==6