Sorry -- I've been offline since Friday morning, so I am slow to reply on this 
thread.

To be totally clear: I am seeing --enable-heterogeneous fail on homogeneous 
clusters.  I was seeing timeouts and segv's in Cisco's MTT last week, IIRC.  So 
I disabled the --enable-heterogeneous builds.

I only have access to Intel/x86-based servers for Cisco's MTT, so I can only 
test this one case.

If we want to keep the heterogeneous code:

1. George's point of doing a bisect to find the problem would probably be a 
good first step.  I unfortunately do not have the cycles to do this.  Does 
someone else?

2. Someone really needs to commit to doing regular periodic testing of actual 
heterogeneous test cases (as I think I mentioned in a prior email, a minimum of 
once a week would probably be good).  I think Gilles mention running in big 
endian, little endian, and mixed big-little endian cases -- that would cover 
the entire range, and would be great.




On Apr 28, 2014, at 9:05 AM, Ralph Castain <r...@open-mpi.org> wrote:

> No, it looks like something has broken it since I last tested. Sorry about 
> the confusion.
> 
> On Apr 27, 2014, at 10:55 PM, Gilles Gouaillardet 
> <gilles.gouaillar...@iferc.org> wrote:
> 
>> I might have misunderstood Jeff's comment :
>> 
>>> The broken part(s) is(are) likely somewhere in the datatype and/or PML code 
>>> (my guess).  Keep in mind that my only testing of this feature is in 
>>> *homogeneous* mode -- i.e., I compile with --enable-heterogeneous and then 
>>> run tests on homogeneous machines.  Meaning: it's not only broken for 
>>> actual heterogeneity, it's also broken in the "unity"/homogeneous case.
>> 
>> Unfortunatly, a trivial send/recv can hang in this case 
>> (--enable-heterogeneous and homogenous cluster of little endian procs).
>> 
>> i opened #4568 https://svn.open-mpi.org/trac/ompi/ticket/4568 in order to 
>> track this issue
>> (uninitialized data can cause a hang with this config)
>> 
>> trunk is affected, v1.8 is very likely affected too
>> 
>> Gilles
>> 
>> On 2014/04/28 12:22, Ralph Castain wrote:
>>> I think you misunderstood his comment. It works fine on a homogeneous 
>>> cluster, even with --enable-hetero. I've run it that way on my cluster.
>>> 
>>> On Apr 27, 2014, at 7:50 PM, Gilles Gouaillardet 
>>> <gilles.gouaillar...@iferc.org>
>>>  wrote:
>>> 
>>> 
>>>> According to Jeff's comment, OpenMPI compiled with
>>>> --enable-heterogeneous is broken even in an homogeneous cluster.
>>>> 
>>>> as a first step, MTT could be ran with OpenMPI compiled with
>>>> --enable-heterogenous and running on an homogeneous cluster
>>>> (ideally on both little and big endian) in order to identify and fix the
>>>> bug/regression.
>>>> /* this build is currently disabled in the MTT config of the
>>>> cisco-community cluster */
>>>> 
>>>> Gilles
>>>> 
>>>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/04/14624.php
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/04/14625.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to