Sorry -- I've been offline since Friday morning, so I am slow to reply on this thread.
To be totally clear: I am seeing --enable-heterogeneous fail on homogeneous clusters. I was seeing timeouts and segv's in Cisco's MTT last week, IIRC. So I disabled the --enable-heterogeneous builds. I only have access to Intel/x86-based servers for Cisco's MTT, so I can only test this one case. If we want to keep the heterogeneous code: 1. George's point of doing a bisect to find the problem would probably be a good first step. I unfortunately do not have the cycles to do this. Does someone else? 2. Someone really needs to commit to doing regular periodic testing of actual heterogeneous test cases (as I think I mentioned in a prior email, a minimum of once a week would probably be good). I think Gilles mention running in big endian, little endian, and mixed big-little endian cases -- that would cover the entire range, and would be great. On Apr 28, 2014, at 9:05 AM, Ralph Castain <r...@open-mpi.org> wrote: > No, it looks like something has broken it since I last tested. Sorry about > the confusion. > > On Apr 27, 2014, at 10:55 PM, Gilles Gouaillardet > <gilles.gouaillar...@iferc.org> wrote: > >> I might have misunderstood Jeff's comment : >> >>> The broken part(s) is(are) likely somewhere in the datatype and/or PML code >>> (my guess). Keep in mind that my only testing of this feature is in >>> *homogeneous* mode -- i.e., I compile with --enable-heterogeneous and then >>> run tests on homogeneous machines. Meaning: it's not only broken for >>> actual heterogeneity, it's also broken in the "unity"/homogeneous case. >> >> Unfortunatly, a trivial send/recv can hang in this case >> (--enable-heterogeneous and homogenous cluster of little endian procs). >> >> i opened #4568 https://svn.open-mpi.org/trac/ompi/ticket/4568 in order to >> track this issue >> (uninitialized data can cause a hang with this config) >> >> trunk is affected, v1.8 is very likely affected too >> >> Gilles >> >> On 2014/04/28 12:22, Ralph Castain wrote: >>> I think you misunderstood his comment. It works fine on a homogeneous >>> cluster, even with --enable-hetero. I've run it that way on my cluster. >>> >>> On Apr 27, 2014, at 7:50 PM, Gilles Gouaillardet >>> <gilles.gouaillar...@iferc.org> >>> wrote: >>> >>> >>>> According to Jeff's comment, OpenMPI compiled with >>>> --enable-heterogeneous is broken even in an homogeneous cluster. >>>> >>>> as a first step, MTT could be ran with OpenMPI compiled with >>>> --enable-heterogenous and running on an homogeneous cluster >>>> (ideally on both little and big endian) in order to identify and fix the >>>> bug/regression. >>>> /* this build is currently disabled in the MTT config of the >>>> cisco-community cluster */ >>>> >>>> Gilles >>>> >>>> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/04/14624.php > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/04/14625.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/