Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r29040 - in trunk: ompi/mca/bml/r2 ompi/mca/btl/base ompi/mca/btl/openib ompi/mca/btl/openib/connect ompi/mca/btl/tcp ompi/mca/btl/udapl ompi/mca/btl/ugni ompi/mca/btl/usnic ompi/mca/common/ofacm ompi/mca/mtl/mxm ompi/mca/mtl/psm ompi/mca/pml/base ompi/mca/pml/bfo ompi/proc opal/mca/db/pmi orte/mca/ess/pmi orte/mca/grpcomm/pmi orte/mca/rml/oob

Ralph Castain Mon, 19 Aug 2013 17:48:22 -0400 (EDT)

Yeah, I have some concerns about it too...been trying to test it out some more. 
Would be good to see just how much that one change makes - maybe restoring just 
the hostname wouldn't have that big an impact.


I'm leery of trying to ensure we strip all the opal_output loops if we don't 
find the hostname.

On Aug 19, 2013, at 2:41 PM, George Bosilca <[email protected]> wrote:

> As a result of this patch the first decode of a peer host name might happen 
> in the middle of a debug message (on the first call to 
> ompi_proc_get_hostname). Such a behavior might generate deadlocks based on 
> the level of output verbosity, and has significant potential to reintroduce 
> the recursive behavior the new state machine was supposed to remove.
> 
>  George.
> 
> 
> On Aug 17, 2013, at 02:49 , [email protected] wrote:
> 
>> Author: rhc (Ralph Castain)
>> Date: 2013-08-16 20:49:18 EDT (Fri, 16 Aug 2013)
>> New Revision: 29040
>> URL: https://svn.open-mpi.org/trac/ompi/changeset/29040
>> 
>> Log:
>> When we direct launch an application, we rely on PMI for wireup support. In 
>> doing so, we lose the de facto data compression we get from the ORTE modex 
>> since we no longer get all the wireup info from every proc in a single blob. 
>> Instead, we have to iterate over all the procs, calling PMI_KVS_get for 
>> every value we require.
>> 
>> This creates a really bad scaling behavior. Users have found a nearly 20% 
>> launch time differential between mpirun and PMI, with PMI being the slower 
>> method. Some of the problem is attributable to poor exchange algorithms in 
>> RM's like Slurm and Alps, but we make things worse by calling "get" so many 
>> times.
>> 
>> Nathan (with a tad advice from me) has attempted to alleviate this problem 
>> by reducing the number of "get" calls. This required the following changes:
>> 
>> * upon first request for data, have the OPAL db pmi component fetch and 
>> decode *all* the info from a given remote proc. It turned out we weren't 
>> caching the info, so we would continually request it and only decode the 
>> piece we needed for the immediate request. We now decode all the info and 
>> push it into the db hash component for local storage - and then all 
>> subsequent retrievals are fulfilled locally
>> 
>> * reduced the amount of data by eliminating the exchange of the OMPI_ARCH 
>> value if heterogeneity is not enabled. This was used solely as a check so we 
>> would error out if the system wasn't actually homogeneous, which was fine 
>> when we thought there was no cost in doing the check. Unfortunately, at 
>> large scale and with direct launch, there is a non-zero cost of making this 
>> test. We are open to finding a compromise (perhaps turning the test off if 
>> requested?), if people feel strongly about performing the test
>> 
>> * reduced the amount of RTE data being automatically fetched, and fetched 
>> the rest only upon request. In particular, we no longer immediately fetch 
>> the hostname (which is only used for error reporting), but instead get it 
>> when needed. Likewise for the RML uri as that info is only required for some 
>> (not all) environments. In addition, we no longer fetch the locality unless 
>> required, relying instead on the PMI clique info to tell us who is on our 
>> local node (if additional info is required, the fetch is performed when a 
>> modex_recv is issued).
>> 
>> Again, all this only impacts direct launch - all the info is provided when 
>> launched via mpirun as there is no added cost to getting it
>> 
>> Barring objections, we may move this (plus any required other pieces) to the 
>> 1.7 branch once it soaks for an appropriate time.
> 
> _______________________________________________
> devel mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to