Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r29040 - in trunk: ompi/mca/bml/r2 ompi/mca/btl/base ompi/mca/btl/openib ompi/mca/btl/openib/connect ompi/mca/btl/tcp ompi/mca/btl/udapl ompi/mca/btl/ugni ompi/mca/btl/usnic ompi/mca/common/ofacm ompi/mca/mtl/mxm ompi/mca/mtl/psm ompi/mca/pml/base ompi/mca/pml/bfo ompi/proc opal/mca/db/pmi orte/mca/ess/pmi orte/mca/grpcomm/pmi orte/mca/rml/oob

Nathan Hjelm Mon, 19 Aug 2013 18:43:33 -0400 (EDT)

That solution is fine with me.

-Nathan


On Tue, Aug 20, 2013 at 12:41:49AM +0200, George Bosilca wrote:
> If your offer is between quadratic and non-deterministic, I'll take the 
> former.
> 
> I would advocate for a middle-ground solution. Clearly document in the header 
> file that the ompi_proc_get_hostname is __not__ safe to be used in all 
> contexts as it might exhibit recursive behavior due to communications. Then 
> revert all its uses in the context of opal_output, opal_output_verbose and 
> all variants back to using "->proc_hostname". We might get a (null) instead 
> of the peer name, but this removes the potential loops.
> 
>   George.
> 
> On Aug 19, 2013, at 23:52 , Nathan Hjelm <hje...@lanl.gov> wrote:
> 
> > It would require a db read from every rank which is what we are trying
> > to avoid. This scales quadratic at best on Cray systems.
> > 
> > -Nathan
> > 
> > On Mon, Aug 19, 2013 at 02:48:18PM -0700, Ralph Castain wrote:
> >> Yeah, I have some concerns about it too...been trying to test it out some 
> >> more. Would be good to see just how much that one change makes - maybe 
> >> restoring just the hostname wouldn't have that big an impact.
> >> 
> >> I'm leery of trying to ensure we strip all the opal_output loops if we 
> >> don't find the hostname.
> >> 
> >> On Aug 19, 2013, at 2:41 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
> >> 
> >>> As a result of this patch the first decode of a peer host name might 
> >>> happen in the middle of a debug message (on the first call to 
> >>> ompi_proc_get_hostname). Such a behavior might generate deadlocks based 
> >>> on the level of output verbosity, and has significant potential to 
> >>> reintroduce the recursive behavior the new state machine was supposed to 
> >>> remove.
> >>> 
> >>> George.
> >>> 
> >>> 
> >>> On Aug 17, 2013, at 02:49 , svn-commit-mai...@open-mpi.org wrote:
> >>> 
> >>>> Author: rhc (Ralph Castain)
> >>>> Date: 2013-08-16 20:49:18 EDT (Fri, 16 Aug 2013)
> >>>> New Revision: 29040
> >>>> URL: https://svn.open-mpi.org/trac/ompi/changeset/29040
> >>>> 
> >>>> Log:
> >>>> When we direct launch an application, we rely on PMI for wireup support. 
> >>>> In doing so, we lose the de facto data compression we get from the ORTE 
> >>>> modex since we no longer get all the wireup info from every proc in a 
> >>>> single blob. Instead, we have to iterate over all the procs, calling 
> >>>> PMI_KVS_get for every value we require.
> >>>> 
> >>>> This creates a really bad scaling behavior. Users have found a nearly 
> >>>> 20% launch time differential between mpirun and PMI, with PMI being the 
> >>>> slower method. Some of the problem is attributable to poor exchange 
> >>>> algorithms in RM's like Slurm and Alps, but we make things worse by 
> >>>> calling "get" so many times.
> >>>> 
> >>>> Nathan (with a tad advice from me) has attempted to alleviate this 
> >>>> problem by reducing the number of "get" calls. This required the 
> >>>> following changes:
> >>>> 
> >>>> * upon first request for data, have the OPAL db pmi component fetch and 
> >>>> decode *all* the info from a given remote proc. It turned out we weren't 
> >>>> caching the info, so we would continually request it and only decode the 
> >>>> piece we needed for the immediate request. We now decode all the info 
> >>>> and push it into the db hash component for local storage - and then all 
> >>>> subsequent retrievals are fulfilled locally
> >>>> 
> >>>> * reduced the amount of data by eliminating the exchange of the 
> >>>> OMPI_ARCH value if heterogeneity is not enabled. This was used solely as 
> >>>> a check so we would error out if the system wasn't actually homogeneous, 
> >>>> which was fine when we thought there was no cost in doing the check. 
> >>>> Unfortunately, at large scale and with direct launch, there is a 
> >>>> non-zero cost of making this test. We are open to finding a compromise 
> >>>> (perhaps turning the test off if requested?), if people feel strongly 
> >>>> about performing the test
> >>>> 
> >>>> * reduced the amount of RTE data being automatically fetched, and 
> >>>> fetched the rest only upon request. In particular, we no longer 
> >>>> immediately fetch the hostname (which is only used for error reporting), 
> >>>> but instead get it when needed. Likewise for the RML uri as that info is 
> >>>> only required for some (not all) environments. In addition, we no longer 
> >>>> fetch the locality unless required, relying instead on the PMI clique 
> >>>> info to tell us who is on our local node (if additional info is 
> >>>> required, the fetch is performed when a modex_recv is issued).
> >>>> 
> >>>> Again, all this only impacts direct launch - all the info is provided 
> >>>> when launched via mpirun as there is no added cost to getting it
> >>>> 
> >>>> Barring objections, we may move this (plus any required other pieces) to 
> >>>> the 1.7 branch once it soaks for an appropriate time.
> >>> 
> >>> _______________________________________________
> >>> devel mailing list
> >>> de...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> 
> >> _______________________________________________
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to