It would require a db read from every rank which is what we are trying
to avoid. This scales quadratic at best on Cray systems.

-Nathan

On Mon, Aug 19, 2013 at 02:48:18PM -0700, Ralph Castain wrote:
> Yeah, I have some concerns about it too...been trying to test it out some 
> more. Would be good to see just how much that one change makes - maybe 
> restoring just the hostname wouldn't have that big an impact.
> 
> I'm leery of trying to ensure we strip all the opal_output loops if we don't 
> find the hostname.
> 
> On Aug 19, 2013, at 2:41 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
> 
> > As a result of this patch the first decode of a peer host name might happen 
> > in the middle of a debug message (on the first call to 
> > ompi_proc_get_hostname). Such a behavior might generate deadlocks based on 
> > the level of output verbosity, and has significant potential to reintroduce 
> > the recursive behavior the new state machine was supposed to remove.
> > 
> >  George.
> > 
> > 
> > On Aug 17, 2013, at 02:49 , svn-commit-mai...@open-mpi.org wrote:
> > 
> >> Author: rhc (Ralph Castain)
> >> Date: 2013-08-16 20:49:18 EDT (Fri, 16 Aug 2013)
> >> New Revision: 29040
> >> URL: https://svn.open-mpi.org/trac/ompi/changeset/29040
> >> 
> >> Log:
> >> When we direct launch an application, we rely on PMI for wireup support. 
> >> In doing so, we lose the de facto data compression we get from the ORTE 
> >> modex since we no longer get all the wireup info from every proc in a 
> >> single blob. Instead, we have to iterate over all the procs, calling 
> >> PMI_KVS_get for every value we require.
> >> 
> >> This creates a really bad scaling behavior. Users have found a nearly 20% 
> >> launch time differential between mpirun and PMI, with PMI being the slower 
> >> method. Some of the problem is attributable to poor exchange algorithms in 
> >> RM's like Slurm and Alps, but we make things worse by calling "get" so 
> >> many times.
> >> 
> >> Nathan (with a tad advice from me) has attempted to alleviate this problem 
> >> by reducing the number of "get" calls. This required the following changes:
> >> 
> >> * upon first request for data, have the OPAL db pmi component fetch and 
> >> decode *all* the info from a given remote proc. It turned out we weren't 
> >> caching the info, so we would continually request it and only decode the 
> >> piece we needed for the immediate request. We now decode all the info and 
> >> push it into the db hash component for local storage - and then all 
> >> subsequent retrievals are fulfilled locally
> >> 
> >> * reduced the amount of data by eliminating the exchange of the OMPI_ARCH 
> >> value if heterogeneity is not enabled. This was used solely as a check so 
> >> we would error out if the system wasn't actually homogeneous, which was 
> >> fine when we thought there was no cost in doing the check. Unfortunately, 
> >> at large scale and with direct launch, there is a non-zero cost of making 
> >> this test. We are open to finding a compromise (perhaps turning the test 
> >> off if requested?), if people feel strongly about performing the test
> >> 
> >> * reduced the amount of RTE data being automatically fetched, and fetched 
> >> the rest only upon request. In particular, we no longer immediately fetch 
> >> the hostname (which is only used for error reporting), but instead get it 
> >> when needed. Likewise for the RML uri as that info is only required for 
> >> some (not all) environments. In addition, we no longer fetch the locality 
> >> unless required, relying instead on the PMI clique info to tell us who is 
> >> on our local node (if additional info is required, the fetch is performed 
> >> when a modex_recv is issued).
> >> 
> >> Again, all this only impacts direct launch - all the info is provided when 
> >> launched via mpirun as there is no added cost to getting it
> >> 
> >> Barring objections, we may move this (plus any required other pieces) to 
> >> the 1.7 branch once it soaks for an appropriate time.
> > 
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to