On Aug 19, 2013, at 6:07 PM, "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> wrote:
> On Aug 19, 2013, at 8:02 PM, Ralph Castain <r...@open-mpi.org> wrote: > >> That's how it works now. My concern is with the error message scenario. >> IIRC, Jeff's issue was that the error message only contains the hostname of >> the proc that generates it - it doesn't tell you the hostname of the remote >> proc. Hence, we included that info in the proc_t. > > This is quite important for getting useful error messages. > >> However, IIRC we also provided an option to *not* send that info due to >> scaling concerns way back when. I wonder if we can resolve this simply by >> having Nathan set that option in his platform .conf files, and then removing >> ompi_proc_get_hostname completely. Since the IP-based comm channels will >> call modex_recv anyway, we'll get the hostname at that time. Otherwise, the >> errors print "NULL" for proc->hostname. >> >> Yes, that means that users of direct-launched apps on Nathan's systems will >> get less informative error messages - but they can always override Nathan's >> default param if they want better info. After all, the vast majority of >> users aren't running such big jobs as to care about this optimization. > > I'm good with it. It could also be (might already be) a run-time MCA > param...? I think it is - I'll check tonight > > We could also default the value to -1 (vs. 0 or 1), meaning: with np <= N > procs, send the hostname around, otherwise, don't send it (we can argue over > the value of N -- e.g., 1024 or 2048). That makes the most sense to me - for small jobs, the time difference is too tiny to measure. > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel