On Aug 19, 2013, at 8:02 PM, Ralph Castain <r...@open-mpi.org> wrote:

> That's how it works now. My concern is with the error message scenario. IIRC, 
> Jeff's issue was that the error message only contains the hostname of the 
> proc that generates it - it doesn't tell you the hostname of the remote proc. 
> Hence, we included that info in the proc_t.

This is quite important for getting useful error messages.

> However, IIRC we also provided an option to *not* send that info due to 
> scaling concerns way back when. I wonder if we can resolve this simply by 
> having Nathan set that option in his platform .conf files, and then removing 
> ompi_proc_get_hostname completely. Since the IP-based comm channels will call 
> modex_recv anyway, we'll get the hostname at that time. Otherwise, the errors 
> print "NULL" for proc->hostname.
> 
> Yes, that means that users of direct-launched apps on Nathan's systems will 
> get less informative error messages - but they can always override Nathan's 
> default param if they want better info. After all, the vast majority of users 
> aren't running such big jobs as to care about this optimization.

I'm good with it.  It could also be (might already be) a run-time MCA param...?

We could also default the value to -1 (vs. 0 or 1), meaning: with np <= N 
procs, send the hostname around, otherwise, don't send it (we can argue over 
the value of N -- e.g., 1024 or 2048).

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to