On Aug 19, 2013, at 6:07 PM, "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> 
wrote:

> On Aug 19, 2013, at 8:02 PM, Ralph Castain <r...@open-mpi.org> wrote:
> 
>> That's how it works now. My concern is with the error message scenario. 
>> IIRC, Jeff's issue was that the error message only contains the hostname of 
>> the proc that generates it - it doesn't tell you the hostname of the remote 
>> proc. Hence, we included that info in the proc_t.
> 
> This is quite important for getting useful error messages.
> 
>> However, IIRC we also provided an option to *not* send that info due to 
>> scaling concerns way back when. I wonder if we can resolve this simply by 
>> having Nathan set that option in his platform .conf files, and then removing 
>> ompi_proc_get_hostname completely. Since the IP-based comm channels will 
>> call modex_recv anyway, we'll get the hostname at that time. Otherwise, the 
>> errors print "NULL" for proc->hostname.
>> 
>> Yes, that means that users of direct-launched apps on Nathan's systems will 
>> get less informative error messages - but they can always override Nathan's 
>> default param if they want better info. After all, the vast majority of 
>> users aren't running such big jobs as to care about this optimization.
> 
> I'm good with it.  It could also be (might already be) a run-time MCA 
> param...?

I think it is - I'll check tonight

> 
> We could also default the value to -1 (vs. 0 or 1), meaning: with np <= N 
> procs, send the hostname around, otherwise, don't send it (we can argue over 
> the value of N -- e.g., 1024 or 2048).

That makes the most sense to me - for small jobs, the time difference is too 
tiny to measure.

> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to