Okay, please see r29052 - I believe this will address everyone's concerns. 
Please give it a test so we can verify it is clean - it worked for me, but I 
can't test all environments


On Aug 20, 2013, at 7:35 AM, Ralph Castain <r...@open-mpi.org> wrote:

> The error messages already output the name of the other proc, so that should 
> be available. Besides, I just spent all yesterday afternoon auditing our MPI 
> layers memory usage byte-by-byte and getting my ears burned about the need to 
> reduce that footprint - not really thrilled about adding to it.
> 
> I think the key here is to only do this reduction when directed to do so. It 
> only benefits really big scale, which is the exception and not the rule. And 
> if someone in that scenario wants the error output, they can just ask for it 
> (assuming their sys admin defaulted it to not include the hostname).
> 
> 
> On Aug 20, 2013, at 3:18 AM, George Bosilca <bosi...@icl.utk.edu> wrote:
> 
>> If we don't want to lose the usefulness of the error messages (and don't 
>> care that much about the memory requirements), we can initialize this value 
>> with the string of the rank of the process in MPI_COMM_WORLD (instead of 
>> NULL). We will at least get an idea where to start looking in case of 
>> troubles …
>> 
>> George.
>> 
>> On Aug 20, 2013, at 04:20 , Ralph Castain <r...@open-mpi.org> wrote:
>> 
>>> 
>>> On Aug 19, 2013, at 6:07 PM, "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> 
>>> wrote:
>>> 
>>>> On Aug 19, 2013, at 8:02 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>> 
>>>>> That's how it works now. My concern is with the error message scenario. 
>>>>> IIRC, Jeff's issue was that the error message only contains the hostname 
>>>>> of the proc that generates it - it doesn't tell you the hostname of the 
>>>>> remote proc. Hence, we included that info in the proc_t.
>>>> 
>>>> This is quite important for getting useful error messages.
>>>> 
>>>>> However, IIRC we also provided an option to *not* send that info due to 
>>>>> scaling concerns way back when. I wonder if we can resolve this simply by 
>>>>> having Nathan set that option in his platform .conf files, and then 
>>>>> removing ompi_proc_get_hostname completely. Since the IP-based comm 
>>>>> channels will call modex_recv anyway, we'll get the hostname at that 
>>>>> time. Otherwise, the errors print "NULL" for proc->hostname.
>>>>> 
>>>>> Yes, that means that users of direct-launched apps on Nathan's systems 
>>>>> will get less informative error messages - but they can always override 
>>>>> Nathan's default param if they want better info. After all, the vast 
>>>>> majority of users aren't running such big jobs as to care about this 
>>>>> optimization.
>>>> 
>>>> I'm good with it.  It could also be (might already be) a run-time MCA 
>>>> param...?
>>> 
>>> I think it is - I'll check tonight
>>> 
>>>> 
>>>> We could also default the value to -1 (vs. 0 or 1), meaning: with np <= N 
>>>> procs, send the hostname around, otherwise, don't send it (we can argue 
>>>> over the value of N -- e.g., 1024 or 2048).
>>> 
>>> That makes the most sense to me - for small jobs, the time difference is 
>>> too tiny to measure.
>>> 
>>>> 
>>>> -- 
>>>> Jeff Squyres
>>>> jsquy...@cisco.com
>>>> For corporate legal information go to: 
>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 

Reply via email to