Okay, please see r29052 - I believe this will address everyone's concerns. Please give it a test so we can verify it is clean - it worked for me, but I can't test all environments
On Aug 20, 2013, at 7:35 AM, Ralph Castain <[email protected]> wrote: > The error messages already output the name of the other proc, so that should > be available. Besides, I just spent all yesterday afternoon auditing our MPI > layers memory usage byte-by-byte and getting my ears burned about the need to > reduce that footprint - not really thrilled about adding to it. > > I think the key here is to only do this reduction when directed to do so. It > only benefits really big scale, which is the exception and not the rule. And > if someone in that scenario wants the error output, they can just ask for it > (assuming their sys admin defaulted it to not include the hostname). > > > On Aug 20, 2013, at 3:18 AM, George Bosilca <[email protected]> wrote: > >> If we don't want to lose the usefulness of the error messages (and don't >> care that much about the memory requirements), we can initialize this value >> with the string of the rank of the process in MPI_COMM_WORLD (instead of >> NULL). We will at least get an idea where to start looking in case of >> troubles … >> >> George. >> >> On Aug 20, 2013, at 04:20 , Ralph Castain <[email protected]> wrote: >> >>> >>> On Aug 19, 2013, at 6:07 PM, "Jeff Squyres (jsquyres)" <[email protected]> >>> wrote: >>> >>>> On Aug 19, 2013, at 8:02 PM, Ralph Castain <[email protected]> wrote: >>>> >>>>> That's how it works now. My concern is with the error message scenario. >>>>> IIRC, Jeff's issue was that the error message only contains the hostname >>>>> of the proc that generates it - it doesn't tell you the hostname of the >>>>> remote proc. Hence, we included that info in the proc_t. >>>> >>>> This is quite important for getting useful error messages. >>>> >>>>> However, IIRC we also provided an option to *not* send that info due to >>>>> scaling concerns way back when. I wonder if we can resolve this simply by >>>>> having Nathan set that option in his platform .conf files, and then >>>>> removing ompi_proc_get_hostname completely. Since the IP-based comm >>>>> channels will call modex_recv anyway, we'll get the hostname at that >>>>> time. Otherwise, the errors print "NULL" for proc->hostname. >>>>> >>>>> Yes, that means that users of direct-launched apps on Nathan's systems >>>>> will get less informative error messages - but they can always override >>>>> Nathan's default param if they want better info. After all, the vast >>>>> majority of users aren't running such big jobs as to care about this >>>>> optimization. >>>> >>>> I'm good with it. It could also be (might already be) a run-time MCA >>>> param...? >>> >>> I think it is - I'll check tonight >>> >>>> >>>> We could also default the value to -1 (vs. 0 or 1), meaning: with np <= N >>>> procs, send the hostname around, otherwise, don't send it (we can argue >>>> over the value of N -- e.g., 1024 or 2048). >>> >>> That makes the most sense to me - for small jobs, the time difference is >>> too tiny to measure. >>> >>>> >>>> -- >>>> Jeff Squyres >>>> [email protected] >>>> For corporate legal information go to: >>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> [email protected] >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> _______________________________________________ >>> devel mailing list >>> [email protected] >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> _______________________________________________ >> devel mailing list >> [email protected] >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >
