Okay, please see r29052 - I believe this will address everyone's concerns. Please give it a test so we can verify it is clean - it worked for me, but I can't test all environments
On Aug 20, 2013, at 7:35 AM, Ralph Castain <r...@open-mpi.org> wrote: > The error messages already output the name of the other proc, so that should > be available. Besides, I just spent all yesterday afternoon auditing our MPI > layers memory usage byte-by-byte and getting my ears burned about the need to > reduce that footprint - not really thrilled about adding to it. > > I think the key here is to only do this reduction when directed to do so. It > only benefits really big scale, which is the exception and not the rule. And > if someone in that scenario wants the error output, they can just ask for it > (assuming their sys admin defaulted it to not include the hostname). > > > On Aug 20, 2013, at 3:18 AM, George Bosilca <bosi...@icl.utk.edu> wrote: > >> If we don't want to lose the usefulness of the error messages (and don't >> care that much about the memory requirements), we can initialize this value >> with the string of the rank of the process in MPI_COMM_WORLD (instead of >> NULL). We will at least get an idea where to start looking in case of >> troubles … >> >> George. >> >> On Aug 20, 2013, at 04:20 , Ralph Castain <r...@open-mpi.org> wrote: >> >>> >>> On Aug 19, 2013, at 6:07 PM, "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> >>> wrote: >>> >>>> On Aug 19, 2013, at 8:02 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>> >>>>> That's how it works now. My concern is with the error message scenario. >>>>> IIRC, Jeff's issue was that the error message only contains the hostname >>>>> of the proc that generates it - it doesn't tell you the hostname of the >>>>> remote proc. Hence, we included that info in the proc_t. >>>> >>>> This is quite important for getting useful error messages. >>>> >>>>> However, IIRC we also provided an option to *not* send that info due to >>>>> scaling concerns way back when. I wonder if we can resolve this simply by >>>>> having Nathan set that option in his platform .conf files, and then >>>>> removing ompi_proc_get_hostname completely. Since the IP-based comm >>>>> channels will call modex_recv anyway, we'll get the hostname at that >>>>> time. Otherwise, the errors print "NULL" for proc->hostname. >>>>> >>>>> Yes, that means that users of direct-launched apps on Nathan's systems >>>>> will get less informative error messages - but they can always override >>>>> Nathan's default param if they want better info. After all, the vast >>>>> majority of users aren't running such big jobs as to care about this >>>>> optimization. >>>> >>>> I'm good with it. It could also be (might already be) a run-time MCA >>>> param...? >>> >>> I think it is - I'll check tonight >>> >>>> >>>> We could also default the value to -1 (vs. 0 or 1), meaning: with np <= N >>>> procs, send the hostname around, otherwise, don't send it (we can argue >>>> over the value of N -- e.g., 1024 or 2048). >>> >>> That makes the most sense to me - for small jobs, the time difference is >>> too tiny to measure. >>> >>>> >>>> -- >>>> Jeff Squyres >>>> jsquy...@cisco.com >>>> For corporate legal information go to: >>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >