To be honest, I don't care so much today, I'm just fighting so that the output doesn't get worse. At some point, we do need to figure out a better way of dealing with error messages, but not today :).
Brian On 11/2/11 11:53 AM, "Ralph Castain" <r...@open-mpi.org> wrote: >Hmmm....since it was my bug that surfaced the problem, maybe the best >answer is to just return an error code. I'll slowly work thru the param >registrations in ORTE and make them all check the return code. I'm >willing to look at OPAL as I go, but someone else will have to deal with >the OMPI layer. > >I don't know how to entirely avoid the message issue Brian mentions - >I'll still have to say -something- when I get an error code, but I have >come up with some methods for reducing the clutter. > >On Nov 2, 2011, at 11:43 AM, Barrett, Brian W wrote: > >> I really don't like our show_help at every level behavior (look at what >> happens when MPI_INIT fails, you get a page per process of the same >>error >> message from each level of the call stack). If you want to show_help >>and >> abort on debug, that makes sense. It doesn't make any sense on a >> production build. Return an error code and let the upper layer deal >>with >> it. >> >> Brian >> >> On 11/2/11 11:27 AM, "Jeff Squyres" <jsquy...@cisco.com> wrote: >> >>> Brian: you were the one that had an allergic reaction to #1 on the >>>call. >>> >>> Thoughts? >>> >>> >>> On Nov 2, 2011, at 1:23 PM, George Bosilca wrote: >>> >>>> As it has been said, this is not something supposed to make it in a >>>> release. On the unfortunate case where it does, always having a >>>> show_help will ensure a quick complaint on one of our mailing lists >>>>and >>>> increase the probability of a [very] quick fix. >>>> >>>> george. >>>> >>>> On Nov 2, 2011, at 06:26 , TERRY DONTJE wrote: >>>> >>>>> >>>>> >>>>> On 11/1/2011 7:48 PM, Jeff Squyres wrote: >>>>>> So this was slightly different than the opinion that was discussed >>>>>>on >>>>>> the call today, which was 2. The rationale for #2 was to punish >>>>>> developers, but if such a bug did make it through to production, >>>>>>users >>>>>> wouldn't be annoyed with show_help messages all the time. >>>>>> >>>>>> Does anyone have strong opinions here? I don't. >>>>>> >>>>>> I offer the following two points: >>>>>> >>>>>> - this is a coding error on the OMPI developer >>>>>> - it's pretty rare >>>>>> >>>>>> >>>>> I think a show_help + return is very helpful in this case. I >>>>>wouldn't >>>>> think that we'd run into this case that much and it would seem that >>>>>it >>>>> would be a rare occurance that one could just fix when they run into >>>>> it. However, since there was some opposition to having show_help >>>>> messages possibly coming up all over the place I thought a fall >>>>> back of only doing the show_help on enable_debug builds was a >>>>> reasonable middle ground. >>>>> >>>>> --td >>>>>> On Nov 1, 2011, at 7:30 PM, George Bosilca wrote: >>>>>> >>>>>> >>>>>>> 1 >>>>>>> >>>>>>> george. >>>>>>> >>>>>>> On Nov 1, 2011, at 17:23 , Jeff Squyres wrote: >>>>>>> >>>>>>> >>>>>>>> Can you clarify -- I can parse your text multiple ways. Which are >>>>>>>> you voting for? >>>>>>>> >>>>>>>> 1. show_help + return error code in all cases. >>>>>>>> 2. if OPAL_ENABLE_DEBUG, show_help + exit(1), else silently return >>>>>>>> error code. >>>>>>>> 3. show_help. if OPAL_ENABLE_DEBUG, exit(1), else return error >>>>>>>> code. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Nov 1, 2011, at 4:50 PM, George Bosilca wrote: >>>>>>>> >>>>>>>> >>>>>>>>> This is a much saner solution. We [mostly] stayed away from >>>>>>>>> calling exit deep into our libraries, there is no reason to add >>>>>>>>>it >>>>>>>>> now. I'll vote in favor of show_help + return code. >>>>>>>>> >>>>>>>>> george. >>>>>>>>> >>>>>>>>> On Nov 1, 2011, at 15:14 , Jeff Squyres wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>>> We talked about this on the call today. >>>>>>>>>> >>>>>>>>>> A good suggestion was made: call show_help/opal_finalize/exit >>>>>>>>>> only when OPAL_ENABLE_DEBUG is true. Otherwise, return an error >>>>>>>>>> code. >>>>>>>>>> >>>>>>>>>> If no one objects to this, I'll commit this tomorrow. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Oct 31, 2011, at 4:16 PM, Jeff Squyres wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> WHAT: what to do if registering an MCA param results in an >>>>>>>>>>>error? >>>>>>>>>>> >>>>>>>>>>> WHERE: opal/mca/base/mca_base_param.c >>>>>>>>>>> >>>>>>>>>>> WHY: MCA param re-registration issues should be treated as OMPI >>>>>>>>>>> developer errors >>>>>>>>>>> >>>>>>>>>>> WHEN: COB Friday, 4 Nov 2011 >>>>>>>>>>> >>>>>>>>>>> ----------------- >>>>>>>>>>> >>>>>>>>>>> Short version: >>>>>>>>>>> >>>>>>>>>>> Re-registering an MCA param to be a different type (e.g., it >>>>>>>>>>>was >>>>>>>>>>> initially registered to be a string, but was later >>>>>>>>>>>re-registered >>>>>>>>>>> to be an int) should be treated as an OMPI developer error, and >>>>>>>>>>> should opal_finalize()/exit(1). >>>>>>>>>>> >>>>>>>>>>> More details: >>>>>>>>>>> >>>>>>>>>>> A mistaken MCA param re-registration recently caused an orted >>>>>>>>>>> segv. >>>>>>>>>>> >>>>>>>>>>> The MCA param subsystem was fixed to avoid this segv, but >>>>>>>>>>> silently convert the MCA param to the newly-registered type. >>>>>>>>>>> Upon reflection and some discussion, this seems to be a bad >>>>>>>>>>>idea. >>>>>>>>>>> Instead, we should loudly complain via a show_help message and >>>>>>>>>>> then exit(1). >>>>>>>>>>> >>>>>>>>>>> Specifically: this kind of behavior is clearly an error and >>>>>>>>>>> should be fixed. Unfortunately, in most cases, we don't >>>>>>>>>>>actually >>>>>>>>>>> check the return value from MCA param registration functions, >>>>>>>>>>>so >>>>>>>>>>> if we change the MCA param function to simply return a non >>>>>>>>>>> OPAL_SUCCESS status, it's unlikely that anyone will notice >>>>>>>>>>>until >>>>>>>>>>> some code tries to read the param value, likely still resulting >>>>>>>>>>> in a segv. >>>>>>>>>>> >>>>>>>>>>> Does anyone have heartburn if I change the error behavior to >>>>>>>>>>> opal_finalize()/exit(1)? >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Jeff Squyres >>>>>>>>>>> >>>>>>>>>>> jsquy...@cisco.com >>>>>>>>>>> >>>>>>>>>>> For corporate legal information go to: >>>>>>>>>>> >>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> devel mailing list >>>>>>>>>>> >>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>> -- >>>>>>>>>> Jeff Squyres >>>>>>>>>> >>>>>>>>>> jsquy...@cisco.com >>>>>>>>>> >>>>>>>>>> For corporate legal information go to: >>>>>>>>>> >>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> devel mailing list >>>>>>>>>> >>>>>>>>>> de...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> >>>>>>>>> de...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> -- >>>>>>>> Jeff Squyres >>>>>>>> >>>>>>>> jsquy...@cisco.com >>>>>>>> >>>>>>>> For corporate legal information go to: >>>>>>>> >>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> >>>>>>>> de...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> >>>>>>> de...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> -- >>>>> <Mail Attachment.gif> >>>>> Terry D. Dontje | Principal Software Engineer >>>>> Developer Tools Engineering | +1.781.442.2631 >>>>> Oracle - Performance Technologies >>>>> 95 Network Drive, Burlington, MA 01803 >>>>> Email terry.don...@oracle.com >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> -- >>> Jeff Squyres >>> jsquy...@cisco.com >>> For corporate legal information go to: >>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>> >>> >>> >> >> >> -- >> Brian W. Barrett >> Dept. 1423: Scalable System Software >> Sandia National Laboratories >> >> >> >> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > >_______________________________________________ >devel mailing list >de...@open-mpi.org >http://www.open-mpi.org/mailman/listinfo.cgi/devel > > -- Brian W. Barrett Dept. 1423: Scalable System Software Sandia National Laboratories