George -

I wrote the error code gorp; I'm pretty sure I know exactly how it was
supposed to work.

There are 58 codes unused between OPAL_NETWORK_NOT_PARSEABLE and
OPAL_ERR_MAX.  I now see what you did with ERR_REQUEST, and it's evil.
THat's not the intent of the error code logic at all.  If you want to
change that, I'm not necessarily opposed to it, but that's something that
should be discussed in an RFC.  What the current code does is not
consistent with the original intent.

I don't agree that you shouldn't propagate error codes through OMPI; in
fact, the original intent of the design was to allow such propagation.
Again, such a change should be discussed as part of an RFC.

Brian

On 10/19/11 4:50 PM, "George Bosilca" <bosi...@eecs.utk.edu> wrote:

>I don't know how you think that the error codes work in Open MPI, so I'll
>take the liberty to depict it here so we all agree we're talking about
>the same thing.
>
>The opal_strerror is a nice feature, it allow to register a range of
>error codes with a particular error converter. Every time you look for
>the meaning of a particular error code, the first convertor with a range
>enveloping the looked at value, will translate it into an error string.
>
>This is only currently used by OPAL and ORTE directly. It worked at the
>OMPI level only because we mapped __all__ OMPI errors to OPAL or ORTE
>ones. This behavior didn't change after my patch, you can still use
>opal_strerror to get the error string for all OPAL/ORTE/OMPI errors.
>
>There is a small "variation" for OMPI_ERR_REQUEST, the only really OMPI
>specific error code today. The OMPI error codes are actually inserted
>between the OPAL and the ORTE ones (there is a gap of 100 elements), so
>there is __no__ possible overlap right now. If at one point we add more
>than 100 OMPI level, we should certainly revisit this.
>
>Now, resulting from my patch, there is a difference. One should not
>simply forward an ORTE code into the stack of OMPI, and hope it just
>works. Errors should be dealt with where they happens, and if not
>possible they should be translated into the actual layer error code. The
>error propagation should be compartmentalized, and has to be translated
>into an error code that has a meaning at the OMPI level. The current
>patch should not prevent the mixed error-code code to work, as
>opal_strerror retains the same behavior as before. However, this coding
>practice should be avoided. I tried to clean the current code of such
>instances few days ago in r25230.
>
>Moreover, this is similar to how we deal with the error codes between
>OMPI and MPI layers, and seems like a sane way to compose libraries. You
>deal with a specific layer error code when you get it (usually after the
>call to a function from that specific layer), not later on when you don't
>even know exactly what the execution path was.
>
>  george.
>
>PS: I'll fix the +/- issue.
>
>On Oct 19, 2011, at 14:09 , Jeff Squyres wrote:
>
>> Oy, yes, that is bad -- we cannot have overlapping ORTE and OMPI error
>>codes. That seems like a very bad idea (in addition to the mixing of +
>>and -).
>> 
>> For one thing, that breaks opal_strerror().  That, in itself, seems
>>like a dealbreaker.
>> 
>> 
>> On Oct 19, 2011, at 1:51 PM, Barrett, Brian W wrote:
>> 
>>> I actually think it's worse than that.  An ORTE error code can now have
>>> the same error code as an OMPI error.  OMPI_ERR_REQUEST and
>>> ORTE_ERR_RECV_LESS_THANK_POSTED now share the same integer return code.
>>> Or, they should, if George hadn't made a mistake (see below).  The
>>>sharing
>>> of return codes seems... bad.
>>> 
>>> Also, there's a bug in George's patch.  Error codes are all negative,
>>>so
>>> OMPI_ERR_REQUEST should be OMPI_ERR_BASE -1 and OMPI_ERR_MAX should be
>>> OMPI_ERR_BASE - 1, not plus 2.
>>> 
>>> Brian
>>> 
>>> On 10/19/11 1:32 PM, "Ralph Castain" <r...@open-mpi.org> wrote:
>>> 
>>>> I've been wrestling with something from this commit, and I'm unsure of
>>>> the right answer. So please consider this a general design question
>>>>for
>>>> the community.
>>>> 
>>>> This commit removes all the OMPI <-> ORTE equivalent constants -
>>>>i.e., we
>>>> used to declare OMPI-prefixed equivalents to every ORTE-prefixed
>>>> constant. I understand the thinking (or at least, what I suspect was
>>>>the
>>>> thought), but it creates an issue.
>>>> 
>>>> Suppose I have an ompi-level function (A) that calls another
>>>>ompi-level
>>>> function (B). Invisible to A is that B calls an orte-level function. B
>>>> dutifully checks the error return from the orte-level function
>>>>against an
>>>> ORTE-prefixed constant.
>>>> 
>>>> However, if that return isn't "success", what does B return up to A?
>>>>It
>>>> cannot return the OMPI equivalent to the orte error constant because
>>>>it
>>>> no longer exists. It could return the orte error code, but A has no
>>>>way
>>>> of knowing it is going to get a non-OMPI constant, and therefore
>>>>won't be
>>>> able to understand it - it will be an "unrecognized error".
>>>> 
>>>> I guess one option is to require that B "translate" the return code
>>>>and
>>>> pass some OMPI error up the chain, but this prevents anything upwards
>>>> from understanding the nature of the problem and potentially taking
>>>> corrective and/or alternative action. Seems awfully limiting, as most
>>>>of
>>>> the time the only option will be the vanilla "OMPI_ERROR".
>>>> 
>>>> Thoughts?
>>> -- 
>>> Brian W. Barrett
>>> Dept. 1423: Scalable System Software
>>> Sandia National Laboratories
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>_______________________________________________
>devel mailing list
>de...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>


-- 
  Brian W. Barrett
  Dept. 1423: Scalable System Software
  Sandia National Laboratories






Reply via email to