Re: [OMPI devel] [OMPI svn] svn:open-mpi r25323

George Bosilca Wed, 19 Oct 2011 18:48:34 -0400

There are several OPAL level error codes not used in the current code.

OPAL_ERR_TOPO_SLOT_LIST_NOT_SUPPORTED
OPAL_ERR_TOPO_SOCKET_NOT_SUPPORTED
OPAL_ERR_TOPO_CORE_NOT_SUPPORTED
OPAL_ERR_NOT_ENOUGH_SOCKETS
OPAL_ERR_NOT_ENOUGH_CORES
OPAL_ERR_INVALID_PHYS_CPU
OPAL_ERR_MULTIPLE_AFFINITIES


If somebody feels like filling up an RFC to remove them, please feel free to go 
ahead.

  george.

On Oct 19, 2011, at 18:41 , George Bosilca wrote:

> A careful reading of the committed patch, would have pointed out that none of 
> the concerns raised so far were true, the "old-way" behavior of the OMPI code 
> was preserved. Moreover, every single of the error codes removed were not 
> used in ages.
> 
> What Brian pointed out as evil, evil being a subjective notion by itself, 
> didn't prevent the correct behavior of the code, nor affected in any way it's 
> correctness. Anyway, to address his concern I pushed a patch (25333) putting 
> the OMPI error codes back where they were originally.
> 
> In other words we spent a very unproductive day, arguing over unfounded 
> arguments and "thought-to-be" behaviors.
> 
>  george.
> 
> 
> On Oct 19, 2011, at 17:50 , Barrett, Brian W wrote:
> 
>> George -
>> 
>> I wrote the error code gorp; I'm pretty sure I know exactly how it was
>> supposed to work.
>> 
>> There are 58 codes unused between OPAL_NETWORK_NOT_PARSEABLE and
>> OPAL_ERR_MAX.  I now see what you did with ERR_REQUEST, and it's evil.
>> THat's not the intent of the error code logic at all.  If you want to
>> change that, I'm not necessarily opposed to it, but that's something that
>> should be discussed in an RFC.  What the current code does is not
>> consistent with the original intent.
>> 
>> I don't agree that you shouldn't propagate error codes through OMPI; in
>> fact, the original intent of the design was to allow such propagation.
>> Again, such a change should be discussed as part of an RFC.
>> 
>> Brian
>> 
>> On 10/19/11 4:50 PM, "George Bosilca" <[email protected]> wrote:
>> 
>>> I don't know how you think that the error codes work in Open MPI, so I'll
>>> take the liberty to depict it here so we all agree we're talking about
>>> the same thing.
>>> 
>>> The opal_strerror is a nice feature, it allow to register a range of
>>> error codes with a particular error converter. Every time you look for
>>> the meaning of a particular error code, the first convertor with a range
>>> enveloping the looked at value, will translate it into an error string.
>>> 
>>> This is only currently used by OPAL and ORTE directly. It worked at the
>>> OMPI level only because we mapped __all__ OMPI errors to OPAL or ORTE
>>> ones. This behavior didn't change after my patch, you can still use
>>> opal_strerror to get the error string for all OPAL/ORTE/OMPI errors.
>>> 
>>> There is a small "variation" for OMPI_ERR_REQUEST, the only really OMPI
>>> specific error code today. The OMPI error codes are actually inserted
>>> between the OPAL and the ORTE ones (there is a gap of 100 elements), so
>>> there is __no__ possible overlap right now. If at one point we add more
>>> than 100 OMPI level, we should certainly revisit this.
>>> 
>>> Now, resulting from my patch, there is a difference. One should not
>>> simply forward an ORTE code into the stack of OMPI, and hope it just
>>> works. Errors should be dealt with where they happens, and if not
>>> possible they should be translated into the actual layer error code. The
>>> error propagation should be compartmentalized, and has to be translated
>>> into an error code that has a meaning at the OMPI level. The current
>>> patch should not prevent the mixed error-code code to work, as
>>> opal_strerror retains the same behavior as before. However, this coding
>>> practice should be avoided. I tried to clean the current code of such
>>> instances few days ago in r25230.
>>> 
>>> Moreover, this is similar to how we deal with the error codes between
>>> OMPI and MPI layers, and seems like a sane way to compose libraries. You
>>> deal with a specific layer error code when you get it (usually after the
>>> call to a function from that specific layer), not later on when you don't
>>> even know exactly what the execution path was.
>>> 
>>> george.
>>> 
>>> PS: I'll fix the +/- issue.
>>> 
>>> On Oct 19, 2011, at 14:09 , Jeff Squyres wrote:
>>> 
>>>> Oy, yes, that is bad -- we cannot have overlapping ORTE and OMPI error
>>>> codes. That seems like a very bad idea (in addition to the mixing of +
>>>> and -).
>>>> 
>>>> For one thing, that breaks opal_strerror().  That, in itself, seems
>>>> like a dealbreaker.
>>>> 
>>>> 
>>>> On Oct 19, 2011, at 1:51 PM, Barrett, Brian W wrote:
>>>> 
>>>>> I actually think it's worse than that.  An ORTE error code can now have
>>>>> the same error code as an OMPI error.  OMPI_ERR_REQUEST and
>>>>> ORTE_ERR_RECV_LESS_THANK_POSTED now share the same integer return code.
>>>>> Or, they should, if George hadn't made a mistake (see below).  The
>>>>> sharing
>>>>> of return codes seems... bad.
>>>>> 
>>>>> Also, there's a bug in George's patch.  Error codes are all negative,
>>>>> so
>>>>> OMPI_ERR_REQUEST should be OMPI_ERR_BASE -1 and OMPI_ERR_MAX should be
>>>>> OMPI_ERR_BASE - 1, not plus 2.
>>>>> 
>>>>> Brian
>>>>> 
>>>>> On 10/19/11 1:32 PM, "Ralph Castain" <[email protected]> wrote:
>>>>> 
>>>>>> I've been wrestling with something from this commit, and I'm unsure of
>>>>>> the right answer. So please consider this a general design question
>>>>>> for
>>>>>> the community.
>>>>>> 
>>>>>> This commit removes all the OMPI <-> ORTE equivalent constants -
>>>>>> i.e., we
>>>>>> used to declare OMPI-prefixed equivalents to every ORTE-prefixed
>>>>>> constant. I understand the thinking (or at least, what I suspect was
>>>>>> the
>>>>>> thought), but it creates an issue.
>>>>>> 
>>>>>> Suppose I have an ompi-level function (A) that calls another
>>>>>> ompi-level
>>>>>> function (B). Invisible to A is that B calls an orte-level function. B
>>>>>> dutifully checks the error return from the orte-level function
>>>>>> against an
>>>>>> ORTE-prefixed constant.
>>>>>> 
>>>>>> However, if that return isn't "success", what does B return up to A?
>>>>>> It
>>>>>> cannot return the OMPI equivalent to the orte error constant because
>>>>>> it
>>>>>> no longer exists. It could return the orte error code, but A has no
>>>>>> way
>>>>>> of knowing it is going to get a non-OMPI constant, and therefore
>>>>>> won't be
>>>>>> able to understand it - it will be an "unrecognized error".
>>>>>> 
>>>>>> I guess one option is to require that B "translate" the return code
>>>>>> and
>>>>>> pass some OMPI error up the chain, but this prevents anything upwards
>>>>>> from understanding the nature of the problem and potentially taking
>>>>>> corrective and/or alternative action. Seems awfully limiting, as most
>>>>>> of
>>>>>> the time the only option will be the vanilla "OMPI_ERROR".
>>>>>> 
>>>>>> Thoughts?
>>>>> -- 
>>>>> Brian W. Barrett
>>>>> Dept. 1423: Scalable System Software
>>>>> Sandia National Laboratories
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> [email protected]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> 
>>>> -- 
>>>> Jeff Squyres
>>>> [email protected]
>>>> For corporate legal information go to:
>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>> 
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> [email protected]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> [email protected]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>> 
>> 
>> -- 
>> Brian W. Barrett
>> Dept. 1423: Scalable System Software
>> Sandia National Laboratories
>> 
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> [email protected]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> _______________________________________________
> devel mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] [OMPI svn] svn:open-mpi r25323

Reply via email to