[tickets] [opensaf:tickets] #1607 Handle AIS error codes properly

Anders Widell Fri, 20 Nov 2015 00:53:44 -0800


---

** [tickets:#1607] Handle AIS error codes properly**

**Status:** assigned
**Milestone:** 5.0.FC
**Created:** Fri Nov 20, 2015 08:53 AM UTC by Anders Widell
**Last Updated:** Fri Nov 20, 2015 08:53 AM UTC
**Owner:** Anders Widell


There is a flora of AIS error codes defined in saAis.h that an API user is 
supposed to handle in an appropriate way, but currently, the OpenSAF services 
themselves do not internally handle these error codes properly. This ticket 
proposes a general improvement / cleanup of the code where we are (or in moste 
cases: are *not*) handling AIS error codes in the OpenSAF services. The 
proposal is also to also add common library helper functions for the AIS eror 
handling mechanism, to minimize code duplication.

Examples of error codes and how to handle them:

* SA_AIS_ERR_TRY_AGAIN: Retry the function
* SA_AIS_ERR_NO_RESOURCES: Similar to SA_AIS_ERR_TRY_AGAIN
* SA_AIS_ERR_TIMEOUT: Retry if the function is idempotent. If the function 
isn't idempotent, we have to judge from case to case if it should be retried or 
not.
* SA_AIS_ERR_BAD_HANDLE: Initialize a new handle (and possibly also do other 
things like setting OI implementer in case of an OI handle). Retry with the new 
handle. In the case of an IMM CCB handle, an incomplete IMM transaction may 
have to be "replayed".
* SA_AIS_ERR_FAILED_OPERATION: When applying an IMM transaction, this code is 
returned when the transaction was aborted. It can be returned both in the case 
of a validation error and in the case of a resource error. To distinguish 
between the two causes, use the new functionality introduced in ticket [#744]. 
If it was a resource abort, retry by replaying the whole transaction.

# For how long should we keep retrying?

It is very difficult to set a maximum time limt for how long we need to keep 
retrying before we give up, as can be seen for example in ticket [#1582]. It is 
also in many cases difficult to decide what to do when we give up. Sometimes, 
we can just skip the action and continue anyway. An example of this case would 
be logging; logging a message is normally not vital to the function of the 
system. In those cases, we should only retry for a short while (or not at all), 
and then give up the operation and continue in the same was as if it was 
successful. However, in many cases the operation cannot be skipped. Restarting 
the calling process is unlikely to help, since the AIS call is failing because 
some *other* OpenSAF service (possibly on on another node) is unresponsive. 
Therefore, the proposal is that in these cases where the operation is vital, we 
should keep retrying forever and let higher-level monitoring (NID or AMF 
helathcheck) detect and recover hanging processes. For debugging purp
 oses, we should however log a message to syslog to indicate where we are stuck 
in a retry loop. This logging should be by the common helper functions.




---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.

------------------------------------------------------------------------------

_______________________________________________
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #1607 Handle AIS error codes properly

Reply via email to