I have updated the functional/design spec for the BE error and
observability project based on all of the feedback to this point.
I believe I have covered all of the issues and concerns raised to
this point.

Please provide your comments by Thursday 8/27.

Thanks,
-evan
Problem statement:
Currently in libbe when there is an error during a call into the
library only error codes are returned. While these error code
provide some information on why an operation failed they do not
provide enough context to tell the user what actually caused the
problem or what they may need to do the solve it.

To get more context the user can turn on extra error and debug
output through the use of the BE_PRINT_ERR environment variable.
However this also does not always provide enough information and
can cause the user some confusion. This is also a problem since
it requires the ability to print out these messages directly from
the library and requires the user to rerun the failing command to
retrieve the needed error or debug output.

Scope:
- This project will provide for the ability to return a an nvlist of
  information describing a failure and it's context from calls into libbe.
- It is expected that this will not replace the use of be_print_err
  throughout the library at this time.
    - As we move forward and are able to provide all of the needed
      error information for all error conditions, all instances of
      be_print_err will be removed. However it's removal is not
      planned for this release.
- We will not provide the overall library for handling errors and
  logging as described in the Caiman Unified Design (CUD) documents.
    - However the design here is such that the code will be generic
      enough that moving into something that will fit with CUD
      will be easier. 

Requirements:
- Calls into the library need to include enough information to
  determine the cause of a failure and possible solutions to that
  failure.
    - The error information should include:
        - The operation being performed (the entry point into the
          library such as activating a BE).
        - What was being performed when the error occured (for
          example running installgrub).
        - What the failure was (for example what was the error
          string returned form installgrub or a zfs_promote call).
        - What steps can be taken to correct the problem or if this
          is not available a link to more information on possible
          issues to check.
- As stated above the design will be generic enough that moving
  into something that will fit with CUD will be easier. This will
  be done by keeping the calls in a separate file and header file
  that can at a later time be removed. Also the functionality itself
  will be kept generic enough that it can be moved easily outside
  of libbe.

Requirements on other projects
- This project will require changes to any consumer of libbe so that
  they can make use of this new error information.

Errors Corrected Internally:
- For errors we can fix internal to the library we will use a linked list
  in the library handle which will allow us to relay any informational
  data that may need to be reported back to the consumer on the
  corrective action taken. This linked list will be made up of the same
  data structures shown below and will use the same interfaces to retrieve
  this information.
        For example if we find that the grub menu is missing we attempt to
        create a new menu.lst file. When this is done the corrective action
        would be added to the linked list of fixed error data. For these
        the error type will always be "no error" since the error was corrected.
- When logging is available this information can also be logged from
  within the library and separately from this linked list.

Logging:
- The logging side of things is outside the scope of this project and will
  be done as part of the Caiman Unified Design project. That being said we
  can see the possibility for two types or levels of logging that may be
  needed. The first is logging that the consumer of the library will do.
  This will be based on the information returned through the library's handle.
  There is also the need for some debugging form of logging this will be
  done inside the library.

Library Handle:
- The library interfaces will be changed to pass back a handle which will
  contain primarily the error information. This handle will be allocated by
  the library and returned to the consumer. When the consumer has retrieved
  the information they are interested in the handle must then be closed
  which will free up the memory and any other clean up that may be needed.

Structure definitions:

        internal to the library:
        struct err_info {
                union {
                        int             ei_err_num; /* this is a be_errno */
                        int             ei_op_num;  /* enum of libbe operations 
*/
                        int             ei_fixit_str_num; /* enum of fixit 
strings
                                                           * or URL's */
                        int             ei_failed_at; /* enum of function calls 
*/
                        char            ei_failed_str[MAXLEN]; /* error string 
returned
                                                                * from failure 
*/
                } ei_info;
                int     ei_err_type; /* The type of failure */
        };

        enum {
                EI_NO_ERR = 0,
                EI_BE_ERR = 5000, /* libbe errors */
                EI_BE_CLEANUP /* libbe cleanup errors */
        } err_type;


        Public definitions:
        typedef struct err_info_list {
                err_info_t      *el_err_info;
                err_info_t      *next;
        } err_info_list_t;

        typedef struct be_handle {
                err_info_t      *be_err_info; /* information for the actual 
failure */
                err_info_t      *be_cleanup_info; /* information on any needed 
cleanup */
                err_info_list_t *be_fixed_err_info; /* list of errors fixed 
internally */
                ....
        } be_handle_t;

        typedef struct err_info err_info_t;


Public Functions:
These functions are used to access the fields in the data structure as
the err_info structure itself will be encapsulated within the library.

/* retrieves error information */
int be_get_err_info(err_info_t *be_err_info, nvlist_t *be_err_info);

/* retrieves any cleanup information needed due to error */
int be_get_cleanup_info(err_info_t *be_cleanup_info, nvlist_t *be_err_info);

/* closes the library handle and frees up the error and clean-up information. */
int be_close_handle (be_handle_t *be_hd);

The information from these nvlists is then pulled into specific dictionaries
for these types of errors within the libbe python module and then returned to
consumers of the module. The information can then be used as the consumer 
chooses.

Deliverables:
    - In addition to library changes to support the components mentioned
      above we will need to add the following:
        - Addition of code that will try to determine what the user may need
          to do to correct the problem.
        - Addition of a handle that will be passed back from library calls.
          This handle will contain the error, cleanup and corrected error
          information. Accessor functions will be used to retrieve the error
          information out of the error structures attached to this handle.
        - Additional documentation on the existence of this error
          information
        - Addition of the html content that describes possible solutions
          to various errors. This will at first be minimal with more
          information added as more errors are found where a solution can't
          be determined from the available information.
_______________________________________________
pkg-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/pkg-discuss

Reply via email to