Re: Better error reporting in Mesos

Marco Massenzio Fri, 06 Nov 2015 10:58:50 -0800

On a related note (and in full agreement with Michael and Alex) I have
always felt not terribly at ease with our current practice of *not* logging
at the point of error - relying instead on returning an Error() object
instead.


I have now witnessed a large number of instances, when debugging
user-reported failures/errors, when we had to "grep" the codebase for the
error message and find the actual call point that way, as the logs would
not provide a sufficiently clear indication as to *where* the error
actually occurred.

So, in addition to what Michael suggests, and in line with Alex's
exhortation to have "excellent core error reporting facilities," I would
also like to propose that we consider adopting a pattern I've seen
elsewhere:

if (errorOccurred) {
  string msg = "Dude, something went wrong!";
  LOG(ERROR) << msg;
  return Error(msg);
}

or a variation thereof.

--
*Marco Massenzio*
Distributed Systems Engineer
http://codetrips.com

On Fri, Nov 6, 2015 at 4:55 AM, Alexander Rojas <alexan...@mesosphere.io>
wrote:

> I also like the idea of having better error reports. On that side, I
> always liked the design of boost system_error where one has categories of
> errors with a set of error codes (integer) where each of them have a
> specific string associated and which can be complemented with extra text.
>
>
> > On 06 Nov 2015, at 09:29, Michael Hopcroft <m...@hopcroft.org> wrote:
> >
> > Error reporting in Mesos is based on Stout’s Error class and a variety of
> > other classes, such as ErrnoError and WindowsError, that derive from
> Error.
> > One goal of these classes is to simplify the construction and storage of
> > error message strings. As an example, the ErrnoError constructor
> > automatically incorporates std::string(strerror(errno)) into its error
> > message.
> >
> >
> > Stout catches exceptions and converts them into Try<T>  and Result<T>
> > objects which are used as polymorphic return values that can express
> either
> > an error condition or a proper return value. One consequence of catching
> > exceptions is that call stacks from throw sites and error sites are lost.
> > This makes it harder to debug production environment crashes post hoc.
> >
> >
> > One way to improve debugability would be to add more information about
> the
> > error site to the error message. This could be done by convention, but it
> > seems that the vast majority of the error messages in Mesos currently
> don’t
> > include much information about the site that originally generated an
> error.
> >
> >
> > Because much of the code in Mesos uses Error and ErrnoError, we have an
> > opportunity to add diagnostic information automatically with only a small
> > amount of churn. What we’d need to do is provide Error creation macros
> that
> > would pass the source file name and line number to the constructors of
> the
> > various Error classes.
> >
> >
> > Concretely this would involve
> >
> >
> >   - class Error
> >      - Rename declaration and definition to ErrorImpl (or ErrorBase).
> This
> >      change is not error prone.
> >      - Rename references to class’ type. These changes are mostly limited
> >      to result.hpp and try.hpp.
> >      - Don’t rename calls to constructor. These will now just invoke the
> >      macro.
> >      - Modify the constructor to take a line number and source filename
> in
> >      addition to a message.
> >      - Define a macro called Error(message) that has the same prototype
> as
> >      the original Error constructor. This macro would invoke
> >      ErrorImpl::ErrorImpl, passing the current line number and source
> filename
> >      along with the user-provided message. The macro would make use
> > of __FILE__
> >      and __LINE__, which are predefined in gcc and Visual Studio.
> >   - class ErrnoError, WindowsError.
> >      - Similar pattern as above.
> >
> > *Pros.*
> >
> >
> >   - Codebase is more debuggable in production environment because the
> vast
> >   majority of error sites will document the file name and line number in
> the
> >   error message.
> >   - Most error sites don’t require a code change.
> >
> > *Cons.*
> >
> >
> >   - Introduces a macro. In general, we’d like to move away from macros
> >   wherever possible.
> >   - Discourages subclassing of Error because each subclass would need its
> >   own special macro to call its constructor. A brief search of the code
> base
> >   seems to indicate that Mesos programmers don’t tend to declare
> subclasses
> >   of Error. Right now, it seems that subclasses exist mainly to provide
> >   different construction semantics. The big question is whether one would
> >   want to use subclasses down the road to distinguish between equivalence
> >   classes of errors, say those that should bring the process down vs
> those
> >   that are recoverable.
> >   - Some amount of code churn, mainly in error.hpp, result.hpp, and
> >   try.hpp.
> >
> > *Limitations.*
> >
> >
> >   - This approach by itself won’t address error sites that invoke
> >   convenience functions such as Try<T>::error() and Result<T>::error().
> These
> >   could be addressed with a similar macro, but it would need to somehow
> deal
> >   with the template parameter as well.
> >
> > *Risks and Potential Complications.*
> >
> >
> >   - The rename process might fail to correctly distinguish between error
> >   sites (i.e. calls to Error::Error()) and other uses of the term
> “Error”.
> >   - There may be important compilers that don’t support __FILE__ and
> >   __LINE__. These predefined macros have been in the ANSI/ISO standard
> for a
> >   while, but it would still be wise to check each compiler we care about.
> >   - General hazards associated with macro expansion.
> >   - Users of Mesos may have written diagnostic tools that depend on the
> >   current values of the error strings.
> >   - Unit tests may depend on exact values of error strings. If this were
> >   the case, we'd have to provide a convenience function that strips off
> the
> >   file names and line numbers so that unit tests don't break when error
> sites
> >   move around due to edits elsewhere in the file.
> >
> > Let me know what you think of this idea. If the approach seems reasonable
> > and a worthwhile, I would be willing to prototype it to see if it has
> some
> > fatal flaw. Thanks for your input!
> >
> >
> >
> > -Mike
>
>

Re: Better error reporting in Mesos

Reply via email to