Re: Better error reporting in Mesos

Michael Hopcroft Tue, 10 Nov 2015 12:28:59 -0800

There have been a lot of great suggestions on this thread. My
recommendation is that I do an experiment to uncover any hidden issues with
the macro approach outlined in my original email. I will try to design the
API of the ErrorImpl class so that it would be, not only possible, but easy
to add support for error chaining and message formatting, as has been
suggested on this thread. If I don’t encounter any significant problems
with the basic experiment, I will come back to this group with a patch and
we can then discuss whether to push the design further forward. I’m a big
fan of an error infrastructure because it gives us a centralized location
in the code to add debugging hooks and apply various policies such as
logging and formatting.


On Fri, Nov 6, 2015 at 1:56 PM, Alex Clemmer <[email protected]>
wrote:

> In general, I think a good answer to my question is for the place that
> is reporting or unboxing the error to aggregate data into an error
> struct that holds things like the "error stack" (and other semantics
> about the log, including perhaps error codes), and then let whoever is
> logging, reporting or printing the message, handle the pretty-printing
> semantics. This is a departure from our current log-the-error-message
> approach, where the error message acts as the sole communicator of
> error semantics and context, including the details about the line
> number.
>
> In fact, I think it's worth wondering whether the place where we
> create an error should even log the error message -- right now we add
> a lot of strings together, and it makes more sense to me to have a
> methodology for just populating an "error struct" with data about the
> error, and then the `toString` method (or whatever) can simply
> generate the error messages as necessary, in one holistic place. That
> way, we have centralized place that determining how an error gets
> transformed into a string, instead of a lot of ad hoc string
> concatenations and error unboxing. Another advantage is that you'd
> have very fine-grained control over how to respond to errors, as it is
> a typesafe struct. In fact, you could even imagine having a system
> where error stringifying is handled completely out-of-band, meaning
> that logging could become much, much cheaper in general.
>
> But anyway, more broadly, what I'm trying to say is that the place
> where we are returning an error should not know or care about how the
> downstream sites pretty-print it or log it. Those are separate
> concerns.
>
> On Fri, Nov 6, 2015 at 1:17 PM, Benjamin Mahler
> <[email protected]> wrote:
> > Thanks for the proposal Michael. This has come up in discussions before
> and
> > I think it's really valuable, so it's nice to see folks thinking about
> > this. Some things to consider:
> >
> > Don't forget 'Failure' which is unfortunately not unified with 'Error'.
> >
> > How does this affect error message composition (as Alex alluded to)?
> > Currently we do manual error message composition ("<Caller message>:
> > <callee message>: <callee's callee message>: ...") but in the future we
> may
> > add composition support to our synchronous types like Try,Result,etc to
> be
> > similar to Future composition with .then. See:
> > https://issues.apache.org/jira/browse/MESOS-785
> >
> > Given composition, it sounds like we want to decouple the trace
> information
> > from the error message, so that we can store a "stack-trace" of multiple
> > code locations?
> >
> > It would be good to be aware of the limitations around Future composition
> > and failure messages. When a .then chain encounters a failure, no error
> > message wrapping occurs as the failure bubbles up. The result is that we
> > instead get "<bottom-most callee message>" as the failure message without
> > the caller wrapping I mentioned above. Tracing will help alleviate only
> > having the bottom most callee failure, but it would be great to think
> about
> > how to produce better messages as well (given MESOS-785 lands us in the
> > same boat for Try,Result,etc).
> >
> > Typed errors/failures has come up before as well, but might be orthogonal
> > enough to discuss separately.
> >
> > On Fri, Nov 6, 2015 at 1:16 AM, Alex Clemmer <
> [email protected]>
> > wrote:
> >
> >> I'm strongly in favor of moving down this path, or a path like it.
> >>
> >> As Mesos becomes increasingly important as core distributed systems
> >> infrastructure, it will become correspondingly helpful and important
> >> to have things like excellent core error reporting facilities, even if
> >> they come at the cost of some technical risk in the short term. In
> >> this case in particular, I believe this error reporting model, where
> >> the developer need only fill in semantic data about the error (e.g.,
> >> the error message), and the boilerplate context of the error (e.g.,
> >> the line number of the error) is filled in automatically and opaquely
> >> by the system, is clearly the "right" model. On the merits of simply
> >> adding line numbers itself, I think this is worth prototyping, and if
> >> anything, I'd ask for the model to go further -- I've worked on scale
> >> systems where errors are all "stringly typed" and the call sites are
> >> all tribal knowledge, and I've worked on systems where the errors are
> >> highly semantically regimented, and it is extremely clear to me that
> >> the latter is clearly, unequivocally better.
> >>
> >> It's true there are some subtleties to think through. We'll need to
> >> think about things like ABI compatibility, and we'll need to think
> >> about whether there are situations where we _don't_ want line numbers
> >> to be reported (e.g., we often "unbox" the error message and repackage
> >> it, and it's worth wondering whether these semantics hurt or help
> >> those error handling practices). It will indeed require a steady hand
> >> to make sure the needle is "threaded correctly" and we don't stomp on
> >> something important by accident.
> >>
> >> But I think it's worth taking the time to give this a shot anyway. I'm
> >> happy to do the code reviews myself, if that helps. :)
> >>
> >> On Fri, Nov 6, 2015 at 12:29 AM, Michael Hopcroft <[email protected]>
> >> wrote:
> >> > Error reporting in Mesos is based on Stout’s Error class and a
> variety of
> >> > other classes, such as ErrnoError and WindowsError, that derive from
> >> Error.
> >> > One goal of these classes is to simplify the construction and storage
> of
> >> > error message strings. As an example, the ErrnoError constructor
> >> > automatically incorporates std::string(strerror(errno)) into its error
> >> > message.
> >> >
> >> >
> >> > Stout catches exceptions and converts them into Try<T>  and Result<T>
> >> > objects which are used as polymorphic return values that can express
> >> either
> >> > an error condition or a proper return value. One consequence of
> catching
> >> > exceptions is that call stacks from throw sites and error sites are
> lost.
> >> > This makes it harder to debug production environment crashes post hoc.
> >> >
> >> >
> >> > One way to improve debugability would be to add more information about
> >> the
> >> > error site to the error message. This could be done by convention,
> but it
> >> > seems that the vast majority of the error messages in Mesos currently
> >> don’t
> >> > include much information about the site that originally generated an
> >> error.
> >> >
> >> >
> >> > Because much of the code in Mesos uses Error and ErrnoError, we have
> an
> >> > opportunity to add diagnostic information automatically with only a
> small
> >> > amount of churn. What we’d need to do is provide Error creation macros
> >> that
> >> > would pass the source file name and line number to the constructors of
> >> the
> >> > various Error classes.
> >> >
> >> >
> >> > Concretely this would involve
> >> >
> >> >
> >> >    - class Error
> >> >       - Rename declaration and definition to ErrorImpl (or ErrorBase).
> >> This
> >> >       change is not error prone.
> >> >       - Rename references to class’ type. These changes are mostly
> >> limited
> >> >       to result.hpp and try.hpp.
> >> >       - Don’t rename calls to constructor. These will now just invoke
> the
> >> >       macro.
> >> >       - Modify the constructor to take a line number and source
> filename
> >> in
> >> >       addition to a message.
> >> >       - Define a macro called Error(message) that has the same
> prototype
> >> as
> >> >       the original Error constructor. This macro would invoke
> >> >       ErrorImpl::ErrorImpl, passing the current line number and source
> >> filename
> >> >       along with the user-provided message. The macro would make use
> >> > of __FILE__
> >> >       and __LINE__, which are predefined in gcc and Visual Studio.
> >> >    - class ErrnoError, WindowsError.
> >> >       - Similar pattern as above.
> >> >
> >> > *Pros.*
> >> >
> >> >
> >> >    - Codebase is more debuggable in production environment because the
> >> vast
> >> >    majority of error sites will document the file name and line number
> >> in the
> >> >    error message.
> >> >    - Most error sites don’t require a code change.
> >> >
> >> > *Cons.*
> >> >
> >> >
> >> >    - Introduces a macro. In general, we’d like to move away from
> macros
> >> >    wherever possible.
> >> >    - Discourages subclassing of Error because each subclass would need
> >> its
> >> >    own special macro to call its constructor. A brief search of the
> code
> >> base
> >> >    seems to indicate that Mesos programmers don’t tend to declare
> >> subclasses
> >> >    of Error. Right now, it seems that subclasses exist mainly to
> provide
> >> >    different construction semantics. The big question is whether one
> >> would
> >> >    want to use subclasses down the road to distinguish between
> >> equivalence
> >> >    classes of errors, say those that should bring the process down vs
> >> those
> >> >    that are recoverable.
> >> >    - Some amount of code churn, mainly in error.hpp, result.hpp, and
> >> >    try.hpp.
> >> >
> >> > *Limitations.*
> >> >
> >> >
> >> >    - This approach by itself won’t address error sites that invoke
> >> >    convenience functions such as Try<T>::error() and
> Result<T>::error().
> >> These
> >> >    could be addressed with a similar macro, but it would need to
> somehow
> >> deal
> >> >    with the template parameter as well.
> >> >
> >> > *Risks and Potential Complications.*
> >> >
> >> >
> >> >    - The rename process might fail to correctly distinguish between
> error
> >> >    sites (i.e. calls to Error::Error()) and other uses of the term
> >> “Error”.
> >> >    - There may be important compilers that don’t support __FILE__ and
> >> >    __LINE__. These predefined macros have been in the ANSI/ISO
> standard
> >> for a
> >> >    while, but it would still be wise to check each compiler we care
> >> about.
> >> >    - General hazards associated with macro expansion.
> >> >    - Users of Mesos may have written diagnostic tools that depend on
> the
> >> >    current values of the error strings.
> >> >    - Unit tests may depend on exact values of error strings. If this
> were
> >> >    the case, we'd have to provide a convenience function that strips
> off
> >> the
> >> >    file names and line numbers so that unit tests don't break when
> error
> >> sites
> >> >    move around due to edits elsewhere in the file.
> >> >
> >> > Let me know what you think of this idea. If the approach seems
> reasonable
> >> > and a worthwhile, I would be willing to prototype it to see if it has
> >> some
> >> > fatal flaw. Thanks for your input!
> >> >
> >> >
> >> >
> >> > -Mike
> >>
> >>
> >>
> >> --
> >> Alex
> >>
> >> Theory is the first term in the Taylor series of practice. -- Thomas M
> >> Cover (1992)
> >>
>
>
>
> --
> Alex
>
> Theory is the first term in the Taylor series of practice. -- Thomas M
> Cover (1992)
>

Re: Better error reporting in Mesos

Reply via email to