Re: Better error reporting in Mesos

Benjamin Mahler Fri, 06 Nov 2015 13:18:06 -0800

Thanks for the proposal Michael. This has come up in discussions before and
I think it's really valuable, so it's nice to see folks thinking about
this. Some things to consider:


Don't forget 'Failure' which is unfortunately not unified with 'Error'.

How does this affect error message composition (as Alex alluded to)?
Currently we do manual error message composition ("<Caller message>:
<callee message>: <callee's callee message>: ...") but in the future we may
add composition support to our synchronous types like Try,Result,etc to be
similar to Future composition with .then. See:
https://issues.apache.org/jira/browse/MESOS-785

Given composition, it sounds like we want to decouple the trace information
from the error message, so that we can store a "stack-trace" of multiple
code locations?

It would be good to be aware of the limitations around Future composition
and failure messages. When a .then chain encounters a failure, no error
message wrapping occurs as the failure bubbles up. The result is that we
instead get "<bottom-most callee message>" as the failure message without
the caller wrapping I mentioned above. Tracing will help alleviate only
having the bottom most callee failure, but it would be great to think about
how to produce better messages as well (given MESOS-785 lands us in the
same boat for Try,Result,etc).

Typed errors/failures has come up before as well, but might be orthogonal
enough to discuss separately.

On Fri, Nov 6, 2015 at 1:16 AM, Alex Clemmer <[email protected]>
wrote:

> I'm strongly in favor of moving down this path, or a path like it.
>
> As Mesos becomes increasingly important as core distributed systems
> infrastructure, it will become correspondingly helpful and important
> to have things like excellent core error reporting facilities, even if
> they come at the cost of some technical risk in the short term. In
> this case in particular, I believe this error reporting model, where
> the developer need only fill in semantic data about the error (e.g.,
> the error message), and the boilerplate context of the error (e.g.,
> the line number of the error) is filled in automatically and opaquely
> by the system, is clearly the "right" model. On the merits of simply
> adding line numbers itself, I think this is worth prototyping, and if
> anything, I'd ask for the model to go further -- I've worked on scale
> systems where errors are all "stringly typed" and the call sites are
> all tribal knowledge, and I've worked on systems where the errors are
> highly semantically regimented, and it is extremely clear to me that
> the latter is clearly, unequivocally better.
>
> It's true there are some subtleties to think through. We'll need to
> think about things like ABI compatibility, and we'll need to think
> about whether there are situations where we _don't_ want line numbers
> to be reported (e.g., we often "unbox" the error message and repackage
> it, and it's worth wondering whether these semantics hurt or help
> those error handling practices). It will indeed require a steady hand
> to make sure the needle is "threaded correctly" and we don't stomp on
> something important by accident.
>
> But I think it's worth taking the time to give this a shot anyway. I'm
> happy to do the code reviews myself, if that helps. :)
>
> On Fri, Nov 6, 2015 at 12:29 AM, Michael Hopcroft <[email protected]>
> wrote:
> > Error reporting in Mesos is based on Stout’s Error class and a variety of
> > other classes, such as ErrnoError and WindowsError, that derive from
> Error.
> > One goal of these classes is to simplify the construction and storage of
> > error message strings. As an example, the ErrnoError constructor
> > automatically incorporates std::string(strerror(errno)) into its error
> > message.
> >
> >
> > Stout catches exceptions and converts them into Try<T>  and Result<T>
> > objects which are used as polymorphic return values that can express
> either
> > an error condition or a proper return value. One consequence of catching
> > exceptions is that call stacks from throw sites and error sites are lost.
> > This makes it harder to debug production environment crashes post hoc.
> >
> >
> > One way to improve debugability would be to add more information about
> the
> > error site to the error message. This could be done by convention, but it
> > seems that the vast majority of the error messages in Mesos currently
> don’t
> > include much information about the site that originally generated an
> error.
> >
> >
> > Because much of the code in Mesos uses Error and ErrnoError, we have an
> > opportunity to add diagnostic information automatically with only a small
> > amount of churn. What we’d need to do is provide Error creation macros
> that
> > would pass the source file name and line number to the constructors of
> the
> > various Error classes.
> >
> >
> > Concretely this would involve
> >
> >
> >    - class Error
> >       - Rename declaration and definition to ErrorImpl (or ErrorBase).
> This
> >       change is not error prone.
> >       - Rename references to class’ type. These changes are mostly
> limited
> >       to result.hpp and try.hpp.
> >       - Don’t rename calls to constructor. These will now just invoke the
> >       macro.
> >       - Modify the constructor to take a line number and source filename
> in
> >       addition to a message.
> >       - Define a macro called Error(message) that has the same prototype
> as
> >       the original Error constructor. This macro would invoke
> >       ErrorImpl::ErrorImpl, passing the current line number and source
> filename
> >       along with the user-provided message. The macro would make use
> > of __FILE__
> >       and __LINE__, which are predefined in gcc and Visual Studio.
> >    - class ErrnoError, WindowsError.
> >       - Similar pattern as above.
> >
> > *Pros.*
> >
> >
> >    - Codebase is more debuggable in production environment because the
> vast
> >    majority of error sites will document the file name and line number
> in the
> >    error message.
> >    - Most error sites don’t require a code change.
> >
> > *Cons.*
> >
> >
> >    - Introduces a macro. In general, we’d like to move away from macros
> >    wherever possible.
> >    - Discourages subclassing of Error because each subclass would need
> its
> >    own special macro to call its constructor. A brief search of the code
> base
> >    seems to indicate that Mesos programmers don’t tend to declare
> subclasses
> >    of Error. Right now, it seems that subclasses exist mainly to provide
> >    different construction semantics. The big question is whether one
> would
> >    want to use subclasses down the road to distinguish between
> equivalence
> >    classes of errors, say those that should bring the process down vs
> those
> >    that are recoverable.
> >    - Some amount of code churn, mainly in error.hpp, result.hpp, and
> >    try.hpp.
> >
> > *Limitations.*
> >
> >
> >    - This approach by itself won’t address error sites that invoke
> >    convenience functions such as Try<T>::error() and Result<T>::error().
> These
> >    could be addressed with a similar macro, but it would need to somehow
> deal
> >    with the template parameter as well.
> >
> > *Risks and Potential Complications.*
> >
> >
> >    - The rename process might fail to correctly distinguish between error
> >    sites (i.e. calls to Error::Error()) and other uses of the term
> “Error”.
> >    - There may be important compilers that don’t support __FILE__ and
> >    __LINE__. These predefined macros have been in the ANSI/ISO standard
> for a
> >    while, but it would still be wise to check each compiler we care
> about.
> >    - General hazards associated with macro expansion.
> >    - Users of Mesos may have written diagnostic tools that depend on the
> >    current values of the error strings.
> >    - Unit tests may depend on exact values of error strings. If this were
> >    the case, we'd have to provide a convenience function that strips off
> the
> >    file names and line numbers so that unit tests don't break when error
> sites
> >    move around due to edits elsewhere in the file.
> >
> > Let me know what you think of this idea. If the approach seems reasonable
> > and a worthwhile, I would be willing to prototype it to see if it has
> some
> > fatal flaw. Thanks for your input!
> >
> >
> >
> > -Mike
>
>
>
> --
> Alex
>
> Theory is the first term in the Taylor series of practice. -- Thomas M
> Cover (1992)
>

Re: Better error reporting in Mesos

Reply via email to