Re: Better error reporting in Mesos

Alex Clemmer Fri, 06 Nov 2015 13:57:21 -0800

In general, I think a good answer to my question is for the place that
is reporting or unboxing the error to aggregate data into an error
struct that holds things like the "error stack" (and other semantics
about the log, including perhaps error codes), and then let whoever is
logging, reporting or printing the message, handle the pretty-printing
semantics. This is a departure from our current log-the-error-message
approach, where the error message acts as the sole communicator of
error semantics and context, including the details about the line
number.


In fact, I think it's worth wondering whether the place where we
create an error should even log the error message -- right now we add
a lot of strings together, and it makes more sense to me to have a
methodology for just populating an "error struct" with data about the
error, and then the `toString` method (or whatever) can simply
generate the error messages as necessary, in one holistic place. That
way, we have centralized place that determining how an error gets
transformed into a string, instead of a lot of ad hoc string
concatenations and error unboxing. Another advantage is that you'd
have very fine-grained control over how to respond to errors, as it is
a typesafe struct. In fact, you could even imagine having a system
where error stringifying is handled completely out-of-band, meaning
that logging could become much, much cheaper in general.

But anyway, more broadly, what I'm trying to say is that the place
where we are returning an error should not know or care about how the
downstream sites pretty-print it or log it. Those are separate
concerns.

On Fri, Nov 6, 2015 at 1:17 PM, Benjamin Mahler
<benjamin.mah...@gmail.com> wrote:
> Thanks for the proposal Michael. This has come up in discussions before and
> I think it's really valuable, so it's nice to see folks thinking about
> this. Some things to consider:
>
> Don't forget 'Failure' which is unfortunately not unified with 'Error'.
>
> How does this affect error message composition (as Alex alluded to)?
> Currently we do manual error message composition ("<Caller message>:
> <callee message>: <callee's callee message>: ...") but in the future we may
> add composition support to our synchronous types like Try,Result,etc to be
> similar to Future composition with .then. See:
> https://issues.apache.org/jira/browse/MESOS-785
>
> Given composition, it sounds like we want to decouple the trace information
> from the error message, so that we can store a "stack-trace" of multiple
> code locations?
>
> It would be good to be aware of the limitations around Future composition
> and failure messages. When a .then chain encounters a failure, no error
> message wrapping occurs as the failure bubbles up. The result is that we
> instead get "<bottom-most callee message>" as the failure message without
> the caller wrapping I mentioned above. Tracing will help alleviate only
> having the bottom most callee failure, but it would be great to think about
> how to produce better messages as well (given MESOS-785 lands us in the
> same boat for Try,Result,etc).
>
> Typed errors/failures has come up before as well, but might be orthogonal
> enough to discuss separately.
>
> On Fri, Nov 6, 2015 at 1:16 AM, Alex Clemmer <clemmer.alexan...@gmail.com>
> wrote:
>
>> I'm strongly in favor of moving down this path, or a path like it.
>>
>> As Mesos becomes increasingly important as core distributed systems
>> infrastructure, it will become correspondingly helpful and important
>> to have things like excellent core error reporting facilities, even if
>> they come at the cost of some technical risk in the short term. In
>> this case in particular, I believe this error reporting model, where
>> the developer need only fill in semantic data about the error (e.g.,
>> the error message), and the boilerplate context of the error (e.g.,
>> the line number of the error) is filled in automatically and opaquely
>> by the system, is clearly the "right" model. On the merits of simply
>> adding line numbers itself, I think this is worth prototyping, and if
>> anything, I'd ask for the model to go further -- I've worked on scale
>> systems where errors are all "stringly typed" and the call sites are
>> all tribal knowledge, and I've worked on systems where the errors are
>> highly semantically regimented, and it is extremely clear to me that
>> the latter is clearly, unequivocally better.
>>
>> It's true there are some subtleties to think through. We'll need to
>> think about things like ABI compatibility, and we'll need to think
>> about whether there are situations where we _don't_ want line numbers
>> to be reported (e.g., we often "unbox" the error message and repackage
>> it, and it's worth wondering whether these semantics hurt or help
>> those error handling practices). It will indeed require a steady hand
>> to make sure the needle is "threaded correctly" and we don't stomp on
>> something important by accident.
>>
>> But I think it's worth taking the time to give this a shot anyway. I'm
>> happy to do the code reviews myself, if that helps. :)
>>
>> On Fri, Nov 6, 2015 at 12:29 AM, Michael Hopcroft <m...@hopcroft.org>
>> wrote:
>> > Error reporting in Mesos is based on Stout’s Error class and a variety of
>> > other classes, such as ErrnoError and WindowsError, that derive from
>> Error.
>> > One goal of these classes is to simplify the construction and storage of
>> > error message strings. As an example, the ErrnoError constructor
>> > automatically incorporates std::string(strerror(errno)) into its error
>> > message.
>> >
>> >
>> > Stout catches exceptions and converts them into Try<T>  and Result<T>
>> > objects which are used as polymorphic return values that can express
>> either
>> > an error condition or a proper return value. One consequence of catching
>> > exceptions is that call stacks from throw sites and error sites are lost.
>> > This makes it harder to debug production environment crashes post hoc.
>> >
>> >
>> > One way to improve debugability would be to add more information about
>> the
>> > error site to the error message. This could be done by convention, but it
>> > seems that the vast majority of the error messages in Mesos currently
>> don’t
>> > include much information about the site that originally generated an
>> error.
>> >
>> >
>> > Because much of the code in Mesos uses Error and ErrnoError, we have an
>> > opportunity to add diagnostic information automatically with only a small
>> > amount of churn. What we’d need to do is provide Error creation macros
>> that
>> > would pass the source file name and line number to the constructors of
>> the
>> > various Error classes.
>> >
>> >
>> > Concretely this would involve
>> >
>> >
>> >    - class Error
>> >       - Rename declaration and definition to ErrorImpl (or ErrorBase).
>> This
>> >       change is not error prone.
>> >       - Rename references to class’ type. These changes are mostly
>> limited
>> >       to result.hpp and try.hpp.
>> >       - Don’t rename calls to constructor. These will now just invoke the
>> >       macro.
>> >       - Modify the constructor to take a line number and source filename
>> in
>> >       addition to a message.
>> >       - Define a macro called Error(message) that has the same prototype
>> as
>> >       the original Error constructor. This macro would invoke
>> >       ErrorImpl::ErrorImpl, passing the current line number and source
>> filename
>> >       along with the user-provided message. The macro would make use
>> > of __FILE__
>> >       and __LINE__, which are predefined in gcc and Visual Studio.
>> >    - class ErrnoError, WindowsError.
>> >       - Similar pattern as above.
>> >
>> > *Pros.*
>> >
>> >
>> >    - Codebase is more debuggable in production environment because the
>> vast
>> >    majority of error sites will document the file name and line number
>> in the
>> >    error message.
>> >    - Most error sites don’t require a code change.
>> >
>> > *Cons.*
>> >
>> >
>> >    - Introduces a macro. In general, we’d like to move away from macros
>> >    wherever possible.
>> >    - Discourages subclassing of Error because each subclass would need
>> its
>> >    own special macro to call its constructor. A brief search of the code
>> base
>> >    seems to indicate that Mesos programmers don’t tend to declare
>> subclasses
>> >    of Error. Right now, it seems that subclasses exist mainly to provide
>> >    different construction semantics. The big question is whether one
>> would
>> >    want to use subclasses down the road to distinguish between
>> equivalence
>> >    classes of errors, say those that should bring the process down vs
>> those
>> >    that are recoverable.
>> >    - Some amount of code churn, mainly in error.hpp, result.hpp, and
>> >    try.hpp.
>> >
>> > *Limitations.*
>> >
>> >
>> >    - This approach by itself won’t address error sites that invoke
>> >    convenience functions such as Try<T>::error() and Result<T>::error().
>> These
>> >    could be addressed with a similar macro, but it would need to somehow
>> deal
>> >    with the template parameter as well.
>> >
>> > *Risks and Potential Complications.*
>> >
>> >
>> >    - The rename process might fail to correctly distinguish between error
>> >    sites (i.e. calls to Error::Error()) and other uses of the term
>> “Error”.
>> >    - There may be important compilers that don’t support __FILE__ and
>> >    __LINE__. These predefined macros have been in the ANSI/ISO standard
>> for a
>> >    while, but it would still be wise to check each compiler we care
>> about.
>> >    - General hazards associated with macro expansion.
>> >    - Users of Mesos may have written diagnostic tools that depend on the
>> >    current values of the error strings.
>> >    - Unit tests may depend on exact values of error strings. If this were
>> >    the case, we'd have to provide a convenience function that strips off
>> the
>> >    file names and line numbers so that unit tests don't break when error
>> sites
>> >    move around due to edits elsewhere in the file.
>> >
>> > Let me know what you think of this idea. If the approach seems reasonable
>> > and a worthwhile, I would be willing to prototype it to see if it has
>> some
>> > fatal flaw. Thanks for your input!
>> >
>> >
>> >
>> > -Mike
>>
>>
>>
>> --
>> Alex
>>
>> Theory is the first term in the Taylor series of practice. -- Thomas M
>> Cover (1992)
>>



-- 
Alex

Theory is the first term in the Taylor series of practice. -- Thomas M
Cover (1992)

Re: Better error reporting in Mesos

Reply via email to