Re: Better error reporting in Mesos

Alex Clemmer Mon, 23 Nov 2015 10:48:34 -0800

An additional argument for logging a struct and putting all the
to-string logic somewhere else is that it would allow us to
incorporate localization and internationalization relatively
painlessly. If we have to pass around a string, it is much more
painful to do this.


On Tue, Nov 10, 2015 at 12:28 PM, Michael Hopcroft <[email protected]> wrote:
> There have been a lot of great suggestions on this thread. My
> recommendation is that I do an experiment to uncover any hidden issues with
> the macro approach outlined in my original email. I will try to design the
> API of the ErrorImpl class so that it would be, not only possible, but easy
> to add support for error chaining and message formatting, as has been
> suggested on this thread. If I don’t encounter any significant problems
> with the basic experiment, I will come back to this group with a patch and
> we can then discuss whether to push the design further forward. I’m a big
> fan of an error infrastructure because it gives us a centralized location
> in the code to add debugging hooks and apply various policies such as
> logging and formatting.
>
> On Fri, Nov 6, 2015 at 1:56 PM, Alex Clemmer <[email protected]>
> wrote:
>
>> In general, I think a good answer to my question is for the place that
>> is reporting or unboxing the error to aggregate data into an error
>> struct that holds things like the "error stack" (and other semantics
>> about the log, including perhaps error codes), and then let whoever is
>> logging, reporting or printing the message, handle the pretty-printing
>> semantics. This is a departure from our current log-the-error-message
>> approach, where the error message acts as the sole communicator of
>> error semantics and context, including the details about the line
>> number.
>>
>> In fact, I think it's worth wondering whether the place where we
>> create an error should even log the error message -- right now we add
>> a lot of strings together, and it makes more sense to me to have a
>> methodology for just populating an "error struct" with data about the
>> error, and then the `toString` method (or whatever) can simply
>> generate the error messages as necessary, in one holistic place. That
>> way, we have centralized place that determining how an error gets
>> transformed into a string, instead of a lot of ad hoc string
>> concatenations and error unboxing. Another advantage is that you'd
>> have very fine-grained control over how to respond to errors, as it is
>> a typesafe struct. In fact, you could even imagine having a system
>> where error stringifying is handled completely out-of-band, meaning
>> that logging could become much, much cheaper in general.
>>
>> But anyway, more broadly, what I'm trying to say is that the place
>> where we are returning an error should not know or care about how the
>> downstream sites pretty-print it or log it. Those are separate
>> concerns.
>>
>> On Fri, Nov 6, 2015 at 1:17 PM, Benjamin Mahler
>> <[email protected]> wrote:
>> > Thanks for the proposal Michael. This has come up in discussions before
>> and
>> > I think it's really valuable, so it's nice to see folks thinking about
>> > this. Some things to consider:
>> >
>> > Don't forget 'Failure' which is unfortunately not unified with 'Error'.
>> >
>> > How does this affect error message composition (as Alex alluded to)?
>> > Currently we do manual error message composition ("<Caller message>:
>> > <callee message>: <callee's callee message>: ...") but in the future we
>> may
>> > add composition support to our synchronous types like Try,Result,etc to
>> be
>> > similar to Future composition with .then. See:
>> > https://issues.apache.org/jira/browse/MESOS-785
>> >
>> > Given composition, it sounds like we want to decouple the trace
>> information
>> > from the error message, so that we can store a "stack-trace" of multiple
>> > code locations?
>> >
>> > It would be good to be aware of the limitations around Future composition
>> > and failure messages. When a .then chain encounters a failure, no error
>> > message wrapping occurs as the failure bubbles up. The result is that we
>> > instead get "<bottom-most callee message>" as the failure message without
>> > the caller wrapping I mentioned above. Tracing will help alleviate only
>> > having the bottom most callee failure, but it would be great to think
>> about
>> > how to produce better messages as well (given MESOS-785 lands us in the
>> > same boat for Try,Result,etc).
>> >
>> > Typed errors/failures has come up before as well, but might be orthogonal
>> > enough to discuss separately.
>> >
>> > On Fri, Nov 6, 2015 at 1:16 AM, Alex Clemmer <
>> [email protected]>
>> > wrote:
>> >
>> >> I'm strongly in favor of moving down this path, or a path like it.
>> >>
>> >> As Mesos becomes increasingly important as core distributed systems
>> >> infrastructure, it will become correspondingly helpful and important
>> >> to have things like excellent core error reporting facilities, even if
>> >> they come at the cost of some technical risk in the short term. In
>> >> this case in particular, I believe this error reporting model, where
>> >> the developer need only fill in semantic data about the error (e.g.,
>> >> the error message), and the boilerplate context of the error (e.g.,
>> >> the line number of the error) is filled in automatically and opaquely
>> >> by the system, is clearly the "right" model. On the merits of simply
>> >> adding line numbers itself, I think this is worth prototyping, and if
>> >> anything, I'd ask for the model to go further -- I've worked on scale
>> >> systems where errors are all "stringly typed" and the call sites are
>> >> all tribal knowledge, and I've worked on systems where the errors are
>> >> highly semantically regimented, and it is extremely clear to me that
>> >> the latter is clearly, unequivocally better.
>> >>
>> >> It's true there are some subtleties to think through. We'll need to
>> >> think about things like ABI compatibility, and we'll need to think
>> >> about whether there are situations where we _don't_ want line numbers
>> >> to be reported (e.g., we often "unbox" the error message and repackage
>> >> it, and it's worth wondering whether these semantics hurt or help
>> >> those error handling practices). It will indeed require a steady hand
>> >> to make sure the needle is "threaded correctly" and we don't stomp on
>> >> something important by accident.
>> >>
>> >> But I think it's worth taking the time to give this a shot anyway. I'm
>> >> happy to do the code reviews myself, if that helps. :)
>> >>
>> >> On Fri, Nov 6, 2015 at 12:29 AM, Michael Hopcroft <[email protected]>
>> >> wrote:
>> >> > Error reporting in Mesos is based on Stout’s Error class and a
>> variety of
>> >> > other classes, such as ErrnoError and WindowsError, that derive from
>> >> Error.
>> >> > One goal of these classes is to simplify the construction and storage
>> of
>> >> > error message strings. As an example, the ErrnoError constructor
>> >> > automatically incorporates std::string(strerror(errno)) into its error
>> >> > message.
>> >> >
>> >> >
>> >> > Stout catches exceptions and converts them into Try<T>  and Result<T>
>> >> > objects which are used as polymorphic return values that can express
>> >> either
>> >> > an error condition or a proper return value. One consequence of
>> catching
>> >> > exceptions is that call stacks from throw sites and error sites are
>> lost.
>> >> > This makes it harder to debug production environment crashes post hoc.
>> >> >
>> >> >
>> >> > One way to improve debugability would be to add more information about
>> >> the
>> >> > error site to the error message. This could be done by convention,
>> but it
>> >> > seems that the vast majority of the error messages in Mesos currently
>> >> don’t
>> >> > include much information about the site that originally generated an
>> >> error.
>> >> >
>> >> >
>> >> > Because much of the code in Mesos uses Error and ErrnoError, we have
>> an
>> >> > opportunity to add diagnostic information automatically with only a
>> small
>> >> > amount of churn. What we’d need to do is provide Error creation macros
>> >> that
>> >> > would pass the source file name and line number to the constructors of
>> >> the
>> >> > various Error classes.
>> >> >
>> >> >
>> >> > Concretely this would involve
>> >> >
>> >> >
>> >> >    - class Error
>> >> >       - Rename declaration and definition to ErrorImpl (or ErrorBase).
>> >> This
>> >> >       change is not error prone.
>> >> >       - Rename references to class’ type. These changes are mostly
>> >> limited
>> >> >       to result.hpp and try.hpp.
>> >> >       - Don’t rename calls to constructor. These will now just invoke
>> the
>> >> >       macro.
>> >> >       - Modify the constructor to take a line number and source
>> filename
>> >> in
>> >> >       addition to a message.
>> >> >       - Define a macro called Error(message) that has the same
>> prototype
>> >> as
>> >> >       the original Error constructor. This macro would invoke
>> >> >       ErrorImpl::ErrorImpl, passing the current line number and source
>> >> filename
>> >> >       along with the user-provided message. The macro would make use
>> >> > of __FILE__
>> >> >       and __LINE__, which are predefined in gcc and Visual Studio.
>> >> >    - class ErrnoError, WindowsError.
>> >> >       - Similar pattern as above.
>> >> >
>> >> > *Pros.*
>> >> >
>> >> >
>> >> >    - Codebase is more debuggable in production environment because the
>> >> vast
>> >> >    majority of error sites will document the file name and line number
>> >> in the
>> >> >    error message.
>> >> >    - Most error sites don’t require a code change.
>> >> >
>> >> > *Cons.*
>> >> >
>> >> >
>> >> >    - Introduces a macro. In general, we’d like to move away from
>> macros
>> >> >    wherever possible.
>> >> >    - Discourages subclassing of Error because each subclass would need
>> >> its
>> >> >    own special macro to call its constructor. A brief search of the
>> code
>> >> base
>> >> >    seems to indicate that Mesos programmers don’t tend to declare
>> >> subclasses
>> >> >    of Error. Right now, it seems that subclasses exist mainly to
>> provide
>> >> >    different construction semantics. The big question is whether one
>> >> would
>> >> >    want to use subclasses down the road to distinguish between
>> >> equivalence
>> >> >    classes of errors, say those that should bring the process down vs
>> >> those
>> >> >    that are recoverable.
>> >> >    - Some amount of code churn, mainly in error.hpp, result.hpp, and
>> >> >    try.hpp.
>> >> >
>> >> > *Limitations.*
>> >> >
>> >> >
>> >> >    - This approach by itself won’t address error sites that invoke
>> >> >    convenience functions such as Try<T>::error() and
>> Result<T>::error().
>> >> These
>> >> >    could be addressed with a similar macro, but it would need to
>> somehow
>> >> deal
>> >> >    with the template parameter as well.
>> >> >
>> >> > *Risks and Potential Complications.*
>> >> >
>> >> >
>> >> >    - The rename process might fail to correctly distinguish between
>> error
>> >> >    sites (i.e. calls to Error::Error()) and other uses of the term
>> >> “Error”.
>> >> >    - There may be important compilers that don’t support __FILE__ and
>> >> >    __LINE__. These predefined macros have been in the ANSI/ISO
>> standard
>> >> for a
>> >> >    while, but it would still be wise to check each compiler we care
>> >> about.
>> >> >    - General hazards associated with macro expansion.
>> >> >    - Users of Mesos may have written diagnostic tools that depend on
>> the
>> >> >    current values of the error strings.
>> >> >    - Unit tests may depend on exact values of error strings. If this
>> were
>> >> >    the case, we'd have to provide a convenience function that strips
>> off
>> >> the
>> >> >    file names and line numbers so that unit tests don't break when
>> error
>> >> sites
>> >> >    move around due to edits elsewhere in the file.
>> >> >
>> >> > Let me know what you think of this idea. If the approach seems
>> reasonable
>> >> > and a worthwhile, I would be willing to prototype it to see if it has
>> >> some
>> >> > fatal flaw. Thanks for your input!
>> >> >
>> >> >
>> >> >
>> >> > -Mike
>> >>
>> >>
>> >>
>> >> --
>> >> Alex
>> >>
>> >> Theory is the first term in the Taylor series of practice. -- Thomas M
>> >> Cover (1992)
>> >>
>>
>>
>>
>> --
>> Alex
>>
>> Theory is the first term in the Taylor series of practice. -- Thomas M
>> Cover (1992)
>>



-- 
Alex

Theory is the first term in the Taylor series of practice. -- Thomas M
Cover (1992)

Re: Better error reporting in Mesos

Reply via email to