Re: IMPALA-2549: Debug rules -> injecting failures and exercising edge cases

Henry Robinson Fri, 16 Sep 2016 16:51:45 -0700

Couple of answers inline (sorry for the delay in replying!)

On 6 September 2016 at 16:29, Juan <any...@gmail.com> wrote:


> Hey Henry,
>
> This is great. The rules are easy to read, easy to set, make failure
> injection much easier so we could add tests to verify Impala's behavior. I
> love that you can change the impala config dynamically without restart
> cluster. Not only it saves testing time, also adding flexibility to tests.
> I have a few questions:
> 1. could I set multiple debug rules for the same scope? for example, random
> rpc failures on multiple rpcs, like first transimit data RPC fails, then
> cancel fragment RPC fails as well.
>

You can set any number of rules, but a maximum of one on each tracepoint.
What you're asking for I think involves coordination between rules - that
would be possible to do quite easily, but wasn't in scope (i.e. someway to
say "this rule only takes effect if some other rule has already fired" -
easy to express that as a new action).



> 2. how easy to repeat a test failure? some actions allow inject failure
> randomly on fragments, If we could repeat the test with exactly the same
> failure injection sequence, we could probably figure out what's wrong
> faster.
>

That's something I want to make easy: random seeds should be logged, for
example, as should whether or not a rule was triggered. One goal is to make
it easy to write deterministic failure tests, so outside of rule sets
produced by randomized rule generation it should be very easy to see what's
going on.

Henry


>
> Thanks,
> Juan
>
> On Fri, Sep 2, 2016 at 4:51 PM, Henry Robinson <he...@apache.org> wrote:
>
> > Hi all -
> >
> > I want to start to get some feedback on a skunkworks project I did
> > recently.
> >
> > Impala has long had 'debug options' which are small scripts that can
> inject
> > a limited class of failures into query execution, mostly at the exec node
> > scope (e.g. delaying Prepare(), returning an error during Open() etc).
> > These are useful, but have a number of limitations:
> >
> > * Extending them is hard, because of the lack of regular language
> > * The set of available actions is limited to waiting forever, or
> > immediately cancelling.
> > * Actions can't be composed or sequenced, or executed with some
> > probability.
> > * Actions are only applicable to exec nodes, and generalising them is
> hard.
> > * Only one debug action may be applied per-query.
> >
> > I have wanted to make improvements here for a long time, not least
> because
> > testing any RPC changes is hard, and am about done with the first cut
> that
> > addresses all of the above issues and more. I'm calling them 'debug
> rules',
> > and they have the following features:
> >
> >    1. Regular rule structure so that the language may be extended without
> >    touching a parser.
> >    2. JSON-based serialisation so that rules may be human readable and
> >    (somewhat) writeable.
> >    3. Python builder classes to make complex rule construction easy.
> >    4. Extensible set of actions, which now includes setting a CLI flag at
> >    runtime, logging a message, failing with a probability, sleeping, and
> >    setting RPC timeouts.
> >    5. Actions may have arguments which may themselves be functions.
> >    Functions may access their environment in a limited way (so, for
> > example,
> >    they may log their fragment ID).
> >    6. Actions may be applied at any point in Impalad, not just in exec
> >    nodes.
> >    7. Disabled completely in release builds.
> >
> >
> > The full commit is the HEAD here: https://github.com/
> > henryr/impala/tree/debug-rules. But I'd like some general feedback on
> the
> > 'user experience', and with that in mind I'd like to point you to four
> > locations:
> >
> >    1. Introductory documentation here: https://github.com/
> >    henryr/Impala/blob/debug-rules/be/src/runtime/debug-rules.h#L25
> >    <https://github.com/henryr/Impala/blob/debug-rules/be/
> > src/runtime/debug-rules.h#L25>
> >    2. A set of tests for failure modes after IMPALA-1599, which
> >    parallelised fragment start-up: https://github.com/
> >    henryr/Impala/blob/debug-rules/tests/rpc/test_1599.py
> >    <https://github.com/henryr/Impala/blob/debug-rules/tests/
> > rpc/test_1599.py>
> >    3. A randomized test that introduces random delays to a subset of
> phases
> >    in PHJ, PHA, scans and xchg operators: https://github.com/henryr/
> >    Impala/blob/debug-rules/tests/rpc/test_exec_node_injection.py
> >    <https://github.com/henryr/Impala/blob/debug-rules/tests/
> > rpc/test_exec_node_injection.py>
> >    4. A port of the existing test_rpc_timeouts.py test (that confirms the
> >    RPC timeout mechanism that Juan wrote is working correctly) that
> doesn't
> >    need to be a custom cluster test - debug rules do all the parameter
> and
> >    flag setting: https://github.com/henryr/Impala/blob/debug-
> >    rules/tests/rpc/test_timeouts.py
> >    <https://github.com/henryr/Impala/blob/debug-rules/tests/
> > rpc/test_timeouts.py>
> >
> > I like the last one particularly as it shaves more than 60s off the
> > execution time of that test.
> >
> > The changes themselves are very large (~2000 lines and growing), and
> aren't
> > remotely ready for review yet (you'll see half commented code all over
> the
> > place) - but I'd like feedback on whether this is useful, would help with
> > any testing scenarios you haven't been able to address yet, and if the
> > abstractions and test APIs seem about right.
> >
> > Let me know what you think -
> >
> > Henry
> >
>



-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679

Re: IMPALA-2549: Debug rules -> injecting failures and exercising edge cases

Reply via email to