some ideas for failure tracking and testing: built in chaos monkey; percentage failure tracking

Steve Loughran Wed, 02 Jul 2014 06:07:36 -0700

I've been thinking about failures, in particular

   1. How do we trigger failures while running, say, the standard accumulo
   test suites? We can do it via IPC calls today, but it would require a
   client and an even-more-complex test run.
   2. In production, how to have good thresholds for deciding that the app
   is unstable and halting? SLIDER-77 proposed weighted moving averages, which
   are complex to tune.



Here are my thoughts

Failures in tests: we have a built in chaos monkey, so when a slider
instance is started, it reads in its polling/failure values and, driven by
a (seedable) random number, triggers failures appropriately.  proposed in
SLIDER-202 <https://issues.apache.org/jira/browse/SLIDER-202>

Instability threshold. We use percentages of a component type failing over
a 24 hour period. We could allow, say, 400% failures of the (few) masters,
but workers must stay above 80% for the cluster to be considered stable.
SLIDER-203 <https://issues.apache.org/jira/browse/SLIDER-203>

Comments welcome, especially on the JIRAs. I don't plan to do any of them
this sprint

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

some ideas for failure tracking and testing: built in chaos monkey; percentage failure tracking

Reply via email to