I've been thinking about failures, in particular 1. How do we trigger failures while running, say, the standard accumulo test suites? We can do it via IPC calls today, but it would require a client and an even-more-complex test run. 2. In production, how to have good thresholds for deciding that the app is unstable and halting? SLIDER-77 proposed weighted moving averages, which are complex to tune.
Here are my thoughts Failures in tests: we have a built in chaos monkey, so when a slider instance is started, it reads in its polling/failure values and, driven by a (seedable) random number, triggers failures appropriately. proposed in SLIDER-202 <https://issues.apache.org/jira/browse/SLIDER-202> Instability threshold. We use percentages of a component type failing over a 24 hour period. We could allow, say, 400% failures of the (few) masters, but workers must stay above 80% for the cluster to be considered stable. SLIDER-203 <https://issues.apache.org/jira/browse/SLIDER-203> Comments welcome, especially on the JIRAs. I don't plan to do any of them this sprint -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
