Steve Loughran created SLIDER-203:
-------------------------------------
Summary: Implement scalable failure threshold based on percentage
of instances failing over a time period
Key: SLIDER-203
URL: https://issues.apache.org/jira/browse/SLIDER-203
Project: Slider
Issue Type: Sub-task
Components: appmaster, test
Affects Versions: Slider 0.40
Reporter: Steve Loughran
SLIDER-77 proposed weighted moving averages for failures. This has some flaws
# it's hard to understand and configure
# different cluster sizes need different default values
# if you flex a cluster, it the threshold may become inapppropriate
I propose something more tangible and related to how to track physical nodes:
percentage failing over a time period.
For example, we could define a functional hbase cluster as:
200% of masters failing per day (for two masters == 4 failures)
80% of region servers per day (for 20 region servers, that's 16 failures)
Every day the counter could be reset.
Flexing complicates the equation: it may be simplest just to reset the
counters, at least when scaling down. Otherwise if a 20 worker cluster had a
failure count of 5, and a 40% threshold, all would be well. But scale it down
to 10 nodes and the failure count is immediately over the limit.
--
This message was sent by Atlassian JIRA
(v6.2#6252)