Steve Loughran created SLIDER-203:
-------------------------------------

             Summary: Implement scalable failure threshold based on percentage 
of instances failing over a time period
                 Key: SLIDER-203
                 URL: https://issues.apache.org/jira/browse/SLIDER-203
             Project: Slider
          Issue Type: Sub-task
          Components: appmaster, test
    Affects Versions: Slider 0.40
            Reporter: Steve Loughran


SLIDER-77 proposed weighted moving averages for failures. This has some flaws
# it's hard to understand and configure
# different cluster sizes need different default values
# if you flex a cluster, it the threshold may become inapppropriate

I propose something more tangible and related to how to track physical nodes: 
percentage failing over a time period.

For example, we could define a functional hbase cluster as:
200% of masters failing per day (for two masters == 4 failures)
80% of region servers per day (for 20 region servers, that's 16 failures)

Every day the counter could be reset.

Flexing complicates the equation: it may be simplest just to reset the 
counters, at least when scaling down. Otherwise if a 20 worker cluster had a 
failure count of 5, and a 40% threshold, all would be well. But scale it down 
to 10 nodes and the failure count is immediately over the limit. 






--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to