[ 
https://issues.apache.org/jira/browse/KUDU-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke updated KUDU-3025:
------------------------------
    Component/s: metrics

> Add metric for the open file descriptors usage vs the limit
> -----------------------------------------------------------
>
>                 Key: KUDU-3025
>                 URL: https://issues.apache.org/jira/browse/KUDU-3025
>             Project: Kudu
>          Issue Type: Improvement
>          Components: master, metrics, tserver
>            Reporter: Alexey Serbin
>            Priority: Major
>              Labels: Availability, observability, scalability
>
> In the case of even replica distribution across all available nodes, once one 
> tablet server hits the maximum number of open file descriptors and go down 
> (e.g., upon hosting another tablet replica), the system will automatically 
> re-replicate tablet replicas from the tablet server, most likely bringing 
> other tablet servers down as well.  That's a cascading failure scenario that 
> nobody wants to experience.
> Monitoring the number of open file descriptors vs the limit can help to 
> prevent full Kudu cluster outage in such case, if operators are given a 
> chance to handle those situations proactively.  Once some threshold is 
> reached (e.g., 90%), an operator could update the limit via corresponding 
> {{ulimit}} setting, preventing an outage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to