[ https://issues.apache.org/jira/browse/KUDU-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Grant Henke updated KUDU-3025: ------------------------------ Component/s: metrics > Add metric for the open file descriptors usage vs the limit > ----------------------------------------------------------- > > Key: KUDU-3025 > URL: https://issues.apache.org/jira/browse/KUDU-3025 > Project: Kudu > Issue Type: Improvement > Components: master, metrics, tserver > Reporter: Alexey Serbin > Priority: Major > Labels: Availability, observability, scalability > > In the case of even replica distribution across all available nodes, once one > tablet server hits the maximum number of open file descriptors and go down > (e.g., upon hosting another tablet replica), the system will automatically > re-replicate tablet replicas from the tablet server, most likely bringing > other tablet servers down as well. That's a cascading failure scenario that > nobody wants to experience. > Monitoring the number of open file descriptors vs the limit can help to > prevent full Kudu cluster outage in such case, if operators are given a > chance to handle those situations proactively. Once some threshold is > reached (e.g., 90%), an operator could update the limit via corresponding > {{ulimit}} setting, preventing an outage. -- This message was sent by Atlassian Jira (v8.3.4#803005)