EBernhardson has uploaded a new change for review.

  https://gerrit.wikimedia.org/r/259443

Change subject: [elasticsearch] Collect cluster health stats about shard 
movement
......................................................................

[elasticsearch] Collect cluster health stats about shard movement

The addition of relocating/initializing/unassigned shards statistics
should give us better insight into when the cluster drops a node, and
how it recovers from dropping that node. I would have thought this was
uncommon, but we have dropped a node twice in the last 3 days and
need better monitoring about what happens.

Bug: T117284
Change-Id: I69788c5455115b5aa54167facfcb2dd83954e0bc
---
M modules/elasticsearch/files/monitor/wmfelastic.py
1 file changed, 26 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/operations/puppet 
refs/changes/43/259443/1

diff --git a/modules/elasticsearch/files/monitor/wmfelastic.py 
b/modules/elasticsearch/files/monitor/wmfelastic.py
index 15aa081..8a5d8f2 100644
--- a/modules/elasticsearch/files/monitor/wmfelastic.py
+++ b/modules/elasticsearch/files/monitor/wmfelastic.py
@@ -37,8 +37,18 @@
 
         self.endpoints = {
             'node': '_nodes/_local/stats',
-            'cluster': '_cluster/stats',
+            'cluster_stats': '_cluster/stats',
+            'cluster_health': '_cluster/health',
         }
+
+        # Metrics provided at cluster level
+        # _cluster/health
+        self.health_metrics = [
+            "delayed_unassigned_shards",
+            "unassigned_shards",
+            "initializing_shards",
+            "relocating_shards",
+        ]
 
         # Metrics provided at cluster level
         # _cluster/stats
@@ -179,8 +189,19 @@
     def dict_path(self, m, sep='.'):
         return m.split(sep)
 
+    def cluster_health(self):
+        chealth = self._get(self.endpoints['cluster_health'])
+        gmetrics = {}
+        for metric in self.health_metrics:
+            try:
+                gmetrics[metric] = chealth[metric]
+            except KeyError, e:
+                self.errors += 1
+                pass
+        return gmetrics
+
     def cluster_stats(self):
-        cstats = self._get(self.endpoints['cluster'])
+        cstats = self._get(self.endpoints['cluster_stats'])
         gmetrics = {}
         for m in self.cluster_metrics:
             depth = self.dict_path(m)
@@ -223,6 +244,9 @@
             cluster_stats = self.cluster_stats()
             for metric, value in cluster_stats.iteritems():
                 self.publish(metric, value)
+            health_stats = self.health_stats()
+            for metric, value in health_stats.iteritems():
+                self.publish(metric, value)
 
             # Remaining fall under the hostname context
             self.config['path_prefix'] = self.o_prefix

-- 
To view, visit https://gerrit.wikimedia.org/r/259443
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I69788c5455115b5aa54167facfcb2dd83954e0bc
Gerrit-PatchSet: 1
Gerrit-Project: operations/puppet
Gerrit-Branch: production
Gerrit-Owner: EBernhardson <[email protected]>

_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to