Faidon has uploaded a new change for review.

  https://gerrit.wikimedia.org/r/80557


Change subject: authdns: make the Ganglia plugin more resilient
......................................................................

authdns: make the Ganglia plugin more resilient

The graphs have a lot of spikes currently. The theory is that this
happens because the plugin occasionally returns 0 for values, making
Ganglia misdetect the rate of change.

The first problem is, the plugin has an issue right now that it does an
HTTP request for every metric, which is kind of silly (this is due to a
limitation of Ganglia's python module). This could aggravate the problem
by making a potential fetching failure appear much more often than it
otherwise would. So, we add a caching layer for this, fetching metrics
at most once a second (which is how often gdnsd updates stats anyway).

Since we now do requests much more seldomly, we can afford to repeat
them if they failed for whatever reason. Two attempts should limit the
occurence of an issue much more.

Finally, add a grace period of 15s (the plugin's refresh time) when
we'll serve stale data even if all attempts to fetch failed.

Change-Id: I94bbcf6db6f6389cd7fdbc500b7a16441b448dc3
---
M modules/authdns/files/ganglia/ganglia_gdnsd.py
1 file changed, 42 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/operations/puppet 
refs/changes/57/80557/1

diff --git a/modules/authdns/files/ganglia/ganglia_gdnsd.py 
b/modules/authdns/files/ganglia/ganglia_gdnsd.py
index 97792aa..b3453b7 100644
--- a/modules/authdns/files/ganglia/ganglia_gdnsd.py
+++ b/modules/authdns/files/ganglia/ganglia_gdnsd.py
@@ -22,6 +22,7 @@
 
 import urllib2
 import json
+import time
 
 
 CONF = {
@@ -50,6 +51,10 @@
     'tcp_sendfail': 'DNS TCP sendfail',
     'tcp_recvfail': 'DNS TCP recvfail',
 }
+CACHE = {
+    'time': 0,
+    'data': {},
+}
 
 
 def build_desc(skel, prop):
@@ -77,8 +82,41 @@
         data = response.read()
         response.close()
         metrics = json.loads(data)
-    except (urllib2.URLError, KeyError):
+    except Exception:  # pylint: disable-msg=W0703
+        # Could be URLError, HTTPError, HTTPException or ValueError (from json)
+        # doesn't matter why, as Ganglia won't propagate a message.
+        # pass, i.e. just return {}.
         pass
+
+    return metrics
+
+
+def fetch_metrics_cached(url=CONF['stats_url']):
+    """Fetches, decodes and caches metrics from gdnsd.
+    Fetches at most once a second, otherwise serving from the cache.
+    Tries to fetch twice, if the first attempt failed.
+    Serves stale data up to 15s old if both attempts failed.
+
+    :param url: URL for gdnsd's json output
+    :returns: decoded dict
+    """
+    # fetch at most once a second; especially useful considering that
+    # the callback gets called for every single metric independently
+    if time.time() - CACHE['time'] < 1 and CACHE['data']:
+        return CACHE['data']
+
+    metrics = fetch_metrics(url)
+    # failed, try once more
+    if not metrics:
+        metrics = fetch_metrics(url)
+
+    if metrics:
+        CACHE['time'] = time.time()
+        CACHE['data'] = metrics
+    else:
+        # failed twice, return cached data up to 15s to avoid dives/spikes
+        if time.time() - CACHE['time'] <= 15:
+            metrics = CACHE['data']
 
     return metrics
 
@@ -89,12 +127,12 @@
     :param name: metric name
     :returns: metric value
     """
-    raw = fetch_metrics()
+    raw = fetch_metrics_cached()
     try:
         _, category, metric = name.split('_', 2)
         val = raw[category][metric]
     except KeyError:
-        val = 0
+        val = None
     return val
 
 
@@ -116,7 +154,7 @@
         'groups': CONF['groups'],
     }
 
-    raw = fetch_metrics()
+    raw = fetch_metrics_cached()
     descriptors = []
     for category in raw:
         try:

-- 
To view, visit https://gerrit.wikimedia.org/r/80557
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I94bbcf6db6f6389cd7fdbc500b7a16441b448dc3
Gerrit-PatchSet: 1
Gerrit-Project: operations/puppet
Gerrit-Branch: production
Gerrit-Owner: Faidon <fai...@wikimedia.org>

_______________________________________________
MediaWiki-commits mailing list
MediaWiki-commits@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to