Faidon has submitted this change and it was merged. Change subject: authdns: make the Ganglia plugin more resilient ......................................................................
authdns: make the Ganglia plugin more resilient The graphs have a lot of spikes currently. The theory is that this happens because the plugin occasionally returns 0 for values, making Ganglia misdetect the rate of change. The first problem is, the plugin has an issue right now that it does an HTTP request for every metric, which is kind of silly (this is due to a limitation of Ganglia's python module). This could aggravate the problem by making a potential fetching failure appear much more often than it otherwise would. So, we add a caching layer for this, fetching metrics at most once a second (which is how often gdnsd updates stats anyway). Since we now do requests much more seldomly, we can afford to repeat them if they failed for whatever reason. Two attempts should limit the occurence of an issue much more. Finally, add a grace period of 15s (the plugin's refresh time) when we'll serve stale data even if all attempts to fetch failed. Change-Id: I94bbcf6db6f6389cd7fdbc500b7a16441b448dc3 --- M modules/authdns/files/ganglia/ganglia_gdnsd.py 1 file changed, 42 insertions(+), 4 deletions(-) Approvals: Faidon: Looks good to me, approved jenkins-bot: Verified diff --git a/modules/authdns/files/ganglia/ganglia_gdnsd.py b/modules/authdns/files/ganglia/ganglia_gdnsd.py index 97792aa..b3453b7 100644 --- a/modules/authdns/files/ganglia/ganglia_gdnsd.py +++ b/modules/authdns/files/ganglia/ganglia_gdnsd.py @@ -22,6 +22,7 @@ import urllib2 import json +import time CONF = { @@ -50,6 +51,10 @@ 'tcp_sendfail': 'DNS TCP sendfail', 'tcp_recvfail': 'DNS TCP recvfail', } +CACHE = { + 'time': 0, + 'data': {}, +} def build_desc(skel, prop): @@ -77,8 +82,41 @@ data = response.read() response.close() metrics = json.loads(data) - except (urllib2.URLError, KeyError): + except Exception: # pylint: disable-msg=W0703 + # Could be URLError, HTTPError, HTTPException or ValueError (from json) + # doesn't matter why, as Ganglia won't propagate a message. + # pass, i.e. just return {}. pass + + return metrics + + +def fetch_metrics_cached(url=CONF['stats_url']): + """Fetches, decodes and caches metrics from gdnsd. + Fetches at most once a second, otherwise serving from the cache. + Tries to fetch twice, if the first attempt failed. + Serves stale data up to 15s old if both attempts failed. + + :param url: URL for gdnsd's json output + :returns: decoded dict + """ + # fetch at most once a second; especially useful considering that + # the callback gets called for every single metric independently + if time.time() - CACHE['time'] < 1 and CACHE['data']: + return CACHE['data'] + + metrics = fetch_metrics(url) + # failed, try once more + if not metrics: + metrics = fetch_metrics(url) + + if metrics: + CACHE['time'] = time.time() + CACHE['data'] = metrics + else: + # failed twice, return cached data up to 15s to avoid dives/spikes + if time.time() - CACHE['time'] <= 15: + metrics = CACHE['data'] return metrics @@ -89,12 +127,12 @@ :param name: metric name :returns: metric value """ - raw = fetch_metrics() + raw = fetch_metrics_cached() try: _, category, metric = name.split('_', 2) val = raw[category][metric] except KeyError: - val = 0 + val = None return val @@ -116,7 +154,7 @@ 'groups': CONF['groups'], } - raw = fetch_metrics() + raw = fetch_metrics_cached() descriptors = [] for category in raw: try: -- To view, visit https://gerrit.wikimedia.org/r/80557 To unsubscribe, visit https://gerrit.wikimedia.org/r/settings Gerrit-MessageType: merged Gerrit-Change-Id: I94bbcf6db6f6389cd7fdbc500b7a16441b448dc3 Gerrit-PatchSet: 1 Gerrit-Project: operations/puppet Gerrit-Branch: production Gerrit-Owner: Faidon <fai...@wikimedia.org> Gerrit-Reviewer: Faidon <fai...@wikimedia.org> Gerrit-Reviewer: jenkins-bot _______________________________________________ MediaWiki-commits mailing list MediaWiki-commits@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits