Hello Dan Kenigsberg, Francesco Romani, Martin Polednik, I'd like you to do a code review. Please visit
https://gerrit.ovirt.org/59996 to review the following change. Change subject: sampling: periodic: improve logging ...................................................................... sampling: periodic: improve logging We received reports of incorrect monitoring, e.g. VMs set unresponsive, or periodic monitoring being slow to get back to normal after a disruption. Analysis of post-mortem logs provided nothing conclusive; actually Vdsm seems to have reacted well under the disruption. Still, we want to make sure everything is right. This patch refactors the bulk stats sampling logging, to add more information and to move it outside the critical section. We know logging is a controversial matter. Too much logging is unpractical and taxing over the host. Too little logging makes troubleshooting very hard, especially for monitoring issues that are most often detected and reported much after the incident happen. We still log once per operation, so four entries per minute, which is definitely bearable. Change-Id: I1e52a9c356be212ec72aace0f8d8aa5f1580ccb3 Bug-Url: https://bugzilla.redhat.com/1351247 Backport-To: 4.0 Backport-To: 3.6 Signed-off-by: Francesco Romani <from...@redhat.com> Reviewed-on: https://gerrit.ovirt.org/58972 Reviewed-by: Milan Zamazal <mzama...@redhat.com> Reviewed-by: Martin Polednik <mpoled...@redhat.com> Continuous-Integration: Jenkins CI Reviewed-by: Dan Kenigsberg <dan...@redhat.com> --- M lib/vdsm/virt/sampling.py 1 file changed, 8 insertions(+), 1 deletion(-) git pull ssh://gerrit.ovirt.org:29418/vdsm refs/changes/96/59996/1 diff --git a/lib/vdsm/virt/sampling.py b/lib/vdsm/virt/sampling.py index b49666b..4cfd3b1 100644 --- a/lib/vdsm/virt/sampling.py +++ b/lib/vdsm/virt/sampling.py @@ -485,11 +485,13 @@ self._log = logging.getLogger("virt.sampling.VMBulkSampler") def __call__(self): + log_status = True timestamp = self._stats_cache.clock() acquired = self._sampling.acquire(blocking=False) # we are deep in the hot path. bool(ExpiringCache) # *is* costly so we should avoid it if we can. fast_path = acquired and not self._skip_doms + doms = [] # whitelist, meaningful only in the slow path try: if fast_path: # This is expected to be the common case. @@ -499,7 +501,6 @@ # A previous call got stuck, or not every domain # has properly recovered. Thus we must whitelist domains. doms = self._get_responsive_doms() - self._log.debug('sampling %d domains', len(doms)) if doms: bulk_stats = self._conn.domainListGetStats( doms, self._stats_flags) @@ -507,11 +508,17 @@ bulk_stats = [] except Exception: self._log.exception("vm sampling failed") + log_status = False else: self._stats_cache.put(_translate(bulk_stats), timestamp) finally: if acquired: self._sampling.release() + if log_status: + self._log.debug( + 'sampled timestamp %r elapsed %.3f acquired %r domains %s', + timestamp, self._stats_cache.clock() - timestamp, acquired, + 'all' if fast_path else len(doms)) def _get_responsive_doms(self): vms = self._get_vms() -- To view, visit https://gerrit.ovirt.org/59996 To unsubscribe, visit https://gerrit.ovirt.org/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I1e52a9c356be212ec72aace0f8d8aa5f1580ccb3 Gerrit-PatchSet: 1 Gerrit-Project: vdsm Gerrit-Branch: ovirt-4.0 Gerrit-Owner: Milan Zamazal <mzama...@redhat.com> Gerrit-Reviewer: Dan Kenigsberg <dan...@redhat.com> Gerrit-Reviewer: Francesco Romani <from...@redhat.com> Gerrit-Reviewer: Martin Polednik <mpoled...@redhat.com> _______________________________________________ vdsm-patches mailing list vdsm-patches@lists.fedorahosted.org https://lists.fedorahosted.org/admin/lists/vdsm-patches@lists.fedorahosted.org