Hello Dan Kenigsberg, Martin Polednik, I'd like you to do a code review. Please visit
https://gerrit.ovirt.org/62418 to review the following change. Change subject: vm: periodic: fix stats age reporting ...................................................................... vm: periodic: fix stats age reporting We need to report the Vm responsiveness, because it is one important indicator of the Vm health. This information can trigger the administator intervention, or automatic corrective action (e.g. Vm migrations). The Vm class reports the monitor responsiveness in two ways: 1. if a libvirt operation reports a timeout (a standard libvirt exception) 2. if one sampling operation hangs for too long In the second case, given the transient nature of the detected failure, the responsiveness is reported, but not stored as Vm field. This is noteworthy, but totally expected and legitimate. The sampling code stores the bulk samples only for a limited number of cycles, and drops the old values. The sampling code also used to calculate the sample age based on the stored values. But if a VM hangs for too long, all its samples are eventually evicted from the samples cache, so its stats age becomes None, which is misinterpred as responsive VM. We should keep reporting None when we don't have enough meaningful data, but we should also fix the sampling age reporting; To do so, we leverage the last recorded vm timestamp. This timestamp is needed by the flow, so it is guaranteed to be present. This patch leaves two potential races open. We do so because we think both are insignificant: 1. the sampling stats cache is initialized in Vm._domDepedentInit(), so if a VM takes too long (> vars.vm_command_timeout, default 60s) to get created, it will be reported as unresponsive until the domain is ready. This is actually a more correct representation of reality. 2. if the powerdown flow takes too long (same as #1), again the VM will be reported as unresponsive until the domain is completely cleaned up. This is also a more thrutful reporting. Change-Id: I3e61626625a2e0517d55dc61e361f3f5eb690c00 Bug-Url: https://bugzilla.redhat.com/1360986 Backport-To: 4.0 Backport-To: 3.6 Signed-off-by: Francesco Romani <from...@redhat.com> Reviewed-on: https://gerrit.ovirt.org/61310 Continuous-Integration: Jenkins CI Reviewed-by: Dan Kenigsberg <dan...@redhat.com> Reviewed-on: https://gerrit.ovirt.org/61822 Reviewed-by: Martin Polednik <mpoled...@redhat.com> --- M tests/samplingTests.py M vdsm/virt/sampling.py 2 files changed, 31 insertions(+), 5 deletions(-) git pull ssh://gerrit.ovirt.org:29418/vdsm refs/changes/18/62418/1 diff --git a/tests/samplingTests.py b/tests/samplingTests.py index 4adb252..10307a3 100644 --- a/tests/samplingTests.py +++ b/tests/samplingTests.py @@ -482,8 +482,10 @@ ({'a': 'foo'}, 1), ({'a': 'bar'}, 2) )) + self.fake_monotonic_time.freeze(value=3) res = self.cache.get('b') self.assertTrue(res.is_empty()) + self.assertEqual(res.stats_age, 3) def test_put_overwrite(self): self._feed_cache(( @@ -520,10 +522,35 @@ # here we lost sampling for 'b' ({'a': 'baz', 'b': 'baz'}, 3), )) + self.fake_monotonic_time.freeze(value=4) self.assertEqual(self.cache.get('a'), - ('bar', 'baz', 1, FakeClock.STEP)) + ('bar', 'baz', 1, 1)) res = self.cache.get('b') self.assertTrue(res.is_empty()) + self.assertEqual(res.stats_age, 1) + + def test_missing_initially(self): + self._feed_cache(( + ({'a': 'foo'}, 1), + ({'a': 'bar', 'b': 'baz'}, 2) + )) + self.fake_monotonic_time.freeze(value=3) + res = self.cache.get('b') + self.assertTrue(res.is_empty()) + self.assertEqual(res.stats_age, 1) + + def test_seen_long_ago(self): + # simulate a sample long evicted from the cache + self.cache.add('a') + # but the cache must not be empty! + self._feed_cache(( + ({'b': 'foo'}, 1), + ({'b': 'foo', 'c': 'bar'}, 2), + )) + self.fake_monotonic_time.freeze(value=101) + res = self.cache.get('a') + self.assertTrue(res.is_empty()) + self.assertEqual(res.stats_age, 100) def _feed_cache(self, samples): for sample in samples: diff --git a/vdsm/virt/sampling.py b/vdsm/virt/sampling.py index 5d50539..dee75c1 100644 --- a/vdsm/virt/sampling.py +++ b/vdsm/virt/sampling.py @@ -453,17 +453,16 @@ """ with self._lock: first_batch, last_batch, interval = self._samples.stats() + stats_age = self._clock() - self._vm_last_timestamp[vmid] if first_batch is None: - return StatsSample(None, None, None, None) + return StatsSample(None, None, None, stats_age) first_sample = first_batch.get(vmid) last_sample = last_batch.get(vmid) if first_sample is None or last_sample is None: - return StatsSample(None, None, None, None) - - stats_age = self._clock() - self._vm_last_timestamp[vmid] + return StatsSample(None, None, None, stats_age) return StatsSample(first_sample, last_sample, interval, stats_age) -- To view, visit https://gerrit.ovirt.org/62418 To unsubscribe, visit https://gerrit.ovirt.org/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I3e61626625a2e0517d55dc61e361f3f5eb690c00 Gerrit-PatchSet: 1 Gerrit-Project: vdsm Gerrit-Branch: ovirt-3.6 Gerrit-Owner: Francesco Romani <from...@redhat.com> Gerrit-Reviewer: Dan Kenigsberg <dan...@redhat.com> Gerrit-Reviewer: Martin Polednik <mpoled...@redhat.com> _______________________________________________ vdsm-patches mailing list vdsm-patches@lists.fedorahosted.org https://lists.fedorahosted.org/admin/lists/vdsm-patches@lists.fedorahosted.org