Hello Dan Kenigsberg,

I'd like you to do a code review.  Please visit

    https://gerrit.ovirt.org/61822

to review the following change.

Change subject: vm: periodic: fix stats age reporting
......................................................................

vm: periodic: fix stats age reporting

We need to report the Vm responsiveness, because it is one
important indicator of the Vm health. This information can
trigger the administator intervention, or automatic corrective
action (e.g. Vm migrations).

The Vm class reports the monitor responsiveness in two ways:
1. if a libvirt operation reports a timeout (a standard libvirt
exception)
2. if one sampling operation hangs for too long

In the second case, given the transient nature of the detected
failure, the responsiveness is reported, but not stored as Vm
field. This is noteworthy, but totally expected and legitimate.

The sampling code stores the bulk samples only for a limited
number of cycles, and drops the old values.
The sampling code also used to calculate the sample age
based on the stored values.

But if a VM hangs for too long, all its samples are eventually
evicted from the samples cache, so its stats age becomes None,
which is misinterpred as responsive VM.

We should keep reporting None when we don't have enough meaningful
data, but we should also fix the sampling age reporting;
To do so, we leverage the last recorded vm timestamp.
This timestamp is needed by the flow, so it is guaranteed to be present.

This patch leaves two potential races open. We do so because we
think both are insignificant:

1. the sampling stats cache is initialized in Vm._domDepedentInit(),
so if a VM takes too long (> vars.vm_command_timeout, default 60s)
to get created, it will be reported as unresponsive until the domain
is ready. This is actually a more correct representation of reality.
2. if the powerdown flow takes too long (same as #1), again the
VM will be reported as unresponsive until the domain is completely
cleaned up. This is also a more thrutful reporting.

Change-Id: I3e61626625a2e0517d55dc61e361f3f5eb690c00
Backport-To: 4.0
Backport-To: 3.6
Bug-Url: https://bugzilla.redhat.com/1357798
Signed-off-by: Francesco Romani <from...@redhat.com>
Reviewed-on: https://gerrit.ovirt.org/61310
Continuous-Integration: Jenkins CI
Reviewed-by: Dan Kenigsberg <dan...@redhat.com>
---
M lib/vdsm/virt/sampling.py
M tests/samplingTests.py
2 files changed, 31 insertions(+), 5 deletions(-)


  git pull ssh://gerrit.ovirt.org:29418/vdsm refs/changes/22/61822/1

diff --git a/lib/vdsm/virt/sampling.py b/lib/vdsm/virt/sampling.py
index f84746b..7367cd4 100644
--- a/lib/vdsm/virt/sampling.py
+++ b/lib/vdsm/virt/sampling.py
@@ -418,17 +418,16 @@
         """
         with self._lock:
             first_batch, last_batch, interval = self._samples.stats()
+            stats_age = self._clock() - self._vm_last_timestamp[vmid]
 
             if first_batch is None:
-                return StatsSample(None, None, None, None)
+                return StatsSample(None, None, None, stats_age)
 
             first_sample = first_batch.get(vmid)
             last_sample = last_batch.get(vmid)
 
             if first_sample is None or last_sample is None:
-                return StatsSample(None, None, None, None)
-
-            stats_age = self._clock() - self._vm_last_timestamp[vmid]
+                return StatsSample(None, None, None, stats_age)
 
             return StatsSample(first_sample, last_sample,
                                interval, stats_age)
diff --git a/tests/samplingTests.py b/tests/samplingTests.py
index f0f22e5..3dc1e0b 100644
--- a/tests/samplingTests.py
+++ b/tests/samplingTests.py
@@ -282,8 +282,10 @@
             ({'a': 'foo'}, 1),
             ({'a': 'bar'}, 2)
         ))
+        self.fake_monotonic_time.freeze(value=3)
         res = self.cache.get('b')
         self.assertTrue(res.is_empty())
+        self.assertEqual(res.stats_age, 3)
 
     def test_put_overwrite(self):
         self._feed_cache((
@@ -320,10 +322,35 @@
             # here we lost sampling for 'b'
             ({'a': 'baz', 'b': 'baz'}, 3),
         ))
+        self.fake_monotonic_time.freeze(value=4)
         self.assertEqual(self.cache.get('a'),
-                         ('bar', 'baz', 1, FakeClock.STEP))
+                         ('bar', 'baz', 1, 1))
         res = self.cache.get('b')
         self.assertTrue(res.is_empty())
+        self.assertEqual(res.stats_age, 1)
+
+    def test_missing_initially(self):
+        self._feed_cache((
+            ({'a': 'foo'}, 1),
+            ({'a': 'bar', 'b': 'baz'}, 2)
+        ))
+        self.fake_monotonic_time.freeze(value=3)
+        res = self.cache.get('b')
+        self.assertTrue(res.is_empty())
+        self.assertEqual(res.stats_age, 1)
+
+    def test_seen_long_ago(self):
+        # simulate a sample long evicted from the cache
+        self.cache.add('a')
+        # but the cache must not be empty!
+        self._feed_cache((
+            ({'b': 'foo'}, 1),
+            ({'b': 'foo', 'c': 'bar'}, 2),
+        ))
+        self.fake_monotonic_time.freeze(value=101)
+        res = self.cache.get('a')
+        self.assertTrue(res.is_empty())
+        self.assertEqual(res.stats_age, 100)
 
     def _feed_cache(self, samples):
         for sample in samples:


-- 
To view, visit https://gerrit.ovirt.org/61822
To unsubscribe, visit https://gerrit.ovirt.org/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I3e61626625a2e0517d55dc61e361f3f5eb690c00
Gerrit-PatchSet: 1
Gerrit-Project: vdsm
Gerrit-Branch: ovirt-4.0
Gerrit-Owner: Francesco Romani <from...@redhat.com>
Gerrit-Reviewer: Dan Kenigsberg <dan...@redhat.com>
_______________________________________________
vdsm-patches mailing list
vdsm-patches@lists.fedorahosted.org
https://lists.fedorahosted.org/admin/lists/vdsm-patches@lists.fedorahosted.org

Reply via email to