On Wed, Mar 24, 2021 at 1:24 PM Alan G <alan+ov...@griff.me.uk> wrote:
> Looking back in the logs, in fact the first error we get is Out of memory. > So it seems we're hitting > https://bugzilla.redhat.com/show_bug.cgi?id=1623851 > > It's not clear from the ticket. Is there an explicit fix for this is 4.4, > or the problem just kind of went away? > If it is the described issue, the problem seems to go away in 4.4. The reason might be a newer kernel and libnl3. > > > > ---- On Wed, 24 Mar 2021 11:18:57 +0000 *Alan G <alan+ov...@griff.me.uk > <alan%2bov...@griff.me.uk>>* wrote ---- > > Hi, > > I sent this a while back and never got a response. We've since upgrade to > 4.3 and the issue persists. > > 2021-03-24 10:53:48,934+0000 ERROR (periodic/2) [virt.periodic.Operation] > <vdsm.virt.sampling.HostMonitor object at 0x7f5964398350> operation failed > (periodic:188) > Traceback (most recent call last): > File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 186, > in __call__ > self._func() > File "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 481, > in __call__ > stats = hostapi.get_stats(self._cif, self._samples.stats()) > File "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 50, in > get_stats > decStats = stats.produce(first_sample, last_sample) > File "/usr/lib/python2.7/site-packages/vdsm/host/stats.py", line 72, in > produce > stats.update(get_interfaces_stats()) > File "/usr/lib/python2.7/site-packages/vdsm/host/stats.py", line 154, in > get_interfaces_stats > return net_api.network_stats() > File "/usr/lib/python2.7/site-packages/vdsm/network/api.py", line 63, in > network_stats > return netstats.report() > File "/usr/lib/python2.7/site-packages/vdsm/network/netstats.py", line > 32, in report > stats = link_stats.report() > File "/usr/lib/python2.7/site-packages/vdsm/network/link/stats.py", line > 34, in report > for iface_properties in iface.list(): > File "/usr/lib/python2.7/site-packages/vdsm/network/link/iface.py", line > 257, in list > for properties in itertools.chain(link.iter_links(), dpdk_links): > File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/link.py", > line 47, in iter_links > with _nl_link_cache(sock) as cache: > File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__ > return self.gen.next() > File > "/usr/lib/python2.7/site-packages/vdsm/network/netlink/__init__.py", line > 108, in _cache_manager > cache = cache_allocator(sock) > File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/link.py", > line 157, in _rtnl_link_alloc_cache > return libnl.rtnl_link_alloc_cache(socket, AF_UNSPEC) > File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/libnl.py", > line 578, in rtnl_link_alloc_cache > raise IOError(-err, nl_geterror(err)) > IOError: [Errno 16] Message sequence number mismatch > > This occurs on both nodes in the cluster. A restart of vdsm/supervdsm will > sort it for a while, but within 24 hours it occurs again. We run a number > of clusters and it only occurs on one so must be some specific corner case > we're triggering. > > I can find almost no information on this. The best I could find was this > https://linuxlizard.com/2020/10/18/message-sequence-number-mismatch-in-libnl/ > which details a sequence number issue. I'm guessing I'm experiencing the > same issue in that the nl sequence numbers are getting out of sync and > closing/re-opening the nl socket (aka restart vdsm) is the only way to > resolve. > > I've completely hit a brick wall with it. We've had to disable fencing on > both nodes as sometimes they get erroneously fenced when vdsm stops > function correctly. At this point I'm thinking about replaced the severs > with different models in-case it's something in the NIC drivers... > > Alan > > > ---- On Mon, 06 Jan 2020 10:54:52 +0000 *Alan G <alan+ov...@griff.me.uk > <alan+ov...@griff.me.uk>>* wrote ---- > > Hi, > > I have issues with one host where supervdsm is failing in network_caps. > > I see the following trace in the log. > > MainProcess|jsonrpc/1::ERROR::2020-01-06 > 03:01:05,558::supervdsm_server::100::SuperVdsm.ServerCallback::(wrapper) > Error in network_caps > Traceback (most recent call last): > File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line > 98, in wrapper > res = func(*args, **kwargs) > File "/usr/lib/python2.7/site-packages/vdsm/network/api.py", line 56, in > network_caps > return netswitch.configurator.netcaps(compatibility=30600) > File > "/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", > line 317, in netcaps > net_caps = netinfo(compatibility=compatibility) > File > "/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", > line 325, in netinfo > _netinfo = netinfo_get(vdsmnets, compatibility) > File "/usr/lib/python2.7/site-packages/vdsm/network/netinfo/cache.py", > line 150, in get > return _stringify_mtus(_get(vdsmnets)) > File "/usr/lib/python2.7/site-packages/vdsm/network/netinfo/cache.py", > line 59, in _get > ipaddrs = getIpAddrs() > File > "/usr/lib/python2.7/site-packages/vdsm/network/netinfo/addresses.py", line > 72, in getIpAddrs > for addr in nl_addr.iter_addrs(): > File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/addr.py", > line 33, in iter_addrs > with _nl_addr_cache(sock) as addr_cache: > File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__ > return self.gen.next() > File > "/usr/lib/python2.7/site-packages/vdsm/network/netlink/__init__.py", line > 92, in _cache_manager > cache = cache_allocator(sock) > File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/libnl.py", > line 469, in rtnl_addr_alloc_cache > raise IOError(-err, nl_geterror(err)) > IOError: [Errno 16] Message sequence number mismatch > > A restart of supervdsm will resolve the issue for a period, maybe 24 > hours, then it will occur again. So I'm thinking it's resource exhaustion > or a leak of some kind? > > Running 4.2.8.2 with VDSM at 4.20.46. > > I've had a look through the bugzilla and can't find an exact match, > closest was this one https://bugzilla.redhat.com/show_bug.cgi?id=1666123 > which seems to be a RHV only fix. > > Thanks, > > Alan > > _______________________________________________ > Users mailing list -- users@ovirt.org > To unsubscribe send an email to users-le...@ovirt.org > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > oVirt Code of Conduct: > https://www.ovirt.org/community/about/community-guidelines/ > List Archives: > https://lists.ovirt.org/archives/list/users@ovirt.org/message/4YGTPGGNZJ3JT4Z6ZPIQOPPD73WRG72E/ > > > > _______________________________________________ > Users mailing list -- users@ovirt.org > To unsubscribe send an email to users-le...@ovirt.org > Privacy Statement: https://www.ovirt.org/privacy-policy.html > oVirt Code of Conduct: > https://www.ovirt.org/community/about/community-guidelines/ > List Archives: > https://lists.ovirt.org/archives/list/users@ovirt.org/message/OIVD3XUU7JV4XAO6IPYAU5U6XHOX267E/ > > > > _______________________________________________ > Users mailing list -- users@ovirt.org > To unsubscribe send an email to users-le...@ovirt.org > Privacy Statement: https://www.ovirt.org/privacy-policy.html > oVirt Code of Conduct: > https://www.ovirt.org/community/about/community-guidelines/ > List Archives: > https://lists.ovirt.org/archives/list/users@ovirt.org/message/4525CHJL7E7AXXB6Q5VG4Q7LRFFB2ILL/ > -- Ales Musil Software Engineer - RHV Network Red Hat EMEA <https://www.redhat.com> amu...@redhat.com IM: amusil <https://red.ht/sig>
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/COFOPDWFXQ3GQ7A2BAM73FR4ZDBTU5FS/