On Wed, Mar 24, 2021 at 1:24 PM Alan G <alan+ov...@griff.me.uk> wrote:

> Looking back in the logs, in fact the first error we get is Out of memory.
> So it seems we're hitting
> https://bugzilla.redhat.com/show_bug.cgi?id=1623851
>
> It's not clear from the ticket. Is there an explicit fix for this is 4.4,
> or the problem just kind of went away?
>

If it is the described issue, the problem seems to go away in 4.4. The
reason might be a newer kernel and libnl3.




>
>
>
> ---- On Wed, 24 Mar 2021 11:18:57 +0000 *Alan G <alan+ov...@griff.me.uk
> <alan%2bov...@griff.me.uk>>* wrote ----
>
> Hi,
>
> I sent this a while back and never got a response. We've since upgrade to
> 4.3 and the issue persists.
>
> 2021-03-24 10:53:48,934+0000 ERROR (periodic/2) [virt.periodic.Operation]
> <vdsm.virt.sampling.HostMonitor object at 0x7f5964398350> operation failed
> (periodic:188)
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 186,
> in __call__
>     self._func()
>   File "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 481,
> in __call__
>     stats = hostapi.get_stats(self._cif, self._samples.stats())
>   File "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 50, in
> get_stats
>     decStats = stats.produce(first_sample, last_sample)
>   File "/usr/lib/python2.7/site-packages/vdsm/host/stats.py", line 72, in
> produce
>     stats.update(get_interfaces_stats())
>   File "/usr/lib/python2.7/site-packages/vdsm/host/stats.py", line 154, in
> get_interfaces_stats
>     return net_api.network_stats()
>   File "/usr/lib/python2.7/site-packages/vdsm/network/api.py", line 63, in
> network_stats
>     return netstats.report()
>   File "/usr/lib/python2.7/site-packages/vdsm/network/netstats.py", line
> 32, in report
>     stats = link_stats.report()
>   File "/usr/lib/python2.7/site-packages/vdsm/network/link/stats.py", line
> 34, in report
>     for iface_properties in iface.list():
>   File "/usr/lib/python2.7/site-packages/vdsm/network/link/iface.py", line
> 257, in list
>     for properties in itertools.chain(link.iter_links(), dpdk_links):
>   File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/link.py",
> line 47, in iter_links
>     with _nl_link_cache(sock) as cache:
>   File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
>     return self.gen.next()
>   File
> "/usr/lib/python2.7/site-packages/vdsm/network/netlink/__init__.py", line
> 108, in _cache_manager
>     cache = cache_allocator(sock)
>   File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/link.py",
> line 157, in _rtnl_link_alloc_cache
>     return libnl.rtnl_link_alloc_cache(socket, AF_UNSPEC)
>   File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/libnl.py",
> line 578, in rtnl_link_alloc_cache
>     raise IOError(-err, nl_geterror(err))
> IOError: [Errno 16] Message sequence number mismatch
>
> This occurs on both nodes in the cluster. A restart of vdsm/supervdsm will
> sort it for a while, but within 24 hours it occurs again. We run a number
> of clusters and it only occurs on one so must be some specific corner case
> we're triggering.
>
> I can find almost no information on this. The best I could find was this
> https://linuxlizard.com/2020/10/18/message-sequence-number-mismatch-in-libnl/
> which details a sequence number issue. I'm guessing I'm experiencing the
> same issue in that the nl sequence numbers are getting out of sync and
> closing/re-opening the nl socket (aka restart vdsm) is the only way to
> resolve.
>
> I've completely hit a brick wall with it. We've had to disable fencing on
> both nodes as sometimes they get erroneously fenced when vdsm stops
> function correctly. At this point I'm thinking about replaced the severs
> with different models in-case it's something in the NIC drivers...
>
> Alan
>
>
> ---- On Mon, 06 Jan 2020 10:54:52 +0000 *Alan G <alan+ov...@griff.me.uk
> <alan+ov...@griff.me.uk>>* wrote ----
>
> Hi,
>
> I have issues with one host where supervdsm is failing in network_caps.
>
> I see the following trace in the log.
>
> MainProcess|jsonrpc/1::ERROR::2020-01-06
> 03:01:05,558::supervdsm_server::100::SuperVdsm.ServerCallback::(wrapper)
> Error in network_caps
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line
> 98, in wrapper
>     res = func(*args, **kwargs)
>   File "/usr/lib/python2.7/site-packages/vdsm/network/api.py", line 56, in
> network_caps
>     return netswitch.configurator.netcaps(compatibility=30600)
>   File
> "/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py",
> line 317, in netcaps
>     net_caps = netinfo(compatibility=compatibility)
>   File
> "/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py",
> line 325, in netinfo
>     _netinfo = netinfo_get(vdsmnets, compatibility)
>   File "/usr/lib/python2.7/site-packages/vdsm/network/netinfo/cache.py",
> line 150, in get
>     return _stringify_mtus(_get(vdsmnets))
>   File "/usr/lib/python2.7/site-packages/vdsm/network/netinfo/cache.py",
> line 59, in _get
>     ipaddrs = getIpAddrs()
>   File
> "/usr/lib/python2.7/site-packages/vdsm/network/netinfo/addresses.py", line
> 72, in getIpAddrs
>     for addr in nl_addr.iter_addrs():
>   File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/addr.py",
> line 33, in iter_addrs
>     with _nl_addr_cache(sock) as addr_cache:
>   File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
>     return self.gen.next()
>   File
> "/usr/lib/python2.7/site-packages/vdsm/network/netlink/__init__.py", line
> 92, in _cache_manager
>     cache = cache_allocator(sock)
>   File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/libnl.py",
> line 469, in rtnl_addr_alloc_cache
>     raise IOError(-err, nl_geterror(err))
> IOError: [Errno 16] Message sequence number mismatch
>
> A restart of supervdsm will resolve the issue for a period, maybe 24
> hours, then it will occur again. So I'm thinking it's resource exhaustion
> or a leak of some kind?
>
> Running 4.2.8.2 with VDSM at 4.20.46.
>
> I've had a look through the bugzilla and can't find an exact match,
> closest was this one https://bugzilla.redhat.com/show_bug.cgi?id=1666123
> which seems to be a RHV only fix.
>
> Thanks,
>
> Alan
>
> _______________________________________________
> Users mailing list -- users@ovirt.org
> To unsubscribe send an email to users-le...@ovirt.org
> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
> oVirt Code of Conduct:
> https://www.ovirt.org/community/about/community-guidelines/
> List Archives:
> https://lists.ovirt.org/archives/list/users@ovirt.org/message/4YGTPGGNZJ3JT4Z6ZPIQOPPD73WRG72E/
>
>
>
> _______________________________________________
> Users mailing list -- users@ovirt.org
> To unsubscribe send an email to users-le...@ovirt.org
> Privacy Statement: https://www.ovirt.org/privacy-policy.html
> oVirt Code of Conduct:
> https://www.ovirt.org/community/about/community-guidelines/
> List Archives:
> https://lists.ovirt.org/archives/list/users@ovirt.org/message/OIVD3XUU7JV4XAO6IPYAU5U6XHOX267E/
>
>
>
> _______________________________________________
> Users mailing list -- users@ovirt.org
> To unsubscribe send an email to users-le...@ovirt.org
> Privacy Statement: https://www.ovirt.org/privacy-policy.html
> oVirt Code of Conduct:
> https://www.ovirt.org/community/about/community-guidelines/
> List Archives:
> https://lists.ovirt.org/archives/list/users@ovirt.org/message/4525CHJL7E7AXXB6Q5VG4Q7LRFFB2ILL/
>


-- 

Ales Musil

Software Engineer - RHV Network

Red Hat EMEA <https://www.redhat.com>

amu...@redhat.com    IM: amusil
<https://red.ht/sig>
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/COFOPDWFXQ3GQ7A2BAM73FR4ZDBTU5FS/

Reply via email to