[ovirt-users] Re: supervdsm failing during network_caps

2021-03-24 Thread Ales Musil
On Wed, Mar 24, 2021 at 1:24 PM Alan G  wrote:

> Looking back in the logs, in fact the first error we get is Out of memory.
> So it seems we're hitting
> https://bugzilla.redhat.com/show_bug.cgi?id=1623851
>
> It's not clear from the ticket. Is there an explicit fix for this is 4.4,
> or the problem just kind of went away?
>

If it is the described issue, the problem seems to go away in 4.4. The
reason might be a newer kernel and libnl3.




>
>
>
>  On Wed, 24 Mar 2021 11:18:57 + *Alan G  >* wrote 
>
> Hi,
>
> I sent this a while back and never got a response. We've since upgrade to
> 4.3 and the issue persists.
>
> 2021-03-24 10:53:48,934+ ERROR (periodic/2) [virt.periodic.Operation]
>  operation failed
> (periodic:188)
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 186,
> in __call__
> self._func()
>   File "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 481,
> in __call__
> stats = hostapi.get_stats(self._cif, self._samples.stats())
>   File "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 50, in
> get_stats
> decStats = stats.produce(first_sample, last_sample)
>   File "/usr/lib/python2.7/site-packages/vdsm/host/stats.py", line 72, in
> produce
> stats.update(get_interfaces_stats())
>   File "/usr/lib/python2.7/site-packages/vdsm/host/stats.py", line 154, in
> get_interfaces_stats
> return net_api.network_stats()
>   File "/usr/lib/python2.7/site-packages/vdsm/network/api.py", line 63, in
> network_stats
> return netstats.report()
>   File "/usr/lib/python2.7/site-packages/vdsm/network/netstats.py", line
> 32, in report
> stats = link_stats.report()
>   File "/usr/lib/python2.7/site-packages/vdsm/network/link/stats.py", line
> 34, in report
> for iface_properties in iface.list():
>   File "/usr/lib/python2.7/site-packages/vdsm/network/link/iface.py", line
> 257, in list
> for properties in itertools.chain(link.iter_links(), dpdk_links):
>   File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/link.py",
> line 47, in iter_links
> with _nl_link_cache(sock) as cache:
>   File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
> return self.gen.next()
>   File
> "/usr/lib/python2.7/site-packages/vdsm/network/netlink/__init__.py", line
> 108, in _cache_manager
> cache = cache_allocator(sock)
>   File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/link.py",
> line 157, in _rtnl_link_alloc_cache
> return libnl.rtnl_link_alloc_cache(socket, AF_UNSPEC)
>   File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/libnl.py",
> line 578, in rtnl_link_alloc_cache
> raise IOError(-err, nl_geterror(err))
> IOError: [Errno 16] Message sequence number mismatch
>
> This occurs on both nodes in the cluster. A restart of vdsm/supervdsm will
> sort it for a while, but within 24 hours it occurs again. We run a number
> of clusters and it only occurs on one so must be some specific corner case
> we're triggering.
>
> I can find almost no information on this. The best I could find was this
> https://linuxlizard.com/2020/10/18/message-sequence-number-mismatch-in-libnl/
> which details a sequence number issue. I'm guessing I'm experiencing the
> same issue in that the nl sequence numbers are getting out of sync and
> closing/re-opening the nl socket (aka restart vdsm) is the only way to
> resolve.
>
> I've completely hit a brick wall with it. We've had to disable fencing on
> both nodes as sometimes they get erroneously fenced when vdsm stops
> function correctly. At this point I'm thinking about replaced the severs
> with different models in-case it's something in the NIC drivers...
>
> Alan
>
>
>  On Mon, 06 Jan 2020 10:54:52 + *Alan G  >* wrote 
>
> Hi,
>
> I have issues with one host where supervdsm is failing in network_caps.
>
> I see the following trace in the log.
>
> MainProcess|jsonrpc/1::ERROR::2020-01-06
> 03:01:05,558::supervdsm_server::100::SuperVdsm.ServerCallback::(wrapper)
> Error in network_caps
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line
> 98, in wrapper
> res = func(*args, **kwargs)
>   File "/usr/lib/python2.7/site-packages/vdsm/network/api.py", line 56, in
> network_caps
> return netswitch.configurator.netcaps(compatibility=30600)
>   File
> "/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py",
> line 317, in netcaps
> net_caps = netinfo(compatibility=compatibility)
>   File
> "/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py",
> line 325, in netinfo
> _netinfo = netinfo_get(vdsmnets, compatibility)
>   File "/usr/lib/python2.7/site-packages/vdsm/network/netinfo/cache.py",
> line 150, in get
> return _stringify_mtus(_get(vdsmnets))
>   File "/usr/lib/python2.7/site-packages/vdsm/network/netinfo/cache.py",
> line 59, in _get
> ipaddrs = getIpAddrs()
>   File
> "/usr/lib/python2.

[ovirt-users] Re: supervdsm failing during network_caps

2021-03-24 Thread Alan G
Looking back in the logs, in fact the first error we get is Out of memory. So 
it seems we're hitting https://bugzilla.redhat.com/show_bug.cgi?id=1623851



It's not clear from the ticket. Is there an explicit fix for this is 4.4, or 
the problem just kind of went away?







 On Wed, 24 Mar 2021 11:18:57 + Alan G  wrote 




Hi,



I sent this a while back and never got a response. We've since upgrade to 4.3 
and the issue persists.



2021-03-24 10:53:48,934+ ERROR (periodic/2) [virt.periodic.Operation] 
 operation failed 
(periodic:188)

Traceback (most recent call last):

  File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 186, in 
__call__

    self._func()

  File "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 481, in 
__call__

    stats = hostapi.get_stats(self._cif, self._samples.stats())

  File "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 50, in 
get_stats

    decStats = stats.produce(first_sample, last_sample)

  File "/usr/lib/python2.7/site-packages/vdsm/host/stats.py", line 72, in 
produce

    stats.update(get_interfaces_stats())

  File "/usr/lib/python2.7/site-packages/vdsm/host/stats.py", line 154, in 
get_interfaces_stats

    return net_api.network_stats()

  File "/usr/lib/python2.7/site-packages/vdsm/network/api.py", line 63, in 
network_stats

    return netstats.report()

  File "/usr/lib/python2.7/site-packages/vdsm/network/netstats.py", line 32, in 
report

    stats = link_stats.report()

  File "/usr/lib/python2.7/site-packages/vdsm/network/link/stats.py", line 34, 
in report

    for iface_properties in iface.list():

  File "/usr/lib/python2.7/site-packages/vdsm/network/link/iface.py", line 257, 
in list

    for properties in itertools.chain(link.iter_links(), dpdk_links):

  File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/link.py", line 
47, in iter_links

    with _nl_link_cache(sock) as cache:

  File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__

    return self.gen.next()

  File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/__init__.py", 
line 108, in _cache_manager

    cache = cache_allocator(sock)

  File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/link.py", line 
157, in _rtnl_link_alloc_cache

    return libnl.rtnl_link_alloc_cache(socket, AF_UNSPEC)

  File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/libnl.py", line 
578, in rtnl_link_alloc_cache

    raise IOError(-err, nl_geterror(err))

IOError: [Errno 16] Message sequence number mismatch



This occurs on both nodes in the cluster. A restart of vdsm/supervdsm will sort 
it for a while, but within 24 hours it occurs again. We run a number of 
clusters and it only occurs on one so must be some specific corner case we're 
triggering.



I can find almost no information on this. The best I could find was this 
https://linuxlizard.com/2020/10/18/message-sequence-number-mismatch-in-libnl/ 
which details a sequence number issue. I'm guessing I'm experiencing the same 
issue in that the nl sequence numbers are getting out of sync and 
closing/re-opening the nl socket (aka restart vdsm) is the only way to resolve.



I've completely hit a brick wall with it. We've had to disable fencing on both 
nodes as sometimes they get erroneously fenced when vdsm stops function 
correctly. At this point I'm thinking about replaced the severs with different 
models in-case it's something in the NIC drivers...



Alan





 On Mon, 06 Jan 2020 10:54:52 + Alan G  
wrote 



Hi,



I have issues with one host where supervdsm is failing in network_caps.



I see the following trace in the log.



MainProcess|jsonrpc/1::ERROR::2020-01-06 
03:01:05,558::supervdsm_server::100::SuperVdsm.ServerCallback::(wrapper) Error 
in network_caps

Traceback (most recent call last):

  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 98, in 
wrapper

    res = func(*args, **kwargs)

  File "/usr/lib/python2.7/site-packages/vdsm/network/api.py", line 56, in 
network_caps

    return netswitch.configurator.netcaps(compatibility=30600)

  File 
"/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", line 
317, in netcaps

    net_caps = netinfo(compatibility=compatibility)

  File 
"/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", line 
325, in netinfo

    _netinfo = netinfo_get(vdsmnets, compatibility)

  File "/usr/lib/python2.7/site-packages/vdsm/network/netinfo/cache.py", line 
150, in get

    return _stringify_mtus(_get(vdsmnets))

  File "/usr/lib/python2.7/site-packages/vdsm/network/netinfo/cache.py", line 
59, in _get

    ipaddrs = getIpAddrs()

  File "/usr/lib/python2.7/site-packages/vdsm/network/netinfo/addresses.py", 
line 72, in getIpAddrs

    for addr in nl_addr.iter_addrs():

  File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/addr.py", line 
33, in iter_addrs

    with _nl_addr_cache(sock) as addr

[ovirt-users] Re: supervdsm failing during network_caps

2021-03-24 Thread Alan G
Hi,



I sent this a while back and never got a response. We've since upgrade to 4.3 
and the issue persists.



2021-03-24 10:53:48,934+ ERROR (periodic/2) [virt.periodic.Operation] 
 operation failed 
(periodic:188)

Traceback (most recent call last):

  File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 186, in 
__call__

    self._func()

  File "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 481, in 
__call__

    stats = hostapi.get_stats(self._cif, self._samples.stats())

  File "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 50, in 
get_stats

    decStats = stats.produce(first_sample, last_sample)

  File "/usr/lib/python2.7/site-packages/vdsm/host/stats.py", line 72, in 
produce

    stats.update(get_interfaces_stats())

  File "/usr/lib/python2.7/site-packages/vdsm/host/stats.py", line 154, in 
get_interfaces_stats

    return net_api.network_stats()

  File "/usr/lib/python2.7/site-packages/vdsm/network/api.py", line 63, in 
network_stats

    return netstats.report()

  File "/usr/lib/python2.7/site-packages/vdsm/network/netstats.py", line 32, in 
report

    stats = link_stats.report()

  File "/usr/lib/python2.7/site-packages/vdsm/network/link/stats.py", line 34, 
in report

    for iface_properties in iface.list():

  File "/usr/lib/python2.7/site-packages/vdsm/network/link/iface.py", line 257, 
in list

    for properties in itertools.chain(link.iter_links(), dpdk_links):

  File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/link.py", line 
47, in iter_links

    with _nl_link_cache(sock) as cache:

  File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__

    return self.gen.next()

  File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/__init__.py", 
line 108, in _cache_manager

    cache = cache_allocator(sock)

  File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/link.py", line 
157, in _rtnl_link_alloc_cache

    return libnl.rtnl_link_alloc_cache(socket, AF_UNSPEC)

  File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/libnl.py", line 
578, in rtnl_link_alloc_cache

    raise IOError(-err, nl_geterror(err))

IOError: [Errno 16] Message sequence number mismatch


This occurs on both nodes in the cluster. A restart of vdsm/supervdsm will sort 
it for a while, but within 24 hours it occurs again. We run a number of 
clusters and it only occurs on one so must be some specific corner case we're 
triggering.

I can find almost no information on this. The best I could find was this 
https://linuxlizard.com/2020/10/18/message-sequence-number-mismatch-in-libnl/ 
which details a sequence number issue. I'm guessing I'm experiencing the same 
issue in that the nl sequence numbers are getting out of sync and 
closing/re-opening the nl socket (aka restart vdsm) is the only way to resolve.

I've completely hit a brick wall with it. We've had to disable fencing on both 
nodes as sometimes they get erroneously fenced when vdsm stops function 
correctly. At this point I'm thinking about replaced the severs with different 
models in-case it's something in the NIC drivers...

Alan



 On Mon, 06 Jan 2020 10:54:52 + Alan G  
wrote 


Hi,



I have issues with one host where supervdsm is failing in network_caps.



I see the following trace in the log.



MainProcess|jsonrpc/1::ERROR::2020-01-06 
03:01:05,558::supervdsm_server::100::SuperVdsm.ServerCallback::(wrapper) Error 
in network_caps

Traceback (most recent call last):

  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 98, in 
wrapper

    res = func(*args, **kwargs)

  File "/usr/lib/python2.7/site-packages/vdsm/network/api.py", line 56, in 
network_caps

    return netswitch.configurator.netcaps(compatibility=30600)

  File 
"/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", line 
317, in netcaps

    net_caps = netinfo(compatibility=compatibility)

  File 
"/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", line 
325, in netinfo

    _netinfo = netinfo_get(vdsmnets, compatibility)

  File "/usr/lib/python2.7/site-packages/vdsm/network/netinfo/cache.py", line 
150, in get

    return _stringify_mtus(_get(vdsmnets))

  File "/usr/lib/python2.7/site-packages/vdsm/network/netinfo/cache.py", line 
59, in _get

    ipaddrs = getIpAddrs()

  File "/usr/lib/python2.7/site-packages/vdsm/network/netinfo/addresses.py", 
line 72, in getIpAddrs

    for addr in nl_addr.iter_addrs():

  File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/addr.py", line 
33, in iter_addrs

    with _nl_addr_cache(sock) as addr_cache:

  File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__

    return self.gen.next()

  File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/__init__.py", 
line 92, in _cache_manager

    cache = cache_allocator(sock)

  File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/libnl.py", line 
469, in rtnl_addr_alloc_cac