So it looks like my main issue and the one mentioned in the PS section
may actually be one and the same.

I've run an experiment now where I monitor the normal leakage over
30min, getting me around 1.1GiB of extra kmalloc-256.
I then repeated the experiment but with firewall (good old iptables)
dropping all non-ICMP traffic headed to the OVN router addresses, this
reduced the leakage to just 89MiB.

So it looks like that traffic is getting lost into a redirection loop
for a while in OVN/OVS, triggers the "openvswitch: ovs-system:
deferred action limit reached, drop recirc action" entry and also
leaks a bunch of memory in the process. We did attempt to pin the
issue to this cause earlier by generating some traffic which would hit
the recirc loop but that didn't cause a very visible increase at the
time. Given the new data however, our attempt at manually triggering
it on an otherwise unused network must have been flawed somehow.

The amount of traffic dropped by the firewall is quite small too which
suggests that things could be significantly worse.
In this case I saw a total of 3478 packets dropped for a total of 183852 bytes.

We'll be doing a bit more digging on the OVN side to see what may be
causing this and will report back.

Stéphane




Stéphane

On Wed, Mar 30, 2022 at 5:08 PM Stéphane Graber <stgra...@ubuntu.com> wrote:
>
> Hello,
>
> I'm using OVS 2.17.0 combined with OVN 22.03.0 on Ubuntu 20.04 and a
> 5.17.1 mainline kernel.
>
> I'm trying to debug a very very problematic kernel memory leak which
> is happening in this environment.
> Sadly I don't have a clear reproducer or much idea of when it first
> appeared. I went through about a year of metrics and some amount of
> leakage may always have been present, the magnitude of it just
> changing recently making it such that my normal weekly server
> maintenance is now not frequent enough to take care of it.
>
> Basically what I'm doing is running a LXD + OVN setup on 3 servers,
> they all act as OVN chassis with various priorities to spread the
> load.
> All 3 servers also normally run a combination of containers and
> virtual machines attached to about a dozen different OVN networks.
>
> What I'm seeing is about 2MB/s of memory leakage (kmalloc-256 slub)
> which after enabling slub debugging can be tracked down to
> nf_ct_tmpl_alloc kernel calls such as those made by the openvswitch
> kernel module as part of its ovs_ct_copy_action function which is
> exposed as OVS_ACTION_ATTR_CT to userspace through the openvswitch
> netlink API.
>
> This means that in just a couple of days I'm dealing with just shy of
> 40GiB of those kmalloc-256 entries.
> Here is one of the servers which has been running for just 6 hours:
>
> ```
> root@abydos:~# uptime
>  20:51:44 up  6:32,  1 user,  load average: 6.63, 5.95, 5.26
>
> root@abydos:~# slabtop -o -s c | head -n10
>  Active / Total Objects (% used)    : 24919212 / 25427299 (98.0%)
>  Active / Total Slabs (% used)      : 541777 / 541777 (100.0%)
>  Active / Total Caches (% used)     : 150 / 197 (76.1%)
>  Active / Total Size (% used)       : 11576509.49K / 11680410.31K (99.1%)
>  Minimum / Average / Maximum Object : 0.01K / 0.46K / 50.52K
>
>   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
> 13048822 13048737  99%    0.75K 310688       42   9942016K kmalloc-256
> 2959671 2677549  90%    0.10K  75889       39    303556K buffer_head
> 460411 460411 100%    0.57K  16444       28    263104K radix_tree_node
>
> root@abydos:~# cat /sys/kernel/debug/slab/kmalloc-256/alloc_traces |
> sort -rn | head -n5
> 12203895 nf_ct_tmpl_alloc+0x55/0xb0 [nf_conntrack]
> age=57/2975485/5871871 pid=1964-2048 cpus=0-31 nodes=0-1
>  803599 metadata_dst_alloc+0x25/0x50 age=2/2663973/5873408
> pid=0-683331 cpus=0-31 nodes=0-1
>   32773 memcg_alloc_slab_cgroups+0x3d/0x90 age=386/4430883/5878515
> pid=1-731302 cpus=0-31 nodes=0-1
>    3861 do_seccomp+0xdb/0xb80 age=749613/4661870/5878386
> pid=752-648826 cpus=0-31 nodes=0-1
>    2314 device_add+0x504/0x920 age=751269/5665662/5883377 pid=1-648698
> cpus=0-31 nodes=0-1
>
> root@abydos:~# cat /sys/kernel/debug/slab/kmalloc-256/free_traces |
> sort -rn | head -n5
> 8129152 <not-available> age=4300785451 pid=0 cpus=0 nodes=0-1
> 2770915 reserve_sfa_size+0xdf/0x110 [openvswitch]
> age=1912/2970717/5881994 pid=1964-2069 cpus=0-31 nodes=0-1
> 1621182 dst_destroy+0x70/0xd0 age=4/3065853/5883592 pid=0-733033
> cpus=0-31 nodes=0-1
>  288686 nf_ct_tmpl_free+0x1b/0x30 [nf_conntrack]
> age=109/2985710/5879968 pid=0-733208 cpus=0-31 nodes=0-1
>  134435 ovs_nla_free_flow_actions+0x68/0x90 [openvswitch]
> age=134/2955781/5883717 pid=0-733208 cpus=0-31 nodes=0-1
> ```
>
> Here you can see 12M calls to nf_ct_tmpl_alloc but just 288k to 
> nf_ct_tmpl_free.
>
> Things I've done so far to try to isolate things:
>
> 1) I've evacuated all workloads from the server, so the only thing
> running on it is OVS vswitchd. This did not change anything.
> 2) I've added iptables/ip6tables raw table rules marking all traffic
> as NOTRACK. This did not change anything.
> 3) I've played with the chassis assignment. This does change things in
> that a server with no active chassis will not show any leakage
> (thankfully). The busier the network I move back to the host, the
> faster the leakage.
>
>
> I've had both Frode and Tom (Cced) assist with a variety of ideas and
> questions but while we have found some unrelated OVS and kernel
> issues, we're yet to figure this one out. So I wanted to reach out to
> the wider community to see if anyone has either seen something like
> this before or has suggestions as to where to look next.
>
> I can pretty easily rebuild the kernel, OVS or OVN and while this
> cluster is a production environment, the fact that I can evacuate one
> of the three servers with no user-visible impact makes it not too bad
> to debug. Having to constantly reboot the entire setup to clear the
> memory leak is the bigger annoyance right now :)
>
> Stéphane
>
> PS: The side OVS/kernel issue I'm referring to is
> https://lore.kernel.org/netdev/20220330194244.3476544-1-stgra...@ubuntu.com/
> which allowed us to track down an issue with OVN logical routers
> properly responding to ICMP on their external address but getting into
> a recirculation loop when any other kind of traffic is thrown at them
> (instead of being immediately dropped or rejected). The kernel change
> made it possible to track this down.
_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to