On 3/31/22 06:24, Stéphane Graber wrote: > Thanks for the patch! > > I've applied it on top of my 5.17.1 and I'm now getting a very very > large number of: > > [ 188.616220] openvswitch: netlink: ovs_ct_free_action at depth 3 > > After just 3 minutes of uptime I'm seeing 230 such lines but also: > > [ 108.726680] net_ratelimit: 6283 callbacks suppressed > [ 114.515126] net_ratelimit: 3959 callbacks suppressed > [ 120.523252] net_ratelimit: 4388 callbacks suppressed > [ 127.458617] net_ratelimit: 2992 callbacks suppressed > [ 133.543592] net_ratelimit: 2348 callbacks suppressed > [ 140.099915] net_ratelimit: 5089 callbacks suppressed > [ 145.116922] net_ratelimit: 6136 callbacks suppressed > [ 150.551786] net_ratelimit: 5953 callbacks suppressed > [ 155.555861] net_ratelimit: 5108 callbacks suppressed > [ 162.740087] net_ratelimit: 6507 callbacks suppressed > [ 168.059741] net_ratelimit: 3161 callbacks suppressed > [ 173.092062] net_ratelimit: 4377 callbacks suppressed > [ 178.104068] net_ratelimit: 3090 callbacks suppressed > [ 183.135704] net_ratelimit: 2735 callbacks suppressed > [ 188.616160] net_ratelimit: 2893 callbacks suppressed > [ 193.628158] net_ratelimit: 4149 callbacks suppressed > [ 199.128232] net_ratelimit: 3813 callbacks suppressed > [ 204.136278] net_ratelimit: 2884 callbacks suppressed > [ 211.443642] net_ratelimit: 5322 callbacks suppressed > [ 216.643209] net_ratelimit: 5820 callbacks suppressed > [ 221.648342] net_ratelimit: 4908 callbacks suppressed > [ 226.656369] net_ratelimit: 4533 callbacks suppressed > [ 231.680416] net_ratelimit: 4247 callbacks suppressed > [ 237.172416] net_ratelimit: 2481 callbacks suppressed > > I did a 30min test run like earlier and I'm now seeing a tiny increase > in kmalloc-256 of just 448KB :) > > So that fix definitely did the trick to plug that leak! > > Tested-by: Stéphane Graber <stgra...@ubuntu.com>
Thanks! That fast flow rotation on a system though is not a healthy sign. But, at least, it doesn't leak the memory anymore. May I ask you to find the datapath flow that triggers the issue? I wasn't able to reproduce it with OVN system tests, so I'm curious how it looks like. The flow should have a nested clone() or check_pkt_len() actions with the ct() action inside. 'depth 3' suggests that it is double nested, i.e. clone(...clone(...ct())) or clone(...check_pkt_len(...ct())) or something similar. ovs-dpctl can be used to dump flows. For now, I'll work on a proper patch for the kernel. Best regards, Ilya Maximets. > > Stéphane > > > On Wed, Mar 30, 2022 at 9:24 PM Ilya Maximets <i.maxim...@ovn.org> wrote: >> >> On 3/31/22 03:00, Stéphane Graber wrote: >>> So it looks like my main issue and the one mentioned in the PS section >>> may actually be one and the same. >>> >>> I've run an experiment now where I monitor the normal leakage over >>> 30min, getting me around 1.1GiB of extra kmalloc-256. >>> I then repeated the experiment but with firewall (good old iptables) >>> dropping all non-ICMP traffic headed to the OVN router addresses, this >>> reduced the leakage to just 89MiB. >>> >>> So it looks like that traffic is getting lost into a redirection loop >>> for a while in OVN/OVS, triggers the "openvswitch: ovs-system: >>> deferred action limit reached, drop recirc action" entry and also >>> leaks a bunch of memory in the process. We did attempt to pin the >>> issue to this cause earlier by generating some traffic which would hit >>> the recirc loop but that didn't cause a very visible increase at the >>> time. Given the new data however, our attempt at manually triggering >>> it on an otherwise unused network must have been flawed somehow. >>> >>> The amount of traffic dropped by the firewall is quite small too which >>> suggests that things could be significantly worse. >>> In this case I saw a total of 3478 packets dropped for a total of 183852 >>> bytes. >>> >>> We'll be doing a bit more digging on the OVN side to see what may be >>> causing this and will report back. >>> >>> Stéphane >>> >>> >>> >>> >>> Stéphane >>> >>> On Wed, Mar 30, 2022 at 5:08 PM Stéphane Graber <stgra...@ubuntu.com> wrote: >>>> >>>> Hello, >>>> >>>> I'm using OVS 2.17.0 combined with OVN 22.03.0 on Ubuntu 20.04 and a >>>> 5.17.1 mainline kernel. >>>> >>>> I'm trying to debug a very very problematic kernel memory leak which >>>> is happening in this environment. >>>> Sadly I don't have a clear reproducer or much idea of when it first >>>> appeared. I went through about a year of metrics and some amount of >>>> leakage may always have been present, the magnitude of it just >>>> changing recently making it such that my normal weekly server >>>> maintenance is now not frequent enough to take care of it. >>>> >>>> Basically what I'm doing is running a LXD + OVN setup on 3 servers, >>>> they all act as OVN chassis with various priorities to spread the >>>> load. >>>> All 3 servers also normally run a combination of containers and >>>> virtual machines attached to about a dozen different OVN networks. >>>> >>>> What I'm seeing is about 2MB/s of memory leakage (kmalloc-256 slub) >>>> which after enabling slub debugging can be tracked down to >>>> nf_ct_tmpl_alloc kernel calls such as those made by the openvswitch >>>> kernel module as part of its ovs_ct_copy_action function which is >>>> exposed as OVS_ACTION_ATTR_CT to userspace through the openvswitch >>>> netlink API. >>>> >>>> This means that in just a couple of days I'm dealing with just shy of >>>> 40GiB of those kmalloc-256 entries. >>>> Here is one of the servers which has been running for just 6 hours: >>>> >>>> ``` >>>> root@abydos:~# uptime >>>> 20:51:44 up 6:32, 1 user, load average: 6.63, 5.95, 5.26 >>>> >>>> root@abydos:~# slabtop -o -s c | head -n10 >>>> Active / Total Objects (% used) : 24919212 / 25427299 (98.0%) >>>> Active / Total Slabs (% used) : 541777 / 541777 (100.0%) >>>> Active / Total Caches (% used) : 150 / 197 (76.1%) >>>> Active / Total Size (% used) : 11576509.49K / 11680410.31K (99.1%) >>>> Minimum / Average / Maximum Object : 0.01K / 0.46K / 50.52K >>>> >>>> OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME >>>> 13048822 13048737 99% 0.75K 310688 42 9942016K kmalloc-256 >>>> 2959671 2677549 90% 0.10K 75889 39 303556K buffer_head >>>> 460411 460411 100% 0.57K 16444 28 263104K radix_tree_node >>>> >>>> root@abydos:~# cat /sys/kernel/debug/slab/kmalloc-256/alloc_traces | >>>> sort -rn | head -n5 >>>> 12203895 nf_ct_tmpl_alloc+0x55/0xb0 [nf_conntrack] >>>> age=57/2975485/5871871 pid=1964-2048 cpus=0-31 nodes=0-1 >>>> 803599 metadata_dst_alloc+0x25/0x50 age=2/2663973/5873408 >>>> pid=0-683331 cpus=0-31 nodes=0-1 >>>> 32773 memcg_alloc_slab_cgroups+0x3d/0x90 age=386/4430883/5878515 >>>> pid=1-731302 cpus=0-31 nodes=0-1 >>>> 3861 do_seccomp+0xdb/0xb80 age=749613/4661870/5878386 >>>> pid=752-648826 cpus=0-31 nodes=0-1 >>>> 2314 device_add+0x504/0x920 age=751269/5665662/5883377 pid=1-648698 >>>> cpus=0-31 nodes=0-1 >>>> >>>> root@abydos:~# cat /sys/kernel/debug/slab/kmalloc-256/free_traces | >>>> sort -rn | head -n5 >>>> 8129152 <not-available> age=4300785451 pid=0 cpus=0 nodes=0-1 >>>> 2770915 reserve_sfa_size+0xdf/0x110 [openvswitch] >>>> age=1912/2970717/5881994 pid=1964-2069 cpus=0-31 nodes=0-1 >>>> 1621182 dst_destroy+0x70/0xd0 age=4/3065853/5883592 pid=0-733033 >>>> cpus=0-31 nodes=0-1 >>>> 288686 nf_ct_tmpl_free+0x1b/0x30 [nf_conntrack] >>>> age=109/2985710/5879968 pid=0-733208 cpus=0-31 nodes=0-1 >>>> 134435 ovs_nla_free_flow_actions+0x68/0x90 [openvswitch] >>>> age=134/2955781/5883717 pid=0-733208 cpus=0-31 nodes=0-1 >>>> ``` >>>> >>>> Here you can see 12M calls to nf_ct_tmpl_alloc but just 288k to >>>> nf_ct_tmpl_free. >>>> >>>> Things I've done so far to try to isolate things: >>>> >>>> 1) I've evacuated all workloads from the server, so the only thing >>>> running on it is OVS vswitchd. This did not change anything. >>>> 2) I've added iptables/ip6tables raw table rules marking all traffic >>>> as NOTRACK. This did not change anything. >>>> 3) I've played with the chassis assignment. This does change things in >>>> that a server with no active chassis will not show any leakage >>>> (thankfully). The busier the network I move back to the host, the >>>> faster the leakage. >>>> >>>> >>>> I've had both Frode and Tom (Cced) assist with a variety of ideas and >>>> questions but while we have found some unrelated OVS and kernel >>>> issues, we're yet to figure this one out. So I wanted to reach out to >>>> the wider community to see if anyone has either seen something like >>>> this before or has suggestions as to where to look next. >>>> >>>> I can pretty easily rebuild the kernel, OVS or OVN and while this >>>> cluster is a production environment, the fact that I can evacuate one >>>> of the three servers with no user-visible impact makes it not too bad >>>> to debug. Having to constantly reboot the entire setup to clear the >>>> memory leak is the bigger annoyance right now :) >>>> >>>> Stéphane >>>> >>>> PS: The side OVS/kernel issue I'm referring to is >>>> https://lore.kernel.org/netdev/20220330194244.3476544-1-stgra...@ubuntu.com/ >>>> which allowed us to track down an issue with OVN logical routers >>>> properly responding to ICMP on their external address but getting into >>>> a recirculation loop when any other kind of traffic is thrown at them >>>> (instead of being immediately dropped or rejected). The kernel change >>>> made it possible to track this down. >> >> Hi, Stéphane. >> >> Thanks for the report. >> >> I'm not sure if that's what you see, but I think that I found one >> pretty serious memory leak in the openvswitch module. In short, >> it leaks all the dynamically allocated memory for nested actions. >> E.g. if you have a datapath flow with actions:clone(ct(commit)), >> the structure allocated by nf_ct_tmpl_alloc for a nested ct() action >> will not be freed. >> >> Could you try the change below? >> >> It will additionally print a rate-limited error, if the nested CT >> encountered: >> openvswitch: netlink: ovs_ct_free_action at depth 2 >> >> This message can be used to confirm the issue. >> >> The very last bit of the fix is not necessary, but I just spotted >> the additional formatting issue and fixed it. >> >> --- >> diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c >> index 5176f6ccac8e..248c39da465e 100644 >> --- a/net/openvswitch/flow_netlink.c >> +++ b/net/openvswitch/flow_netlink.c >> @@ -2317,6 +2317,65 @@ static struct sw_flow_actions >> *nla_alloc_flow_actions(int size) >> return sfa; >> } >> >> +static void ovs_nla_free_flow_actions_nla(const struct nlattr *actions, >> + int len, int depth); >> + >> +static void ovs_nla_free_check_pkt_len_action(const struct nlattr *action, >> + int depth) >> +{ >> + const struct nlattr *a; >> + int rem; >> + >> + nla_for_each_nested(a, action, rem) { >> + switch (nla_type(a)) { >> + case OVS_CHECK_PKT_LEN_ATTR_ACTIONS_IF_LESS_EQUAL: >> + case OVS_CHECK_PKT_LEN_ATTR_ACTIONS_IF_GREATER: >> + ovs_nla_free_flow_actions_nla(nla_data(a), >> nla_len(a), >> + depth); >> + break; >> + } >> + } >> +} >> + >> +static void ovs_nla_free_clone_action(const struct nlattr *action, int >> depth) >> +{ >> + const struct nlattr *a; >> + int rem; >> + >> + nla_for_each_nested(a, action, rem) { >> + switch (nla_type(a)) { >> + case OVS_CLONE_ATTR_EXEC: >> + /* Real list of actions follows this attribute. >> + * Free them and return. */ >> + a = nla_next(a, &rem); >> + ovs_nla_free_flow_actions_nla(a, rem, depth); >> + return; >> + } >> + } >> +} >> + >> +static void ovs_nla_free_dec_ttl_action(const struct nlattr *action, int >> depth) >> +{ >> + const struct nlattr *a = nla_data(action); >> + >> + switch (nla_type(a)) { >> + case OVS_DEC_TTL_ATTR_ACTION: >> + ovs_nla_free_flow_actions_nla(nla_data(a), nla_len(a), >> depth); >> + break; >> + } >> +} >> + >> +static void ovs_nla_free_sample_action(const struct nlattr *action, int >> depth) >> +{ >> + const struct nlattr *a = nla_data(action); >> + >> + switch (nla_type(a)) { >> + case OVS_SAMPLE_ATTR_ARG: >> + ovs_nla_free_flow_actions_nla(nla_data(a), nla_len(a), >> depth); >> + break; >> + } >> +} >> + >> static void ovs_nla_free_set_action(const struct nlattr *a) >> { >> const struct nlattr *ovs_key = nla_data(a); >> @@ -2330,25 +2389,55 @@ static void ovs_nla_free_set_action(const struct >> nlattr *a) >> } >> } >> >> -void ovs_nla_free_flow_actions(struct sw_flow_actions *sf_acts) >> +static void ovs_nla_free_flow_actions_nla(const struct nlattr *actions, >> + int len, int depth) >> { >> const struct nlattr *a; >> int rem; >> >> - if (!sf_acts) >> + if (!actions) >> return; >> >> - nla_for_each_attr(a, sf_acts->actions, sf_acts->actions_len, rem) { >> + depth++; >> + >> + nla_for_each_attr(a, actions, len, rem) { >> switch (nla_type(a)) { >> - case OVS_ACTION_ATTR_SET: >> - ovs_nla_free_set_action(a); >> + case OVS_ACTION_ATTR_CHECK_PKT_LEN: >> + ovs_nla_free_check_pkt_len_action(a, depth); >> + break; >> + >> + case OVS_ACTION_ATTR_CLONE: >> + ovs_nla_free_clone_action(a, depth); >> break; >> + >> case OVS_ACTION_ATTR_CT: >> + if (depth != 1) >> + OVS_NLERR(true, >> + "ovs_ct_free_action at depth %d", >> depth); >> ovs_ct_free_action(a); >> break; >> + >> + case OVS_ACTION_ATTR_DEC_TTL: >> + ovs_nla_free_dec_ttl_action(a, depth); >> + break; >> + >> + case OVS_ACTION_ATTR_SAMPLE: >> + ovs_nla_free_sample_action(a, depth); >> + break; >> + >> + case OVS_ACTION_ATTR_SET: >> + ovs_nla_free_set_action(a); >> + break; >> } >> } >> +} >> + >> +void ovs_nla_free_flow_actions(struct sw_flow_actions *sf_acts) >> +{ >> + if (!sf_acts) >> + return; >> >> + ovs_nla_free_flow_actions_nla(sf_acts->actions, >> sf_acts->actions_len, 0); >> kfree(sf_acts); >> } >> >> @@ -3458,7 +3547,9 @@ static int clone_action_to_attr(const struct nlattr >> *attr, >> if (!start) >> return -EMSGSIZE; >> >> - err = ovs_nla_put_actions(nla_data(attr), rem, skb); >> + /* Skipping the OVS_CLONE_ATTR_EXEC that is always present. */ >> + attr = nla_next(nla_data(attr), &rem); >> + err = ovs_nla_put_actions(attr, rem, skb); >> >> if (err) >> nla_nest_cancel(skb, start); >> --- >> >> Best regards, Ilya Maximets. > _______________________________________________ > dev mailing list > d...@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev