On 3/31/22 06:24, Stéphane Graber wrote:
> Thanks for the patch!
> 
> I've applied it on top of my 5.17.1 and I'm now getting a very very
> large number of:
> 
> [  188.616220] openvswitch: netlink: ovs_ct_free_action at depth 3
> 
> After just 3 minutes of uptime I'm seeing 230 such lines but also:
> 
> [  108.726680] net_ratelimit: 6283 callbacks suppressed
> [  114.515126] net_ratelimit: 3959 callbacks suppressed
> [  120.523252] net_ratelimit: 4388 callbacks suppressed
> [  127.458617] net_ratelimit: 2992 callbacks suppressed
> [  133.543592] net_ratelimit: 2348 callbacks suppressed
> [  140.099915] net_ratelimit: 5089 callbacks suppressed
> [  145.116922] net_ratelimit: 6136 callbacks suppressed
> [  150.551786] net_ratelimit: 5953 callbacks suppressed
> [  155.555861] net_ratelimit: 5108 callbacks suppressed
> [  162.740087] net_ratelimit: 6507 callbacks suppressed
> [  168.059741] net_ratelimit: 3161 callbacks suppressed
> [  173.092062] net_ratelimit: 4377 callbacks suppressed
> [  178.104068] net_ratelimit: 3090 callbacks suppressed
> [  183.135704] net_ratelimit: 2735 callbacks suppressed
> [  188.616160] net_ratelimit: 2893 callbacks suppressed
> [  193.628158] net_ratelimit: 4149 callbacks suppressed
> [  199.128232] net_ratelimit: 3813 callbacks suppressed
> [  204.136278] net_ratelimit: 2884 callbacks suppressed
> [  211.443642] net_ratelimit: 5322 callbacks suppressed
> [  216.643209] net_ratelimit: 5820 callbacks suppressed
> [  221.648342] net_ratelimit: 4908 callbacks suppressed
> [  226.656369] net_ratelimit: 4533 callbacks suppressed
> [  231.680416] net_ratelimit: 4247 callbacks suppressed
> [  237.172416] net_ratelimit: 2481 callbacks suppressed
> 
> I did a 30min test run like earlier and I'm now seeing a tiny increase
> in kmalloc-256 of just 448KB :)
> 
> So that fix definitely did the trick to plug that leak!
> 
> Tested-by: Stéphane Graber <stgra...@ubuntu.com>

Thanks!

That fast flow rotation on a system though is not a healthy sign.
But, at least, it doesn't leak the memory anymore.

May I ask you to find the datapath flow that triggers the issue?
I wasn't able to reproduce it with OVN system tests, so I'm curious
how it looks like.  The flow should have a nested clone() or
check_pkt_len() actions with the ct() action inside.  'depth 3'
suggests that it is double nested, i.e. clone(...clone(...ct())) or
clone(...check_pkt_len(...ct())) or something similar.  ovs-dpctl
can be used to dump flows.

For now, I'll work on a proper patch for the kernel.

Best regards, Ilya Maximets.

> 
> Stéphane
> 
> 
> On Wed, Mar 30, 2022 at 9:24 PM Ilya Maximets <i.maxim...@ovn.org> wrote:
>>
>> On 3/31/22 03:00, Stéphane Graber wrote:
>>> So it looks like my main issue and the one mentioned in the PS section
>>> may actually be one and the same.
>>>
>>> I've run an experiment now where I monitor the normal leakage over
>>> 30min, getting me around 1.1GiB of extra kmalloc-256.
>>> I then repeated the experiment but with firewall (good old iptables)
>>> dropping all non-ICMP traffic headed to the OVN router addresses, this
>>> reduced the leakage to just 89MiB.
>>>
>>> So it looks like that traffic is getting lost into a redirection loop
>>> for a while in OVN/OVS, triggers the "openvswitch: ovs-system:
>>> deferred action limit reached, drop recirc action" entry and also
>>> leaks a bunch of memory in the process. We did attempt to pin the
>>> issue to this cause earlier by generating some traffic which would hit
>>> the recirc loop but that didn't cause a very visible increase at the
>>> time. Given the new data however, our attempt at manually triggering
>>> it on an otherwise unused network must have been flawed somehow.
>>>
>>> The amount of traffic dropped by the firewall is quite small too which
>>> suggests that things could be significantly worse.
>>> In this case I saw a total of 3478 packets dropped for a total of 183852 
>>> bytes.
>>>
>>> We'll be doing a bit more digging on the OVN side to see what may be
>>> causing this and will report back.
>>>
>>> Stéphane
>>>
>>>
>>>
>>>
>>> Stéphane
>>>
>>> On Wed, Mar 30, 2022 at 5:08 PM Stéphane Graber <stgra...@ubuntu.com> wrote:
>>>>
>>>> Hello,
>>>>
>>>> I'm using OVS 2.17.0 combined with OVN 22.03.0 on Ubuntu 20.04 and a
>>>> 5.17.1 mainline kernel.
>>>>
>>>> I'm trying to debug a very very problematic kernel memory leak which
>>>> is happening in this environment.
>>>> Sadly I don't have a clear reproducer or much idea of when it first
>>>> appeared. I went through about a year of metrics and some amount of
>>>> leakage may always have been present, the magnitude of it just
>>>> changing recently making it such that my normal weekly server
>>>> maintenance is now not frequent enough to take care of it.
>>>>
>>>> Basically what I'm doing is running a LXD + OVN setup on 3 servers,
>>>> they all act as OVN chassis with various priorities to spread the
>>>> load.
>>>> All 3 servers also normally run a combination of containers and
>>>> virtual machines attached to about a dozen different OVN networks.
>>>>
>>>> What I'm seeing is about 2MB/s of memory leakage (kmalloc-256 slub)
>>>> which after enabling slub debugging can be tracked down to
>>>> nf_ct_tmpl_alloc kernel calls such as those made by the openvswitch
>>>> kernel module as part of its ovs_ct_copy_action function which is
>>>> exposed as OVS_ACTION_ATTR_CT to userspace through the openvswitch
>>>> netlink API.
>>>>
>>>> This means that in just a couple of days I'm dealing with just shy of
>>>> 40GiB of those kmalloc-256 entries.
>>>> Here is one of the servers which has been running for just 6 hours:
>>>>
>>>> ```
>>>> root@abydos:~# uptime
>>>>  20:51:44 up  6:32,  1 user,  load average: 6.63, 5.95, 5.26
>>>>
>>>> root@abydos:~# slabtop -o -s c | head -n10
>>>>  Active / Total Objects (% used)    : 24919212 / 25427299 (98.0%)
>>>>  Active / Total Slabs (% used)      : 541777 / 541777 (100.0%)
>>>>  Active / Total Caches (% used)     : 150 / 197 (76.1%)
>>>>  Active / Total Size (% used)       : 11576509.49K / 11680410.31K (99.1%)
>>>>  Minimum / Average / Maximum Object : 0.01K / 0.46K / 50.52K
>>>>
>>>>   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
>>>> 13048822 13048737  99%    0.75K 310688       42   9942016K kmalloc-256
>>>> 2959671 2677549  90%    0.10K  75889       39    303556K buffer_head
>>>> 460411 460411 100%    0.57K  16444       28    263104K radix_tree_node
>>>>
>>>> root@abydos:~# cat /sys/kernel/debug/slab/kmalloc-256/alloc_traces |
>>>> sort -rn | head -n5
>>>> 12203895 nf_ct_tmpl_alloc+0x55/0xb0 [nf_conntrack]
>>>> age=57/2975485/5871871 pid=1964-2048 cpus=0-31 nodes=0-1
>>>>  803599 metadata_dst_alloc+0x25/0x50 age=2/2663973/5873408
>>>> pid=0-683331 cpus=0-31 nodes=0-1
>>>>   32773 memcg_alloc_slab_cgroups+0x3d/0x90 age=386/4430883/5878515
>>>> pid=1-731302 cpus=0-31 nodes=0-1
>>>>    3861 do_seccomp+0xdb/0xb80 age=749613/4661870/5878386
>>>> pid=752-648826 cpus=0-31 nodes=0-1
>>>>    2314 device_add+0x504/0x920 age=751269/5665662/5883377 pid=1-648698
>>>> cpus=0-31 nodes=0-1
>>>>
>>>> root@abydos:~# cat /sys/kernel/debug/slab/kmalloc-256/free_traces |
>>>> sort -rn | head -n5
>>>> 8129152 <not-available> age=4300785451 pid=0 cpus=0 nodes=0-1
>>>> 2770915 reserve_sfa_size+0xdf/0x110 [openvswitch]
>>>> age=1912/2970717/5881994 pid=1964-2069 cpus=0-31 nodes=0-1
>>>> 1621182 dst_destroy+0x70/0xd0 age=4/3065853/5883592 pid=0-733033
>>>> cpus=0-31 nodes=0-1
>>>>  288686 nf_ct_tmpl_free+0x1b/0x30 [nf_conntrack]
>>>> age=109/2985710/5879968 pid=0-733208 cpus=0-31 nodes=0-1
>>>>  134435 ovs_nla_free_flow_actions+0x68/0x90 [openvswitch]
>>>> age=134/2955781/5883717 pid=0-733208 cpus=0-31 nodes=0-1
>>>> ```
>>>>
>>>> Here you can see 12M calls to nf_ct_tmpl_alloc but just 288k to 
>>>> nf_ct_tmpl_free.
>>>>
>>>> Things I've done so far to try to isolate things:
>>>>
>>>> 1) I've evacuated all workloads from the server, so the only thing
>>>> running on it is OVS vswitchd. This did not change anything.
>>>> 2) I've added iptables/ip6tables raw table rules marking all traffic
>>>> as NOTRACK. This did not change anything.
>>>> 3) I've played with the chassis assignment. This does change things in
>>>> that a server with no active chassis will not show any leakage
>>>> (thankfully). The busier the network I move back to the host, the
>>>> faster the leakage.
>>>>
>>>>
>>>> I've had both Frode and Tom (Cced) assist with a variety of ideas and
>>>> questions but while we have found some unrelated OVS and kernel
>>>> issues, we're yet to figure this one out. So I wanted to reach out to
>>>> the wider community to see if anyone has either seen something like
>>>> this before or has suggestions as to where to look next.
>>>>
>>>> I can pretty easily rebuild the kernel, OVS or OVN and while this
>>>> cluster is a production environment, the fact that I can evacuate one
>>>> of the three servers with no user-visible impact makes it not too bad
>>>> to debug. Having to constantly reboot the entire setup to clear the
>>>> memory leak is the bigger annoyance right now :)
>>>>
>>>> Stéphane
>>>>
>>>> PS: The side OVS/kernel issue I'm referring to is
>>>> https://lore.kernel.org/netdev/20220330194244.3476544-1-stgra...@ubuntu.com/
>>>> which allowed us to track down an issue with OVN logical routers
>>>> properly responding to ICMP on their external address but getting into
>>>> a recirculation loop when any other kind of traffic is thrown at them
>>>> (instead of being immediately dropped or rejected). The kernel change
>>>> made it possible to track this down.
>>
>> Hi, Stéphane.
>>
>> Thanks for the report.
>>
>> I'm not sure if that's what you see, but I think that I found one
>> pretty serious memory leak in the openvswitch module.  In short,
>> it leaks all the dynamically allocated memory for nested actions.
>> E.g. if you have a datapath flow with actions:clone(ct(commit)),
>> the structure allocated by nf_ct_tmpl_alloc for a nested ct() action
>> will not be freed.
>>
>> Could you try the change below?
>>
>> It will additionally print a rate-limited error, if the nested CT
>> encountered:
>>   openvswitch: netlink: ovs_ct_free_action at depth 2
>>
>> This message can be used to confirm the issue.
>>
>> The very last bit of the fix is not necessary, but I just spotted
>> the additional formatting issue and fixed it.
>>
>> ---
>> diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
>> index 5176f6ccac8e..248c39da465e 100644
>> --- a/net/openvswitch/flow_netlink.c
>> +++ b/net/openvswitch/flow_netlink.c
>> @@ -2317,6 +2317,65 @@ static struct sw_flow_actions 
>> *nla_alloc_flow_actions(int size)
>>         return sfa;
>>  }
>>
>> +static void ovs_nla_free_flow_actions_nla(const struct nlattr *actions,
>> +                                         int len, int depth);
>> +
>> +static void ovs_nla_free_check_pkt_len_action(const struct nlattr *action,
>> +                                             int depth)
>> +{
>> +       const struct nlattr *a;
>> +       int rem;
>> +
>> +       nla_for_each_nested(a, action, rem) {
>> +               switch (nla_type(a)) {
>> +               case OVS_CHECK_PKT_LEN_ATTR_ACTIONS_IF_LESS_EQUAL:
>> +               case OVS_CHECK_PKT_LEN_ATTR_ACTIONS_IF_GREATER:
>> +                       ovs_nla_free_flow_actions_nla(nla_data(a), 
>> nla_len(a),
>> +                                                     depth);
>> +                       break;
>> +               }
>> +       }
>> +}
>> +
>> +static void ovs_nla_free_clone_action(const struct nlattr *action, int 
>> depth)
>> +{
>> +       const struct nlattr *a;
>> +       int rem;
>> +
>> +       nla_for_each_nested(a, action, rem) {
>> +               switch (nla_type(a)) {
>> +               case OVS_CLONE_ATTR_EXEC:
>> +                       /* Real list of actions follows this attribute.
>> +                        * Free them and return. */
>> +                       a = nla_next(a, &rem);
>> +                       ovs_nla_free_flow_actions_nla(a, rem, depth);
>> +                       return;
>> +               }
>> +       }
>> +}
>> +
>> +static void ovs_nla_free_dec_ttl_action(const struct nlattr *action, int 
>> depth)
>> +{
>> +       const struct nlattr *a = nla_data(action);
>> +
>> +       switch (nla_type(a)) {
>> +       case OVS_DEC_TTL_ATTR_ACTION:
>> +               ovs_nla_free_flow_actions_nla(nla_data(a), nla_len(a), 
>> depth);
>> +               break;
>> +       }
>> +}
>> +
>> +static void ovs_nla_free_sample_action(const struct nlattr *action, int 
>> depth)
>> +{
>> +       const struct nlattr *a = nla_data(action);
>> +
>> +       switch (nla_type(a)) {
>> +       case OVS_SAMPLE_ATTR_ARG:
>> +               ovs_nla_free_flow_actions_nla(nla_data(a), nla_len(a), 
>> depth);
>> +               break;
>> +       }
>> +}
>> +
>>  static void ovs_nla_free_set_action(const struct nlattr *a)
>>  {
>>         const struct nlattr *ovs_key = nla_data(a);
>> @@ -2330,25 +2389,55 @@ static void ovs_nla_free_set_action(const struct 
>> nlattr *a)
>>         }
>>  }
>>
>> -void ovs_nla_free_flow_actions(struct sw_flow_actions *sf_acts)
>> +static void ovs_nla_free_flow_actions_nla(const struct nlattr *actions,
>> +                                         int len, int depth)
>>  {
>>         const struct nlattr *a;
>>         int rem;
>>
>> -       if (!sf_acts)
>> +       if (!actions)
>>                 return;
>>
>> -       nla_for_each_attr(a, sf_acts->actions, sf_acts->actions_len, rem) {
>> +       depth++;
>> +
>> +       nla_for_each_attr(a, actions, len, rem) {
>>                 switch (nla_type(a)) {
>> -               case OVS_ACTION_ATTR_SET:
>> -                       ovs_nla_free_set_action(a);
>> +               case OVS_ACTION_ATTR_CHECK_PKT_LEN:
>> +                       ovs_nla_free_check_pkt_len_action(a, depth);
>> +                       break;
>> +
>> +               case OVS_ACTION_ATTR_CLONE:
>> +                       ovs_nla_free_clone_action(a, depth);
>>                         break;
>> +
>>                 case OVS_ACTION_ATTR_CT:
>> +                       if (depth != 1)
>> +                               OVS_NLERR(true,
>> +                                       "ovs_ct_free_action at depth %d", 
>> depth);
>>                         ovs_ct_free_action(a);
>>                         break;
>> +
>> +               case OVS_ACTION_ATTR_DEC_TTL:
>> +                       ovs_nla_free_dec_ttl_action(a, depth);
>> +                       break;
>> +
>> +               case OVS_ACTION_ATTR_SAMPLE:
>> +                       ovs_nla_free_sample_action(a, depth);
>> +                       break;
>> +
>> +               case OVS_ACTION_ATTR_SET:
>> +                       ovs_nla_free_set_action(a);
>> +                       break;
>>                 }
>>         }
>> +}
>> +
>> +void ovs_nla_free_flow_actions(struct sw_flow_actions *sf_acts)
>> +{
>> +       if (!sf_acts)
>> +               return;
>>
>> +       ovs_nla_free_flow_actions_nla(sf_acts->actions, 
>> sf_acts->actions_len, 0);
>>         kfree(sf_acts);
>>  }
>>
>> @@ -3458,7 +3547,9 @@ static int clone_action_to_attr(const struct nlattr 
>> *attr,
>>         if (!start)
>>                 return -EMSGSIZE;
>>
>> -       err = ovs_nla_put_actions(nla_data(attr), rem, skb);
>> +       /* Skipping the OVS_CLONE_ATTR_EXEC that is always present. */
>> +       attr = nla_next(nla_data(attr), &rem);
>> +       err = ovs_nla_put_actions(attr, rem, skb);
>>
>>         if (err)
>>                 nla_nest_cancel(skb, start);
>> ---
>>
>> Best regards, Ilya Maximets.
> _______________________________________________
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to