On Thu, Mar 31, 2022 at 6:55 AM Ilya Maximets <i.maxim...@ovn.org> wrote:
>
> On 3/31/22 06:24, Stéphane Graber wrote:
> > Thanks for the patch!
> >
> > I've applied it on top of my 5.17.1 and I'm now getting a very very
> > large number of:
> >
> > [  188.616220] openvswitch: netlink: ovs_ct_free_action at depth 3
> >
> > After just 3 minutes of uptime I'm seeing 230 such lines but also:
> >
> > [  108.726680] net_ratelimit: 6283 callbacks suppressed
> > [  114.515126] net_ratelimit: 3959 callbacks suppressed
> > [  120.523252] net_ratelimit: 4388 callbacks suppressed
> > [  127.458617] net_ratelimit: 2992 callbacks suppressed
> > [  133.543592] net_ratelimit: 2348 callbacks suppressed
> > [  140.099915] net_ratelimit: 5089 callbacks suppressed
> > [  145.116922] net_ratelimit: 6136 callbacks suppressed
> > [  150.551786] net_ratelimit: 5953 callbacks suppressed
> > [  155.555861] net_ratelimit: 5108 callbacks suppressed
> > [  162.740087] net_ratelimit: 6507 callbacks suppressed
> > [  168.059741] net_ratelimit: 3161 callbacks suppressed
> > [  173.092062] net_ratelimit: 4377 callbacks suppressed
> > [  178.104068] net_ratelimit: 3090 callbacks suppressed
> > [  183.135704] net_ratelimit: 2735 callbacks suppressed
> > [  188.616160] net_ratelimit: 2893 callbacks suppressed
> > [  193.628158] net_ratelimit: 4149 callbacks suppressed
> > [  199.128232] net_ratelimit: 3813 callbacks suppressed
> > [  204.136278] net_ratelimit: 2884 callbacks suppressed
> > [  211.443642] net_ratelimit: 5322 callbacks suppressed
> > [  216.643209] net_ratelimit: 5820 callbacks suppressed
> > [  221.648342] net_ratelimit: 4908 callbacks suppressed
> > [  226.656369] net_ratelimit: 4533 callbacks suppressed
> > [  231.680416] net_ratelimit: 4247 callbacks suppressed
> > [  237.172416] net_ratelimit: 2481 callbacks suppressed
> >
> > I did a 30min test run like earlier and I'm now seeing a tiny increase
> > in kmalloc-256 of just 448KB :)
> >
> > So that fix definitely did the trick to plug that leak!
> >
> > Tested-by: Stéphane Graber <stgra...@ubuntu.com>
>
> Thanks!
>
> That fast flow rotation on a system though is not a healthy sign.
> But, at least, it doesn't leak the memory anymore.
>
> May I ask you to find the datapath flow that triggers the issue?
> I wasn't able to reproduce it with OVN system tests, so I'm curious
> how it looks like.  The flow should have a nested clone() or
> check_pkt_len() actions with the ct() action inside.  'depth 3'
> suggests that it is double nested, i.e. clone(...clone(...ct())) or
> clone(...check_pkt_len(...ct())) or something similar.  ovs-dpctl
> can be used to dump flows.

So I've run (a rather crazy) "while :; do ovs-dpctl dump-flows | grep
clone; done" on all 3 servers and didn't see a single hit.

Instead I went to grep for one of the addresses directly and I'm seeing:
https://gist.github.com/stgraber/4c7fe7b4e45cd3fa36d54759091bb2c0

I don't know if that's telling you anything useful, but it appears to
be a couple of machines trying to access a TCP port on the OVN router.
My OVN router addresses are publicly reachable (45.45.148.136) in this
case, so some amount of port scanning and the like is somewhat common
and the likely cause of this traffic in the first place.

> For now, I'll work on a proper patch for the kernel.

Thanks!

Stéphane

> Best regards, Ilya Maximets.
>
> >
> > Stéphane
> >
> >
> > On Wed, Mar 30, 2022 at 9:24 PM Ilya Maximets <i.maxim...@ovn.org> wrote:
> >>
> >> On 3/31/22 03:00, Stéphane Graber wrote:
> >>> So it looks like my main issue and the one mentioned in the PS section
> >>> may actually be one and the same.
> >>>
> >>> I've run an experiment now where I monitor the normal leakage over
> >>> 30min, getting me around 1.1GiB of extra kmalloc-256.
> >>> I then repeated the experiment but with firewall (good old iptables)
> >>> dropping all non-ICMP traffic headed to the OVN router addresses, this
> >>> reduced the leakage to just 89MiB.
> >>>
> >>> So it looks like that traffic is getting lost into a redirection loop
> >>> for a while in OVN/OVS, triggers the "openvswitch: ovs-system:
> >>> deferred action limit reached, drop recirc action" entry and also
> >>> leaks a bunch of memory in the process. We did attempt to pin the
> >>> issue to this cause earlier by generating some traffic which would hit
> >>> the recirc loop but that didn't cause a very visible increase at the
> >>> time. Given the new data however, our attempt at manually triggering
> >>> it on an otherwise unused network must have been flawed somehow.
> >>>
> >>> The amount of traffic dropped by the firewall is quite small too which
> >>> suggests that things could be significantly worse.
> >>> In this case I saw a total of 3478 packets dropped for a total of 183852 
> >>> bytes.
> >>>
> >>> We'll be doing a bit more digging on the OVN side to see what may be
> >>> causing this and will report back.
> >>>
> >>> Stéphane
> >>>
> >>>
> >>>
> >>>
> >>> Stéphane
> >>>
> >>> On Wed, Mar 30, 2022 at 5:08 PM Stéphane Graber <stgra...@ubuntu.com> 
> >>> wrote:
> >>>>
> >>>> Hello,
> >>>>
> >>>> I'm using OVS 2.17.0 combined with OVN 22.03.0 on Ubuntu 20.04 and a
> >>>> 5.17.1 mainline kernel.
> >>>>
> >>>> I'm trying to debug a very very problematic kernel memory leak which
> >>>> is happening in this environment.
> >>>> Sadly I don't have a clear reproducer or much idea of when it first
> >>>> appeared. I went through about a year of metrics and some amount of
> >>>> leakage may always have been present, the magnitude of it just
> >>>> changing recently making it such that my normal weekly server
> >>>> maintenance is now not frequent enough to take care of it.
> >>>>
> >>>> Basically what I'm doing is running a LXD + OVN setup on 3 servers,
> >>>> they all act as OVN chassis with various priorities to spread the
> >>>> load.
> >>>> All 3 servers also normally run a combination of containers and
> >>>> virtual machines attached to about a dozen different OVN networks.
> >>>>
> >>>> What I'm seeing is about 2MB/s of memory leakage (kmalloc-256 slub)
> >>>> which after enabling slub debugging can be tracked down to
> >>>> nf_ct_tmpl_alloc kernel calls such as those made by the openvswitch
> >>>> kernel module as part of its ovs_ct_copy_action function which is
> >>>> exposed as OVS_ACTION_ATTR_CT to userspace through the openvswitch
> >>>> netlink API.
> >>>>
> >>>> This means that in just a couple of days I'm dealing with just shy of
> >>>> 40GiB of those kmalloc-256 entries.
> >>>> Here is one of the servers which has been running for just 6 hours:
> >>>>
> >>>> ```
> >>>> root@abydos:~# uptime
> >>>>  20:51:44 up  6:32,  1 user,  load average: 6.63, 5.95, 5.26
> >>>>
> >>>> root@abydos:~# slabtop -o -s c | head -n10
> >>>>  Active / Total Objects (% used)    : 24919212 / 25427299 (98.0%)
> >>>>  Active / Total Slabs (% used)      : 541777 / 541777 (100.0%)
> >>>>  Active / Total Caches (% used)     : 150 / 197 (76.1%)
> >>>>  Active / Total Size (% used)       : 11576509.49K / 11680410.31K (99.1%)
> >>>>  Minimum / Average / Maximum Object : 0.01K / 0.46K / 50.52K
> >>>>
> >>>>   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
> >>>> 13048822 13048737  99%    0.75K 310688       42   9942016K kmalloc-256
> >>>> 2959671 2677549  90%    0.10K  75889       39    303556K buffer_head
> >>>> 460411 460411 100%    0.57K  16444       28    263104K radix_tree_node
> >>>>
> >>>> root@abydos:~# cat /sys/kernel/debug/slab/kmalloc-256/alloc_traces |
> >>>> sort -rn | head -n5
> >>>> 12203895 nf_ct_tmpl_alloc+0x55/0xb0 [nf_conntrack]
> >>>> age=57/2975485/5871871 pid=1964-2048 cpus=0-31 nodes=0-1
> >>>>  803599 metadata_dst_alloc+0x25/0x50 age=2/2663973/5873408
> >>>> pid=0-683331 cpus=0-31 nodes=0-1
> >>>>   32773 memcg_alloc_slab_cgroups+0x3d/0x90 age=386/4430883/5878515
> >>>> pid=1-731302 cpus=0-31 nodes=0-1
> >>>>    3861 do_seccomp+0xdb/0xb80 age=749613/4661870/5878386
> >>>> pid=752-648826 cpus=0-31 nodes=0-1
> >>>>    2314 device_add+0x504/0x920 age=751269/5665662/5883377 pid=1-648698
> >>>> cpus=0-31 nodes=0-1
> >>>>
> >>>> root@abydos:~# cat /sys/kernel/debug/slab/kmalloc-256/free_traces |
> >>>> sort -rn | head -n5
> >>>> 8129152 <not-available> age=4300785451 pid=0 cpus=0 nodes=0-1
> >>>> 2770915 reserve_sfa_size+0xdf/0x110 [openvswitch]
> >>>> age=1912/2970717/5881994 pid=1964-2069 cpus=0-31 nodes=0-1
> >>>> 1621182 dst_destroy+0x70/0xd0 age=4/3065853/5883592 pid=0-733033
> >>>> cpus=0-31 nodes=0-1
> >>>>  288686 nf_ct_tmpl_free+0x1b/0x30 [nf_conntrack]
> >>>> age=109/2985710/5879968 pid=0-733208 cpus=0-31 nodes=0-1
> >>>>  134435 ovs_nla_free_flow_actions+0x68/0x90 [openvswitch]
> >>>> age=134/2955781/5883717 pid=0-733208 cpus=0-31 nodes=0-1
> >>>> ```
> >>>>
> >>>> Here you can see 12M calls to nf_ct_tmpl_alloc but just 288k to 
> >>>> nf_ct_tmpl_free.
> >>>>
> >>>> Things I've done so far to try to isolate things:
> >>>>
> >>>> 1) I've evacuated all workloads from the server, so the only thing
> >>>> running on it is OVS vswitchd. This did not change anything.
> >>>> 2) I've added iptables/ip6tables raw table rules marking all traffic
> >>>> as NOTRACK. This did not change anything.
> >>>> 3) I've played with the chassis assignment. This does change things in
> >>>> that a server with no active chassis will not show any leakage
> >>>> (thankfully). The busier the network I move back to the host, the
> >>>> faster the leakage.
> >>>>
> >>>>
> >>>> I've had both Frode and Tom (Cced) assist with a variety of ideas and
> >>>> questions but while we have found some unrelated OVS and kernel
> >>>> issues, we're yet to figure this one out. So I wanted to reach out to
> >>>> the wider community to see if anyone has either seen something like
> >>>> this before or has suggestions as to where to look next.
> >>>>
> >>>> I can pretty easily rebuild the kernel, OVS or OVN and while this
> >>>> cluster is a production environment, the fact that I can evacuate one
> >>>> of the three servers with no user-visible impact makes it not too bad
> >>>> to debug. Having to constantly reboot the entire setup to clear the
> >>>> memory leak is the bigger annoyance right now :)
> >>>>
> >>>> Stéphane
> >>>>
> >>>> PS: The side OVS/kernel issue I'm referring to is
> >>>> https://lore.kernel.org/netdev/20220330194244.3476544-1-stgra...@ubuntu.com/
> >>>> which allowed us to track down an issue with OVN logical routers
> >>>> properly responding to ICMP on their external address but getting into
> >>>> a recirculation loop when any other kind of traffic is thrown at them
> >>>> (instead of being immediately dropped or rejected). The kernel change
> >>>> made it possible to track this down.
> >>
> >> Hi, Stéphane.
> >>
> >> Thanks for the report.
> >>
> >> I'm not sure if that's what you see, but I think that I found one
> >> pretty serious memory leak in the openvswitch module.  In short,
> >> it leaks all the dynamically allocated memory for nested actions.
> >> E.g. if you have a datapath flow with actions:clone(ct(commit)),
> >> the structure allocated by nf_ct_tmpl_alloc for a nested ct() action
> >> will not be freed.
> >>
> >> Could you try the change below?
> >>
> >> It will additionally print a rate-limited error, if the nested CT
> >> encountered:
> >>   openvswitch: netlink: ovs_ct_free_action at depth 2
> >>
> >> This message can be used to confirm the issue.
> >>
> >> The very last bit of the fix is not necessary, but I just spotted
> >> the additional formatting issue and fixed it.
> >>
> >> ---
> >> diff --git a/net/openvswitch/flow_netlink.c 
> >> b/net/openvswitch/flow_netlink.c
> >> index 5176f6ccac8e..248c39da465e 100644
> >> --- a/net/openvswitch/flow_netlink.c
> >> +++ b/net/openvswitch/flow_netlink.c
> >> @@ -2317,6 +2317,65 @@ static struct sw_flow_actions 
> >> *nla_alloc_flow_actions(int size)
> >>         return sfa;
> >>  }
> >>
> >> +static void ovs_nla_free_flow_actions_nla(const struct nlattr *actions,
> >> +                                         int len, int depth);
> >> +
> >> +static void ovs_nla_free_check_pkt_len_action(const struct nlattr *action,
> >> +                                             int depth)
> >> +{
> >> +       const struct nlattr *a;
> >> +       int rem;
> >> +
> >> +       nla_for_each_nested(a, action, rem) {
> >> +               switch (nla_type(a)) {
> >> +               case OVS_CHECK_PKT_LEN_ATTR_ACTIONS_IF_LESS_EQUAL:
> >> +               case OVS_CHECK_PKT_LEN_ATTR_ACTIONS_IF_GREATER:
> >> +                       ovs_nla_free_flow_actions_nla(nla_data(a), 
> >> nla_len(a),
> >> +                                                     depth);
> >> +                       break;
> >> +               }
> >> +       }
> >> +}
> >> +
> >> +static void ovs_nla_free_clone_action(const struct nlattr *action, int 
> >> depth)
> >> +{
> >> +       const struct nlattr *a;
> >> +       int rem;
> >> +
> >> +       nla_for_each_nested(a, action, rem) {
> >> +               switch (nla_type(a)) {
> >> +               case OVS_CLONE_ATTR_EXEC:
> >> +                       /* Real list of actions follows this attribute.
> >> +                        * Free them and return. */
> >> +                       a = nla_next(a, &rem);
> >> +                       ovs_nla_free_flow_actions_nla(a, rem, depth);
> >> +                       return;
> >> +               }
> >> +       }
> >> +}
> >> +
> >> +static void ovs_nla_free_dec_ttl_action(const struct nlattr *action, int 
> >> depth)
> >> +{
> >> +       const struct nlattr *a = nla_data(action);
> >> +
> >> +       switch (nla_type(a)) {
> >> +       case OVS_DEC_TTL_ATTR_ACTION:
> >> +               ovs_nla_free_flow_actions_nla(nla_data(a), nla_len(a), 
> >> depth);
> >> +               break;
> >> +       }
> >> +}
> >> +
> >> +static void ovs_nla_free_sample_action(const struct nlattr *action, int 
> >> depth)
> >> +{
> >> +       const struct nlattr *a = nla_data(action);
> >> +
> >> +       switch (nla_type(a)) {
> >> +       case OVS_SAMPLE_ATTR_ARG:
> >> +               ovs_nla_free_flow_actions_nla(nla_data(a), nla_len(a), 
> >> depth);
> >> +               break;
> >> +       }
> >> +}
> >> +
> >>  static void ovs_nla_free_set_action(const struct nlattr *a)
> >>  {
> >>         const struct nlattr *ovs_key = nla_data(a);
> >> @@ -2330,25 +2389,55 @@ static void ovs_nla_free_set_action(const struct 
> >> nlattr *a)
> >>         }
> >>  }
> >>
> >> -void ovs_nla_free_flow_actions(struct sw_flow_actions *sf_acts)
> >> +static void ovs_nla_free_flow_actions_nla(const struct nlattr *actions,
> >> +                                         int len, int depth)
> >>  {
> >>         const struct nlattr *a;
> >>         int rem;
> >>
> >> -       if (!sf_acts)
> >> +       if (!actions)
> >>                 return;
> >>
> >> -       nla_for_each_attr(a, sf_acts->actions, sf_acts->actions_len, rem) {
> >> +       depth++;
> >> +
> >> +       nla_for_each_attr(a, actions, len, rem) {
> >>                 switch (nla_type(a)) {
> >> -               case OVS_ACTION_ATTR_SET:
> >> -                       ovs_nla_free_set_action(a);
> >> +               case OVS_ACTION_ATTR_CHECK_PKT_LEN:
> >> +                       ovs_nla_free_check_pkt_len_action(a, depth);
> >> +                       break;
> >> +
> >> +               case OVS_ACTION_ATTR_CLONE:
> >> +                       ovs_nla_free_clone_action(a, depth);
> >>                         break;
> >> +
> >>                 case OVS_ACTION_ATTR_CT:
> >> +                       if (depth != 1)
> >> +                               OVS_NLERR(true,
> >> +                                       "ovs_ct_free_action at depth %d", 
> >> depth);
> >>                         ovs_ct_free_action(a);
> >>                         break;
> >> +
> >> +               case OVS_ACTION_ATTR_DEC_TTL:
> >> +                       ovs_nla_free_dec_ttl_action(a, depth);
> >> +                       break;
> >> +
> >> +               case OVS_ACTION_ATTR_SAMPLE:
> >> +                       ovs_nla_free_sample_action(a, depth);
> >> +                       break;
> >> +
> >> +               case OVS_ACTION_ATTR_SET:
> >> +                       ovs_nla_free_set_action(a);
> >> +                       break;
> >>                 }
> >>         }
> >> +}
> >> +
> >> +void ovs_nla_free_flow_actions(struct sw_flow_actions *sf_acts)
> >> +{
> >> +       if (!sf_acts)
> >> +               return;
> >>
> >> +       ovs_nla_free_flow_actions_nla(sf_acts->actions, 
> >> sf_acts->actions_len, 0);
> >>         kfree(sf_acts);
> >>  }
> >>
> >> @@ -3458,7 +3547,9 @@ static int clone_action_to_attr(const struct nlattr 
> >> *attr,
> >>         if (!start)
> >>                 return -EMSGSIZE;
> >>
> >> -       err = ovs_nla_put_actions(nla_data(attr), rem, skb);
> >> +       /* Skipping the OVS_CLONE_ATTR_EXEC that is always present. */
> >> +       attr = nla_next(nla_data(attr), &rem);
> >> +       err = ovs_nla_put_actions(attr, rem, skb);
> >>
> >>         if (err)
> >>                 nla_nest_cancel(skb, start);
> >> ---
> >>
> >> Best regards, Ilya Maximets.
> > _______________________________________________
> > dev mailing list
> > d...@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to