On Mon, May 30, 2022 at 5:25 PM Frode Nordahl
<frode.nord...@canonical.com> wrote:
>
> On Fri, May 27, 2022 at 10:04 PM Ilya Maximets <i.maxim...@ovn.org> wrote:
> >
> > On 5/26/22 14:53, Frode Nordahl wrote:
> > >
> > >
> > > tor. 26. mai 2022, 14:45 skrev Ilya Maximets <i.maxim...@ovn.org 
> > > <mailto:i.maxim...@ovn.org>>:
> > >
> > >     On 5/26/22 13:00, Frode Nordahl wrote:
> > >     > On Wed, May 25, 2022 at 9:55 AM Frode Nordahl
> > >     > <frode.nord...@canonical.com <mailto:frode.nord...@canonical.com>> 
> > > wrote:
> > >     >>
> > >     >> On Tue, May 24, 2022 at 1:32 PM Ilya Maximets <i.maxim...@ovn.org 
> > > <mailto:i.maxim...@ovn.org>> wrote:
> > >     >>>
> > >     >>> On 5/24/22 12:54, Frode Nordahl wrote:
> > >     >>>> On Mon, May 23, 2022 at 3:49 PM Ilya Maximets 
> > > <i.maxim...@ovn.org <mailto:i.maxim...@ovn.org>> wrote:
> > >     >>>>>
> > >     >>>>> On 5/21/22 12:49, Frode Nordahl wrote:
> > >     >>>>>> On Thu, May 19, 2022 at 3:39 PM Frode Nordahl
> > >     >>>>>> <frode.nord...@canonical.com 
> > > <mailto:frode.nord...@canonical.com>> wrote:
> > >     >>>>>>>
> > >     >>>>>>> On Sat, May 14, 2022 at 2:10 AM Ilya Maximets 
> > > <i.maxim...@ovn.org <mailto:i.maxim...@ovn.org>> wrote:
> > >     >>>>>>>>
> > >     >>>>>>>> On 5/13/22 10:36, Frode Nordahl wrote:
> > >     >>>>>>>>> On Fri, Mar 11, 2022 at 2:04 PM Liam Young 
> > > <liam.yo...@canonical.com <mailto:liam.yo...@canonical.com>> wrote:
> > >     >>>>>>>>>>
> > >     >>>>>>>>>> Hi,
> > >     >>>>>>>>>>
> > >     >>>>>>>>>> <tl;dr> Commit 355fef6f2 seems to break connectivity in my 
> > > setup</tl;dr>
> > >     >>>>>>>>>
> > >     >>>>>>>>> After OVN started colocating sNAT and dNAT operations in 
> > > the same CT
> > >     >>>>>>>>> zone [0] the above mentioned OVS commit appears to also 
> > > break OVN [1].
> > >     >>>>>>>>> I have been discussing with Numan and he has found a 
> > > correlation with
> > >     >>>>>>>>> the behavior of the open vswitch kernel data path conntrack
> > >     >>>>>>>>> `skb_nfct_cached` function, i.e. if that is changed to 
> > > always return
> > >     >>>>>>>>> 'false' the erratic behavior disappears.
> > >     >>>>>>>>>
> > >     >>>>>>>>> So I guess the question then becomes, is this a OVS 
> > > userspace or OVS
> > >     >>>>>>>>> kernel problem?
> > >     >>>>>>>>>
> > >     >>>>>>>>> We have a reproducer in [3].
> > >     >>>>>>>>>
> > >     >>>>>>>>> 0: 
> > > https://github.com/ovn-org/ovn/commit/4deac4509abbedd6ffaecf27eed01ddefccea40a
> > >  
> > > <https://github.com/ovn-org/ovn/commit/4deac4509abbedd6ffaecf27eed01ddefccea40a>
> > >     >>>>>>>>> 1: 
> > > https://bugs.launchpad.net/ubuntu/+source/ovn/+bug/1967856 
> > > <https://bugs.launchpad.net/ubuntu/+source/ovn/+bug/1967856>
> > >     >>>>>>>>> 2: 
> > > https://elixir.bootlin.com/linux/latest/source/net/openvswitch/conntrack.c#L683
> > >  
> > > <https://elixir.bootlin.com/linux/latest/source/net/openvswitch/conntrack.c#L683>
> > >     >>>>>>>>> 3: 
> > > https://bugs.launchpad.net/ubuntu/+source/ovn/+bug/1967856/comments/6 
> > > <https://bugs.launchpad.net/ubuntu/+source/ovn/+bug/1967856/comments/6>
> > >     >>>>>>>>>
> > >     >>>>>>>>
> > >     >>>>>>>> Hrm.  I think, there is a logical bug in implementations of 
> > > ct()
> > >     >>>>>>>> datapath action in both datapaths.
> > >     >>>>>>>>
> > >     >>>>>>>> The problem appears when the OpenFlow pipeline for the 
> > > packet contains
> > >     >>>>>>>> several ct() actions.  NXAST_CT action (i.e. every ct() 
> > > action) must
> > >     >>>>>>>> always put the packet into an untracked state.  Tracked 
> > > state will be
> > >     >>>>>>>> set only in the 'recirc_table' where the packet is cloned by 
> > > the ct()
> > >     >>>>>>>> action for further processing.
> > >     >>>>>>>>
> > >     >>>>>>>> If an OF pipeline looks like this:
> > >     >>>>>>>>
> > >     >>>>>>>>   actions=ct(),something,something,ct(),something
> > >     >>>>>>>>
> > >     >>>>>>>> each action must be entered with the 'untracked' conntrack 
> > > state in
> > >     >>>>>>>> the packet metadata.
> > >     >>>>>>>>
> > >     >>>>>>>> However, ct() action in the datapath, unlike OpenFlow, 
> > > doesn't work
> > >     >>>>>>>> this way.  It modifies the conntrack state for the packet it 
> > > is processing.
> > >     >>>>>>>> During the flow translation OVS inserts recirculation right 
> > > after the
> > >     >>>>>>>> datapath ct() action.  This way conntrack state affects only 
> > > processing
> > >     >>>>>>>> after recirculation.  Sort of.
> > >     >>>>>>>>
> > >     >>>>>>>> The issue is that not all OpenFlow ct() actions have 
> > > recirc_table
> > >     >>>>>>>> specified.  These actions supposed to change the state of 
> > > the conntrack
> > >     >>>>>>>> subsystem, but they still must leave the packet itself in 
> > > the untracked
> > >     >>>>>>>> state with no strings attached.  But since datapath ct() 
> > > actions doesn't
> > >     >>>>>>>> work that way and since recirculation is not inserted (no 
> > > recirc_table
> > >     >>>>>>>> specified), packet after conntrack execution carries the 
> > > state along to
> > >     >>>>>>>> all other actions.
> > >     >>>>>>>> This doesn't impact normal actions, but seems to break 
> > > subsequent ct()
> > >     >>>>>>>> actions on the same pipeline.
> > >     >>>>>>>>
> > >     >>>>>>>> In general, it looks to me that ct() action in the datapath 
> > > is
> > >     >>>>>>>> not supposed to cache any connection data inside the 
> > > packet's metadata.
> > >     >>>>>>>> This seems to be the root cause of the problem.  Fields that 
> > > OF knows
> > >     >>>>>>>> about should not trigger any issues if carried along with a 
> > > packet,
> > >     >>>>>>>> but datapath-specific metadata can not be cleared, because 
> > > OFproto
> > >     >>>>>>>> simply doesn't know about it.
> > >     >>>>>>>>
> > >     >>>>>>>> But, I guess, not caching the connection and other internal 
> > > structures
> > >     >>>>>>>> might significantly impact datapath performance.  Will it?
> > >     >>>>>>>>
> > >     >>>>>>>> Looking at the reproducer in [3], it has, for example, the 
> > > flow with the
> > >     >>>>>>>> following actions:
> > >     >>>>>>>>
> > >     >>>>>>>>   actions:ct(commit,zone=8,mark=0/0x1,nat(src)),
> > >     >>>>>>>>           
> > > set(eth(src=00:00:00:00:00:01,dst=00:00:00:00:00:06)),
> > >     >>>>>>>>           
> > > set(ipv4(src=172.18.2.10,dst=192.168.100.6,ttl=62)),
> > >     >>>>>>>>           ct(zone=8),recirc(0x4)
> > >     >>>>>>>>
> > >     >>>>>>>> So, if the first ct() will change some datapath-specific 
> > > packet metadata,
> > >     >>>>>>>> the second ct() may try to use that information.
> > >     >>>>>>>>
> > >     >>>>>>>> It looks like during the flow translation we must add 
> > > ct_clear action
> > >     >>>>>>>> explicitly after every ct() action, unless it was the last 
> > > action in
> > >     >>>>>>>> the list.  This will end up with adding a lot of ct_clear 
> > > actions
> > >     >>>>>>>> everywhere.
> > >     >>>>>>>>
> > >     >>>>>>>> Another option is the patch below that tries to track if 
> > > ct_clear
> > >     >>>>>>>> is required and adds that action before the next ct() action 
> > > in
> > >     >>>>>>>> the datapath.
> > >     >>>>>>>>
> > >     >>>>>>>> BTW, the test [3] fails with both kernel and userspace 
> > > datapaths.
> > >     >>>>>>>>
> > >     >>>>>>>> The change below should fix the problem, I think.
> > >     >>>>>>>> It looks like we also have to put ct_clear action before 
> > > translating
> > >     >>>>>>>> output to the patch port if we're in 'conntracked' state.
> > >     >>>>>>>>
> > >     >>>>>>>> And I do not know how to fix the problem for kernels without 
> > > ct_clear
> > >     >>>>>>>> support.  I don't think we can clear metadata that is 
> > > internal to
> > >     >>>>>>>> the kernel.  Ideas are welcome.
> > >     >>>>>>>>
> > >     >>>>>>>> CC: Aaron, Paolo.
> > >     >>>>>>>>
> > >     >>>>>>>> Any thoughts?
> > >     >>>>>>>
> > >     >>>>>>> Thank you so much for the detailed explanation of what's 
> > > going on and
> > >     >>>>>>> for providing a proposed fix.
> > >     >>>>>>>
> > >     >>>>>>> I took it for a spin and it does indeed appear to fix the OVN 
> > > hairpin
> > >     >>>>>>> issue, but it does unfortunately not appear to fix the issue 
> > > Liam
> > >     >>>>>>> reported for the ML2/OVS use case [4].
> > >     >>>>>>>
> > >     >>>>>>> When trying that use case with your patch I see this in the 
> > > Open vSwitch log:
> > >     >>>>>>> 
> > > 2022-05-19T08:17:02.668Z|00001|ofproto_dpif_xlate(handler1)|WARN|over
> > >     >>>>>>> max translation depth 64 on bridge br-int while processing
> > >     >>>>>>> 
> > > ct_state=new|trk,ct_ipv6_src=fc00:24:159c:555b:f816:3eff:fe8b:3ad0,ct_ipv6_dst=fc00:24:159c:555b:f816:3eff:fe8f:9302,ct_nw_proto=58,ct_tp_src=128,ct_tp_dst=0,icmp6,in_port=3,vlan_tci=0x0000,dl_src=fa:16:3e:8b:3a:d0,dl_dst=fa:16:3e:8f:93:02,ipv6_src=fc00:24:159c:555b:f816:3eff:fe8b:3ad0,ipv6_dst=fc00:24:159c:555b:f816:3eff:fe8f:9302,ipv6_label=0xac5db,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=128,icmp_code=0
> > >     >>>>>>> 
> > > 2022-05-19T08:17:02.668Z|00002|ofproto_dpif_upcall(handler1)|WARN|Dropped
> > >     >>>>>>> 2 log messages in last 205 seconds (most recently, 187 
> > > seconds ago)
> > >     >>>>>>> due to excessive rate
> > >     >>>>>>> 
> > > 2022-05-19T08:17:02.668Z|00003|ofproto_dpif_upcall(handler1)|WARN|Flow:
> > >     >>>>>>> 
> > > ct_state=new|trk,ct_ipv6_src=fc00:24:159c:555b:f816:3eff:fe8b:3ad0,ct_ipv6_dst=fc00:24:159c:555b:f816:3eff:fe8f:9302,ct_nw_proto=58,ct_tp_src=128,ct_tp_dst=0,icmp6,in_port=5,vlan_tci=0x0000,dl_src=fa:16:3e:8b:3a:d0,dl_dst=fa:16:3e:8f:93:02,ipv6_src=fc00:24:159c:555b:f816:3eff:fe8b:3ad0,ipv6_dst=fc00:24:159c:555b:f816:3eff:fe8f:9302,ipv6_label=0xac5db,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=128,icmp_code=0
> > >     >>>>>>>
> > >     >>>>>>> bridge("br-int")
> > >     >>>>>>> ----------------
> > >     >>>>>>>  0. priority 0, cookie 0x56d6d58c018b51bd
> > >     >>>>>>>     goto_table:60
> > >     >>>>>>> 60. priority 3, cookie 0x56d6d58c018b51bd
> > >     >>>>>>>     NORMAL
> > >     >>>>>>>      >>>> received packet on unknown port 5 <<<<
> > >     >>>>>>>      >> no input bundle, dropping
> > >     >>>>>>>
> > >     >>>>>>> Final flow: unchanged
> > >     >>>>>>> Megaflow: 
> > > recirc_id=0,eth,ipv6,in_port=5,dl_src=fa:16:3e:8b:3a:d0,dl_dst=fa:16:3e:8f:93:02,nw_frag=no
> > >     >>>>>>> Datapath actions: drop
> > >     >>>>>>>
> > >     >>>>>>>
> > >     >>>>>>> On the back of your explanation and the log output I made an 
> > > attempt
> > >     >>>>>>> at adding this patch on top of yours:
> > >     >>>>>>> ---
> > >     >>>>>>> diff --git a/ofproto/ofproto-dpif-xlate.c 
> > > b/ofproto/ofproto-dpif-xlate.c
> > >     >>>>>>> index 110dab0ec..2955e8e4d 100644
> > >     >>>>>>> --- a/ofproto/ofproto-dpif-xlate.c
> > >     >>>>>>> +++ b/ofproto/ofproto-dpif-xlate.c
> > >     >>>>>>> @@ -7144,6 +7144,7 @@ do_xlate_actions(const struct ofpact 
> > > *ofpacts,
> > >     >>>>>>> size_t ofpacts_len,
> > >     >>>>>>>           * resubmit to the frozen actions.
> > >     >>>>>>>           */
> > >     >>>>>>>          case OFPACT_RESUBMIT:
> > >     >>>>>>> +            ctx->pending_ct_clear = true;
> > >     >>>>>>>              xlate_ofpact_resubmit(ctx, 
> > > ofpact_get_RESUBMIT(a), last);
> > >     >>>>>>>              continue;
> > >     >>>>>>>          case OFPACT_GOTO_TABLE:
> > >     >>>>>>> ---
> > >     >>>>>>>
> > >     >>>>>>> And the two patches together do appear to resolve the issue 
> > > reported
> > >     >>>>>>> in [4] as well as the OVN hairpin issue [1]. It does however 
> > > make a
> > >     >>>>>>> couple of tests fail, so I need to look into if that is 
> > > expected from
> > >     >>>>>>> the change or if the approach must be changed.
> > >     >>>>>>>
> > >     >>>>>>> 4: https://bugs.launchpad.net/openvswitch/+bug/1964117 
> > > <https://bugs.launchpad.net/openvswitch/+bug/1964117>
> > >     >>>>>>
> > >     >>>>>> The following patch also works and does not cause any tests to 
> > > fail.
> > >     >>>>>>
> > >     >>>>>> ---
> > >     >>>>>> diff --git a/ofproto/ofproto-dpif-xlate.c 
> > > b/ofproto/ofproto-dpif-xlate.c
> > >     >>>>>> index 110dab0ec..905f6994d 100644
> > >     >>>>>> --- a/ofproto/ofproto-dpif-xlate.c
> > >     >>>>>> +++ b/ofproto/ofproto-dpif-xlate.c
> > >     >>>>>> @@ -7354,7 +7354,9 @@ do_xlate_actions(const struct ofpact 
> > > *ofpacts,
> > >     >>>>>> size_t ofpacts_len,
> > >     >>>>>>              break;
> > >     >>>>>>
> > >     >>>>>>          case OFPACT_CT_CLEAR:
> > >     >>>>>> -            if (ctx->conntracked || ctx->pending_ct_clear) {
> > >     >>>>>> +            if (ctx->conntracked || ctx->pending_ct_clear
> > >     >>>>>> +                || (flow->ct_state && flow->ct_state & 
> > > CS_TRACKED))
> > >     >>>>>> +            {
> > >     >>>>>>                  compose_ct_clear_action(ctx);
> > >     >>>>>>              }
> > >     >>>>>>              break;
> > >     >>>>>> ---
> > >     >>>>>>
> > >     >>>>>> The OpenStack ML2/OVS Open vSwitch firewall driver appears to 
> > > make
> > >     >>>>>> explicit use of ct_clear action as part of progressing packets 
> > > through
> > >     >>>>>> the tables and the optimization in 355fef6f2 interrupts that 
> > > workflow.
> > >     >>>>>>
> > >     >>>>>> I am not quite sure if the definition of `struct
> > >     >>>>>> xlate_ctx->conntracked` is "this flow has been exposed to 
> > > conntrack"
> > >     >>>>>> or if it is "conntrack sees this flow as established", if 
> > > anyone can
> > >     >>>>>> shed light on that we would know if the above patch is ok as 
> > > is or if
> > >     >>>>>> this flow state should have set struct xlate_ctx->conntracked 
> > > to true
> > >     >>>>>> in the first place.
> > >     >>>>>
> > >     >>>>> I beleive that ctx->conntracked should men that "this flow has 
> > > been exposed
> > >     >>>>> to conntrack".  And it looks like the condition:
> > >     >>>>>  (!ctx->conntracked && flow->ct_state & CS_TRACKED)
> > >     >>>>> should not be possible.  We're, probably, missing 
> > > clear_conntrack() call
> > >     >>>>> somewhere.
> > >     >>>>>
> > >     >>>>> One thing I seem to miss in my patch is the conntrack clear 
> > > after returning
> > >     >>>>> from the patch port processing:
> > >     >>>>>
> > >     >>>>> diff --git a/ofproto/ofproto-dpif-xlate.c 
> > > b/ofproto/ofproto-dpif-xlate.c
> > >     >>>>> index 110dab0ec..53d4f78b2 100644
> > >     >>>>> --- a/ofproto/ofproto-dpif-xlate.c
> > >     >>>>> +++ b/ofproto/ofproto-dpif-xlate.c
> > >     >>>>> @@ -3955,8 +3955,13 @@ patch_port_output(struct xlate_ctx *ctx, 
> > > const struct xport *in_dev,
> > >     >>>>>
> > >     >>>>>      /* The out bridge's conntrack execution should have no 
> > > effect on the
> > >     >>>>>       * original bridge. */
> > >     >>>>> -    ctx->conntracked = old_conntrack;
> > >     >>>>>      ctx->pending_ct_clear = old_ct_clear;
> > >     >>>>> +    if (ctx->conntracked) {
> > >     >>>>> +        /* Conntrack was involved in the other bridge.  We 
> > > need to clear
> > >     >>>>> +         * whatever information was cached in the datapath. */
> > >     >>>>> +        ctx->pending_ct_clear = true;
> > >     >>>>> +    }
> > >     >>>>> +    ctx->conntracked = old_conntrack;
> > >     >>>>>
> > >     >>>>>      /* The fact that the out bridge exits (for any reason) 
> > > does not mean
> > >     >>>>>       * that the original bridge should exit.  Specifically, if 
> > > the out
> > >     >>>>> ---
> > >     >>>>>
> > >     >>>>> But that doesn't seem to be related to the (!ctx->conntracked &&
> > >     >>>>> flow->ct_state & CS_TRACKED) case...
> > >     >>>>
> > >     >>>> Thank you for confirming that ctx->conntracked should have been 
> > > set in
> > >     >>>> this flow state, that led me onto the possible root and fix of 
> > > this
> > >     >>>> particular use case.
> > >     >>>>
> > >     >>>> Enabling debug log for dpif module while initiating a new flow I 
> > > see
> > >     >>>> that the flow can enter the system as a miss upcall without
> > >     >>>> recirculation and with ct_state set:
> > >     >>>> 
> > > 2022-05-24T10:42:57.869Z|00036|dpif(handler6)|DBG|system@ovs-system:
> > >     >>>> miss upcall:
> > >     >>>> 
> > > recirc_id(0),dp_hash(0),skb_priority(0),in_port(6),skb_mark(0),ct_state(0x21),ct_zone(0),ct_mark(0),ct_label(0),ct_tuple6(src=fc00:a3f0:723f:7bc4:f816:3eff:fe81:224d,dst=fc00:a3f0:723f:7bc4:f816:3eff:fee5:e7b2,proto=58,src_port=128,dst_port=0),eth(src=fa:16:3e:81:22:4d,dst=fa:16:3e:e5:e7:b2),eth_type(0x86dd),ipv6(src=fc00:a3f0:723f:7bc4:f816:3eff:fe81:224d,dst=fc00:a3f0:723f:7bc4:f816:3eff:fee5:e7b2,label=0x357f1,proto=58,tclass=0,hlimit=64,frag=no),icmpv6(type=128,code=0)
> > >     >>>>
> > >     >>>> Looking at the code it appears that for this state the
> > >     >>>> ctx->conntracked would not be set, it is currently initialized to
> > >     >>>> 'false' and only updated based on frozen state for recirculated
> > >     >>>> packets.
> > >     >>>>
> > >     >>>> Adding this patch resolves inconsistency we discussed above and
> > >     >>>> subsequently the ML2/OVS problem:
> > >     >>>> diff --git a/ofproto/ofproto-dpif-xlate.c 
> > > b/ofproto/ofproto-dpif-xlate.c
> > >     >>>> index 110dab0ec..7bc7426ac 100644
> > >     >>>> --- a/ofproto/ofproto-dpif-xlate.c
> > >     >>>> +++ b/ofproto/ofproto-dpif-xlate.c
> > >     >>>> @@ -7852,6 +7852,12 @@ xlate_actions(struct xlate_in *xin, struct
> > >     >>>> xlate_out *xout)
> > >     >>>>          goto exit;
> > >     >>>>      }
> > >     >>>>
> > >     >>>> +    if (!xin->frozen_state
> > >     >>>> +        && xin->flow.ct_state
> > >     >>>> +        && xin->flow.ct_state & CS_TRACKED) {
> > >     >>>> +        ctx.conntracked = true;
> > >     >>>> +    }
> > >     >>>> +
> > >     >>>>      /* Tunnel metadata in udpif format must be normalized before
> > >     >>>> translation. */
> > >     >>>>      if (flow->tunnel.flags & FLOW_TNL_F_UDPIF) {
> > >     >>>>          const struct tun_table *tun_tab = ofproto_get_tun_tab(
> > >     >>>> ---
> > >     >>>>
> > >     >>>> What do you think?
> >
> > I was ready to send the kernel patch below, but when I tried to run
> > system tests, 2 of them failed.  And then I discovered this gem:
> >
> >   c2926d6d1cfd ("system-traffic: Add ct tests using local stack.")
>
> Good catch! I only tested your proposed kernel change with the OVN
> system tests and not the OVS ones.
>
> > So, it might be that your change to set the 'conntracked' to 'true'
> > is actually the right one, since OVS system tests are expecting
> > this scenario to work...
>
> Ok, I've found a usable system test for this condition, so I'm ready
> to submit something for this bit:
> diff --git a/tests/system-traffic.at b/tests/system-traffic.at
> index 239105e89..53af3d3da 100644
> --- a/tests/system-traffic.at
> +++ b/tests/system-traffic.at
> @@ -6807,6 +6807,46 @@ AT_CHECK([ovs-ofctl dump-flows br0 | grep
> table=2, | OFPROTO_CLEAR_DURATION_IDLE
>  OVS_TRAFFIC_VSWITCHD_STOP
>  AT_CLEANUP
>
> +dnl This is essentially a copy of the "datapath - ping over geneve tunnel"
> +dnl test, and we use it to confirm OVS behavior when ct_state is set on
> +dnl flows by the kernel without conscious action by the OVS user space code.
> +AT_SETUP([conntrack - ct_state from outside OVS])
> +OVS_CHECK_TUNNEL_TSO()
> +OVS_CHECK_GENEVE()
> +
> +OVS_TRAFFIC_VSWITCHD_START()
> +ADD_BR([br-underlay])
> +
> +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
> +AT_CHECK([ovs-ofctl add-flow br-underlay
> "priority=100,ct_state=+trk,actions=ct_clear,resubmit(,0)"])
> +AT_CHECK([ovs-ofctl add-flow br-underlay "priority=10,actions=normal"])
> +
> +ADD_NAMESPACES(at_ns0)
> +
> +dnl Set up underlay link from host into the namespace using veth pair.
> +ADD_VETH(p0, at_ns0, br-underlay, "172.31.1.1/24")
> +AT_CHECK([ip addr add dev br-underlay "172.31.1.100/24"])
> +AT_CHECK([ip link set dev br-underlay up])
> +
> +dnl Set up tunnel endpoints on OVS outside the namespace and with a native
> +dnl linux device inside the namespace.
> +ADD_OVS_TUNNEL([geneve], [br0], [at_gnv0], [172.31.1.1], [10.1.1.100/24])
> +ADD_NATIVE_TUNNEL([geneve], [ns_gnv0], [at_ns0], [172.31.1.100], 
> [10.1.1.1/24],
> +                  [vni 0])
> +
> +dnl First, check the underlay
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 172.31.1.100 |
> FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +dnl Okay, now check the overlay
> +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 |
> FORMAT_PING], [0], [dnl
> +3 packets transmitted, 3 received, 0% packet loss, time 0ms
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP
> +
>  AT_BANNER([IGMP])
>
>  AT_SETUP([IGMP - flood under normal action])
> ---
>
> I should probably add something to confirm the ct_state=+trk flow rule
> is hit so this does not suddenly turn into a false positive.
>
>
> The ct state across patch ports issue is a bit more involved to test,
> so I wonder if we should use a OVN test for that? We could of course
> have OVN stage it and see if it is easy to extract something to encode
> into OVS system tests.
>
> > I'm starting to questioning reality here already... Will get back
> > to that issue next week.
>
> Friday afternoon can have that effect on the best of us :-)

I've pushed the first part of the fix here:
https://mail.openvswitch.org/pipermail/ovs-dev/2022-May/394450.html

For the patch port issue: Our previous discussion about the disconnect
between how the datapath operates and how OF operates combined with
the fact that everyone appears to currently issue a `ct_clear` action
after change of datapath made me think. Perhaps the issue is that OVS
currently only clears its internal state and does not do anything with
the datapath state when "cloning" the flow over to the new
bridge/datapath?

With the first part of the fix in place, this is sufficient to fix the
OVN hairpin issue:
diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
index 578cbfe58..cc29425b0 100644
--- a/ofproto/ofproto-dpif-xlate.c
+++ b/ofproto/ofproto-dpif-xlate.c
@@ -7285,6 +7285,9 @@ do_xlate_actions(const struct ofpact *ofpacts,
size_t ofpacts_len,
         }

         case OFPACT_CT:
+            if (ctx->conntracked) {
+                compose_ct_clear_action(ctx);
+            }
             compose_conntrack_action(ctx, ofpact_get_CT(a), last);
             break;

---

Which makes me ask, perhaps we should just compose a ct_clear action
at the same time as the call to `clear_conntrack(ctx);` in the
`patch_port_output` function? Does of course feel more risky than the
delayed approach you proposed, but if it does not work without an
explicit ct_clear action anyway, it becomes tempting.

-- 
Frode Nordahl

> --
> Frode Nordahl
>
> > >     >>>>
> > >     >>>
> > >     >>> Hrm.  I suspected something like this.  My guess is that packet
> > >     >>> gets conntracked in OVS, then leaves OVS, gets routed inside the
> > >     >>> kernel and enters OVS again while retaining the ct state from the
> > >     >>> previous run.  Can you confirm that?
> > >     >>
> > >     >> In this setup there is an interface present in the system attached 
> > > to
> > >     >> OVS using an interface of type internal. It's essentially there to
> > >     >> give a path into an OpenStack managed overlay network for some
> > >     >> management daemons running "outside the cloud".
> > >     >>
> > >     >> Without the patches we've been working on in this thread so far,
> > >     >> looking at the OF flow configuration side the packets traverse the
> > >     >> tables like this:
> > >     >> table=0, n_packets=6, n_bytes=604, priority=0 actions=resubmit(,59)
> > >     >> table=59, n_packets=6, n_bytes=604, priority=0 
> > > actions=resubmit(,60)
> > >     >> table=60, n_packets=4, n_bytes=384, priority=100,in_port="o-hm0"
> > >     >> 
> > > actions=load:0x3->NXM_NX_REG5[],load:0x1->NXM_NX_REG6[],resubmit(,71)
> > >     >> table=71, n_packets=65, n_bytes=7670, priority=110,ct_state=+trk
> > >     >> actions=ct_clear,resubmit(,71)
> > >     >>
> > >     >> Since we currently ignore the request for the ct_clear action in 
> > > this
> > >     >> state the packet will be resubmitted until hitting the limit in the
> > >     >> last rule above.
> > >     >>
> > >     >> The OVS kernel event tracing output from the same flow look like 
> > > this:
> > >     >>             ping-207769  [010] ..... 33681.265450: ovs_dp_upcall:
> > >     >> dpaddr=0000000053422f7e dp_name=ovs-system dev=o-hm0
> > >     >> skbaddr=00000000662a695d len=118 data_len=0 truesize=768 nr_frags=0
> > >     >> gso_size=0 gso_type=0x0 ovs_flow_hash=0x00000000 
> > > recirc_id=0x00000000
> > >     >> keyaddr=00000000900b6d94 eth_type=0xdd86 ct_state=21 
> > > ct_orig_proto=3a
> > >     >> ct_zone=0000 flow_key_valid=1 upcall_cmd=1 upcall_port=2805034951
> > >     >> upcall_mru=0
> > >     >>           <idle>-0       [010] ..s.. 33686.518866: ovs_dp_upcall:
> > >     >> dpaddr=0000000053422f7e dp_name=ovs-system dev=o-hm0
> > >     >> skbaddr=00000000ab9954ee len=86 data_len=0 truesize=768 nr_frags=0
> > >     >> gso_size=0 gso_type=0x0 ovs_flow_hash=0x00000000 
> > > recirc_id=0x00000000
> > >     >> keyaddr=00000000cfad77d1 eth_type=0xdd86 ct_state=00 
> > > ct_orig_proto=00
> > >     >> ct_zone=0000 flow_key_valid=1 upcall_cmd=1 upcall_port=2805034951
> > >     >> upcall_mru=0
> > >     >>        handler12-205602  [010] ..... 33686.519158:
> > >     >> ovs_do_execute_action: dpaddr=0000000053422f7e dp_name=ovs-system
> > >     >> dev=o-hm0 skbaddr=00000000662a695d len=86 data_len=0 truesize=768
> > >     >> nr_frags=0 gso_size=0 gso_type=0x0 ovs_flow_hash=0x00000000
> > >     >> recirc_id=0x00000000 keyaddr=0000000008b47a0c eth_type=0xdd86
> > >     >> ct_state=00 ct_orig_proto=00 ct_Zone=0000 flow_key_valid=1
> > >     >> action_type=3 action_len=12 action_data=0000000046f8c388 is_last=0
> > >     >>        handler12-205602  [010] ..... 33686.519159:
> > >     >> ovs_do_execute_action: dpaddr=0000000053422f7e dp_name=ovs-system
> > >     >> dev=o-hm0 skbaddr=00000000662a695d len=86 data_len=0 truesize=768
> > >     >> nr_frags=0 gso_size=0 gso_type=0x0 ovs_flow_hash=0x00000000
> > >     >> recirc_id=0x00000000 keyaddr=0000000008b47a0c eth_type=0xdd86
> > >     >> ct_state=00 ct_orig_proto=00 ct_Zone=0000 flow_key_valid=1
> > >     >> action_type=1 action_len=4 action_data=00000000acdfdf2a is_last=0
> > >     >>        handler12-205602  [010] ..... 33686.519197:
> > >     >> ovs_do_execute_action: dpaddr=0000000053422f7e dp_name=ovs-system
> > >     >> dev=o-hm0 skbaddr=00000000662a695d len=86 data_len=0 truesize=768
> > >     >> nr_frags=0 gso_size=0 gso_type=0x0 ovs_flow_hash=0x00000000
> > >     >> recirc_id=0x00000000 keyaddr=0000000008b47a0c eth_type=0xdd86
> > >     >> ct_state=00 ct_orig_proto=00 ct_Zone=0000 flow_key_valid=1
> > >     >> action_type=4 action_len=4 action_data=0000000069492ddd is_last=0
> > >     >>        handler12-205602  [010] ..... 33686.519197:
> > >     >> ovs_do_execute_action: dpaddr=0000000053422f7e dp_name=ovs-system
> > >     >> dev=o-hm0 skbaddr=00000000662a695d len=86 data_len=0 truesize=768
> > >     >> nr_frags=0 gso_size=0 gso_type=0x0 ovs_flow_hash=0x00000000
> > >     >> recirc_id=0x00000000 keyaddr=0000000008b47a0c eth_type=0xdd86
> > >     >> ct_state=00 ct_orig_proto=00 ct_Zone=0000 flow_key_valid=1
> > >     >> action_type=1 action_len=4 action_data=00000000e5733215 is_last=1
> > >     >>           <idle>-0       [011] ..s.. 33686.521143: ovs_dp_upcall:
> > >     >> dpaddr=0000000053422f7e dp_name=ovs-system dev=gre_sys
> > >     >> skbaddr=00000000369641d8 len=78 data_len=0 truesize=768 nr_frags=0
> > >     >> gso_size=0 gso_type=0x0 ovs_flow_hash=0x00000000 
> > > recirc_id=0x00000000
> > >     >> keyaddr=00000000dc600580 eth_type=0xdd86 ct_state=00 
> > > ct_orig_proto=00
> > >     >> ct_zone=0000 flow_key_valid=1 upcall_cmd=1 upcall_port=3508503525
> > >     >> upcall_mru=0
> > >     >>        handler13-205603  [011] ..... 33686.521383:
> > >     >> ovs_do_execute_action: dpaddr=0000000053422f7e dp_name=ovs-system
> > >     >> dev=gre_sys skbaddr=000000007efe6e4d len=78 data_len=0 truesize=768
> > >     >> nr_frags=0 gso_size=0 gso_type=0x0 ovs_flow_hash=0x00000000
> > >     >> recirc_id=0x00000000 keyaddr=00000000303b841b eth_type=0xdd86
> > >     >> ct_state=00 ct_orig_proto=00 ct_Zone=0000 flow_key_valid=1
> > >     >> action_type=1 action_len=4 action_data=0000000033bb5908 is_last=1
> > >     >>
> > >     >> So to me it looks like the packet is conntracked before entering 
> > > OVS
> > >     >> the first time, and I guess we would have the same situation if it
> > >     >> took a path like you suggested above.
> > >     >>
> > >     >>> If that's the case, I think, we actually have to clear ct state
> > >     >>> while packet is first entering the kernel datapath.  Otherwise
> > >     >>> it can match on incorrect flows in the kenrel cache.  Marking
> > >     >>> it as conntracked will probably be against the logic of the OF
> > >     >>> pipeline.
> > >     >>
> > >     >> We did use to clear this from the OVS user space side, but I agree
> > >     >> with you that that is problematic. This is probably why Numan was 
> > > on
> > >     >> the track of that in-kernel NF cache function.
> > >     >>
> > >     >>> Something like this (not tested):
> > >     >>>
> > >     >>> diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c
> > >     >>> index 82a74f998966..ebea64494196 100644
> > >     >>> --- a/net/openvswitch/vport.c
> > >     >>> +++ b/net/openvswitch/vport.c
> > >     >>> @@ -451,6 +451,12 @@ int ovs_vport_receive(struct vport *vport, 
> > > struct sk_buff *skb,
> > >     >>>                 kfree_skb(skb);
> > >     >>>                 return error;
> > >     >>>         }
> > >     >>> +
> > >     >>> +       /* Packets entering OVS has to be in untracked state.  
> > > Clearing here
> > >     >>> +        * to be sure that no state persisted from the last pass 
> > > through OVS.
> > >     >>> +        */
> > >     >>> +       ovs_ct_clear(skb, &key);
> > >     >>> +
> > >     >>>         ovs_dp_process_packet(skb, &key);
> > >     >>>         return 0;
> > >     >>>  }
> > >     >>> ---
> > >     >>
> > >     >> Yep, this works:
> > >     >>             ping-3265    [011] .....    97.820835: ovs_dp_upcall:
> > >     >> dpaddr=00000000a05ba3f9 dp_name=ovs-system dev=o-hm0
> > >     >> skbaddr=000000003b1e06d0 len=118 data_len=0 truesize=768 nr_frags=0
> > >     >> gso_size=0 gso_type=0x0 ovs_flow_hash=0x00000000 
> > > recirc_id=0x00000000
> > >     >> keyaddr=0000000077e1165d eth_type=0xdd86 ct_state=00 
> > > ct_orig_proto=00
> > >     >> ct_zone=0000 flow_key_valid=1 upcall_cmd=1 upcall_port=2787207471
> > >     >> upcall_mru=0
> > >     >>
> > >     >> A end to end workload test for the system described in [4] with 
> > > this
> > >     >> patched kernel is also successful. For good measure I also ran 
> > > through
> > >     >> the OVN kernel system testsuite and observed no regressions with 
> > > this
> > >     >> kernel.
> > >     >>
> > >     >>
> > >     >> So to sum up, I guess we have two distinct bugs here and we need 
> > > your
> > >     >> proposed patch for ovs-dpif to fix the OVN hairpin issue, and the
> > >     >> kernel patch for the conntracked packet entering the system issue 
> > > that
> > >     >> affects ML2/OVS.
> > >     >
> > >     > If we agree on how to proceed I'd be happy to help put this into
> > >     > formal patches with tests and appropriate accreditation etc. Let me
> > >     > know what you think and how you would prefer we can split the work 
> > > of
> > >     > bringing this to a resolution.
> > >     >
> > >
> > >     I'm working on a good commit message for the kernel piece above
> > >     (ovs_vport_receive), will send it soon.
> > >
> > >
> > > Great, thank you for doing that!
> > >
> > >     If you can help with stitching all the pieces for the OVS userspace
> > >     part and writing tests for it, that would be great.
> > >
> > >
> > > I will work on this part and hopefully submit tomorrow.
> > >
> > > --
> > > Frode Nordahl
> > >
> > >     Best regards, Ilya Maximets.
> > >
> >
_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to