On Fri, May 27, 2022 at 10:04 PM Ilya Maximets <i.maxim...@ovn.org> wrote: > > On 5/26/22 14:53, Frode Nordahl wrote: > > > > > > tor. 26. mai 2022, 14:45 skrev Ilya Maximets <i.maxim...@ovn.org > > <mailto:i.maxim...@ovn.org>>: > > > > On 5/26/22 13:00, Frode Nordahl wrote: > > > On Wed, May 25, 2022 at 9:55 AM Frode Nordahl > > > <frode.nord...@canonical.com <mailto:frode.nord...@canonical.com>> > > wrote: > > >> > > >> On Tue, May 24, 2022 at 1:32 PM Ilya Maximets <i.maxim...@ovn.org > > <mailto:i.maxim...@ovn.org>> wrote: > > >>> > > >>> On 5/24/22 12:54, Frode Nordahl wrote: > > >>>> On Mon, May 23, 2022 at 3:49 PM Ilya Maximets <i.maxim...@ovn.org > > <mailto:i.maxim...@ovn.org>> wrote: > > >>>>> > > >>>>> On 5/21/22 12:49, Frode Nordahl wrote: > > >>>>>> On Thu, May 19, 2022 at 3:39 PM Frode Nordahl > > >>>>>> <frode.nord...@canonical.com > > <mailto:frode.nord...@canonical.com>> wrote: > > >>>>>>> > > >>>>>>> On Sat, May 14, 2022 at 2:10 AM Ilya Maximets > > <i.maxim...@ovn.org <mailto:i.maxim...@ovn.org>> wrote: > > >>>>>>>> > > >>>>>>>> On 5/13/22 10:36, Frode Nordahl wrote: > > >>>>>>>>> On Fri, Mar 11, 2022 at 2:04 PM Liam Young > > <liam.yo...@canonical.com <mailto:liam.yo...@canonical.com>> wrote: > > >>>>>>>>>> > > >>>>>>>>>> Hi, > > >>>>>>>>>> > > >>>>>>>>>> <tl;dr> Commit 355fef6f2 seems to break connectivity in my > > setup</tl;dr> > > >>>>>>>>> > > >>>>>>>>> After OVN started colocating sNAT and dNAT operations in the > > same CT > > >>>>>>>>> zone [0] the above mentioned OVS commit appears to also break > > OVN [1]. > > >>>>>>>>> I have been discussing with Numan and he has found a > > correlation with > > >>>>>>>>> the behavior of the open vswitch kernel data path conntrack > > >>>>>>>>> `skb_nfct_cached` function, i.e. if that is changed to always > > return > > >>>>>>>>> 'false' the erratic behavior disappears. > > >>>>>>>>> > > >>>>>>>>> So I guess the question then becomes, is this a OVS userspace > > or OVS > > >>>>>>>>> kernel problem? > > >>>>>>>>> > > >>>>>>>>> We have a reproducer in [3]. > > >>>>>>>>> > > >>>>>>>>> 0: > > https://github.com/ovn-org/ovn/commit/4deac4509abbedd6ffaecf27eed01ddefccea40a > > > > <https://github.com/ovn-org/ovn/commit/4deac4509abbedd6ffaecf27eed01ddefccea40a> > > >>>>>>>>> 1: https://bugs.launchpad.net/ubuntu/+source/ovn/+bug/1967856 > > <https://bugs.launchpad.net/ubuntu/+source/ovn/+bug/1967856> > > >>>>>>>>> 2: > > https://elixir.bootlin.com/linux/latest/source/net/openvswitch/conntrack.c#L683 > > > > <https://elixir.bootlin.com/linux/latest/source/net/openvswitch/conntrack.c#L683> > > >>>>>>>>> 3: > > https://bugs.launchpad.net/ubuntu/+source/ovn/+bug/1967856/comments/6 > > <https://bugs.launchpad.net/ubuntu/+source/ovn/+bug/1967856/comments/6> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>>> Hrm. I think, there is a logical bug in implementations of > > ct() > > >>>>>>>> datapath action in both datapaths. > > >>>>>>>> > > >>>>>>>> The problem appears when the OpenFlow pipeline for the packet > > contains > > >>>>>>>> several ct() actions. NXAST_CT action (i.e. every ct() > > action) must > > >>>>>>>> always put the packet into an untracked state. Tracked state > > will be > > >>>>>>>> set only in the 'recirc_table' where the packet is cloned by > > the ct() > > >>>>>>>> action for further processing. > > >>>>>>>> > > >>>>>>>> If an OF pipeline looks like this: > > >>>>>>>> > > >>>>>>>> actions=ct(),something,something,ct(),something > > >>>>>>>> > > >>>>>>>> each action must be entered with the 'untracked' conntrack > > state in > > >>>>>>>> the packet metadata. > > >>>>>>>> > > >>>>>>>> However, ct() action in the datapath, unlike OpenFlow, doesn't > > work > > >>>>>>>> this way. It modifies the conntrack state for the packet it > > is processing. > > >>>>>>>> During the flow translation OVS inserts recirculation right > > after the > > >>>>>>>> datapath ct() action. This way conntrack state affects only > > processing > > >>>>>>>> after recirculation. Sort of. > > >>>>>>>> > > >>>>>>>> The issue is that not all OpenFlow ct() actions have > > recirc_table > > >>>>>>>> specified. These actions supposed to change the state of the > > conntrack > > >>>>>>>> subsystem, but they still must leave the packet itself in the > > untracked > > >>>>>>>> state with no strings attached. But since datapath ct() > > actions doesn't > > >>>>>>>> work that way and since recirculation is not inserted (no > > recirc_table > > >>>>>>>> specified), packet after conntrack execution carries the state > > along to > > >>>>>>>> all other actions. > > >>>>>>>> This doesn't impact normal actions, but seems to break > > subsequent ct() > > >>>>>>>> actions on the same pipeline. > > >>>>>>>> > > >>>>>>>> In general, it looks to me that ct() action in the datapath is > > >>>>>>>> not supposed to cache any connection data inside the packet's > > metadata. > > >>>>>>>> This seems to be the root cause of the problem. Fields that > > OF knows > > >>>>>>>> about should not trigger any issues if carried along with a > > packet, > > >>>>>>>> but datapath-specific metadata can not be cleared, because > > OFproto > > >>>>>>>> simply doesn't know about it. > > >>>>>>>> > > >>>>>>>> But, I guess, not caching the connection and other internal > > structures > > >>>>>>>> might significantly impact datapath performance. Will it? > > >>>>>>>> > > >>>>>>>> Looking at the reproducer in [3], it has, for example, the > > flow with the > > >>>>>>>> following actions: > > >>>>>>>> > > >>>>>>>> actions:ct(commit,zone=8,mark=0/0x1,nat(src)), > > >>>>>>>> > > set(eth(src=00:00:00:00:00:01,dst=00:00:00:00:00:06)), > > >>>>>>>> set(ipv4(src=172.18.2.10,dst=192.168.100.6,ttl=62)), > > >>>>>>>> ct(zone=8),recirc(0x4) > > >>>>>>>> > > >>>>>>>> So, if the first ct() will change some datapath-specific > > packet metadata, > > >>>>>>>> the second ct() may try to use that information. > > >>>>>>>> > > >>>>>>>> It looks like during the flow translation we must add ct_clear > > action > > >>>>>>>> explicitly after every ct() action, unless it was the last > > action in > > >>>>>>>> the list. This will end up with adding a lot of ct_clear > > actions > > >>>>>>>> everywhere. > > >>>>>>>> > > >>>>>>>> Another option is the patch below that tries to track if > > ct_clear > > >>>>>>>> is required and adds that action before the next ct() action in > > >>>>>>>> the datapath. > > >>>>>>>> > > >>>>>>>> BTW, the test [3] fails with both kernel and userspace > > datapaths. > > >>>>>>>> > > >>>>>>>> The change below should fix the problem, I think. > > >>>>>>>> It looks like we also have to put ct_clear action before > > translating > > >>>>>>>> output to the patch port if we're in 'conntracked' state. > > >>>>>>>> > > >>>>>>>> And I do not know how to fix the problem for kernels without > > ct_clear > > >>>>>>>> support. I don't think we can clear metadata that is internal > > to > > >>>>>>>> the kernel. Ideas are welcome. > > >>>>>>>> > > >>>>>>>> CC: Aaron, Paolo. > > >>>>>>>> > > >>>>>>>> Any thoughts? > > >>>>>>> > > >>>>>>> Thank you so much for the detailed explanation of what's going > > on and > > >>>>>>> for providing a proposed fix. > > >>>>>>> > > >>>>>>> I took it for a spin and it does indeed appear to fix the OVN > > hairpin > > >>>>>>> issue, but it does unfortunately not appear to fix the issue > > Liam > > >>>>>>> reported for the ML2/OVS use case [4]. > > >>>>>>> > > >>>>>>> When trying that use case with your patch I see this in the > > Open vSwitch log: > > >>>>>>> > > 2022-05-19T08:17:02.668Z|00001|ofproto_dpif_xlate(handler1)|WARN|over > > >>>>>>> max translation depth 64 on bridge br-int while processing > > >>>>>>> > > ct_state=new|trk,ct_ipv6_src=fc00:24:159c:555b:f816:3eff:fe8b:3ad0,ct_ipv6_dst=fc00:24:159c:555b:f816:3eff:fe8f:9302,ct_nw_proto=58,ct_tp_src=128,ct_tp_dst=0,icmp6,in_port=3,vlan_tci=0x0000,dl_src=fa:16:3e:8b:3a:d0,dl_dst=fa:16:3e:8f:93:02,ipv6_src=fc00:24:159c:555b:f816:3eff:fe8b:3ad0,ipv6_dst=fc00:24:159c:555b:f816:3eff:fe8f:9302,ipv6_label=0xac5db,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=128,icmp_code=0 > > >>>>>>> > > 2022-05-19T08:17:02.668Z|00002|ofproto_dpif_upcall(handler1)|WARN|Dropped > > >>>>>>> 2 log messages in last 205 seconds (most recently, 187 seconds > > ago) > > >>>>>>> due to excessive rate > > >>>>>>> > > 2022-05-19T08:17:02.668Z|00003|ofproto_dpif_upcall(handler1)|WARN|Flow: > > >>>>>>> > > ct_state=new|trk,ct_ipv6_src=fc00:24:159c:555b:f816:3eff:fe8b:3ad0,ct_ipv6_dst=fc00:24:159c:555b:f816:3eff:fe8f:9302,ct_nw_proto=58,ct_tp_src=128,ct_tp_dst=0,icmp6,in_port=5,vlan_tci=0x0000,dl_src=fa:16:3e:8b:3a:d0,dl_dst=fa:16:3e:8f:93:02,ipv6_src=fc00:24:159c:555b:f816:3eff:fe8b:3ad0,ipv6_dst=fc00:24:159c:555b:f816:3eff:fe8f:9302,ipv6_label=0xac5db,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=128,icmp_code=0 > > >>>>>>> > > >>>>>>> bridge("br-int") > > >>>>>>> ---------------- > > >>>>>>> 0. priority 0, cookie 0x56d6d58c018b51bd > > >>>>>>> goto_table:60 > > >>>>>>> 60. priority 3, cookie 0x56d6d58c018b51bd > > >>>>>>> NORMAL > > >>>>>>> >>>> received packet on unknown port 5 <<<< > > >>>>>>> >> no input bundle, dropping > > >>>>>>> > > >>>>>>> Final flow: unchanged > > >>>>>>> Megaflow: > > recirc_id=0,eth,ipv6,in_port=5,dl_src=fa:16:3e:8b:3a:d0,dl_dst=fa:16:3e:8f:93:02,nw_frag=no > > >>>>>>> Datapath actions: drop > > >>>>>>> > > >>>>>>> > > >>>>>>> On the back of your explanation and the log output I made an > > attempt > > >>>>>>> at adding this patch on top of yours: > > >>>>>>> --- > > >>>>>>> diff --git a/ofproto/ofproto-dpif-xlate.c > > b/ofproto/ofproto-dpif-xlate.c > > >>>>>>> index 110dab0ec..2955e8e4d 100644 > > >>>>>>> --- a/ofproto/ofproto-dpif-xlate.c > > >>>>>>> +++ b/ofproto/ofproto-dpif-xlate.c > > >>>>>>> @@ -7144,6 +7144,7 @@ do_xlate_actions(const struct ofpact > > *ofpacts, > > >>>>>>> size_t ofpacts_len, > > >>>>>>> * resubmit to the frozen actions. > > >>>>>>> */ > > >>>>>>> case OFPACT_RESUBMIT: > > >>>>>>> + ctx->pending_ct_clear = true; > > >>>>>>> xlate_ofpact_resubmit(ctx, ofpact_get_RESUBMIT(a), > > last); > > >>>>>>> continue; > > >>>>>>> case OFPACT_GOTO_TABLE: > > >>>>>>> --- > > >>>>>>> > > >>>>>>> And the two patches together do appear to resolve the issue > > reported > > >>>>>>> in [4] as well as the OVN hairpin issue [1]. It does however > > make a > > >>>>>>> couple of tests fail, so I need to look into if that is > > expected from > > >>>>>>> the change or if the approach must be changed. > > >>>>>>> > > >>>>>>> 4: https://bugs.launchpad.net/openvswitch/+bug/1964117 > > <https://bugs.launchpad.net/openvswitch/+bug/1964117> > > >>>>>> > > >>>>>> The following patch also works and does not cause any tests to > > fail. > > >>>>>> > > >>>>>> --- > > >>>>>> diff --git a/ofproto/ofproto-dpif-xlate.c > > b/ofproto/ofproto-dpif-xlate.c > > >>>>>> index 110dab0ec..905f6994d 100644 > > >>>>>> --- a/ofproto/ofproto-dpif-xlate.c > > >>>>>> +++ b/ofproto/ofproto-dpif-xlate.c > > >>>>>> @@ -7354,7 +7354,9 @@ do_xlate_actions(const struct ofpact > > *ofpacts, > > >>>>>> size_t ofpacts_len, > > >>>>>> break; > > >>>>>> > > >>>>>> case OFPACT_CT_CLEAR: > > >>>>>> - if (ctx->conntracked || ctx->pending_ct_clear) { > > >>>>>> + if (ctx->conntracked || ctx->pending_ct_clear > > >>>>>> + || (flow->ct_state && flow->ct_state & > > CS_TRACKED)) > > >>>>>> + { > > >>>>>> compose_ct_clear_action(ctx); > > >>>>>> } > > >>>>>> break; > > >>>>>> --- > > >>>>>> > > >>>>>> The OpenStack ML2/OVS Open vSwitch firewall driver appears to > > make > > >>>>>> explicit use of ct_clear action as part of progressing packets > > through > > >>>>>> the tables and the optimization in 355fef6f2 interrupts that > > workflow. > > >>>>>> > > >>>>>> I am not quite sure if the definition of `struct > > >>>>>> xlate_ctx->conntracked` is "this flow has been exposed to > > conntrack" > > >>>>>> or if it is "conntrack sees this flow as established", if anyone > > can > > >>>>>> shed light on that we would know if the above patch is ok as is > > or if > > >>>>>> this flow state should have set struct xlate_ctx->conntracked to > > true > > >>>>>> in the first place. > > >>>>> > > >>>>> I beleive that ctx->conntracked should men that "this flow has > > been exposed > > >>>>> to conntrack". And it looks like the condition: > > >>>>> (!ctx->conntracked && flow->ct_state & CS_TRACKED) > > >>>>> should not be possible. We're, probably, missing > > clear_conntrack() call > > >>>>> somewhere. > > >>>>> > > >>>>> One thing I seem to miss in my patch is the conntrack clear after > > returning > > >>>>> from the patch port processing: > > >>>>> > > >>>>> diff --git a/ofproto/ofproto-dpif-xlate.c > > b/ofproto/ofproto-dpif-xlate.c > > >>>>> index 110dab0ec..53d4f78b2 100644 > > >>>>> --- a/ofproto/ofproto-dpif-xlate.c > > >>>>> +++ b/ofproto/ofproto-dpif-xlate.c > > >>>>> @@ -3955,8 +3955,13 @@ patch_port_output(struct xlate_ctx *ctx, > > const struct xport *in_dev, > > >>>>> > > >>>>> /* The out bridge's conntrack execution should have no > > effect on the > > >>>>> * original bridge. */ > > >>>>> - ctx->conntracked = old_conntrack; > > >>>>> ctx->pending_ct_clear = old_ct_clear; > > >>>>> + if (ctx->conntracked) { > > >>>>> + /* Conntrack was involved in the other bridge. We need > > to clear > > >>>>> + * whatever information was cached in the datapath. */ > > >>>>> + ctx->pending_ct_clear = true; > > >>>>> + } > > >>>>> + ctx->conntracked = old_conntrack; > > >>>>> > > >>>>> /* The fact that the out bridge exits (for any reason) does > > not mean > > >>>>> * that the original bridge should exit. Specifically, if > > the out > > >>>>> --- > > >>>>> > > >>>>> But that doesn't seem to be related to the (!ctx->conntracked && > > >>>>> flow->ct_state & CS_TRACKED) case... > > >>>> > > >>>> Thank you for confirming that ctx->conntracked should have been > > set in > > >>>> this flow state, that led me onto the possible root and fix of this > > >>>> particular use case. > > >>>> > > >>>> Enabling debug log for dpif module while initiating a new flow I > > see > > >>>> that the flow can enter the system as a miss upcall without > > >>>> recirculation and with ct_state set: > > >>>> > > 2022-05-24T10:42:57.869Z|00036|dpif(handler6)|DBG|system@ovs-system: > > >>>> miss upcall: > > >>>> > > recirc_id(0),dp_hash(0),skb_priority(0),in_port(6),skb_mark(0),ct_state(0x21),ct_zone(0),ct_mark(0),ct_label(0),ct_tuple6(src=fc00:a3f0:723f:7bc4:f816:3eff:fe81:224d,dst=fc00:a3f0:723f:7bc4:f816:3eff:fee5:e7b2,proto=58,src_port=128,dst_port=0),eth(src=fa:16:3e:81:22:4d,dst=fa:16:3e:e5:e7:b2),eth_type(0x86dd),ipv6(src=fc00:a3f0:723f:7bc4:f816:3eff:fe81:224d,dst=fc00:a3f0:723f:7bc4:f816:3eff:fee5:e7b2,label=0x357f1,proto=58,tclass=0,hlimit=64,frag=no),icmpv6(type=128,code=0) > > >>>> > > >>>> Looking at the code it appears that for this state the > > >>>> ctx->conntracked would not be set, it is currently initialized to > > >>>> 'false' and only updated based on frozen state for recirculated > > >>>> packets. > > >>>> > > >>>> Adding this patch resolves inconsistency we discussed above and > > >>>> subsequently the ML2/OVS problem: > > >>>> diff --git a/ofproto/ofproto-dpif-xlate.c > > b/ofproto/ofproto-dpif-xlate.c > > >>>> index 110dab0ec..7bc7426ac 100644 > > >>>> --- a/ofproto/ofproto-dpif-xlate.c > > >>>> +++ b/ofproto/ofproto-dpif-xlate.c > > >>>> @@ -7852,6 +7852,12 @@ xlate_actions(struct xlate_in *xin, struct > > >>>> xlate_out *xout) > > >>>> goto exit; > > >>>> } > > >>>> > > >>>> + if (!xin->frozen_state > > >>>> + && xin->flow.ct_state > > >>>> + && xin->flow.ct_state & CS_TRACKED) { > > >>>> + ctx.conntracked = true; > > >>>> + } > > >>>> + > > >>>> /* Tunnel metadata in udpif format must be normalized before > > >>>> translation. */ > > >>>> if (flow->tunnel.flags & FLOW_TNL_F_UDPIF) { > > >>>> const struct tun_table *tun_tab = ofproto_get_tun_tab( > > >>>> --- > > >>>> > > >>>> What do you think? > > I was ready to send the kernel patch below, but when I tried to run > system tests, 2 of them failed. And then I discovered this gem: > > c2926d6d1cfd ("system-traffic: Add ct tests using local stack.")
Good catch! I only tested your proposed kernel change with the OVN system tests and not the OVS ones. > So, it might be that your change to set the 'conntracked' to 'true' > is actually the right one, since OVS system tests are expecting > this scenario to work... Ok, I've found a usable system test for this condition, so I'm ready to submit something for this bit: diff --git a/tests/system-traffic.at b/tests/system-traffic.at index 239105e89..53af3d3da 100644 --- a/tests/system-traffic.at +++ b/tests/system-traffic.at @@ -6807,6 +6807,46 @@ AT_CHECK([ovs-ofctl dump-flows br0 | grep table=2, | OFPROTO_CLEAR_DURATION_IDLE OVS_TRAFFIC_VSWITCHD_STOP AT_CLEANUP +dnl This is essentially a copy of the "datapath - ping over geneve tunnel" +dnl test, and we use it to confirm OVS behavior when ct_state is set on +dnl flows by the kernel without conscious action by the OVS user space code. +AT_SETUP([conntrack - ct_state from outside OVS]) +OVS_CHECK_TUNNEL_TSO() +OVS_CHECK_GENEVE() + +OVS_TRAFFIC_VSWITCHD_START() +ADD_BR([br-underlay]) + +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"]) +AT_CHECK([ovs-ofctl add-flow br-underlay "priority=100,ct_state=+trk,actions=ct_clear,resubmit(,0)"]) +AT_CHECK([ovs-ofctl add-flow br-underlay "priority=10,actions=normal"]) + +ADD_NAMESPACES(at_ns0) + +dnl Set up underlay link from host into the namespace using veth pair. +ADD_VETH(p0, at_ns0, br-underlay, "172.31.1.1/24") +AT_CHECK([ip addr add dev br-underlay "172.31.1.100/24"]) +AT_CHECK([ip link set dev br-underlay up]) + +dnl Set up tunnel endpoints on OVS outside the namespace and with a native +dnl linux device inside the namespace. +ADD_OVS_TUNNEL([geneve], [br0], [at_gnv0], [172.31.1.1], [10.1.1.100/24]) +ADD_NATIVE_TUNNEL([geneve], [ns_gnv0], [at_ns0], [172.31.1.100], [10.1.1.1/24], + [vni 0]) + +dnl First, check the underlay +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 172.31.1.100 | FORMAT_PING], [0], [dnl +3 packets transmitted, 3 received, 0% packet loss, time 0ms +]) + +dnl Okay, now check the overlay +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.100 | FORMAT_PING], [0], [dnl +3 packets transmitted, 3 received, 0% packet loss, time 0ms +]) + +OVS_TRAFFIC_VSWITCHD_STOP +AT_CLEANUP + AT_BANNER([IGMP]) AT_SETUP([IGMP - flood under normal action]) --- I should probably add something to confirm the ct_state=+trk flow rule is hit so this does not suddenly turn into a false positive. The ct state across patch ports issue is a bit more involved to test, so I wonder if we should use a OVN test for that? We could of course have OVN stage it and see if it is easy to extract something to encode into OVS system tests. > I'm starting to questioning reality here already... Will get back > to that issue next week. Friday afternoon can have that effect on the best of us :-) -- Frode Nordahl > > >>>> > > >>> > > >>> Hrm. I suspected something like this. My guess is that packet > > >>> gets conntracked in OVS, then leaves OVS, gets routed inside the > > >>> kernel and enters OVS again while retaining the ct state from the > > >>> previous run. Can you confirm that? > > >> > > >> In this setup there is an interface present in the system attached to > > >> OVS using an interface of type internal. It's essentially there to > > >> give a path into an OpenStack managed overlay network for some > > >> management daemons running "outside the cloud". > > >> > > >> Without the patches we've been working on in this thread so far, > > >> looking at the OF flow configuration side the packets traverse the > > >> tables like this: > > >> table=0, n_packets=6, n_bytes=604, priority=0 actions=resubmit(,59) > > >> table=59, n_packets=6, n_bytes=604, priority=0 actions=resubmit(,60) > > >> table=60, n_packets=4, n_bytes=384, priority=100,in_port="o-hm0" > > >> actions=load:0x3->NXM_NX_REG5[],load:0x1->NXM_NX_REG6[],resubmit(,71) > > >> table=71, n_packets=65, n_bytes=7670, priority=110,ct_state=+trk > > >> actions=ct_clear,resubmit(,71) > > >> > > >> Since we currently ignore the request for the ct_clear action in this > > >> state the packet will be resubmitted until hitting the limit in the > > >> last rule above. > > >> > > >> The OVS kernel event tracing output from the same flow look like > > this: > > >> ping-207769 [010] ..... 33681.265450: ovs_dp_upcall: > > >> dpaddr=0000000053422f7e dp_name=ovs-system dev=o-hm0 > > >> skbaddr=00000000662a695d len=118 data_len=0 truesize=768 nr_frags=0 > > >> gso_size=0 gso_type=0x0 ovs_flow_hash=0x00000000 recirc_id=0x00000000 > > >> keyaddr=00000000900b6d94 eth_type=0xdd86 ct_state=21 ct_orig_proto=3a > > >> ct_zone=0000 flow_key_valid=1 upcall_cmd=1 upcall_port=2805034951 > > >> upcall_mru=0 > > >> <idle>-0 [010] ..s.. 33686.518866: ovs_dp_upcall: > > >> dpaddr=0000000053422f7e dp_name=ovs-system dev=o-hm0 > > >> skbaddr=00000000ab9954ee len=86 data_len=0 truesize=768 nr_frags=0 > > >> gso_size=0 gso_type=0x0 ovs_flow_hash=0x00000000 recirc_id=0x00000000 > > >> keyaddr=00000000cfad77d1 eth_type=0xdd86 ct_state=00 ct_orig_proto=00 > > >> ct_zone=0000 flow_key_valid=1 upcall_cmd=1 upcall_port=2805034951 > > >> upcall_mru=0 > > >> handler12-205602 [010] ..... 33686.519158: > > >> ovs_do_execute_action: dpaddr=0000000053422f7e dp_name=ovs-system > > >> dev=o-hm0 skbaddr=00000000662a695d len=86 data_len=0 truesize=768 > > >> nr_frags=0 gso_size=0 gso_type=0x0 ovs_flow_hash=0x00000000 > > >> recirc_id=0x00000000 keyaddr=0000000008b47a0c eth_type=0xdd86 > > >> ct_state=00 ct_orig_proto=00 ct_Zone=0000 flow_key_valid=1 > > >> action_type=3 action_len=12 action_data=0000000046f8c388 is_last=0 > > >> handler12-205602 [010] ..... 33686.519159: > > >> ovs_do_execute_action: dpaddr=0000000053422f7e dp_name=ovs-system > > >> dev=o-hm0 skbaddr=00000000662a695d len=86 data_len=0 truesize=768 > > >> nr_frags=0 gso_size=0 gso_type=0x0 ovs_flow_hash=0x00000000 > > >> recirc_id=0x00000000 keyaddr=0000000008b47a0c eth_type=0xdd86 > > >> ct_state=00 ct_orig_proto=00 ct_Zone=0000 flow_key_valid=1 > > >> action_type=1 action_len=4 action_data=00000000acdfdf2a is_last=0 > > >> handler12-205602 [010] ..... 33686.519197: > > >> ovs_do_execute_action: dpaddr=0000000053422f7e dp_name=ovs-system > > >> dev=o-hm0 skbaddr=00000000662a695d len=86 data_len=0 truesize=768 > > >> nr_frags=0 gso_size=0 gso_type=0x0 ovs_flow_hash=0x00000000 > > >> recirc_id=0x00000000 keyaddr=0000000008b47a0c eth_type=0xdd86 > > >> ct_state=00 ct_orig_proto=00 ct_Zone=0000 flow_key_valid=1 > > >> action_type=4 action_len=4 action_data=0000000069492ddd is_last=0 > > >> handler12-205602 [010] ..... 33686.519197: > > >> ovs_do_execute_action: dpaddr=0000000053422f7e dp_name=ovs-system > > >> dev=o-hm0 skbaddr=00000000662a695d len=86 data_len=0 truesize=768 > > >> nr_frags=0 gso_size=0 gso_type=0x0 ovs_flow_hash=0x00000000 > > >> recirc_id=0x00000000 keyaddr=0000000008b47a0c eth_type=0xdd86 > > >> ct_state=00 ct_orig_proto=00 ct_Zone=0000 flow_key_valid=1 > > >> action_type=1 action_len=4 action_data=00000000e5733215 is_last=1 > > >> <idle>-0 [011] ..s.. 33686.521143: ovs_dp_upcall: > > >> dpaddr=0000000053422f7e dp_name=ovs-system dev=gre_sys > > >> skbaddr=00000000369641d8 len=78 data_len=0 truesize=768 nr_frags=0 > > >> gso_size=0 gso_type=0x0 ovs_flow_hash=0x00000000 recirc_id=0x00000000 > > >> keyaddr=00000000dc600580 eth_type=0xdd86 ct_state=00 ct_orig_proto=00 > > >> ct_zone=0000 flow_key_valid=1 upcall_cmd=1 upcall_port=3508503525 > > >> upcall_mru=0 > > >> handler13-205603 [011] ..... 33686.521383: > > >> ovs_do_execute_action: dpaddr=0000000053422f7e dp_name=ovs-system > > >> dev=gre_sys skbaddr=000000007efe6e4d len=78 data_len=0 truesize=768 > > >> nr_frags=0 gso_size=0 gso_type=0x0 ovs_flow_hash=0x00000000 > > >> recirc_id=0x00000000 keyaddr=00000000303b841b eth_type=0xdd86 > > >> ct_state=00 ct_orig_proto=00 ct_Zone=0000 flow_key_valid=1 > > >> action_type=1 action_len=4 action_data=0000000033bb5908 is_last=1 > > >> > > >> So to me it looks like the packet is conntracked before entering OVS > > >> the first time, and I guess we would have the same situation if it > > >> took a path like you suggested above. > > >> > > >>> If that's the case, I think, we actually have to clear ct state > > >>> while packet is first entering the kernel datapath. Otherwise > > >>> it can match on incorrect flows in the kenrel cache. Marking > > >>> it as conntracked will probably be against the logic of the OF > > >>> pipeline. > > >> > > >> We did use to clear this from the OVS user space side, but I agree > > >> with you that that is problematic. This is probably why Numan was on > > >> the track of that in-kernel NF cache function. > > >> > > >>> Something like this (not tested): > > >>> > > >>> diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c > > >>> index 82a74f998966..ebea64494196 100644 > > >>> --- a/net/openvswitch/vport.c > > >>> +++ b/net/openvswitch/vport.c > > >>> @@ -451,6 +451,12 @@ int ovs_vport_receive(struct vport *vport, > > struct sk_buff *skb, > > >>> kfree_skb(skb); > > >>> return error; > > >>> } > > >>> + > > >>> + /* Packets entering OVS has to be in untracked state. > > Clearing here > > >>> + * to be sure that no state persisted from the last pass > > through OVS. > > >>> + */ > > >>> + ovs_ct_clear(skb, &key); > > >>> + > > >>> ovs_dp_process_packet(skb, &key); > > >>> return 0; > > >>> } > > >>> --- > > >> > > >> Yep, this works: > > >> ping-3265 [011] ..... 97.820835: ovs_dp_upcall: > > >> dpaddr=00000000a05ba3f9 dp_name=ovs-system dev=o-hm0 > > >> skbaddr=000000003b1e06d0 len=118 data_len=0 truesize=768 nr_frags=0 > > >> gso_size=0 gso_type=0x0 ovs_flow_hash=0x00000000 recirc_id=0x00000000 > > >> keyaddr=0000000077e1165d eth_type=0xdd86 ct_state=00 ct_orig_proto=00 > > >> ct_zone=0000 flow_key_valid=1 upcall_cmd=1 upcall_port=2787207471 > > >> upcall_mru=0 > > >> > > >> A end to end workload test for the system described in [4] with this > > >> patched kernel is also successful. For good measure I also ran > > through > > >> the OVN kernel system testsuite and observed no regressions with this > > >> kernel. > > >> > > >> > > >> So to sum up, I guess we have two distinct bugs here and we need your > > >> proposed patch for ovs-dpif to fix the OVN hairpin issue, and the > > >> kernel patch for the conntracked packet entering the system issue > > that > > >> affects ML2/OVS. > > > > > > If we agree on how to proceed I'd be happy to help put this into > > > formal patches with tests and appropriate accreditation etc. Let me > > > know what you think and how you would prefer we can split the work of > > > bringing this to a resolution. > > > > > > > I'm working on a good commit message for the kernel piece above > > (ovs_vport_receive), will send it soon. > > > > > > Great, thank you for doing that! > > > > If you can help with stitching all the pieces for the OVS userspace > > part and writing tests for it, that would be great. > > > > > > I will work on this part and hopefully submit tomorrow. > > > > -- > > Frode Nordahl > > > > Best regards, Ilya Maximets. > > > _______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss