Ilya, hi! Last time I didnt figured out the issue I wrote about in this thread, but I really want to put end to it. I will remind you what I wrote in this thread: I applied my patches to version 24.03, these patches shift table numbers in northd. After that, ovn userspace test started failing (79-82 userspace tests (ECMP symmetric reply). The test failed because it expects to see the conntrack state ct_state(+new-est-rpl+trk), but after my patches, I see ct_state(+new-rpl+trk) in the datapath flow. I understand why the userspace test was failing - you said it was all due to the classifier's behavior. I figured it out back then and it all made sense. I changed the test to expect ct_state(+new-rpl+trk). But after that, the test started failing in the kernelspace case. In your message above, you said it was likely due to supported features, but I, firstly, compared the supported features in the controller logs, and secondly, in OpenFlow. They are identical. I did some unsuccessful investigation: I set up several machines with different distributions. Initially, I expected that the test failure might depend on the kernel version. I tried different kernels and ended up with the following result: The test only fails on Ubuntu 24 - the same is used in ci tests on github (which has a 6+ kernel installed). With the same kernel version, the test does not fail on CentOS 9, Fedora, or Ubuntu 22.I checked that the issue is not related to the GCC version. I compared the kernel configs and didn't see anything significant there. I also compared the kernel module support. I understand that the issue is likely environmental, possibly related to the version of some libraries or system packages. But I'm completely out of ideas on where to look next. Maybe you could suggest where else I might investigate to get to the bottom of this test failure?
On 18.06.2025 21:41, Ilya Maximets wrote: > On 5/26/25 3:45 PM, Rukomoinikova Aleksandra wrote: >> Ilya, hi! Thank you for your help and sorry for the long reply. >> >> Generally, I figured out the classifier code and understood what you >> meant, but I noticed that when testing with a kernel module, there's a >> difference in the way the classifier works, or something similar. In the >> test I was having trouble with, the expected conntrack state is: >> +new-est-rpl+trk. After my changes, it turns out to be: +new-rpl+trk. I >> corrected the test to expect +new-rpl+trk, but in tests with the OVS >> kernel datapath(check kernel), the old сonntrack state is expected: >> +new-est-rpl+trk. I couldn't figure out the reason for this behavior, >> can you suggest where it would be better to look and what it could be >> connected with? Thanks! > Sorry, lost track of this discussion for some time. If you run the same > test with 'make check-kernel' and the 'make check-system-userspace', do > you see different conntrack states? There might be some slight difference > in what kind of OpenFlow rules ovn-controller generates depending on the > OVS datapath type (features supported). > >> On 19.05.2025 23:14, Ilya Maximets wrote: >>> On 5/16/25 11:16 AM, Rukomoinikova Aleksandra wrote: >>>> Hi! >>>> >>>> We've encountered a strange issue while backporting patches to the >>>> version 24.03 branch (ovs v3.3.4) and running tests. Let me describe the >>>> situation: >>>> I took the upstream branch 24.03, added a stage at the beginning of the >>>> switch pipeline, and added a 'match all' flow with 'next;' action. >>>> Commit example: >>>> https://github.com/Sashhkaa/ovn/commit/f20295315c327addfeb6fe455c3b3c655d6b3666. >>>> After this change, OVN 79-82 userspace tests (ECMP symmetric reply) >>>> started failing. >>>> According to the test logs, I see the following: >>>> The test expects to see the conntrack state ct_state(+new-est-rpl+trk) >>>> in the datapath flow, but gets ct_state(+new-rpl+trk) - that is, -est >>>> disappears. I will also attach more detailed dumps below. >>> In general, the extra -est match is harmless and doesn't affect correctness, >>> because +new traffic is always -est. And +est traffic is always -new. >>> So, I think, you may just update the test in your internal backport and >>> call it a day. >>> >>> For the actual reason why this is happening, the answer is: OpenFlow table >>> sharing between the switch and the router pipelines. >>> >>> Both the router and the switch pipelines have their OpenFlow rules in the >>> exact same OpenFlow tables starting from table 8. This means that on 24.03 >>> the ls_in_acl_action and the lr_in_ecmp_stateful stages are using the same >>> OpenFlow table 17. When you add one stage to the switch pipeline, you shift >>> all switch tables by one while keeping router pipelines in place. So, now >>> lr_in_ecmp_stateful shares the table with ls_in_acl_eval instead. >>> >>> All the rules have a match on metadata fields that distinguishes switches >>> from routers and so there are no issues with correctness caused by sharing. >>> However, the classifier may add extra matches due to internal implementation >>> details. Classifier will traverse all the rules in the OpenFlow table >>> starting with the highest priority. If there are no rules that match the >>> packet in the current priority, classifier adds a minimal match to the >>> datapath flow that will distinguish this packet from any OpenFlow rule in >>> this table at this priority. So, if one of the rules with the higher >>> priority had +est in the match, classifier will add -est to the datapath >>> flow for the packet that didn't match that flow. >>> >>> So, by adding an extra stage to the router pipeline, you're just restoring >>> the mapping of switch and router pipelines to OpenFlow tables like it was >>> before the backport. By playing with ACL priorities, you're making the >>> classifier go to the next table before evaluating a lower priority rule that >>> has an extra +est match. >>> >>> We worked on one similar issue recently: >>> >>> https://patchwork.ozlabs.org/project/ovn/patch/20250414085122.348614-4-dce...@redhat.com/ >>> Here we had a -dnat match leak from the router pipeline to the switch >>> pipeline for the packet that does not even go through the router. And that >>> breaks hardware offload because neither kernel nor hardware NICs support >>> offloading of NAT flags. >>> >>> Leaking of match criteria between switch and router pipelines is an >>> interesting side effect of OVN design, but should not generally cause >>> issues, >>> except for hardware offloading in some cases. >>> >>> Best regards, Ilya Maximets. >>> >>>> The expected state should be set by matching this OpenFlow rule in table >>>> 17 (in OVN it is router pipeline table 9 - ECMP stateful): >>>> >>>> cookie=0xdda3b0a7, duration=2.635s, table=17, n_packets=6, >>>> n_bytes=636, idle_age=1, >>>> priority=100,ct_state=+new-rpl+trk,ipv6,reg14=0x2,metadata=0x1,ipv6_dst=fd01::/126 >>>> actions=ct(commit,zone=NXM_NX_REG11[0..15],nat(src),exec(move:NXM_OF_ETH_SRC[]->NXM_NX_CT_LABEL[32..79],load:0x2->NXM_NX_CT_MARK[16..31])),resubmit(,18) >>>> >>>> cookie=0xdda3b0a7, duration=2.635s, table=17, n_packets=14, >>>> n_bytes=1396, idle_age=0, >>>> priority=100,ct_state=+est-rpl+trk,ipv6,reg14=0x2,metadata=0x1,ipv6_dst=fd01::/126 >>>> actions=ct(commit,zone=NXM_NX_REG11[0..15],nat(src),exec(move:NXM_OF_ETH_SRC[]->NXM_NX_CT_LABEL[32..79],load:0x2->NXM_NX_CT_MARK[16..31])),resubmit(,18) >>>> >>>> >>>> I found two logical flow changes, that work, though it's not clear why: >>>> 1) Adding a router table before ECMP processing: >>>> By inserting just one table at the very beginning of the router >>>> pipeline, before the ECMP stateful handling (for example, >>>> https://github.com/odivlad/ovn/commit/eb6d0d7409ff78f1fc0908a28225d0a2a47daa29 >>>> one table is enough), the test starts passing. The mechanism isn't clear >>>> - packets now match the default flow in table 17 and only hit the proper >>>> ECMP rule in table 18, yet this somehow resolves the issue. >>>> 2) Modifying ACL evaluation rules: >>>> The second solution is even more strange. Since this test case doesn't >>>> use ACLs or load balancers, northd adds match all' flow with 'next;' >>>> action and priority 65535 to the acl_eval table (logical table 9 in >>>> switch, OpenFlow table 17). When we lower the priority of these rules >>>> below 100(less priority for the ecmp rules), the test begins working. >>>> This suggests some hidden interaction between router and switch pipeline >>>> rules, despite their different metadata matching criteria. >>>> >>>> When examining the OVS traces for both cases - the initial failed test >>>> with just a stage addition versus the working version where we also >>>> modified the ACL eval table priority to 0 - the packet's path through >>>> the tables shows no differences except for two key aspects: first, the >>>> rule matching in ACL eval (OpenFlow table 17), and second, the resulting >>>> datapath action where the -est state unexpectedly disappears. The trace >>>> comparison reveals that only the rule priorities in table 17 actually >>>> changed, yet this somehow impacts the connection tracking state. You can >>>> see the complete trace comparison showing both scenarios - with just the >>>> stage addition and with the priority modification - along with the >>>> contents of table 17 and the diff between traces at this link: >>>> https://gist.github.com/Sashhkaa/58b2c616e7d46fc2dafb898ed832960f. >>>> I've verified this behavior persists in newer versions of Open vSwitch >>>> as well. >>>> Does anyone understand what could be causing this issue? I'd appreciate >>>> any insights or suggestions for a proper fix. Thank you! >>>> -- regards, Alexandra. _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev