On 7/7/22 00:08, Han Zhou wrote: > On Wed, Jul 6, 2022 at 8:45 AM Dumitru Ceara <dce...@redhat.com> wrote: >> >> Hi Han, >> >> On 7/6/22 00:41, Han Zhou wrote: >>> The ls_in_pre_stateful priority 120 flow that saves dst IP and Port to >>> registers is causing a critical dataplane performance impact to >>> short-lived connections, because it unwildcards megaflows with exact >>> match on dst IP and L4 ports. Any new connections with a different >>> client side L4 port will encounter datapath flow miss and upcall to >>> ovs-vswitchd. >>> >>> These fields (dst IP and port) were saved to registers to solve a >>> problem of LB hairpin use case when different VIPs are sharing >>> overlapping backend+port [0]. The change [0] might not have as wide >>> performance impact as it is now because at that time one of the match >>> condition "REGBIT_CONNTRACK_NAT == 1" was set only for established and >>> natted traffic, while now the impact is more obvious because >>> REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP >>> configured on the LS) since commit [1], after several other indirectly >>> related optimizations and refactors. >>> >>> Since the changes that introduced the performance problem had their >>> own values (fixes problems or optimizes performance), so we don't want >>> to revert any of the changes (and it is also not straightforward to >>> revert any of them because there have been lots of changes and refactors >>> on top of them). >>> >>> Change [0] itself has added an alternative way to solve the overlapping >>> backends problem, which utilizes ct fields instead of saving dst IP and >>> port to registers. This patch forces to that approach and removes the >>> flows/actions that saves the dst IP and port to avoid the dataplane >>> performance problem for short-lived connections. >>> >>> (With this approach, the ct_state DNAT is not HW offload friendly, so it >>> may result in those flows not being offloaded, which is supposed to be >>> solved in a follow-up patch) >>> >>> [0] ce0ef8d59850 ("Properly handle hairpin traffic for VIPs with shared > backends.") >>> [1] 0038579d1928 ("northd: Optimize ct nat for load balancer traffic.") >>> >>> Signed-off-by: Han Zhou <hz...@ovn.org> >>> --- >> >> I think the main concern I have is that this forces us to choose between: >> a. non hwol friendly flows (reduced performance) >> b. less functionality (with the knob in patch 3/3 set to false). >> > Thanks Dumitru for the comments! I agree the solution is not ideal, but if > we look at it from a different angle, even with a), for most pod->service > traffic the performance is still much better than how it is today (not > offloaded kernel datapath is still much better than userspace slowpath). > And *hopefully* b) is ok for most use cases to get HW-offload capability. > >> Change [0] was added to address the case when a service in kubernetes is >> exposed via two different k8s services objects that share the same >> endpoints. That translates in ovn-k8s to two different OVN load >> balancer VIPs that share the same backends. For such cases, if the >> service is being accessed by one of its own backends we need to be able >> to differentiate based on the VIP address it used to connect to. >> >> CC: Tim Rozet, Dan Williams for some more input from the ovn-k8s side on >> how common it is that an OVN-networked pod accesses two (or more) >> services that might have the pod itself as a backend. >> > > Yes, we definitely need input from ovn-k8s side. The information we got so > far: the change [0] was to fix a bug [2] reported by Tim. However, the bug > description didn't mention anything about two VIPs sharing the same > backend. Tim also mentioned in the ovn-k8s meeting last week that the > original user bug report for [2] was [3], and [3] was in fact a completely > different problem (although it is related to hairpin, too). So, I am under
I am not completely sure about the link between [3] and [2], maybe Tim remembers more. > the impression that "an OVN-networked pod accesses two (or more) services > that might have the pod itself as a backend" might be a very rare use case, > if it exists at all. > I went ahead and set the new ovn-allow-vips-share-hairpin-backend knob to "false" and pushed it to my fork to run the ovn-kubernetes CI. This runs a subset of the kubernetes conformance tests (AFAICT) and some specific e2e ovn-kubernetes tests. The results are here: https://github.com/dceara/ovn/runs/7230840427?check_suite_focus=true Focusing on the conformance failures: 2022-07-07T10:31:24.7228157Z [91m[1m[Fail] [0m[90m[sig-network] Networking [0m[0mGranular Checks: Services [0m[91m[1m[It] should function for endpoint-Service: http [0m 2022-07-07T10:31:24.7228940Z [37mvendor/github.com/onsi/ginkgo/internal/leafnodes/runner.go:113[0m ... 2022-07-07T10:31:24.7240313Z [91m[1m[Fail] [0m[90m[sig-network] Networking [0m[0mGranular Checks: Services [0m[91m[1m[It] should function for multiple endpoint-Services with same selector [0m 2022-07-07T10:31:24.7240819Z [37mvendor/github.com/onsi/ginkgo/internal/leafnodes/runner.go:113[0m ... Checking how these tests are defined: https://github.com/kubernetes/kubernetes/blob/2a017f94bcf8d04cbbbbdc6695bcf74273d630ed/test/e2e/network/networking.go#L283 https://github.com/kubernetes/kubernetes/blob/2a017f94bcf8d04cbbbbdc6695bcf74273d630ed/test/e2e/network/networking.go#L236 It seems to me that they're testing explicitly for a "pod that accesses two services that might have the pod itself as a backend". So, if I'm not wrong, we'd become non-compliant in this case. >> If this turns out to be mandatory I guess we might want to also look >> into alternatives like: >> - getting help from the HW to offload matches like ct_tuple() > > I believe this is going to happen in the future. HWOL is continuously > enhanced. > That would make things simpler. >> - limiting the impact of "a." only to some load balancers (e.g., would >> it help to use different hairpin lookup tables for such load balancers?) > > I am not sure if this would work, and not sure if this is a good approach, > either. In general, I believe it is possible to solve the problem with more > complex pipelines, but we need to keep in mind it is quite easy to > introduce other performance problems (either control plane or data plane) - > many of the changes lead to the current implementation were for performance > optimizations, some for control plane, some for HWOL, and some for reducing > recirculations. I'd avoid complexity unless it is really necessary. Let's > get more input for the problem, and based on that we can decide if we want > to move to a more complex solution. > Sure, I agree we need to find the best solution. In my option OVN should be HW-agnostic. We did try to adjust the way OVN generates openflows in order to make it more "HWOL-friendly" but that shouldn't come with the cost of breaking CMS features (if that's the case here). > [2] https://bugzilla.redhat.com/show_bug.cgi?id=1931599 > [3] https://bugzilla.redhat.com/show_bug.cgi?id=1903651 > > Thanks, > Han > Thanks, Dumitru _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev