Hi Han,
On 7/7/22 19:02, Dumitru Ceara wrote: > On 7/7/22 18:21, Han Zhou wrote: >> On Thu, Jul 7, 2022 at 8:55 AM Dumitru Ceara <dce...@redhat.com> wrote: >>> >>> On 7/7/22 13:45, Dumitru Ceara wrote: >>>> On 7/7/22 00:08, Han Zhou wrote: >>>>> On Wed, Jul 6, 2022 at 8:45 AM Dumitru Ceara <dce...@redhat.com> wrote: >>>>>> >>>>>> Hi Han, >>>>>> >>>>>> On 7/6/22 00:41, Han Zhou wrote: >>>>>>> The ls_in_pre_stateful priority 120 flow that saves dst IP and Port >> to >>>>>>> registers is causing a critical dataplane performance impact to >>>>>>> short-lived connections, because it unwildcards megaflows with exact >>>>>>> match on dst IP and L4 ports. Any new connections with a different >>>>>>> client side L4 port will encounter datapath flow miss and upcall to >>>>>>> ovs-vswitchd. >>>>>>> >>>>>>> These fields (dst IP and port) were saved to registers to solve a >>>>>>> problem of LB hairpin use case when different VIPs are sharing >>>>>>> overlapping backend+port [0]. The change [0] might not have as wide >>>>>>> performance impact as it is now because at that time one of the match >>>>>>> condition "REGBIT_CONNTRACK_NAT == 1" was set only for established >> and >>>>>>> natted traffic, while now the impact is more obvious because >>>>>>> REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP >>>>>>> configured on the LS) since commit [1], after several other >> indirectly >>>>>>> related optimizations and refactors. >>>>>>> >>>>>>> Since the changes that introduced the performance problem had their >>>>>>> own values (fixes problems or optimizes performance), so we don't >> want >>>>>>> to revert any of the changes (and it is also not straightforward to >>>>>>> revert any of them because there have been lots of changes and >> refactors >>>>>>> on top of them). >>>>>>> >>>>>>> Change [0] itself has added an alternative way to solve the >> overlapping >>>>>>> backends problem, which utilizes ct fields instead of saving dst IP >> and >>>>>>> port to registers. This patch forces to that approach and removes the >>>>>>> flows/actions that saves the dst IP and port to avoid the dataplane >>>>>>> performance problem for short-lived connections. >>>>>>> >>>>>>> (With this approach, the ct_state DNAT is not HW offload friendly, >> so it >>>>>>> may result in those flows not being offloaded, which is supposed to >> be >>>>>>> solved in a follow-up patch) While we're waiting for more input from ovn-k8s on this, I have a slightly different question. Aren't we hitting a similar problem in the router pipeline, due to REG_ORIG_TP_DPORT_ROUTER? Thanks, Dumitru >>>>>>> >>>>>>> [0] ce0ef8d59850 ("Properly handle hairpin traffic for VIPs with >> shared >>>>> backends.") >>>>>>> [1] 0038579d1928 ("northd: Optimize ct nat for load balancer >> traffic.") >>>>>>> >>>>>>> Signed-off-by: Han Zhou <hz...@ovn.org> >>>>>>> --- >>>>>> >>>>>> I think the main concern I have is that this forces us to choose >> between: >>>>>> a. non hwol friendly flows (reduced performance) >>>>>> b. less functionality (with the knob in patch 3/3 set to false). >>>>>> >>>>> Thanks Dumitru for the comments! I agree the solution is not ideal, >> but if >>>>> we look at it from a different angle, even with a), for most >> pod->service >>>>> traffic the performance is still much better than how it is today (not >>>>> offloaded kernel datapath is still much better than userspace >> slowpath). >>>>> And *hopefully* b) is ok for most use cases to get HW-offload >> capability. >>>>> >>> >>> Just a note on this item. I'm a bit confused about why all traffic >>> would be slowpath-ed? It's just the first packet that goes to vswitchd >>> as an upcall, right? >>> >> >> It is about all traffic for *short-lived* connections. Any clients -> >> service traffic with the pattern: >> 1. TCP connection setup >> 2. Set API request, receives response >> 3. Close TCP connection >> It can be tested with netperf TCP_CRR. Every time the client side TCP port >> is different, but since the server -> client DP flow includes the client >> TCP port, for each such transaction there is going to be at least a DP flow >> miss and goes to userspace. Such application latency would be very high. In >> addition, it causes the OVS handler CPU spikes very high which would >> further impact the dataplane performance of the system. >> >>> Once the megaflow (even if it's more specific than ideal) is installed >>> all following traffic in that session should be forwarded in fast path >>> (kernel). >>> >>> Also, I'm not sure I follow why the same behavior wouldn't happen with >>> your changes too for pod->service. The datapath flow includes the >>> dp_hash() match, and that's likely different for different connections. >>> >> >> With the change it is not going to happen, because the match is for server >> side port only. >> For dp_hash(), for what I remembered, there are as many as the number of >> megaflows as the number of buckets (the masked hash value) at most. >> > > OK, that explains it, thanks. > >>> Or am I missing something? >>> >>>>>> Change [0] was added to address the case when a service in kubernetes >> is >>>>>> exposed via two different k8s services objects that share the same >>>>>> endpoints. That translates in ovn-k8s to two different OVN load >>>>>> balancer VIPs that share the same backends. For such cases, if the >>>>>> service is being accessed by one of its own backends we need to be >> able >>>>>> to differentiate based on the VIP address it used to connect to. >>>>>> >>>>>> CC: Tim Rozet, Dan Williams for some more input from the ovn-k8s side >> on >>>>>> how common it is that an OVN-networked pod accesses two (or more) >>>>>> services that might have the pod itself as a backend. >>>>>> >>>>> >>>>> Yes, we definitely need input from ovn-k8s side. The information we >> got so >>>>> far: the change [0] was to fix a bug [2] reported by Tim. However, the >> bug >>>>> description didn't mention anything about two VIPs sharing the same >>>>> backend. Tim also mentioned in the ovn-k8s meeting last week that the >>>>> original user bug report for [2] was [3], and [3] was in fact a >> completely >>>>> different problem (although it is related to hairpin, too). So, I am >> under >>>> >>>> I am not completely sure about the link between [3] and [2], maybe Tim >>>> remembers more. >>>> >>>>> the impression that "an OVN-networked pod accesses two (or more) >> services >>>>> that might have the pod itself as a backend" might be a very rare use >> case, >>>>> if it exists at all. >>>>> >>>> >>>> I went ahead and set the new ovn-allow-vips-share-hairpin-backend knob >>>> to "false" and pushed it to my fork to run the ovn-kubernetes CI. This >>>> runs a subset of the kubernetes conformance tests (AFAICT) and some >>>> specific e2e ovn-kubernetes tests. >>>> >>>> The results are here: >>>> >>>> https://github.com/dceara/ovn/runs/7230840427?check_suite_focus=true >>>> >>>> Focusing on the conformance failures: >>>> >>>> 2022-07-07T10:31:24.7228157Z [91m [1m[Fail] [0m [90m[sig-network] >> Networking [0m [0mGranular Checks: Services [0m [91m [1m[It] should >> function for endpoint-Service: http [0m >>>> 2022-07-07T10:31:24.7228940Z [37mvendor/ >> github.com/onsi/ginkgo/internal/leafnodes/runner.go:113 [0m >>>> ... >>>> 2022-07-07T10:31:24.7240313Z [91m [1m[Fail] [0m [90m[sig-network] >> Networking [0m [0mGranular Checks: Services [0m [91m [1m[It] should >> function for multiple endpoint-Services with same selector [0m >>>> 2022-07-07T10:31:24.7240819Z [37mvendor/ >> github.com/onsi/ginkgo/internal/leafnodes/runner.go:113 [0m >>>> ... >>>> >>>> Checking how these tests are defined: >>>> >> https://github.com/kubernetes/kubernetes/blob/2a017f94bcf8d04cbbbbdc6695bcf74273d630ed/test/e2e/network/networking.go#L283 >>>> >> https://github.com/kubernetes/kubernetes/blob/2a017f94bcf8d04cbbbbdc6695bcf74273d630ed/test/e2e/network/networking.go#L236 >>>> >> Thanks for the test and information! Really need input from k8s folks to >> understand more. >> >> Thanks, >> Han >> > > Like I said below, these are kubernetes conformance tests so I'll let > k8s folks confirm if such failures can be ignored/worked around. > >>>> It seems to me that they're testing explicitly for a "pod that accesses >>>> two services that might have the pod itself as a backend". >>>> >>>> So, if I'm not wrong, we'd become non-compliant in this case. >>>> >>>>>> If this turns out to be mandatory I guess we might want to also look >>>>>> into alternatives like: >>>>>> - getting help from the HW to offload matches like ct_tuple() >>>>> >>>>> I believe this is going to happen in the future. HWOL is continuously >>>>> enhanced. >>>>> >>>> >>>> That would make things simpler. >>>> >>>>>> - limiting the impact of "a." only to some load balancers (e.g., would >>>>>> it help to use different hairpin lookup tables for such load >> balancers?) >>>>> >>>>> I am not sure if this would work, and not sure if this is a good >> approach, >>>>> either. In general, I believe it is possible to solve the problem with >> more >>>>> complex pipelines, but we need to keep in mind it is quite easy to >>>>> introduce other performance problems (either control plane or data >> plane) - >>>>> many of the changes lead to the current implementation were for >> performance >>>>> optimizations, some for control plane, some for HWOL, and some for >> reducing >>>>> recirculations. I'd avoid complexity unless it is really necessary. >> Let's >>>>> get more input for the problem, and based on that we can decide if we >> want >>>>> to move to a more complex solution. >>>>> >>>> >>>> Sure, I agree we need to find the best solution. >>>> >>>> In my option OVN should be HW-agnostic. We did try to adjust the way >>>> OVN generates openflows in order to make it more "HWOL-friendly" but >>>> that shouldn't come with the cost of breaking CMS features (if that's >>>> the case here). >>>> >>>>> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1931599 >>>>> [3] https://bugzilla.redhat.com/show_bug.cgi?id=1903651 >>>>> >>>>> Thanks, >>>>> Han >>>>> >>>> >>>> Thanks, >>>> Dumitru >>> >> _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev