On Wed, Jul 20, 2022 at 6:18 AM Dumitru Ceara <dce...@redhat.com> wrote: > > > Hi Han, > > On 7/7/22 19:02, Dumitru Ceara wrote: > > On 7/7/22 18:21, Han Zhou wrote: > >> On Thu, Jul 7, 2022 at 8:55 AM Dumitru Ceara <dce...@redhat.com> wrote: > >>> > >>> On 7/7/22 13:45, Dumitru Ceara wrote: > >>>> On 7/7/22 00:08, Han Zhou wrote: > >>>>> On Wed, Jul 6, 2022 at 8:45 AM Dumitru Ceara <dce...@redhat.com> wrote: > >>>>>> > >>>>>> Hi Han, > >>>>>> > >>>>>> On 7/6/22 00:41, Han Zhou wrote: > >>>>>>> The ls_in_pre_stateful priority 120 flow that saves dst IP and Port > >> to > >>>>>>> registers is causing a critical dataplane performance impact to > >>>>>>> short-lived connections, because it unwildcards megaflows with exact > >>>>>>> match on dst IP and L4 ports. Any new connections with a different > >>>>>>> client side L4 port will encounter datapath flow miss and upcall to > >>>>>>> ovs-vswitchd. > >>>>>>> > >>>>>>> These fields (dst IP and port) were saved to registers to solve a > >>>>>>> problem of LB hairpin use case when different VIPs are sharing > >>>>>>> overlapping backend+port [0]. The change [0] might not have as wide > >>>>>>> performance impact as it is now because at that time one of the match > >>>>>>> condition "REGBIT_CONNTRACK_NAT == 1" was set only for established > >> and > >>>>>>> natted traffic, while now the impact is more obvious because > >>>>>>> REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP > >>>>>>> configured on the LS) since commit [1], after several other > >> indirectly > >>>>>>> related optimizations and refactors. > >>>>>>> > >>>>>>> Since the changes that introduced the performance problem had their > >>>>>>> own values (fixes problems or optimizes performance), so we don't > >> want > >>>>>>> to revert any of the changes (and it is also not straightforward to > >>>>>>> revert any of them because there have been lots of changes and > >> refactors > >>>>>>> on top of them). > >>>>>>> > >>>>>>> Change [0] itself has added an alternative way to solve the > >> overlapping > >>>>>>> backends problem, which utilizes ct fields instead of saving dst IP > >> and > >>>>>>> port to registers. This patch forces to that approach and removes the > >>>>>>> flows/actions that saves the dst IP and port to avoid the dataplane > >>>>>>> performance problem for short-lived connections. > >>>>>>> > >>>>>>> (With this approach, the ct_state DNAT is not HW offload friendly, > >> so it > >>>>>>> may result in those flows not being offloaded, which is supposed to > >> be > >>>>>>> solved in a follow-up patch) > > While we're waiting for more input from ovn-k8s on this, I have a > slightly different question. > > Aren't we hitting a similar problem in the router pipeline, due to > REG_ORIG_TP_DPORT_ROUTER? > > Thanks, > Dumitru > > >>>>>>> > >>>>>>> [0] ce0ef8d59850 ("Properly handle hairpin traffic for VIPs with > >> shared > >>>>> backends.") > >>>>>>> [1] 0038579d1928 ("northd: Optimize ct nat for load balancer > >> traffic.") > >>>>>>> > >>>>>>> Signed-off-by: Han Zhou <hz...@ovn.org> > >>>>>>> --- > >>>>>> > >>>>>> I think the main concern I have is that this forces us to choose > >> between: > >>>>>> a. non hwol friendly flows (reduced performance) > >>>>>> b. less functionality (with the knob in patch 3/3 set to false). > >>>>>> > >>>>> Thanks Dumitru for the comments! I agree the solution is not ideal, > >> but if > >>>>> we look at it from a different angle, even with a), for most > >> pod->service > >>>>> traffic the performance is still much better than how it is today (not > >>>>> offloaded kernel datapath is still much better than userspace > >> slowpath). > >>>>> And *hopefully* b) is ok for most use cases to get HW-offload > >> capability. > >>>>> > >>> > >>> Just a note on this item. I'm a bit confused about why all traffic > >>> would be slowpath-ed? It's just the first packet that goes to vswitchd > >>> as an upcall, right? > >>> > >> > >> It is about all traffic for *short-lived* connections. Any clients -> > >> service traffic with the pattern: > >> 1. TCP connection setup > >> 2. Set API request, receives response > >> 3. Close TCP connection > >> It can be tested with netperf TCP_CRR. Every time the client side TCP port > >> is different, but since the server -> client DP flow includes the client > >> TCP port, for each such transaction there is going to be at least a DP flow > >> miss and goes to userspace. Such application latency would be very high. In > >> addition, it causes the OVS handler CPU spikes very high which would > >> further impact the dataplane performance of the system. > >> > >>> Once the megaflow (even if it's more specific than ideal) is installed > >>> all following traffic in that session should be forwarded in fast path > >>> (kernel). > >>> > >>> Also, I'm not sure I follow why the same behavior wouldn't happen with > >>> your changes too for pod->service. The datapath flow includes the > >>> dp_hash() match, and that's likely different for different connections. > >>> > >> > >> With the change it is not going to happen, because the match is for server > >> side port only. > >> For dp_hash(), for what I remembered, there are as many as the number of > >> megaflows as the number of buckets (the masked hash value) at most. > >> > > > > OK, that explains it, thanks. > > > >>> Or am I missing something? > >>> > >>>>>> Change [0] was added to address the case when a service in kubernetes > >> is > >>>>>> exposed via two different k8s services objects that share the same > >>>>>> endpoints. That translates in ovn-k8s to two different OVN load > >>>>>> balancer VIPs that share the same backends. For such cases, if the > >>>>>> service is being accessed by one of its own backends we need to be > >> able > >>>>>> to differentiate based on the VIP address it used to connect to. > >>>>>> > >>>>>> CC: Tim Rozet, Dan Williams for some more input from the ovn-k8s side > >> on > >>>>>> how common it is that an OVN-networked pod accesses two (or more) > >>>>>> services that might have the pod itself as a backend. > >>>>>> > >>>>> > >>>>> Yes, we definitely need input from ovn-k8s side. The information we > >> got so > >>>>> far: the change [0] was to fix a bug [2] reported by Tim. However, the > >> bug > >>>>> description didn't mention anything about two VIPs sharing the same > >>>>> backend. Tim also mentioned in the ovn-k8s meeting last week that the > >>>>> original user bug report for [2] was [3], and [3] was in fact a > >> completely > >>>>> different problem (although it is related to hairpin, too). So, I am > >> under > >>>> > >>>> I am not completely sure about the link between [3] and [2], maybe Tim > >>>> remembers more. > >>>> > >>>>> the impression that "an OVN-networked pod accesses two (or more) > >> services > >>>>> that might have the pod itself as a backend" might be a very rare use > >> case, > >>>>> if it exists at all. > >>>>> > >>>> > >>>> I went ahead and set the new ovn-allow-vips-share-hairpin-backend knob > >>>> to "false" and pushed it to my fork to run the ovn-kubernetes CI. This > >>>> runs a subset of the kubernetes conformance tests (AFAICT) and some > >>>> specific e2e ovn-kubernetes tests. > >>>> > >>>> The results are here: > >>>> > >>>> https://github.com/dceara/ovn/runs/7230840427?check_suite_focus=true > >>>> > >>>> Focusing on the conformance failures: > >>>> > >>>> 2022-07-07T10:31:24.7228157Z [91m [1m[Fail] [0m [90m[sig-network] > >> Networking [0m [0mGranular Checks: Services [0m [91m [1m[It] should > >> function for endpoint-Service: http [0m > >>>> 2022-07-07T10:31:24.7228940Z [37mvendor/ > >> github.com/onsi/ginkgo/internal/leafnodes/runner.go:113 [0m > >>>> ... > >>>> 2022-07-07T10:31:24.7240313Z [91m [1m[Fail] [0m [90m[sig-network] > >> Networking [0m [0mGranular Checks: Services [0m [91m [1m[It] should > >> function for multiple endpoint-Services with same selector [0m > >>>> 2022-07-07T10:31:24.7240819Z [37mvendor/ > >> github.com/onsi/ginkgo/internal/leafnodes/runner.go:113 [0m > >>>> ... > >>>> > >>>> Checking how these tests are defined: > >>>> > >> https://github.com/kubernetes/kubernetes/blob/2a017f94bcf8d04cbbbbdc6695bcf74273d630ed/test/e2e/network/networking.go#L283 > >>>> > >> https://github.com/kubernetes/kubernetes/blob/2a017f94bcf8d04cbbbbdc6695bcf74273d630ed/test/e2e/network/networking.go#L236 > >>>> > >> Thanks for the test and information! Really need input from k8s folks to > >> understand more. > >> > >> Thanks, > >> Han > >> > > > > Like I said below, these are kubernetes conformance tests so I'll let > > k8s folks confirm if such failures can be ignored/worked around. > > > >>>> It seems to me that they're testing explicitly for a "pod that accesses > >>>> two services that might have the pod itself as a backend". > >>>> > >>>> So, if I'm not wrong, we'd become non-compliant in this case. > >>>> > >>>>>> If this turns out to be mandatory I guess we might want to also look > >>>>>> into alternatives like: > >>>>>> - getting help from the HW to offload matches like ct_tuple() > >>>>> > >>>>> I believe this is going to happen in the future. HWOL is continuously > >>>>> enhanced. > >>>>> > >>>> > >>>> That would make things simpler. > >>>> > >>>>>> - limiting the impact of "a." only to some load balancers (e.g., would > >>>>>> it help to use different hairpin lookup tables for such load > >> balancers?) > >>>>> > >>>>> I am not sure if this would work, and not sure if this is a good > >> approach, > >>>>> either. In general, I believe it is possible to solve the problem with > >> more > >>>>> complex pipelines, but we need to keep in mind it is quite easy to > >>>>> introduce other performance problems (either control plane or data > >> plane) - > >>>>> many of the changes lead to the current implementation were for > >> performance > >>>>> optimizations, some for control plane, some for HWOL, and some for > >> reducing > >>>>> recirculations. I'd avoid complexity unless it is really necessary. > >> Let's > >>>>> get more input for the problem, and based on that we can decide if we > >> want > >>>>> to move to a more complex solution. > >>>>> > >>>> > >>>> Sure, I agree we need to find the best solution. > >>>> > >>>> In my option OVN should be HW-agnostic. We did try to adjust the way > >>>> OVN generates openflows in order to make it more "HWOL-friendly" but > >>>> that shouldn't come with the cost of breaking CMS features (if that's > >>>> the case here). > >>>> > >>>>> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1931599 > >>>>> [3] https://bugzilla.redhat.com/show_bug.cgi?id=1903651 > >>>>> > >>>>> Thanks, > >>>>> Han > >>>>> > >>>> > >>>> Thanks, > >>>> Dumitru
Hi Han, Dumitru, What's the status of this patch series ? Does it need a v2 ? Sorry I didn't follow all the discussions. If the patch series doesn't need a v2, a can probably start reviewing. Thanks Numan > >>> > >> > > _______________________________________________ > dev mailing list > d...@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev > _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev