Re: [ovs-dev] [PATCH ovn 1/3] Don't save original dst IP and Port to avoid megaflow unwildcarding.

Numan Siddique Wed, 20 Jul 2022 12:14:08 -0700

On Wed, Jul 20, 2022 at 6:18 AM Dumitru Ceara <dce...@redhat.com> wrote:
>
>
> Hi Han,
>
> On 7/7/22 19:02, Dumitru Ceara wrote:
> > On 7/7/22 18:21, Han Zhou wrote:
> >> On Thu, Jul 7, 2022 at 8:55 AM Dumitru Ceara <dce...@redhat.com> wrote:
> >>>
> >>> On 7/7/22 13:45, Dumitru Ceara wrote:
> >>>> On 7/7/22 00:08, Han Zhou wrote:
> >>>>> On Wed, Jul 6, 2022 at 8:45 AM Dumitru Ceara <dce...@redhat.com> wrote:
> >>>>>>
> >>>>>> Hi Han,
> >>>>>>
> >>>>>> On 7/6/22 00:41, Han Zhou wrote:
> >>>>>>> The ls_in_pre_stateful priority 120 flow that saves dst IP and Port
> >> to
> >>>>>>> registers is causing a critical dataplane performance impact to
> >>>>>>> short-lived connections, because it unwildcards megaflows with exact
> >>>>>>> match on dst IP and L4 ports. Any new connections with a different
> >>>>>>> client side L4 port will encounter datapath flow miss and upcall to
> >>>>>>> ovs-vswitchd.
> >>>>>>>
> >>>>>>> These fields (dst IP and port) were saved to registers to solve a
> >>>>>>> problem of LB hairpin use case when different VIPs are sharing
> >>>>>>> overlapping backend+port [0]. The change [0] might not have as wide
> >>>>>>> performance impact as it is now because at that time one of the match
> >>>>>>> condition "REGBIT_CONNTRACK_NAT == 1" was set only for established
> >> and
> >>>>>>> natted traffic, while now the impact is more obvious because
> >>>>>>> REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP
> >>>>>>> configured on the LS) since commit [1], after several other
> >> indirectly
> >>>>>>> related optimizations and refactors.
> >>>>>>>
> >>>>>>> Since the changes that introduced the performance problem had their
> >>>>>>> own values (fixes problems or optimizes performance), so we don't
> >> want
> >>>>>>> to revert any of the changes (and it is also not straightforward to
> >>>>>>> revert any of them because there have been lots of changes and
> >> refactors
> >>>>>>> on top of them).
> >>>>>>>
> >>>>>>> Change [0] itself has added an alternative way to solve the
> >> overlapping
> >>>>>>> backends problem, which utilizes ct fields instead of saving dst IP
> >> and
> >>>>>>> port to registers. This patch forces to that approach and removes the
> >>>>>>> flows/actions that saves the dst IP and port to avoid the dataplane
> >>>>>>> performance problem for short-lived connections.
> >>>>>>>
> >>>>>>> (With this approach, the ct_state DNAT is not HW offload friendly,
> >> so it
> >>>>>>> may result in those flows not being offloaded, which is supposed to
> >> be
> >>>>>>> solved in a follow-up patch)
>
> While we're waiting for more input from ovn-k8s on this, I have a
> slightly different question.
>
> Aren't we hitting a similar problem in the router pipeline, due to
> REG_ORIG_TP_DPORT_ROUTER?
>
> Thanks,
> Dumitru
>
> >>>>>>>
> >>>>>>> [0] ce0ef8d59850 ("Properly handle hairpin traffic for VIPs with
> >> shared
> >>>>> backends.")
> >>>>>>> [1] 0038579d1928 ("northd: Optimize ct nat for load balancer
> >> traffic.")
> >>>>>>>
> >>>>>>> Signed-off-by: Han Zhou <hz...@ovn.org>
> >>>>>>> ---
> >>>>>>
> >>>>>> I think the main concern I have is that this forces us to choose
> >> between:
> >>>>>> a. non hwol friendly flows (reduced performance)
> >>>>>> b. less functionality (with the knob in patch 3/3 set to false).
> >>>>>>
> >>>>> Thanks Dumitru for the comments! I agree the solution is not ideal,
> >> but if
> >>>>> we look at it from a different angle, even with a), for most
> >> pod->service
> >>>>> traffic the performance is still much better than how it is today (not
> >>>>> offloaded kernel datapath is still much better than userspace
> >> slowpath).
> >>>>> And *hopefully* b) is ok for most use cases to get HW-offload
> >> capability.
> >>>>>
> >>>
> >>> Just a note on this item.  I'm a bit confused about why all traffic
> >>> would be slowpath-ed?  It's just the first packet that goes to vswitchd
> >>> as an upcall, right?
> >>>
> >>
> >> It is about all traffic for *short-lived* connections. Any clients ->
> >> service traffic with the pattern:
> >> 1. TCP connection setup
> >> 2. Set API request, receives response
> >> 3. Close TCP connection
> >> It can be tested with netperf TCP_CRR. Every time the client side TCP port
> >> is different, but since the server -> client DP flow includes the client
> >> TCP port, for each such transaction there is going to be at least a DP flow
> >> miss and goes to userspace. Such application latency would be very high. In
> >> addition, it causes the OVS handler CPU spikes very high which would
> >> further impact the dataplane performance of the system.
> >>
> >>> Once the megaflow (even if it's more specific than ideal) is installed
> >>> all following traffic in that session should be forwarded in fast path
> >>> (kernel).
> >>>
> >>> Also, I'm not sure I follow why the same behavior wouldn't happen with
> >>> your changes too for pod->service.  The datapath flow includes the
> >>> dp_hash() match, and that's likely different for different connections.
> >>>
> >>
> >> With the change it is not going to happen, because the match is for server
> >> side port only.
> >> For dp_hash(), for what I remembered, there are as many as the number of
> >> megaflows as the number of buckets (the masked hash value) at most.
> >>
> >
> > OK, that explains it, thanks.
> >
> >>> Or am I missing something?
> >>>
> >>>>>> Change [0] was added to address the case when a service in kubernetes
> >> is
> >>>>>> exposed via two different k8s services objects that share the same
> >>>>>> endpoints.  That translates in ovn-k8s to two different OVN load
> >>>>>> balancer VIPs that share the same backends.  For such cases, if the
> >>>>>> service is being accessed by one of its own backends we need to be
> >> able
> >>>>>> to differentiate based on the VIP address it used to connect to.
> >>>>>>
> >>>>>> CC: Tim Rozet, Dan Williams for some more input from the ovn-k8s side
> >> on
> >>>>>> how common it is that an OVN-networked pod accesses two (or more)
> >>>>>> services that might have the pod itself as a backend.
> >>>>>>
> >>>>>
> >>>>> Yes, we definitely need input from ovn-k8s side. The information we
> >> got so
> >>>>> far: the change [0] was to fix a bug [2] reported by Tim. However, the
> >> bug
> >>>>> description didn't mention anything about two VIPs sharing the same
> >>>>> backend. Tim also mentioned in the ovn-k8s meeting last week that the
> >>>>> original user bug report for [2] was [3], and [3] was in fact a
> >> completely
> >>>>> different problem (although it is related to hairpin, too). So, I am
> >> under
> >>>>
> >>>> I am not completely sure about the link between [3] and [2], maybe Tim
> >>>> remembers more.
> >>>>
> >>>>> the impression that "an OVN-networked pod accesses two (or more)
> >> services
> >>>>> that might have the pod itself as a backend" might be a very rare use
> >> case,
> >>>>> if it exists at all.
> >>>>>
> >>>>
> >>>> I went ahead and set the new ovn-allow-vips-share-hairpin-backend knob
> >>>> to "false" and pushed it to my fork to run the ovn-kubernetes CI.  This
> >>>> runs a subset of the kubernetes conformance tests (AFAICT) and some
> >>>> specific e2e ovn-kubernetes tests.
> >>>>
> >>>> The results are here:
> >>>>
> >>>> https://github.com/dceara/ovn/runs/7230840427?check_suite_focus=true
> >>>>
> >>>> Focusing on the conformance failures:
> >>>>
> >>>> 2022-07-07T10:31:24.7228157Z  [91m [1m[Fail]  [0m [90m[sig-network]
> >> Networking  [0m [0mGranular Checks: Services  [0m [91m [1m[It] should
> >> function for endpoint-Service: http  [0m
> >>>> 2022-07-07T10:31:24.7228940Z  [37mvendor/
> >> github.com/onsi/ginkgo/internal/leafnodes/runner.go:113 [0m
> >>>> ...
> >>>> 2022-07-07T10:31:24.7240313Z  [91m [1m[Fail]  [0m [90m[sig-network]
> >> Networking  [0m [0mGranular Checks: Services  [0m [91m [1m[It] should
> >> function for multiple endpoint-Services with same selector  [0m
> >>>> 2022-07-07T10:31:24.7240819Z  [37mvendor/
> >> github.com/onsi/ginkgo/internal/leafnodes/runner.go:113 [0m
> >>>> ...
> >>>>
> >>>> Checking how these tests are defined:
> >>>>
> >> https://github.com/kubernetes/kubernetes/blob/2a017f94bcf8d04cbbbbdc6695bcf74273d630ed/test/e2e/network/networking.go#L283
> >>>>
> >> https://github.com/kubernetes/kubernetes/blob/2a017f94bcf8d04cbbbbdc6695bcf74273d630ed/test/e2e/network/networking.go#L236
> >>>>
> >> Thanks for the test and information! Really need input from k8s folks to
> >> understand more.
> >>
> >> Thanks,
> >> Han
> >>
> >
> > Like I said below, these are kubernetes conformance tests so I'll let
> > k8s folks confirm if such failures can be ignored/worked around.
> >
> >>>> It seems to me that they're testing explicitly for a  "pod that accesses
> >>>> two services that might have the pod itself as a backend".
> >>>>
> >>>> So, if I'm not wrong, we'd become non-compliant in this case.
> >>>>
> >>>>>> If this turns out to be mandatory I guess we might want to also look
> >>>>>> into alternatives like:
> >>>>>> - getting help from the HW to offload matches like ct_tuple()
> >>>>>
> >>>>> I believe this is going to happen in the future. HWOL is continuously
> >>>>> enhanced.
> >>>>>
> >>>>
> >>>> That would make things simpler.
> >>>>
> >>>>>> - limiting the impact of "a." only to some load balancers (e.g., would
> >>>>>> it help to use different hairpin lookup tables for such load
> >> balancers?)
> >>>>>
> >>>>> I am not sure if this would work, and not sure if this is a good
> >> approach,
> >>>>> either. In general, I believe it is possible to solve the problem with
> >> more
> >>>>> complex pipelines, but we need to keep in mind it is quite easy to
> >>>>> introduce other performance problems (either control plane or data
> >> plane) -
> >>>>> many of the changes lead to the current implementation were for
> >> performance
> >>>>> optimizations, some for control plane, some for HWOL, and some for
> >> reducing
> >>>>> recirculations. I'd avoid complexity unless it is really necessary.
> >> Let's
> >>>>> get more input for the problem, and based on that we can decide if we
> >> want
> >>>>> to move to a more complex solution.
> >>>>>
> >>>>
> >>>> Sure, I agree we need to find the best solution.
> >>>>
> >>>> In my option OVN should be HW-agnostic.  We did try to adjust the way
> >>>> OVN generates openflows in order to make it more "HWOL-friendly" but
> >>>> that shouldn't come with the cost of breaking CMS features (if that's
> >>>> the case here).
> >>>>
> >>>>> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1931599
> >>>>> [3] https://bugzilla.redhat.com/show_bug.cgi?id=1903651
> >>>>>
> >>>>> Thanks,
> >>>>> Han
> >>>>>
> >>>>
> >>>> Thanks,
> >>>> Dumitru



Hi Han,  Dumitru,

What's the status of this patch series ?  Does it need a v2 ? Sorry I
didn't follow all the discussions.

If the patch series doesn't need a v2, a can probably start reviewing.

Thanks
Numan

> >>>
> >>
>
> _______________________________________________
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH ovn 1/3] Don't save original dst IP and Port to avoid megaflow unwildcarding.

Reply via email to