Re: [ovs-dev] [PATCH ovn 1/3] Don't save original dst IP and Port to avoid megaflow unwildcarding.

Han Zhou Wed, 20 Jul 2022 22:56:10 -0700

On Wed, Jul 20, 2022 at 12:13 PM Numan Siddique <num...@ovn.org> wrote:
>
> On Wed, Jul 20, 2022 at 6:18 AM Dumitru Ceara <dce...@redhat.com> wrote:
> >
> >
> > Hi Han,
> >
> > On 7/7/22 19:02, Dumitru Ceara wrote:
> > > On 7/7/22 18:21, Han Zhou wrote:
> > >> On Thu, Jul 7, 2022 at 8:55 AM Dumitru Ceara <dce...@redhat.com>
wrote:
> > >>>
> > >>> On 7/7/22 13:45, Dumitru Ceara wrote:
> > >>>> On 7/7/22 00:08, Han Zhou wrote:
> > >>>>> On Wed, Jul 6, 2022 at 8:45 AM Dumitru Ceara <dce...@redhat.com>
wrote:
> > >>>>>>
> > >>>>>> Hi Han,
> > >>>>>>
> > >>>>>> On 7/6/22 00:41, Han Zhou wrote:
> > >>>>>>> The ls_in_pre_stateful priority 120 flow that saves dst IP and
Port
> > >> to
> > >>>>>>> registers is causing a critical dataplane performance impact to
> > >>>>>>> short-lived connections, because it unwildcards megaflows with
exact
> > >>>>>>> match on dst IP and L4 ports. Any new connections with a
different
> > >>>>>>> client side L4 port will encounter datapath flow miss and
upcall to
> > >>>>>>> ovs-vswitchd.
> > >>>>>>>
> > >>>>>>> These fields (dst IP and port) were saved to registers to solve
a
> > >>>>>>> problem of LB hairpin use case when different VIPs are sharing
> > >>>>>>> overlapping backend+port [0]. The change [0] might not have as
wide
> > >>>>>>> performance impact as it is now because at that time one of the
match
> > >>>>>>> condition "REGBIT_CONNTRACK_NAT == 1" was set only for
established
> > >> and
> > >>>>>>> natted traffic, while now the impact is more obvious because
> > >>>>>>> REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP
> > >>>>>>> configured on the LS) since commit [1], after several other
> > >> indirectly
> > >>>>>>> related optimizations and refactors.
> > >>>>>>>
> > >>>>>>> Since the changes that introduced the performance problem had
their
> > >>>>>>> own values (fixes problems or optimizes performance), so we
don't
> > >> want
> > >>>>>>> to revert any of the changes (and it is also not
straightforward to
> > >>>>>>> revert any of them because there have been lots of changes and
> > >> refactors
> > >>>>>>> on top of them).
> > >>>>>>>
> > >>>>>>> Change [0] itself has added an alternative way to solve the
> > >> overlapping
> > >>>>>>> backends problem, which utilizes ct fields instead of saving
dst IP
> > >> and
> > >>>>>>> port to registers. This patch forces to that approach and
removes the
> > >>>>>>> flows/actions that saves the dst IP and port to avoid the
dataplane
> > >>>>>>> performance problem for short-lived connections.
> > >>>>>>>
> > >>>>>>> (With this approach, the ct_state DNAT is not HW offload
friendly,
> > >> so it
> > >>>>>>> may result in those flows not being offloaded, which is
supposed to
> > >> be
> > >>>>>>> solved in a follow-up patch)
> >
> > While we're waiting for more input from ovn-k8s on this, I have a
> > slightly different question.
> >
> > Aren't we hitting a similar problem in the router pipeline, due to
> > REG_ORIG_TP_DPORT_ROUTER?
> >
> > Thanks,
> > Dumitru
> >
> > >>>>>>>
> > >>>>>>> [0] ce0ef8d59850 ("Properly handle hairpin traffic for VIPs with
> > >> shared
> > >>>>> backends.")
> > >>>>>>> [1] 0038579d1928 ("northd: Optimize ct nat for load balancer
> > >> traffic.")
> > >>>>>>>
> > >>>>>>> Signed-off-by: Han Zhou <hz...@ovn.org>
> > >>>>>>> ---
> > >>>>>>
> > >>>>>> I think the main concern I have is that this forces us to choose
> > >> between:
> > >>>>>> a. non hwol friendly flows (reduced performance)
> > >>>>>> b. less functionality (with the knob in patch 3/3 set to false).
> > >>>>>>
> > >>>>> Thanks Dumitru for the comments! I agree the solution is not
ideal,
> > >> but if
> > >>>>> we look at it from a different angle, even with a), for most
> > >> pod->service
> > >>>>> traffic the performance is still much better than how it is today
(not
> > >>>>> offloaded kernel datapath is still much better than userspace
> > >> slowpath).
> > >>>>> And *hopefully* b) is ok for most use cases to get HW-offload
> > >> capability.
> > >>>>>
> > >>>
> > >>> Just a note on this item.  I'm a bit confused about why all traffic
> > >>> would be slowpath-ed?  It's just the first packet that goes to
vswitchd
> > >>> as an upcall, right?
> > >>>
> > >>
> > >> It is about all traffic for *short-lived* connections. Any clients ->
> > >> service traffic with the pattern:
> > >> 1. TCP connection setup
> > >> 2. Set API request, receives response
> > >> 3. Close TCP connection
> > >> It can be tested with netperf TCP_CRR. Every time the client side
TCP port
> > >> is different, but since the server -> client DP flow includes the
client
> > >> TCP port, for each such transaction there is going to be at least a
DP flow
> > >> miss and goes to userspace. Such application latency would be very
high. In
> > >> addition, it causes the OVS handler CPU spikes very high which would
> > >> further impact the dataplane performance of the system.
> > >>
> > >>> Once the megaflow (even if it's more specific than ideal) is
installed
> > >>> all following traffic in that session should be forwarded in fast
path
> > >>> (kernel).
> > >>>
> > >>> Also, I'm not sure I follow why the same behavior wouldn't happen
with
> > >>> your changes too for pod->service.  The datapath flow includes the
> > >>> dp_hash() match, and that's likely different for different
connections.
> > >>>
> > >>
> > >> With the change it is not going to happen, because the match is for
server
> > >> side port only.
> > >> For dp_hash(), for what I remembered, there are as many as the
number of
> > >> megaflows as the number of buckets (the masked hash value) at most.
> > >>
> > >
> > > OK, that explains it, thanks.
> > >
> > >>> Or am I missing something?
> > >>>
> > >>>>>> Change [0] was added to address the case when a service in
kubernetes
> > >> is
> > >>>>>> exposed via two different k8s services objects that share the
same
> > >>>>>> endpoints.  That translates in ovn-k8s to two different OVN load
> > >>>>>> balancer VIPs that share the same backends.  For such cases, if
the
> > >>>>>> service is being accessed by one of its own backends we need to
be
> > >> able
> > >>>>>> to differentiate based on the VIP address it used to connect to.
> > >>>>>>
> > >>>>>> CC: Tim Rozet, Dan Williams for some more input from the ovn-k8s
side
> > >> on
> > >>>>>> how common it is that an OVN-networked pod accesses two (or more)
> > >>>>>> services that might have the pod itself as a backend.
> > >>>>>>
> > >>>>>
> > >>>>> Yes, we definitely need input from ovn-k8s side. The information
we
> > >> got so
> > >>>>> far: the change [0] was to fix a bug [2] reported by Tim.
However, the
> > >> bug
> > >>>>> description didn't mention anything about two VIPs sharing the
same
> > >>>>> backend. Tim also mentioned in the ovn-k8s meeting last week that
the
> > >>>>> original user bug report for [2] was [3], and [3] was in fact a
> > >> completely
> > >>>>> different problem (although it is related to hairpin, too). So, I
am
> > >> under
> > >>>>
> > >>>> I am not completely sure about the link between [3] and [2], maybe
Tim
> > >>>> remembers more.
> > >>>>
> > >>>>> the impression that "an OVN-networked pod accesses two (or more)
> > >> services
> > >>>>> that might have the pod itself as a backend" might be a very rare
use
> > >> case,
> > >>>>> if it exists at all.
> > >>>>>
> > >>>>
> > >>>> I went ahead and set the new ovn-allow-vips-share-hairpin-backend
knob
> > >>>> to "false" and pushed it to my fork to run the ovn-kubernetes CI.
This
> > >>>> runs a subset of the kubernetes conformance tests (AFAICT) and some
> > >>>> specific e2e ovn-kubernetes tests.
> > >>>>
> > >>>> The results are here:
> > >>>>
> > >>>>
https://github.com/dceara/ovn/runs/7230840427?check_suite_focus=true
> > >>>>
> > >>>> Focusing on the conformance failures:
> > >>>>
> > >>>> 2022-07-07T10:31:24.7228157Z  [91m [1m[Fail]  [0m [90m[sig-network]
> > >> Networking  [0m [0mGranular Checks: Services  [0m [91m [1m[It] should
> > >> function for endpoint-Service: http  [0m
> > >>>> 2022-07-07T10:31:24.7228940Z  [37mvendor/
> > >> github.com/onsi/ginkgo/internal/leafnodes/runner.go:113 [0m
> > >>>> ...
> > >>>> 2022-07-07T10:31:24.7240313Z  [91m [1m[Fail]  [0m [90m[sig-network]
> > >> Networking  [0m [0mGranular Checks: Services  [0m [91m [1m[It] should
> > >> function for multiple endpoint-Services with same selector  [0m
> > >>>> 2022-07-07T10:31:24.7240819Z  [37mvendor/
> > >> github.com/onsi/ginkgo/internal/leafnodes/runner.go:113 [0m
> > >>>> ...
> > >>>>
> > >>>> Checking how these tests are defined:
> > >>>>
> > >>
https://github.com/kubernetes/kubernetes/blob/2a017f94bcf8d04cbbbbdc6695bcf74273d630ed/test/e2e/network/networking.go#L283
> > >>>>
> > >>
https://github.com/kubernetes/kubernetes/blob/2a017f94bcf8d04cbbbbdc6695bcf74273d630ed/test/e2e/network/networking.go#L236
> > >>>>
> > >> Thanks for the test and information! Really need input from k8s
folks to
> > >> understand more.
> > >>
> > >> Thanks,
> > >> Han
> > >>
> > >
> > > Like I said below, these are kubernetes conformance tests so I'll let
> > > k8s folks confirm if such failures can be ignored/worked around.
> > >
> > >>>> It seems to me that they're testing explicitly for a  "pod that
accesses
> > >>>> two services that might have the pod itself as a backend".
> > >>>>
> > >>>> So, if I'm not wrong, we'd become non-compliant in this case.
> > >>>>
> > >>>>>> If this turns out to be mandatory I guess we might want to also
look
> > >>>>>> into alternatives like:
> > >>>>>> - getting help from the HW to offload matches like ct_tuple()
> > >>>>>
> > >>>>> I believe this is going to happen in the future. HWOL is
continuously
> > >>>>> enhanced.
> > >>>>>
> > >>>>
> > >>>> That would make things simpler.
> > >>>>
> > >>>>>> - limiting the impact of "a." only to some load balancers (e.g.,
would
> > >>>>>> it help to use different hairpin lookup tables for such load
> > >> balancers?)
> > >>>>>
> > >>>>> I am not sure if this would work, and not sure if this is a good
> > >> approach,
> > >>>>> either. In general, I believe it is possible to solve the problem
with
> > >> more
> > >>>>> complex pipelines, but we need to keep in mind it is quite easy to
> > >>>>> introduce other performance problems (either control plane or data
> > >> plane) -
> > >>>>> many of the changes lead to the current implementation were for
> > >> performance
> > >>>>> optimizations, some for control plane, some for HWOL, and some for
> > >> reducing
> > >>>>> recirculations. I'd avoid complexity unless it is really
necessary.
> > >> Let's
> > >>>>> get more input for the problem, and based on that we can decide
if we
> > >> want
> > >>>>> to move to a more complex solution.
> > >>>>>
> > >>>>
> > >>>> Sure, I agree we need to find the best solution.
> > >>>>
> > >>>> In my option OVN should be HW-agnostic.  We did try to adjust the
way
> > >>>> OVN generates openflows in order to make it more "HWOL-friendly"
but
> > >>>> that shouldn't come with the cost of breaking CMS features (if
that's
> > >>>> the case here).
> > >>>>
> > >>>>> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1931599
> > >>>>> [3] https://bugzilla.redhat.com/show_bug.cgi?id=1903651
> > >>>>>
> > >>>>> Thanks,
> > >>>>> Han
> > >>>>>
> > >>>>
> > >>>> Thanks,
> > >>>> Dumitru
>
>
> Hi Han,  Dumitru,
>
> What's the status of this patch series ?  Does it need a v2 ? Sorry I
> didn't follow all the discussions.
>
> If the patch series doesn't need a v2, a can probably start reviewing.


Hi Numan,
I am still waiting for clarification of the k8s requirement, which Tim said
last week at the ovn-k8s meeting that he would discuss with Dumitru.
At the same time I am trying an alternative to see if we can solve the
performance problem while still supporting the corner cases (without adding
too much complexity to the pipeline).
So you can hold the review for now.

Thanks,
Han

>
> Thanks
> Numan
>
> > >>>
> > >>
> >
> > _______________________________________________
> > dev mailing list
> > d...@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH ovn 1/3] Don't save original dst IP and Port to avoid megaflow unwildcarding.

Reply via email to