On 7/7/22 00:08, Han Zhou wrote:
> On Wed, Jul 6, 2022 at 8:45 AM Dumitru Ceara <dce...@redhat.com> wrote:
>>
>> Hi Han,
>>
>> On 7/6/22 00:41, Han Zhou wrote:
>>> The ls_in_pre_stateful priority 120 flow that saves dst IP and Port to
>>> registers is causing a critical dataplane performance impact to
>>> short-lived connections, because it unwildcards megaflows with exact
>>> match on dst IP and L4 ports. Any new connections with a different
>>> client side L4 port will encounter datapath flow miss and upcall to
>>> ovs-vswitchd.
>>>
>>> These fields (dst IP and port) were saved to registers to solve a
>>> problem of LB hairpin use case when different VIPs are sharing
>>> overlapping backend+port [0]. The change [0] might not have as wide
>>> performance impact as it is now because at that time one of the match
>>> condition "REGBIT_CONNTRACK_NAT == 1" was set only for established and
>>> natted traffic, while now the impact is more obvious because
>>> REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP
>>> configured on the LS) since commit [1], after several other indirectly
>>> related optimizations and refactors.
>>>
>>> Since the changes that introduced the performance problem had their
>>> own values (fixes problems or optimizes performance), so we don't want
>>> to revert any of the changes (and it is also not straightforward to
>>> revert any of them because there have been lots of changes and refactors
>>> on top of them).
>>>
>>> Change [0] itself has added an alternative way to solve the overlapping
>>> backends problem, which utilizes ct fields instead of saving dst IP and
>>> port to registers. This patch forces to that approach and removes the
>>> flows/actions that saves the dst IP and port to avoid the dataplane
>>> performance problem for short-lived connections.
>>>
>>> (With this approach, the ct_state DNAT is not HW offload friendly, so it
>>> may result in those flows not being offloaded, which is supposed to be
>>> solved in a follow-up patch)
>>>
>>> [0] ce0ef8d59850 ("Properly handle hairpin traffic for VIPs with shared
> backends.")
>>> [1] 0038579d1928 ("northd: Optimize ct nat for load balancer traffic.")
>>>
>>> Signed-off-by: Han Zhou <hz...@ovn.org>
>>> ---
>>
>> I think the main concern I have is that this forces us to choose between:
>> a. non hwol friendly flows (reduced performance)
>> b. less functionality (with the knob in patch 3/3 set to false).
>>
> Thanks Dumitru for the comments! I agree the solution is not ideal, but if
> we look at it from a different angle, even with a), for most pod->service
> traffic the performance is still much better than how it is today (not
> offloaded kernel datapath is still much better than userspace slowpath).
> And *hopefully* b) is ok for most use cases to get HW-offload capability.
> 
>> Change [0] was added to address the case when a service in kubernetes is
>> exposed via two different k8s services objects that share the same
>> endpoints.  That translates in ovn-k8s to two different OVN load
>> balancer VIPs that share the same backends.  For such cases, if the
>> service is being accessed by one of its own backends we need to be able
>> to differentiate based on the VIP address it used to connect to.
>>
>> CC: Tim Rozet, Dan Williams for some more input from the ovn-k8s side on
>> how common it is that an OVN-networked pod accesses two (or more)
>> services that might have the pod itself as a backend.
>>
> 
> Yes, we definitely need input from ovn-k8s side. The information we got so
> far: the change [0] was to fix a bug [2] reported by Tim. However, the bug
> description didn't mention anything about two VIPs sharing the same
> backend. Tim also mentioned in the ovn-k8s meeting last week that the
> original user bug report for [2] was [3], and [3] was in fact a completely
> different problem (although it is related to hairpin, too). So, I am under

I am not completely sure about the link between [3] and [2], maybe Tim
remembers more.

> the impression that "an OVN-networked pod accesses two (or more) services
> that might have the pod itself as a backend" might be a very rare use case,
> if it exists at all.
> 

I went ahead and set the new ovn-allow-vips-share-hairpin-backend knob
to "false" and pushed it to my fork to run the ovn-kubernetes CI.  This
runs a subset of the kubernetes conformance tests (AFAICT) and some
specific e2e ovn-kubernetes tests.

The results are here:

https://github.com/dceara/ovn/runs/7230840427?check_suite_focus=true

Focusing on the conformance failures:

2022-07-07T10:31:24.7228157Z [Fail] [sig-network] Networking 
Granular Checks: Services [It] should function for 
endpoint-Service: http 
2022-07-07T10:31:24.7228940Z 
vendor/github.com/onsi/ginkgo/internal/leafnodes/runner.go:113
...
2022-07-07T10:31:24.7240313Z [Fail] [sig-network] Networking 
Granular Checks: Services [It] should function for 
multiple endpoint-Services with same selector 
2022-07-07T10:31:24.7240819Z 
vendor/github.com/onsi/ginkgo/internal/leafnodes/runner.go:113
...

Checking how these tests are defined:
https://github.com/kubernetes/kubernetes/blob/2a017f94bcf8d04cbbbbdc6695bcf74273d630ed/test/e2e/network/networking.go#L283
https://github.com/kubernetes/kubernetes/blob/2a017f94bcf8d04cbbbbdc6695bcf74273d630ed/test/e2e/network/networking.go#L236

It seems to me that they're testing explicitly for a  "pod that accesses
two services that might have the pod itself as a backend".

So, if I'm not wrong, we'd become non-compliant in this case.
 
>> If this turns out to be mandatory I guess we might want to also look
>> into alternatives like:
>> - getting help from the HW to offload matches like ct_tuple()
> 
> I believe this is going to happen in the future. HWOL is continuously
> enhanced.
> 

That would make things simpler.

>> - limiting the impact of "a." only to some load balancers (e.g., would
>> it help to use different hairpin lookup tables for such load balancers?)
> 
> I am not sure if this would work, and not sure if this is a good approach,
> either. In general, I believe it is possible to solve the problem with more
> complex pipelines, but we need to keep in mind it is quite easy to
> introduce other performance problems (either control plane or data plane) -
> many of the changes lead to the current implementation were for performance
> optimizations, some for control plane, some for HWOL, and some for reducing
> recirculations. I'd avoid complexity unless it is really necessary. Let's
> get more input for the problem, and based on that we can decide if we want
> to move to a more complex solution.
> 

Sure, I agree we need to find the best solution.

In my option OVN should be HW-agnostic.  We did try to adjust the way
OVN generates openflows in order to make it more "HWOL-friendly" but
that shouldn't come with the cost of breaking CMS features (if that's
the case here).

> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1931599
> [3] https://bugzilla.redhat.com/show_bug.cgi?id=1903651
> 
> Thanks,
> Han
> 

Thanks,
Dumitru

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to