Hi Han,

On 7/7/22 19:02, Dumitru Ceara wrote:
> On 7/7/22 18:21, Han Zhou wrote:
>> On Thu, Jul 7, 2022 at 8:55 AM Dumitru Ceara <dce...@redhat.com> wrote:
>>>
>>> On 7/7/22 13:45, Dumitru Ceara wrote:
>>>> On 7/7/22 00:08, Han Zhou wrote:
>>>>> On Wed, Jul 6, 2022 at 8:45 AM Dumitru Ceara <dce...@redhat.com> wrote:
>>>>>>
>>>>>> Hi Han,
>>>>>>
>>>>>> On 7/6/22 00:41, Han Zhou wrote:
>>>>>>> The ls_in_pre_stateful priority 120 flow that saves dst IP and Port
>> to
>>>>>>> registers is causing a critical dataplane performance impact to
>>>>>>> short-lived connections, because it unwildcards megaflows with exact
>>>>>>> match on dst IP and L4 ports. Any new connections with a different
>>>>>>> client side L4 port will encounter datapath flow miss and upcall to
>>>>>>> ovs-vswitchd.
>>>>>>>
>>>>>>> These fields (dst IP and port) were saved to registers to solve a
>>>>>>> problem of LB hairpin use case when different VIPs are sharing
>>>>>>> overlapping backend+port [0]. The change [0] might not have as wide
>>>>>>> performance impact as it is now because at that time one of the match
>>>>>>> condition "REGBIT_CONNTRACK_NAT == 1" was set only for established
>> and
>>>>>>> natted traffic, while now the impact is more obvious because
>>>>>>> REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP
>>>>>>> configured on the LS) since commit [1], after several other
>> indirectly
>>>>>>> related optimizations and refactors.
>>>>>>>
>>>>>>> Since the changes that introduced the performance problem had their
>>>>>>> own values (fixes problems or optimizes performance), so we don't
>> want
>>>>>>> to revert any of the changes (and it is also not straightforward to
>>>>>>> revert any of them because there have been lots of changes and
>> refactors
>>>>>>> on top of them).
>>>>>>>
>>>>>>> Change [0] itself has added an alternative way to solve the
>> overlapping
>>>>>>> backends problem, which utilizes ct fields instead of saving dst IP
>> and
>>>>>>> port to registers. This patch forces to that approach and removes the
>>>>>>> flows/actions that saves the dst IP and port to avoid the dataplane
>>>>>>> performance problem for short-lived connections.
>>>>>>>
>>>>>>> (With this approach, the ct_state DNAT is not HW offload friendly,
>> so it
>>>>>>> may result in those flows not being offloaded, which is supposed to
>> be
>>>>>>> solved in a follow-up patch)

While we're waiting for more input from ovn-k8s on this, I have a
slightly different question.

Aren't we hitting a similar problem in the router pipeline, due to
REG_ORIG_TP_DPORT_ROUTER?

Thanks,
Dumitru

>>>>>>>
>>>>>>> [0] ce0ef8d59850 ("Properly handle hairpin traffic for VIPs with
>> shared
>>>>> backends.")
>>>>>>> [1] 0038579d1928 ("northd: Optimize ct nat for load balancer
>> traffic.")
>>>>>>>
>>>>>>> Signed-off-by: Han Zhou <hz...@ovn.org>
>>>>>>> ---
>>>>>>
>>>>>> I think the main concern I have is that this forces us to choose
>> between:
>>>>>> a. non hwol friendly flows (reduced performance)
>>>>>> b. less functionality (with the knob in patch 3/3 set to false).
>>>>>>
>>>>> Thanks Dumitru for the comments! I agree the solution is not ideal,
>> but if
>>>>> we look at it from a different angle, even with a), for most
>> pod->service
>>>>> traffic the performance is still much better than how it is today (not
>>>>> offloaded kernel datapath is still much better than userspace
>> slowpath).
>>>>> And *hopefully* b) is ok for most use cases to get HW-offload
>> capability.
>>>>>
>>>
>>> Just a note on this item.  I'm a bit confused about why all traffic
>>> would be slowpath-ed?  It's just the first packet that goes to vswitchd
>>> as an upcall, right?
>>>
>>
>> It is about all traffic for *short-lived* connections. Any clients ->
>> service traffic with the pattern:
>> 1. TCP connection setup
>> 2. Set API request, receives response
>> 3. Close TCP connection
>> It can be tested with netperf TCP_CRR. Every time the client side TCP port
>> is different, but since the server -> client DP flow includes the client
>> TCP port, for each such transaction there is going to be at least a DP flow
>> miss and goes to userspace. Such application latency would be very high. In
>> addition, it causes the OVS handler CPU spikes very high which would
>> further impact the dataplane performance of the system.
>>
>>> Once the megaflow (even if it's more specific than ideal) is installed
>>> all following traffic in that session should be forwarded in fast path
>>> (kernel).
>>>
>>> Also, I'm not sure I follow why the same behavior wouldn't happen with
>>> your changes too for pod->service.  The datapath flow includes the
>>> dp_hash() match, and that's likely different for different connections.
>>>
>>
>> With the change it is not going to happen, because the match is for server
>> side port only.
>> For dp_hash(), for what I remembered, there are as many as the number of
>> megaflows as the number of buckets (the masked hash value) at most.
>>
> 
> OK, that explains it, thanks.
> 
>>> Or am I missing something?
>>>
>>>>>> Change [0] was added to address the case when a service in kubernetes
>> is
>>>>>> exposed via two different k8s services objects that share the same
>>>>>> endpoints.  That translates in ovn-k8s to two different OVN load
>>>>>> balancer VIPs that share the same backends.  For such cases, if the
>>>>>> service is being accessed by one of its own backends we need to be
>> able
>>>>>> to differentiate based on the VIP address it used to connect to.
>>>>>>
>>>>>> CC: Tim Rozet, Dan Williams for some more input from the ovn-k8s side
>> on
>>>>>> how common it is that an OVN-networked pod accesses two (or more)
>>>>>> services that might have the pod itself as a backend.
>>>>>>
>>>>>
>>>>> Yes, we definitely need input from ovn-k8s side. The information we
>> got so
>>>>> far: the change [0] was to fix a bug [2] reported by Tim. However, the
>> bug
>>>>> description didn't mention anything about two VIPs sharing the same
>>>>> backend. Tim also mentioned in the ovn-k8s meeting last week that the
>>>>> original user bug report for [2] was [3], and [3] was in fact a
>> completely
>>>>> different problem (although it is related to hairpin, too). So, I am
>> under
>>>>
>>>> I am not completely sure about the link between [3] and [2], maybe Tim
>>>> remembers more.
>>>>
>>>>> the impression that "an OVN-networked pod accesses two (or more)
>> services
>>>>> that might have the pod itself as a backend" might be a very rare use
>> case,
>>>>> if it exists at all.
>>>>>
>>>>
>>>> I went ahead and set the new ovn-allow-vips-share-hairpin-backend knob
>>>> to "false" and pushed it to my fork to run the ovn-kubernetes CI.  This
>>>> runs a subset of the kubernetes conformance tests (AFAICT) and some
>>>> specific e2e ovn-kubernetes tests.
>>>>
>>>> The results are here:
>>>>
>>>> https://github.com/dceara/ovn/runs/7230840427?check_suite_focus=true
>>>>
>>>> Focusing on the conformance failures:
>>>>
>>>> 2022-07-07T10:31:24.7228157Z  [91m [1m[Fail]  [0m [90m[sig-network]
>> Networking  [0m [0mGranular Checks: Services  [0m [91m [1m[It] should
>> function for endpoint-Service: http  [0m
>>>> 2022-07-07T10:31:24.7228940Z  [37mvendor/
>> github.com/onsi/ginkgo/internal/leafnodes/runner.go:113 [0m
>>>> ...
>>>> 2022-07-07T10:31:24.7240313Z  [91m [1m[Fail]  [0m [90m[sig-network]
>> Networking  [0m [0mGranular Checks: Services  [0m [91m [1m[It] should
>> function for multiple endpoint-Services with same selector  [0m
>>>> 2022-07-07T10:31:24.7240819Z  [37mvendor/
>> github.com/onsi/ginkgo/internal/leafnodes/runner.go:113 [0m
>>>> ...
>>>>
>>>> Checking how these tests are defined:
>>>>
>> https://github.com/kubernetes/kubernetes/blob/2a017f94bcf8d04cbbbbdc6695bcf74273d630ed/test/e2e/network/networking.go#L283
>>>>
>> https://github.com/kubernetes/kubernetes/blob/2a017f94bcf8d04cbbbbdc6695bcf74273d630ed/test/e2e/network/networking.go#L236
>>>>
>> Thanks for the test and information! Really need input from k8s folks to
>> understand more.
>>
>> Thanks,
>> Han
>>
> 
> Like I said below, these are kubernetes conformance tests so I'll let
> k8s folks confirm if such failures can be ignored/worked around.
> 
>>>> It seems to me that they're testing explicitly for a  "pod that accesses
>>>> two services that might have the pod itself as a backend".
>>>>
>>>> So, if I'm not wrong, we'd become non-compliant in this case.
>>>>
>>>>>> If this turns out to be mandatory I guess we might want to also look
>>>>>> into alternatives like:
>>>>>> - getting help from the HW to offload matches like ct_tuple()
>>>>>
>>>>> I believe this is going to happen in the future. HWOL is continuously
>>>>> enhanced.
>>>>>
>>>>
>>>> That would make things simpler.
>>>>
>>>>>> - limiting the impact of "a." only to some load balancers (e.g., would
>>>>>> it help to use different hairpin lookup tables for such load
>> balancers?)
>>>>>
>>>>> I am not sure if this would work, and not sure if this is a good
>> approach,
>>>>> either. In general, I believe it is possible to solve the problem with
>> more
>>>>> complex pipelines, but we need to keep in mind it is quite easy to
>>>>> introduce other performance problems (either control plane or data
>> plane) -
>>>>> many of the changes lead to the current implementation were for
>> performance
>>>>> optimizations, some for control plane, some for HWOL, and some for
>> reducing
>>>>> recirculations. I'd avoid complexity unless it is really necessary.
>> Let's
>>>>> get more input for the problem, and based on that we can decide if we
>> want
>>>>> to move to a more complex solution.
>>>>>
>>>>
>>>> Sure, I agree we need to find the best solution.
>>>>
>>>> In my option OVN should be HW-agnostic.  We did try to adjust the way
>>>> OVN generates openflows in order to make it more "HWOL-friendly" but
>>>> that shouldn't come with the cost of breaking CMS features (if that's
>>>> the case here).
>>>>
>>>>> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1931599
>>>>> [3] https://bugzilla.redhat.com/show_bug.cgi?id=1903651
>>>>>
>>>>> Thanks,
>>>>> Han
>>>>>
>>>>
>>>> Thanks,
>>>> Dumitru
>>>
>>

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to