Re: [ovs-dev] [PATCH ovn 1/3] Don't save original dst IP and Port to avoid megaflow unwildcarding.

2022-08-24 Thread Han Zhou
On Wed, Jul 20, 2022 at 10:55 PM Han Zhou  wrote:
>
>
>
> On Wed, Jul 20, 2022 at 12:13 PM Numan Siddique  wrote:
> >
> > On Wed, Jul 20, 2022 at 6:18 AM Dumitru Ceara  wrote:
> > >
> > >
> > > Hi Han,
> > >
> > > On 7/7/22 19:02, Dumitru Ceara wrote:
> > > > On 7/7/22 18:21, Han Zhou wrote:
> > > >> On Thu, Jul 7, 2022 at 8:55 AM Dumitru Ceara 
wrote:
> > > >>>
> > > >>> On 7/7/22 13:45, Dumitru Ceara wrote:
> > >  On 7/7/22 00:08, Han Zhou wrote:
> > > > On Wed, Jul 6, 2022 at 8:45 AM Dumitru Ceara 
wrote:
> > > >>
> > > >> Hi Han,
> > > >>
> > > >> On 7/6/22 00:41, Han Zhou wrote:
> > > >>> The ls_in_pre_stateful priority 120 flow that saves dst IP
and Port
> > > >> to
> > > >>> registers is causing a critical dataplane performance impact
to
> > > >>> short-lived connections, because it unwildcards megaflows
with exact
> > > >>> match on dst IP and L4 ports. Any new connections with a
different
> > > >>> client side L4 port will encounter datapath flow miss and
upcall to
> > > >>> ovs-vswitchd.
> > > >>>
> > > >>> These fields (dst IP and port) were saved to registers to
solve a
> > > >>> problem of LB hairpin use case when different VIPs are sharing
> > > >>> overlapping backend+port [0]. The change [0] might not have
as wide
> > > >>> performance impact as it is now because at that time one of
the match
> > > >>> condition "REGBIT_CONNTRACK_NAT == 1" was set only for
established
> > > >> and
> > > >>> natted traffic, while now the impact is more obvious because
> > > >>> REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP
> > > >>> configured on the LS) since commit [1], after several other
> > > >> indirectly
> > > >>> related optimizations and refactors.
> > > >>>
> > > >>> Since the changes that introduced the performance problem had
their
> > > >>> own values (fixes problems or optimizes performance), so we
don't
> > > >> want
> > > >>> to revert any of the changes (and it is also not
straightforward to
> > > >>> revert any of them because there have been lots of changes and
> > > >> refactors
> > > >>> on top of them).
> > > >>>
> > > >>> Change [0] itself has added an alternative way to solve the
> > > >> overlapping
> > > >>> backends problem, which utilizes ct fields instead of saving
dst IP
> > > >> and
> > > >>> port to registers. This patch forces to that approach and
removes the
> > > >>> flows/actions that saves the dst IP and port to avoid the
dataplane
> > > >>> performance problem for short-lived connections.
> > > >>>
> > > >>> (With this approach, the ct_state DNAT is not HW offload
friendly,
> > > >> so it
> > > >>> may result in those flows not being offloaded, which is
supposed to
> > > >> be
> > > >>> solved in a follow-up patch)
> > >
> > > While we're waiting for more input from ovn-k8s on this, I have a
> > > slightly different question.
> > >
> > > Aren't we hitting a similar problem in the router pipeline, due to
> > > REG_ORIG_TP_DPORT_ROUTER?
> > >
> > > Thanks,
> > > Dumitru
> > >
> > > >>>
> > > >>> [0] ce0ef8d59850 ("Properly handle hairpin traffic for VIPs
with
> > > >> shared
> > > > backends.")
> > > >>> [1] 0038579d1928 ("northd: Optimize ct nat for load balancer
> > > >> traffic.")
> > > >>>
> > > >>> Signed-off-by: Han Zhou 
> > > >>> ---
> > > >>
> > > >> I think the main concern I have is that this forces us to
choose
> > > >> between:
> > > >> a. non hwol friendly flows (reduced performance)
> > > >> b. less functionality (with the knob in patch 3/3 set to
false).
> > > >>
> > > > Thanks Dumitru for the comments! I agree the solution is not
ideal,
> > > >> but if
> > > > we look at it from a different angle, even with a), for most
> > > >> pod->service
> > > > traffic the performance is still much better than how it is
today (not
> > > > offloaded kernel datapath is still much better than userspace
> > > >> slowpath).
> > > > And *hopefully* b) is ok for most use cases to get HW-offload
> > > >> capability.
> > > >
> > > >>>
> > > >>> Just a note on this item.  I'm a bit confused about why all
traffic
> > > >>> would be slowpath-ed?  It's just the first packet that goes to
vswitchd
> > > >>> as an upcall, right?
> > > >>>
> > > >>
> > > >> It is about all traffic for *short-lived* connections. Any clients
->
> > > >> service traffic with the pattern:
> > > >> 1. TCP connection setup
> > > >> 2. Set API request, receives response
> > > >> 3. Close TCP connection
> > > >> It can be tested with netperf TCP_CRR. Every time the client side
TCP port
> > > >> is different, but since the server -> client DP flow includes the
client
> > > >> TCP port, for each such transaction there is going to be at least
a DP flow
> > > >> miss and goes to userspace. Such application latency would be very
high. In
> > > >> addition, it causes the OVS 

Re: [ovs-dev] [PATCH ovn 1/3] Don't save original dst IP and Port to avoid megaflow unwildcarding.

2022-07-20 Thread Han Zhou
On Wed, Jul 20, 2022 at 12:13 PM Numan Siddique  wrote:
>
> On Wed, Jul 20, 2022 at 6:18 AM Dumitru Ceara  wrote:
> >
> >
> > Hi Han,
> >
> > On 7/7/22 19:02, Dumitru Ceara wrote:
> > > On 7/7/22 18:21, Han Zhou wrote:
> > >> On Thu, Jul 7, 2022 at 8:55 AM Dumitru Ceara 
wrote:
> > >>>
> > >>> On 7/7/22 13:45, Dumitru Ceara wrote:
> >  On 7/7/22 00:08, Han Zhou wrote:
> > > On Wed, Jul 6, 2022 at 8:45 AM Dumitru Ceara 
wrote:
> > >>
> > >> Hi Han,
> > >>
> > >> On 7/6/22 00:41, Han Zhou wrote:
> > >>> The ls_in_pre_stateful priority 120 flow that saves dst IP and
Port
> > >> to
> > >>> registers is causing a critical dataplane performance impact to
> > >>> short-lived connections, because it unwildcards megaflows with
exact
> > >>> match on dst IP and L4 ports. Any new connections with a
different
> > >>> client side L4 port will encounter datapath flow miss and
upcall to
> > >>> ovs-vswitchd.
> > >>>
> > >>> These fields (dst IP and port) were saved to registers to solve
a
> > >>> problem of LB hairpin use case when different VIPs are sharing
> > >>> overlapping backend+port [0]. The change [0] might not have as
wide
> > >>> performance impact as it is now because at that time one of the
match
> > >>> condition "REGBIT_CONNTRACK_NAT == 1" was set only for
established
> > >> and
> > >>> natted traffic, while now the impact is more obvious because
> > >>> REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP
> > >>> configured on the LS) since commit [1], after several other
> > >> indirectly
> > >>> related optimizations and refactors.
> > >>>
> > >>> Since the changes that introduced the performance problem had
their
> > >>> own values (fixes problems or optimizes performance), so we
don't
> > >> want
> > >>> to revert any of the changes (and it is also not
straightforward to
> > >>> revert any of them because there have been lots of changes and
> > >> refactors
> > >>> on top of them).
> > >>>
> > >>> Change [0] itself has added an alternative way to solve the
> > >> overlapping
> > >>> backends problem, which utilizes ct fields instead of saving
dst IP
> > >> and
> > >>> port to registers. This patch forces to that approach and
removes the
> > >>> flows/actions that saves the dst IP and port to avoid the
dataplane
> > >>> performance problem for short-lived connections.
> > >>>
> > >>> (With this approach, the ct_state DNAT is not HW offload
friendly,
> > >> so it
> > >>> may result in those flows not being offloaded, which is
supposed to
> > >> be
> > >>> solved in a follow-up patch)
> >
> > While we're waiting for more input from ovn-k8s on this, I have a
> > slightly different question.
> >
> > Aren't we hitting a similar problem in the router pipeline, due to
> > REG_ORIG_TP_DPORT_ROUTER?
> >
> > Thanks,
> > Dumitru
> >
> > >>>
> > >>> [0] ce0ef8d59850 ("Properly handle hairpin traffic for VIPs with
> > >> shared
> > > backends.")
> > >>> [1] 0038579d1928 ("northd: Optimize ct nat for load balancer
> > >> traffic.")
> > >>>
> > >>> Signed-off-by: Han Zhou 
> > >>> ---
> > >>
> > >> I think the main concern I have is that this forces us to choose
> > >> between:
> > >> a. non hwol friendly flows (reduced performance)
> > >> b. less functionality (with the knob in patch 3/3 set to false).
> > >>
> > > Thanks Dumitru for the comments! I agree the solution is not
ideal,
> > >> but if
> > > we look at it from a different angle, even with a), for most
> > >> pod->service
> > > traffic the performance is still much better than how it is today
(not
> > > offloaded kernel datapath is still much better than userspace
> > >> slowpath).
> > > And *hopefully* b) is ok for most use cases to get HW-offload
> > >> capability.
> > >
> > >>>
> > >>> Just a note on this item.  I'm a bit confused about why all traffic
> > >>> would be slowpath-ed?  It's just the first packet that goes to
vswitchd
> > >>> as an upcall, right?
> > >>>
> > >>
> > >> It is about all traffic for *short-lived* connections. Any clients ->
> > >> service traffic with the pattern:
> > >> 1. TCP connection setup
> > >> 2. Set API request, receives response
> > >> 3. Close TCP connection
> > >> It can be tested with netperf TCP_CRR. Every time the client side
TCP port
> > >> is different, but since the server -> client DP flow includes the
client
> > >> TCP port, for each such transaction there is going to be at least a
DP flow
> > >> miss and goes to userspace. Such application latency would be very
high. In
> > >> addition, it causes the OVS handler CPU spikes very high which would
> > >> further impact the dataplane performance of the system.
> > >>
> > >>> Once the megaflow (even if it's more specific than ideal) is
installed
> > >>> all following traffic in that session should be forwarded in fast
path
> > >>> 

Re: [ovs-dev] [PATCH ovn 1/3] Don't save original dst IP and Port to avoid megaflow unwildcarding.

2022-07-20 Thread Numan Siddique
On Wed, Jul 20, 2022 at 6:18 AM Dumitru Ceara  wrote:
>
>
> Hi Han,
>
> On 7/7/22 19:02, Dumitru Ceara wrote:
> > On 7/7/22 18:21, Han Zhou wrote:
> >> On Thu, Jul 7, 2022 at 8:55 AM Dumitru Ceara  wrote:
> >>>
> >>> On 7/7/22 13:45, Dumitru Ceara wrote:
>  On 7/7/22 00:08, Han Zhou wrote:
> > On Wed, Jul 6, 2022 at 8:45 AM Dumitru Ceara  wrote:
> >>
> >> Hi Han,
> >>
> >> On 7/6/22 00:41, Han Zhou wrote:
> >>> The ls_in_pre_stateful priority 120 flow that saves dst IP and Port
> >> to
> >>> registers is causing a critical dataplane performance impact to
> >>> short-lived connections, because it unwildcards megaflows with exact
> >>> match on dst IP and L4 ports. Any new connections with a different
> >>> client side L4 port will encounter datapath flow miss and upcall to
> >>> ovs-vswitchd.
> >>>
> >>> These fields (dst IP and port) were saved to registers to solve a
> >>> problem of LB hairpin use case when different VIPs are sharing
> >>> overlapping backend+port [0]. The change [0] might not have as wide
> >>> performance impact as it is now because at that time one of the match
> >>> condition "REGBIT_CONNTRACK_NAT == 1" was set only for established
> >> and
> >>> natted traffic, while now the impact is more obvious because
> >>> REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP
> >>> configured on the LS) since commit [1], after several other
> >> indirectly
> >>> related optimizations and refactors.
> >>>
> >>> Since the changes that introduced the performance problem had their
> >>> own values (fixes problems or optimizes performance), so we don't
> >> want
> >>> to revert any of the changes (and it is also not straightforward to
> >>> revert any of them because there have been lots of changes and
> >> refactors
> >>> on top of them).
> >>>
> >>> Change [0] itself has added an alternative way to solve the
> >> overlapping
> >>> backends problem, which utilizes ct fields instead of saving dst IP
> >> and
> >>> port to registers. This patch forces to that approach and removes the
> >>> flows/actions that saves the dst IP and port to avoid the dataplane
> >>> performance problem for short-lived connections.
> >>>
> >>> (With this approach, the ct_state DNAT is not HW offload friendly,
> >> so it
> >>> may result in those flows not being offloaded, which is supposed to
> >> be
> >>> solved in a follow-up patch)
>
> While we're waiting for more input from ovn-k8s on this, I have a
> slightly different question.
>
> Aren't we hitting a similar problem in the router pipeline, due to
> REG_ORIG_TP_DPORT_ROUTER?
>
> Thanks,
> Dumitru
>
> >>>
> >>> [0] ce0ef8d59850 ("Properly handle hairpin traffic for VIPs with
> >> shared
> > backends.")
> >>> [1] 0038579d1928 ("northd: Optimize ct nat for load balancer
> >> traffic.")
> >>>
> >>> Signed-off-by: Han Zhou 
> >>> ---
> >>
> >> I think the main concern I have is that this forces us to choose
> >> between:
> >> a. non hwol friendly flows (reduced performance)
> >> b. less functionality (with the knob in patch 3/3 set to false).
> >>
> > Thanks Dumitru for the comments! I agree the solution is not ideal,
> >> but if
> > we look at it from a different angle, even with a), for most
> >> pod->service
> > traffic the performance is still much better than how it is today (not
> > offloaded kernel datapath is still much better than userspace
> >> slowpath).
> > And *hopefully* b) is ok for most use cases to get HW-offload
> >> capability.
> >
> >>>
> >>> Just a note on this item.  I'm a bit confused about why all traffic
> >>> would be slowpath-ed?  It's just the first packet that goes to vswitchd
> >>> as an upcall, right?
> >>>
> >>
> >> It is about all traffic for *short-lived* connections. Any clients ->
> >> service traffic with the pattern:
> >> 1. TCP connection setup
> >> 2. Set API request, receives response
> >> 3. Close TCP connection
> >> It can be tested with netperf TCP_CRR. Every time the client side TCP port
> >> is different, but since the server -> client DP flow includes the client
> >> TCP port, for each such transaction there is going to be at least a DP flow
> >> miss and goes to userspace. Such application latency would be very high. In
> >> addition, it causes the OVS handler CPU spikes very high which would
> >> further impact the dataplane performance of the system.
> >>
> >>> Once the megaflow (even if it's more specific than ideal) is installed
> >>> all following traffic in that session should be forwarded in fast path
> >>> (kernel).
> >>>
> >>> Also, I'm not sure I follow why the same behavior wouldn't happen with
> >>> your changes too for pod->service.  The datapath flow includes the
> >>> dp_hash() match, and that's likely different for different connections.
> >>>
> >>
> >> With the change it is not 

Re: [ovs-dev] [PATCH ovn 1/3] Don't save original dst IP and Port to avoid megaflow unwildcarding.

2022-07-20 Thread Dumitru Ceara


Hi Han,

On 7/7/22 19:02, Dumitru Ceara wrote:
> On 7/7/22 18:21, Han Zhou wrote:
>> On Thu, Jul 7, 2022 at 8:55 AM Dumitru Ceara  wrote:
>>>
>>> On 7/7/22 13:45, Dumitru Ceara wrote:
 On 7/7/22 00:08, Han Zhou wrote:
> On Wed, Jul 6, 2022 at 8:45 AM Dumitru Ceara  wrote:
>>
>> Hi Han,
>>
>> On 7/6/22 00:41, Han Zhou wrote:
>>> The ls_in_pre_stateful priority 120 flow that saves dst IP and Port
>> to
>>> registers is causing a critical dataplane performance impact to
>>> short-lived connections, because it unwildcards megaflows with exact
>>> match on dst IP and L4 ports. Any new connections with a different
>>> client side L4 port will encounter datapath flow miss and upcall to
>>> ovs-vswitchd.
>>>
>>> These fields (dst IP and port) were saved to registers to solve a
>>> problem of LB hairpin use case when different VIPs are sharing
>>> overlapping backend+port [0]. The change [0] might not have as wide
>>> performance impact as it is now because at that time one of the match
>>> condition "REGBIT_CONNTRACK_NAT == 1" was set only for established
>> and
>>> natted traffic, while now the impact is more obvious because
>>> REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP
>>> configured on the LS) since commit [1], after several other
>> indirectly
>>> related optimizations and refactors.
>>>
>>> Since the changes that introduced the performance problem had their
>>> own values (fixes problems or optimizes performance), so we don't
>> want
>>> to revert any of the changes (and it is also not straightforward to
>>> revert any of them because there have been lots of changes and
>> refactors
>>> on top of them).
>>>
>>> Change [0] itself has added an alternative way to solve the
>> overlapping
>>> backends problem, which utilizes ct fields instead of saving dst IP
>> and
>>> port to registers. This patch forces to that approach and removes the
>>> flows/actions that saves the dst IP and port to avoid the dataplane
>>> performance problem for short-lived connections.
>>>
>>> (With this approach, the ct_state DNAT is not HW offload friendly,
>> so it
>>> may result in those flows not being offloaded, which is supposed to
>> be
>>> solved in a follow-up patch)

While we're waiting for more input from ovn-k8s on this, I have a
slightly different question.

Aren't we hitting a similar problem in the router pipeline, due to
REG_ORIG_TP_DPORT_ROUTER?

Thanks,
Dumitru

>>>
>>> [0] ce0ef8d59850 ("Properly handle hairpin traffic for VIPs with
>> shared
> backends.")
>>> [1] 0038579d1928 ("northd: Optimize ct nat for load balancer
>> traffic.")
>>>
>>> Signed-off-by: Han Zhou 
>>> ---
>>
>> I think the main concern I have is that this forces us to choose
>> between:
>> a. non hwol friendly flows (reduced performance)
>> b. less functionality (with the knob in patch 3/3 set to false).
>>
> Thanks Dumitru for the comments! I agree the solution is not ideal,
>> but if
> we look at it from a different angle, even with a), for most
>> pod->service
> traffic the performance is still much better than how it is today (not
> offloaded kernel datapath is still much better than userspace
>> slowpath).
> And *hopefully* b) is ok for most use cases to get HW-offload
>> capability.
>
>>>
>>> Just a note on this item.  I'm a bit confused about why all traffic
>>> would be slowpath-ed?  It's just the first packet that goes to vswitchd
>>> as an upcall, right?
>>>
>>
>> It is about all traffic for *short-lived* connections. Any clients ->
>> service traffic with the pattern:
>> 1. TCP connection setup
>> 2. Set API request, receives response
>> 3. Close TCP connection
>> It can be tested with netperf TCP_CRR. Every time the client side TCP port
>> is different, but since the server -> client DP flow includes the client
>> TCP port, for each such transaction there is going to be at least a DP flow
>> miss and goes to userspace. Such application latency would be very high. In
>> addition, it causes the OVS handler CPU spikes very high which would
>> further impact the dataplane performance of the system.
>>
>>> Once the megaflow (even if it's more specific than ideal) is installed
>>> all following traffic in that session should be forwarded in fast path
>>> (kernel).
>>>
>>> Also, I'm not sure I follow why the same behavior wouldn't happen with
>>> your changes too for pod->service.  The datapath flow includes the
>>> dp_hash() match, and that's likely different for different connections.
>>>
>>
>> With the change it is not going to happen, because the match is for server
>> side port only.
>> For dp_hash(), for what I remembered, there are as many as the number of
>> megaflows as the number of buckets (the masked hash value) at most.
>>
> 
> OK, that explains it, thanks.
> 
>>> Or am I missing 

Re: [ovs-dev] [PATCH ovn 1/3] Don't save original dst IP and Port to avoid megaflow unwildcarding.

2022-07-07 Thread Dumitru Ceara
On 7/7/22 18:21, Han Zhou wrote:
> On Thu, Jul 7, 2022 at 8:55 AM Dumitru Ceara  wrote:
>>
>> On 7/7/22 13:45, Dumitru Ceara wrote:
>>> On 7/7/22 00:08, Han Zhou wrote:
 On Wed, Jul 6, 2022 at 8:45 AM Dumitru Ceara  wrote:
>
> Hi Han,
>
> On 7/6/22 00:41, Han Zhou wrote:
>> The ls_in_pre_stateful priority 120 flow that saves dst IP and Port
> to
>> registers is causing a critical dataplane performance impact to
>> short-lived connections, because it unwildcards megaflows with exact
>> match on dst IP and L4 ports. Any new connections with a different
>> client side L4 port will encounter datapath flow miss and upcall to
>> ovs-vswitchd.
>>
>> These fields (dst IP and port) were saved to registers to solve a
>> problem of LB hairpin use case when different VIPs are sharing
>> overlapping backend+port [0]. The change [0] might not have as wide
>> performance impact as it is now because at that time one of the match
>> condition "REGBIT_CONNTRACK_NAT == 1" was set only for established
> and
>> natted traffic, while now the impact is more obvious because
>> REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP
>> configured on the LS) since commit [1], after several other
> indirectly
>> related optimizations and refactors.
>>
>> Since the changes that introduced the performance problem had their
>> own values (fixes problems or optimizes performance), so we don't
> want
>> to revert any of the changes (and it is also not straightforward to
>> revert any of them because there have been lots of changes and
> refactors
>> on top of them).
>>
>> Change [0] itself has added an alternative way to solve the
> overlapping
>> backends problem, which utilizes ct fields instead of saving dst IP
> and
>> port to registers. This patch forces to that approach and removes the
>> flows/actions that saves the dst IP and port to avoid the dataplane
>> performance problem for short-lived connections.
>>
>> (With this approach, the ct_state DNAT is not HW offload friendly,
> so it
>> may result in those flows not being offloaded, which is supposed to
> be
>> solved in a follow-up patch)
>>
>> [0] ce0ef8d59850 ("Properly handle hairpin traffic for VIPs with
> shared
 backends.")
>> [1] 0038579d1928 ("northd: Optimize ct nat for load balancer
> traffic.")
>>
>> Signed-off-by: Han Zhou 
>> ---
>
> I think the main concern I have is that this forces us to choose
> between:
> a. non hwol friendly flows (reduced performance)
> b. less functionality (with the knob in patch 3/3 set to false).
>
 Thanks Dumitru for the comments! I agree the solution is not ideal,
> but if
 we look at it from a different angle, even with a), for most
> pod->service
 traffic the performance is still much better than how it is today (not
 offloaded kernel datapath is still much better than userspace
> slowpath).
 And *hopefully* b) is ok for most use cases to get HW-offload
> capability.

>>
>> Just a note on this item.  I'm a bit confused about why all traffic
>> would be slowpath-ed?  It's just the first packet that goes to vswitchd
>> as an upcall, right?
>>
> 
> It is about all traffic for *short-lived* connections. Any clients ->
> service traffic with the pattern:
> 1. TCP connection setup
> 2. Set API request, receives response
> 3. Close TCP connection
> It can be tested with netperf TCP_CRR. Every time the client side TCP port
> is different, but since the server -> client DP flow includes the client
> TCP port, for each such transaction there is going to be at least a DP flow
> miss and goes to userspace. Such application latency would be very high. In
> addition, it causes the OVS handler CPU spikes very high which would
> further impact the dataplane performance of the system.
> 
>> Once the megaflow (even if it's more specific than ideal) is installed
>> all following traffic in that session should be forwarded in fast path
>> (kernel).
>>
>> Also, I'm not sure I follow why the same behavior wouldn't happen with
>> your changes too for pod->service.  The datapath flow includes the
>> dp_hash() match, and that's likely different for different connections.
>>
> 
> With the change it is not going to happen, because the match is for server
> side port only.
> For dp_hash(), for what I remembered, there are as many as the number of
> megaflows as the number of buckets (the masked hash value) at most.
> 

OK, that explains it, thanks.

>> Or am I missing something?
>>
> Change [0] was added to address the case when a service in kubernetes
> is
> exposed via two different k8s services objects that share the same
> endpoints.  That translates in ovn-k8s to two different OVN load
> balancer VIPs that share the same backends.  For such cases, if the
> service is being accessed by one of its own backends we 

Re: [ovs-dev] [PATCH ovn 1/3] Don't save original dst IP and Port to avoid megaflow unwildcarding.

2022-07-07 Thread Han Zhou
On Thu, Jul 7, 2022 at 8:55 AM Dumitru Ceara  wrote:
>
> On 7/7/22 13:45, Dumitru Ceara wrote:
> > On 7/7/22 00:08, Han Zhou wrote:
> >> On Wed, Jul 6, 2022 at 8:45 AM Dumitru Ceara  wrote:
> >>>
> >>> Hi Han,
> >>>
> >>> On 7/6/22 00:41, Han Zhou wrote:
>  The ls_in_pre_stateful priority 120 flow that saves dst IP and Port
to
>  registers is causing a critical dataplane performance impact to
>  short-lived connections, because it unwildcards megaflows with exact
>  match on dst IP and L4 ports. Any new connections with a different
>  client side L4 port will encounter datapath flow miss and upcall to
>  ovs-vswitchd.
> 
>  These fields (dst IP and port) were saved to registers to solve a
>  problem of LB hairpin use case when different VIPs are sharing
>  overlapping backend+port [0]. The change [0] might not have as wide
>  performance impact as it is now because at that time one of the match
>  condition "REGBIT_CONNTRACK_NAT == 1" was set only for established
and
>  natted traffic, while now the impact is more obvious because
>  REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP
>  configured on the LS) since commit [1], after several other
indirectly
>  related optimizations and refactors.
> 
>  Since the changes that introduced the performance problem had their
>  own values (fixes problems or optimizes performance), so we don't
want
>  to revert any of the changes (and it is also not straightforward to
>  revert any of them because there have been lots of changes and
refactors
>  on top of them).
> 
>  Change [0] itself has added an alternative way to solve the
overlapping
>  backends problem, which utilizes ct fields instead of saving dst IP
and
>  port to registers. This patch forces to that approach and removes the
>  flows/actions that saves the dst IP and port to avoid the dataplane
>  performance problem for short-lived connections.
> 
>  (With this approach, the ct_state DNAT is not HW offload friendly,
so it
>  may result in those flows not being offloaded, which is supposed to
be
>  solved in a follow-up patch)
> 
>  [0] ce0ef8d59850 ("Properly handle hairpin traffic for VIPs with
shared
> >> backends.")
>  [1] 0038579d1928 ("northd: Optimize ct nat for load balancer
traffic.")
> 
>  Signed-off-by: Han Zhou 
>  ---
> >>>
> >>> I think the main concern I have is that this forces us to choose
between:
> >>> a. non hwol friendly flows (reduced performance)
> >>> b. less functionality (with the knob in patch 3/3 set to false).
> >>>
> >> Thanks Dumitru for the comments! I agree the solution is not ideal,
but if
> >> we look at it from a different angle, even with a), for most
pod->service
> >> traffic the performance is still much better than how it is today (not
> >> offloaded kernel datapath is still much better than userspace
slowpath).
> >> And *hopefully* b) is ok for most use cases to get HW-offload
capability.
> >>
>
> Just a note on this item.  I'm a bit confused about why all traffic
> would be slowpath-ed?  It's just the first packet that goes to vswitchd
> as an upcall, right?
>

It is about all traffic for *short-lived* connections. Any clients ->
service traffic with the pattern:
1. TCP connection setup
2. Set API request, receives response
3. Close TCP connection
It can be tested with netperf TCP_CRR. Every time the client side TCP port
is different, but since the server -> client DP flow includes the client
TCP port, for each such transaction there is going to be at least a DP flow
miss and goes to userspace. Such application latency would be very high. In
addition, it causes the OVS handler CPU spikes very high which would
further impact the dataplane performance of the system.

> Once the megaflow (even if it's more specific than ideal) is installed
> all following traffic in that session should be forwarded in fast path
> (kernel).
>
> Also, I'm not sure I follow why the same behavior wouldn't happen with
> your changes too for pod->service.  The datapath flow includes the
> dp_hash() match, and that's likely different for different connections.
>

With the change it is not going to happen, because the match is for server
side port only.
For dp_hash(), for what I remembered, there are as many as the number of
megaflows as the number of buckets (the masked hash value) at most.

> Or am I missing something?
>
> >>> Change [0] was added to address the case when a service in kubernetes
is
> >>> exposed via two different k8s services objects that share the same
> >>> endpoints.  That translates in ovn-k8s to two different OVN load
> >>> balancer VIPs that share the same backends.  For such cases, if the
> >>> service is being accessed by one of its own backends we need to be
able
> >>> to differentiate based on the VIP address it used to connect to.
> >>>
> >>> CC: Tim Rozet, Dan Williams for some more input from the 

Re: [ovs-dev] [PATCH ovn 1/3] Don't save original dst IP and Port to avoid megaflow unwildcarding.

2022-07-07 Thread Dumitru Ceara
On 7/7/22 13:45, Dumitru Ceara wrote:
> On 7/7/22 00:08, Han Zhou wrote:
>> On Wed, Jul 6, 2022 at 8:45 AM Dumitru Ceara  wrote:
>>>
>>> Hi Han,
>>>
>>> On 7/6/22 00:41, Han Zhou wrote:
 The ls_in_pre_stateful priority 120 flow that saves dst IP and Port to
 registers is causing a critical dataplane performance impact to
 short-lived connections, because it unwildcards megaflows with exact
 match on dst IP and L4 ports. Any new connections with a different
 client side L4 port will encounter datapath flow miss and upcall to
 ovs-vswitchd.

 These fields (dst IP and port) were saved to registers to solve a
 problem of LB hairpin use case when different VIPs are sharing
 overlapping backend+port [0]. The change [0] might not have as wide
 performance impact as it is now because at that time one of the match
 condition "REGBIT_CONNTRACK_NAT == 1" was set only for established and
 natted traffic, while now the impact is more obvious because
 REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP
 configured on the LS) since commit [1], after several other indirectly
 related optimizations and refactors.

 Since the changes that introduced the performance problem had their
 own values (fixes problems or optimizes performance), so we don't want
 to revert any of the changes (and it is also not straightforward to
 revert any of them because there have been lots of changes and refactors
 on top of them).

 Change [0] itself has added an alternative way to solve the overlapping
 backends problem, which utilizes ct fields instead of saving dst IP and
 port to registers. This patch forces to that approach and removes the
 flows/actions that saves the dst IP and port to avoid the dataplane
 performance problem for short-lived connections.

 (With this approach, the ct_state DNAT is not HW offload friendly, so it
 may result in those flows not being offloaded, which is supposed to be
 solved in a follow-up patch)

 [0] ce0ef8d59850 ("Properly handle hairpin traffic for VIPs with shared
>> backends.")
 [1] 0038579d1928 ("northd: Optimize ct nat for load balancer traffic.")

 Signed-off-by: Han Zhou 
 ---
>>>
>>> I think the main concern I have is that this forces us to choose between:
>>> a. non hwol friendly flows (reduced performance)
>>> b. less functionality (with the knob in patch 3/3 set to false).
>>>
>> Thanks Dumitru for the comments! I agree the solution is not ideal, but if
>> we look at it from a different angle, even with a), for most pod->service
>> traffic the performance is still much better than how it is today (not
>> offloaded kernel datapath is still much better than userspace slowpath).
>> And *hopefully* b) is ok for most use cases to get HW-offload capability.
>>

Just a note on this item.  I'm a bit confused about why all traffic
would be slowpath-ed?  It's just the first packet that goes to vswitchd
as an upcall, right?

Once the megaflow (even if it's more specific than ideal) is installed
all following traffic in that session should be forwarded in fast path
(kernel).

Also, I'm not sure I follow why the same behavior wouldn't happen with
your changes too for pod->service.  The datapath flow includes the
dp_hash() match, and that's likely different for different connections.

Or am I missing something?

>>> Change [0] was added to address the case when a service in kubernetes is
>>> exposed via two different k8s services objects that share the same
>>> endpoints.  That translates in ovn-k8s to two different OVN load
>>> balancer VIPs that share the same backends.  For such cases, if the
>>> service is being accessed by one of its own backends we need to be able
>>> to differentiate based on the VIP address it used to connect to.
>>>
>>> CC: Tim Rozet, Dan Williams for some more input from the ovn-k8s side on
>>> how common it is that an OVN-networked pod accesses two (or more)
>>> services that might have the pod itself as a backend.
>>>
>>
>> Yes, we definitely need input from ovn-k8s side. The information we got so
>> far: the change [0] was to fix a bug [2] reported by Tim. However, the bug
>> description didn't mention anything about two VIPs sharing the same
>> backend. Tim also mentioned in the ovn-k8s meeting last week that the
>> original user bug report for [2] was [3], and [3] was in fact a completely
>> different problem (although it is related to hairpin, too). So, I am under
> 
> I am not completely sure about the link between [3] and [2], maybe Tim
> remembers more.
> 
>> the impression that "an OVN-networked pod accesses two (or more) services
>> that might have the pod itself as a backend" might be a very rare use case,
>> if it exists at all.
>>
> 
> I went ahead and set the new ovn-allow-vips-share-hairpin-backend knob
> to "false" and pushed it to my fork to run the ovn-kubernetes CI.  This
> runs a subset of the 

Re: [ovs-dev] [PATCH ovn 1/3] Don't save original dst IP and Port to avoid megaflow unwildcarding.

2022-07-07 Thread Dumitru Ceara
On 7/7/22 00:08, Han Zhou wrote:
> On Wed, Jul 6, 2022 at 8:45 AM Dumitru Ceara  wrote:
>>
>> Hi Han,
>>
>> On 7/6/22 00:41, Han Zhou wrote:
>>> The ls_in_pre_stateful priority 120 flow that saves dst IP and Port to
>>> registers is causing a critical dataplane performance impact to
>>> short-lived connections, because it unwildcards megaflows with exact
>>> match on dst IP and L4 ports. Any new connections with a different
>>> client side L4 port will encounter datapath flow miss and upcall to
>>> ovs-vswitchd.
>>>
>>> These fields (dst IP and port) were saved to registers to solve a
>>> problem of LB hairpin use case when different VIPs are sharing
>>> overlapping backend+port [0]. The change [0] might not have as wide
>>> performance impact as it is now because at that time one of the match
>>> condition "REGBIT_CONNTRACK_NAT == 1" was set only for established and
>>> natted traffic, while now the impact is more obvious because
>>> REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP
>>> configured on the LS) since commit [1], after several other indirectly
>>> related optimizations and refactors.
>>>
>>> Since the changes that introduced the performance problem had their
>>> own values (fixes problems or optimizes performance), so we don't want
>>> to revert any of the changes (and it is also not straightforward to
>>> revert any of them because there have been lots of changes and refactors
>>> on top of them).
>>>
>>> Change [0] itself has added an alternative way to solve the overlapping
>>> backends problem, which utilizes ct fields instead of saving dst IP and
>>> port to registers. This patch forces to that approach and removes the
>>> flows/actions that saves the dst IP and port to avoid the dataplane
>>> performance problem for short-lived connections.
>>>
>>> (With this approach, the ct_state DNAT is not HW offload friendly, so it
>>> may result in those flows not being offloaded, which is supposed to be
>>> solved in a follow-up patch)
>>>
>>> [0] ce0ef8d59850 ("Properly handle hairpin traffic for VIPs with shared
> backends.")
>>> [1] 0038579d1928 ("northd: Optimize ct nat for load balancer traffic.")
>>>
>>> Signed-off-by: Han Zhou 
>>> ---
>>
>> I think the main concern I have is that this forces us to choose between:
>> a. non hwol friendly flows (reduced performance)
>> b. less functionality (with the knob in patch 3/3 set to false).
>>
> Thanks Dumitru for the comments! I agree the solution is not ideal, but if
> we look at it from a different angle, even with a), for most pod->service
> traffic the performance is still much better than how it is today (not
> offloaded kernel datapath is still much better than userspace slowpath).
> And *hopefully* b) is ok for most use cases to get HW-offload capability.
> 
>> Change [0] was added to address the case when a service in kubernetes is
>> exposed via two different k8s services objects that share the same
>> endpoints.  That translates in ovn-k8s to two different OVN load
>> balancer VIPs that share the same backends.  For such cases, if the
>> service is being accessed by one of its own backends we need to be able
>> to differentiate based on the VIP address it used to connect to.
>>
>> CC: Tim Rozet, Dan Williams for some more input from the ovn-k8s side on
>> how common it is that an OVN-networked pod accesses two (or more)
>> services that might have the pod itself as a backend.
>>
> 
> Yes, we definitely need input from ovn-k8s side. The information we got so
> far: the change [0] was to fix a bug [2] reported by Tim. However, the bug
> description didn't mention anything about two VIPs sharing the same
> backend. Tim also mentioned in the ovn-k8s meeting last week that the
> original user bug report for [2] was [3], and [3] was in fact a completely
> different problem (although it is related to hairpin, too). So, I am under

I am not completely sure about the link between [3] and [2], maybe Tim
remembers more.

> the impression that "an OVN-networked pod accesses two (or more) services
> that might have the pod itself as a backend" might be a very rare use case,
> if it exists at all.
> 

I went ahead and set the new ovn-allow-vips-share-hairpin-backend knob
to "false" and pushed it to my fork to run the ovn-kubernetes CI.  This
runs a subset of the kubernetes conformance tests (AFAICT) and some
specific e2e ovn-kubernetes tests.

The results are here:

https://github.com/dceara/ovn/runs/7230840427?check_suite_focus=true

Focusing on the conformance failures:

2022-07-07T10:31:24.7228157Z [Fail] [sig-network] Networking 
Granular Checks: Services [It] should function for 
endpoint-Service: http 
2022-07-07T10:31:24.7228940Z 
vendor/github.com/onsi/ginkgo/internal/leafnodes/runner.go:113
...
2022-07-07T10:31:24.7240313Z [Fail] [sig-network] Networking 
Granular Checks: Services [It] should function for 
multiple endpoint-Services with 

Re: [ovs-dev] [PATCH ovn 1/3] Don't save original dst IP and Port to avoid megaflow unwildcarding.

2022-07-06 Thread Han Zhou
On Wed, Jul 6, 2022 at 8:45 AM Dumitru Ceara  wrote:
>
> Hi Han,
>
> On 7/6/22 00:41, Han Zhou wrote:
> > The ls_in_pre_stateful priority 120 flow that saves dst IP and Port to
> > registers is causing a critical dataplane performance impact to
> > short-lived connections, because it unwildcards megaflows with exact
> > match on dst IP and L4 ports. Any new connections with a different
> > client side L4 port will encounter datapath flow miss and upcall to
> > ovs-vswitchd.
> >
> > These fields (dst IP and port) were saved to registers to solve a
> > problem of LB hairpin use case when different VIPs are sharing
> > overlapping backend+port [0]. The change [0] might not have as wide
> > performance impact as it is now because at that time one of the match
> > condition "REGBIT_CONNTRACK_NAT == 1" was set only for established and
> > natted traffic, while now the impact is more obvious because
> > REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP
> > configured on the LS) since commit [1], after several other indirectly
> > related optimizations and refactors.
> >
> > Since the changes that introduced the performance problem had their
> > own values (fixes problems or optimizes performance), so we don't want
> > to revert any of the changes (and it is also not straightforward to
> > revert any of them because there have been lots of changes and refactors
> > on top of them).
> >
> > Change [0] itself has added an alternative way to solve the overlapping
> > backends problem, which utilizes ct fields instead of saving dst IP and
> > port to registers. This patch forces to that approach and removes the
> > flows/actions that saves the dst IP and port to avoid the dataplane
> > performance problem for short-lived connections.
> >
> > (With this approach, the ct_state DNAT is not HW offload friendly, so it
> > may result in those flows not being offloaded, which is supposed to be
> > solved in a follow-up patch)
> >
> > [0] ce0ef8d59850 ("Properly handle hairpin traffic for VIPs with shared
backends.")
> > [1] 0038579d1928 ("northd: Optimize ct nat for load balancer traffic.")
> >
> > Signed-off-by: Han Zhou 
> > ---
>
> I think the main concern I have is that this forces us to choose between:
> a. non hwol friendly flows (reduced performance)
> b. less functionality (with the knob in patch 3/3 set to false).
>
Thanks Dumitru for the comments! I agree the solution is not ideal, but if
we look at it from a different angle, even with a), for most pod->service
traffic the performance is still much better than how it is today (not
offloaded kernel datapath is still much better than userspace slowpath).
And *hopefully* b) is ok for most use cases to get HW-offload capability.

> Change [0] was added to address the case when a service in kubernetes is
> exposed via two different k8s services objects that share the same
> endpoints.  That translates in ovn-k8s to two different OVN load
> balancer VIPs that share the same backends.  For such cases, if the
> service is being accessed by one of its own backends we need to be able
> to differentiate based on the VIP address it used to connect to.
>
> CC: Tim Rozet, Dan Williams for some more input from the ovn-k8s side on
> how common it is that an OVN-networked pod accesses two (or more)
> services that might have the pod itself as a backend.
>

Yes, we definitely need input from ovn-k8s side. The information we got so
far: the change [0] was to fix a bug [2] reported by Tim. However, the bug
description didn't mention anything about two VIPs sharing the same
backend. Tim also mentioned in the ovn-k8s meeting last week that the
original user bug report for [2] was [3], and [3] was in fact a completely
different problem (although it is related to hairpin, too). So, I am under
the impression that "an OVN-networked pod accesses two (or more) services
that might have the pod itself as a backend" might be a very rare use case,
if it exists at all.

> If this turns out to be mandatory I guess we might want to also look
> into alternatives like:
> - getting help from the HW to offload matches like ct_tuple()

I believe this is going to happen in the future. HWOL is continuously
enhanced.

> - limiting the impact of "a." only to some load balancers (e.g., would
> it help to use different hairpin lookup tables for such load balancers?)

I am not sure if this would work, and not sure if this is a good approach,
either. In general, I believe it is possible to solve the problem with more
complex pipelines, but we need to keep in mind it is quite easy to
introduce other performance problems (either control plane or data plane) -
many of the changes lead to the current implementation were for performance
optimizations, some for control plane, some for HWOL, and some for reducing
recirculations. I'd avoid complexity unless it is really necessary. Let's
get more input for the problem, and based on that we can decide if we want
to move to a more complex solution.

[2] 

Re: [ovs-dev] [PATCH ovn 1/3] Don't save original dst IP and Port to avoid megaflow unwildcarding.

2022-07-06 Thread Dumitru Ceara
Hi Han,

On 7/6/22 00:41, Han Zhou wrote:
> The ls_in_pre_stateful priority 120 flow that saves dst IP and Port to
> registers is causing a critical dataplane performance impact to
> short-lived connections, because it unwildcards megaflows with exact
> match on dst IP and L4 ports. Any new connections with a different
> client side L4 port will encounter datapath flow miss and upcall to
> ovs-vswitchd.
> 
> These fields (dst IP and port) were saved to registers to solve a
> problem of LB hairpin use case when different VIPs are sharing
> overlapping backend+port [0]. The change [0] might not have as wide
> performance impact as it is now because at that time one of the match
> condition "REGBIT_CONNTRACK_NAT == 1" was set only for established and
> natted traffic, while now the impact is more obvious because
> REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP
> configured on the LS) since commit [1], after several other indirectly
> related optimizations and refactors.
> 
> Since the changes that introduced the performance problem had their
> own values (fixes problems or optimizes performance), so we don't want
> to revert any of the changes (and it is also not straightforward to
> revert any of them because there have been lots of changes and refactors
> on top of them).
> 
> Change [0] itself has added an alternative way to solve the overlapping
> backends problem, which utilizes ct fields instead of saving dst IP and
> port to registers. This patch forces to that approach and removes the
> flows/actions that saves the dst IP and port to avoid the dataplane
> performance problem for short-lived connections.
> 
> (With this approach, the ct_state DNAT is not HW offload friendly, so it
> may result in those flows not being offloaded, which is supposed to be
> solved in a follow-up patch)
> 
> [0] ce0ef8d59850 ("Properly handle hairpin traffic for VIPs with shared 
> backends.")
> [1] 0038579d1928 ("northd: Optimize ct nat for load balancer traffic.")
> 
> Signed-off-by: Han Zhou 
> ---

I think the main concern I have is that this forces us to choose between:
a. non hwol friendly flows (reduced performance)
b. less functionality (with the knob in patch 3/3 set to false).

Change [0] was added to address the case when a service in kubernetes is
exposed via two different k8s services objects that share the same
endpoints.  That translates in ovn-k8s to two different OVN load
balancer VIPs that share the same backends.  For such cases, if the
service is being accessed by one of its own backends we need to be able
to differentiate based on the VIP address it used to connect to.

CC: Tim Rozet, Dan Williams for some more input from the ovn-k8s side on
how common it is that an OVN-networked pod accesses two (or more)
services that might have the pod itself as a backend.

If this turns out to be mandatory I guess we might want to also look
into alternatives like:
- getting help from the HW to offload matches like ct_tuple()
- limiting the impact of "a." only to some load balancers (e.g., would
it help to use different hairpin lookup tables for such load balancers?)

Thanks,
Dumitru

>  controller/lflow.c   |  74 ++--
>  include/ovn/logical-fields.h |   5 -
>  lib/lb.c |   3 -
>  lib/lb.h |   3 -
>  northd/northd.c  |  95 ---
>  northd/ovn-northd.8.xml  |  30 +
>  tests/ovn-northd.at  |  78 -
>  tests/ovn.at | 218 +--
>  8 files changed, 141 insertions(+), 365 deletions(-)
> 
> diff --git a/controller/lflow.c b/controller/lflow.c
> index 934b23698..a44f6d056 100644
> --- a/controller/lflow.c
> +++ b/controller/lflow.c
> @@ -1932,10 +1932,6 @@ add_lb_vip_hairpin_reply_action(struct in6_addr *vip6, 
> ovs_be32 vip,
>  }
>  
>  /* Adds flows to detect hairpin sessions.
> - *
> - * For backwards compatibilty with older ovn-northd versions, uses
> - * ct_nw_dst(), ct_ipv6_dst(), ct_tp_dst(), otherwise uses the
> - * original destination tuple stored by ovn-northd.
>   */
>  static void
>  add_lb_vip_hairpin_flows(struct ovn_controller_lb *lb,
> @@ -1956,10 +1952,8 @@ add_lb_vip_hairpin_flows(struct ovn_controller_lb *lb,
>  /* Matching on ct_nw_dst()/ct_ipv6_dst()/ct_tp_dst() requires matching
>   * on ct_state first.
>   */
> -if (!lb->hairpin_orig_tuple) {
> -uint32_t ct_state = OVS_CS_F_TRACKED | OVS_CS_F_DST_NAT;
> -match_set_ct_state_masked(_match, ct_state, ct_state);
> -}
> +uint32_t ct_state = OVS_CS_F_TRACKED | OVS_CS_F_DST_NAT;
> +match_set_ct_state_masked(_match, ct_state, ct_state);
>  
>  if (IN6_IS_ADDR_V4MAPPED(_vip->vip)) {
>  ovs_be32 bip4 = in6_addr_get_mapped_ipv4(_backend->ip);
> @@ -1971,14 +1965,7 @@ add_lb_vip_hairpin_flows(struct ovn_controller_lb *lb,
>  match_set_dl_type(_match, htons(ETH_TYPE_IP));
>  match_set_nw_src(_match, bip4);
>  

[ovs-dev] [PATCH ovn 1/3] Don't save original dst IP and Port to avoid megaflow unwildcarding.

2022-07-05 Thread Han Zhou
The ls_in_pre_stateful priority 120 flow that saves dst IP and Port to
registers is causing a critical dataplane performance impact to
short-lived connections, because it unwildcards megaflows with exact
match on dst IP and L4 ports. Any new connections with a different
client side L4 port will encounter datapath flow miss and upcall to
ovs-vswitchd.

These fields (dst IP and port) were saved to registers to solve a
problem of LB hairpin use case when different VIPs are sharing
overlapping backend+port [0]. The change [0] might not have as wide
performance impact as it is now because at that time one of the match
condition "REGBIT_CONNTRACK_NAT == 1" was set only for established and
natted traffic, while now the impact is more obvious because
REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP
configured on the LS) since commit [1], after several other indirectly
related optimizations and refactors.

Since the changes that introduced the performance problem had their
own values (fixes problems or optimizes performance), so we don't want
to revert any of the changes (and it is also not straightforward to
revert any of them because there have been lots of changes and refactors
on top of them).

Change [0] itself has added an alternative way to solve the overlapping
backends problem, which utilizes ct fields instead of saving dst IP and
port to registers. This patch forces to that approach and removes the
flows/actions that saves the dst IP and port to avoid the dataplane
performance problem for short-lived connections.

(With this approach, the ct_state DNAT is not HW offload friendly, so it
may result in those flows not being offloaded, which is supposed to be
solved in a follow-up patch)

[0] ce0ef8d59850 ("Properly handle hairpin traffic for VIPs with shared 
backends.")
[1] 0038579d1928 ("northd: Optimize ct nat for load balancer traffic.")

Signed-off-by: Han Zhou 
---
 controller/lflow.c   |  74 ++--
 include/ovn/logical-fields.h |   5 -
 lib/lb.c |   3 -
 lib/lb.h |   3 -
 northd/northd.c  |  95 ---
 northd/ovn-northd.8.xml  |  30 +
 tests/ovn-northd.at  |  78 -
 tests/ovn.at | 218 +--
 8 files changed, 141 insertions(+), 365 deletions(-)

diff --git a/controller/lflow.c b/controller/lflow.c
index 934b23698..a44f6d056 100644
--- a/controller/lflow.c
+++ b/controller/lflow.c
@@ -1932,10 +1932,6 @@ add_lb_vip_hairpin_reply_action(struct in6_addr *vip6, 
ovs_be32 vip,
 }
 
 /* Adds flows to detect hairpin sessions.
- *
- * For backwards compatibilty with older ovn-northd versions, uses
- * ct_nw_dst(), ct_ipv6_dst(), ct_tp_dst(), otherwise uses the
- * original destination tuple stored by ovn-northd.
  */
 static void
 add_lb_vip_hairpin_flows(struct ovn_controller_lb *lb,
@@ -1956,10 +1952,8 @@ add_lb_vip_hairpin_flows(struct ovn_controller_lb *lb,
 /* Matching on ct_nw_dst()/ct_ipv6_dst()/ct_tp_dst() requires matching
  * on ct_state first.
  */
-if (!lb->hairpin_orig_tuple) {
-uint32_t ct_state = OVS_CS_F_TRACKED | OVS_CS_F_DST_NAT;
-match_set_ct_state_masked(_match, ct_state, ct_state);
-}
+uint32_t ct_state = OVS_CS_F_TRACKED | OVS_CS_F_DST_NAT;
+match_set_ct_state_masked(_match, ct_state, ct_state);
 
 if (IN6_IS_ADDR_V4MAPPED(_vip->vip)) {
 ovs_be32 bip4 = in6_addr_get_mapped_ipv4(_backend->ip);
@@ -1971,14 +1965,7 @@ add_lb_vip_hairpin_flows(struct ovn_controller_lb *lb,
 match_set_dl_type(_match, htons(ETH_TYPE_IP));
 match_set_nw_src(_match, bip4);
 match_set_nw_dst(_match, bip4);
-
-if (!lb->hairpin_orig_tuple) {
-match_set_ct_nw_dst(_match, vip4);
-} else {
-match_set_reg(_match,
-  MFF_LOG_LB_ORIG_DIP_IPV4 - MFF_LOG_REG0,
-  ntohl(vip4));
-}
+match_set_ct_nw_dst(_match, vip4);
 
 add_lb_vip_hairpin_reply_action(NULL, snat_vip4, lb_proto,
 lb_backend->port,
@@ -1993,17 +1980,7 @@ add_lb_vip_hairpin_flows(struct ovn_controller_lb *lb,
 match_set_dl_type(_match, htons(ETH_TYPE_IPV6));
 match_set_ipv6_src(_match, bip6);
 match_set_ipv6_dst(_match, bip6);
-
-if (!lb->hairpin_orig_tuple) {
-match_set_ct_ipv6_dst(_match, _vip->vip);
-} else {
-ovs_be128 vip6_value;
-
-memcpy(_value, _vip->vip, sizeof vip6_value);
-match_set_xxreg(_match,
-MFF_LOG_LB_ORIG_DIP_IPV6 - MFF_LOG_XXREG0,
-ntoh128(vip6_value));
-}
+match_set_ct_ipv6_dst(_match, _vip->vip);
 
 add_lb_vip_hairpin_reply_action(snat_vip6, 0, lb_proto,
 lb_backend->port,
@@ -2014,14 +1991,8 @@ add_lb_vip_hairpin_flows(struct ovn_controller_lb *lb,