Re: [ovs-dev] [PATCH] dpif-netdev: Use unmasked key when adding datapath flows.

2022-10-18 Thread Jan Scheurich via dev
Hi guys,

I am afraid that commit is too long ago that would remember any details that 
caused us to change the code in beb75a40fdc2 ("userspace: Switching of L3 
packets in L2 pipeline"). What I vaguely remember was that I couldn’t 
comprehend the original code and it was not working correctly in some of the 
cases we needed/tested. But perhaps the changes we introduced also had corner 
cases we didn't consider. 

Question though: 

> > The datapath supports installing wider flows, and OVS relies on this
> > behavior. For example if ipv4(src=1.1.1.1/192.0.0.0,
> > dst=1.1.1.2/192.0.0.0) exists, a wider flow (smaller mask) of
> > ipv4(src=192.1.1.1/128.0.0.0,dst=192.1.1.2/128.0.0.0) is allowed to be
> > added.

That sounds strange to me. I always believed the datapath only supports 
non-overlapping flows, i.e. a packet can match at most one datapath flow. Only 
with that pre-requisite the dpcls classifier can work without priorities. Have 
I been wrong in this? What would be the semantics of adding a wider flow to the 
datapath? To my knowledge there is no guarantee that the dpcls subtables are 
visited in any specific order that would honor the mask width. And the first 
match will win.

Please clarify this. And in which sense OVS relies on this behavior?

BR, Jan

> -Original Message-
> From: Ilya Maximets 
> Sent: Tuesday, 18 October 2022 21:40
> To: Eelco Chaudron ; d...@openvswitch.org
> Cc: i.maxim...@ovn.org; Jan Scheurich 
> Subject: Re: [ovs-dev] [PATCH] dpif-netdev: Use unmasked key when adding
> datapath flows.
> 
> On 10/18/22 18:42, Eelco Chaudron wrote:
> > The datapath supports installing wider flows, and OVS relies on this
> > behavior. For example if ipv4(src=1.1.1.1/192.0.0.0,
> > dst=1.1.1.2/192.0.0.0) exists, a wider flow (smaller mask) of
> > ipv4(src=192.1.1.1/128.0.0.0,dst=192.1.1.2/128.0.0.0) is allowed to be
> > added.
> >
> > However, if we try to add a wildcard rule, the installation fails:
> >
> > # ovs-appctl dpctl/add-flow system@myDP "in_port(1),eth_type(0x0800), \
> >   ipv4(src=1.1.1.1/192.0.0.0,dst=1.1.1.2/192.0.0.0,frag=no)" 2 #
> > ovs-appctl dpctl/add-flow system@myDP "in_port(1),eth_type(0x0800), \
> >   ipv4(src=192.1.1.1/0.0.0.0,dst=49.1.1.2/0.0.0.0,frag=no)" 2
> > ovs-vswitchd: updating flow table (File exists)
> >
> > The reason is that the key used to determine if the flow is already
> > present in the system uses the original key ANDed with the mask.
> > This results in the IP address not being part of the (miniflow) key,
> > i.e., being substituted with an all-zero value. When doing the actual
> > lookup, this results in the key wrongfully matching the first flow,
> > and therefore the flow does not get installed. The solution is to use
> > the unmasked key for the existence check, the same way this is handled
> > in the userspace datapath.
> >
> > Signed-off-by: Eelco Chaudron 
> > ---
> >  lib/dpif-netdev.c|   33 +
> >  tests/dpif-netdev.at |   14 ++
> >  2 files changed, 43 insertions(+), 4 deletions(-)
> >
> > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
> > a45b46014..daa00aa2f 100644
> > --- a/lib/dpif-netdev.c
> > +++ b/lib/dpif-netdev.c
> > @@ -3321,6 +3321,28 @@ netdev_flow_key_init_masked(struct
> netdev_flow_key *dst,
> >  (dst_u64 - miniflow_get_values(>mf))
> > * 8);  }
> >
> > +/* Initializes 'dst' as a copy of 'flow'. */ static inline void
> > +netdev_flow_key_init(struct netdev_flow_key *key,
> > + const struct flow *flow) {
> > +uint64_t *dst = miniflow_values(>mf);
> > +uint32_t hash = 0;
> > +uint64_t value;
> > +
> > +miniflow_map_init(>mf, flow);
> > +miniflow_init(>mf, flow);
> > +
> > +size_t n = dst - miniflow_get_values(>mf);
> > +
> > +FLOW_FOR_EACH_IN_MAPS (value, flow, key->mf.map) {
> > +hash = hash_add64(hash, value);
> > +}
> > +
> > +key->hash = hash_finish(hash, n * 8);
> > +key->len = netdev_flow_key_size(n); }
> > +
> >  static inline void
> >  emc_change_entry(struct emc_entry *ce, struct dp_netdev_flow *flow,
> >   const struct netdev_flow_key *key) @@ -4195,7
> > +4217,7 @@ static int  dpif_netdev_flow_put(struct dpif *dpif, const
> > struct dpif_flow_put *put)  {
> >  struct dp_netdev *dp = get_dp_netdev(dpif);
> > -struct netdev_flow_key key, mask;
> > +struct netdev_flow_key key;
> >  struct dp_netdev_pmd_thread *pmd;
> >

Re: [ovs-dev] [PATCH v5] dpif-netdev: Allow cross-NUMA polling on selected ports

2022-07-21 Thread Jan Scheurich via dev
Hi,

True, the cost of polling a packet from a physical port on a remote NUMA node 
is slightly higher than from local NUMA node. Hence the cross-NUMA polling of 
rx queues has some overhead. However, the packet processing cost is much more 
influence by the location of the target vhostuser ports. If the majority of the 
rx queue traffic is going to a VM on the other NUMA node, it is actually 
*better* to poll the packets in a PMD on the VM's NUMA node.

Long story short, OVS doesn't have sufficient data to be able to correctly 
predict the actual rxq load when assigned to another PMD in a different queue 
configuration. The rxq processing cycles measured on the current PMD is the 
best estimate we have for balancing the overall load on the PMDs. We need to 
live with the inevitable inaccuracies.

My main point is: these inaccuracies don't matter. The purpose of balancing the 
load over PMDs is *not* to minimize the total cycles spent by PMDs on 
processing packets. The PMD run in a busy loop anyhow and burn all cycles of 
the CPU. The purpose is to prevent that some PMD unnecessarily gets congested 
(i.e. load > 95%) while others have a lot of spare capacity and could take over 
some rxqs.

Cross-NUMA polling of physical port rxqs has proven to be an extremely valuable 
tool to help OVS's cycle-based rxq-balancing algorithm to do its job, and I 
strongly suggest we allow the proposed per-port opt-in option.

BR, Jan

From: Anurag Agarwal 
Sent: Thursday, 21 July 2022 07:15
To: lic...@chinatelecom.cn
Cc: Jan Scheurich 
Subject: RE: RE: [ovs-dev] [PATCH v5] dpif-netdev: Allow cross-NUMA polling on 
selected ports

Hello Cheng,
With cross-numa enabled, we flatten the PMD list across NUMAs and select 
the least loaded PMD. Thus I would not like to consider the case below.

Regards,
Anurag

From: lic...@chinatelecom.cn<mailto:lic...@chinatelecom.cn> 
mailto:lic...@chinatelecom.cn>>
Sent: Thursday, July 21, 2022 8:19 AM
To: Anurag Agarwal 
mailto:anurag.agar...@ericsson.com>>
Cc: Jan Scheurich 
mailto:jan.scheur...@ericsson.com>>
Subject: Re: RE: [ovs-dev] [PATCH v5] dpif-netdev: Allow cross-NUMA polling on 
selected ports

Hi Anurag,

"If local numa has bandwidth for rxq, we are not supposed to assign a rxq to 
remote pmd."
Would you like to consider this case? If not, I think we don't have to resolve 
the cycles measurement issue for cross numa case.


李成

From: Anurag Agarwal<mailto:anurag.agar...@ericsson.com>
Date: 2022-07-21 10:21
To: lic...@chinatelecom.cn<mailto:lic...@chinatelecom.cn>
CC: Jan Scheurich<mailto:jan.scheur...@ericsson.com>
Subject: RE: RE: [ovs-dev] [PATCH v5] dpif-netdev: Allow cross-NUMA polling on 
selected ports
+ Jan

Hello Cheng,
  Thanks for your insightful comments. Please find my inputs inline.

Regards,
Anurag

From: lic...@chinatelecom.cn<mailto:lic...@chinatelecom.cn> 
mailto:lic...@chinatelecom.cn>>
Sent: Monday, July 11, 2022 7:51 AM
To: Anurag Agarwal 
mailto:anurag.agar...@ericsson.com>>
Subject: Re: RE: [ovs-dev] [PATCH v5] dpif-netdev: Allow cross-NUMA polling on 
selected ports

Hi Anurag,

Sorry for late reply, I was busy on a task last two weeks.

I think you proposal can cover the case I reported. It looks good to me.
>> Thanks for your review and positive feedback

However, to enable cross numa rxq pollin, we may have another problem to 
address.

From my test, cross numa polling has worse performance than numa affinity 
polling.(at least 10%)
So if local numa has bandwidth for rxq, we are not supposed to assign a rxq to 
remote pmd.
Unfortunately, we don't know if a pmd is out of bandwidth from it's assigned 
rxq cycles.
Because rx batchs size impacts the rxq cycle a lot in my test:
rx batch  cycles per pkt
1.00 5738
5.00 2353
12.15   1770
32.00   1533

Pkts come faster, the rx batch size is larger. More rxqs a pmd is assigned, rx 
batch size if larger.
Imaging that pmd pA has only one rxq assigned. Pkts comes at 1.00 pkt/5738 
cycle, the rxq rx batch size is 1.00.
Now pA has 2 rxq assigned, each rxq has pkts comes at 1.00 pkt/5738 cycle.
pmd spends 5738 cycles process the first rxq, and then the second.
After the second rxq is processed, pmd comes back to the first rxq, now first 
rxq has 2 pkts ready(becase 2*5738 cycles passed).
The rxq batch size becomes 2.
>> Ok. Do you think it is a more generic problem with cycles measurement and 
>> PMD utilization? Not specific to cross-numa feature..

So it's hard to say if a pmd is overload from the rxq cycles.
At last, I think cross numa feature is very nice. I will make effort on this as 
well to cover cases in our company.
Let keep in sync on progress :)
>> Thanks



李成

From: Anurag Agarwal<mailto:anurag.agar...@ericsson.com>
Date: 2022-06-29 14:12
To: lic...@chinatelecom.cn

Re: [ovs-dev] [PATCH v2] ofproto-xlate: Fix crash when forwarding packet between legacy_l3 tunnels

2022-04-04 Thread Jan Scheurich via dev
> I found this one though:
>   https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-
> 45444731-805e1f47a2a28e92=1=1fbc0307-e0af-4087-98ef-
> d9dbff40359a=https%3A%2F%2Fpatchwork.ozlabs.org%2Fproject%2Fopenvs
> witch%2Fpatch%2F20220403222617.31688-1-jan.scheurich%40web.de%2F
> 
> It was held back by the mail list and appears to be better formatted.
> Let's see if it works.
> 
> P.S. I marked that email address for acceptance, so the mail list
>  should no longer block it.

Thanks Ilya, that is a good work-around for me until I find a solution for SMTP 
with my Ericsson mail.

Regards, Jan

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2] ofproto-xlate: Fix crash when forwarding packet between legacy_l3 tunnels

2022-04-04 Thread jan . scheurich
From: Jan Scheurich 

A packet received from a tunnel port with legacy_l3 packet-type (e.g.
lisp, L3 gre, gtpu) is conceptually wrapped in a dummy Ethernet header
for processing in an OF pipeline that is not packet-type-aware. Before
transmission of the packet to another legacy_l3 tunnel port, the dummy
Ethernet header is stripped again.

In ofproto-xlate, wrapping in the dummy Ethernet header is done by
simply changing the packet_type to PT_ETH. The generation of the
push_eth datapath action is deferred until the packet's flow changes
need to be committed, for example at output to a normal port. The
deferred Ethernet encapsulation is marked in the pending_encap flag.

This patch fixes a bug in the translation of the output action to a
legacy_l3 tunnel port, where the packet_type of the flow is reverted
from PT_ETH to PT_IPV4 or PT_IPV6 (depending on the dl_type) to remove
its Ethernet header without clearing the pending_encap flag if it was
set. At the subsequent commit of the flow changes, the unexpected
combination of pending_encap == true with an PT_IPV4 or PT_IPV6
packet_type hit the OVS_NOT_REACHED() abortion clause.

The pending_encap is now cleared in this situation.

Reported-By: Dincer Beken 
Signed-off-By: Jan Scheurich 
Signed-off-By: Dincer Beken 

v1->v2:
  A specific test has been added to tunnel-push-pop.at to verify the
  correct behavior.
---
 ofproto/ofproto-dpif-xlate.c |  4 
 tests/tunnel-push-pop.at | 22 ++
 2 files changed, 26 insertions(+)

diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
index cc9c1c628..1843a5d66 100644
--- a/ofproto/ofproto-dpif-xlate.c
+++ b/ofproto/ofproto-dpif-xlate.c
@@ -4195,6 +4195,10 @@ compose_output_action__(struct xlate_ctx *ctx, 
ofp_port_t ofp_port,
 if (xport->pt_mode == NETDEV_PT_LEGACY_L3) {
 flow->packet_type = PACKET_TYPE_BE(OFPHTN_ETHERTYPE,
ntohs(flow->dl_type));
+if (ctx->pending_encap) {
+/* The Ethernet header was not actually added yet. */
+ctx->pending_encap = false;
+}
 }
 }

diff --git a/tests/tunnel-push-pop.at b/tests/tunnel-push-pop.at
index 57589758f..70462d905 100644
--- a/tests/tunnel-push-pop.at
+++ b/tests/tunnel-push-pop.at
@@ -546,6 +546,28 @@ AT_CHECK([ovs-ofctl dump-ports int-br | grep 'port  
[[37]]' | sort], [0], [dnl
   port  7: rx pkts=5, bytes=434, drop=?, errs=?, frame=?, over=?, crc=?
 ])

+dnl Send out packets received from L3GRE tunnel back to L3GRE tunnel
+AT_CHECK([ovs-ofctl del-flows int-br])
+AT_CHECK([ovs-ofctl add-flow int-br 
"in_port=7,actions=set_field:3->in_port,7"])
+AT_CHECK([ovs-vsctl -- set Interface br0 options:pcap=br0.pcap])
+
+AT_CHECK([ovs-appctl netdev-dummy/receive p0 
'aa55aa55001b213cab640800457079464000402fba630101025c01010258280001c84554ba20400184861e011e024227e75400030af31955f2650100101112131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f3031323334353637'])
+AT_CHECK([ovs-appctl netdev-dummy/receive p0 
'aa55aa55001b213cab640800457079464000402fba630101025c01010258280001c84554ba20400184861e011e024227e75400030af31955f2650100101112131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f3031323334353637'])
+AT_CHECK([ovs-appctl netdev-dummy/receive p0 
'aa55aa55001b213cab640800457079464000402fba630101025c01010258280001c84554ba20400184861e011e024227e75400030af31955f2650100101112131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f3031323334353637'])
+
+ovs-appctl time/warp 1000
+
+AT_CHECK([ovs-pcap p0.pcap > p0.pcap.txt 2>&1])
+AT_CHECK([tail -6 p0.pcap.txt], [0], [dnl
+aa55aa55001b213cab640800457079464000402fba630101025c01010258280001c84554ba20400184861e011e024227e75400030af31955f2650100101112131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f3031323334353637
+001b213cab64aa55aa55080045704000402f33aa010102580101025c280001c84554ba20400184861e011e024227e75400030af31955f2650100101112131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f3031323334353637
+aa55aa55001b213cab640800457079464000402fba630101025c01010258280001c84554ba20400184861e011e024227e75400030af31955f2650100101112131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f3031323334353637
+001b213cab64aa55aa55080045704000402f33aa010102580101025c280001c84554ba20400184861e011e024227e75400030af31955f2650100101112131415161718191a1b1c1d1e1f2021222

Re: [ovs-dev] [PATCH v2] ofproto-xlate: Fix crash when forwarding packet between legacy_l3 tunnels

2022-04-04 Thread Jan Scheurich via dev
Hi Ilya,

Sorry for spamming but I have problems again to send correctly formatted 
patches to the ovs-dev list. My previous SMTP server for git-send email no 
longer works and patches I send through my private SMTP provider do not reach 
the mailing list. Resending from Outlook obviously still screws up the patch ☹.

Any chance that you could pick the PATCH v2 manually and feed it through the 
process? I doubt that I will find a solution to my mailing issues soon.

Thanks, Jan

> -Original Message-
> From: 0-day Robot 
> Sent: Monday, 4 April, 2022 09:56
> To: Jan Scheurich 
> Cc: d...@openvswitch.org
> Subject: Re: [ovs-dev] [PATCH v2] ofproto-xlate: Fix crash when forwarding
> packet between legacy_l3 tunnels
> 
> Bleep bloop.  Greetings Jan Scheurich, I am a robot and I have tried out your
> patch.
> Thanks for your contribution.
> 
> I encountered some error that I wasn't expecting.  See the details below.
> 
> 
> git-am:
> error: corrupt patch at line 85
> error: could not build fake ancestor
> hint: Use 'git am --show-current-patch' to see the failed patch Patch failed 
> at
> 0001 ofproto-xlate: Fix crash when forwarding packet between legacy_l3
> tunnels When you have resolved this problem, run "git am --continue".
> If you prefer to skip this patch, run "git am --skip" instead.
> To restore the original branch and stop patching, run "git am --abort".
> 
> 
> Patch skipped due to previous failure.
> 
> Please check this out.  If you feel there has been an error, please email
> acon...@redhat.com
> 
> Thanks,
> 0-day Robot
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2] ofproto-xlate: Fix crash when forwarding packet between legacy_l3 tunnels

2022-04-04 Thread Jan Scheurich via dev
From: Jan Scheurich 

A packet received from a tunnel port with legacy_l3 packet-type (e.g.
lisp, L3 gre, gtpu) is conceptually wrapped in a dummy Ethernet header
for processing in an OF pipeline that is not packet-type-aware. Before
transmission of the packet to another legacy_l3 tunnel port, the dummy
Ethernet header is stripped again.

In ofproto-xlate, wrapping in the dummy Ethernet header is done by 
simply changing the packet_type to PT_ETH. The generation of the 
push_eth datapath action is deferred until the packet's flow changes 
need to be committed, for example at output to a normal port. The 
deferred Ethernet encapsulation is marked in the pending_encap flag.

This patch fixes a bug in the translation of the output action to a
legacy_l3 tunnel port, where the packet_type of the flow is reverted 
from PT_ETH to PT_IPV4 or PT_IPV6 (depending on the dl_type) to 
remove its Ethernet header without clearing the pending_encap flag 
if it was set. At the subsequent commit of the flow changes, the 
unexpected combination of pending_encap == true with an PT_IPV4 
or PT_IPV6 packet_type hit the OVS_NOT_REACHED() abortion clause.

The pending_encap is now cleared in this situation.

Reported-By: Dincer Beken 
Signed-off-By: Jan Scheurich 
Signed-off-By: Dincer Beken 

v1->v2:
  A specific test has been added to tunnel-push-pop.at to verify the
  correct behavior.
---
 ofproto/ofproto-dpif-xlate.c |  4 
 tests/tunnel-push-pop.at | 22 ++
 2 files changed, 26 insertions(+)

diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c index 
cc9c1c628..1843a5d66 100644
--- a/ofproto/ofproto-dpif-xlate.c
+++ b/ofproto/ofproto-dpif-xlate.c
@@ -4195,6 +4195,10 @@ compose_output_action__(struct xlate_ctx *ctx, 
ofp_port_t ofp_port,
 if (xport->pt_mode == NETDEV_PT_LEGACY_L3) {
 flow->packet_type = PACKET_TYPE_BE(OFPHTN_ETHERTYPE,
ntohs(flow->dl_type));
+if (ctx->pending_encap) {
+/* The Ethernet header was not actually added yet. */
+ctx->pending_encap = false;
+}
 }
 }

diff --git a/tests/tunnel-push-pop.at b/tests/tunnel-push-pop.at index 
57589758f..70462d905 100644
--- a/tests/tunnel-push-pop.at
+++ b/tests/tunnel-push-pop.at
@@ -546,6 +546,28 @@ AT_CHECK([ovs-ofctl dump-ports int-br | grep 'port  
[[37]]' | sort], [0], [dnl
   port  7: rx pkts=5, bytes=434, drop=?, errs=?, frame=?, over=?, crc=?
 ])

+dnl Send out packets received from L3GRE tunnel back to L3GRE tunnel 
+AT_CHECK([ovs-ofctl del-flows int-br]) AT_CHECK([ovs-ofctl add-flow 
+int-br "in_port=7,actions=set_field:3->in_port,7"])
+AT_CHECK([ovs-vsctl -- set Interface br0 options:pcap=br0.pcap])
+
+AT_CHECK([ovs-appctl netdev-dummy/receive p0 
+'aa55aa55001b213cab640800457079464000402fba630101025c0101025820
+00080001c84554ba20400184861e011e024227e75400030
+af31955f2650100101112131415161718191a1b1c1d1e1f20212223
+2425262728292a2b2c2d2e2f3031323334353637'])
+AT_CHECK([ovs-appctl netdev-dummy/receive p0 
+'aa55aa55001b213cab640800457079464000402fba630101025c0101025820
+00080001c84554ba20400184861e011e024227e75400030
+af31955f2650100101112131415161718191a1b1c1d1e1f20212223
+2425262728292a2b2c2d2e2f3031323334353637'])
+AT_CHECK([ovs-appctl netdev-dummy/receive p0 
+'aa55aa55001b213cab640800457079464000402fba630101025c0101025820
+00080001c84554ba20400184861e011e024227e75400030
+af31955f2650100101112131415161718191a1b1c1d1e1f20212223
+2425262728292a2b2c2d2e2f3031323334353637'])
+
+ovs-appctl time/warp 1000
+
+AT_CHECK([ovs-pcap p0.pcap > p0.pcap.txt 2>&1]) AT_CHECK([tail -6 
+p0.pcap.txt], [0], [dnl
+aa55aa55001b213cab640800457079464000402fba630101025c01010258200
+0080001c84554ba20400184861e011e024227e75400030a
+f31955f2650100101112131415161718191a1b1c1d1e1f202122232
+425262728292a2b2c2d2e2f3031323334353637
+001b213cab64aa55aa55080045704000402f33aa010102580101025c200
+0080001c84554ba20400184861e011e024227e75400030a
+f31955f2650100101112131415161718191a1b1c1d1e1f202122232
+425262728292a2b2c2d2e2f3031323334353637
+aa55aa55001b213cab640800457079464000402fba630101025c01010258200
+0080001c84554ba20400184861e011e024227e75400030a
+f31955f2650100101112131415161718191a1b1c1d1e1f202122232
+425262728292a2b2c2d2e2f3031323334353637
+001b213cab64aa55aa55080045704000402f33aa010102580101025c200
+0080001c84554ba20400184861e011e024227e75400030a
+f31955f2650100101112131415161718191a1b1c1d1e1f202122232
+425262728292a2b2c2d2e2f3031323334353637
+aa55aa55001b213cab64080045

Re: [ovs-dev] [PATCH v2] dpif-netdev: Allow cross-NUMA polling on selected ports

2022-03-24 Thread Jan Scheurich via dev
Hi Kevin,

This was a bit of a misunderstanding. We didn't check your RFC patch carefully 
enough to realize that you had meant to encompass our cross-numa-polling 
function in that RFC patch. Sorry for the confusion.

I wouldn't say we are particularly keen on upstreaming exactly our 
implementation of cross-numa-polling for ALB, as long as we get the 
functionality with a per-interface configuration option (preferably as in out 
patch, so that we can maintain backward compatibility with our downstream 
solution).

I suggest we have a closer look at your RFC and come back with comments on that.

> 
> 'roundrobin' and 'cycles', are dependent on RR of cores per numa, so I agree 
> it
> makes sense in those cases to still RR the numas. Otherwise a mix of cross-
> numa enabled and cross-numa disabled interfaces would conflict with each
> other when selecting a core. So even though the user has selected to ignore
> numa for this interface, we don't have a choice but to still RR the numa.
> 
> For 'group' it is about finding the lowest loaded pmd core. In that case we 
> don't
> need to RR numa. We can just find the lowest loaded pmd core from any numa.
> 
> This is better because the user is choosing to remove numa based selection for
> the interface and as the rxq scheduling algorthim is not dependent on it, we
> can fully remove it too by checking pmds from all numas. I have done this in 
> my
> RFC.
> 
> It is also better to do this where possible because there may not be same
> amount of pmd cores on each numa, or one numa could already be more
> heavily loaded than the other.
> 
> Another difference is with above for 'group' I added tiebreaker for a 
> local-to-
> interface numa pmd to be selected if multiple pmds from different numas were
> available with same load. This is most likely to be helpful for initial 
> selection
> when there is no load on any pmds.

At first glance I agree with your reasoning. Choosing the least-loaded PMD from 
all NUMAs using NUMA-locality as a tie-breaker makes sense in 'group' 
algorithm. How do you select between equally loaded PMDs on the same NUMA node? 
Just pick any?

BR, Jan

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [External] Re: [PATCH] dpif-netdev: add an option to assign pmd rxq to all numas

2022-03-09 Thread Jan Scheurich via dev
> Thanks for sharing your experience with it. My fear with the proposal is that
> someone turns this on and then tells us performance is worse and/or OVS
> assignments/ALB are broken, because it has an impact on their case.
> 
> In terms of limiting possible negative effects,
> - it can be opt-in and recommended only for phy ports
> - could print a warning when it is enabled
> - ALB is currently disabled with cross-numa polling (except a limited
> case) but it's clear you want to remove that restriction too
> - for ALB, a user could increase the improvement threshold to account for any
> reassignments triggered by inaccuracies

[Jan] Yes, we want to enable cross-NUMA polling of selected (typically phy) 
ports in ALB "group" mode as an opt-in config option (default off). Based on 
our observations we are not too much concerned with the loss of ALB prediction 
accuracy but increasing the threshold may be a way of taking that into account, 
if wanted.

> 
> There is also some improvements that can be made to the proposed method
> when used with group assignment,
> - we can prefer local numa where there is no difference between pmd cores.
> (e.g. two unused cores available, pick the local numa one)
> - we can flatten the list of pmds, so best pmd can be selected. This will 
> remove
> issues with RR numa when there are different num of pmd cores or loads per
> numa.
> - I wrote an RFC that does these two items, I can post when(/if!) consensus is
> reached on the broader topic

[Jan] In our alternative version of the current upstream "group" ALB [1] we 
already maintained a flat list of PMDs. So we would support that feature. Using 
NUMA-locality as a tie-breaker makes sense.

[1] https://mail.openvswitch.org/pipermail/ovs-dev/2021-June/384546.html

> 
> In summary, it's a trade-off,
> 
> With no cross-numa polling (current):
> - won't have any impact to OVS assignment or ALB accuracy
> - there could be a bottleneck on one numa pmds while other numa pmd cores
> are idle and unused
> 
> With cross-numa rx pinning (current):
> - will have access to pmd cores on all numas
> - may require more cycles for some traffic paths
> - won't have any impact to OVS assignment or ALB accuracy
> - >1 pinned rxqs per core may cause a bottleneck depending on traffic
> 
> With cross-numa interface setting (proposed):
> - will have access to all pmd cores on all numas (i.e. no unused pmd cores
> during highest load)
> - will require more cycles for some traffic paths
> - will impact on OVS assignment and ALB accuracy
> 
> Anything missing above, or is it a reasonable summary?

I think that is a reasonable summary, albeit I would have characterized the 
third option a bit more positively:
- Gives ALB maximum freedom to balance load of PMDs on all NUMA nodes (in the 
likely scenario of uneven VM load on the NUMAs)
- Accepts an increase of cycles on cross-NUMA paths for a better utilization of 
a free PMD cycles
- Mostly suitable for phy ports due to limited cycle increase for cross-NUMA 
polling of phy rx queues
- Could negatively impact the ALB prediction accuracy in certain scenarios

We will post a new version of our patch [2] for cross-numa polling on selected 
ports adapted to the current OVS master shortly.

[2] https://mail.openvswitch.org/pipermail/ovs-dev/2021-June/384547.html

Thanks, Jan


___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [External] Re: [PATCH] dpif-netdev: add an option to assign pmd rxq to all numas

2022-02-17 Thread Jan Scheurich via dev
Hi Kevin,

> > We have done extensive benchmarking and found that we get better overall
> PMD load balance and resulting OVS performance when we do not statically
> pin any rx queues and instead let the auto-load-balancing find the optimal
> distribution of phy rx queues over both NUMA nodes to balance an asymmetric
> load of vhu rx queues (polled only on the local NUMA node).
> >
> > Cross-NUMA polling of vhu rx queues comes with a very high latency cost due
> to cross-NUMA access to volatile virtio ring pointers in every iteration (not 
> only
> when actually copying packets). Cross-NUMA polling of phy rx queues doesn't
> have a similar issue.
> >
> 
> I agree that for vhost rxq polling, it always causes a performance penalty 
> when
> there is cross-numa polling.
> 
> For polling phy rxq, when phy and vhost are in different numas, I don't see 
> any
> additional penalty for cross-numa polling the phy rxq.
> 
> For the case where phy and vhost are both in the same numa, if I change to 
> poll
> the phy rxq cross-numa, then I see about a >20% tput drop for traffic from 
> phy -
> > vhost. Are you seeing that too?

Yes, but the performance drop is mostly due to the extra cost of copying the 
packets across the UPI bus to the virtio buffers on the other NUMA, not because 
of polling the phy rxq on the other NUMA.

> 
> Also, the fact that a different numa can poll the phy rxq after every 
> rebalance
> means that the ability of the auto-load-balancer to estimate and trigger a
> rebalance is impacted.

Agree, there is some inaccuracy in the estimation of the load a phy rx queue 
creates when it is moved to another NUMA node. So far we have not seen that as 
a practical problem.

> 
> It seems like simple pinning some phy rxqs cross-numa would avoid all the
> issues above and give most of the benefit of cross-numa polling for phy rxqs.

That is what we have done in the past (far a lack of alternatives). But any 
static pinning reduces the ability of the auto-load balancer to do its job. 
Consider the following scenarios:

1. The phy ingress traffic is not evenly distributed by RSS due to lack of 
entropy (Examples for this are IP-IP encapsulated traffic, e.g. Calico, or 
MPLSoGRE encapsulated traffic).

2. VM traffic is very asymmetric, e.g. due to a large dual-NUMA VM whose vhu 
ports are all on NUMA 0.

In all such scenarios, static pinning of phy rxqs may lead to unnecessarily 
uneven PMD load and loss of overall capacity.

> 
> With the pmd-rxq-assign=group and pmd-rxq-isolate=false options, OVS could
> still assign other rxqs to those cores which have with pinned phy rxqs and
> properly adjust the assignments based on the load from the pinned rxqs.

Yes, sometimes the vhu rxq load is distributed such that it can be use to 
balance the PMD, but not always. Sometimes the balance is just better when phy 
rxqs are not pinned.

> 
> New assignments or auto-load-balance would not change the numa polling
> those rxqs, so it it would have no impact to ALB or ability to assign based on
> load.

In our practical experience the new "group" algorithm for load-based rxq 
distribution is able to balance the PMD load best when none of the rxqs are 
pinned and cross-NUMA polling of phy rxqs is enabled. So the effect of the 
prediction error when doing auto-lb dry-runs cannot be significant.

In our experience we consistently get the best PMD balance and OVS throughput 
when we give the auto-lb free hands (no cross-NUMA polling of vhu rxqs, 
through).

BR, Jan
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [External] Re: [PATCH] dpif-netdev: add an option to assign pmd rxq to all numas

2022-02-14 Thread Jan Scheurich via dev
> > We do acknowledge the benefit of non-pinned polling of phy rx queues by
> PMD threads on all NUMA nodes. It gives the auto-load balancer much better
> options to utilize spare capacity on PMDs on all NUMA nodes.
> >
> > Our patch proposed in
> > https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-45444
> > 731-c996011189a3eea8=1=0dc6a0b0-959c-493e-a3de-
> fea8f3151705=
> > https%3A%2F%2Fmail.openvswitch.org%2Fpipermail%2Fovs-dev%2F2021-
> June%2
> > F384547.html indeed covers the difference between phy and vhu ports.
> > One has to explicitly enable cross-NUMA-polling for individual interfaces
> with:
> >
> >ovs-vsctl set interface  other_config:cross-numa-polling=true
> >
> > This would typically only be done by static configuration for the fixed set 
> > of
> physical ports. There is no code in the OpenStack's os-vif handler to apply 
> such
> configuration for dynamically created vhu ports.
> >
> > I would strongly suggest that cross-num-polling be introduced as a per-
> interface option as in our patch rather than as a per-datapath option as in 
> your
> patch. Why not adapt our original patch to the latest OVS code base? We can
> help you with that.
> >
> > BR, Jan
> >
> 
> Hi, Jan Scheurich
> 
> We can achieve the static setting of pinning a phy port by combining pmd-rxq-
> isolate and pmd-rxq-affinity.  This setting can get the same result. And we 
> have
> seen the benefits.
> The new issue is the polling of vhu on one numa. Under heavy traffic, polling
> vhu + phy will make the pmds reach 100% usage. While other pmds on the
> other numa with only phy port reaches 70% usage. Enabling cross-numa polling
> for a vhu port would give us more benefits in this case. Overloads of 
> different
> pmds on both numa would be balanced.
> As you have mentioned, there is no code to apply this config for vhu while
> creating them. A global setting would save us from dynamically detecting the
> vhu name or any new creation.

Hi Wan Junjie,

We have done extensive benchmarking and found that we get better overall PMD 
load balance and resulting OVS performance when we do not statically pin any rx 
queues and instead let the auto-load-balancing find the optimal distribution of 
phy rx queues over both NUMA nodes to balance an asymmetric load of vhu rx 
queues (polled only on the local NUMA node).

Cross-NUMA polling of vhu rx queues comes with a very high latency cost due to 
cross-NUMA access to volatile virtio ring pointers in every iteration (not only 
when actually copying packets). Cross-NUMA polling of phy rx queues doesn't 
have a similar issue.

BR, Jan

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [External] Re: [PATCH] dpif-netdev: add an option to assign pmd rxq to all numas

2022-02-14 Thread Jan Scheurich via dev
> >
> > Btw, this patch is similar in functionality to the one posted by
> > Anurag [0] and there was also some discussion about this approach here [1].
> >
> 
> Thanks for pointing this out.
> IMO, setting interface cross-numa would be good for phy port but not good for
> vhu.  Since vhu can be destroyed and created relatively frequently.
> But yes the main idea is the same.
> 

We do acknowledge the benefit of non-pinned polling of phy rx queues by PMD 
threads on all NUMA nodes. It gives the auto-load balancer much better options 
to utilize spare capacity on PMDs on all NUMA nodes.

Our patch proposed in 
https://mail.openvswitch.org/pipermail/ovs-dev/2021-June/384547.html
indeed covers the difference between phy and vhu ports. One has to explicitly 
enable cross-NUMA-polling for individual interfaces with:

   ovs-vsctl set interface  other_config:cross-numa-polling=true

This would typically only be done by static configuration for the fixed set of 
physical ports. There is no code in the OpenStack's os-vif handler to apply 
such configuration for dynamically created vhu ports.

I would strongly suggest that cross-num-polling be introduced as a 
per-interface option as in our patch rather than as a per-datapath option as in 
your patch. Why not adapt our original patch to the latest OVS code base? We 
can help you with that.

BR, Jan

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v4 2/7] dpif-netdev: Make PMD auto load balance use common rxq scheduling.

2021-07-14 Thread Jan Scheurich via dev
> > In our patch series we decided to skip the check on cross-numa polling 
> > during
> auto-load balancing. The rationale is as follows:
> >
> > If the estimated PMD-rxq distribution includes cross-NUMA rxq assignments,
> the same must apply for the current distribution, as none of the scheduling
> algorithms would voluntarily assign rxqs across NUMA nodes. So, current and
> estimated rxq assignments are comparable and it makes sense to consider
> rebalancing when the variance improves.
> >
> > Please consider removing this check.
> >
> 
> The first thing is that this patch is not changing any behaviour, just re-
> implementing to reuse the common code, so it would not be the place to
> change this functionality.

Fair enough. We should address this in a separate patch.

> About the proposed change itself, just to be clear what is allowed currently. 
> It
> will allow rebalance when there are local pmds, OR there are no local pmds
> and there is one other NUMA node with pmds available for cross-numa polling.
> 
> The rationale of not doing a rebalance when there are no local pmds but
> multiple other NUMAs available for cross-NUMA polling is that the estimate
> may be incorrect due a different cross-NUMA being choosen for an Rxq than is
> currently used.
> 
> I thought about some things like making an Rxq sticky with a particular cross-
> NUMA etc for this case but that brings a whole new set of problems, e.g. what
> happens if that NUMA gets overloaded, reduced cores, how can it ever be reset
> etc. so I decided not to pursue it as I think it is probably a corner case 
> (at least
> for now).

We currently don't see any scenarios with more than two NUMA nodes, but 
different CPU/server architectures may perhaps have more NUMA nodes than CPU 
sockets. 

> I know the case of no local pmd and one NUMA with pmds is not a corner case
> as I'm aware of users doing that.

Agree such configurations are a must to support with auto-lb.

> We can discuss further about the multiple non-local NUMA case and maybe
> there's some improvements we can think of, or maybe I've made some wrong
> assumptions but it would be a follow on from the current patchset.

Our main use case for cross-NUMA balancing comes with the additional freedom to 
allow cross-NUMA polling for selected ports that we introduce with fourth patch:

dpif-netdev: Allow cross-NUMA polling on selected ports

Today dpif-netdev considers PMD threads on a non-local NUMA node for
automatic assignment of the rxqs of a port only if there are no local,
non-isolated PMDs.

On typical servers with both physical ports on one NUMA node, this often
leaves the PMDs on the other NUMA node under-utilized, wasting CPU
resources. The alternative, to manually pin the rxqs to PMDs on remote
NUMA nodes, also has drawbacks as it limits OVS' ability to auto
load-balance the rxqs.

This patch introduces a new interface configuration option to allow
ports to be automatically polled by PMDs on any NUMA node:

ovs-vsctl set interface  other_config:cross-numa-polling=true

If this option is not present or set to false, legacy behaviour applies.

We indeed use this for our physical ports to be polled by non-isolated PMDs on 
both NUMAs. The observed capacity improvement is very substantial, so we plan 
to port this feature on top of your patches once they are merged. 

This can only fly if the auto-load balancing is allowed to activate rxq 
assignments with cross-numa polling also in the case there are local 
non-isolated PMDs.

Anyway, we can take this up later in our upcoming patch that introduces this 
option.

BR, Jan
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v4 6/7] dpif-netdev: Allow pin rxq and non-isolate PMD.

2021-07-14 Thread Jan Scheurich via dev
> >> +If using ``pmd-rxq-assign=group`` PMD threads with *pinned* Rxqs can
> >> +be
> >> +*non-isolated* by setting::
> >> +
> >> +  $ ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-isolate=false
> >
> > Is there any specific reason why the new pmd-rxq-isolate option should be
> limited to the "group" scheduling algorithm? In my view it would make more
> sense and simplify documentation and code if these aspects of scheduling were
> kept orthogonal.
> >
> 
> David had a similar comment on an earlier version. I will add the fuller reply
> below. In summary, pinning and the other algorithms (particularly if there was
> multiple pinnings) conflict in how they operate because they are based on RR
> pmd's to add equal number of Rxqs/PMD.
> 
> ---
> Yes, the main issue is that the other algorithms are based on a pmd order on
> the assumption that they start from empty. For 'roundrobin' it is to equally
> distribute num of rxqs on pmd RR - if we pin several rxqs on one pmds, it is 
> not
> clear what to do etc. Even worse for 'cycles' it is based on placing rxqs in 
> order
> of busy cycles on the pmds RR. If we pin an rxq initially we are conflicting
> against how the algorithm operates.
> 
> Because 'group' algorithm is not tied to a defined pmd order, it is better 
> suited
> to be able to place a pinned rxq initially and be able to consider that pmd on
> it's merits along with the others later.
> ---

Makes sense. The legacy round-robin and cycles algorithms are not really 
prepared for pinned rxqs.

In our patch set we had replaced the original round-robin and cycles algorithms 
with a single new algorithm that simply distributed the rxqs to the 
"least-loaded" non-isolated PMD (based on #rxqs or cycles assigned so far). 
With pinning=isolation, this exactly reproduced the round-robin behavior and 
created at least as good balance as the original cycles algorithm. In that new 
algorithm any pinned rxqs were already considered and didn't lead to distorted 
results.

BR, Jan
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v4 2/7] dpif-netdev: Make PMD auto load balance use common rxq scheduling.

2021-07-13 Thread Jan Scheurich via dev
> -Original Message-
> From: dev  On Behalf Of Kevin Traynor
> Sent: Thursday, 8 July, 2021 15:54
> To: d...@openvswitch.org
> Cc: david.march...@redhat.com
> Subject: [ovs-dev] [PATCH v4 2/7] dpif-netdev: Make PMD auto load balance
> use common rxq scheduling.
> 
> PMD auto load balance had its own separate implementation of the rxq
> scheduling that it used for dry runs. This was done because previously the rxq
> scheduling was not made reusable for a dry run.
> 
> Apart from the code duplication (which is a good enough reason to replace it
> alone) this meant that if any further rxq scheduling changes or assignment
> types were added they would also have to be duplicated in the auto load
> balance code too.
> 
> This patch replaces the current PMD auto load balance rxq scheduling code to
> reuse the common rxq scheduling code.
> 
> The behaviour does not change from a user perspective, except the logs are
> updated to be more consistent.
> 
> As the dry run will compare the pmd load variances for current and estimated
> assignments, new functions are added to populate the current assignments and
> use the rxq scheduling data structs for variance calculations.
> 
> Now that the new rxq scheduling data structures are being used in PMD auto
> load balance, the older rr_* data structs and associated functions can be
> removed.
> 
> Signed-off-by: Kevin Traynor 
> ---
>  lib/dpif-netdev.c | 508 +++---
>  1 file changed, 161 insertions(+), 347 deletions(-)
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index beafa00a0..338ffd971
> 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -4903,138 +4903,4 @@ port_reconfigure(struct dp_netdev_port *port)  }
> 
> -struct rr_numa_list {
> -struct hmap numas;  /* Contains 'struct rr_numa' */
> -};
> -
> -struct rr_numa {
> -struct hmap_node node;
> -
> -int numa_id;
> -
> -/* Non isolated pmds on numa node 'numa_id' */
> -struct dp_netdev_pmd_thread **pmds;
> -int n_pmds;
> -
> -int cur_index;
> -bool idx_inc;
> -};
> -
> -static size_t
> -rr_numa_list_count(struct rr_numa_list *rr) -{
> -return hmap_count(>numas);
> -}
> -
> -static struct rr_numa *
> -rr_numa_list_lookup(struct rr_numa_list *rr, int numa_id) -{
> -struct rr_numa *numa;
> -
> -HMAP_FOR_EACH_WITH_HASH (numa, node, hash_int(numa_id, 0), 
> >numas) {
> -if (numa->numa_id == numa_id) {
> -return numa;
> -}
> -}
> -
> -return NULL;
> -}
> -
> -/* Returns the next node in numa list following 'numa' in round-robin 
> fashion.
> - * Returns first node if 'numa' is a null pointer or the last node in 'rr'.
> - * Returns NULL if 'rr' numa list is empty. */ -static struct rr_numa * -
> rr_numa_list_next(struct rr_numa_list *rr, const struct rr_numa *numa) -{
> -struct hmap_node *node = NULL;
> -
> -if (numa) {
> -node = hmap_next(>numas, >node);
> -}
> -if (!node) {
> -node = hmap_first(>numas);
> -}
> -
> -return (node) ? CONTAINER_OF(node, struct rr_numa, node) : NULL;
> -}
> -
> -static void
> -rr_numa_list_populate(struct dp_netdev *dp, struct rr_numa_list *rr) -{
> -struct dp_netdev_pmd_thread *pmd;
> -struct rr_numa *numa;
> -
> -hmap_init(>numas);
> -
> -CMAP_FOR_EACH (pmd, node, >poll_threads) {
> -if (pmd->core_id == NON_PMD_CORE_ID || pmd->isolated) {
> -continue;
> -}
> -
> -numa = rr_numa_list_lookup(rr, pmd->numa_id);
> -if (!numa) {
> -numa = xzalloc(sizeof *numa);
> -numa->numa_id = pmd->numa_id;
> -hmap_insert(>numas, >node, hash_int(pmd->numa_id, 0));
> -}
> -numa->n_pmds++;
> -numa->pmds = xrealloc(numa->pmds, numa->n_pmds * sizeof *numa-
> >pmds);
> -numa->pmds[numa->n_pmds - 1] = pmd;
> -/* At least one pmd so initialise curr_idx and idx_inc. */
> -numa->cur_index = 0;
> -numa->idx_inc = true;
> -}
> -}
> -
> -/*
> - * Returns the next pmd from the numa node.
> - *
> - * If 'updown' is 'true' it will alternate between selecting the next pmd in
> - * either an up or down walk, switching between up/down when the first or
> last
> - * core is reached. e.g. 1,2,3,3,2,1,1,2...
> - *
> - * If 'updown' is 'false' it will select the next pmd wrapping around when 
> last
> - * core reached. e.g. 1,2,3,1,2,3,1,2...
> - */
> -static struct dp_netdev_pmd_thread *
> -rr_numa_get_pmd(struct rr_numa *numa, bool updown) -{
> -int numa_idx = numa->cur_index;
> -
> -if (numa->idx_inc == true) {
> -/* Incrementing through list of pmds. */
> -if (numa->cur_index == numa->n_pmds-1) {
> -/* Reached the last pmd. */
> -if (updown) {
> -numa->idx_inc = false;
> -} else {
> -numa->cur_index = 0;
> -}
> -} else {
> -numa->cur_index++;
> -}
> -} 

Re: [ovs-dev] [PATCH v4 6/7] dpif-netdev: Allow pin rxq and non-isolate PMD.

2021-07-13 Thread Jan Scheurich via dev


> -Original Message-
> From: dev  On Behalf Of Kevin Traynor
> Sent: Thursday, 8 July, 2021 15:54
> To: d...@openvswitch.org
> Cc: david.march...@redhat.com
> Subject: [ovs-dev] [PATCH v4 6/7] dpif-netdev: Allow pin rxq and non-isolate
> PMD.
> 
> Pinning an rxq to a PMD with pmd-rxq-affinity may be done for various reasons
> such as reserving a full PMD for an rxq, or to ensure that multiple rxqs from 
> a
> port are handled on different PMDs.
> 
> Previously pmd-rxq-affinity always isolated the PMD so no other rxqs could be
> assigned to it by OVS. There may be cases where there is unused cycles on
> those pmds and the user would like other rxqs to also be able to be assigned 
> to
> it by OVS.
> 
> Add an option to pin the rxq and non-isolate the PMD. The default behaviour is
> unchanged, which is pin and isolate the PMD.
> 
> In order to pin and non-isolate:
> ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-isolate=false
> 
> Note this is available only with group assignment type, as pinning conflicts 
> with
> the operation of the other rxq assignment algorithms.
> 
> Signed-off-by: Kevin Traynor 
> ---
>  Documentation/topics/dpdk/pmd.rst |   9 ++-
>  NEWS  |   3 +
>  lib/dpif-netdev.c |  34 --
>  tests/pmd.at  | 105 ++
>  vswitchd/vswitch.xml  |  19 ++
>  5 files changed, 162 insertions(+), 8 deletions(-)
> 
> diff --git a/Documentation/topics/dpdk/pmd.rst
> b/Documentation/topics/dpdk/pmd.rst
> index 29ba53954..30040d703 100644
> --- a/Documentation/topics/dpdk/pmd.rst
> +++ b/Documentation/topics/dpdk/pmd.rst
> @@ -102,6 +102,11 @@ like so:
>  - Queue #3 pinned to core 8
> 
> -PMD threads on cores where Rx queues are *pinned* will become *isolated*.
> This -means that this thread will only poll the *pinned* Rx queues.
> +PMD threads on cores where Rx queues are *pinned* will become
> +*isolated* by default. This means that this thread will only poll the 
> *pinned*
> Rx queues.
> +
> +If using ``pmd-rxq-assign=group`` PMD threads with *pinned* Rxqs can be
> +*non-isolated* by setting::
> +
> +  $ ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-isolate=false

Is there any specific reason why the new pmd-rxq-isolate option should be 
limited to the "group" scheduling algorithm? In my view it would make more 
sense and simplify documentation and code if these aspects of scheduling were 
kept orthogonal.

Regards, Jan
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] tests: Fixed L3 over patch port tests

2021-06-09 Thread Jan Scheurich via dev
LGTM.
Acked-by: Jan Scheurich 

> -Original Message-
> From: Martin Varghese 
> Sent: Wednesday, 9 June, 2021 15:36
> To: d...@openvswitch.org; i.maxim...@ovn.org; Jan Scheurich
> 
> Cc: Martin Varghese 
> Subject: [PATCH] tests: Fixed L3 over patch port tests
> 
> From: Martin Varghese 
> 
> Normal action is replaced with output to GRE port for sending
> l3 packets over GRE tunnel. Normal action cannot be used with
> l3 packets.
> 
> Fixes: d03d0cf2b71b ("tests: Extend PTAP unit tests with decap action")
> Signed-off-by: Martin Varghese 
> ---
>  tests/packet-type-aware.at | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/tests/packet-type-aware.at b/tests/packet-type-aware.at index
> 540cf98f3..73aa14cea 100644
> --- a/tests/packet-type-aware.at
> +++ b/tests/packet-type-aware.at
> @@ -697,7 +697,7 @@ AT_CHECK([
>  ovs-ofctl del-flows br1 &&
>  ovs-ofctl del-flows br2 &&
>  ovs-ofctl add-flow br0 in_port=n0,actions=decap,output=p0 -OOpenFlow13
> &&
> -ovs-ofctl add-flow br1 in_port=p1,actions=NORMAL &&
> +ovs-ofctl add-flow br1 in_port=p1,actions=output=gre1 &&
>  ovs-ofctl add-flow br2 in_port=LOCAL,actions=output=n2  ], [0])
> 
> @@ -708,7 +708,7 @@ AT_CHECK([ovs-ofctl -OOpenFlow13 dump-flows br0 |
> ofctl_strip | grep actions],
> 
>  AT_CHECK([ovs-ofctl -OOpenFlow13 dump-flows br1 | ofctl_strip | grep
> actions],  [0], [dnl
> - reset_counts in_port=20 actions=NORMAL
> + reset_counts in_port=20 actions=output:100
>  ])
> 
>  AT_CHECK([ovs-ofctl -OOpenFlow13 dump-flows br2 | ofctl_strip | grep
> actions], @@ -726,7 +726,7 @@ ovs-appctl time/warp 1000  AT_CHECK([
>  ovs-appctl dpctl/dump-flows --names dummy@ovs-dummy | strip_used |
> grep -v ipv6 | sort  ], [0], [flow-dump from the main thread:
> -
> recirc_id(0),in_port(n0),packet_type(ns=0,id=0),eth(src=3a:6d:d2:09:9c:ab,dst=1
> e:2c:e9:2a:66:9e),eth_type(0x0800),ipv4(tos=0/0x3,frag=no), packets:1,
> bytes:98, used:0.0s,
> actions:pop_eth,clone(tnl_push(tnl_port(gre_sys),header(size=38,type=3,eth(dst
> =de:af:be:ef:ba:be,src=aa:55:00:00:00:02,dl_type=0x0800),ipv4(src=10.0.0.1,dst
> =10.0.0.2,proto=47,tos=0,ttl=64,frag=0x4000),gre((flags=0x0,proto=0x800))),out
> _port(br2)),n2)
> +recirc_id(0),in_port(n0),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(t
> +os=0/0x3,frag=no), packets:1, bytes:98, used:0.0s,
> +actions:pop_eth,clone(tnl_push(tnl_port(gre_sys),header(size=38,type=3,
> +eth(dst=de:af:be:ef:ba:be,src=aa:55:00:00:00:02,dl_type=0x0800),ipv4(sr
> +c=10.0.0.1,dst=10.0.0.2,proto=47,tos=0,ttl=64,frag=0x4000),gre((flags=0
> +x0,proto=0x800))),out_port(br2)),n2)
>  ])
> 
>  AT_CHECK([
> --
> 2.18.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2] Fix redundant datapath set ethernet action with NSH Decap

2021-06-07 Thread Jan Scheurich via dev
> -Original Message-
> From: Martin Varghese 
> Sent: Monday, 7 June, 2021 16:47
> To: Ilya Maximets 
> Cc: d...@openvswitch.org; echau...@redhat.com; Jan Scheurich
> ; Martin Varghese
> 
> Subject: Re: [ovs-dev] [PATCH v2] Fix redundant datapath set ethernet action
> with NSH Decap
> 
> On Wed, May 19, 2021 at 12:26:40PM +0200, Ilya Maximets wrote:
> > On 5/19/21 5:26 AM, Martin Varghese wrote:
> > > On Tue, May 18, 2021 at 10:03:39PM +0200, Ilya Maximets wrote:
> > >> On 5/17/21 3:45 PM, Martin Varghese wrote:
> > >>> From: Martin Varghese 
> > >>>
> > >>> When a decap action is applied on NSH header encapsulatiing a
> > >>> ethernet packet a redundant set mac address action is programmed
> > >>> to the datapath.
> > >>>
> > >>> Fixes: f839892a206a ("OF support and translation of generic encap
> > >>> and decap")
> > >>> Signed-off-by: Martin Varghese 
> > >>> Acked-by: Jan Scheurich 
> > >>> Acked-by: Eelco Chaudron 
> > >>> ---
> > >>> Changes in v2:
> > >>>   - Fixed code styling
> > >>>   - Added Ack from jan.scheur...@ericsson.com
> > >>>   - Added Ack from echau...@redhat.com
> > >>>
> > >>
> > >> Hi, Martin.
> > >> For some reason this patch triggers frequent failures of the
> > >> following unit test:
> > >>
> > >> 2314. packet-type-aware.at:619: testing ptap - L3 over patch port
> > >> ...
> 
> The test is failing as, during revalidation, NORMAL action is dropping 
> packets.
> With these changes, the mac address in flow structures get cleared with decap
> action. Hence the NORMAL action drops the packet assuming a loop (SRC and
> DST mac address are zero). I assume NORMAL action handling in
> xlate_push_stats_entry is not adapted for l3 packet. The timing at which
> revalidator gets triggered explains the sporadicity of the issue. The issue is
> never seen as the MAC addresses in flow structure were not cleared with decap
> before.
> 
> So can we use NORMAL action with a L3 packet ?  Does OVS handle all the L3
> use cases with Normal action ? If not, shouldn't we not use NORMAL action in
> this test case
> 
> Comments?
> 

Good catch! Normal flow L2 bridging is of course nonsense for the use case of 
forwarding an L3 packet. I am surprised that the packet was forwarded at all in 
the first place. That in itself can be considered a bug. Correctly, a Normal 
flow should drop non-Ethernet packets, I would say.

To fix the test case I suggest to replace the Normal action in br1 with 
"output:gre1" in line 700.

> 
> > >> stdout:
> > >> warped
> > >> ./packet-type-aware.at:726:
> > >> ovs-appctl dpctl/dump-flows --names dummy@ovs-dummy |
> > >> strip_used | grep -v ipv6 | sort
> > >>
> > >> --- -   2021-05-18 21:57:56.810513366 +0200
> > >> +++ /home/i.maximets/work/git/ovs/tests/testsuite.dir/at-
> groups/2314/stdout 2021-05-18 21:57:56.806609814 +0200
> > >> @@ -1,3 +1,3 @@
> > >>  flow-dump from the main thread:
> > >> -recirc_id(0),in_port(n0),packet_type(ns=0,id=0),eth(src=3a:6d:d2:0
> > >> 9:9c:ab,dst=1e:2c:e9:2a:66:9e),eth_type(0x0800),ipv4(tos=0/0x3,frag
> > >> =no), packets:1, bytes:98, used:0.0s,
> > >> actions:pop_eth,clone(tnl_push(tnl_port(gre_sys),header(size=38,typ
> > >> e=3,eth(dst=de:af:be:ef:ba:be,src=aa:55:00:00:00:02,dl_type=0x0800)
> > >> ,ipv4(src=10.0.0.1,dst=10.0.0.2,proto=47,tos=0,ttl=64,frag=0x4000),
> > >> gre((flags=0x0,proto=0x800))),out_port(br2)),n2)
> > >> +recirc_id(0),in_port(n0),packet_type(ns=0,id=0),eth(src=3a:6d:d2:0
> > >> +9:9c:ab,dst=1e:2c:e9:2a:66:9e),eth_type(0x0800),ipv4(tos=0/0x3,fra
> > >> +g=no), packets:1, bytes:98, used:0.0s, actions:drop
> > >>
> > >>
> > >> It fails very frequently in GitHub Actions, but it's harder to make
> > >> it fail on my local machine.  Following change to the test allows
> > >> to reproduce the failure almost always on my local machine:
> > >>
> > >> diff --git a/tests/packet-type-aware.at
> > >> b/tests/packet-type-aware.at index 540cf98f3..01dbc8030 100644
> > >> --- a/tests/packet-type-aware.at
> > >> +++ b/tests/packet-type-aware.at
> > >> @@ -721,7 +721,7 @@ AT_CHECK([
> > >>  ovs-appctl netdev-dummy/receive n0
> > >>
> 1e2ce92a669e3a6

Re: [ovs-dev] [PATCH] Fix redundant datapath set ethernet action with NSH Decap

2021-05-07 Thread Jan Scheurich via dev
Hi Martin,

Somehow I didn’t receive this patch email via the ovs-dev mailing list, perhaps 
one of the many spam filters on the way interfered. Don't know if this response 
email will be recognized by ovs patchwork.

The nsh.at lines are broken wrongly in this email, but they look OK in 
patchworks.
Otherwise, LGTM.

Acked-by: Jan Scheurich 

/Jan

> -Original Message-
> From: Varghese, Martin (Nokia - IN/Bangalore)
> 
> Sent: Monday, 3 May, 2021 15:25
> To: Jan Scheurich 
> Subject: RE: [PATCH] Fix redundant datapath set ethernet action with NSH
> Decap
> 
> Hi Jan,
> 
> Could you please review this patch.
> 
> Regards,
> Martin
> 
> -Original Message-
> From: Martin Varghese 
> Sent: Tuesday, April 27, 2021 6:13 PM
> To: d...@openvswitch.org; echau...@redhat.com;
> jan.scheur...@ericsson.com
> Cc: Varghese, Martin (Nokia - IN/Bangalore) 
> Subject: [PATCH] Fix redundant datapath set ethernet action with NSH Decap
> 
> From: Martin Varghese 
> 
> When a decap action is applied on NSH header encapsulatiing a ethernet
> packet a redundant set mac address action is programmed to the datapath.
> 
> Fixes: f839892a206a ("OF support and translation of generic encap and
> decap")
> Signed-off-by: Martin Varghese 
> ---
>  lib/odp-util.c   | 3 ++-
>  ofproto/ofproto-dpif-xlate.c | 2 ++
>  tests/nsh.at | 8 
>  3 files changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/lib/odp-util.c b/lib/odp-util.c index e1199d1da..9d558082f 100644
> --- a/lib/odp-util.c
> +++ b/lib/odp-util.c
> @@ -7830,7 +7830,8 @@ commit_set_ether_action(const struct flow *flow,
> struct flow *base_flow,
>  struct offsetof_sizeof ovs_key_ethernet_offsetof_sizeof_arr[] =
>  OVS_KEY_ETHERNET_OFFSETOF_SIZEOF_ARR;
> 
> -if (flow->packet_type != htonl(PT_ETH)) {
> +if ((flow->packet_type != htonl(PT_ETH)) ||
> +(base_flow->packet_type != htonl(PT_ETH))) {
>  return;
>  }
> 
> diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c index
> 7108c8a30..a6f4ea334 100644
> --- a/ofproto/ofproto-dpif-xlate.c
> +++ b/ofproto/ofproto-dpif-xlate.c
> @@ -6549,6 +6549,8 @@ xlate_generic_decap_action(struct xlate_ctx *ctx,
>   * Delay generating pop_eth to the next commit. */
>  flow->packet_type = htonl(PACKET_TYPE(OFPHTN_ETHERTYPE,
>ntohs(flow->dl_type)));
> +flow->dl_src = eth_addr_zero;
> +flow->dl_dst = eth_addr_zero;
>  ctx->wc->masks.dl_type = OVS_BE16_MAX;
>  }
>  return false;
> diff --git a/tests/nsh.at b/tests/nsh.at index d5c772ff0..e84134e42 100644
> --- a/tests/nsh.at
> +++ b/tests/nsh.at
> @@ -105,7 +105,7 @@ bridge("br0")
> 
>  Final flow:
> in_port=1,vlan_tci=0x,dl_src=00:00:00:00:00:00,dl_dst=11:22:33:44:55:6
> 6,dl_type=0x894f,nsh_flags=0,nsh_ttl=63,nsh_mdtype=1,nsh_np=3,nsh_spi=0
> x1234,nsh_si=255,nsh_c1=0x11223344,nsh_c2=0x0,nsh_c3=0x0,nsh_c4=0x0,
> nw_proto=0,nw_tos=0,nw_ecn=0,nw_ttl=0
>  Megaflow:
> recirc_id=0,eth,ip,in_port=1,dl_dst=66:77:88:99:aa:bb,nw_frag=no
> -Datapath actions:
> push_nsh(flags=0,ttl=63,mdtype=1,np=3,spi=0x1234,si=255,c1=0x11223344,c
> 2=0x0,c3=0x0,c4=0x0),push_eth(src=00:00:00:00:00:00,dst=11:22:33:44:55:6
> 6),pop_eth,pop_nsh(),set(eth(dst=11:22:33:44:55:66)),recirc(0x1)
> +Datapath actions:
> +push_nsh(flags=0,ttl=63,mdtype=1,np=3,spi=0x1234,si=255,c1=0x11223344,
> c
> +2=0x0,c3=0x0,c4=0x0),push_eth(src=00:00:00:00:00:00,dst=11:22:33:44:55:
> +66),pop_eth,pop_nsh(),recirc(0x1)
>  ])
> 
>  AT_CHECK([
> @@ -139,7 +139,7 @@ ovs-appctl time/warp 1000  AT_CHECK([
>  ovs-appctl dpctl/dump-flows dummy@ovs-dummy | strip_used | grep -v
> ipv6 | sort  ], [0], [flow-dump from the main thread:
> -
> recirc_id(0),in_port(1),packet_type(ns=0,id=0),eth(dst=1e:2c:e9:2a:66:9e),eth
> _type(0x0800),ipv4(frag=no), packets:1, bytes:98, used:0.0s,
> actions:push_nsh(flags=0,ttl=63,mdtype=1,np=3,spi=0x1234,si=255,c1=0x112
> 23344,c2=0x0,c3=0x0,c4=0x0),push_eth(src=00:00:00:00:00:00,dst=11:22:33:
> 44:55:66),pop_eth,pop_nsh(),set(eth(dst=11:22:33:44:55:66)),recirc(0x3)
> +recirc_id(0),in_port(1),packet_type(ns=0,id=0),eth(dst=1e:2c:e9:2a:66:9
> +e),eth_type(0x0800),ipv4(frag=no), packets:1, bytes:98, used:0.0s,
> +actions:push_nsh(flags=0,ttl=63,mdtype=1,np=3,spi=0x1234,si=255,c1=0x11
> +223344,c2=0x0,c3=0x0,c4=0x0),push_eth(src=00:00:00:00:00:00,dst=11:22:
> 3
> +3:44:55:66),pop_eth,pop_nsh(),recirc(0x3)
> 
> recirc_id(0x3),in_port(1),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=
> no), p

Re: [ovs-dev] [PATCH v4 1/2] Encap & Decap actions for MPLS packet type.

2021-04-17 Thread Jan Scheurich via dev


> -Original Message-
> From: Martin Varghese 
> Sent: Tuesday, 13 April, 2021 16:20
> To: Jan Scheurich 
> Cc: Eelco Chaudron ; d...@openvswitch.org;
> pshe...@ovn.org; martin.vargh...@nokia.com
> Subject: Re: [PATCH v4 1/2] Encap & Decap actions for MPLS packet type.
> 
> On Wed, Apr 07, 2021 at 03:49:07PM +, Jan Scheurich wrote:
> > Hi Martin,
> >
> > I guess you are aware of the original design document we wrote for generic
> encap/decap and NSH support:
> > https://protect2.fireeye.com/v1/url?k=993ba795-c6a09d8c-993be70e-8682a
> > aa22bc0-3c9b4464027ca7bf=1=f89fc25e-8dc0-45bf-bdd9-
> 2d0ca03b5686=
> >
> https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1oWMYUH8sjZJzWa72
> o2q9kU
> > 0N6pNE-rwZcLH3-kbbDR8%2Fedit%23
> >
> > It is no longer 100% aligned with the final implementation in OVS but still 
> > a
> good reference for understanding the design principles behind the
> implementation and some specifics for Ethernet and NSH encap/decap use
> cases.
> >
> > Please find some more answers/comments below.
> >
> > BR, Jan
> >
> > > -Original Message-
> > > From: Martin Varghese 
> > > Sent: Wednesday, 7 April, 2021 10:43
> > > To: Jan Scheurich 
> > > Cc: Eelco Chaudron ; d...@openvswitch.org;
> > > pshe...@ovn.org; martin.vargh...@nokia.com
> > > Subject: Re: [PATCH v4 1/2] Encap & Decap actions for MPLS packet type.
> > >
> > > On Tue, Apr 06, 2021 at 09:00:16AM +, Jan Scheurich wrote:
> > > > Hi,
> > > >
> > > > Thanks for the heads up. The interaction with MPLS push/pop is a
> > > > use case
> > > that was likely not tested during the NSH and generic encap/decap
> > > design. It's complex code and a long time ago. I'm willing to help,
> > > but I will need some time to go back and have a look.
> > > >
> > > > It would definitely help, if you could provide a minimal example
> > > > for
> > > reproducing the problem.
> > > >
> > >
> > > Hi Jan ,
> > >
> > > Thanks for your help.
> > >
> > > I was trying to implement ENCAP/DECAP support for MPLS.
> > >
> > > The programming of datapath flow for the below  userspace rule fails
> > > as there is set(eth() action between pop_mpls and recirc ovs-ofctl
> > > -O OpenFlow13 add- flow br_mpls2
> > > "in_port=$egress_port,dl_type=0x8847
> > > actions=decap(),decap(packet_type(ns=0,type=0),goto_table:1
> > >
> > > 2021-04-05T05:46:49.192Z|00068|dpif(handler51)|WARN|system@ovs-
> > > system: failed to put[create] (Invalid argument)
> > > ufid:1dddb0ba-27fe-44ea- 9a99-5815764b4b9c
> > > recirc_id(0),dp_hash(0/0),skb_priority(0/0),in_port(6),skb_mark(0/0)
> > > ,ct_state
> > > (0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),eth(src=00:00:00:00:00:01/00:
> > > 00:00:00:00:00,dst=00:00:00:00:00:02/00:00:00:00:00:00),eth_type(0x8
> > > 847) ,mpls(label=2/0x0,tc=0/0,ttl=64/0x0,bos=1/1),
> > > actions:pop_eth,pop_mpls(eth_type=0x6558),set(eth()),recirc(0x45)
> > >
> >
> > Conceptually, what should happen in this scenario is that, after the second
> decap(packet_type(ns=0,type=0) action, OVS processes the unchanged inner
> packet as packet type PT_ETH, i.e. as L2 Ethernet frame. Overwriting the
> existing Ethernet header with zero values  through set(eth()) is clearly
> incorrect. That is a logical error inside the ofproto-dpif-xlate module (see
> below).
> >
> > I believe the netdev userspace datapath would still have accepted the
> incorrect datapath flow. I have too little experience with the kernel 
> datapath to
> explain why that rejects the datapath flow as invalid.
> >
> > Unlike in the Ethernet and NSH cases, the MPLS header does not contain any
> indication about the inner packet type. That is why the packet_type must be
> provided by the SDN controller as part of the decap() action.  And the 
> ofproto-
> dpif-xlate module must consider the specified inner packet type when
> continuing the translation. In the general case, a decap() action should 
> trigger
> recirculation for reparsing of the inner packet, so the new packet type must 
> be
> set before recirculation. (Exceptions to the general recirculation rule are 
> those
> where OVS has already parsed further into the packet and ofproto can modify
> the flow on the fly: decap(Ethernet) and possibly decap(MPLS) for all but the
> last bottom of stack label).
> >
> > I have had a look at your new code for encap/decap o

Re: [ovs-dev] [PATCH v4 1/2] Encap & Decap actions for MPLS packet type.

2021-04-07 Thread Jan Scheurich via dev
Hi Martin,

I guess you are aware of the original design document we wrote for generic 
encap/decap and NSH support:
https://docs.google.com/document/d/1oWMYUH8sjZJzWa72o2q9kU0N6pNE-rwZcLH3-kbbDR8/edit#

It is no longer 100% aligned with the final implementation in OVS but still a 
good reference for understanding the design principles behind the 
implementation and some specifics for Ethernet and NSH encap/decap use cases.

Please find some more answers/comments below.

BR, Jan 

> -Original Message-
> From: Martin Varghese 
> Sent: Wednesday, 7 April, 2021 10:43
> To: Jan Scheurich 
> Cc: Eelco Chaudron ; d...@openvswitch.org;
> pshe...@ovn.org; martin.vargh...@nokia.com
> Subject: Re: [PATCH v4 1/2] Encap & Decap actions for MPLS packet type.
> 
> On Tue, Apr 06, 2021 at 09:00:16AM +, Jan Scheurich wrote:
> > Hi,
> >
> > Thanks for the heads up. The interaction with MPLS push/pop is a use case
> that was likely not tested during the NSH and generic encap/decap design. It's
> complex code and a long time ago. I'm willing to help, but I will need some
> time to go back and have a look.
> >
> > It would definitely help, if you could provide a minimal example for
> reproducing the problem.
> >
> 
> Hi Jan ,
> 
> Thanks for your help.
> 
> I was trying to implement ENCAP/DECAP support for MPLS.
> 
> The programming of datapath flow for the below  userspace rule fails as there
> is set(eth() action between pop_mpls and recirc ovs-ofctl -O OpenFlow13 add-
> flow br_mpls2 "in_port=$egress_port,dl_type=0x8847
> actions=decap(),decap(packet_type(ns=0,type=0),goto_table:1
> 
> 2021-04-05T05:46:49.192Z|00068|dpif(handler51)|WARN|system@ovs-
> system: failed to put[create] (Invalid argument) ufid:1dddb0ba-27fe-44ea-
> 9a99-5815764b4b9c
> recirc_id(0),dp_hash(0/0),skb_priority(0/0),in_port(6),skb_mark(0/0),ct_state
> (0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),eth(src=00:00:00:00:00:01/00:
> 00:00:00:00:00,dst=00:00:00:00:00:02/00:00:00:00:00:00),eth_type(0x8847)
> ,mpls(label=2/0x0,tc=0/0,ttl=64/0x0,bos=1/1),
> actions:pop_eth,pop_mpls(eth_type=0x6558),set(eth()),recirc(0x45)
> 

Conceptually, what should happen in this scenario is that, after the second 
decap(packet_type(ns=0,type=0) action, OVS processes the unchanged inner packet 
as packet type PT_ETH, i.e. as L2 Ethernet frame. Overwriting the existing 
Ethernet header with zero values  through set(eth()) is clearly incorrect. That 
is a logical error inside the ofproto-dpif-xlate module (see below).

I believe the netdev userspace datapath would still have accepted the incorrect 
datapath flow. I have too little experience with the kernel datapath to explain 
why that rejects the datapath flow as invalid.

Unlike in the Ethernet and NSH cases, the MPLS header does not contain any 
indication about the inner packet type. That is why the packet_type must be 
provided by the SDN controller as part of the decap() action.  And the 
ofproto-dpif-xlate module must consider the specified inner packet type when 
continuing the translation. In the general case, a decap() action should 
trigger recirculation for reparsing of the inner packet, so the new packet type 
must be set before recirculation. (Exceptions to the general recirculation rule 
are those where OVS has already parsed further into the packet and ofproto can 
modify the flow on the fly: decap(Ethernet) and possibly decap(MPLS) for all 
but the last bottom of stack label).

I have had a look at your new code for encap/decap of MPLS headers, but I must 
admit I cannot fully judge in how far re-using the existing translation 
functions for MPLS label stacks written for the legacy push/pop_mpls case (i.e. 
manipulating a label stack between the L2 and the L3 headers of a PT_ETH 
Packet) are possible to re-use in the new context. 

BTW: Do you support multiple MPLS label encap or decap actions with your patch? 
Have you tested that? 

I am uncertain about the handling of the ethertype of the decapsulated inner 
packet. In the design base, the ethertype that is set in the existing L2 header 
of the packet after pop_mpls of the last label is coming from the pop_mpls 
action, while in the decap(packet_type(0,0)) case the entire inner packet 
should be recirculated as is with packet_type PT_ETH. 

case PT_MPLS: {
 int n;
 ovs_be16 ethertype;

 flow->packet_type = decap->new_pkt_type;
 ethertype = pt_ns_type_be(flow->packet_type);

 n = flow_count_mpls_labels(flow, ctx->wc);
 flow_pop_mpls(flow, n, ethertype, ctx->wc);
 if (!ctx->xbridge->support.add_mpls) {
ctx->xout->slow |= SLOW_ACTION;
 }
 ctx->pending_decap = true;
 return true;

In the example scenario the new_pkt_type is P

Re: [ovs-dev] [PATCH v4 1/2] Encap & Decap actions for MPLS packet type.

2021-04-06 Thread Jan Scheurich via dev
Hi,

Thanks for the heads up. The interaction with MPLS push/pop is a use case that 
was likely not tested during the NSH and generic encap/decap design. It's 
complex code and a long time ago. I'm willing to help, but I will need some 
time to go back and have a look.

It would definitely help, if you could provide a minimal example for 
reproducing the problem.

BR, Jan

> -Original Message-
> From: Eelco Chaudron 
> Sent: Tuesday, 6 April, 2021 10:55
> To: Martin Varghese ; Jan Scheurich
> 
> Cc: d...@openvswitch.org; pshe...@ovn.org; martin.vargh...@nokia.com
> Subject: Re: [PATCH v4 1/2] Encap & Decap actions for MPLS packet type.
> 
> 
> 
> On 6 Apr 2021, at 10:27, Martin Varghese wrote:
> 
> > On Thu, Apr 01, 2021 at 11:32:06AM +0200, Eelco Chaudron wrote:
> >>
> >>
> >> On 1 Apr 2021, at 11:28, Martin Varghese wrote:
> >>
> >>> On Thu, Apr 01, 2021 at 11:17:14AM +0200, Eelco Chaudron wrote:
> >>>>
> >>>>
> >>>> On 1 Apr 2021, at 11:09, Martin Varghese wrote:
> >>>>
> >>>>> On Thu, Apr 01, 2021 at 10:54:42AM +0200, Eelco Chaudron wrote:
> >>>>>>
> >>>>>>
> >>>>>> On 1 Apr 2021, at 10:35, Martin Varghese wrote:
> >>>>>>
> >>>>>>> On Thu, Apr 01, 2021 at 08:59:27AM +0200, Eelco Chaudron wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 1 Apr 2021, at 6:10, Martin Varghese wrote:
> >>>>>>>>
> >>>>>>>>> On Wed, Mar 31, 2021 at 03:59:40PM +0200, Eelco Chaudron
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 26 Mar 2021, at 7:21, Martin Varghese wrote:
> >>>>>>>>>>
> >>>>>>>>>>> From: Martin Varghese 
> >>>>>>>>>>>
> >>>>>>>>>>> The encap & decap actions are extended to support MPLS
> >>>>>>>>>>> packet type.
> >>>>>>>>>>> Encap & decap actions adds and removes MPLS header at start
> >>>>>>>>>>> of the packet.
> >>>>>>>>>>
> >>>>>>>>>> Hi Martin,
> >>>>>>>>>>
> >>>>>>>>>> I’m trying to do some real-life testing, and I’m running into
> >>>>>>>>>> issues. This might be me setting it up wrongly but just
> >>>>>>>>>> wanting to confirm…
> >>>>>>>>>>
> >>>>>>>>>> I’m sending an MPLS packet that contains an ARP packet into a
> >>>>>>>>>> physical port.
> >>>>>>>>>> This is the packet:
> >>>>>>>>>>
> >>>>>>>>>> Frame 4: 64 bytes on wire (512 bits), 64 bytes captured (512
> >>>>>>>>>> bits)
> >>>>>>>>>> Encapsulation type: Ethernet (1)
> >>>>>>>>>> [Protocols in frame: eth:ethertype:mpls:data] Ethernet
> >>>>>>>>>> II, Src: 00:00:00_00:00:01 (00:00:00:00:00:01), Dst:
> >>>>>>>>>> 00:00:00_00:00:02 (00:00:00:00:00:02)
> >>>>>>>>>> Destination: 00:00:00_00:00:02 (00:00:00:00:00:02)
> >>>>>>>>>> Address: 00:00:00_00:00:02 (00:00:00:00:00:02)
> >>>>>>>>>>  ..0.     = LG bit: Globally
> >>>>>>>>>> unique address (factory default)
> >>>>>>>>>>  ...0     = IG bit:
> >>>>>>>>>> Individual address
> >>>>>>>>>> (unicast)
> >>>>>>>>>> Source: 00:00:00_00:00:01 (00:00:00:00:00:01)
> >>>>>>>>>> Address: 00:00:00_00:00:01 (00:00:00:00:00:01)
> >>>>>>>>>>  ..0.     = LG bit: Globally
> >>>>>>>>>> unique address (factory default)
> >>>>>>>>>>  ...0     = IG bit:
> >>>>>>>>>> Individual address
> >>>&

Re: [ovs-dev] [PATCH v2] ofp-ed-props: Fix using uninitialized padding for NSH encap actions.

2020-10-14 Thread Jan Scheurich via dev
LGTM. Please back-port to stable branches.

Acked-by: Jan Scheurich 

/Jan

> -Original Message-
> From: Ilya Maximets 
> Sent: Wednesday, 14 October, 2020 18:14
> To: ovs-dev@openvswitch.org; Jan Scheurich 
> Cc: Ben Pfaff ; Ilya Maximets 
> Subject: [PATCH v2] ofp-ed-props: Fix using uninitialized padding for NSH
> encap actions.
> 
> OVS uses memcmp to compare actions of existing and new flows, but 'struct
> ofp_ed_prop_nsh_md_type' and corresponding ofpact structure has
> 3 bytes of padding that never initialized and passed around within OF data
> structures and messages.
> 
>   Uninitialized bytes in MemcmpInterceptorCommon
> at offset 21 inside [0x709003f8, 136)
>   WARNING: MemorySanitizer: use-of-uninitialized-value
> #0 0x4a184e in bcmp (vswitchd/ovs-vswitchd+0x4a184e)
> #1 0x896c8a in ofpacts_equal lib/ofp-actions.c:9121:31
> #2 0x564403 in replace_rule_finish ofproto/ofproto.c:5650:37
> #3 0x563462 in add_flow_finish ofproto/ofproto.c:5218:13
> #4 0x54a1ff in ofproto_flow_mod_finish ofproto/ofproto.c:8091:17
> #5 0x5433b2 in handle_flow_mod__ ofproto/ofproto.c:6216:17
> #6 0x56a2fc in handle_flow_mod ofproto/ofproto.c:6190:17
> #7 0x565bda in handle_single_part_openflow ofproto/ofproto.c:8504:16
> #8 0x540b25 in handle_openflow ofproto/ofproto.c:8685:21
> #9 0x6697fd in ofconn_run ofproto/connmgr.c:1329:13
> #10 0x668e6e in connmgr_run ofproto/connmgr.c:356:9
> #11 0x53f1bc in ofproto_run ofproto/ofproto.c:1890:5
> #12 0x4ead0c in bridge_run__ vswitchd/bridge.c:3250:9
> #13 0x4e9bc8 in bridge_run vswitchd/bridge.c:3309:5
> #14 0x51c072 in main vswitchd/ovs-vswitchd.c:127:9
> #15 0x7f23a99011a2 in __libc_start_main (/lib64/libc.so.6)
> #16 0x46b92d in _start (vswitchd/ovs-vswitchd+0x46b92d)
> 
>   Uninitialized value was stored to memory at
> #0 0x4745aa in __msan_memcpy.part.0 (vswitchd/ovs-vswitchd)
> #1 0x54529f in rule_actions_create ofproto/ofproto.c:3134:5
> #2 0x54915e in ofproto_rule_create ofproto/ofproto.c:5284:11
> #3 0x55d419 in add_flow_init ofproto/ofproto.c:5123:17
> #4 0x54841f in ofproto_flow_mod_init ofproto/ofproto.c:7987:17
> #5 0x543250 in handle_flow_mod__ ofproto/ofproto.c:6206:13
> #6 0x56a2fc in handle_flow_mod ofproto/ofproto.c:6190:17
> #7 0x565bda in handle_single_part_openflow ofproto/ofproto.c:8504:16
> #8 0x540b25 in handle_openflow ofproto/ofproto.c:8685:21
> #9 0x6697fd in ofconn_run ofproto/connmgr.c:1329:13
> #10 0x668e6e in connmgr_run ofproto/connmgr.c:356:9
> #11 0x53f1bc in ofproto_run ofproto/ofproto.c:1890:5
> #12 0x4ead0c in bridge_run__ vswitchd/bridge.c:3250:9
> #13 0x4e9bc8 in bridge_run vswitchd/bridge.c:3309:5
> #14 0x51c072 in main vswitchd/ovs-vswitchd.c:127:9
> #15 0x7f23a99011a2 in __libc_start_main (/lib64/libc.so.6)
> 
>   Uninitialized value was created by an allocation of 'ofpacts_stub'
>   in the stack frame of function 'handle_flow_mod'
> #0 0x569e80 in handle_flow_mod ofproto/ofproto.c:6170
> 
> This could cause issues with flow modifications or other operations.
> 
> To reproduce, some NSH tests could be run under valgrind or clang
> MemorySantizer. Ex. "nsh - md1 encap over a veth link" test.
> 
> Fix that by clearing padding bytes while encoding and decoding.
> OVS will still accept OF messages with non-zero padding from controllers.
> 
> New tests added to tests/ofp-actions.at.
> 
> Fixes: 1fc11c5948cf ("Generic encap and decap support for NSH")
> Signed-off-by: Ilya Maximets 
> ---
>  lib/ofp-ed-props.c   |  3 ++-
>  tests/ofp-actions.at | 11 +++
>  2 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/ofp-ed-props.c b/lib/ofp-ed-props.c index
> 28382e012..02a9235d5 100644
> --- a/lib/ofp-ed-props.c
> +++ b/lib/ofp-ed-props.c
> @@ -49,7 +49,7 @@ decode_ed_prop(const struct ofp_ed_prop_header
> **ofp_prop,
>  return OFPERR_NXBAC_BAD_ED_PROP;
>  }
>  struct ofpact_ed_prop_nsh_md_type *pnmt =
> -ofpbuf_put_uninit(out, sizeof(*pnmt));
> +ofpbuf_put_zeros(out, sizeof *pnmt);
>  pnmt->header.prop_class = prop_class;
>  pnmt->header.type = prop_type;
>  pnmt->header.len = len;
> @@ -108,6 +108,7 @@ encode_ed_prop(const struct ofpact_ed_prop
> **prop,
>  opnmt->header.len =
>  offsetof(struct ofp_ed_prop_nsh_md_type, pad);
>  opnmt->md_type = pnmt->md_type;
> +memset(opnmt->pad, 0, sizeof opnmt->pad);
>  prop_len = sizeof(*pnmt);
>  break;
>   

Re: [ovs-dev] [PATCH] ofp-ed-props: Fix using uninitialized padding for NSH encap actions.

2020-10-14 Thread Jan Scheurich via dev
> >> Fix that by clearing padding bytes while encoding, and checking that
> >> these bytes are all zeros on decoding.
> >
> > Is the latter strictly necessary? It may break existing controllers that do 
> > not
> initialize the padding bytes to zero.
> > Wouldn't it be sufficient to just zero the padding bytes at reception?
> 
> I do not have a strong opinion.  I guess, we could not fail OF request if
> padding is not all zeroes for backward compatibility.
> Anyway, it seems like I missed one part of this change (see inline).
> 
> On the other hand, AFAIU, NXOXM_NSH_ is not standardized, so, technically,
> we could change the rules here.  As an option, we could apply the patch
> without checking for all-zeroes padding and backport it this way to stable
> branches.  Afterwards, we could introduce the 'is_all_zeros' check and
> mention this change in release notes for the new version.  Anyway OpenFlow
> usually requires paddings to be all-zeroes for most of matches and actions, so
> this should be a sane requirement for controllers.
> What do you think?
> 

I think there is little to gain by enforcing strict rules on zeroed padding 
bytes in a future release. It just creates grief with users of OVS by 
unnecessarily breaking backward compatibility without any benefit for OVS. No 
matter if OVS is has the right to do so or not.

> >> diff --git a/lib/ofp-ed-props.c b/lib/ofp-ed-props.c index
> >> 28382e012..5a4b12d9f 100644
> >> --- a/lib/ofp-ed-props.c
> >> +++ b/lib/ofp-ed-props.c
> >> @@ -48,6 +48,9 @@ decode_ed_prop(const struct ofp_ed_prop_header
> >> **ofp_prop,
> >>  if (len > sizeof(*opnmt) || len > *remaining) {
> >>  return OFPERR_NXBAC_BAD_ED_PROP;
> >>  }
> >> +if (!is_all_zeros(opnmt->pad, sizeof opnmt->pad)) {
> >> +return OFPERR_NXBRC_MUST_BE_ZERO;
> >> +}
> >>  struct ofpact_ed_prop_nsh_md_type *pnmt =
> >>  ofpbuf_put_uninit(out, sizeof(*pnmt));
> 
> This should be 'ofpbuf_put_zeroes' because 'struct
> ofpact_ed_prop_nsh_md_type'
> contains padding too that must be cleared while constructing ofpacts.
> Since OVS compares decoded ofpacts' and not the original OF messages, this
> should do the trick.

Agree.

> 
> I'll send v2 with this change and will remove 'is_all_zeros' check for this 
> fix.

Thanks, Jan

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] ofp-ed-props: Fix using uninitialized padding for NSH encap actions.

2020-10-14 Thread Jan Scheurich via dev
Hi Ilya,

Good catch. One comment below.

/Jan

> -Original Message-
> From: Ilya Maximets 
> Sent: Tuesday, 13 October, 2020 21:02
> To: ovs-dev@openvswitch.org; Jan Scheurich 
> Cc: Ben Pfaff ; Yi Yang ; Ilya Maximets
> 
> Subject: [PATCH] ofp-ed-props: Fix using uninitialized padding for NSH encap
> actions.
> 
> OVS uses memcmp to compare actions of existing and new flows, but 'struct
> ofp_ed_prop_nsh_md_type' has 3 bytes of padding that never initialized and
> passed around within OF data structures and messages.
> 
>   Uninitialized bytes in MemcmpInterceptorCommon
> at offset 21 inside [0x709003f8, 136)
>   WARNING: MemorySanitizer: use-of-uninitialized-value
> #0 0x4a184e in bcmp (vswitchd/ovs-vswitchd+0x4a184e)
> #1 0x896c8a in ofpacts_equal lib/ofp-actions.c:9121:31
> #2 0x564403 in replace_rule_finish ofproto/ofproto.c:5650:37
> #3 0x563462 in add_flow_finish ofproto/ofproto.c:5218:13
> #4 0x54a1ff in ofproto_flow_mod_finish ofproto/ofproto.c:8091:17
> #5 0x5433b2 in handle_flow_mod__ ofproto/ofproto.c:6216:17
> #6 0x56a2fc in handle_flow_mod ofproto/ofproto.c:6190:17
> #7 0x565bda in handle_single_part_openflow ofproto/ofproto.c:8504:16
> #8 0x540b25 in handle_openflow ofproto/ofproto.c:8685:21
> #9 0x6697fd in ofconn_run ofproto/connmgr.c:1329:13
> #10 0x668e6e in connmgr_run ofproto/connmgr.c:356:9
> #11 0x53f1bc in ofproto_run ofproto/ofproto.c:1890:5
> #12 0x4ead0c in bridge_run__ vswitchd/bridge.c:3250:9
> #13 0x4e9bc8 in bridge_run vswitchd/bridge.c:3309:5
> #14 0x51c072 in main vswitchd/ovs-vswitchd.c:127:9
> #15 0x7f23a99011a2 in __libc_start_main (/lib64/libc.so.6)
> #16 0x46b92d in _start (vswitchd/ovs-vswitchd+0x46b92d)
> 
>   Uninitialized value was stored to memory at
> #0 0x4745aa in __msan_memcpy.part.0 (vswitchd/ovs-vswitchd)
> #1 0x54529f in rule_actions_create ofproto/ofproto.c:3134:5
> #2 0x54915e in ofproto_rule_create ofproto/ofproto.c:5284:11
> #3 0x55d419 in add_flow_init ofproto/ofproto.c:5123:17
> #4 0x54841f in ofproto_flow_mod_init ofproto/ofproto.c:7987:17
> #5 0x543250 in handle_flow_mod__ ofproto/ofproto.c:6206:13
> #6 0x56a2fc in handle_flow_mod ofproto/ofproto.c:6190:17
> #7 0x565bda in handle_single_part_openflow ofproto/ofproto.c:8504:16
> #8 0x540b25 in handle_openflow ofproto/ofproto.c:8685:21
> #9 0x6697fd in ofconn_run ofproto/connmgr.c:1329:13
> #10 0x668e6e in connmgr_run ofproto/connmgr.c:356:9
> #11 0x53f1bc in ofproto_run ofproto/ofproto.c:1890:5
> #12 0x4ead0c in bridge_run__ vswitchd/bridge.c:3250:9
> #13 0x4e9bc8 in bridge_run vswitchd/bridge.c:3309:5
> #14 0x51c072 in main vswitchd/ovs-vswitchd.c:127:9
> #15 0x7f23a99011a2 in __libc_start_main (/lib64/libc.so.6)
> 
>   Uninitialized value was created by an allocation of 'ofpacts_stub'
>   in the stack frame of function 'handle_flow_mod'
> #0 0x569e80 in handle_flow_mod ofproto/ofproto.c:6170
> 
> This could cause issues with flow modifications or other operations.
> 
> To reproduce, some NSH tests could be run under valgrind or clang
> MemorySantizer. Ex. "nsh - md1 encap over a veth link" test.
> 
> Fix that by clearing padding bytes while encoding, and checking that these
> bytes are all zeros on decoding.

Is the latter strictly necessary? It may break existing controllers that do not 
initialize the padding bytes to zero.
Wouldn't it be sufficient to just zero the padding bytes at reception?

> 
> New tests added to tests/ofp-actions.at.
> 
> Fixes: 1fc11c5948cf ("Generic encap and decap support for NSH")
> Signed-off-by: Ilya Maximets 
> ---
>  lib/ofp-ed-props.c   |  4 
>  tests/ofp-actions.at | 11 +++
>  2 files changed, 15 insertions(+)
> 
> diff --git a/lib/ofp-ed-props.c b/lib/ofp-ed-props.c index
> 28382e012..5a4b12d9f 100644
> --- a/lib/ofp-ed-props.c
> +++ b/lib/ofp-ed-props.c
> @@ -48,6 +48,9 @@ decode_ed_prop(const struct ofp_ed_prop_header
> **ofp_prop,
>  if (len > sizeof(*opnmt) || len > *remaining) {
>  return OFPERR_NXBAC_BAD_ED_PROP;
>  }
> +if (!is_all_zeros(opnmt->pad, sizeof opnmt->pad)) {
> +return OFPERR_NXBRC_MUST_BE_ZERO;
> +}
>  struct ofpact_ed_prop_nsh_md_type *pnmt =
>  ofpbuf_put_uninit(out, sizeof(*pnmt));
>  pnmt->header.prop_class = prop_class; @@ -108,6 +111,7 @@
> encode_ed_prop(const struct ofpact_ed_prop **prop,
>  opnmt->header.len =
>  offsetof(struct ofp_ed_prop_nsh_md_type, pad);
>  opnmt->md_type = pnmt->md_type

Re: [ovs-dev] [PATCH] userspace: Switch default cache from EMC to SMC.

2020-09-22 Thread Jan Scheurich via dev
> -Original Message-
> From: Flavio Leitner 
> On Tue, Sep 22, 2020 at 01:22:58PM +0200, Ilya Maximets wrote:
> > On 9/19/20 3:07 PM, Flavio Leitner wrote:
> > > The EMC is not large enough for current production cases and they
> > > are scaling up, so this change switches over from EMC to SMC by
> > > default, which provides better results.
> > >
> > > The EMC is still available and could be used when only a few number
> > > of flows is used.


> 
> I am curious to find out what others think about this change, so going to 
> wait a
> bit before following up with the next version if that sounds OK.
> 

For production deployments of OVS-DPDK in NFVI we also recommend switching SMC 
on and EMC off.
No problem with making that configuration default in the future.
As to be expected SMC only provides acceleration over DPCLS if the avg. number 
of sub-table lookups is > 1.

BR, Jan
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] dpif-netdev: Do not mix recirculation depth into RSS hash itself.

2019-10-24 Thread Jan Scheurich via dev
Even simpler solution to the problem.
Acked-by: Jan Scheurich 

BR, Jan

> -Original Message-
> From: Ilya Maximets 
> Sent: Thursday, 24 October, 2019 14:32
> To: ovs-dev@openvswitch.org
> Cc: Ian Stokes ; Kevin Traynor ;
> Jan Scheurich ; ychen103...@163.com; Ilya
> Maximets 
> Subject: [PATCH] dpif-netdev: Do not mix recirculation depth into RSS hash
> itself.
> 
> Mixing of RSS hash with recirculation depth is useful for flow lookup because
> same packet after recirculation should match with different datapath rule.
> Setting of the mixed value back to the packet is completely unnecessary
> because recirculation depth is different on each recirculation, i.e. we will 
> have
> different packet hash for flow lookup anyway.
> 
> This should fix the issue that packets from the same flow could be directed to
> different buckets based on a dp_hash or different ports of a balanced bonding
> in case they were recirculated different number of times (e.g. due to 
> conntrack
> rules).
> With this change, the original RSS hash will remain the same making it 
> possible
> to calculate equal dp_hash values for such packets.
> 
> Reported-at: https://protect2.fireeye.com/v1/url?k=0a51a6c3-56db840c-
> 0a51e658-0cc47ad93ea4-b7f1e9be8f7bbef8=1=c9f55798-de3c-45f4-afeb-
> 9a87d3d594ca=https%3A%2F%2Fmail.openvswitch.org%2Fpipermail%2Fovs-
> dev%2F2019-September%2F363127.html
> Fixes: 048963aa8507 ("dpif-netdev: Reset RSS hash when recirculating.")
> Signed-off-by: Ilya Maximets 
> ---
>  lib/dpif-netdev.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 4546b55e8..c09b8fd95
> 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -6288,7 +6288,6 @@ dpif_netdev_packet_get_rss_hash(struct dp_packet
> *packet,
>  recirc_depth = *recirc_depth_get_unsafe();
>  if (OVS_UNLIKELY(recirc_depth)) {
>  hash = hash_finish(hash, recirc_depth);
> -dp_packet_set_rss_hash(packet, hash);
>  }
>  return hash;
>  }
> --
> 2.17.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] group dp_hash method works incorrectly when using snat

2019-09-30 Thread Jan Scheurich via dev
Hi,

You have pointed out an interesting issue in the netdev datapath implementation 
(not sure in how far the same applies also to the kernel datapath).

Conceptually, the dp_hash of a packet should be based on the current packet's 
flow. It should not change if the headers remain unchanged.

For performance reasons, the actual implementation of the dp_hash action in 
odp_execute.c bases the dp_hash value on the current RSS hash of the packet, if 
it has one, and only computes it from the actual packet content if not.

However, the RSS hash of the packet is updated with every recirculation in 
order to improve the EMC lookup success rate. So even if initially the RSS hash 
was a suitable base for dp_hash (that itself is uncertain, as the 
implemantation of the RSS hash is dependent on the NIC HW and might not satisfy 
the algorithm specified as part of the dp_hash action), its volatility at 
recirculation destroys the required property of the dp_hash.

What we could do is something like the following (not even compiler-tested):

diff --git a/lib/odp-execute.c b/lib/odp-execute.c
index 563ad1da8..1937bb1e6 100644
--- a/lib/odp-execute.c
+++ b/lib/odp-execute.c
@@ -820,16 +820,22 @@ odp_execute_actions(void *dp, struct dp_packet_batch 
*batch, bool steal,
 uint32_t hash;

 DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
-/* RSS hash can be used here instead of 5tuple for
- * performance reasons. */
-if (dp_packet_rss_valid(packet)) {
-hash = dp_packet_get_rss_hash(packet);
-hash = hash_int(hash, hash_act->hash_basis);
-} else {
-flow_extract(packet, );
-hash = flow_hash_5tuple(, hash_act->hash_basis);
+if (packet->md.dp_hash == 0) {
+if (packet->md.recirc_id == 0 &&
+dp_packet_rss_valid(packet)) {
+/* RSS hash is used here instead of 5tuple for
+ * performance reasons. */
+hash = dp_packet_get_rss_hash(packet);
+hash = hash_int(hash, hash_act->hash_basis);
+} else {
+flow_extract(packet, );
+hash = flow_hash_5tuple(, 
hash_act->hash_basis);
+}
+if (unlikely(hash == 0)) {
+hash = 1;
+}
+packet->md.dp_hash = hash;
 }
-packet->md.dp_hash = hash;
 }
 break;
 }
@@ -842,6 +848,9 @@ odp_execute_actions(void *dp, struct dp_packet_batch 
*batch, bool steal,
 hash = flow_hash_symmetric_l3l4(,
 hash_act->hash_basis,
 false);
+if (unlikely(hash == 0)) {
+hash = 1;
+}
 packet->md.dp_hash = hash;
 }
 break;
diff --git a/lib/packets.c b/lib/packets.c
index ab0b1a36d..a03a3ab61 100644
--- a/lib/packets.c
+++ b/lib/packets.c
@@ -391,6 +391,8 @@ push_mpls(struct dp_packet *packet, ovs_be16 ethtype, 
ovs_be32 lse)
 header = dp_packet_resize_l2_5(packet, MPLS_HLEN);
 memmove(header, header + MPLS_HLEN, len);
 memcpy(header + len, , sizeof lse);
+/* Invalidate dp_hash */
+packet->md.dp_hash = 0;
 }

 /* If 'packet' is an MPLS packet, removes its outermost MPLS label stack entry.
@@ -411,6 +413,8 @@ pop_mpls(struct dp_packet *packet, ovs_be16 ethtype)
 /* Shift the l2 header forward. */
 memmove((char*)dp_packet_data(packet) + MPLS_HLEN, 
dp_packet_data(packet), len);
 dp_packet_resize_l2_5(packet, -MPLS_HLEN);
+/* Invalidate dp_hash */
+packet->md.dp_hash = 0;
 }
 }

@@ -444,6 +448,8 @@ push_nsh(struct dp_packet *packet, const struct nsh_hdr 
*nsh_hdr_src)
 packet->packet_type = htonl(PT_NSH);
 dp_packet_reset_offsets(packet);
 packet->l3_ofs = 0;
+/* Invalidate dp_hash */
+packet->md.dp_hash = 0;
 }

 bool
@@ -474,6 +480,8 @@ pop_nsh(struct dp_packet *packet)

 length = nsh_hdr_len(nsh);
 dp_packet_reset_packet(packet, length);
+/* Invalidate dp_hash */
+packet->md.dp_hash = 0;
 packet->packet_type = htonl(next_pt);
 /* Packet must be recirculated for further processing. */
 }
diff --git a/lib/packets.h b/lib/packets.h
index a4bee3819..8691fa0c2 100644
--- a/lib/packets.h
+++ b/lib/packets.h
@@ -98,8 +98,7 @@ PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline0,
 uint32_t recirc_id; /* Recirculation id carried with the
recirculating 

Re: [ovs-dev] [RFC] dpif-netdev: only poll enabled vhost queues

2019-04-10 Thread Jan Scheurich
> >
> > I am afraid it is not a valid assumption that there will be similarly large
> number of OVS PMD threads as there are queues.
> >
> > In OpenStack deployments the OVS is typically statically configured to use a
> few dedicated host CPUs for PMDs (perhaps 2-8).
> >
> > Typical Telco VNF VMs, on the other hand, are very large (12-20 vCPUs or
> even more). If they enable an instance for multi-queue in Nova, Nova (in its
> eternal wisdom) will set up every vhostuser port with #vCPU queue pairs.
> 
> For me, it's an issue of Nova. It's pretty easy to limit the maximum number of
> queue pairs to some sane value (the value that could be handled by your
> number of available PMD threads).
> It'll be a one config and a small patch to nova-compute. With a bit more work
> you could make this per-port configurable and finally stop wasting HW
> resources.

OK, I fully agree. 
The OpenStack community is slow, though, when it comes to these kind of changes.
Do we have contacts we could push?

> 
> > A (real world) VM with 20 vCPUs and 6 ports would have 120 queue pairs,
> even if only one or two high-traffic ports can actually profit from 
> multi-queue.
> Even on those ports is it unlikely that the application will use all 16 
> queues. And
> often there would be another such VM on the second NUMA node.
> 
> With limiting the number of queues in Nova (like I described above) to 4 
> you'll
> have just
> 24 queues for 6 ports. If you'll make it per-port, you'll be able to limit 
> this to
> even more sane values.

Yes, per port configuration in Neutron seems the logical thing for me to do,
rather than a global per instance parameter in the Nova flavor. A per server
setting in Nova compute to limit the number of acceptable queue pairs to
match the OVS configuration might still be useful on top.

> 
> >
> > So, as soon as a VNF enables MQ in OpenStack, there will typically be a vast
> number of un-used queue pairs in OVS and it makes a lot of sense to minimize
> the run-time impact of having these around.
> 
> For me it seems like not an OVS, DPDK or QEMU issue. The orchestrator should
> configure sane values first of all. It's totally unclear why we're changing 
> OVS
> instead of changing Nova.

The VNF orchestrator would request queues based on the applications needs. They
should not need to be aware of the configuration of the infrastructure (such as
the number of PMD threads in OVS). The OpenStack operator would have to make
sure that the instantiated queues are a good compromise between application
needs and infra capabilities.

> 
> >
> > We have had discussion earlier with RedHat as to how a vhostuser backend
> like OVS could negotiate the number of queue pairs with Qemu down to a
> reasonable value (e.g. the number PMDs available for polling) *before* Qemu
> would actually start the guest. The guest would then not have to guess on the
> optimal number of queue pairs to actually activate.
> >
> > BR, Jan
> >
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [RFC] dpif-netdev: only poll enabled vhost queues

2019-04-10 Thread Jan Scheurich
Hi Ilya,

> >
> > With a simple pvp setup of mine.
> > 1c/2t poll two physical ports.
> > 1c/2t poll four vhost ports with 16 queues each.
> >   Only one queue is enabled on each virtio device attached by the guest.
> >   The first two virtio devices are bound to the virtio kmod.
> >   The last two virtio devices are bound to vfio-pci and used to forward
> incoming traffic with testpmd.
> >
> > The forwarding zeroloss rate goes from 5.2Mpps (polling all 64 vhost queues)
> to 6.2Mpps (polling only the 4 enabled vhost queues).
> 
> That's interesting. However, this doesn't look like a realistic scenario.
> In practice you'll need much more PMD threads to handle so many queues.
> If you'll add more threads, zeroloss test could show even worse results if 
> one of
> idle VMs will periodically change the number of queues. Periodic latency 
> spikes
> will cause queue overruns and subsequent packet drops on hot Rx queues. This
> could be partially solved by allowing n_rxq to grow only.
> However, I'd be happy to have different solution that will not hide number of
> queues from the datapath.
> 

I am afraid it is not a valid assumption that there will be similarly large 
number of OVS PMD threads as there are queues. 

In OpenStack deployments the OVS is typically statically configured to use a 
few dedicated host CPUs for PMDs (perhaps 2-8).

Typical Telco VNF VMs, on the other hand, are very large (12-20 vCPUs or even 
more). If they enable an instance for multi-queue in Nova, Nova (in its eternal 
wisdom) will set up every vhostuser port with #vCPU queue pairs. A (real world) 
VM with 20 vCPUs and 6 ports would have 120 queue pairs, even if only one or 
two high-traffic ports can actually profit from multi-queue. Even on those 
ports is it unlikely that the application will use all 16 queues. And often 
there would be another such VM on the second NUMA node.

So, as soon as a VNF enables MQ in OpenStack, there will typically be a vast 
number of un-used queue pairs in OVS and it makes a lot of sense to minimize 
the run-time impact of having these around. 

We have had discussion earlier with RedHat as to how a vhostuser backend like 
OVS could negotiate the number of queue pairs with Qemu down to a reasonable 
value (e.g. the number PMDs available for polling) *before* Qemu would actually 
start the guest. The guest would then not have to guess on the optimal number 
of queue pairs to actually activate.

BR, Jan
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] dpif-netdev-perf: Fix millisecond stats precision with slower TSC.

2019-03-19 Thread Jan Scheurich
Hi Ilya, 

OK with me!

BR, Jan

> -Original Message-
> From: Ilya Maximets 
> Sent: Tuesday, 19 March, 2019 12:08
> To: ovs-dev@openvswitch.org; Ian Stokes 
> Cc: Kevin Traynor ; Ilya Maximets
> ; Jan Scheurich 
> Subject: [PATCH] dpif-netdev-perf: Fix millisecond stats precision with slower
> TSC.
> 
> Unlike x86 where TSC frequency usually matches with CPU frequency, another
> architectures could have much slower TSCs.
> For example, it's common for Arm SoCs to have 100 MHz TSC by default.
> In this case perf module will check for end of current millisecond each 10K
> cycles, i.e 10 times per millisecond. This could be not enough to collect 
> precise
> statistics.
> Fix that by taking current TSC frequency into account instead of hardcoding 
> the
> number of cycles.
> 
> CC: Jan Scheurich 
> Fixes: 79f368756ce8 ("dpif-netdev: Detailed performance stats for PMDs")
> Signed-off-by: Ilya Maximets 
> ---
>  lib/dpif-netdev-perf.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c index
> 52324858d..e7ed49e7e 100644
> --- a/lib/dpif-netdev-perf.c
> +++ b/lib/dpif-netdev-perf.c
> @@ -554,8 +554,8 @@ pmd_perf_end_iteration(struct pmd_perf_stats *s, int
> rx_packets,
>  cum_ms = history_next(>milliseconds);
>  cum_ms->timestamp = now;
>  }
> -/* Do the next check after 10K cycles (4 us at 2.5 GHz TSC clock). */
> -s->next_check_tsc = cycles_counter_update(s) + 1;
> +/* Do the next check after 4 us (10K cycles at 2.5 GHz TSC clock). */
> +s->next_check_tsc = cycles_counter_update(s) + get_tsc_hz() /
> + 25;
>  }
>  }
> 
> --
> 2.17.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] dpif-netdev-perf: Fix double update of perf histograms.

2019-03-18 Thread Jan Scheurich
Hi Ilya,

Thanks for spotting this. I believe your fix is correct.

BR, Jan

> -Original Message-
> From: Ilya Maximets 
> Sent: Monday, 18 March, 2019 14:01
> To: ovs-dev@openvswitch.org; Ian Stokes 
> Cc: Kevin Traynor ; Ilya Maximets
> ; Jan Scheurich 
> Subject: [PATCH] dpif-netdev-perf: Fix double update of perf histograms.
> 
> Real values of 'packets per batch' and 'cycles per upcall' already added to
> histograms in 'dpif-netdev' on receive. Adding the averages makes statistics
> wrong. We should not add to histograms values that never really appeared.
> 
> For exmaple, in current code following situation is possible:
> 
>   pmd thread numa_id 0 core_id 5:
>   ...
> Rx packets:  83  (0 Kpps, 13873 cycles/pkt)
> ...
> - Upcalls:3  (  3.6 %, 248.6 us/upcall)
> 
>   Histograms
> packets/it  pkts/batch   upcalls/it cycles/upcall
> 1 831 1661 3...
> 15848 2
> 19952 2
> ...
> 50118 2
> 
> i.e. all the packets counted twice in 'pkts/batch' column and all the upcalls
> counted twice in 'cycles/upcall' column.
> 
> CC: Jan Scheurich 
> Fixes: 79f368756ce8 ("dpif-netdev: Detailed performance stats for PMDs")
> Signed-off-by: Ilya Maximets 
> ---
>  lib/dpif-netdev-perf.c | 8 
>  1 file changed, 8 deletions(-)
> 
> diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c index
> 8f0c9bc4f..52324858d 100644
> --- a/lib/dpif-netdev-perf.c
> +++ b/lib/dpif-netdev-perf.c
> @@ -498,15 +498,7 @@ pmd_perf_end_iteration(struct pmd_perf_stats *s, int
> rx_packets,
>  cycles_per_pkt = cycles / rx_packets;
>  histogram_add_sample(>cycles_per_pkt, cycles_per_pkt);
>  }
> -if (s->current.batches > 0) {
> -histogram_add_sample(>pkts_per_batch,
> - rx_packets / s->current.batches);
> -}
>  histogram_add_sample(>upcalls, s->current.upcalls);
> -if (s->current.upcalls > 0) {
> -histogram_add_sample(>cycles_per_upcall,
> - s->current.upcall_cycles / s->current.upcalls);
> -}
>  histogram_add_sample(>max_vhost_qfill, s->current.max_vhost_qfill);
> 
>  /* Add iteration samples to millisecond stats. */
> --
> 2.17.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 1/2] odp-util: Fix a bug in parse_odp_push_nsh_action

2018-12-28 Thread Jan Scheurich
Hi Ben and Yfeng,

Looking at the current code on master I believe it is correct except that I 
wrongly used ofpbuf_push_zeros() where it should have been ofpbuf_put_zeros() 
to append the padding zeroes to the end. The ofp_buf should never be relocated 
through padding as it is allocated with the allowed maximum size from the start.

The entire MD2 metadata (1 or more TLVs) is treated as a single NLATTR. The 
reason is that for the purpose of push_nsh the datapath does not care about the 
internal structure. The whole MD2 complex is added as one binary blob of data. 
Remember we did not implement the possibility to match on specific MD2 TLV 
fields in OVS 2.8 as that would have required a generalization of the generic 
metadata TLV field infrastructure to match fields.

Admittedly, the possibility to specify arbitrary hex-encoded MD2 TLVs in a 
push_nsh action in ovs-dpctl is a bit raw as it trusts the hex data to be 
well-formed. As discussed, the datapath code doesn't really care, but the 
receiver of an NSH packet with malformed MD2 headers might choke. The 
alternatives would have been to not accept MD2 metadata in dpctl commands or to 
validate the hex data (at least from a TLV-structural perspective).

I believe for MD2 metadata specified in an OF encap_nsh action we do ensure the 
generated TLV structure is correct. 

I hope this helps.

BR, Jan

> -Original Message-
> From: Ben Pfaff 
> Sent: Thursday, 27 December, 2018 19:55
> To: Yifeng Sun ; Jan Scheurich
> 
> Cc: d...@openvswitch.org
> Subject: Re: [ovs-dev] [PATCH 1/2] odp-util: Fix a bug in
> parse_odp_push_nsh_action
> 
> On Wed, Dec 26, 2018 at 04:52:22PM -0800, Yifeng Sun wrote:
> > In this piece of code, 'struct ofpbuf b' should always point to
> > metadata so that metadata can be filled with values through ofpbuf
> > operations, like ofpbuf_put_hex and ofpbuf_push_zeros. However,
> > ofpbuf_push_zeros may change the data pointer of 'struct ofpbuf b'
> > and therefore, metadata will not contain the expected values. This
> > patch fixes it.
> >
> > Reported-at:
> > https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=10863
> > Signed-off-by: Yifeng Sun 
> > ---
> >  lib/odp-util.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/lib/odp-util.c b/lib/odp-util.c index
> > cb6a5f2047fd..af855873690c 100644
> > --- a/lib/odp-util.c
> > +++ b/lib/odp-util.c
> > @@ -2114,12 +2114,12 @@ parse_odp_push_nsh_action(const char *s,
> struct ofpbuf *actions)
> >  if (ovs_scan_len(s, , "md2=0x%511[0-9a-fA-F]", buf)
> >  && n/2 <= sizeof metadata) {
> >  ofpbuf_use_stub(, metadata, sizeof metadata);
> > -ofpbuf_put_hex(, buf, );
> >  /* Pad metadata to 4 bytes. */
> >  padding = PAD_SIZE(mdlen, 4);
> >  if (padding > 0) {
> > -ofpbuf_push_zeros(, padding);
> > +ofpbuf_put_zeros(, padding);
> >  }
> > +ofpbuf_put_hex(, buf, );
> >  md_size = mdlen + padding;
> >  ofpbuf_uninit();
> >  continue;
> 
> Yifeng, this fix looks wrong because it uses 'mdlen' in PAD_SIZE before
> initializing it.
> 
> This code is weird.  It adds padding to a 4-byte boundary even though I can't
> find any other code that checks for that or relies on it.
> Furthermore, it puts the padding **BEFORE** the metadata, which just seems
> super wrong.
> 
> When I look at datapath code for md2 I get even more confused.
> nsh_key_put_from_nlattr() seems to assume that an md2 attribute has well-
> formatted data in it, then nsh_hdr_from_nlattr() copies it without checking 
> into
> nh->md2, and then if it's not perfectly formatted then
> nsh->md2.length is going to be invalid.  If I'm reading it right, it
> also assumes there's exactly one TLV.
> 
> Jan, I think this is your code, can you help me understand this code?
> 
> Thanks,
> 
> Ben.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] Bug: select group with dp_hash causing recursive recirculation

2018-09-26 Thread Jan Scheurich
Hi Zang,

Thanks for reporting this bug. As I see it, the check on dp_hash != 0 in 
ofproto-dpif-xlate.c is there to guarantee that a dp_hash value has been 
computed for the packet once before, not necessarily that a new one is computed 
for each translated select group. That's why a check for a valid dp_hash is OK.

Bat all datapaths must adhere to the invariant that valid dp_hash !=0. Indeed 
the kernel datapath implements this:

datapath/linux/actions.c:
   1071 static void execute_hash(struct sk_buff *skb, struct sw_flow_key *key,
   1072  const struct nlattr *attr)
   1073 {
   1074 struct ovs_action_hash *hash_act = nla_data(attr);
   1075 u32 hash = 0;
   1076
   1077 /* OVS_HASH_ALG_L4 is the only possible hash algorithm.  */
   1078 hash = skb_get_hash(skb);
   1079 hash = jhash_1word(hash, hash_act->hash_basis);
   1080 if (!hash)
   1081 hash = 0x1;
   1082
   1083 key->ovs_flow_hash = hash;
   1084 }

The correct fix in my view would be to implement the same for the netdev 
datapath in lib/odp-execute.c.

This requirement on the dp_hash action implementations should better be 
documented properly.

BR, Jan

> -Original Message-
> From: ovs-dev-boun...@openvswitch.org  On 
> Behalf Of Zang MingJie
> Sent: Tuesday, 25 September, 2018 10:45
> To: ovs dev 
> Subject: [ovs-dev] Bug: select group with dp_hash causing recursive 
> recirculation
> 
> Hi, we found a serious problem where one pmd is stop working, I want to
> share the problem and find solution here.
> 
> vswitchd log:
> 
>   2018-09-13T23:36:44.377Z|40269235|dpif_netdev(pmd45)|WARN|Packet dropped.
> Max recirculation depth exceeded.
>   2018-09-13T23:36:44.387Z|40269236|dpif_netdev(pmd45)|WARN|Packet dropped.
> Max recirculation depth exceeded.
>   2018-09-13T23:36:44.391Z|40269237|dpif_netdev(pmd45)|WARN|Packet dropped.
> Max recirculation depth exceeded.
> 
> problematic datapath flows:
> 
> 
> ct_state(+new-est),recirc_id(0x143c893),in_port(2),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(proto=6,frag=no),tcp(dst=443),
> packets:84573093, bytes:6308807903, used:0.009s,
> flags:SFPRU.ECN[200][400][800],
> actions:meter(306),hash(hash_l4(0)),recirc(0x237b09d)
> 
> 
> recirc_id(0x237b09d),in_port(2),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no),
> packets:279713339, bytes:20890205186, used:0.007s,
> flags:SFPRU.ECN[200][400][800], actions:hash(hash_l4(0)),recirc(0x237b09d)
> 
> corresponding openflow:
> 
>   cookie=0x5b5ab65e000f0101, duration=4848269.642s, table=40,
> n_packets=974343805, n_bytes=72484367083,
> priority=10,tcp,metadata=0xf0100/0xff00,tp_dst=443
> actions=group:983297
> 
> 
> group_id=983297,type=select,selection_method=dp_hash,bucket=bucket_id:3057033848,weight:100,actions=ct(commit,table=70,zo
> ne=15,exec(nat(dst=10.177.251.203:443))),...``lots
> of buckets``...
> 
> 
> 
> Following explains how select group with dp_hash works.
> 
> To implement select group with dp_hash, two datapath flows are needed:
> 
>  1. calculate dp_hash, recirculate to second one
>  2. select group bucket by dp_hash
> 
> When encounter a datapath miss, openflow doesn't know which one is missing,
> so it depends on dp_hash value of the packet:
> 
>  if dp_hash == 0 generate first dp flow.
> if dp_hash != 0 generate second dp flow.
> 
> 
> Back to the problem.
> 
> Notice that second datapath flow is a dead loop, it recirculate to itself.
> The cause of the problem is here ofproto/ofproto-dpif-xlate.c#L4429[1]:
> 
> /* dp_hash value 0 is special since it means that the dp_hash has not
> been
>  * computed, as all computed dp_hash values are non-zero.  Therefore
>  * compare to zero can be used to decide if the dp_hash value is valid
>  * without masking the dp_hash field. */
> if (!dp_hash) {
> 
> The comment saying that `dp_hash` shouldn't be zero, but under DPDK, it can
> be zero, at lib/odp-execute.c#L747[2]
> 
>/* RSS hash can be used here instead of 5tuple for
> * performance reasons. */
>if (dp_packet_rss_valid(packet)) {
>hash = dp_packet_get_rss_hash(packet);
>hash = hash_int(hash, hash_act->hash_basis);
>} else {
>flow_extract(packet, );
>hash = flow_hash_5tuple(, hash_act->hash_basis);
>}
>packet->md.dp_hash = hash;
> 
> I don't know how small chance that `hash_int` returns 0, we have tested
> that if the final hash is 0, will definitely trigger the same bug. And due
> to the chance is extremely low, I'm also investigation that if there are
> other situation that will pass 0 hash to ofp.
> 
> 
> 
> IMO, it is silly to depends on dp_hash value, maybe we need a new mechanism
> which can pass data between ofp and odp freely. And a quick solution could
> be just change the 0 hash to 1.
> 
> 
> [1]
> https://github.com/openvswitch/ovs/blob/master/ofproto/ofproto-dpif-xlate.c#L4429
> [2] 

Re: [ovs-dev] [PATCH v4] Upcall/Slowpath rate-limiter for OVS

2018-06-08 Thread Jan Scheurich
> Have you considered making this token bucket per-port instead of
> per-pmd?  As I read it, a greedy port can exhaust all the tokens from a
> particular PMD, possibly leading to an unfair performance for that PMD
> thread.  Am I just being overly paranoid?
> [manu] Yes, this is possible. But it can happen for both fast and slowpath 
> today, as PMDs sequentially iterate through ports. In order
> to keep it simple, its done per-PMD. It can be extended to per-port if needed.

The purpose of the upcall rate limiter for the netdev datapath is to protect a 
PMD from becoming clogged down by having to process an excessive number of 
upcalls. It is not to police the number of upcalls per port to some rate, 
especially not across multiple PMDs (in the case of RSS).

I think what you are after, Aaron, is some kind of fairness scheme that 
provides each rx queue with a minimum rate of upcalls even if the global PMD 
rate limit is reached? I don't believe simply partitioning the global PMD rate 
limit into a number of smaller rx queue buckets would be a good solution. But I 
don't have a better alternative either.

I agree with Manu that it should not stop us implementing the PMD-level 
protection. We can add a fairness scheme later, if needed.

BR, Jan
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] Improved Packet Drop Statistics in OVS.

2018-06-06 Thread Jan Scheurich
The user-space part for packet drop stats should be generic and work with any 
dpif datapath. 
So, if someone implemented the equivalent drop stats functionality in the 
kernel datapath that would be very welcome.
We in Ericsson cannot do that currently due to license restrictions.

Regards, Jan

> -Original Message-
> From: ovs-dev-boun...@openvswitch.org  On 
> Behalf Of Rohith Basavaraja
> Sent: Friday, 25 May, 2018 07:37
> To: Ben Pfaff 
> Cc: d...@openvswitch.org
> Subject: Re: [ovs-dev] [PATCH] Improved Packet Drop Statistics in OVS.
> 
> Thanks Ben for the clarification. Yes this new stuff is used only in the
> DPDK datapath and it’s not used in the kernel datapath.
> 
> Thanks
> Rohith
> 
> On 25/05/18, 2:52 AM, "Ben Pfaff"  wrote:
> 
> On Thu, May 24, 2018 at 02:19:06AM +, Rohith Basavaraja wrote:
> > Only  changes in
> > datapath/linux/compat/include/linux/openvswitch.h
> > are related to OvS Kernel module.
> 
> On a second look, I see that the new stuff here is only for the DPDK
> datapath.  If you don't intend to add this feature to the kernel
> datapath, there should be no problem.  Never mind.
> 
> 
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] OVS-DPDK public meeting

2018-06-06 Thread Jan Scheurich
> > I was planning to if the community agreed it was warranted.
> >
> > However the general feeling expressed at the past few community calls is
> > that the next move should be to DPDK 18.11 LTS and I tend to agree with
> > this.
> >
> > The main advantage of this is the DPDK LTS lifecycle provides bug fixes
> > for DPDK for 2 years from release. Moving to a non DPDK LTS becomes a pain
> > as critical bug fixes will not be backported on the DPDK side so are not
> > addressed in OVS with DPDK either, we've seen this with some of the CVE
> > fixes for vhost quite recently.
> >
> > 18.05 is also the largest DPDK release to date with a lot of code being
> > introduced in the later RC stages which IMO increases the risk rather than
> > the gain of moving to it.
> >
> > However I'm open to discussing if a move to 18.05 is warranted, are there
> > any critical features or usecases it enables that you had in mind?
> >
> 
> There are always the two big groups of users.
> - Those that want max stability for a huge Production setup (which would
> follow the pick LTS argument)
> - And those that want/need the very latest HW support and features (which
> would always prefer the latest version)

I subscribe to that statement.

> 
> I had no single critical feature in mind for 18.05, but especially your
> argument of "the largest DPDK release to date with a lot of code being
> introduced" makes it interesting for the second group.
> Actually I think there are also plenty of new devices which are not
> supported at all before 18.02/18.05.
> 
> So far my DPDK upgrade policy was "the last DPDK available which has at
> least one point release AND works with OVS".
> If OpenVswitch really changed to only support to each DPDK LTS version,
> then I might have to follow that.
> I must admit I already had the same thought to only pick .11 stable
> versions, so I'm not totally opposed if that is the way it is preferred for
> Openvswitch.
> 
> But if we can make this a toleration (saying it works with 17.11 AND newer
> 18.05) then this would be a great contrib to OVS and IMHO be warranted.
> If the latter would work it could be great to spot issues early on instead
> of having a super-big jump from 17.11 to 18.11 in one shot.
> But if you have to kill support fot 17.11 to let it work with 18.05, then
> better not.
> 
> Interested what other opinions on this are.

In my eyes that would the only way to satisfy both user's needs: Keep default 
support for the associated DPDK LTS release and add optional support for 
bleeding edge DPDK versions. The downside of this is that it will likely 
clutter the OVS code with conditional compiler directives to handle the DPDK 
API/ABI incompatibilities. Plus, someone must also clean these up at a later 
stage when they are no longer needed.

Today, OVS developers that really need the latest DPDK typically fork/branch 
OVS locally and maintain their fork until OVS master switches to the required 
DPDK version. That model doesn't burden the community with the problem.

BR, Jan
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v4 2/3] ofproto-dpif: Improve dp_hash selection method for select groups

2018-05-24 Thread Jan Scheurich
The current implementation of the "dp_hash" selection method suffers
from two deficiences: 1. The hash mask and hence the number of dp_hash
values is just large enough to cover the number of group buckets, but
does not consider the case that buckets have different weights. 2. The
xlate-time selection of best bucket from the masked dp_hash value often
results in bucket load distributions that are quite different from the
bucket weights because the number of available masked dp_hash values
is too small (2-6 bits compared to 32 bits of a full hash in the default
hash selection method).

This commit provides a more accurate implementation of the dp_hash
select group by applying the well known Webster method for distributing
a small number of "seats" fairly over the weighted "parties"
(see https://en.wikipedia.org/wiki/Webster/Sainte-Lagu%C3%AB_method).
The dp_hash mask is autmatically chosen large enough to provide good
enough accuracy even with widely differing weights.

This distribution happens at group modification time and the resulting
table is stored with the group-dpif struct. At xlation time, we use the
masked dp_hash values as index to look up the assigned bucket.

If the bucket should not be live, we do a circular search over the
mapping table until we find the first live bucket. As the buckets in
the table are by construction in pseudo-random order with a frequency
according to their weight, this method maintains correct distribution
even if one or more buckets are non-live.

Xlation is further simplified by storing some derived select group state
at group construction in struct group-dpif in a form better suited for
xlation purposes.

Adapted the unit test case for dp_hash select group accordingly.

Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>
Signed-off-by: Nitin Katiyar <nitin.kati...@ericsson.com>
Co-authored-by: Nitin Katiyar <nitin.kati...@ericsson.com>
---
 lib/odp-util.c   |   4 +-
 ofproto/ofproto-dpif-xlate.c |  59 ++---
 ofproto/ofproto-dpif.c   | 150 +++
 ofproto/ofproto-dpif.h   |  13 
 tests/ofproto-dpif.at|  15 +++--
 5 files changed, 211 insertions(+), 30 deletions(-)

diff --git a/lib/odp-util.c b/lib/odp-util.c
index 105ac80..8d4afa0 100644
--- a/lib/odp-util.c
+++ b/lib/odp-util.c
@@ -595,7 +595,9 @@ format_odp_hash_action(struct ds *ds, const struct 
ovs_action_hash *hash_act)
 ds_put_format(ds, "hash(");
 
 if (hash_act->hash_alg == OVS_HASH_ALG_L4) {
-ds_put_format(ds, "hash_l4(%"PRIu32")", hash_act->hash_basis);
+ds_put_format(ds, "l4(%"PRIu32")", hash_act->hash_basis);
+} else if (hash_act->hash_alg == OVS_HASH_ALG_SYM_L4) {
+ds_put_format(ds, "sym_l4(%"PRIu32")", hash_act->hash_basis);
 } else {
 ds_put_format(ds, "Unknown hash algorithm(%"PRIu32")",
   hash_act->hash_alg);
diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
index 9f7fca7..c990d8a 100644
--- a/ofproto/ofproto-dpif-xlate.c
+++ b/ofproto/ofproto-dpif-xlate.c
@@ -4392,27 +4392,37 @@ pick_hash_fields_select_group(struct xlate_ctx *ctx, 
struct group_dpif *group)
 static struct ofputil_bucket *
 pick_dp_hash_select_group(struct xlate_ctx *ctx, struct group_dpif *group)
 {
+uint32_t dp_hash = ctx->xin->flow.dp_hash;
+
 /* dp_hash value 0 is special since it means that the dp_hash has not been
  * computed, as all computed dp_hash values are non-zero.  Therefore
  * compare to zero can be used to decide if the dp_hash value is valid
  * without masking the dp_hash field. */
-if (!ctx->xin->flow.dp_hash) {
-uint64_t param = group->up.props.selection_method_param;
-
-ctx_trigger_recirculate_with_hash(ctx, param >> 32, (uint32_t)param);
+if (!dp_hash) {
+enum ovs_hash_alg hash_alg = group->hash_alg;
+if (hash_alg > ctx->xbridge->support.max_hash_alg) {
+/* Algorithm supported by all datapaths. */
+hash_alg = OVS_HASH_ALG_L4;
+}
+ctx_trigger_recirculate_with_hash(ctx, hash_alg, group->hash_basis);
 return NULL;
 } else {
-uint32_t n_buckets = group->up.n_buckets;
-if (n_buckets) {
-/* Minimal mask to cover the number of buckets. */
-uint32_t mask = (1 << log_2_ceil(n_buckets)) - 1;
-/* Multiplier chosen to make the trivial 1 bit case to
- * actually distribute amongst two equal weight buckets. */
-uint32_t basis = 0xc2b73583 * (ctx->xin->flow.dp_hash & mask);
-
-ctx->wc->masks.dp_hash |= mask;
-return group_best_live_bucket(ctx, group, basis);
+uint32_t hash_mask = group->hash_mask;
+ctx->wc

[ovs-dev] [PATCH v4 3/3] ofproto-dpif: Use dp_hash as default selection method

2018-05-24 Thread Jan Scheurich
The dp_hash selection method for select groups overcomes the scalability
problems of the current default selection method which, due to L2-L4
hashing during xlation and un-wildcarding of the hashed fields,
basically requires an upcall to the slow path to load-balance every
L4 connection. The consequence are an explosion of datapath flows
(megaflows degenerate to miniflows) and a limitation of connection
setup rate OVS can handle.

This commit changes the default selection method to dp_hash, provided the
bucket configuration is such that the dp_hash method can accurately
represent the bucket weights with up to 64 hash values. Otherwise we
stick to original default hash method.

We use the new dp_hash algorithm OVS_HASH_L4_SYMMETRIC to maintain the
symmetry property of the old default hash method.

A controller can explicitly request the old default hash selection method
by specifying selection method "hash" with an empty list of fields in the
Group properties of the OpenFlow 1.5 Group Mod message.

Update the documentation about selection method in the ovs-ovctl man page.

Revise and complete the ofproto-dpif unit tests cases for select groups.

Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>
Signed-off-by: Nitin Katiyar <nitin.kati...@ericsson.com>
Co-authored-by: Nitin Katiyar <nitin.kati...@ericsson.com>
---
 NEWS   |   2 +
 lib/ofp-group.c|  15 ++-
 ofproto/ofproto-dpif.c |  30 +++--
 ofproto/ofproto-dpif.h |   1 +
 ofproto/ofproto-provider.h |   2 +-
 tests/mpls-xlate.at|  26 ++--
 tests/ofproto-dpif.at  | 316 +++--
 tests/ofproto-macros.at|   7 +-
 utilities/ovs-ofctl.8.in   |  47 ---
 9 files changed, 334 insertions(+), 112 deletions(-)

diff --git a/NEWS b/NEWS
index ec548b0..2b2be1e 100644
--- a/NEWS
+++ b/NEWS
@@ -17,6 +17,8 @@ Post-v2.9.0
  * OFPT_ROLE_STATUS is now available in OpenFlow 1.3.
  * OpenFlow 1.5 extensible statistics (OXS) now implemented.
  * New OpenFlow 1.0 extensions for group support.
+ * Default selection method for select groups is now dp_hash with improved
+   accuracy.
- Linux kernel 4.14
  * Add support for compiling OVS with the latest Linux 4.14 kernel
- ovn:
diff --git a/lib/ofp-group.c b/lib/ofp-group.c
index f5b0af8..697208f 100644
--- a/lib/ofp-group.c
+++ b/lib/ofp-group.c
@@ -1600,12 +1600,17 @@ parse_group_prop_ntr_selection_method(struct ofpbuf 
*payload,
 return OFPERR_OFPBPC_BAD_VALUE;
 }
 
-error = oxm_pull_field_array(payload->data, fields_len,
- >fields);
-if (error) {
-OFPPROP_LOG(, false,
+if (fields_len > 0) {
+error = oxm_pull_field_array(payload->data, fields_len,
+>fields);
+if (error) {
+OFPPROP_LOG(, false,
 "ntr selection method fields are invalid");
-return error;
+return error;
+}
+} else {
+/* Selection_method "hash: w/o fields means default hash method. */
+gp->fields.values_size = 0;
 }
 
 return 0;
diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c
index c9c2e51..a45d6ea 100644
--- a/ofproto/ofproto-dpif.c
+++ b/ofproto/ofproto-dpif.c
@@ -1,5 +1,4 @@
 /*
- * Copyright (c) 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017 Nicira, 
Inc.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -4787,7 +4786,7 @@ group_setup_dp_hash_table(struct group_dpif *group, 
size_t max_hash)
 } *webster;
 
 if (n_buckets == 0) {
-VLOG_DBG("  Don't apply dp_hash method without buckets");
+VLOG_DBG("  Don't apply dp_hash method without buckets.");
 return false;
 }
 
@@ -4862,9 +4861,24 @@ group_set_selection_method(struct group_dpif *group)
 const struct ofputil_group_props *props = >up.props;
 const char *selection_method = props->selection_method;
 
+VLOG_DBG("Constructing select group %"PRIu32, group->up.group_id);
 if (selection_method[0] == '\0') {
-VLOG_DBG("No selection method specified.");
-group->selection_method = SEL_METHOD_DEFAULT;
+VLOG_DBG("No selection method specified. Trying dp_hash.");
+/* If the controller has not specified a selection method, check if
+ * the dp_hash selection method with max 64 hash values is appropriate
+ * for the given bucket configuration. */
+if (group_setup_dp_hash_table(group, 64)) {
+/* Use dp_hash selection method with symmetric L4 hash. */
+group->selection_method = SEL_METHOD_DP_HASH;
+group->hash_alg = OVS_HASH_ALG_SYM_L4;
+group->hash_basis = 0;
+VLOG_DBG("Use dp_hash with %d hash

[ovs-dev] [PATCH v4 1/3] userspace datapath: Add OVS_HASH_L4_SYMMETRIC dp_hash algorithm

2018-05-24 Thread Jan Scheurich
This commit implements a new dp_hash algorithm OVS_HASH_L4_SYMMETRIC in
the netdev datapath. It will be used as default hash algorithm for the
dp_hash-based select groups in a subsequent commit to maintain
compatibility with the symmetry property of the current default hash
selection method.

A new dpif_backer_support field 'max_hash_alg' is introduced to reflect
the highest hash algorithm a datapath supports in the dp_hash action.

Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>
Signed-off-by: Nitin Katiyar <nitin.kati...@ericsson.com>
Co-authored-by: Nitin Katiyar <nitin.kati...@ericsson.com>
---
 datapath/linux/compat/include/linux/openvswitch.h |  4 ++
 lib/flow.c| 43 +-
 lib/flow.h|  1 +
 lib/odp-execute.c | 23 ++--
 ofproto/ofproto-dpif-xlate.c  |  7 +++-
 ofproto/ofproto-dpif.c| 45 +++
 ofproto/ofproto-dpif.h|  5 ++-
 7 files changed, 121 insertions(+), 7 deletions(-)

diff --git a/datapath/linux/compat/include/linux/openvswitch.h 
b/datapath/linux/compat/include/linux/openvswitch.h
index 6f4fa01..5c1e238 100644
--- a/datapath/linux/compat/include/linux/openvswitch.h
+++ b/datapath/linux/compat/include/linux/openvswitch.h
@@ -724,6 +724,10 @@ struct ovs_action_push_vlan {
  */
 enum ovs_hash_alg {
OVS_HASH_ALG_L4,
+#ifndef __KERNEL__
+   OVS_HASH_ALG_SYM_L4,
+#endif
+   __OVS_HASH_MAX
 };
 
 /*
diff --git a/lib/flow.c b/lib/flow.c
index 136f060..75ca456 100644
--- a/lib/flow.c
+++ b/lib/flow.c
@@ -2124,6 +2124,45 @@ flow_hash_symmetric_l4(const struct flow *flow, uint32_t 
basis)
 return jhash_bytes(, sizeof fields, basis);
 }
 
+/* Symmetrically Hashes non-IP 'flow' based on its L2 headers. */
+uint32_t
+flow_hash_symmetric_l2(const struct flow *flow, uint32_t basis)
+{
+union {
+struct {
+ovs_be16 eth_type;
+ovs_be16 vlan_tci;
+struct eth_addr eth_addr;
+ovs_be16 pad;
+};
+uint32_t word[3];
+} fields;
+
+uint32_t hash = basis;
+int i;
+
+if (flow->packet_type != htonl(PT_ETH)) {
+/* Cannot hash non-Ethernet flows */
+return 0;
+}
+
+for (i = 0; i < ARRAY_SIZE(fields.eth_addr.be16); i++) {
+fields.eth_addr.be16[i] =
+flow->dl_src.be16[i] ^ flow->dl_dst.be16[i];
+}
+fields.vlan_tci = 0;
+for (i = 0; i < FLOW_MAX_VLAN_HEADERS; i++) {
+fields.vlan_tci ^= flow->vlans[i].tci & htons(VLAN_VID_MASK);
+}
+fields.eth_type = flow->dl_type;
+fields.pad = 0;
+
+hash = hash_add(hash, fields.word[0]);
+hash = hash_add(hash, fields.word[1]);
+hash = hash_add(hash, fields.word[2]);
+return hash_finish(hash, basis);
+}
+
 /* Hashes 'flow' based on its L3 through L4 protocol information */
 uint32_t
 flow_hash_symmetric_l3l4(const struct flow *flow, uint32_t basis,
@@ -2144,8 +2183,8 @@ flow_hash_symmetric_l3l4(const struct flow *flow, 
uint32_t basis,
 hash = hash_add64(hash, a[i] ^ b[i]);
 }
 } else {
-/* Cannot hash non-IP flows */
-return 0;
+/* Revert to hashing L2 headers */
+return flow_hash_symmetric_l2(flow, basis);
 }
 
 hash = hash_add(hash, flow->nw_proto);
diff --git a/lib/flow.h b/lib/flow.h
index 7a9e7d0..9de94b2 100644
--- a/lib/flow.h
+++ b/lib/flow.h
@@ -236,6 +236,7 @@ hash_odp_port(odp_port_t odp_port)
 
 uint32_t flow_hash_5tuple(const struct flow *flow, uint32_t basis);
 uint32_t flow_hash_symmetric_l4(const struct flow *flow, uint32_t basis);
+uint32_t flow_hash_symmetric_l2(const struct flow *flow, uint32_t basis);
 uint32_t flow_hash_symmetric_l3l4(const struct flow *flow, uint32_t basis,
  bool inc_udp_ports );
 
diff --git a/lib/odp-execute.c b/lib/odp-execute.c
index c5080ea..5831d1f 100644
--- a/lib/odp-execute.c
+++ b/lib/odp-execute.c
@@ -730,14 +730,16 @@ odp_execute_actions(void *dp, struct dp_packet_batch 
*batch, bool steal,
 }
 
 switch ((enum ovs_action_attr) type) {
+
 case OVS_ACTION_ATTR_HASH: {
 const struct ovs_action_hash *hash_act = nl_attr_get(a);
 
-/* Calculate a hash value directly.  This might not match the
+/* Calculate a hash value directly. This might not match the
  * value computed by the datapath, but it is much less expensive,
  * and the current use case (bonding) does not require a strict
  * match to work properly. */
-if (hash_act->hash_alg == OVS_HASH_ALG_L4) {
+switch (hash_act->hash_alg) {
+case OVS_HASH_ALG_L4: {
 struct flow flow;
 uint32_t hash;
 
@@ -753,7 +755,22 @@ odp_execute_actions(

[ovs-dev] [PATCH v4 0/3] Use improved dp_hash select group by default

2018-05-24 Thread Jan Scheurich
The current default OpenFlow select group implementation sends every new L4 flow
to the slow path for the balancing decision and installs a 5-tuple "miniflow"
in the datapath to forward subsequent packets of the connection accordingly. 
Clearly this has major scalability issues with many parallel L4 flows and high
connection setup rates.

The dp_hash selection method for the OpenFlow select group was added to OVS
as an alternative. It avoids the scalability issues for the price of an 
additional recirculation in the datapath. The dp_hash method is only available
to OF1.5 SDN controllers speaking the Netronome Group Mod extension to 
configure the selection mechanism. This severely limited the applicability of
the dp_hash select group in the past.

Furthermore, testing revealed that the implemented dp_hash selection often
generated a very uneven distribution of flows over group buckets and didn't 
consider bucket weights at all.

The present patch set in a first step improves the dp_hash selection method to
much more accurately distribute flows over weighted group buckets and to
apply a symmetric dp_hash function to maintain the symmetry property of the
legacy hash function. In a second step it makes the improved dp_hash method
the default in OVS for select groups that can be accurately handled by dp_hash.
That should be the vast majority of cases. Otherwise we fall back to the
legacy slow-path selection method.

The Netronome extension can still be used to override the default decision and
require the legacy slow-path or the dp_hash selection method.

v3 -> v4:
- Rebased to master (commit 82d5b337cd).
- Implemented Ben's improvement suggestions for patch 2/3.
- Fixed machine dependency of one select group test case.

v2 -> v3:
- Fixed another corner case crash reported by Chen Yuefang.
- Fixed several sparse and clang warnings reported by Ben.
- Rewritten the select group unit tests to abstract the checks from
  the behavior of the system-specific hash function implementation.
- Added dpif_backer_support field for dp_hash algorithms to prevent
  using the new OVS_HASH_L4_SYMMETRIC algorithm if it is not 
  supported by the datapath.

v1 -> v2:
- Fixed crashes for corner cases reported by Chen Yuefang.
- Fixed group ref leakage with dp_hash reported by Chen Yuefang.
- Changed all xlation logging from INFO to DBG.
- Revised, completed and detailed select group unit test cases in 
ofproto-dpif.
- Updated selection_method documentation in ovs-ofctl man page.
- Added NEWS item.


Jan Scheurich (3):
  userspace datapath: Add OVS_HASH_L4_SYMMETRIC dp_hash algorithm
  ofproto-dpif: Improve dp_hash selection method for select groups
  ofproto-dpif: Use dp_hash as default selection method

 NEWS  |   2 +
 datapath/linux/compat/include/linux/openvswitch.h |   4 +
 lib/flow.c|  43 ++-
 lib/flow.h|   1 +
 lib/odp-execute.c |  23 +-
 lib/odp-util.c|   4 +-
 lib/ofp-group.c   |  15 +-
 ofproto/ofproto-dpif-xlate.c  |  66 +++--
 ofproto/ofproto-dpif.c| 211 ++-
 ofproto/ofproto-dpif.h|  19 +-
 ofproto/ofproto-provider.h|   2 +-
 tests/mpls-xlate.at   |  26 +-
 tests/ofproto-dpif.at | 315 +-
 tests/ofproto-macros.at   |   7 +-
 utilities/ovs-ofctl.8.in  |  47 ++--
 15 files changed, 651 insertions(+), 134 deletions(-)

-- 
1.9.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v3 3/3] ofproto-dpif: Use dp_hash as default selection method

2018-05-14 Thread Jan Scheurich
> 
> Thanks for working on this.
> 
> I get the following test failure with this applied (with or without the
> incremental changes I suggested for patch 2).
> 
> Will you take a look?
> 

The test should verify that only one of the buckets is hit when the packets 
have no entropy in the custom hash fields. Which bucket is hit depends on the 
hash function implementation and can differ between platforms. Will fix the 
check.

Regards, Jan

> Thanks,
> 
> Ben.
> 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v3 2/3] ofproto-dpif: Improve dp_hash selection method for select groups

2018-05-14 Thread Jan Scheurich
> 
> Thanks a lot.
> 
> I don't think that the new 'aux' member in ofputil_bucket is too
> useful.  It looks to me like the only use of it could be kept just as
> easily in struct webster.
> 
> group_setup_dp_hash_table() uses floating-point arithmetic for good
> reasons, but it seems to me that some of it is unnecessary, especially
> since we have DIV_ROUND_UP and ROUND_UP_POW2.
> 
> group_dp_hash_best_bucket() seems like it unnecessarily modifies its
> dp_hash parameter (and then never uses it again) and unnecessarily uses
> % when & would work.  I also saw a few ways to make the style better
> match what we most often do these days.
> 
> So here's an incremental that I suggest folding in for v4.  What do you
> think?

I agree with your suggestions. The incremental looks good to me. Will include 
it in v4.

Thanks, Jan

> 
> Thanks,
> 
> Ben.
> 
> --8<--cut here-->8--
> 
> diff --git a/include/openvswitch/ofp-group.h b/include/openvswitch/ofp-group.h
> index af4033dc68e4..8d893a53fcb2 100644
> --- a/include/openvswitch/ofp-group.h
> +++ b/include/openvswitch/ofp-group.h
> @@ -47,7 +47,6 @@ struct bucket_counter {
>  /* Bucket for use in groups. */
>  struct ofputil_bucket {
>  struct ovs_list list_node;
> -uint16_t aux;   /* Padding. Also used for temporary data. */
>  uint16_t weight;/* Relative weight, for "select" groups. */
>  ofp_port_t watch_port;  /* Port whose state affects whether this 
> bucket
>   * is live. Only required for fast failover
> diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
> index e35582df0c37..1c78c2d7ca50 100644
> --- a/ofproto/ofproto-dpif-xlate.c
> +++ b/ofproto/ofproto-dpif-xlate.c
> @@ -4386,26 +4386,22 @@ group_dp_hash_best_bucket(struct xlate_ctx *ctx,
>const struct group_dpif *group,
>uint32_t dp_hash)
>  {
> -struct ofputil_bucket *bucket, *best_bucket = NULL;
> -uint32_t n_hash = group->hash_mask + 1;
> -
> -uint32_t hash = dp_hash &= group->hash_mask;
> -ctx->wc->masks.dp_hash |= group->hash_mask;
> +uint32_t hash_mask = group->hash_mask;
> +ctx->wc->masks.dp_hash |= hash_mask;
> 
>  /* Starting from the original masked dp_hash value iterate over the
>   * hash mapping table to find the first live bucket. As the buckets
>   * are quasi-randomly spread over the hash values, this maintains
>   * a distribution according to bucket weights even when some buckets
>   * are non-live. */
> -for (int i = 0; i < n_hash; i++) {
> -bucket = group->hash_map[(hash + i) % n_hash];
> -if (bucket_is_alive(ctx, bucket, 0)) {
> -best_bucket = bucket;
> -break;
> +for (int i = 0; i <= hash_mask; i++) {
> +struct ofputil_bucket *b = group->hash_map[(dp_hash + i) & 
> hash_mask];
> +if (bucket_is_alive(ctx, b, 0)) {
> +return b;
>  }
>  }
> 
> -return best_bucket;
> +return NULL;
>  }
> 
>  static void
> diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c
> index f5ecd8be8d05..c9c2e5176e46 100644
> --- a/ofproto/ofproto-dpif.c
> +++ b/ofproto/ofproto-dpif.c
> @@ -4777,13 +4777,13 @@ group_setup_dp_hash_table(struct group_dpif *group, 
> size_t max_hash)
>  {
>  struct ofputil_bucket *bucket;
>  uint32_t n_buckets = group->up.n_buckets;
> -double total_weight = 0.0;
> +uint64_t total_weight = 0;
>  uint16_t min_weight = UINT16_MAX;
> -uint32_t n_hash;
>  struct webster {
>  struct ofputil_bucket *bucket;
>  uint32_t divisor;
>  double value;
> +int hits;
>  } *webster;
> 
>  if (n_buckets == 0) {
> @@ -4794,7 +4794,6 @@ group_setup_dp_hash_table(struct group_dpif *group, 
> size_t max_hash)
>  webster = xcalloc(n_buckets, sizeof(struct webster));
>  int i = 0;
>  LIST_FOR_EACH (bucket, list_node, >up.buckets) {
> -bucket->aux = 0;
>  if (bucket->weight > 0 && bucket->weight < min_weight) {
>  min_weight = bucket->weight;
>  }
> @@ -4802,6 +4801,7 @@ group_setup_dp_hash_table(struct group_dpif *group, 
> size_t max_hash)
>  webster[i].bucket = bucket;
>  webster[i].divisor = 1;
>  webster[i].value = bucket->weight;
> +webster[i].hits = 0;
>  i++;
>  }
> 
> @@ -4810,19 +4810,19 @@ group_setup_dp_hash_table(struct group_dpif *group, 
> size_t max_hash)
>  free(webster);
>  return false;
>  }
> -VLOG_DBG("  Minimum weight: %d, total weight: %.0f",
> +VLOG_DBG("  Minimum weight: %d, total weight: %"PRIu64,
>   min_weight, total_weight);
> 
> -uint32_t min_slots = ceil(total_weight / min_weight);
> -n_hash = MAX(16, 1L << log_2_ceil(min_slots));
> -
> +uint64_t min_slots = DIV_ROUND_UP(total_weight, min_weight);
> +uint64_t 

Re: [ovs-dev] [PATCH v3] Upcall/Slowpath rate limiter for OVS

2018-05-08 Thread Jan Scheurich
Hi Manu,

Thanks for working on this. Two general comments:

1. Is there a chance to add unit test cases for this feature? I know it might 
be difficult due to the real-time character, but perhaps using very low 
parameter values?

2. I believe the number RL-dropped packets must be accounted for in function 
dpif_netdev_get_stats() as stats->n_missed, otherwise the overall number of 
reported packets may not match the total number of packets processed.

Other comments in-line.

Regards, Jan

> From: Manohar Krishnappa Chidambaraswamy
> Sent: Monday, 07 May, 2018 12:45
> 
> Hi
> 
> Rebased to master and adapted to the new dpif-netdev-perf counters.
> As explained in v2 thread, OFPM_SLOWPATH meters cannot be used as is
> for rate-limiting upcalls, hence reverted back to the simpler method
> using token bucket.

I guess the question was not whether to use meter actions in the datapath to 
implement the upcall rate limiter in dpif-netdev but whether to allow 
configuration of the upcall rate limiter through OpenFlow Meter Mod command for 
special pre-defined meter OFPM_SLOWPATH. In any case, the decision was to drop 
that idea.

> 
> Could you please review this patch?
> 
> Thanx
> Manu
> 
> v2: https://patchwork.ozlabs.org/patch/860687/
> v1: https://patchwork.ozlabs.org/patch/836737/

Please add the list of main changes between the current and the previous 
version in the next revision of the patch. Put it below the '---' separator so 
that is not part of the commit message.

> 
> Signed-off-by: Manohar K C
> <manohar.krishnappa.chidambarasw...@ericsson.com>
> CC: Jan Scheurich <jan.scheur...@ericsson.com>
> ---
>  Documentation/howto/dpdk.rst | 21 +++
>  lib/dpif-netdev-perf.h   |  1 +
>  lib/dpif-netdev.c| 83 
> 
>  vswitchd/vswitch.xml | 47 +
>  4 files changed, 146 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/howto/dpdk.rst b/Documentation/howto/dpdk.rst
> index 79b626c..bd1eaac 100644
> --- a/Documentation/howto/dpdk.rst
> +++ b/Documentation/howto/dpdk.rst
> @@ -739,3 +739,24 @@ devices to bridge ``br0``. Once complete, follow the 
> below steps:
> Check traffic on multiple queues::
> 
> $ cat /proc/interrupts | grep virtio
> +
> +Upcall rate limiting
> +
> +ovs-vsctl can be used to enable and configure upcall rate limit parameters.
> +There are 2 configurable values ``upcall-rate`` and ``upcall-burst`` which
> +take effect when global enable knob ``upcall-rl`` is set to true.

Please explain why upcall rate limiting may be relevant in the context of DPDK 
datapath (upcalls executed in the context of the PMD and affecting datapath 
forwarding capacity). Worth noting here, perhaps, that this rate limiting is 
independently per PMD and not a global limit.

Replace "knob" by "configuration parameter"  and put the command to enable rate 
limiting:
 $ ovs-vsctl set Open_vSwitch . other_config:upcall-rl=true
before the commands to tune the token bucket parameters. Mention the default 
parameter values?

> +
> +Upcall rate should be set using ``upcall-rate`` in packets-per-sec. For
> +example::
> +
> +$ ovs-vsctl set Open_vSwitch . other_config:upcall-rate=2000
> +
> +Upcall burst should be set using ``upcall-burst`` in packets-per-sec. For
> +example::
> +
> +$ ovs-vsctl set Open_vSwitch . other_config:upcall-burst=2000
> +
> +Upcall ratelimit feature should be globally enabled using ``upcall-rl``. For
> +example::
> +
> +$ ovs-vsctl set Open_vSwitch . other_config:upcall-rl=true

> diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
> index 5993c25..189213c 100644
> --- a/lib/dpif-netdev-perf.h
> +++ b/lib/dpif-netdev-perf.h
> @@ -64,6 +64,7 @@ enum pmd_stat_type {
>   * recirculation. */
>  PMD_STAT_SENT_PKTS, /* Packets that have been sent. */
>  PMD_STAT_SENT_BATCHES,  /* Number of batches sent. */
> +PMD_STAT_RATELIMIT_DROP,/* Packets dropped due to upcall policer. */

Name PMD_STAT_RL_DROP and move up in list after PMD_STAT_LOST. Add space before 
comment.

>  PMD_CYCLES_ITER_IDLE,   /* Cycles spent in idle iterations. */
>  PMD_CYCLES_ITER_BUSY,   /* Cycles spent in busy iterations. */
>  PMD_N_STATS
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index be31fd0..eebab89 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -101,6 +101,16 @@ static struct shash dp_netdevs 
> OVS_GUARDED_BY(dp_netdev_mutex)
> 
>  static struct vlog_rate_limit upcall_rl = VLOG_RATE_LIMIT_INIT(600, 600);
> 
> +/* Upcall rate-limit parameters */
> +static bool upcall_ratelimit;
> +static unsigned i

[ovs-dev] [PATCH v3 3/3] ofproto-dpif: Use dp_hash as default selection method

2018-04-27 Thread Jan Scheurich
The dp_hash selection method for select groups overcomes the scalability
problems of the current default selection method which, due to L2-L4
hashing during xlation and un-wildcarding of the hashed fields,
basically requires an upcall to the slow path to load-balance every
L4 connection. The consequence are an explosion of datapath flows
(megaflows degenerate to miniflows) and a limitation of connection
setup rate OVS can handle.

This commit changes the default selection method to dp_hash, provided the
bucket configuration is such that the dp_hash method can accurately
represent the bucket weights with up to 64 hash values. Otherwise we
stick to original default hash method.

We use the new dp_hash algorithm OVS_HASH_L4_SYMMETRIC to maintain the
symmetry property of the old default hash method.

A controller can explicitly request the old default hash selection method
by specifying selection method "hash" with an empty list of fields in the
Group properties of the OpenFlow 1.5 Group Mod message.

Update the documentation about selection method in the ovs-ovctl man page.

Revise and complete the ofproto-dpif unit tests cases for select groups.

Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>
Signed-off-by: Nitin Katiyar <nitin.kati...@ericsson.com>
Co-authored-by: Nitin Katiyar <nitin.kati...@ericsson.com>
---
 NEWS   |   2 +
 lib/ofp-group.c|  15 ++-
 ofproto/ofproto-dpif.c |  30 +++--
 ofproto/ofproto-dpif.h |   1 +
 ofproto/ofproto-provider.h |   2 +-
 tests/mpls-xlate.at|  26 ++--
 tests/ofproto-dpif.at  | 315 +++--
 tests/ofproto-macros.at|   7 +-
 utilities/ovs-ofctl.8.in   |  47 ---
 9 files changed, 334 insertions(+), 111 deletions(-)

diff --git a/NEWS b/NEWS
index cd4ffbb..fbd987f 100644
--- a/NEWS
+++ b/NEWS
@@ -15,6 +15,8 @@ Post-v2.9.0
- ovs-vsctl: New commands "add-bond-iface" and "del-bond-iface".
- OpenFlow:
  * OFPT_ROLE_STATUS is now available in OpenFlow 1.3.
+ * Default selection method for select groups is now dp_hash with improved
+   accuracy.
- Linux kernel 4.14
  * Add support for compiling OVS with the latest Linux 4.14 kernel
- ovn:
diff --git a/lib/ofp-group.c b/lib/ofp-group.c
index 31b0437..c5ddc65 100644
--- a/lib/ofp-group.c
+++ b/lib/ofp-group.c
@@ -1518,12 +1518,17 @@ parse_group_prop_ntr_selection_method(struct ofpbuf 
*payload,
 return OFPERR_OFPBPC_BAD_VALUE;
 }
 
-error = oxm_pull_field_array(payload->data, fields_len,
- >fields);
-if (error) {
-OFPPROP_LOG(, false,
+if (fields_len > 0) {
+error = oxm_pull_field_array(payload->data, fields_len,
+>fields);
+if (error) {
+OFPPROP_LOG(, false,
 "ntr selection method fields are invalid");
-return error;
+return error;
+}
+} else {
+/* Selection_method "hash: w/o fields means default hash method. */
+gp->fields.values_size = 0;
 }
 
 return 0;
diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c
index f5ecd8b..52282a8 100644
--- a/ofproto/ofproto-dpif.c
+++ b/ofproto/ofproto-dpif.c
@@ -1,5 +1,4 @@
 /*
- * Copyright (c) 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017 Nicira, 
Inc.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -4787,7 +4786,7 @@ group_setup_dp_hash_table(struct group_dpif *group, 
size_t max_hash)
 } *webster;
 
 if (n_buckets == 0) {
-VLOG_DBG("  Don't apply dp_hash method without buckets");
+VLOG_DBG("  Don't apply dp_hash method without buckets.");
 return false;
 }
 
@@ -4860,9 +4859,24 @@ group_set_selection_method(struct group_dpif *group)
 const struct ofputil_group_props *props = >up.props;
 const char *selection_method = props->selection_method;
 
+VLOG_DBG("Constructing select group %"PRIu32, group->up.group_id);
 if (selection_method[0] == '\0') {
-VLOG_DBG("No selection method specified.");
-group->selection_method = SEL_METHOD_DEFAULT;
+VLOG_DBG("No selection method specified. Trying dp_hash.");
+/* If the controller has not specified a selection method, check if
+ * the dp_hash selection method with max 64 hash values is appropriate
+ * for the given bucket configuration. */
+if (group_setup_dp_hash_table(group, 64)) {
+/* Use dp_hash selection method with symmetric L4 hash. */
+group->selection_method = SEL_METHOD_DP_HASH;
+group->hash_alg = OVS_HASH_ALG_SYM_L4;
+group->hash_basis = 0;
+VLOG_DBG("Use dp_hash with %d hash values u

[ovs-dev] [PATCH v3 1/3] userspace datapath: Add OVS_HASH_L4_SYMMETRIC dp_hash algorithm

2018-04-27 Thread Jan Scheurich
This commit implements a new dp_hash algorithm OVS_HASH_L4_SYMMETRIC in
the netdev datapath. It will be used as default hash algorithm for the
dp_hash-based select groups in a subsequent commit to maintain
compatibility with the symmetry property of the current default hash
selection method.

A new dpif_backer_support field 'max_hash_alg' is introduced to reflect
the highest hash algorithm a datapath supports in the dp_hash action.

Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>
Signed-off-by: Nitin Katiyar <nitin.kati...@ericsson.com>
Co-authored-by: Nitin Katiyar <nitin.kati...@ericsson.com>
---
 datapath/linux/compat/include/linux/openvswitch.h |  4 ++
 lib/flow.c| 43 +-
 lib/flow.h|  1 +
 lib/odp-execute.c | 23 ++--
 ofproto/ofproto-dpif-xlate.c  |  7 +++-
 ofproto/ofproto-dpif.c| 45 +++
 ofproto/ofproto-dpif.h|  5 ++-
 7 files changed, 121 insertions(+), 7 deletions(-)

diff --git a/datapath/linux/compat/include/linux/openvswitch.h 
b/datapath/linux/compat/include/linux/openvswitch.h
index 84ebcaf..2bb3cb2 100644
--- a/datapath/linux/compat/include/linux/openvswitch.h
+++ b/datapath/linux/compat/include/linux/openvswitch.h
@@ -720,6 +720,10 @@ struct ovs_action_push_vlan {
  */
 enum ovs_hash_alg {
OVS_HASH_ALG_L4,
+#ifndef __KERNEL__
+   OVS_HASH_ALG_SYM_L4,
+#endif
+   __OVS_HASH_MAX
 };
 
 /*
diff --git a/lib/flow.c b/lib/flow.c
index 09b66b8..c65b288 100644
--- a/lib/flow.c
+++ b/lib/flow.c
@@ -2108,6 +2108,45 @@ flow_hash_symmetric_l4(const struct flow *flow, uint32_t 
basis)
 return jhash_bytes(, sizeof fields, basis);
 }
 
+/* Symmetrically Hashes non-IP 'flow' based on its L2 headers. */
+uint32_t
+flow_hash_symmetric_l2(const struct flow *flow, uint32_t basis)
+{
+union {
+struct {
+ovs_be16 eth_type;
+ovs_be16 vlan_tci;
+struct eth_addr eth_addr;
+ovs_be16 pad;
+};
+uint32_t word[3];
+} fields;
+
+uint32_t hash = basis;
+int i;
+
+if (flow->packet_type != htonl(PT_ETH)) {
+/* Cannot hash non-Ethernet flows */
+return 0;
+}
+
+for (i = 0; i < ARRAY_SIZE(fields.eth_addr.be16); i++) {
+fields.eth_addr.be16[i] =
+flow->dl_src.be16[i] ^ flow->dl_dst.be16[i];
+}
+fields.vlan_tci = 0;
+for (i = 0; i < FLOW_MAX_VLAN_HEADERS; i++) {
+fields.vlan_tci ^= flow->vlans[i].tci & htons(VLAN_VID_MASK);
+}
+fields.eth_type = flow->dl_type;
+fields.pad = 0;
+
+hash = hash_add(hash, fields.word[0]);
+hash = hash_add(hash, fields.word[1]);
+hash = hash_add(hash, fields.word[2]);
+return hash_finish(hash, basis);
+}
+
 /* Hashes 'flow' based on its L3 through L4 protocol information */
 uint32_t
 flow_hash_symmetric_l3l4(const struct flow *flow, uint32_t basis,
@@ -2128,8 +2167,8 @@ flow_hash_symmetric_l3l4(const struct flow *flow, 
uint32_t basis,
 hash = hash_add64(hash, a[i] ^ b[i]);
 }
 } else {
-/* Cannot hash non-IP flows */
-return 0;
+/* Revert to hashing L2 headers */
+return flow_hash_symmetric_l2(flow, basis);
 }
 
 hash = hash_add(hash, flow->nw_proto);
diff --git a/lib/flow.h b/lib/flow.h
index af82931..900e8f8 100644
--- a/lib/flow.h
+++ b/lib/flow.h
@@ -236,6 +236,7 @@ hash_odp_port(odp_port_t odp_port)
 
 uint32_t flow_hash_5tuple(const struct flow *flow, uint32_t basis);
 uint32_t flow_hash_symmetric_l4(const struct flow *flow, uint32_t basis);
+uint32_t flow_hash_symmetric_l2(const struct flow *flow, uint32_t basis);
 uint32_t flow_hash_symmetric_l3l4(const struct flow *flow, uint32_t basis,
  bool inc_udp_ports );
 
diff --git a/lib/odp-execute.c b/lib/odp-execute.c
index 1969f02..c716c41 100644
--- a/lib/odp-execute.c
+++ b/lib/odp-execute.c
@@ -726,14 +726,16 @@ odp_execute_actions(void *dp, struct dp_packet_batch 
*batch, bool steal,
 }
 
 switch ((enum ovs_action_attr) type) {
+
 case OVS_ACTION_ATTR_HASH: {
 const struct ovs_action_hash *hash_act = nl_attr_get(a);
 
-/* Calculate a hash value directly.  This might not match the
+/* Calculate a hash value directly. This might not match the
  * value computed by the datapath, but it is much less expensive,
  * and the current use case (bonding) does not require a strict
  * match to work properly. */
-if (hash_act->hash_alg == OVS_HASH_ALG_L4) {
+switch (hash_act->hash_alg) {
+case OVS_HASH_ALG_L4: {
 struct flow flow;
 uint32_t hash;
 
@@ -749,7 +751,22 @@ odp_execute_actions(

[ovs-dev] [PATCH v3 0/3] Use improved dp_hash select group by default

2018-04-27 Thread Jan Scheurich
The current default OpenFlow select group implementation sends every new L4 flow
to the slow path for the balancing decision and installs a 5-tuple "miniflow"
in the datapath to forward subsequent packets of the connection accordingly. 
Clearly this has major scalability issues with many parallel L4 flows and high
connection setup rates.

The dp_hash selection method for the OpenFlow select group was added to OVS
as an alternative. It avoids the scalability issues for the price of an 
additional recirculation in the datapath. The dp_hash method is only available
to OF1.5 SDN controllers speaking the Netronome Group Mod extension to 
configure the selection mechanism. This severely limited the applicability of
the dp_hash select group in the past.

Furthermore, testing revealed that the implemented dp_hash selection often
generated a very uneven distribution of flows over group buckets and didn't 
consider bucket weights at all.

The present patch set in a first step improves the dp_hash selection method to
much more accurately distribute flows over weighted group buckets and to
apply a symmetric dp_hash function to maintain the symmetry property of the
legacy hash function. In a second step it makes the improved dp_hash method
the default in OVS for select groups that can be accurately handled by dp_hash.
That should be the vast majority of cases. Otherwise we fall back to the
legacy slow-path selection method.

The Netronome extension can still be used to override the default decision and
require the legacy slow-path or the dp_hash selection method.

v2 -> v3:
- Fixed another corner case crash reported by Chen Yuefang.
- Fixed several sparse and clang warnings reported by Ben.
- Rewritten the select group unit tests to abstract the checks from
  the behavior of the system-specific hash function implementation.
- Added dpif_backer_support field for dp_hash algorithms to prevent
  using the new OVS_HASH_L4_SYMMETRIC algorithm if it is not 
  supported by the datapath.

v1 -> v2:
- Fixed crashes for corner cases reported by Chen Yuefang.
- Fixed group ref leakage with dp_hash reported by Chen Yuefang.
- Changed all xlation logging from INFO to DBG.
- Revised, completed and detailed select group unit test cases in 
ofproto-dpif.
- Updated selection_method documentation in ovs-ofctl man page.
- Added NEWS item.


Jan Scheurich (3):
  userspace datapath: Add OVS_HASH_L4_SYMMETRIC dp_hash algorithm
  ofproto-dpif: Improve dp_hash selection method for select groups
  ofproto-dpif: Use dp_hash as default selection method

 NEWS  |   2 +
 datapath/linux/compat/include/linux/openvswitch.h |   4 +
 include/openvswitch/ofp-group.h   |   1 +
 lib/flow.c|  43 ++-
 lib/flow.h|   1 +
 lib/odp-execute.c |  23 +-
 lib/odp-util.c|   4 +-
 lib/ofp-group.c   |  15 +-
 ofproto/ofproto-dpif-xlate.c  |  89 --
 ofproto/ofproto-dpif.c| 209 +-
 ofproto/ofproto-dpif.h|  19 +-
 ofproto/ofproto-provider.h|   2 +-
 tests/mpls-xlate.at   |  26 +-
 tests/ofproto-dpif.at | 314 +-
 tests/ofproto-macros.at   |   7 +-
 utilities/ovs-ofctl.8.in  |  47 ++--
 16 files changed, 667 insertions(+), 139 deletions(-)

-- 
1.9.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v3 2/3] ofproto-dpif: Improve dp_hash selection method for select groups

2018-04-27 Thread Jan Scheurich
The current implementation of the "dp_hash" selection method suffers
from two deficiences: 1. The hash mask and hence the number of dp_hash
values is just large enough to cover the number of group buckets, but
does not consider the case that buckets have different weights. 2. The
xlate-time selection of best bucket from the masked dp_hash value often
results in bucket load distributions that are quite different from the
bucket weights because the number of available masked dp_hash values
is too small (2-6 bits compared to 32 bits of a full hash in the default
hash selection method).

This commit provides a more accurate implementation of the dp_hash
select group by applying the well known Webster method for distributing
a small number of "seats" fairly over the weighted "parties"
(see https://en.wikipedia.org/wiki/Webster/Sainte-Lagu%C3%AB_method).
The dp_hash mask is autmatically chosen large enough to provide good
enough accuracy even with widely differing weights.

This distribution happens at group modification time and the resulting
table is stored with the group-dpif struct. At xlation time, we use the
masked dp_hash values as index to look up the assigned bucket.

If the bucket should not be live, we do a circular search over the
mapping table until we find the first live bucket. As the buckets in
the table are by construction in pseudo-random order with a frequency
according to their weight, this method maintains correct distribution
even if one or more buckets are non-live.

Xlation is further simplified by storing some derived select group state
at group construction in struct group-dpif in a form better suited for
xlation purposes.

Adapted the unit test case for dp_hash select group accordingly.

Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>
Signed-off-by: Nitin Katiyar <nitin.kati...@ericsson.com>
Co-authored-by: Nitin Katiyar <nitin.kati...@ericsson.com>
---
 include/openvswitch/ofp-group.h |   1 +
 lib/odp-util.c  |   4 +-
 ofproto/ofproto-dpif-xlate.c|  82 ++
 ofproto/ofproto-dpif.c  | 148 
 ofproto/ofproto-dpif.h  |  13 
 tests/ofproto-dpif.at   |  15 ++--
 6 files changed, 227 insertions(+), 36 deletions(-)

diff --git a/include/openvswitch/ofp-group.h b/include/openvswitch/ofp-group.h
index 8d893a5..af4033d 100644
--- a/include/openvswitch/ofp-group.h
+++ b/include/openvswitch/ofp-group.h
@@ -47,6 +47,7 @@ struct bucket_counter {
 /* Bucket for use in groups. */
 struct ofputil_bucket {
 struct ovs_list list_node;
+uint16_t aux;   /* Padding. Also used for temporary data. */
 uint16_t weight;/* Relative weight, for "select" groups. */
 ofp_port_t watch_port;  /* Port whose state affects whether this bucket
  * is live. Only required for fast failover
diff --git a/lib/odp-util.c b/lib/odp-util.c
index 6db241a..2db4e9d 100644
--- a/lib/odp-util.c
+++ b/lib/odp-util.c
@@ -595,7 +595,9 @@ format_odp_hash_action(struct ds *ds, const struct 
ovs_action_hash *hash_act)
 ds_put_format(ds, "hash(");
 
 if (hash_act->hash_alg == OVS_HASH_ALG_L4) {
-ds_put_format(ds, "hash_l4(%"PRIu32")", hash_act->hash_basis);
+ds_put_format(ds, "l4(%"PRIu32")", hash_act->hash_basis);
+} else if (hash_act->hash_alg == OVS_HASH_ALG_SYM_L4) {
+ds_put_format(ds, "sym_l4(%"PRIu32")", hash_act->hash_basis);
 } else {
 ds_put_format(ds, "Unknown hash algorithm(%"PRIu32")",
   hash_act->hash_alg);
diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
index 05db090..e6d97a4 100644
--- a/ofproto/ofproto-dpif-xlate.c
+++ b/ofproto/ofproto-dpif-xlate.c
@@ -4380,35 +4380,59 @@ xlate_hash_fields_select_group(struct xlate_ctx *ctx, 
struct group_dpif *group,
 }
 }
 
+static struct ofputil_bucket *
+group_dp_hash_best_bucket(struct xlate_ctx *ctx,
+  const struct group_dpif *group,
+  uint32_t dp_hash)
+{
+struct ofputil_bucket *bucket, *best_bucket = NULL;
+uint32_t n_hash = group->hash_mask + 1;
+
+uint32_t hash = dp_hash &= group->hash_mask;
+ctx->wc->masks.dp_hash |= group->hash_mask;
+
+/* Starting from the original masked dp_hash value iterate over the
+ * hash mapping table to find the first live bucket. As the buckets
+ * are quasi-randomly spread over the hash values, this maintains
+ * a distribution according to bucket weights even when some buckets
+ * are non-live. */
+for (int i = 0; i < n_hash; i++) {
+bucket = group->hash_map[(hash + i) % n_hash];
+if (bucket_is_alive(ctx, bucket, 0)) {
+best_bucket = bucket;
+bre

Re: [ovs-dev] [PATCH v12 2/3] dpif-netdev: Detailed performance stats for PMDs

2018-04-26 Thread Jan Scheurich
> > I hope that Clang is intelligent enough to recognize this. If not, I
> > wouldn't know how to fix it other than by removing OVS_REQUIRES(s-
> > >stats_mutex) from pmd_perf_stats_clear_lock() and just rely on comments.
> >
> > BR, Jan
> 
> Thanks Jan, that resolves the issue and there's a clear travis build now. 
> I'll add this series as part of the next pull request.
> 
> Thanks all for the work reviewing and testing this.
> 
> Ian

Perfect!
Thank you too,
Jan
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] Mempool redesign for OVS 2.10

2018-04-26 Thread Jan Scheurich
Hi,



Thanks, everyone, for re-opening the discussion around the new packet mempool 
handling for 2.10.



Before we agree on what to actually implement I’d like to summarize my 
understanding of the requirements that have been discussed so far. Based on 
those I want to share some thoughts about how e can best address these 
requirements.



Requirements:



R1 (Backward compatibility):

The new mempool handling shall be able to function equally well as the OVS 2.9 
design base given any specific configuration of OVS-DPDK: hugepage memory, 
PMDs, ports, queues, MTU sizes, traffic flows. This is to ensure that we can 
upgrade OVS in existing deployments without risk of breaking anything.



R2 (Dimensioning for static deployments):

It shall be possible for an operator to calculate the amount of memory needed 
for packet mempools in a given static (maximum) configuration (PMDs, ethernet 
ports and queues, maximum number of vhost ports, MTU sizes) to reserve 
sufficient hugepages for OVS.



R3 (Safe operation):

If the mempools are dimensioned correctly, it shall not be possible that OVS 
runs out of mbufs for packet processing.



R4 (Minimal footprint):

The packet mempool size needed for safe operation of OVS should be as small as 
possible.



R5 (Dynamic mempool allocation):

It should be possible to automatically adjust the size of packet mempools at 
run-time when changing the OVS configuration e.g. adding PMDs, adding ports, 
adding rx/tx queues, changing the port MTU size. (Note: Shrinking the mempools 
with reducing OVS configuration is less important.)



Actual maximum mbuf consumption in OVS DPDK:


  1.  Phy rx queues: Sum over dpdk dev: (dev->requested_n_rxq * 
dev->requested_rxq_size)
Note: Normally the number of rx queues should not exceed the number of PMDs.
  2.  Phy tx queues: Sum over dpdk dev: (#active tx queues (=#PMDs) * 
dev->requested_txq_size)

Note 1: These are hogged because of DPDK PMD’s lazy release of transmitted 
mbufs.
Note 2: Stored mbufs in a tx queue are coming from all ports.

  1.  One rx batch per PMD during processing: #PMDs * 32
  2.  One batch per active tx queue for time-based batching: 32 * #devs * #PMDs



Assuming rx/tx queue size of 2K for physical ports and #rx queues = #PMDs 
(RSS), the upper limit for the used mbufs would be



(*1*) #dpdk devs * #PMDs * 4K   +  (#dpdk devs + #vhost devs) * #PMDs * 32  
+  #PMDs * 32



Examples:

  *   With a typical NFVI deployment (2 DPDK devs, 4 PMDs, 128 vhost devs ) 
this yields  32K + 17K = 49K mbufs
  *   For a large NFVI deployment (4 DPDK devs, 8 PMDs, 256 vhost devs ) this 
would yield  128K + 66K = 194K mbufs

Roughly 1/3rd of the total mbufs are hogged in dpdk dev rx queues. The 
remaining 2/3rds are populated with an arbitrary mix of mbufs from all sources.



Legacy shared mempool handling up to OVS 2.9:


  *   One mempool per NUMA node and used MTU size range.
  *   Each mempool has the maximum of (256K, 128K, 64K, 32K or 16K) mbufs 
available in DPDK at mempool creation.
  *   Each mempool is shared among all ports on its NUMA node with an MTU in 
its range.
 *   All rx queues of a port share the same mempool



The legacy code trivially satisfies R1. Its good feature is that the mempools 
are shared so that it avoids the bloating of dedicated mempools per port 
implied by the handling on master (see below).



Apart from that it does not fulfill any of the requirements.

  *   It swallows all available hugepage memory to allocate up to 256K mbufs 
per NUMA node, even though that is far more than typically needed (violating 
R4).

  *   The actual size of created mempools depends on the order of creation and 
the hugepage memory available. Early mempools are over-dimensioned, later 
mempools might be under-dimensioned. Operation is not at all safe (violating R3)
  *   It doesn’t provide any help for the operator to dimension and reserve 
hugepages for OVS (violating R2)
  *   The only dynamicity is that it creates additional mempools for new MTU 
size ranges only when they are needed. Due to greedy initial allocation these 
are likely to fail (violating R5).



My take is that even though the shared mempool is concept is good, the legacy 
mempool handling should not be kept as is.



Mempool per port scheme (currently implemented on master):



From the above mbuf utilization calculation it is clear that only the dpdk rx 
queues are populated exclusively with mbufs from the port’s mempool. All other 
places are populated with mbufs from all ports, in the case of tx queues 
typically not even their own. As it is not possible to predict the assignment 
of rx queues to PMDs and the flow of packets between ports, safety requirement 
R3 implies that each port mempool must be dimensioned for the worst case, i.e.



[#PMDs * 2K ] +  #dpdk devs * #PMDs * 2K   +  (#dpdk devs + #vhost devs) * 
#PMDs * 32  +  #PMDs * 32



Even though the first term [#PMDs * 2K] is only needed for physical ports this 

Re: [ovs-dev] [PATCH v12 2/3] dpif-netdev: Detailed performance stats for PMDs

2018-04-25 Thread Jan Scheurich
Hi Ian,

Thanks for checking this. I suggest to address Clang's compliant by the 
following incremental:

diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c
index 47ce2c2..c7b8e7b 100644
--- a/lib/dpif-netdev-perf.c
+++ b/lib/dpif-netdev-perf.c
@@ -442,6 +442,7 @@ pmd_perf_stats_clear(struct pmd_perf_stats *s)

 inline void
 pmd_perf_start_iteration(struct pmd_perf_stats *s)
+OVS_REQUIRES(s->stats_mutex)
 {
 if (s->clear) {
 /* Clear the PMD stats before starting next iteration. */

The mutex pmd->perf_stats.stats_mutex is taken by the calling pmd_thread_main() 
while it is in the poll loop. So the pre-requisite for 
pmd_perf_start_iteration() is in place. It is essential for performance that 
pmd_perf_start_iteration () does not take the lock in each iteration.

I hope that Clang is intelligent enough to recognize this. If not, I wouldn't 
know how to fix it other than by removing OVS_REQUIRES(s->stats_mutex) from 
pmd_perf_stats_clear_lock() and just rely on comments.

BR, Jan

> -Original Message-
> From: Stokes, Ian [mailto:ian.sto...@intel.com]
> Sent: Wednesday, 25 April, 2018 11:55
> To: Jan Scheurich <jan.scheur...@ericsson.com>; d...@openvswitch.org
> Cc: ktray...@redhat.com; i.maxim...@samsung.com; O Mahony, Billy 
> <billy.o.mah...@intel.com>
> Subject: RE: [PATCH v12 2/3] dpif-netdev: Detailed performance stats for PMDs
> 
> > This patch instruments the dpif-netdev datapath to record detailed
> > statistics of what is happening in every iteration of a PMD thread.
> >
> > The collection of detailed statistics can be controlled by a new
> > Open_vSwitch configuration parameter "other_config:pmd-perf-metrics".
> > By default it is disabled. The run-time overhead, when enabled, is
> > in the order of 1%.
> 
> Hi Jan, thanks for the patch, 1 comment below.
> 
> [snip]
> 
> > +
> > +/* This function can be called from the anywhere to clear the stats
> > + * of PMD and non-PMD threads. */
> > +void
> > +pmd_perf_stats_clear(struct pmd_perf_stats *s)
> > +{
> > +if (ovs_mutex_trylock(>stats_mutex) == 0) {
> > +/* Locking successful. PMD not polling. */
> > +pmd_perf_stats_clear_lock(s);
> > +ovs_mutex_unlock(>stats_mutex);
> > +} else {
> > +/* Request the polling PMD to clear the stats. There is no need
> > to
> > + * block here as stats retrieval is prevented during clearing. */
> > +s->clear = true;
> > +}
> > +}
> > +
> > +/* Functions recording PMD metrics per iteration. */
> > +
> > +inline void
> > +pmd_perf_start_iteration(struct pmd_perf_stats *s)
> > +{
> 
> Clang will complain that the mutex must be exclusively held for 
> s->stats_mutex in this function.
> I can add this with the following incremental
> 
> --- a/lib/dpif-netdev-perf.c
> +++ b/lib/dpif-netdev-perf.c
> @@ -417,6 +417,7 @@ pmd_perf_stats_clear(struct pmd_perf_stats *s)
>  inline void
>  pmd_perf_start_iteration(struct pmd_perf_stats *s)
>  {
> +ovs_mutex_lock(>stats_mutex);
>  if (s->clear) {
>  /* Clear the PMD stats before starting next iteration. */
>  pmd_perf_stats_clear_lock(s);
> @@ -433,6 +434,7 @@ pmd_perf_start_iteration(struct pmd_perf_stats *s)
>  /* In case last_tsc has never been set before. */
>  s->start_tsc = cycles_counter_update(s);
>  }
> +ovs_mutex_unlock(>stats_mutex);
>  }
> 
> But in that case is the ovs_mutex_trylock in function pmd_perf_stats_clear() 
> made redundant?
> 
> Ian
> 
> > +if (s->clear) {
> > +/* Clear the PMD stats before starting next iteration. */
> > +pmd_perf_stats_clear_lock(s);
> > +}
> > +s->iteration_cnt++;
> > +/* Initialize the current interval stats. */
> > +memset(>current, 0, sizeof(struct iter_stats));
> > +if (OVS_LIKELY(s->last_tsc)) {
> > +/* We assume here that last_tsc was updated immediately prior at
> > + * the end of the previous iteration, or just before the first
> > + * iteration. */
> > +s->start_tsc = s->last_tsc;
> > +} else {
> > +/* In case last_tsc has never been set before. */
> > +s->start_tsc = cycles_counter_update(s);
> > +}
> > +}
> > +
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v12 1/3] netdev: Add optional qfill output parameter to rxq_recv()

2018-04-20 Thread Jan Scheurich
If the caller provides a non-NULL qfill pointer and the netdev
implemementation supports reading the rx queue fill level, the rxq_recv()
function returns the remaining number of packets in the rx queue after
reception of the packet burst to the caller. If the implementation does
not support this, it returns -ENOTSUP instead. Reading the remaining queue
fill level should not substantilly slow down the recv() operation.

A first implementation is provided for ethernet and vhostuser DPDK ports
in netdev-dpdk.c.

This output parameter will be used in the upcoming commit for PMD
performance metrics to supervise the rx queue fill level for DPDK
vhostuser ports.

Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>
Acked-by: Billy O'Mahony <billy.o.mah...@intel.com>
---
 lib/dpif-netdev.c |  2 +-
 lib/netdev-bsd.c  |  8 +++-
 lib/netdev-dpdk.c | 41 -
 lib/netdev-dummy.c|  8 +++-
 lib/netdev-linux.c|  7 ++-
 lib/netdev-provider.h |  8 +++-
 lib/netdev.c  |  5 +++--
 lib/netdev.h  |  3 ++-
 8 files changed, 69 insertions(+), 13 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index be31fd0..7ce3943 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -3277,7 +3277,7 @@ dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread 
*pmd,
 pmd->ctx.last_rxq = rxq;
 dp_packet_batch_init();
 
-error = netdev_rxq_recv(rxq->rx, );
+error = netdev_rxq_recv(rxq->rx, , NULL);
 if (!error) {
 /* At least one packet received. */
 *recirc_depth_get() = 0;
diff --git a/lib/netdev-bsd.c b/lib/netdev-bsd.c
index 05974c1..b70f327 100644
--- a/lib/netdev-bsd.c
+++ b/lib/netdev-bsd.c
@@ -618,7 +618,8 @@ netdev_rxq_bsd_recv_tap(struct netdev_rxq_bsd *rxq, struct 
dp_packet *buffer)
 }
 
 static int
-netdev_bsd_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch)
+netdev_bsd_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
+int *qfill)
 {
 struct netdev_rxq_bsd *rxq = netdev_rxq_bsd_cast(rxq_);
 struct netdev *netdev = rxq->up.netdev;
@@ -643,6 +644,11 @@ netdev_bsd_rxq_recv(struct netdev_rxq *rxq_, struct 
dp_packet_batch *batch)
 batch->packets[0] = packet;
 batch->count = 1;
 }
+
+if (qfill) {
+*qfill = -ENOTSUP;
+}
+
 return retval;
 }
 
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index ee39cbe..a4fc382 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -1812,13 +1812,13 @@ netdev_dpdk_vhost_update_rx_counters(struct 
netdev_stats *stats,
  */
 static int
 netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq,
-   struct dp_packet_batch *batch)
+   struct dp_packet_batch *batch, int *qfill)
 {
 struct netdev_dpdk *dev = netdev_dpdk_cast(rxq->netdev);
 struct ingress_policer *policer = netdev_dpdk_get_ingress_policer(dev);
 uint16_t nb_rx = 0;
 uint16_t dropped = 0;
-int qid = rxq->queue_id;
+int qid = rxq->queue_id * VIRTIO_QNUM + VIRTIO_TXQ;
 int vid = netdev_dpdk_get_vid(dev);
 
 if (OVS_UNLIKELY(vid < 0 || !dev->vhost_reconfigured
@@ -1826,14 +1826,23 @@ netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq,
 return EAGAIN;
 }
 
-nb_rx = rte_vhost_dequeue_burst(vid, qid * VIRTIO_QNUM + VIRTIO_TXQ,
-dev->mp,
+nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->mp,
 (struct rte_mbuf **) batch->packets,
 NETDEV_MAX_BURST);
 if (!nb_rx) {
 return EAGAIN;
 }
 
+if (qfill) {
+if (nb_rx == NETDEV_MAX_BURST) {
+/* The DPDK API returns a uint32_t which often has invalid bits in
+ * the upper 16-bits. Need to restrict the value to uint16_t. */
+*qfill = rte_vhost_rx_queue_count(vid, qid) & UINT16_MAX;
+} else {
+*qfill = 0;
+}
+}
+
 if (policer) {
 dropped = nb_rx;
 nb_rx = ingress_policer_run(policer,
@@ -1854,7 +1863,8 @@ netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq,
 }
 
 static int
-netdev_dpdk_rxq_recv(struct netdev_rxq *rxq, struct dp_packet_batch *batch)
+netdev_dpdk_rxq_recv(struct netdev_rxq *rxq, struct dp_packet_batch *batch,
+ int *qfill)
 {
 struct netdev_rxq_dpdk *rx = netdev_rxq_dpdk_cast(rxq);
 struct netdev_dpdk *dev = netdev_dpdk_cast(rxq->netdev);
@@ -1891,6 +1901,14 @@ netdev_dpdk_rxq_recv(struct netdev_rxq *rxq, struct 
dp_packet_batch *batch)
 batch->count = nb_rx;
 dp_packet_batch_init_packet_fields(batch);
 
+if (qfill) {
+if (nb_rx == NETDEV_MAX_BURST) {
+*qfill = rte_eth_rx_queue_count(rx->port_id, rxq->queue_id);
+} else {
+*qfill = 0;
+}
+}
+
 return 0;
 }
 
@@ -3172,6 +3190,19 @@ vr

[ovs-dev] [PATCH v12 2/3] dpif-netdev: Detailed performance stats for PMDs

2018-04-20 Thread Jan Scheurich
This patch instruments the dpif-netdev datapath to record detailed
statistics of what is happening in every iteration of a PMD thread.

The collection of detailed statistics can be controlled by a new
Open_vSwitch configuration parameter "other_config:pmd-perf-metrics".
By default it is disabled. The run-time overhead, when enabled, is
in the order of 1%.

The covered metrics per iteration are:
  - cycles
  - packets
  - (rx) batches
  - packets/batch
  - max. vhostuser qlen
  - upcalls
  - cycles spent in upcalls

This raw recorded data is used threefold:

1. In histograms for each of the following metrics:
   - cycles/iteration (log.)
   - packets/iteration (log.)
   - cycles/packet
   - packets/batch
   - max. vhostuser qlen (log.)
   - upcalls
   - cycles/upcall (log)
   The histograms bins are divided linear or logarithmic.

2. A cyclic history of the above statistics for 999 iterations

3. A cyclic history of the cummulative/average values per millisecond
   wall clock for the last 1000 milliseconds:
   - number of iterations
   - avg. cycles/iteration
   - packets (Kpps)
   - avg. packets/batch
   - avg. max vhost qlen
   - upcalls
   - avg. cycles/upcall

The gathered performance metrics can be printed at any time with the
new CLI command

ovs-appctl dpif-netdev/pmd-perf-show [-nh] [-it iter_len] [-ms ms_len]
[-pmd core] [dp]

The options are

-nh:Suppress the histograms
-it iter_len:   Display the last iter_len iteration stats
-ms ms_len: Display the last ms_len millisecond stats
-pmd core:  Display only the specified PMD

The performance statistics are reset with the existing
dpif-netdev/pmd-stats-clear command.

The output always contains the following global PMD statistics,
similar to the pmd-stats-show command:

Time: 15:24:55.270
Measurement duration: 1.008 s

pmd thread numa_id 0 core_id 1:

  Cycles:2419034712  (2.40 GHz)
  Iterations:572817  (1.76 us/it)
  - idle:486808  (15.9 % cycles)
  - busy: 86009  (84.1 % cycles)
  Rx packets:   2399607  (2381 Kpps, 848 cycles/pkt)
  Datapath passes:  3599415  (1.50 passes/pkt)
  - EMC hits:336472  ( 9.3 %)
  - Megaflow hits:  3262943  (90.7 %, 1.00 subtbl lookups/hit)
  - Upcalls:  0  ( 0.0 %, 0.0 us/upcall)
  - Lost upcalls: 0  ( 0.0 %)
  Tx packets:   2399607  (2381 Kpps)
  Tx batches:171400  (14.00 pkts/batch)

Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>
Acked-by: Billy O'Mahony <billy.o.mah...@intel.com>
---
 NEWS|   4 +
 lib/automake.mk |   1 +
 lib/dpif-netdev-perf.c  | 462 +++-
 lib/dpif-netdev-perf.h  | 197 ---
 lib/dpif-netdev-unixctl.man | 157 +++
 lib/dpif-netdev.c   | 187 --
 manpages.mk |   2 +
 vswitchd/ovs-vswitchd.8.in  |  27 +--
 vswitchd/vswitch.xml|  12 ++
 9 files changed, 985 insertions(+), 64 deletions(-)
 create mode 100644 lib/dpif-netdev-unixctl.man

diff --git a/NEWS b/NEWS
index cd4ffbb..a665c7f 100644
--- a/NEWS
+++ b/NEWS
@@ -23,6 +23,10 @@ Post-v2.9.0
other IPv4/IPv6-based protocols whenever a reject ACL rule is hit.
  * ACL match conditions can now match on Port_Groups as well as address
sets that are automatically generated by Port_Groups.
+   - Userspace datapath:
+ * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a single PMD
+ * Detailed PMD performance metrics available with new command
+ ovs-appctl dpif-netdev/pmd-perf-show
 
 v2.9.0 - 19 Feb 2018
 
diff --git a/lib/automake.mk b/lib/automake.mk
index 915a33b..3276aaa 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -491,6 +491,7 @@ MAN_FRAGMENTS += \
lib/dpctl.man \
lib/memory-unixctl.man \
lib/netdev-dpdk-unixctl.man \
+   lib/dpif-netdev-unixctl.man \
lib/ofp-version.man \
lib/ovs.tmac \
lib/service.man \
diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c
index f06991a..caa0e27 100644
--- a/lib/dpif-netdev-perf.c
+++ b/lib/dpif-netdev-perf.c
@@ -15,18 +15,333 @@
  */
 
 #include 
+#include 
 
+#include "dpif-netdev-perf.h"
 #include "openvswitch/dynamic-string.h"
 #include "openvswitch/vlog.h"
-#include "dpif-netdev-perf.h"
+#include "ovs-thread.h"
 #include "timeval.h"
 
 VLOG_DEFINE_THIS_MODULE(pmd_perf);
 
+#ifdef DPDK_NETDEV
+static uint64_t
+get_tsc_hz(void)
+{
+return rte_get_tsc_hz();
+}
+#else
+/* This function is only invoked from PMD threads which depend on DPDK.
+ * A dummy function is sufficient when building without DPDK_NETDEV. */
+static uint64_t
+get_tsc_hz(void)
+{
+return 1;
+}
+#endif
+
+/* Histogram functions. */
+
+static void
+histogram_walls_set_lin(struct histogram *hist, uint32_t 

[ovs-dev] [PATCH v12 3/3] dpif-netdev: Detection and logging of suspicious PMD iterations

2018-04-20 Thread Jan Scheurich
This patch enhances dpif-netdev-perf to detect iterations with
suspicious statistics according to the following criteria:

- iteration lasts longer than US_THR microseconds (default 250).
  This can be used to capture events where a PMD is blocked or
  interrupted for such a period of time that there is a risk for
  dropped packets on any of its Rx queues.

- max vhost qlen exceeds a threshold Q_THR (default 128). This can
  be used to infer virtio queue overruns and dropped packets inside
  a VM, which are not visible in OVS otherwise.

Such suspicious iterations can be logged together with their iteration
statistics to be able to correlate them to packet drop or other events
outside OVS.

A new command is introduced to enable/disable logging at run-time and
to adjust the above thresholds for suspicious iterations:

ovs-appctl dpif-netdev/pmd-perf-log-set on | off
[-b before] [-a after] [-e|-ne] [-us usec] [-q qlen]

Turn logging on or off at run-time (on|off).

-b before:  The number of iterations before the suspicious iteration to
be logged (default 5).
-a after:   The number of iterations after the suspicious iteration to
be logged (default 5).
-e: Extend logging interval if another suspicious iteration is
detected before logging occurs.
-ne:Do not extend logging interval (default).
-q qlen:Suspicious vhost queue fill level threshold. Increase this
to 512 if the Qemu supports 1024 virtio queue length.
(default 128).
-us usec:   change the duration threshold for a suspicious iteration
(default 250 us).

Note: Logging of suspicious iterations itself consumes a considerable
amount of processing cycles of a PMD which may be visible in the iteration
history. In the worst case this can lead OVS to detect another
suspicious iteration caused by logging.

If more than 100 iterations around a suspicious iteration have been
logged once, OVS falls back to the safe default values (-b 5/-a 5/-ne)
to avoid that logging itself causes continuos further logging.

Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>
Acked-by: Billy O'Mahony <billy.o.mah...@intel.com>
---
 NEWS|   2 +
 lib/dpif-netdev-perf.c  | 223 
 lib/dpif-netdev-perf.h  |  21 +
 lib/dpif-netdev-unixctl.man |  59 
 lib/dpif-netdev.c   |   5 +
 5 files changed, 310 insertions(+)

diff --git a/NEWS b/NEWS
index a665c7f..7259492 100644
--- a/NEWS
+++ b/NEWS
@@ -27,6 +27,8 @@ Post-v2.9.0
  * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a single PMD
  * Detailed PMD performance metrics available with new command
  ovs-appctl dpif-netdev/pmd-perf-show
+ * Supervision of PMD performance metrics and logging of suspicious
+   iterations
 
 v2.9.0 - 19 Feb 2018
 
diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c
index caa0e27..47ce2c2 100644
--- a/lib/dpif-netdev-perf.c
+++ b/lib/dpif-netdev-perf.c
@@ -25,6 +25,24 @@
 
 VLOG_DEFINE_THIS_MODULE(pmd_perf);
 
+#define ITER_US_THRESHOLD 250   /* Warning threshold for iteration duration
+   in microseconds. */
+#define VHOST_QUEUE_FULL 128/* Size of the virtio TX queue. */
+#define LOG_IT_BEFORE 5 /* Number of iterations to log before
+   suspicious iteration. */
+#define LOG_IT_AFTER 5  /* Number of iterations to log after
+   suspicious iteration. */
+
+bool log_enabled = false;
+bool log_extend = false;
+static uint32_t log_it_before = LOG_IT_BEFORE;
+static uint32_t log_it_after = LOG_IT_AFTER;
+static uint32_t log_us_thr = ITER_US_THRESHOLD;
+uint32_t log_q_thr = VHOST_QUEUE_FULL;
+uint64_t iter_cycle_threshold;
+
+static struct vlog_rate_limit latency_rl = VLOG_RATE_LIMIT_INIT(600, 600);
+
 #ifdef DPDK_NETDEV
 static uint64_t
 get_tsc_hz(void)
@@ -141,6 +159,10 @@ pmd_perf_stats_init(struct pmd_perf_stats *s)
 histogram_walls_set_log(>max_vhost_qfill, 0, 512);
 s->iteration_cnt = 0;
 s->start_ms = time_msec();
+s->log_susp_it = UINT32_MAX;
+s->log_begin_it = UINT32_MAX;
+s->log_end_it = UINT32_MAX;
+s->log_reason = NULL;
 }
 
 void
@@ -391,6 +413,10 @@ pmd_perf_stats_clear_lock(struct pmd_perf_stats *s)
 history_init(>milliseconds);
 s->start_ms = time_msec();
 s->milliseconds.sample[0].timestamp = s->start_ms;
+s->log_susp_it = UINT32_MAX;
+s->log_begin_it = UINT32_MAX;
+s->log_end_it = UINT32_MAX;
+s->log_reason = NULL;
 /* Clearing finished. */
 s->clear = false;
 ovs_mutex_unlock(>clear_mutex);
@@ -442,6 +468,7 @@ pmd_perf_end_iteration(struct pmd_perf_stats *s, int 
rx_packets,
 uint64_t now_tsc = cycles_counter_update(s);
 struct iter_stats *cum_ms;
 uint64_t cycles, cycles_per

[ovs-dev] [PATCH v12 0/3] dpif-netdev: Detailed PMD performance metrics and supervision

2018-04-20 Thread Jan Scheurich
The run-time performance of PMDs is often difficult to understand and 
trouble-shoot. The existing PMD statistics counters only provide a coarse 
grained average picture. At packet rates of several Mpps sporadic drops of
packet bursts happen at sub-millisecond time scales and are impossible to
capture and analyze with existing tools.

This patch collects a large number of important PMD performance metrics
per PMD iteration, maintaining histograms and circular histories for
iteration metrics and millisecond averages. To capture sporadic drop
events, the patch set can be configured to monitor iterations for suspicious
metrics and to log the neighborhood of such iterations for off-line analysis.

The extra cost for the performance metric collection and the supervision has
been measured to be in the order of 1% compared to the base commit in a PVP
setup with L3 pipeline over VXLAN tunnels. For that reason the metrics
collection is disabled by default and can be enabled at run-time through
configuration.

v11 -> v12:
* Rebased to master (commit 83c2757bd)
* Clarified meaning of recv() param *qfill in netdev-provider.h (Ben)

v10 -> v11:
* Rebased to master (commit 00a0a011d)
* Implemented comments on v10 by Ilya, Aaron and Ian.
* Replaced broken macro ATOMIC_LLONG_LOCK_FREE with working
  macro ATOMIC_ALWAYS_LOCK_FREE_8B.
* Changed iteration key in iteration history from TSC timetamp to
  iteration counter.
* Bugfix: Suspicious iteration logged was one off the actual suspicious
  iteration.

v9 -> v10:
* Implemented missed comment by Ilya on v8: use ATOMIC_LLONG_LOCK_FREE
* Fixed travis and checkpatch errors reported by Ian on v9.

v8 -> v9:
* Rebased to master (commit cb8cbbbe9)
* Implemented minor comments on v8 by Billy

v7 -> v8:
* Rebased on to master (commit 4e99b70df)
* Implemented comments from Ilya Maximets and Billy O'Mahony.
* Replaced netdev_rxq_length() introduced in v7 by optional out
  parameter for the remaining rx queue len in netdev_rxq_recv().
* Fixed thread synchronization issues in clearing PMD stats:
  - Use mutex to control whether to clear from main thread directly
or in PMD at start of next iteration.
  - Use mutex to prevent concurrent clearing and printing of metrics.
* Added tx packet and batch stats to pmd-perf-show output.
* Delay warning for suspicious iteration to the iteration in which
  we also log the neighborhood to not pollute the logged iteration
  stats with logging costs.
* Corrected the exact number of iterations logged before and after a
  supicious iteration.
* Introduced options -e and -ne in pmd-perf-log-set to control whether
  to *extend* the range of logged iterations when additional supicious
  iterations are detected before the scheduled end of logging interval
  is reached.
* Exclude logging cycles from the iteration stats to avoid confusing
  ghost peaks.
* Performance impact compared to master less than 1% even with
  supervision enabled.

v5 -> v7:
* Rebased on to dpdk_merge (commit e68)
  - New base contains earlier refactoring parts of series.
* Implemented comments from Ilya Maximets and Billy O'Mahony.
* Replaced piggybacking qlen on dp_packet_batch with a new netdev API
  netdev_rxq_length().
* Thread-safe clearing of pmd counters in pmd_perf_start_iteration().
* Fixed bug in reporting datapath stats.
* Work-around a bug in DPDK rte_vhost_rx_queue_count() which sometimes
  returns bogus in the upper 16 bits of the uint32_t return value.

v4 -> v5:
* Rebased to master (commit e9de6c0)
* Implemented comments from Aaron Conole and Darrel Ball

v3 -> v4:
* Rebased to master (commit 4d0a31b)
  - Reverting changes to struct dp_netdev_pmd_thread.
* Make metrics collection configurable.
* Several bugfixes.

v2 -> v3:
* Rebased to OVS master (commit 3728b3b).
* Non-trivial adaptation to struct dp_netdev_pmd_thread.
  - refactored in commit a807c157 (Bhanu).
* No other changes compared to v2.

v1 -> v2:
* Rebased to OVS master (commit 7468ec788).
* No other changes compared to v1.

Jan Scheurich (3):
  netdev: Add optional qfill output parameter to rxq_recv()
  dpif-netdev: Detailed performance stats for PMDs
  dpif-netdev: Detection and logging of suspicious PMD iterations

 NEWS|   6 +
 lib/automake.mk |   1 +
 lib/dpif-netdev-perf.c  | 685 +++-
 lib/dpif-netdev-perf.h  | 218 --
 lib/dpif-netdev-unixctl.man | 216 ++
 lib/dpif-netdev.c   | 192 -
 lib/netdev-bsd.c|   8 +-
 lib/netdev-dpdk.c   |  41 ++-
 lib/netdev-dummy.c  |   8 +-
 lib/netdev-linux.c  |   7 +-
 lib/netdev-provider.h   |   8 +-
 lib/netdev.c|   5 +-
 lib/netdev.h|   3 +-
 manpages.mk |   2 +
 vswitchd/ovs-vswitchd.8.in  |  27 +-
 vswitchd/vswitch.xml|  12 +
 16 files changed, 1363 insertions(+), 76 deletions(-)
 create mode 1006

Re: [ovs-dev] [PATCH v2 2/3] ofproto-dpif: Improve dp_hash selection method for select groups

2018-04-18 Thread Jan Scheurich
> How about this approach, which should cleanly eliminate the warning?
> 
> diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c
> index e1a5c097f3aa..362339a4abb4 100644
> --- a/ofproto/ofproto-dpif.c
> +++ b/ofproto/ofproto-dpif.c
> @@ -4780,22 +4780,17 @@ group_setup_dp_hash_table(struct group_dpif *group, 
> size_t max_hash)
> 
>  /* Use Webster method to distribute hash values over buckets. */
>  for (int hash = 0; hash < n_hash; hash++) {
> -double max_val = 0.0;
> -struct webster *winner;
> -for (i = 0; i < n_buckets; i++) {
> -if (webster[i].value > max_val) {
> -max_val = webster[i].value;
> +struct webster *winner = [0];
> +for (i = 1; i < n_buckets; i++) {
> +if (webster[i].value > winner->value) {
>  winner = [i];
>  }
>  }
> -#pragma GCC diagnostic push
> -#pragma GCC diagnostic ignored "-Wmaybe-uninitialized"
>  /* winner is a reference to a webster[] element initialized above. */
>  winner->divisor += 2;
>  winner->value = (double) winner->bucket->weight / winner->divisor;
>  group->hash_map[hash] = winner->bucket;
>  winner->bucket->aux++;
> -#pragma GCC diagnostic pop
>  }

Thank you, Ben, for your thorough checks. Yes, your approach is better and 
compiles w/o warnings. 
 
Regards, Jan


___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2 2/3] ofproto-dpif: Improve dp_hash selection method for select groups

2018-04-17 Thread Jan Scheurich
Hi Ychen,

Thank you for finding yet another corner case. I will fix it in the next 
version with the following incremental:

diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c
index 8f71083..674b3b5 100644
--- a/ofproto/ofproto-dpif.c
+++ b/ofproto/ofproto-dpif.c
@@ -4762,6 +4762,11 @@ group_setup_dp_hash_table(struct group_dpif *group, 
size_t max_hash)
 VLOG_DBG("  Minimum weight: %d, total weight: %.0f",
  min_weight, total_weight);

+if (total_weight == 0) {
+VLOG_DBG("  Total weight is zero. No active buckets.");
+return false;
+}
+
 uint32_t min_slots = ceil(total_weight / min_weight);
 n_hash = MAX(16, 1L << log_2_ceil(min_slots));

I would like to mention your contribution with Tested-By: tag in the commit 
messages. Would that be ok? What is your real name I should put?

BR, Jan


From: ychen [mailto:ychen103...@163.com]
Sent: Tuesday, 17 April, 2018 13:22
To: Jan Scheurich <jan.scheur...@ericsson.com>
Cc: d...@openvswitch.org; Nitin Katiyar <nitin.kati...@ericsson.com>; 
b...@ovn.org
Subject: Re:[PATCH v2 2/3] ofproto-dpif: Improve dp_hash selection method for 
select groups

Hi, Jan:
I think the following code should also be modified
 + for (int hash = 0; hash < n_hash; hash++) {
+ double max_val = 0.0;
+ struct webster *winner;

+for (i = 0; i < n_buckets; i++) {

+if (webster[i].value > max_val) {  ===> if 
bucket->weight=0, and there is only one bucket with weight equal to 0, then 
winner will be null

+max_val = webster[i].value;

+winner = [i];

+}

+}



   Test like this command:

   ovs-ofctl add-group br-int -O openflow15 
"group_id=2,type=select,selection_method=dp_hash,bucket=bucket_id=1,weight=0,actions=output:10"

  vswitchd crashed after command put.



___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] ofproto-dpif: Init ukey->dump_seq to zero

2018-04-17 Thread Jan Scheurich
> >
> > OK.
> >
> > I am going to sit on this for a few days and see whether anyone reports
> > unusual issues.  If nothing arises, I'll backport as far as reasonable.
> 
> I backported to branch-2.9 and branch-2.8.

Thanks, Ben.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2 3/3] ofproto-dpif: Use dp_hash as default selection method

2018-04-16 Thread Jan Scheurich
The dp_hash selection method for select groups overcomes the scalability
problems of the current default selection method which, due to L2-L4
hashing during xlation and un-wildcarding of the hashed fields,
basically requires an upcall to the slow path to load-balance every
L4 connection. The consequence are an explosion of datapath flows
(megaflows degenerate to miniflows) and a limitation of connection
setup rate OVS can handle.

This commit changes the default selection method to dp_hash, provided the
bucket configuration is such that the dp_hash method can accurately
represent the bucket weights with up to 64 hash values. Otherwise we
stick to original default hash method.

We use the new dp_hash algorithm OVS_HASH_L4_SYMMETRIC to maintain the
symmetry property of the old default hash method.

A controller can explicitly request the old default hash selection method
by specifying selection method "hash" with an empty list of fields in the
Group properties of the OpenFlow 1.5 Group Mod message.

Update the documentation about selection method in the ovs-ovctl man page.

Revise and complete the ofproto-dpif unit tests cases for select groups.

Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>
Signed-off-by: Nitin Katiyar <nitin.kati...@ericsson.com>
Co-authored-by: Nitin Katiyar <nitin.kati...@ericsson.com>
---
 NEWS   |   2 +
 lib/ofp-group.c|  15 ++-
 ofproto/ofproto-dpif.c |  35 +++--
 ofproto/ofproto-dpif.h |   1 +
 ofproto/ofproto-provider.h |   2 +-
 tests/mpls-xlate.at|  26 ++--
 tests/ofproto-dpif.at  | 322 +++--
 utilities/ovs-ofctl.8.in   |  47 ---
 8 files changed, 336 insertions(+), 114 deletions(-)

diff --git a/NEWS b/NEWS
index 58a7b58..4ea03bf 100644
--- a/NEWS
+++ b/NEWS
@@ -14,6 +14,8 @@ Post-v2.9.0
- ovs-vsctl: New commands "add-bond-iface" and "del-bond-iface".
- OpenFlow:
  * OFPT_ROLE_STATUS is now available in OpenFlow 1.3.
+ * Default selection method for select groups is now dp_hash with improved
+   accuracy.
- Linux kernel 4.14
  * Add support for compiling OVS with the latest Linux 4.14 kernel
- ovn:
diff --git a/lib/ofp-group.c b/lib/ofp-group.c
index 31b0437..c5ddc65 100644
--- a/lib/ofp-group.c
+++ b/lib/ofp-group.c
@@ -1518,12 +1518,17 @@ parse_group_prop_ntr_selection_method(struct ofpbuf 
*payload,
 return OFPERR_OFPBPC_BAD_VALUE;
 }
 
-error = oxm_pull_field_array(payload->data, fields_len,
- >fields);
-if (error) {
-OFPPROP_LOG(, false,
+if (fields_len > 0) {
+error = oxm_pull_field_array(payload->data, fields_len,
+>fields);
+if (error) {
+OFPPROP_LOG(, false,
 "ntr selection method fields are invalid");
-return error;
+return error;
+}
+} else {
+/* Selection_method "hash: w/o fields means default hash method. */
+gp->fields.values_size = 0;
 }
 
 return 0;
diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c
index e1a5c09..8f71083 100644
--- a/ofproto/ofproto-dpif.c
+++ b/ofproto/ofproto-dpif.c
@@ -4814,39 +4814,54 @@ group_set_selection_method(struct group_dpif *group)
 struct ofputil_group_props *props = >up.props;
 char *selection_method = props->selection_method;
 
+VLOG_DBG("Constructing select group %"PRIu32, group->up.group_id);
 if (selection_method[0] == '\0') {
-VLOG_INFO("No selection method specified.");
-group->selection_method = SEL_METHOD_DEFAULT;
-
+VLOG_DBG("No selection method specified. Trying dp_hash.");
+/* If the controller has not specified a selection method, check if
+ * the dp_hash selection method with max 64 hash values is appropriate
+ * for the given bucket configuration. */
+if (group_setup_dp_hash_table(group, 64)) {
+/* Use dp_hash selection method with symmetric L4 hash. */
+VLOG_DBG("Use dp_hash with %d hash values.",
+ group->hash_mask + 1);
+group->selection_method = SEL_METHOD_DP_HASH;
+group->hash_alg = OVS_HASH_ALG_SYM_L4;
+group->hash_basis = 0xdeadbeef;
+} else {
+/* Fall back to original default hashing in slow path. */
+VLOG_DBG("Falling back to default hash method.");
+group->selection_method = SEL_METHOD_DEFAULT;
+}
 } else if (!strcmp(selection_method, "dp_hash")) {
-VLOG_INFO("Selection method specified: dp_hash.");
+VLOG_DBG("Selection method specified: dp_hash.");
 /* Try to use dp_hash if possible at all. */
 if (group_setup_dp_hash_table(group, 0)) {
 group-

[ovs-dev] [PATCH v2 2/3] ofproto-dpif: Improve dp_hash selection method for select groups

2018-04-16 Thread Jan Scheurich
The current implementation of the "dp_hash" selection method suffers
from two deficiences: 1. The hash mask and hence the number of dp_hash
values is just large enough to cover the number of group buckets, but
does not consider the case that buckets have different weights. 2. The
xlate-time selection of best bucket from the masked dp_hash value often
results in bucket load distributions that are quite different from the
bucket weights because the number of available masked dp_hash values
is too small (2-6 bits compared to 32 bits of a full hash in the default
hash selection method).

This commit provides a more accurate implementation of the dp_hash
select group by applying the well known Webster method for distributing
a small number of "seats" fairly over the weighted "parties"
(see https://en.wikipedia.org/wiki/Webster/Sainte-Lagu%C3%AB_method).
The dp_hash mask is autmatically chosen large enough to provide good
enough accuracy even with widely differing weights.

This distribution happens at group modification time and the resulting
table is stored with the group-dpif struct. At xlation time, we use the
masked dp_hash values as index to look up the assigned bucket.

If the bucket should not be live, we do a circular search over the
mapping table until we find the first live bucket. As the buckets in
the table are by construction in pseudo-random order with a frequency
according to their weight, this method maintains correct distribution
even if one or more buckets are non-live.

Xlation is further simplified by storing some derived select group state
at group construction in struct group-dpif in a form better suited for
xlation purposes.

Adapted the unit test case for dp_hash select group accordingly.

Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>
Signed-off-by: Nitin Katiyar <nitin.kati...@ericsson.com>
Co-authored-by: Nitin Katiyar <nitin.kati...@ericsson.com>
---
 include/openvswitch/ofp-group.h |   1 +
 ofproto/ofproto-dpif-xlate.c|  74 +---
 ofproto/ofproto-dpif.c  | 146 
 ofproto/ofproto-dpif.h  |  13 
 tests/ofproto-dpif.at   |  18 +++--
 5 files changed, 221 insertions(+), 31 deletions(-)

diff --git a/include/openvswitch/ofp-group.h b/include/openvswitch/ofp-group.h
index 8d893a5..af4033d 100644
--- a/include/openvswitch/ofp-group.h
+++ b/include/openvswitch/ofp-group.h
@@ -47,6 +47,7 @@ struct bucket_counter {
 /* Bucket for use in groups. */
 struct ofputil_bucket {
 struct ovs_list list_node;
+uint16_t aux;   /* Padding. Also used for temporary data. */
 uint16_t weight;/* Relative weight, for "select" groups. */
 ofp_port_t watch_port;  /* Port whose state affects whether this bucket
  * is live. Only required for fast failover
diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
index c8baba1..df245c5 100644
--- a/ofproto/ofproto-dpif-xlate.c
+++ b/ofproto/ofproto-dpif-xlate.c
@@ -4235,35 +4235,55 @@ xlate_hash_fields_select_group(struct xlate_ctx *ctx, 
struct group_dpif *group,
 }
 }
 
+static struct ofputil_bucket *
+group_dp_hash_best_bucket(struct xlate_ctx *ctx,
+  const struct group_dpif *group,
+  uint32_t dp_hash)
+{
+struct ofputil_bucket *bucket, *best_bucket = NULL;
+uint32_t n_hash = group->hash_mask + 1;
+
+uint32_t hash = dp_hash &= group->hash_mask;
+ctx->wc->masks.dp_hash |= group->hash_mask;
+
+/* Starting from the original masked dp_hash value iterate over the
+ * hash mapping table to find the first live bucket. As the buckets
+ * are quasi-randomly spread over the hash values, this maintains
+ * a distribution according to bucket weights even when some buckets
+ * are non-live. */
+for (int i = 0; i < n_hash; i++) {
+bucket = group->hash_map[(hash + i) % n_hash];
+if (bucket_is_alive(ctx, bucket, 0)) {
+best_bucket = bucket;
+break;
+}
+}
+
+return best_bucket;
+}
+
 static void
 xlate_dp_hash_select_group(struct xlate_ctx *ctx, struct group_dpif *group,
bool is_last_action)
 {
-struct ofputil_bucket *bucket;
-
 /* dp_hash value 0 is special since it means that the dp_hash has not been
  * computed, as all computed dp_hash values are non-zero.  Therefore
  * compare to zero can be used to decide if the dp_hash value is valid
  * without masking the dp_hash field. */
 if (!ctx->xin->flow.dp_hash) {
-uint64_t param = group->up.props.selection_method_param;
-
-ctx_trigger_recirculate_with_hash(ctx, param >> 32, (uint32_t)param);
+ctx_trigger_recirculate_with_hash(ctx, group->hash_alg,
+  group->has

[ovs-dev] [PATCH v2 1/3] userspace datapath: Add OVS_HASH_L4_SYMMETRIC dp_hash algorithm

2018-04-16 Thread Jan Scheurich
This commit implements a new dp_hash algorithm OVS_HASH_L4_SYMMETRIC in
the netdev datapath. It will be used as default hash algorithm for the
dp_hash-based select groups in a subsequent commit to maintain
compatibility with the symmetry property of the current default hash
selection method.

Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>
Signed-off-by: Nitin Katiyar <nitin.kati...@ericsson.com>
Co-authored-by: Nitin Katiyar <nitin.kati...@ericsson.com>
---
 datapath/linux/compat/include/linux/openvswitch.h |  4 +++
 lib/flow.c| 42 +--
 lib/flow.h|  1 +
 lib/odp-execute.c | 23 +++--
 4 files changed, 65 insertions(+), 5 deletions(-)

diff --git a/datapath/linux/compat/include/linux/openvswitch.h 
b/datapath/linux/compat/include/linux/openvswitch.h
index 84ebcaf..2bb3cb2 100644
--- a/datapath/linux/compat/include/linux/openvswitch.h
+++ b/datapath/linux/compat/include/linux/openvswitch.h
@@ -720,6 +720,10 @@ struct ovs_action_push_vlan {
  */
 enum ovs_hash_alg {
OVS_HASH_ALG_L4,
+#ifndef __KERNEL__
+   OVS_HASH_ALG_SYM_L4,
+#endif
+   __OVS_HASH_MAX
 };
 
 /*
diff --git a/lib/flow.c b/lib/flow.c
index 09b66b8..9d8c1ca 100644
--- a/lib/flow.c
+++ b/lib/flow.c
@@ -2108,6 +2108,44 @@ flow_hash_symmetric_l4(const struct flow *flow, uint32_t 
basis)
 return jhash_bytes(, sizeof fields, basis);
 }
 
+/* Symmetrically Hashes non-IP 'flow' based on its L2 headers. */
+uint32_t
+flow_hash_symmetric_l2(const struct flow *flow, uint32_t basis)
+{
+union {
+struct {
+ovs_be16 eth_type;
+ovs_be16 vlan_tci;
+struct eth_addr eth_addr;
+ovs_be16 pad;
+};
+uint32_t word[3];
+} fields;
+
+uint32_t hash = basis;
+int i;
+
+if (flow->packet_type != htons(PT_ETH)) {
+/* Cannot hash non-Ethernet flows */
+return 0;
+}
+
+for (i = 0; i < ARRAY_SIZE(fields.eth_addr.be16); i++) {
+fields.eth_addr.be16[i] =
+flow->dl_src.be16[i] ^ flow->dl_dst.be16[i];
+}
+for (i = 0; i < FLOW_MAX_VLAN_HEADERS; i++) {
+fields.vlan_tci ^= flow->vlans[i].tci & htons(VLAN_VID_MASK);
+}
+fields.eth_type = flow->dl_type;
+fields.pad = 0;
+
+hash = hash_add(hash, fields.word[0]);
+hash = hash_add(hash, fields.word[1]);
+hash = hash_add(hash, fields.word[2]);
+return hash_finish(hash, basis);
+}
+
 /* Hashes 'flow' based on its L3 through L4 protocol information */
 uint32_t
 flow_hash_symmetric_l3l4(const struct flow *flow, uint32_t basis,
@@ -2128,8 +2166,8 @@ flow_hash_symmetric_l3l4(const struct flow *flow, 
uint32_t basis,
 hash = hash_add64(hash, a[i] ^ b[i]);
 }
 } else {
-/* Cannot hash non-IP flows */
-return 0;
+/* Revert to hashing L2 headers */
+return flow_hash_symmetric_l2(flow, basis);
 }
 
 hash = hash_add(hash, flow->nw_proto);
diff --git a/lib/flow.h b/lib/flow.h
index af82931..900e8f8 100644
--- a/lib/flow.h
+++ b/lib/flow.h
@@ -236,6 +236,7 @@ hash_odp_port(odp_port_t odp_port)
 
 uint32_t flow_hash_5tuple(const struct flow *flow, uint32_t basis);
 uint32_t flow_hash_symmetric_l4(const struct flow *flow, uint32_t basis);
+uint32_t flow_hash_symmetric_l2(const struct flow *flow, uint32_t basis);
 uint32_t flow_hash_symmetric_l3l4(const struct flow *flow, uint32_t basis,
  bool inc_udp_ports );
 
diff --git a/lib/odp-execute.c b/lib/odp-execute.c
index 1969f02..c716c41 100644
--- a/lib/odp-execute.c
+++ b/lib/odp-execute.c
@@ -726,14 +726,16 @@ odp_execute_actions(void *dp, struct dp_packet_batch 
*batch, bool steal,
 }
 
 switch ((enum ovs_action_attr) type) {
+
 case OVS_ACTION_ATTR_HASH: {
 const struct ovs_action_hash *hash_act = nl_attr_get(a);
 
-/* Calculate a hash value directly.  This might not match the
+/* Calculate a hash value directly. This might not match the
  * value computed by the datapath, but it is much less expensive,
  * and the current use case (bonding) does not require a strict
  * match to work properly. */
-if (hash_act->hash_alg == OVS_HASH_ALG_L4) {
+switch (hash_act->hash_alg) {
+case OVS_HASH_ALG_L4: {
 struct flow flow;
 uint32_t hash;
 
@@ -749,7 +751,22 @@ odp_execute_actions(void *dp, struct dp_packet_batch 
*batch, bool steal,
 }
 packet->md.dp_hash = hash;
 }
-} else {
+break;
+}
+case OVS_HASH_ALG_SYM_L4: {
+struct flow flow;
+uint32_t hash;
+
+DP_PACKET_BATCH_FOR_EACH (i, pa

[ovs-dev] [PATCH v2 0/3] Use improved dp_hash select group by default

2018-04-16 Thread Jan Scheurich
The current default OpenFlow select group implementation sends every new L4 flow
to the slow path for the balancing decision and installs a 5-tuple "miniflow"
in the datapath to forward subsequent packets of the connection accordingly. 
Clearly this has major scalability issues with many parallel L4 flows and high
connection setup rates.

The dp_hash selection method for the OpenFlow select group was added to OVS
as an alternative. It avoids the scalability issues for the price of an 
additional recirculation in the datapath. The dp_hash method is only available
to OF1.5 SDN controllers speaking the Netronome Group Mod extension to 
configure the selection mechanism. This severely limited the applicability of
the dp_hash select group in the past.

Furthermore, testing revealed that the implemented dp_hash selection often
generated a very uneven distribution of flows over group buckets and didn't 
consider bucket weights at all.

The present patch set in a first step improves the dp_hash selection method to
much more accurately distribute flows over weighted group buckets. In a second
step it makes the improved dp_hash method the default in OVS for select groups
that can be accurately handled by dp_hash. That should be the vast majority of
cases. Otherwise we fall back to the legacy slow-path selection method.

The Netronome extension can still be used to override the default decision and
require the legacy slow-path or the dp_hash selection method.

v1 -> v2:
- Fixed crashes for corner cases reported by Ychen
- Fixed group ref leakage with dp_hash reported by Ychen
- Changed all xlation logging from INFO to DBG
- Revised, completed and detailed select group unit test cases in 
ofproto-dpif
- Updated selection_method documentation in ovs-ofctl man page
- Added NEWS item

Jan Scheurich (3):
  userspace datapath: Add OVS_HASH_L4_SYMMETRIC dp_hash algorithm
  ofproto-dpif: Improve dp_hash selection method for select groups
  ofproto-dpif: Use dp_hash as default selection method

 NEWS  |   2 +
 datapath/linux/compat/include/linux/openvswitch.h |   4 +
 include/openvswitch/ofp-group.h   |   1 +
 lib/flow.c|  42 ++-
 lib/flow.h|   1 +
 lib/odp-execute.c |  23 +-
 lib/ofp-group.c   |  15 +-
 ofproto/ofproto-dpif-xlate.c  |  74 +++--
 ofproto/ofproto-dpif.c| 161 +++
 ofproto/ofproto-dpif.h|  14 +
 ofproto/ofproto-provider.h|   2 +-
 tests/mpls-xlate.at   |  26 +-
 tests/ofproto-dpif.at | 314 +-
 utilities/ovs-ofctl.8.in  |  47 ++--
 14 files changed, 599 insertions(+), 127 deletions(-)

-- 
1.9.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2 0/2] Correct handling of double encap and decap actions

2018-04-16 Thread Jan Scheurich
Just sent the adjusted version for 2.8.
/Jan

> -Original Message-
> From: Ben Pfaff [mailto:b...@ovn.org]
> Sent: Friday, 13 April, 2018 20:19
> To: Jan Scheurich <jan.scheur...@ericsson.com>
> Cc: d...@openvswitch.org; yi.y.y...@intel.com
> Subject: Re: [PATCH v2 0/2] Correct handling of double encap and decap actions
> 
> Thanks for checking.  I applied both patches to branch-2.9.  For
> branch-2.8, would you mind submitting the fixed-up patches?  It would
> save me a few minutes.
> 
> Thanks,
> 
> Ben.
> 
> On Fri, Apr 06, 2018 at 05:37:20PM +, Jan Scheurich wrote:
> > Yes that fix should be applied to branches 2.9 and 2.8.
> >
> > I checked that it applies and passes all unit tests pass.
> >
> > On branch-2.8 the patch for nsh.at patch must be slightly retrofitted as 
> > the datapath action names changed from
> encap_nsh/decap_nsh to push_nsh/pop_nsh and the nsh_ttl field was introduced 
> in 2.9.
> >
> > diff --git a/tests/nsh.at b/tests/nsh.at
> > index 6ae71b5..6eb4637 100644
> > --- a/tests/nsh.at
> > +++ b/tests/nsh.at
> > @@ -351,7 +351,7 @@ bridge("br0")
> >
> >  Final flow: unchanged
> >  Megaflow: recirc_id=0,eth,ip,in_port=1,dl_dst=66:77:88:99:aa:bb,nw_frag=no
> > -Datapath actions: 
> > push_nsh(flags=0,ttl=63,mdtype=1,np=3,spi=0x1234,si=255,c1=0x1122334
> > +Datapath actions: 
> > encap_nsh(flags=0,mdtype=1,np=3,spi=0x1234,si=255,c1=0x11223344,c2=0
> >  ])
> >
> >  AT_CHECK([
> > @@ -370,7 +370,7 @@ bridge("br0")
> >
> >  Final flow: 
> > recirc_id=0x1,eth,in_port=4,vlan_tci=0x,dl_src=00:00:00:00:00:00,dl_ds
> >  Megaflow: 
> > recirc_id=0x1,packet_type=(1,0x894f),in_port=4,nsh_mdtype=1,nsh_np=3,nsh_spi
> > -Datapath actions: pop_nsh(),recirc(0x2)
> > +Datapath actions: decap_nsh(),recirc(0x2)
> >  ])
> >
> >  AT_CHECK([
> > @@ -407,8 +407,8 @@ ovs-appctl time/warp 1000
> >  AT_CHECK([
> >  ovs-appctl dpctl/dump-flows dummy@ovs-dummy | strip_used | grep -v 
> > ipv6 | sort
> >  ], [0], [flow-dump from non-dpdk interfaces:
> > -recirc_id(0),in_port(1),packet_type(ns=0,id=0),eth(dst=1e:2c:e9:2a:66:9e),eth_type(0x0
> > -recirc_id(0x3),in_port(1),packet_type(ns=1,id=0x894f),nsh(mdtype=1,np=3,spi=0x1234,c1=
> > +recirc_id(0),in_port(1),packet_type(ns=0,id=0),eth(dst=1e:2c:e9:2a:66:9e),eth_type(0x0
> > +recirc_id(0x3),in_port(1),packet_type(ns=1,id=0x894f),nsh(mdtype=1,np=3,spi=0x1234,c1=
> >  
> > recirc_id(0x4),in_port(1),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no),
> >  packe
> >  ])
> >
> > Thanks, Jan
> >
> > > -Original Message-
> > > From: Ben Pfaff [mailto:b...@ovn.org]
> > > Sent: Friday, 06 April, 2018 18:36
> > > To: Jan Scheurich <jan.scheur...@ericsson.com>
> > > Cc: d...@openvswitch.org; yi.y.y...@intel.com
> > > Subject: Re: [PATCH v2 0/2] Correct handling of double encap and decap 
> > > actions
> > >
> > > On Fri, Apr 06, 2018 at 09:35:48AM -0700, Ben Pfaff wrote:
> > > > On Thu, Apr 05, 2018 at 04:11:02PM +0200, Jan Scheurich wrote:
> > > > > Recent tests with NSH encap have shown that the translation of 
> > > > > multiple
> > > > > subsequent encap() or decap() actions was incorrect. This patch set
> > > > > corrects the handling and adds a unit test for NSH to cover two NSH
> > > > > and one Ethernet encapsulation levels.
> > > >
> > > > Thanks.  Should this be applied to branch-2.9?
> > >
> > > To be clear, I applied it to master just now.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH branch-2.8 1/2] xlate: Correct handling of double encap() actions

2018-04-16 Thread Jan Scheurich
When the same encap() header was pushed twice onto a packet (e.g in the
case of NSH in NSH), the translation logic only generated a datapath push
action for the first encap() action. The second encap() did not emit a
push action because the packet type was unchanged.

commit_encap_decap_action() (renamed from commit_packet_type_change) must
solely rely on ctx->pending_encap to generate an datapath push action.

Similarly, the first decap() action on a double header packet does not
change the packet_type either. Add a corresponding ctx->pending_decap
flag and use that to trigger emitting a datapath pop action.

Fixes: f839892a2 ("OF support and translation of generic encap and decap")
Fixes: 1fc11c594 ("Generic encap and decap support for NSH")

Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>
---
 lib/odp-util.c   | 16 ++--
 lib/odp-util.h   |  1 +
 ofproto/ofproto-dpif-xlate.c |  7 ++-
 3 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/lib/odp-util.c b/lib/odp-util.c
index 7f42b98..78cc903 100644
--- a/lib/odp-util.c
+++ b/lib/odp-util.c
@@ -6883,17 +6883,13 @@ odp_put_encap_nsh_action(struct ofpbuf *odp_actions,
 }
 
 static void
-commit_packet_type_change(const struct flow *flow,
+commit_encap_decap_action(const struct flow *flow,
   struct flow *base_flow,
   struct ofpbuf *odp_actions,
   struct flow_wildcards *wc,
-  bool pending_encap,
+  bool pending_encap, bool pending_decap,
   struct ofpbuf *encap_data)
 {
-if (flow->packet_type == base_flow->packet_type) {
-return;
-}
-
 if (pending_encap) {
 switch (ntohl(flow->packet_type)) {
 case PT_ETH: {
@@ -6918,7 +6914,7 @@ commit_packet_type_change(const struct flow *flow,
  * The check is done at action translation. */
 OVS_NOT_REACHED();
 }
-} else {
+} else if (pending_decap || flow->packet_type != base_flow->packet_type) {
 /* This is an explicit or implicit decap case. */
 if (pt_ns(flow->packet_type) == OFPHTN_ETHERTYPE &&
 base_flow->packet_type == htonl(PT_ETH)) {
@@ -6957,14 +6953,14 @@ commit_packet_type_change(const struct flow *flow,
 enum slow_path_reason
 commit_odp_actions(const struct flow *flow, struct flow *base,
struct ofpbuf *odp_actions, struct flow_wildcards *wc,
-   bool use_masked, bool pending_encap,
+   bool use_masked, bool pending_encap, bool pending_decap,
struct ofpbuf *encap_data)
 {
 enum slow_path_reason slow1, slow2;
 bool mpls_done = false;
 
-commit_packet_type_change(flow, base, odp_actions, wc,
-  pending_encap, encap_data);
+commit_encap_decap_action(flow, base, odp_actions, wc,
+  pending_encap, pending_decap, encap_data);
 commit_set_ether_action(flow, base, odp_actions, wc, use_masked);
 /* Make packet a non-MPLS packet before committing L3/4 actions,
  * which would otherwise do nothing. */
diff --git a/lib/odp-util.h b/lib/odp-util.h
index 27c2ab4..9d6cc45 100644
--- a/lib/odp-util.h
+++ b/lib/odp-util.h
@@ -278,6 +278,7 @@ enum slow_path_reason commit_odp_actions(const struct flow 
*,
  struct flow_wildcards *wc,
  bool use_masked,
  bool pending_encap,
+ bool pending_decap,
  struct ofpbuf *encap_data);
 
 /* ofproto-dpif interface.
diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
index 3890b2e..54fd06c 100644
--- a/ofproto/ofproto-dpif-xlate.c
+++ b/ofproto/ofproto-dpif-xlate.c
@@ -241,6 +241,8 @@ struct xlate_ctx {
  * true. */
 bool pending_encap; /* True when waiting to commit a pending
  * encap action. */
+bool pending_decap; /* True when waiting to commit a pending
+ * decap action. */
 struct ofpbuf *encap_data;  /* May contain a pointer to an ofpbuf with
  * context for the datapath encap action.*/
 
@@ -3537,8 +3539,9 @@ xlate_commit_actions(struct xlate_ctx *ctx)
 ctx->xout->slow |= commit_odp_actions(>xin->flow, >base_flow,
   ctx->odp_actions, ctx->wc,
   use_masked, ctx->pending_encap,
-  ctx->encap_data);
+  ctx->pending_decap, ctx->encap_data);
 ctx->pending_encap = false;
+ctx->pending_de

[ovs-dev] [PATCH branch-2.8 2/2] nsh: Add unit test for double NSH encap and decap

2018-04-16 Thread Jan Scheurich
The added test verifies that OVS correctly encapsulates an Ethernet
packet with two NSH (MD1) headers, sends it with an Ethernet header
over a patch port and decap the Ethernet and the two NSH headers on
the receiving bridge to reveal the original packet.

The test case performs the encap() operations in a sequence of three
chained groups to test the correct handling of encap() actions in
group buckets recently fixed in commit ce4a16ac0.

Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>
---
 tests/nsh.at | 143 +++
 1 file changed, 143 insertions(+)

diff --git a/tests/nsh.at b/tests/nsh.at
index aa80a2a..6eb4637 100644
--- a/tests/nsh.at
+++ b/tests/nsh.at
@@ -274,6 +274,149 @@ AT_CLEANUP
 
 
 ### -
+###   Double NSH MD1 encapsulation using groups over veth link
+### -
+
+AT_SETUP([nsh - double encap over veth link using groups])
+
+OVS_VSWITCHD_START([])
+
+AT_CHECK([
+ovs-vsctl set bridge br0 datapath_type=dummy \
+protocols=OpenFlow10,OpenFlow13,OpenFlow14,OpenFlow15 -- \
+add-port br0 p1 -- set Interface p1 type=dummy ofport_request=1 -- \
+add-port br0 p2 -- set Interface p2 type=dummy ofport_request=2 -- \
+add-port br0 v3 -- set Interface v3 type=patch options:peer=v4 
ofport_request=3 -- \
+add-port br0 v4 -- set Interface v4 type=patch options:peer=v3 
ofport_request=4])
+
+AT_DATA([flows.txt], [dnl
+table=0,in_port=1,ip,actions=group:100
+
table=0,in_port=4,packet_type=(0,0),dl_type=0x894f,nsh_mdtype=1,nsh_spi=0x5678,nsh_c1=0x55667788,actions=decap(),goto_table:1
+
table=1,packet_type=(1,0x894f),nsh_mdtype=1,nsh_spi=0x5678,nsh_c1=0x55667788,actions=decap(),goto_table:2
+
table=2,packet_type=(1,0x894f),nsh_mdtype=1,nsh_spi=0x1234,nsh_c1=0x11223344,actions=decap(),output:2
+])
+
+AT_DATA([groups.txt], [dnl
+add 
group_id=100,type=indirect,bucket=actions=encap(nsh(md_type=1)),set_field:0x1234->nsh_spi,set_field:0x11223344->nsh_c1,group:200
+add 
group_id=200,type=indirect,bucket=actions=encap(nsh(md_type=1)),set_field:0x5678->nsh_spi,set_field:0x55667788->nsh_c1,group:300
+add 
group_id=300,type=indirect,bucket=actions=encap(ethernet),set_field:11:22:33:44:55:66->dl_dst,3
+])
+
+AT_CHECK([
+ovs-ofctl del-flows br0
+ovs-ofctl -Oopenflow13 add-groups br0 groups.txt
+ovs-ofctl -Oopenflow13 add-flows br0 flows.txt
+ovs-ofctl -Oopenflow13 dump-flows br0 | ofctl_strip | sort | grep actions
+], [0], [dnl
+ in_port=4,dl_type=0x894f,nsh_mdtype=1,nsh_spi=0x5678,nsh_c1=0x55667788 
actions=decap(),goto_table:1
+ ip,in_port=1 actions=group:100
+ table=1, packet_type=(1,0x894f),nsh_mdtype=1,nsh_spi=0x5678,nsh_c1=0x55667788 
actions=decap(),goto_table:2
+ table=2, packet_type=(1,0x894f),nsh_mdtype=1,nsh_spi=0x1234,nsh_c1=0x11223344 
actions=decap(),output:2
+])
+
+# TODO:
+# The fields nw_proto, nw_tos, nw_ecn, nw_ttl in final flow seem unnecessary. 
Can they be avoided?
+# The match on dl_dst=66:77:88:99:aa:bb in the Megaflow is a side effect of 
setting the dl_dst in the pushed outer
+# Ethernet header. It is a consequence of using wc->masks both for tracking 
matched and set bits and seems hard to
+# avoid except by using separate masks for both purposes.
+
+AT_CHECK([
+ovs-appctl ofproto/trace br0 
'in_port=1,icmp,dl_src=00:11:22:33:44:55,dl_dst=66:77:88:99:aa:bb,nw_dst=10.10.10.10,nw_src=20.20.20.20'
+], [0], [dnl
+Flow: 
icmp,in_port=1,vlan_tci=0x,dl_src=00:11:22:33:44:55,dl_dst=66:77:88:99:aa:bb,nw_src=20.20.20.20,nw_dst=10.10.10.10,nw_tos=0,nw_ecn=0,nw_ttl=0,icmp_type=0,icmp_code=0
+
+bridge("br0")
+-
+ 0. ip,in_port=1, priority 32768
+group:100
+encap(nsh(md_type=1))
+set_field:0x1234->nsh_spi
+set_field:0x11223344->nsh_c1
+group:200
+encap(nsh(md_type=1))
+set_field:0x5678->nsh_spi
+set_field:0x55667788->nsh_c1
+group:300
+encap(ethernet)
+set_field:11:22:33:44:55:66->eth_dst
+output:3
+
+bridge("br0")
+-
+ 0. in_port=4,dl_type=0x894f,nsh_mdtype=1,nsh_spi=0x5678,nsh_c1=0x55667788, 
priority 32768
+decap()
+goto_table:1
+ 1. packet_type=(1,0x894f),nsh_mdtype=1,nsh_spi=0x5678,nsh_c1=0x55667788, 
priority 32768
+decap()
+
+Final flow: unchanged
+Megaflow: recirc_id=0,eth,ip,in_port=1,dl_dst=66:77:88:99:aa:bb,nw_frag=no
+Datapath actions: 
encap_nsh(flags=0,mdtype=1,np=3,spi=0x1234,si=255,c1=0x11223344,c2=0x0,c3=0x0,c4=0x0),encap_nsh(flags=0,mdtype=1,np=4,spi=0x5678,si=255,c1=0x55667788,c2=0x0,c3=0x0,c4=0x0),push_eth(src=00:00:00:00:00:00,dst=11:22:33:44:55:66),pop_eth,decap_nsh(),recirc(0x1)
+])
+
+AT_CHECK([
+ovs-appctl ofproto/trace br0 
'recirc_id=1,in_port=4,packet_type=(1,0x894f),nsh_mdtype=1,nsh_np=3,nsh_spi=0x1234,nsh_c1=0x11223344'
+], [0], [dnl
+Flow: re

[ovs-dev] [PATCH branch-2.8 0/2] Correct handling of double encap and decap actions

2018-04-16 Thread Jan Scheurich
Recent tests with NSH encap have shown that the translation of multiple
subsequent encap() or decap() actions was incorrect. This patch set
corrects the handling and adds a unit test for NSH to cover two NSH
and one Ethernet encapsulation levels.

This patch retrofits the new NSH test in commit a5b3e2a6f2 to the preliminary 
NSH 
support in OVS 2.8.

Jan Scheurich (2):
  xlate: Correct handling of double encap() actions
  nsh: Add unit test for double NSH encap and decap

 lib/odp-util.c   |  16 ++---
 lib/odp-util.h   |   1 +
 ofproto/ofproto-dpif-xlate.c |   7 ++-
 tests/nsh.at | 143 +++
 4 files changed, 156 insertions(+), 11 deletions(-)

-- 
1.9.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v11 1/3] netdev: Add optional qfill output parameter to rxq_recv()

2018-04-12 Thread Jan Scheurich
> -Original Message-
> From: Ben Pfaff [mailto:b...@ovn.org]
> Sent: Thursday, 12 April, 2018 18:37
> 
> On Thu, Apr 12, 2018 at 05:32:11PM +0200, Jan Scheurich wrote:
> > If the caller provides a non-NULL qfill pointer and the netdev
> > implemementation supports reading the rx queue fill level, the rxq_recv()
> > function returns the remaining number of packets in the rx queue after
> > reception of the packet burst to the caller. If the implementation does
> > not support this, it returns -ENOTSUP instead. Reading the remaining queue
> > fill level should not substantilly slow down the recv() operation.
> >
> > A first implementation is provided for ethernet and vhostuser DPDK ports
> > in netdev-dpdk.c.
> >
> > This output parameter will be used in the upcoming commit for PMD
> > performance metrics to supervise the rx queue fill level for DPDK
> > vhostuser ports.
> 
> Thanks for working on the generic netdev layer.
> 
> I wasn't sure what a qfill was, so I looked at the comment on the
> function that returned it and it says that it is a "queue fill level".
> I can kind of guess what that is, but maybe it should be spelled out a
> little more.  For example, is it the number of packets currently waiting
> to be received?  (Maybe it is the number of bytes, who knows.)  So, I
> suggest making the comment just a little more explicit.

Sure, will do that.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v10 3/3] dpif-netdev: Detection and logging of suspicious PMD iterations

2018-04-12 Thread Jan Scheurich
> > I would not say this is expected behavior.
> >
> > It seems that you are executing on a somewhat slower system (tsc clock 
> > seems to be 100/us = 0.1 GHz) and that, even with only 5
> lines logged before and after,  the logging output is causing so much slow 
> down of the PMD that it continues to cause iterations using
> excessive cycles (362000 = 3.62 ms!) due to logging.
> 
> The system is slower than usual, but not so much.
> This behaviour captured on ARMv8. The TSC frequency on ARM is usually around 
> 100MHz
> without using PMU which is not available from userspace by default.
> Meantime, CPU frequency is 2GHz.

On my x86_64 server (2.4 GHz) I could reproduce the periodic logging if I go 
down with the us_thr to values as low as 50 us under high PMD load. But it 
always stops when I increase the threshold to a value higher than 80 us. It 
seems you ARM system is much slower when logging to file. The threshold that 
should be reasonably applied may depend on the system under test.

> 
> >
> > The actual iteration with logging is not flagged as suspicious, but the 
> > subsequent iteration gets the hit of the massive cycles that have
> passed on the TSC clock. The "phantom" duration of 0 us shown is probably a 
> side effect of this.
> 
> I guess, you just have some bug in your calculation of execution time.
> Possibly you're mixing up the TSC and CPU frequencies.
> Zero ms duration is normal for printing so small amount of data.

It was actually a bug in the code: The iteration logged as suspicious was the 
one directly after the actually suspicious iteration. That bug is fixed in v11 
I just sent out.

BR, Jan
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev



[ovs-dev] [PATCH v11 0/3] dpif-netdev: Detailed PMD performance metrics and supervision

2018-04-12 Thread Jan Scheurich
The run-time performance of PMDs is often difficult to understand and 
trouble-shoot. The existing PMD statistics counters only provide a coarse 
grained average picture. At packet rates of several Mpps sporadic drops of
packet bursts happen at sub-millisecond time scales and are impossible to
capture and analyze with existing tools.

This patch collects a large number of important PMD performance metrics
per PMD iteration, maintaining histograms and circular histories for
iteration metrics and millisecond averages. To capture sporadic drop
events, the patch set can be configured to monitor iterations for suspicious
metrics and to log the neighborhood of such iterations for off-line analysis.

The extra cost for the performance metric collection and the supervision has
been measured to be in the order of 1% compared to the base commit in a PVP
setup with L3 pipeline over VXLAN tunnels. For that reason the metrics
collection is disabled by default and can be enabled at run-time through
configuration.

v9 -> v10:
* Rebased to master (commit 00a0a011d)
* Implemented comments on v10 by Ilya, Aaron and Ian.
* Replaced broken macro ATOMIC_LLONG_LOCK_FREE with working
  macro ATOMIC_ALWAYS_LOCK_FREE_8B.
* Changed iteration key in iteration history from TSC timetamp to
  iteration counter.
* Bugfix: Suspicious iteration logged was one off the actual suspicious
  iteration.

v9 -> v10:
* Implemented missed comment by Ilya on v8: use ATOMIC_LLONG_LOCK_FREE
* Fixed travis and checkpatch errors reported by Ian on v9.

v8 -> v9:
* Rebased to master (commit cb8cbbbe9)
* Implemented minor comments on v8 by Billy

v7 -> v8:
* Rebased on to master (commit 4e99b70df)
* Implemented comments from Ilya Maximets and Billy O'Mahony.
* Replaced netdev_rxq_length() introduced in v7 by optional out
  parameter for the remaining rx queue len in netdev_rxq_recv().
* Fixed thread synchronization issues in clearing PMD stats:
  - Use mutex to control whether to clear from main thread directly
or in PMD at start of next iteration.
  - Use mutex to prevent concurrent clearing and printing of metrics.
* Added tx packet and batch stats to pmd-perf-show output.
* Delay warning for suspicious iteration to the iteration in which
  we also log the neighborhood to not pollute the logged iteration
  stats with logging costs.
* Corrected the exact number of iterations logged before and after a
  supicious iteration.
* Introduced options -e and -ne in pmd-perf-log-set to control whether
  to *extend* the range of logged iterations when additional supicious
  iterations are detected before the scheduled end of logging interval
  is reached.
* Exclude logging cycles from the iteration stats to avoid confusing
  ghost peaks.
* Performance impact compared to master less than 1% even with
  supervision enabled.

v5 -> v7:
* Rebased on to dpdk_merge (commit e68)
  - New base contains earlier refactoring parts of series.
* Implemented comments from Ilya Maximets and Billy O'Mahony.
* Replaced piggybacking qlen on dp_packet_batch with a new netdev API
  netdev_rxq_length().
* Thread-safe clearing of pmd counters in pmd_perf_start_iteration().
* Fixed bug in reporting datapath stats.
* Work-around a bug in DPDK rte_vhost_rx_queue_count() which sometimes
  returns bogus in the upper 16 bits of the uint32_t return value.

v4 -> v5:
* Rebased to master (commit e9de6c0)
* Implemented comments from Aaron Conole and Darrel Ball

v3 -> v4:
* Rebased to master (commit 4d0a31b)
  - Reverting changes to struct dp_netdev_pmd_thread.
* Make metrics collection configurable.
* Several bugfixes.

v2 -> v3:
* Rebased to OVS master (commit 3728b3b).
* Non-trivial adaptation to struct dp_netdev_pmd_thread.
  - refactored in commit a807c157 (Bhanu).
* No other changes compared to v2.

v1 -> v2:
* Rebased to OVS master (commit 7468ec788).
* No other changes compared to v1.

Jan Scheurich (3):
  netdev: Add optional qfill output parameter to rxq_recv()
  dpif-netdev: Detailed performance stats for PMDs
  dpif-netdev: Detection and logging of suspicious PMD iterations

 NEWS|   6 +
 lib/automake.mk |   1 +
 lib/dpif-netdev-perf.c  | 685 +++-
 lib/dpif-netdev-perf.h  | 218 --
 lib/dpif-netdev-unixctl.man | 216 ++
 lib/dpif-netdev.c   | 192 -
 lib/netdev-bsd.c|   8 +-
 lib/netdev-dpdk.c   |  41 ++-
 lib/netdev-dummy.c  |   8 +-
 lib/netdev-linux.c  |   7 +-
 lib/netdev-provider.h   |   7 +-
 lib/netdev.c|   5 +-
 lib/netdev.h|   3 +-
 manpages.mk |   2 +
 vswitchd/ovs-vswitchd.8.in  |  27 +-
 vswitchd/vswitch.xml|  12 +
 16 files changed, 1362 insertions(+), 76 deletions(-)
 create mode 100644 lib/dpif-netdev-unixctl.man

-- 
1.9.1

___
dev mailing list
d...@openvswitch.org
h

[ovs-dev] [PATCH v11 1/3] netdev: Add optional qfill output parameter to rxq_recv()

2018-04-12 Thread Jan Scheurich
If the caller provides a non-NULL qfill pointer and the netdev
implemementation supports reading the rx queue fill level, the rxq_recv()
function returns the remaining number of packets in the rx queue after
reception of the packet burst to the caller. If the implementation does
not support this, it returns -ENOTSUP instead. Reading the remaining queue
fill level should not substantilly slow down the recv() operation.

A first implementation is provided for ethernet and vhostuser DPDK ports
in netdev-dpdk.c.

This output parameter will be used in the upcoming commit for PMD
performance metrics to supervise the rx queue fill level for DPDK
vhostuser ports.

Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>
Acked-by: Billy O'Mahony <billy.o.mah...@intel.com>
---
 lib/dpif-netdev.c |  2 +-
 lib/netdev-bsd.c  |  8 +++-
 lib/netdev-dpdk.c | 41 -
 lib/netdev-dummy.c|  8 +++-
 lib/netdev-linux.c|  7 ++-
 lib/netdev-provider.h |  7 ++-
 lib/netdev.c  |  5 +++--
 lib/netdev.h  |  3 ++-
 8 files changed, 68 insertions(+), 13 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index be31fd0..7ce3943 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -3277,7 +3277,7 @@ dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread 
*pmd,
 pmd->ctx.last_rxq = rxq;
 dp_packet_batch_init();
 
-error = netdev_rxq_recv(rxq->rx, );
+error = netdev_rxq_recv(rxq->rx, , NULL);
 if (!error) {
 /* At least one packet received. */
 *recirc_depth_get() = 0;
diff --git a/lib/netdev-bsd.c b/lib/netdev-bsd.c
index 05974c1..b70f327 100644
--- a/lib/netdev-bsd.c
+++ b/lib/netdev-bsd.c
@@ -618,7 +618,8 @@ netdev_rxq_bsd_recv_tap(struct netdev_rxq_bsd *rxq, struct 
dp_packet *buffer)
 }
 
 static int
-netdev_bsd_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch)
+netdev_bsd_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
+int *qfill)
 {
 struct netdev_rxq_bsd *rxq = netdev_rxq_bsd_cast(rxq_);
 struct netdev *netdev = rxq->up.netdev;
@@ -643,6 +644,11 @@ netdev_bsd_rxq_recv(struct netdev_rxq *rxq_, struct 
dp_packet_batch *batch)
 batch->packets[0] = packet;
 batch->count = 1;
 }
+
+if (qfill) {
+*qfill = -ENOTSUP;
+}
+
 return retval;
 }
 
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index ee39cbe..a4fc382 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -1812,13 +1812,13 @@ netdev_dpdk_vhost_update_rx_counters(struct 
netdev_stats *stats,
  */
 static int
 netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq,
-   struct dp_packet_batch *batch)
+   struct dp_packet_batch *batch, int *qfill)
 {
 struct netdev_dpdk *dev = netdev_dpdk_cast(rxq->netdev);
 struct ingress_policer *policer = netdev_dpdk_get_ingress_policer(dev);
 uint16_t nb_rx = 0;
 uint16_t dropped = 0;
-int qid = rxq->queue_id;
+int qid = rxq->queue_id * VIRTIO_QNUM + VIRTIO_TXQ;
 int vid = netdev_dpdk_get_vid(dev);
 
 if (OVS_UNLIKELY(vid < 0 || !dev->vhost_reconfigured
@@ -1826,14 +1826,23 @@ netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq,
 return EAGAIN;
 }
 
-nb_rx = rte_vhost_dequeue_burst(vid, qid * VIRTIO_QNUM + VIRTIO_TXQ,
-dev->mp,
+nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->mp,
 (struct rte_mbuf **) batch->packets,
 NETDEV_MAX_BURST);
 if (!nb_rx) {
 return EAGAIN;
 }
 
+if (qfill) {
+if (nb_rx == NETDEV_MAX_BURST) {
+/* The DPDK API returns a uint32_t which often has invalid bits in
+ * the upper 16-bits. Need to restrict the value to uint16_t. */
+*qfill = rte_vhost_rx_queue_count(vid, qid) & UINT16_MAX;
+} else {
+*qfill = 0;
+}
+}
+
 if (policer) {
 dropped = nb_rx;
 nb_rx = ingress_policer_run(policer,
@@ -1854,7 +1863,8 @@ netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq,
 }
 
 static int
-netdev_dpdk_rxq_recv(struct netdev_rxq *rxq, struct dp_packet_batch *batch)
+netdev_dpdk_rxq_recv(struct netdev_rxq *rxq, struct dp_packet_batch *batch,
+ int *qfill)
 {
 struct netdev_rxq_dpdk *rx = netdev_rxq_dpdk_cast(rxq);
 struct netdev_dpdk *dev = netdev_dpdk_cast(rxq->netdev);
@@ -1891,6 +1901,14 @@ netdev_dpdk_rxq_recv(struct netdev_rxq *rxq, struct 
dp_packet_batch *batch)
 batch->count = nb_rx;
 dp_packet_batch_init_packet_fields(batch);
 
+if (qfill) {
+if (nb_rx == NETDEV_MAX_BURST) {
+*qfill = rte_eth_rx_queue_count(rx->port_id, rxq->queue_id);
+} else {
+*qfill = 0;
+}
+}
+
 return 0;
 }
 
@@ -3172,6 +3190,19 @@ vrin

[ovs-dev] [PATCH v11 3/3] dpif-netdev: Detection and logging of suspicious PMD iterations

2018-04-12 Thread Jan Scheurich
This patch enhances dpif-netdev-perf to detect iterations with
suspicious statistics according to the following criteria:

- iteration lasts longer than US_THR microseconds (default 250).
  This can be used to capture events where a PMD is blocked or
  interrupted for such a period of time that there is a risk for
  dropped packets on any of its Rx queues.

- max vhost qlen exceeds a threshold Q_THR (default 128). This can
  be used to infer virtio queue overruns and dropped packets inside
  a VM, which are not visible in OVS otherwise.

Such suspicious iterations can be logged together with their iteration
statistics to be able to correlate them to packet drop or other events
outside OVS.

A new command is introduced to enable/disable logging at run-time and
to adjust the above thresholds for suspicious iterations:

ovs-appctl dpif-netdev/pmd-perf-log-set on | off
[-b before] [-a after] [-e|-ne] [-us usec] [-q qlen]

Turn logging on or off at run-time (on|off).

-b before:  The number of iterations before the suspicious iteration to
be logged (default 5).
-a after:   The number of iterations after the suspicious iteration to
be logged (default 5).
-e: Extend logging interval if another suspicious iteration is
detected before logging occurs.
-ne:Do not extend logging interval (default).
-q qlen:Suspicious vhost queue fill level threshold. Increase this
to 512 if the Qemu supports 1024 virtio queue length.
(default 128).
-us usec:   change the duration threshold for a suspicious iteration
(default 250 us).

Note: Logging of suspicious iterations itself consumes a considerable
amount of processing cycles of a PMD which may be visible in the iteration
history. In the worst case this can lead OVS to detect another
suspicious iteration caused by logging.

If more than 100 iterations around a suspicious iteration have been
logged once, OVS falls back to the safe default values (-b 5/-a 5/-ne)
to avoid that logging itself causes continuos further logging.

Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>
Acked-by: Billy O'Mahony <billy.o.mah...@intel.com>
---
 NEWS|   2 +
 lib/dpif-netdev-perf.c  | 223 
 lib/dpif-netdev-perf.h  |  21 +
 lib/dpif-netdev-unixctl.man |  59 
 lib/dpif-netdev.c   |   5 +
 5 files changed, 310 insertions(+)

diff --git a/NEWS b/NEWS
index ff81a9f..f11ef64 100644
--- a/NEWS
+++ b/NEWS
@@ -23,6 +23,8 @@ Post-v2.9.0
  * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a single PMD
  * Detailed PMD performance metrics available with new command
  ovs-appctl dpif-netdev/pmd-perf-show
+ * Supervision of PMD performance metrics and logging of suspicious
+   iterations
 
 v2.9.0 - 19 Feb 2018
 
diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c
index caa0e27..47ce2c2 100644
--- a/lib/dpif-netdev-perf.c
+++ b/lib/dpif-netdev-perf.c
@@ -25,6 +25,24 @@
 
 VLOG_DEFINE_THIS_MODULE(pmd_perf);
 
+#define ITER_US_THRESHOLD 250   /* Warning threshold for iteration duration
+   in microseconds. */
+#define VHOST_QUEUE_FULL 128/* Size of the virtio TX queue. */
+#define LOG_IT_BEFORE 5 /* Number of iterations to log before
+   suspicious iteration. */
+#define LOG_IT_AFTER 5  /* Number of iterations to log after
+   suspicious iteration. */
+
+bool log_enabled = false;
+bool log_extend = false;
+static uint32_t log_it_before = LOG_IT_BEFORE;
+static uint32_t log_it_after = LOG_IT_AFTER;
+static uint32_t log_us_thr = ITER_US_THRESHOLD;
+uint32_t log_q_thr = VHOST_QUEUE_FULL;
+uint64_t iter_cycle_threshold;
+
+static struct vlog_rate_limit latency_rl = VLOG_RATE_LIMIT_INIT(600, 600);
+
 #ifdef DPDK_NETDEV
 static uint64_t
 get_tsc_hz(void)
@@ -141,6 +159,10 @@ pmd_perf_stats_init(struct pmd_perf_stats *s)
 histogram_walls_set_log(>max_vhost_qfill, 0, 512);
 s->iteration_cnt = 0;
 s->start_ms = time_msec();
+s->log_susp_it = UINT32_MAX;
+s->log_begin_it = UINT32_MAX;
+s->log_end_it = UINT32_MAX;
+s->log_reason = NULL;
 }
 
 void
@@ -391,6 +413,10 @@ pmd_perf_stats_clear_lock(struct pmd_perf_stats *s)
 history_init(>milliseconds);
 s->start_ms = time_msec();
 s->milliseconds.sample[0].timestamp = s->start_ms;
+s->log_susp_it = UINT32_MAX;
+s->log_begin_it = UINT32_MAX;
+s->log_end_it = UINT32_MAX;
+s->log_reason = NULL;
 /* Clearing finished. */
 s->clear = false;
 ovs_mutex_unlock(>clear_mutex);
@@ -442,6 +468,7 @@ pmd_perf_end_iteration(struct pmd_perf_stats *s, int 
rx_packets,
 uint64_t now_tsc = cycles_counter_update(s);
 struct iter_stats *cum_ms;
 uint64_t cycles, cycles_per

[ovs-dev] [PATCH v11 2/3] dpif-netdev: Detailed performance stats for PMDs

2018-04-12 Thread Jan Scheurich
This patch instruments the dpif-netdev datapath to record detailed
statistics of what is happening in every iteration of a PMD thread.

The collection of detailed statistics can be controlled by a new
Open_vSwitch configuration parameter "other_config:pmd-perf-metrics".
By default it is disabled. The run-time overhead, when enabled, is
in the order of 1%.

The covered metrics per iteration are:
  - cycles
  - packets
  - (rx) batches
  - packets/batch
  - max. vhostuser qlen
  - upcalls
  - cycles spent in upcalls

This raw recorded data is used threefold:

1. In histograms for each of the following metrics:
   - cycles/iteration (log.)
   - packets/iteration (log.)
   - cycles/packet
   - packets/batch
   - max. vhostuser qlen (log.)
   - upcalls
   - cycles/upcall (log)
   The histograms bins are divided linear or logarithmic.

2. A cyclic history of the above statistics for 999 iterations

3. A cyclic history of the cummulative/average values per millisecond
   wall clock for the last 1000 milliseconds:
   - number of iterations
   - avg. cycles/iteration
   - packets (Kpps)
   - avg. packets/batch
   - avg. max vhost qlen
   - upcalls
   - avg. cycles/upcall

The gathered performance metrics can be printed at any time with the
new CLI command

ovs-appctl dpif-netdev/pmd-perf-show [-nh] [-it iter_len] [-ms ms_len]
[-pmd core] [dp]

The options are

-nh:Suppress the histograms
-it iter_len:   Display the last iter_len iteration stats
-ms ms_len: Display the last ms_len millisecond stats
-pmd core:  Display only the specified PMD

The performance statistics are reset with the existing
dpif-netdev/pmd-stats-clear command.

The output always contains the following global PMD statistics,
similar to the pmd-stats-show command:

Time: 15:24:55.270
Measurement duration: 1.008 s

pmd thread numa_id 0 core_id 1:

  Cycles:2419034712  (2.40 GHz)
  Iterations:572817  (1.76 us/it)
  - idle:486808  (15.9 % cycles)
  - busy: 86009  (84.1 % cycles)
  Rx packets:   2399607  (2381 Kpps, 848 cycles/pkt)
  Datapath passes:  3599415  (1.50 passes/pkt)
  - EMC hits:336472  ( 9.3 %)
  - Megaflow hits:  3262943  (90.7 %, 1.00 subtbl lookups/hit)
  - Upcalls:  0  ( 0.0 %, 0.0 us/upcall)
  - Lost upcalls: 0  ( 0.0 %)
  Tx packets:   2399607  (2381 Kpps)
  Tx batches:171400  (14.00 pkts/batch)

Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>
Acked-by: Billy O'Mahony <billy.o.mah...@intel.com>
---
 NEWS|   4 +
 lib/automake.mk |   1 +
 lib/dpif-netdev-perf.c  | 462 +++-
 lib/dpif-netdev-perf.h  | 197 ---
 lib/dpif-netdev-unixctl.man | 157 +++
 lib/dpif-netdev.c   | 187 --
 manpages.mk |   2 +
 vswitchd/ovs-vswitchd.8.in  |  27 +--
 vswitchd/vswitch.xml|  12 ++
 9 files changed, 985 insertions(+), 64 deletions(-)
 create mode 100644 lib/dpif-netdev-unixctl.man

diff --git a/NEWS b/NEWS
index 757d648..ff81a9f 100644
--- a/NEWS
+++ b/NEWS
@@ -19,6 +19,10 @@ Post-v2.9.0
  * implemented icmp4/icmp6/tcp_reset actions in order to drop the packet
and reply with a RST for TCP or ICMPv4/ICMPv6 unreachable message for
other IPv4/IPv6-based protocols whenever a reject ACL rule is hit.
+   - Userspace datapath:
+ * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a single PMD
+ * Detailed PMD performance metrics available with new command
+ ovs-appctl dpif-netdev/pmd-perf-show
 
 v2.9.0 - 19 Feb 2018
 
diff --git a/lib/automake.mk b/lib/automake.mk
index 915a33b..3276aaa 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -491,6 +491,7 @@ MAN_FRAGMENTS += \
lib/dpctl.man \
lib/memory-unixctl.man \
lib/netdev-dpdk-unixctl.man \
+   lib/dpif-netdev-unixctl.man \
lib/ofp-version.man \
lib/ovs.tmac \
lib/service.man \
diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c
index f06991a..caa0e27 100644
--- a/lib/dpif-netdev-perf.c
+++ b/lib/dpif-netdev-perf.c
@@ -15,18 +15,333 @@
  */
 
 #include 
+#include 
 
+#include "dpif-netdev-perf.h"
 #include "openvswitch/dynamic-string.h"
 #include "openvswitch/vlog.h"
-#include "dpif-netdev-perf.h"
+#include "ovs-thread.h"
 #include "timeval.h"
 
 VLOG_DEFINE_THIS_MODULE(pmd_perf);
 
+#ifdef DPDK_NETDEV
+static uint64_t
+get_tsc_hz(void)
+{
+return rte_get_tsc_hz();
+}
+#else
+/* This function is only invoked from PMD threads which depend on DPDK.
+ * A dummy function is sufficient when building without DPDK_NETDEV. */
+static uint64_t
+get_tsc_hz(void)
+{
+return 1;
+}
+#endif
+
+/* Histogram functions. */
+
+static void
+histogram_walls_set_lin(struct histogram *hist, ui

Re: [ovs-dev] [PATCH net-next 1/6] netdev-dpdk: Allow vswitchd to parse devargs as dpdk-bond args

2018-04-12 Thread Jan Scheurich
> > The bond of openvswitch has not good performance.
> 
> Any examples?

For example, balance-tcp bond mode for L34 load sharing still requires a 
recirculation after dp_hash.

I believe that it would definitely be interesting to compare bond performance 
between DPDK bonding and OVS bonding with DPDK datapath for various bond modes 
and traffic patterns.

Another interesting performance metric would be link failover times and packet 
drop (at link down and link up) in static and dynamic (LACP) bond 
configurations. That is an area where we have repeatedly seen problems with OVS 
bonding.

> 
> > In some
> > cases we would recommend that you use Linux bonds instead
> > of Open vSwitch bonds. In userspace datapath, we wants use
> > bond to improve bandwidth. The DPDK has implemented it as lib.
> 
> You could use OVS bonding for userspace datapath and it has
> good performance, especially after TX batching patch-set.
> 
> DPDK bonding has a variety of limitations like the requirement
> to call rte_eth_tx_burst and rte_eth_rx_burst with intervals
> period of less than 100ms for link aggregation modes.
> OVS could not assure that.

A periodic dummy tx burst every 100 ms is something that could easily be added 
to dpif-netdev PMD for bonded dpdk netdevs.

> 
> >
> > These patches base DPDK bond to implement the dpdk-bond
> > device as a vswitchd interface.
> >
> > If users set the interface options with multi-pci or device names
> > with ',' as a separator, we try to parse it as dpdk-bond args.
> > For example, set an interface as:
> >
> > ovs-vsctl add-port br0 dpdk0 -- \
> > set Interface dpdk0 type=dpdk \
> > options:dpdk-devargs=:06:00.0,:06:00.1
> >
> > And now these patch support to set bond mode, such as round
> > robin, active_backup and balance and so on. Later some features
> > of bond will be supported.
> 
> Hmm, but you're already have ability to add any virtual dpdk device
> including bond devices like this:
> 
> ovs-vsctl add-port br0 bond0 -- \
> set Interface dpdk0 type=dpdk \
> 
> options:dpdk-devargs="eth_bond0,mode=2,slave=:05:00.0,slave=:05:00.1,xmit_policy=l34"
>
> So, what is the profit of this patch-set?

Thanks for the pointer. That is a valid question. 

I guess special handling like periodic dummy tx burst might have to be enabled 
based on dpdk-devargs bond configuration.

BR, Jan
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH net-next 0/6] Add dpdk-bond support

2018-04-12 Thread Jan Scheurich
Hi Tonghao,

Thanks for working on this. That was on my backlog to try out for a while.

One immediate feedback: This is a pure OVS user space patch. Please remove the 
"net-next" tag from your patches in the next version. "net-next" is reserved 
for OVS kernel module patches that are first submitted upstream to the Linux 
"net-next" repository.

Regards, Jan

> -Original Message-
> From: ovs-dev-boun...@openvswitch.org 
> [mailto:ovs-dev-boun...@openvswitch.org] On Behalf Of xiangxia.m@gmail.com
> Sent: Thursday, 12 April, 2018 14:53
> To: d...@openvswitch.org
> Subject: [ovs-dev] [PATCH net-next 0/6] Add dpdk-bond support
> 
> From: Tonghao Zhang 
> 
> The bond of openvswitch has not good performance. In some
> cases we would recommend that you use Linux bonds instead
> of Open vSwitch bonds. In userspace datapath, we wants use
> bond to improve bandwidth. The DPDK has implemented it as lib.
> 
> These patches base DPDK bond to implement the dpdk-bond
> device as a vswitchd interface.
> 
> If users set the interface options with multi-pci or device names
> with ',' as a separator, we try to parse it as dpdk-bond args.
> For example, set an interface as:
> 
> ovs-vsctl add-port br0 dpdk0 -- \
>   set Interface dpdk0 type=dpdk \
>   options:dpdk-devargs=:06:00.0,:06:00.1
> 
> And now these patch support to set bond mode, such as round
> robin, active_backup and balance and so on. Later some features
> of bond will be supported.
> 
> These patches are RFC, any proposal will be welcome. Ignore the doc,
> if these pathes is ok for openvswitch the doc will be posted.
> 
> There are somes shell scripts, which can help us to test the patches.
> https://github.com/nickcooper-zhangtonghao/ovs-bond-tests
> 
> Tonghao Zhang (6):
>   netdev-dpdk: Allow vswitchd to parse devargs as dpdk-bond args
>   netdev-dpdk: Allow dpdk-ethdev not support setting mtu
>   netdev-dpdk: Add netdev_dpdk_bond struct
>   netdev-dpdk: Add dpdk-bond support
>   netdev-dpdk: Add check whether dpdk-port is used
>   netdev-dpdk: Add dpdk-bond mode setting
> 
>  lib/netdev-dpdk.c | 304 
> +-
>  1 file changed, 299 insertions(+), 5 deletions(-)
> 
> --
> 1.8.3.1
> 
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v10 3/3] dpif-netdev: Detection and logging of suspicious PMD iterations

2018-04-11 Thread Jan Scheurich
Hi Ilya,

I would not say this is expected behavior.

It seems that you are executing on a somewhat slower system (tsc clock seems to 
be 100/us = 0.1 GHz) and that, even with only 5 lines logged before and after,  
the logging output is causing so much slow down of the PMD that it continues to 
cause iterations using excessive cycles (362000 = 3.62 ms!) due to logging.

The actual iteration with logging is not flagged as suspicious, but the 
subsequent iteration gets the hit of the massive cycles that have passed on the 
TSC clock. The "phantom" duration of 0 us shown is probably a side effect of 
this. 

I will try to reproduce and investigate. I will have a look at the detection 
logic to see if this can be avoided.

BR, Jan


> -Original Message-
> From: Ilya Maximets [mailto:i.maxim...@samsung.com]
> Sent: Tuesday, 27 March, 2018 16:05
> To: Jan Scheurich <jan.scheur...@ericsson.com>; d...@openvswitch.org
> Cc: ktray...@redhat.com; ian.sto...@intel.com; billy.o.mah...@intel.com
> Subject: Re: [PATCH v10 3/3] dpif-netdev: Detection and logging of suspicious 
> PMD iterations
> 
> I see following behaviour:
> 
> 1. Configure low -us (like 100)
> 2. After that I see many logs about suspicious iterations (expected).
> 
> 2018-03-27T13:58:27Z|03574|pmd_perf(pmd7)|WARN|Suspicious iteration 
> (Excessive total cycles): tsc=520415762246435
> duration=106 us
> 2018-03-27T13:58:27Z|03575|pmd_perf(pmd7)|WARN|Neighborhood of suspicious 
> iteration:
>tsc cycles   packets  cycles/pkt   pkts/batch   
> vhost qlen   upcalls  cycles/upcall
>520415762297985 9711 32   303  32   
> 424  00
>520415762287041 1066732   333  32   
> 419  00
>520415762277319 9722 32   303  32   
> 429  00
>520415762267083 9971 32   311  32   
> 443  00
>520415762257413 9670 32   302  32   
> 451  00
>520415762246435 1069932   334  32   
> 448  00
>520415762235033 1110932   347  32   
> 455  00
>520415762180220 9826 32   307  32   
> 399  00
>520415762169792 1022932   319  32   
> 413  00
>520415762160385 9407 32   293  32   
> 408  00
>520415762150221 9891 32   309  32   
> 434  00
> 2018-03-27T13:58:27Z|03576|pmd_perf(pmd7)|WARN|Suspicious iteration 
> (Excessive total cycles): tsc=520415762469997
> duration=104 us
> 2018-03-27T13:58:27Z|03577|pmd_perf(pmd7)|WARN|Neighborhood of suspicious 
> iteration:
>tsc cycles   packets  cycles/pkt   pkts/batch   
> vhost qlen   upcalls  cycles/upcall
>520415762519119 9462 32   295  32   
> 505  00
>520415762509595 9319 32   291  32   
> 537  00
>520415762500154 9283 32   290  32   
> 569  00
>520415762490585 9287 32   290  32   
> 601  00
>520415762480693 9730 32   304  32   
> 633  00
>520415762469997 1041432   325  32   
> 665  00
>520415762459348 1034232   323  32   
> 697  00
>520415762297985 9711 32   303  32   
> 424  00
>520415762287041 1066732   333  32   
> 419  00
>520415762277319 9722 32   303  32   
> 429  00
>520415762267083 9971 32   311  32   
> 443  00
> 
> 3. Configure back high -us (like 1000).
> 4. Logs are still there with zero duration. Logs printed every second like 
> this:
> 
> 2018-03-27T14:02:08Z|04140|pmd_perf(pmd7)|WARN|Suspicious iteration 
> (Excessive total cycles): tsc=520437806368099 duration=0
> us
> [Thread 0x7fb56f2910 (LWP 19754) exited]
> [New Thread 0x7fb56f2910 (LWP 19755)]
> 2018-03-27T14:02:08Z|04141|pmd_perf(pmd7)|WARN|Neighb

Re: [ovs-dev] [PATCH 2/3] ofproto-dpif: Improve dp_hash selection method for select groups

2018-04-11 Thread Jan Scheurich
Hi Ychen,

Thanks a lot for your tests of corner cases and suggested bug fixes. I will 
include fixes in the next version, possibly also unit test cases for those.

A bucket weight of zero should in my eyes imply no traffic to that bucket. I 
will check how to achieve that.

I will also look into your ofproto_group_unref question.

Regards, Jan


From: ychen [mailto:ychen103...@163.com]
Sent: Wednesday, 11 April, 2018 06:16
To: Jan Scheurich <jan.scheur...@ericsson.com>
Cc: d...@openvswitch.org; Nitin Katiyar <nitin.kati...@ericsson.com>
Subject: Re:[PATCH 2/3] ofproto-dpif: Improve dp_hash selection method for 
select groups

Hi, Jan:
When I test dp_hash with the new patch, vswitchd was killed by segment 
fault in some conditions.
1. add group with no buckets, then winner will be NULL
2. add buckets with weight with 0, then winner will also be NULL

I did little modify to the patch, will you help to check whether it is correct?

diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c
index 8f6070d..b3a9639 100755
--- a/ofproto/ofproto-dpif.c
+++ b/ofproto/ofproto-dpif.c
@@ -4773,6 +4773,8 @@ group_setup_dp_hash_table(struct group_dpif *group, 
size_t max_hash)
 webster[i].value = bucket->weight;
 i++;
 }
+//consider bucket weight equal to 0
+if (!min_weight) min_weight = 1;

 uint32_t min_slots = ceil(total_weight / min_weight);
 n_hash = MAX(16, 1L << log_2_ceil(min_slots));
@@ -4794,11 +4796,12 @@ group_setup_dp_hash_table(struct group_dpif *group, 
size_t max_hash)
 for (int hash = 0; hash < n_hash; hash++) {
 VLOG_DBG("Hash value: %d", hash);
 double max_val = 0.0;
-struct webster *winner;
+struct webster *winner = NULL;
 for (i = 0; i < n_buckets; i++) {
 VLOG_DBG("Webster[%d]: divisor=%d value=%.2f",
  i, webster[i].divisor, webster[i].value);
-if (webster[i].value > max_val) {
+// use >= in condition there is only one bucket with weight 0
+if (webster[i].value >= max_val) {
 max_val = webster[i].value;
 winner = [i];
 }
@@ -4827,7 +4830,8 @@ group_set_selection_method(struct group_dpif *group)
 group->selection_method = SEL_METHOD_DEFAULT;
 } else if (!strcmp(selection_method, "dp_hash")) {
 /* Try to use dp_hash if possible at all. */
-if (group_setup_dp_hash_table(group, 64)) {
+uint32_t n_buckets = group->up.n_buckets;
+if (n_buckets && group_setup_dp_hash_table(group, 64)) {
 group->selection_method = SEL_METHOD_DP_HASH;
 group->hash_alg = props->selection_method_param >> 32;
 if (group->hash_alg >= __OVS_HASH_MAX) {


Another question, I found in function xlate_default_select_group and 
xlate_hash_fields_select_group,
when group_best_live_bucket is NULL, it will call ofproto_group_unref,
why dp_hash function no need to call it when there is no best bucket 
found?(exp: group with no buckets)



At 2018-03-21 02:16:17, "Jan Scheurich" 
<jan.scheur...@ericsson.com<mailto:jan.scheur...@ericsson.com>> wrote:

>The current implementation of the "dp_hash" selection method suffers

>from two deficiences: 1. The hash mask and hence the number of dp_hash

>values is just large enough to cover the number of group buckets, but

>does not consider the case that buckets have different weights. 2. The

>xlate-time selection of best bucket from the masked dp_hash value often

>results in bucket load distributions that are quite different from the

>bucket weights because the number of available masked dp_hash values

>is too small (2-6 bits compared to 32 bits of a full hash in the default

>hash selection method).

>

>This commit provides a more accurate implementation of the dp_hash

>select group by applying the well known Webster method for distributing

>a small number of "seats" fairly over the weighted "parties"

>(see https://en.wikipedia.org/wiki/Webster/Sainte-Lagu%C3%AB_method).

>The dp_hash mask is autmatically chosen large enough to provide good

>enough accuracy even with widely differing weights.

>

>This distribution happens at group modification time and the resulting

>table is stored with the group-dpif struct. At xlation time, we use the

>masked dp_hash values as index to look up the assigned bucket.

>

>If the bucket should not be live, we do a circular search over the

>mapping table until we find the first live bucket. As the buckets in

>the table are by construction in pseudo-random order with a frequency

>according to their weight, this method maintains correct distribution

>even if one or more buckets are non-live.

>

>Xlation is further simplif

Re: [ovs-dev] [PATCH] ofp-actions: Correct execution of encap/decap actions in action set

2018-04-09 Thread Jan Scheurich
Hi Yi,

The assertion failure is indeed caused by the incorrect implementation of 
double encap() and should be fixed by the patch you mention (which is merged to 
master by now).

Prior to the below fix this happened with every encap(nsh) in an group bucket. 

I can't say why it still happens periodically every few minutes in your test. 
You'd need to carefully analyze a crash dump to try to understand the packet 
processing history that leads to a double encap() or perhaps decap().

It is definitely worth trying whether the problem is already resolved on the 
latest master.

BR, Jan

> -Original Message-
> From: Yang, Yi Y [mailto:yi.y.y...@intel.com]
> Sent: Sunday, 08 April, 2018 10:27
> To: Jan Scheurich <jan.scheur...@ericsson.com>; d...@openvswitch.org
> Subject: RE: [PATCH] ofp-actions: Correct execution of encap/decap actions in 
> action set
> 
> Hi, Jan
> 
> Sangfor guy tried this one, he still encountered assert issue after ovs ran 
> for about 20 minutes, moreover it appeared periodically. I'm
> not sure if https://patchwork.ozlabs.org/patch/895405/ is helpful for this 
> issue. Do you think what the root cause is?
> 
> -Original Message-
> From: Jan Scheurich [mailto:jan.scheur...@ericsson.com]
> Sent: Monday, March 26, 2018 3:36 PM
> To: d...@openvswitch.org
> Cc: Yang, Yi Y <yi.y.y...@intel.com>; Jan Scheurich 
> <jan.scheur...@ericsson.com>
> Subject: [PATCH] ofp-actions: Correct execution of encap/decap actions in 
> action set
> 
> The actions encap, decap and dec_nsh_ttl were wrongly flagged as set_field 
> actions in ofpact_is_set_or_move_action(). This caused
> them to be executed twice in the action set or a group bucket, once 
> explicitly in
> ofpacts_execute_action_set() and once again as part of the list of set_field 
> or move actions.
> 
> Fixes: f839892a ("OF support and translation of generic encap and decap")
> Fixes: 491e05c2 ("nsh: add dec_nsh_ttl action")
> 
> Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>
> 
> ---
> 
> The fix should be backported to OVS 2.9 and OVS 2.8 (without the case for 
> OFPACT_DEC_NSH_TTL introduced in 2.9).
> 
> 
>  lib/ofp-actions.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/lib/ofp-actions.c b/lib/ofp-actions.c index db85716..87797bc 
> 100644
> --- a/lib/ofp-actions.c
> +++ b/lib/ofp-actions.c
> @@ -6985,9 +6985,6 @@ ofpact_is_set_or_move_action(const struct ofpact *a)
>  case OFPACT_SET_TUNNEL:
>  case OFPACT_SET_VLAN_PCP:
>  case OFPACT_SET_VLAN_VID:
> -case OFPACT_ENCAP:
> -case OFPACT_DECAP:
> -case OFPACT_DEC_NSH_TTL:
>  return true;
>  case OFPACT_BUNDLE:
>  case OFPACT_CLEAR_ACTIONS:
> @@ -7025,6 +7022,9 @@ ofpact_is_set_or_move_action(const struct ofpact *a)
>  case OFPACT_WRITE_METADATA:
>  case OFPACT_DEBUG_RECIRC:
>  case OFPACT_DEBUG_SLOW:
> +case OFPACT_ENCAP:
> +case OFPACT_DECAP:
> +case OFPACT_DEC_NSH_TTL:
>  return false;
>  default:
>  OVS_NOT_REACHED();
> --
> 1.9.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2 0/2] Correct handling of double encap and decap actions

2018-04-06 Thread Jan Scheurich
Yes that fix should be applied to branches 2.9 and 2.8.

I checked that it applies and passes all unit tests pass.

On branch-2.8 the patch for nsh.at patch must be slightly retrofitted as the 
datapath action names changed from encap_nsh/decap_nsh to push_nsh/pop_nsh and 
the nsh_ttl field was introduced in 2.9.

diff --git a/tests/nsh.at b/tests/nsh.at
index 6ae71b5..6eb4637 100644
--- a/tests/nsh.at
+++ b/tests/nsh.at
@@ -351,7 +351,7 @@ bridge("br0")

 Final flow: unchanged
 Megaflow: recirc_id=0,eth,ip,in_port=1,dl_dst=66:77:88:99:aa:bb,nw_frag=no
-Datapath actions: 
push_nsh(flags=0,ttl=63,mdtype=1,np=3,spi=0x1234,si=255,c1=0x1122334
+Datapath actions: 
encap_nsh(flags=0,mdtype=1,np=3,spi=0x1234,si=255,c1=0x11223344,c2=0
 ])

 AT_CHECK([
@@ -370,7 +370,7 @@ bridge("br0")

 Final flow: 
recirc_id=0x1,eth,in_port=4,vlan_tci=0x,dl_src=00:00:00:00:00:00,dl_ds
 Megaflow: 
recirc_id=0x1,packet_type=(1,0x894f),in_port=4,nsh_mdtype=1,nsh_np=3,nsh_spi
-Datapath actions: pop_nsh(),recirc(0x2)
+Datapath actions: decap_nsh(),recirc(0x2)
 ])

 AT_CHECK([
@@ -407,8 +407,8 @@ ovs-appctl time/warp 1000
 AT_CHECK([
 ovs-appctl dpctl/dump-flows dummy@ovs-dummy | strip_used | grep -v ipv6 | 
sort
 ], [0], [flow-dump from non-dpdk interfaces:
-recirc_id(0),in_port(1),packet_type(ns=0,id=0),eth(dst=1e:2c:e9:2a:66:9e),eth_type(0x0
-recirc_id(0x3),in_port(1),packet_type(ns=1,id=0x894f),nsh(mdtype=1,np=3,spi=0x1234,c1=
+recirc_id(0),in_port(1),packet_type(ns=0,id=0),eth(dst=1e:2c:e9:2a:66:9e),eth_type(0x0
+recirc_id(0x3),in_port(1),packet_type(ns=1,id=0x894f),nsh(mdtype=1,np=3,spi=0x1234,c1=
 
recirc_id(0x4),in_port(1),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no),
 packe
 ])

Thanks, Jan

> -Original Message-
> From: Ben Pfaff [mailto:b...@ovn.org]
> Sent: Friday, 06 April, 2018 18:36
> To: Jan Scheurich <jan.scheur...@ericsson.com>
> Cc: d...@openvswitch.org; yi.y.y...@intel.com
> Subject: Re: [PATCH v2 0/2] Correct handling of double encap and decap actions
> 
> On Fri, Apr 06, 2018 at 09:35:48AM -0700, Ben Pfaff wrote:
> > On Thu, Apr 05, 2018 at 04:11:02PM +0200, Jan Scheurich wrote:
> > > Recent tests with NSH encap have shown that the translation of multiple
> > > subsequent encap() or decap() actions was incorrect. This patch set
> > > corrects the handling and adds a unit test for NSH to cover two NSH
> > > and one Ethernet encapsulation levels.
> >
> > Thanks.  Should this be applied to branch-2.9?
> 
> To be clear, I applied it to master just now.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v10 1/3] netdev: Add optional qfill output parameter to rxq_recv()

2018-04-06 Thread Jan Scheurich
> > @@ -1846,11 +1846,24 @@ netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq,
> >  batch->count = nb_rx;
> >  dp_packet_batch_init_packet_fields(batch);
> >
> > +if (qfill) {
> > +if (nb_rx == NETDEV_MAX_BURST) {
> > +/* The DPDK API returns a uint32_t which often has invalid 
> > bits in
> > + * the upper 16-bits. Need to restrict the value to uint16_t. 
> > */
> > +*qfill = rte_vhost_rx_queue_count(netdev_dpdk_get_vid(dev),
> 
> I lost count of how many times I talked about this. Please, don't obtain the
> 'vid' twice. You have to check the result of 'netdev_dpdk_get_vid()' always.
> Otherwise this could lead to crash.
> 
> Details, as usual, here:
> daf22bf7a826 ("netdev-dpdk: Fix calling vhost API with negative vid.")
> 
> I believe, that I already wrote this comment to one of the previous versions
> of this patch-set.

Yes, sorry I missed that one. I will for sure fix it. 

As this is fairly non-obvious from looking at the code it might be a good idea 
to add some warning comment to the function 'netdev_dpdk_get_vid()' and/or the 
places where it is used from the PMD.

/Jan

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] ofproto-dpif: Init ukey->dump_seq to zero

2018-04-06 Thread Jan Scheurich
> -Original Message-
> From: Ben Pfaff [mailto:b...@ovn.org]
> Sent: Wednesday, 04 April, 2018 22:28
> 
> Oh, that's weird.  It's as if I didn't read the patch.  Maybe I just
> read some preliminary version in another thread.
> 
> Anyway, you're totally right.  I applied this to master.  If you're
> seeing problems in another branch, let me know and I will backport.

Thanks!

I think the issue was introduced into OVS by the following commit a long time 
ago. 

commit 23597df052262dec961fd86eb7c54d10984a1ec0
Author: Joe Stringer 
Date:   Fri Jul 25 13:54:24 2014 +1200

It's a temporary glitch that can cause unexpected behavior only within the 
first few hundred milliseconds after datapath flow creation. It is most likely 
to affect "reactive" controller use cases (MAC learning, ARP handling), like 
the OVN test case that now failed with a small change of timing. So it is 
possible that one could notice short packet drops or duplicate PACKET_INs in 
real SDN deployments when looking close enough.

My preference would be to backport it all the way to OVS 2.5. But of course I 
don't have proof that there are actual problems out in the field that it would 
solve.

One could also do a systematic search of unit test cases that apply "sleep" or 
"time/warp" work-arounds for the issue and simplify these on master branch. But 
I fear I won't time for that.

Regards, Jan

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] xlate: Correct handling of double encap() actions

2018-04-05 Thread Jan Scheurich
> >
> > This fix should be backported OVS 2.8 and 2.9
> 
> This seems tricky.  Do you plan to write a test?

Hi Ben,

I have just posted v2 where I have added an NSH unit test for double NSH plus 
Ethernet encapsulation.
The test also covers encap() actions in group buckets.

BR, Jan
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2 0/2] Correct handling of double encap and decap actions

2018-04-05 Thread Jan Scheurich
Recent tests with NSH encap have shown that the translation of multiple
subsequent encap() or decap() actions was incorrect. This patch set
corrects the handling and adds a unit test for NSH to cover two NSH
and one Ethernet encapsulation levels.

v1->v2:
  - Rebased to master (commit 4b337e489)
  - Added NSH unit test with double encap

Jan Scheurich (2):
  xlate: Correct handling of double encap() actions
  nsh: Add unit test for double NSH encap and decap

 lib/odp-util.c   |  16 ++---
 lib/odp-util.h   |   1 +
 ofproto/ofproto-dpif-xlate.c |   7 ++-
 tests/nsh.at | 143 +++
 4 files changed, 156 insertions(+), 11 deletions(-)

-- 
1.9.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2 2/2] nsh: Add unit test for double NSH encap and decap

2018-04-05 Thread Jan Scheurich
The added test verifies that OVS correctly encapsulates an Ethernet
packet with two NSH (MD1) headers, sends it with an Ethernet header
over a patch port and decaps the Ethernet and the two NSH headers on
the receiving bridge to reveal the original packet.

The test case performs the encap() operations in a sequence of three
chained groups to test the correct handling of encap() actions in
group buckets recently fixed in commit ce4a16ac0.

Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>

---
 tests/nsh.at | 143 +++
 1 file changed, 143 insertions(+)

diff --git a/tests/nsh.at b/tests/nsh.at
index e6a8345..7539e91 100644
--- a/tests/nsh.at
+++ b/tests/nsh.at
@@ -276,6 +276,149 @@ AT_CLEANUP
 
 
 ### -
+###   Double NSH MD1 encapsulation using groups over veth link
+### -
+
+AT_SETUP([nsh - double encap over veth link using groups])
+
+OVS_VSWITCHD_START([])
+
+AT_CHECK([
+ovs-vsctl set bridge br0 datapath_type=dummy \
+protocols=OpenFlow10,OpenFlow13,OpenFlow14,OpenFlow15 -- \
+add-port br0 p1 -- set Interface p1 type=dummy ofport_request=1 -- \
+add-port br0 p2 -- set Interface p2 type=dummy ofport_request=2 -- \
+add-port br0 v3 -- set Interface v3 type=patch options:peer=v4 
ofport_request=3 -- \
+add-port br0 v4 -- set Interface v4 type=patch options:peer=v3 
ofport_request=4])
+
+AT_DATA([flows.txt], [dnl
+table=0,in_port=1,ip,actions=group:100
+
table=0,in_port=4,packet_type=(0,0),dl_type=0x894f,nsh_mdtype=1,nsh_spi=0x5678,nsh_c1=0x55667788,actions=decap(),goto_table:1
+
table=1,packet_type=(1,0x894f),nsh_mdtype=1,nsh_spi=0x5678,nsh_c1=0x55667788,actions=decap(),goto_table:2
+
table=2,packet_type=(1,0x894f),nsh_mdtype=1,nsh_spi=0x1234,nsh_c1=0x11223344,actions=decap(),output:2
+])
+
+AT_DATA([groups.txt], [dnl
+add 
group_id=100,type=indirect,bucket=actions=encap(nsh(md_type=1)),set_field:0x1234->nsh_spi,set_field:0x11223344->nsh_c1,group:200
+add 
group_id=200,type=indirect,bucket=actions=encap(nsh(md_type=1)),set_field:0x5678->nsh_spi,set_field:0x55667788->nsh_c1,group:300
+add 
group_id=300,type=indirect,bucket=actions=encap(ethernet),set_field:11:22:33:44:55:66->dl_dst,3
+])
+
+AT_CHECK([
+ovs-ofctl del-flows br0
+ovs-ofctl -Oopenflow13 add-groups br0 groups.txt
+ovs-ofctl -Oopenflow13 add-flows br0 flows.txt
+ovs-ofctl -Oopenflow13 dump-flows br0 | ofctl_strip | sort | grep actions
+], [0], [dnl
+ in_port=4,dl_type=0x894f,nsh_mdtype=1,nsh_spi=0x5678,nsh_c1=0x55667788 
actions=decap(),goto_table:1
+ ip,in_port=1 actions=group:100
+ table=1, packet_type=(1,0x894f),nsh_mdtype=1,nsh_spi=0x5678,nsh_c1=0x55667788 
actions=decap(),goto_table:2
+ table=2, packet_type=(1,0x894f),nsh_mdtype=1,nsh_spi=0x1234,nsh_c1=0x11223344 
actions=decap(),output:2
+])
+
+# TODO:
+# The fields nw_proto, nw_tos, nw_ecn, nw_ttl in final flow seem unnecessary. 
Can they be avoided?
+# The match on dl_dst=66:77:88:99:aa:bb in the Megaflow is a side effect of 
setting the dl_dst in the pushed outer
+# Ethernet header. It is a consequence of using wc->masks both for tracking 
matched and set bits and seems hard to
+# avoid except by using separate masks for both purposes.
+
+AT_CHECK([
+ovs-appctl ofproto/trace br0 
'in_port=1,icmp,dl_src=00:11:22:33:44:55,dl_dst=66:77:88:99:aa:bb,nw_dst=10.10.10.10,nw_src=20.20.20.20'
+], [0], [dnl
+Flow: 
icmp,in_port=1,vlan_tci=0x,dl_src=00:11:22:33:44:55,dl_dst=66:77:88:99:aa:bb,nw_src=20.20.20.20,nw_dst=10.10.10.10,nw_tos=0,nw_ecn=0,nw_ttl=0,icmp_type=0,icmp_code=0
+
+bridge("br0")
+-
+ 0. ip,in_port=1, priority 32768
+group:100
+encap(nsh(md_type=1))
+set_field:0x1234->nsh_spi
+set_field:0x11223344->nsh_c1
+group:200
+encap(nsh(md_type=1))
+set_field:0x5678->nsh_spi
+set_field:0x55667788->nsh_c1
+group:300
+encap(ethernet)
+set_field:11:22:33:44:55:66->eth_dst
+output:3
+
+bridge("br0")
+-
+ 0. in_port=4,dl_type=0x894f,nsh_mdtype=1,nsh_spi=0x5678,nsh_c1=0x55667788, 
priority 32768
+decap()
+goto_table:1
+ 1. packet_type=(1,0x894f),nsh_mdtype=1,nsh_spi=0x5678,nsh_c1=0x55667788, 
priority 32768
+decap()
+
+Final flow: unchanged
+Megaflow: recirc_id=0,eth,ip,in_port=1,dl_dst=66:77:88:99:aa:bb,nw_frag=no
+Datapath actions: 
push_nsh(flags=0,ttl=63,mdtype=1,np=3,spi=0x1234,si=255,c1=0x11223344,c2=0x0,c3=0x0,c4=0x0),push_nsh(flags=0,ttl=63,mdtype=1,np=4,spi=0x5678,si=255,c1=0x55667788,c2=0x0,c3=0x0,c4=0x0),push_eth(src=00:00:00:00:00:00,dst=11:22:33:44:55:66),pop_eth,pop_nsh(),recirc(0x1)
+])
+
+AT_CHECK([
+ovs-appctl ofproto/trace br0 
'recirc_id=1,in_port=4,packet_type=(1,0x894f),nsh_mdtype=1,nsh_np=3,nsh_spi=0x1234,nsh_c1=0x11223344'
+], [0], [dnl
+Flow: re

[ovs-dev] [PATCH v2 1/2] xlate: Correct handling of double encap() actions

2018-04-05 Thread Jan Scheurich
When the same encap() header was pushed twice onto a packet (e.g in the
case of NSH in NSH), the translation logic only generated a datapath push
action for the first encap() action. The second encap() did not emit a
push action because the packet type was unchanged.

commit_encap_decap_action() (renamed from commit_packet_type_change) must
solely rely on ctx->pending_encap to generate an datapath push action.

Similarly, the first decap() action on a double header packet does not
change the packet_type either. Add a corresponding ctx->pending_decap
flag and use that to trigger emitting a datapath pop action.

Fixes: f839892a2 ("OF support and translation of generic encap and decap")
Fixes: 1fc11c594 ("Generic encap and decap support for NSH")

Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>
---
 lib/odp-util.c   | 16 ++--
 lib/odp-util.h   |  1 +
 ofproto/ofproto-dpif-xlate.c |  7 ++-
 3 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/lib/odp-util.c b/lib/odp-util.c
index 8743503..6db241a 100644
--- a/lib/odp-util.c
+++ b/lib/odp-util.c
@@ -7446,17 +7446,13 @@ odp_put_push_nsh_action(struct ofpbuf *odp_actions,
 }
 
 static void
-commit_packet_type_change(const struct flow *flow,
+commit_encap_decap_action(const struct flow *flow,
   struct flow *base_flow,
   struct ofpbuf *odp_actions,
   struct flow_wildcards *wc,
-  bool pending_encap,
+  bool pending_encap, bool pending_decap,
   struct ofpbuf *encap_data)
 {
-if (flow->packet_type == base_flow->packet_type) {
-return;
-}
-
 if (pending_encap) {
 switch (ntohl(flow->packet_type)) {
 case PT_ETH: {
@@ -7481,7 +7477,7 @@ commit_packet_type_change(const struct flow *flow,
  * The check is done at action translation. */
 OVS_NOT_REACHED();
 }
-} else {
+} else if (pending_decap || flow->packet_type != base_flow->packet_type) {
 /* This is an explicit or implicit decap case. */
 if (pt_ns(flow->packet_type) == OFPHTN_ETHERTYPE &&
 base_flow->packet_type == htonl(PT_ETH)) {
@@ -7520,14 +7516,14 @@ commit_packet_type_change(const struct flow *flow,
 enum slow_path_reason
 commit_odp_actions(const struct flow *flow, struct flow *base,
struct ofpbuf *odp_actions, struct flow_wildcards *wc,
-   bool use_masked, bool pending_encap,
+   bool use_masked, bool pending_encap, bool pending_decap,
struct ofpbuf *encap_data)
 {
 enum slow_path_reason slow1, slow2;
 bool mpls_done = false;
 
-commit_packet_type_change(flow, base, odp_actions, wc,
-  pending_encap, encap_data);
+commit_encap_decap_action(flow, base, odp_actions, wc,
+  pending_encap, pending_decap, encap_data);
 commit_set_ether_action(flow, base, odp_actions, wc, use_masked);
 /* Make packet a non-MPLS packet before committing L3/4 actions,
  * which would otherwise do nothing. */
diff --git a/lib/odp-util.h b/lib/odp-util.h
index 1fad159..6fcd1bb 100644
--- a/lib/odp-util.h
+++ b/lib/odp-util.h
@@ -283,6 +283,7 @@ enum slow_path_reason commit_odp_actions(const struct flow 
*,
  struct flow_wildcards *wc,
  bool use_masked,
  bool pending_encap,
+ bool pending_decap,
  struct ofpbuf *encap_data);
 
 /* ofproto-dpif interface.
diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
index 42ac118..c8baba1 100644
--- a/ofproto/ofproto-dpif-xlate.c
+++ b/ofproto/ofproto-dpif-xlate.c
@@ -243,6 +243,8 @@ struct xlate_ctx {
  * true. */
 bool pending_encap; /* True when waiting to commit a pending
  * encap action. */
+bool pending_decap; /* True when waiting to commit a pending
+ * decap action. */
 struct ofpbuf *encap_data;  /* May contain a pointer to an ofpbuf with
  * context for the datapath encap action.*/
 
@@ -3477,8 +3479,9 @@ xlate_commit_actions(struct xlate_ctx *ctx)
 ctx->xout->slow |= commit_odp_actions(>xin->flow, >base_flow,
   ctx->odp_actions, ctx->wc,
   use_masked, ctx->pending_encap,
-  ctx->encap_data);
+  ctx->pending_decap, ctx->encap_data);
 ctx->pending_encap = false;
+ctx->pending_de

[ovs-dev] [PATCH v4 2/2] xlate: Move tnl_neigh_snoop() to terminate_native_tunnel()

2018-04-05 Thread Jan Scheurich
From: Zoltan Balogh <zoltan.balogh@gmail.com>

Currently OVS snoops any ARP or ND packets in any bridge and populates
the tunnel neighbor cache with the retreived data. For instance, when
an ARP reply originated by a tenant is received in an overlay bridge, the
ARP packet is snooped and tunnel neighbor cache is filled with tenant
address information. This is at best useless as tunnel endpoints can only
reside on an underlay bridge.

The real problem starts if different tenants on the overlay bridge have
overlapping IP addresses such that they keep overwriting each other's
pseudo tunnel neighbor entries. These frequent updates are treated as
configuration changes and trigger revalidation each time, thus causing
a lot of useless revalidation load on the system.

To keep the ARP neighbor cache clean, this patch moves tunnel neighbor
snooping from the generic function do_xlate_actions() to the specific
funtion terminate_native_tunnel() in compose_output_action(). Thus,
only ARP and Neighbor Advertisement packets addressing a local
tunnel endpoint (on the LOCAL port of the underlay bridge) are snooped.

In order to achieve this, IP addresses of the bridge ports are retrieved
and then stored in xbridge by calling xlate_xbridge_set(). The
destination address extracted from the ARP or Neighbor Advertisement
packet is then matched against the known xbridge addresses in
is_neighbor_reply_correct() to filter the snooped packets further.

Signed-off-by: Zoltan Balogh <zoltan.balogh@gmail.com>
Co-authored-by: Jan Scheurich <jan.scheur...@ericsson.com>
Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>
---
 include/sparse/netinet/in.h   |  10 +++
 ofproto/ofproto-dpif-xlate.c  | 147 --
 tests/tunnel-push-pop-ipv6.at |  68 ++-
 tests/tunnel-push-pop.at  |  67 ++-
 4 files changed, 282 insertions(+), 10 deletions(-)

diff --git a/include/sparse/netinet/in.h b/include/sparse/netinet/in.h
index 6abdb23..eea41bd 100644
--- a/include/sparse/netinet/in.h
+++ b/include/sparse/netinet/in.h
@@ -123,6 +123,16 @@ struct sockaddr_in6 {
  (X)->s6_addr[10] == 0xff &&\
  (X)->s6_addr[11] == 0xff)
 
+#define IN6_IS_ADDR_MC_LINKLOCAL(a) \
+(((const uint8_t *) (a))[0] == 0xff &&  \
+ (((const uint8_t *) (a))[1] & 0xf) == 0x2)
+
+# define IN6_ARE_ADDR_EQUAL(a,b)  \
+const uint32_t *) (a))[0] == ((const uint32_t *) (b))[0]) &&  \
+ (((const uint32_t *) (a))[1] == ((const uint32_t *) (b))[1]) &&  \
+ (((const uint32_t *) (a))[2] == ((const uint32_t *) (b))[2]) &&  \
+ (((const uint32_t *) (a))[3] == ((const uint32_t *) (b))[3]))
+
 #define INET_ADDRSTRLEN 16
 #define INET6_ADDRSTRLEN 46
 
diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
index 42ac118..f593a6e 100644
--- a/ofproto/ofproto-dpif-xlate.c
+++ b/ofproto/ofproto-dpif-xlate.c
@@ -91,6 +91,16 @@ VLOG_DEFINE_THIS_MODULE(ofproto_dpif_xlate);
  * recursive or not. */
 #define MAX_RESUBMITS (MAX_DEPTH * MAX_DEPTH)
 
+/* The structure holds an array of IP addresses assigned to a bridge and the
+ * number of elements in the array. These data are mutable and are evaluated
+ * when ARP or Neighbor Advertisement packets received on a native tunnel
+ * port are xlated. So 'ref_cnt' and RCU are used for synchronization. */
+struct xbridge_addr {
+struct in6_addr *addr;/* Array of IP addresses of xbridge. */
+int n_addr;   /* Number of IP addresses. */
+struct ovs_refcount ref_cnt;
+};
+
 struct xbridge {
 struct hmap_node hmap_node;   /* Node in global 'xbridges' map. */
 struct ofproto_dpif *ofproto; /* Key in global 'xbridges' map. */
@@ -114,6 +124,8 @@ struct xbridge {
 
 /* Datapath feature support. */
 struct dpif_backer_support support;
+
+struct xbridge_addr *addr;
 };
 
 struct xbundle {
@@ -582,7 +594,8 @@ static void xlate_xbridge_set(struct xbridge *, struct dpif 
*,
   const struct dpif_ipfix *,
   const struct netflow *,
   bool forward_bpdu, bool has_in_band,
-  const struct dpif_backer_support *);
+  const struct dpif_backer_support *,
+  const struct xbridge_addr *);
 static void xlate_xbundle_set(struct xbundle *xbundle,
   enum port_vlan_mode vlan_mode,
   uint16_t qinq_ethtype, int vlan,
@@ -836,6 +849,56 @@ xlate_xport_init(struct xlate_cfg *xcfg, struct xport 
*xport)
 uuid_hash(>uuid));
 }
 
+static struct xbridge_addr *
+xbridge_addr_create(struct xbridge *xbridge)
+{
+struct xbridge_addr *xbridge_addr = xbridge->addr;
+struct in6_addr *addr = N

[ovs-dev] [PATCH v4 1/2] tests: Inject ARP replies for snoop tests on different port

2018-04-05 Thread Jan Scheurich
From: Zoltan Balogh <zoltan.balogh@gmail.com>

The ARP replies injected into the underlay bridge 'br0' to trigger
ARP snooping should be destined to the the bridges LOCAL port. So far
the tests injected them on LOCAL port 'br0' itself, which didn't matter
as OVS snooped on all ARP packets passing the bridge.

This patch injects the ARP replies on a different port in preparation for
an upcoming commit that will make OVS only snoop on ARP packets output
to the LOCAL port.

The clone() wrapper must be added to the generated datapath flows now as
the traced packets would actually be transmitted through the tunnel port.
Previously the underlay bridge dropped the packets as the learned egress
port for the tunnel nexthop was the LOCAL port, which also served as
virtual ingress port for the encapsulated traffic. The translation
end result was an expensive way to say 'drop'.

Signed-off-by: Zoltan Balogh <zoltan.balogh@gmail.com>
Co-authored-by: Jan Scheurich <jan.scheur...@ericsson.com>
Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>
---
 tests/tunnel-push-pop-ipv6.at | 14 +++---
 tests/tunnel-push-pop.at  | 24 
 2 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/tests/tunnel-push-pop-ipv6.at b/tests/tunnel-push-pop-ipv6.at
index 7ca522a..29bc1f3 100644
--- a/tests/tunnel-push-pop-ipv6.at
+++ b/tests/tunnel-push-pop-ipv6.at
@@ -55,9 +55,9 @@ AT_CHECK([cat p0.pcap.txt | grep 
93aa55aa5586dd60203aff2001cafe | un
 ])
 
 dnl Check ARP Snoop
-AT_CHECK([ovs-appctl netdev-dummy/receive br0 
'in_port(100),eth(src=f8:bc:12:44:34:b6,dst=aa:55:aa:55:00:00),eth_type(0x86dd),ipv6(src=2001:cafe::92,dst=2001:cafe::94,label=0,proto=58,tclass=0,hlimit=255,frag=no),icmpv6(type=136,code=0),nd(target=2001:cafe::92,sll=00:00:00:00:00:00,tll=f8:bc:12:44:34:b6)'])
+AT_CHECK([ovs-appctl netdev-dummy/receive p0 
'in_port(1),eth(src=f8:bc:12:44:34:b6,dst=aa:55:aa:55:00:00),eth_type(0x86dd),ipv6(src=2001:cafe::92,dst=2001:cafe::94,label=0,proto=58,tclass=0,hlimit=255,frag=no),icmpv6(type=136,code=0),nd(target=2001:cafe::92,sll=00:00:00:00:00:00,tll=f8:bc:12:44:34:b6)'])
 
-AT_CHECK([ovs-appctl netdev-dummy/receive br0 
'in_port(100),eth(src=f8:bc:12:44:34:b7,dst=aa:55:aa:55:00:00),eth_type(0x86dd),ipv6(src=2001:cafe::93,dst=2001:cafe::94,label=0,proto=58,tclass=0,hlimit=255,frag=no),icmpv6(type=136,code=0),nd(target=2001:cafe::93,sll=00:00:00:00:00:00,tll=f8:bc:12:44:34:b7)'])
+AT_CHECK([ovs-appctl netdev-dummy/receive p0 
'in_port(1),eth(src=f8:bc:12:44:34:b7,dst=aa:55:aa:55:00:00),eth_type(0x86dd),ipv6(src=2001:cafe::93,dst=2001:cafe::94,label=0,proto=58,tclass=0,hlimit=255,frag=no),icmpv6(type=136,code=0),nd(target=2001:cafe::93,sll=00:00:00:00:00:00,tll=f8:bc:12:44:34:b7)'])
 
 AT_CHECK([ovs-appctl tnl/arp/show | tail -n+3 | sort], [0], [dnl
 2001:cafe::92 f8:bc:12:44:34:b6   br0
@@ -93,28 +93,28 @@ dnl Check VXLAN tunnel push
 AT_CHECK([ovs-ofctl add-flow int-br action=2])
 AT_CHECK([ovs-appctl ofproto/trace ovs-dummy 
'in_port(2),eth(src=f8:bc:12:44:34:b6,dst=aa:55:aa:55:00:01),eth_type(0x0800),ipv4(src=1.1.3.88,dst=1.1.3.112,proto=47,tos=0,ttl=64,frag=no)'],
 [0], [stdout])
 AT_CHECK([tail -1 stdout], [0],
-  [Datapath actions: 
tnl_push(tnl_port(4789),header(size=70,type=4,eth(dst=f8:bc:12:44:34:b6,src=aa:55:aa:55:00:00,dl_type=0x86dd),ipv6(src=2001:cafe::88,dst=2001:cafe::92,label=0,proto=17,tclass=0x0,hlimit=64),udp(src=0,dst=4789,csum=0x),vxlan(flags=0x800,vni=0x7b)),out_port(100))
+  [Datapath actions: 
clone(tnl_push(tnl_port(4789),header(size=70,type=4,eth(dst=f8:bc:12:44:34:b6,src=aa:55:aa:55:00:00,dl_type=0x86dd),ipv6(src=2001:cafe::88,dst=2001:cafe::92,label=0,proto=17,tclass=0x0,hlimit=64),udp(src=0,dst=4789,csum=0x),vxlan(flags=0x800,vni=0x7b)),out_port(100)),1)
 ])
 
 dnl Check VXLAN tunnel push set tunnel id by flow and checksum
 AT_CHECK([ovs-ofctl add-flow int-br "actions=set_tunnel:124,4"])
 AT_CHECK([ovs-appctl ofproto/trace ovs-dummy 
'in_port(2),eth(src=f8:bc:12:44:34:b6,dst=aa:55:aa:55:00:01),eth_type(0x0800),ipv4(src=1.1.3.88,dst=1.1.3.112,proto=47,tos=0,ttl=64,frag=no)'],
 [0], [stdout])
 AT_CHECK([tail -1 stdout], [0],
-  [Datapath actions: 
tnl_push(tnl_port(4789),header(size=70,type=4,eth(dst=f8:bc:12:44:34:b7,src=aa:55:aa:55:00:00,dl_type=0x86dd),ipv6(src=2001:cafe::88,dst=2001:cafe::93,label=0,proto=17,tclass=0x0,hlimit=64),udp(src=0,dst=4789,csum=0x),vxlan(flags=0x800,vni=0x7c)),out_port(100))
+  [Datapath actions: 
clone(tnl_push(tnl_port(4789),header(size=70,type=4,eth(dst=f8:bc:12:44:34:b7,src=aa:55:aa:55:00:00,dl_type=0x86dd),ipv6(src=2001:cafe::88,dst=2001:cafe::93,label=0,proto=17,tclass=0x0,hlimit=64),udp(src=0,dst=4789,csum=0x),vxlan(flags=0x800,vni=0x7c)),out_port(100)),1)
 ])
 
 dnl Check GRE tunnel push
 AT_CHECK([ovs-ofctl add-flow int-br action=3])
 AT_CHECK([ovs-appctl ofproto/trace ovs-dummy 
'in_port(2),eth(src=f8

[ovs-dev] [PATCH v4 0/2] Fix tunnel neighbor cache population

2018-04-05 Thread Jan Scheurich
Currently, OVS snoops any ARP or ND packets in any bridge and populates
the tunnel neighbor cache with the retrieved data. For instance, when
ARP reply originated by a tenant is received on an overlay bridge, the
ARP packet is snooped and tunnel neighbor cache is filled with tenant
addresses, however only actual tunnel neighbor data should be stored
there. In worst case tunnel peer data could be overwritten in the cache.

This series resolves the issue by limiting the range of ARP and ND
packets being snooped to only those that are addressed to potential
local tunnel endpoints.

v3 -> v4:
  - Rebased to master (commit 4b337e489b)
  - Failing unit test case with v3 fixed by commit 8f0e86f84
  - Improved commit messages

Zoltan Balogh (2):
  tests: Inject ARP replies for snoop tests on different port
  xlate: Move tnl_neigh_snoop() to terminate_native_tunnel()

 include/sparse/netinet/in.h   |  10 +++
 ofproto/ofproto-dpif-xlate.c  | 147 --
 tests/tunnel-push-pop-ipv6.at |  78 --
 tests/tunnel-push-pop.at  |  91 ++
 4 files changed, 299 insertions(+), 27 deletions(-)

-- 
1.9.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] ofproto-dpif: Init ukey->dump_seq to zero

2018-04-04 Thread Jan Scheurich
Thanks Ben,

I hope what my patch does is precisely what you suggest. I had the same 
thoughts when I had a closer look at the code.

Regards, Jan

> -Original Message-
> From: Ben Pfaff [mailto:b...@ovn.org]
> Sent: Wednesday, 04 April, 2018 19:20
> To: Jan Scheurich <jan.scheur...@ericsson.com>
> Cc: d...@openvswitch.org; Zoltán Balogh <zoltan.bal...@ericsson.com>; 
> jpet...@ovn.org
> Subject: Re: [PATCH] ofproto-dpif: Init ukey->dump_seq to zero
> 
> On Wed, Apr 04, 2018 at 01:26:02PM +0200, Jan Scheurich wrote:
> > In the current implementation the dump_seq of a new datapath flow ukey
> > is set to seq_read(udpif->dump_seq). This implies that any revalidation
> > during the current dump_seq period (up to 500 ms) is skipped.
> >
> > This can trigger incorrect behavior, for example when the the creation of
> > datapath flow triggers a PACKET_IN to the controller, which which course
> > the controller installs a new flow entry that should invalidate the
> > original datapath flow.
> >
> > Initializing ukey->dump_seq to zero implies that the first dump of the
> > flow, be it for revalidation or dumping statistics, will always be
> > executed as zero is not a valid value of the ovs_seq.
> >
> > Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>
> 
> If we are going to do this, then we should delete the 'dump_seq' member
> of struct upcall, because it will always be zero.  It is also worth
> considering whether the other caller of ukey_create__() should pass 0,
> and if so then we can delete the 'dump_seq' parameter of
> ukey_create__().
> 
> Thanks,
> 
> Ben.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH] ofproto-dpif: Init ukey->dump_seq to zero

2018-04-04 Thread Jan Scheurich
In the current implementation the dump_seq of a new datapath flow ukey
is set to seq_read(udpif->dump_seq). This implies that any revalidation
during the current dump_seq period (up to 500 ms) is skipped.

This can trigger incorrect behavior, for example when the the creation of
datapath flow triggers a PACKET_IN to the controller, which which course
the controller installs a new flow entry that should invalidate the
original datapath flow.

Initializing ukey->dump_seq to zero implies that the first dump of the
flow, be it for revalidation or dumping statistics, will always be
executed as zero is not a valid value of the ovs_seq.

Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>

---
 ofproto/ofproto-dpif-upcall.c | 14 +-
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c
index 7bfeedd..00160e1 100644
--- a/ofproto/ofproto-dpif-upcall.c
+++ b/ofproto/ofproto-dpif-upcall.c
@@ -231,7 +231,6 @@ struct upcall {
 bool ukey_persists;/* Set true to keep 'ukey' beyond the
   lifetime of this upcall. */
 
-uint64_t dump_seq; /* udpif->dump_seq at translation time. */
 uint64_t reval_seq;/* udpif->reval_seq at translation time. */
 
 /* Not used by the upcall callback interface. */
@@ -1159,7 +1158,6 @@ upcall_xlate(struct udpif *udpif, struct upcall *upcall,
  * with pushing its stats eventually. */
 }
 
-upcall->dump_seq = seq_read(udpif->dump_seq);
 upcall->reval_seq = seq_read(udpif->reval_seq);
 
 xerr = xlate_actions(, >xout);
@@ -1633,7 +1631,7 @@ ukey_create__(const struct nlattr *key, size_t key_len,
   const struct nlattr *mask, size_t mask_len,
   bool ufid_present, const ovs_u128 *ufid,
   const unsigned pmd_id, const struct ofpbuf *actions,
-  uint64_t dump_seq, uint64_t reval_seq, long long int used,
+  uint64_t reval_seq, long long int used,
   uint32_t key_recirc_id, struct xlate_out *xout)
 OVS_NO_THREAD_SAFETY_ANALYSIS
 {
@@ -1654,7 +1652,7 @@ ukey_create__(const struct nlattr *key, size_t key_len,
 ukey_set_actions(ukey, actions);
 
 ovs_mutex_init(>mutex);
-ukey->dump_seq = dump_seq;
+ukey->dump_seq = 0; /* Not yet dumped */
 ukey->reval_seq = reval_seq;
 ukey->state = UKEY_CREATED;
 ukey->state_thread = ovsthread_id_self();
@@ -1704,8 +1702,7 @@ ukey_create_from_upcall(struct upcall *upcall, struct 
flow_wildcards *wc)
 
 return ukey_create__(keybuf.data, keybuf.size, maskbuf.data, maskbuf.size,
  true, upcall->ufid, upcall->pmd_id,
- >put_actions, upcall->dump_seq,
- upcall->reval_seq, 0,
+ >put_actions, upcall->reval_seq, 0,
  upcall->have_recirc_ref ? upcall->recirc->id : 0,
  >xout);
 }
@@ -1717,7 +1714,7 @@ ukey_create_from_dpif_flow(const struct udpif *udpif,
 {
 struct dpif_flow full_flow;
 struct ofpbuf actions;
-uint64_t dump_seq, reval_seq;
+uint64_t reval_seq;
 uint64_t stub[DPIF_FLOW_BUFSIZE / 8];
 const struct nlattr *a;
 unsigned int left;
@@ -1754,12 +1751,11 @@ ukey_create_from_dpif_flow(const struct udpif *udpif,
 }
 }
 
-dump_seq = seq_read(udpif->dump_seq);
 reval_seq = seq_read(udpif->reval_seq) - 1; /* Ensure revalidation. */
 ofpbuf_use_const(, >actions, flow->actions_len);
 *ukey = ukey_create__(flow->key, flow->key_len,
   flow->mask, flow->mask_len, flow->ufid_present,
-  >ufid, flow->pmd_id, , dump_seq,
+  >ufid, flow->pmd_id, ,
   reval_seq, flow->stats.used, 0, NULL);
 
 return 0;
-- 
1.9.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v3 3/3] xlate: call tnl_neigh_snoop() from terminate_native_tunnel()

2018-04-04 Thread Jan Scheurich
d.c:157
2018-03-28T18:41:24.965Z|00012|poll_loop(urcu2)|DBG|wakeup due to [POLLIN] on 
fd 24 (FIFO pipe:[16413512]) at lib/ovs-rcu.c:235
2018-03-28T18:41:24.982Z|00328|poll_loop|DBG|wakeup due to [POLLIN] on fd 46 
(/opt/ovs/tests/testsuite.dir/2487/hv1/br-int.mgmt<->) at lib/stream-fd.c:157
2018-03-28T18:41:24.983Z|00329|vconn|DBG|unix#3: received: OFPT_PACKET_OUT 
(OF1.3) (xid=0xd3): in_port=CONTROLLER 
actions=set_field:0xa02->reg0,set_field:0xac100101->reg1,set_field:0x1->reg10,set_field:0x5->reg11,set_field:0x7->reg12,set_field:0x1->reg14,set_field:0x2->reg15,set_field:0x1->metadata,set_field:ff:ff:ff:ff:ff:ff->eth_dst,move:NXM_NX_XXREG0[64..95]->NXM_OF_ARP_SPA[],move:NXM_NX_XXREG0[96..127]->NXM_OF_ARP_TPA[],set_field:1->arp_op,resubmit(,32)
 data_len=42
arp,vlan_tci=0x,dl_src=00:00:00:01:02:04,dl_dst=00:00:00:00:00:00,arp_spa=192.168.1.2,arp_tpa=10.0.0.2,arp_op=1,arp_sha=00:00:00:01:02:04,arp_tha=00:00:00:00:00:00
2018-03-28T18:41:24.983Z|00330|dpif_netdev|DBG|ovs-system: action upcall:
skb_priority(0),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),recirc_id(0),dp_hash(0),packet_type(ns=0,id=0),eth(src=f0:00:00:01:02:04,dst=00:00:00:01:02:04),eth_type(0x0806),arp(sip=10.0.0.2,tip=172.16.1.1,op=2,sha=f0:00:00:01:02:04,tha=00:00:00:01:02:04)
arp,vlan_tci=0x,dl_src=f0:00:00:01:02:04,dl_dst=00:00:00:01:02:04,arp_spa=10.0.0.2,arp_tpa=172.16.1.1,arp_op=2,arp_sha=f0:00:00:01:02:04,arp_tha=00:00:00:01:02:04
2018-03-28T18:41:24.983Z|00331|dpif|DBG|system@ovs-system: execute 
userspace(pid=0,controller(reason=1,dont_send=0,continuation=0,recirc_id=2,rule_cookie=0xd8bfa54f,controller_id=0,max_len=65535))
 on packet 
arp,vlan_tci=0x,dl_src=f0:00:00:01:02:04,dl_dst=00:00:00:01:02:04,arp_spa=10.0.0.2,arp_tpa=172.16.1.1,arp_op=2,arp_sha=f0:00:00:01:02:04,arp_tha=00:00:00:01:02:04
 with metadata skb_priority(0),skb_mark(0) mtu 0
2018-03-28T18:41:24.983Z|00332|dpif|DBG|system@ovs-system: sub-execute 
userspace(pid=0,controller(reason=1,dont_send=0,continuation=0,recirc_id=2,rule_cookie=0xd8bfa54f,controller_id=0,max_len=65535))
 on packet 
arp,vlan_tci=0x,dl_src=f0:00:00:01:02:04,dl_dst=00:00:00:01:02:04,arp_spa=10.0.0.2,arp_tpa=172.16.1.1,arp_op=2,arp_sha=f0:00:00:01:02:04,arp_tha=00:00:00:01:02:04
 with metadata skb_priority(0),skb_mark(0) mtu 0
2018-03-28T18:41:24.983Z|00333|poll_loop|DBG|wakeup due to 0-ms timeout at 
ofproto/ofproto-dpif.c:1713
2018-03-28T18:41:24.984Z|00334|vconn|DBG|unix#3: sent (Success): NXT_PACKET_IN2 
(OF1.3) (xid=0x0): table_id=9 cookie=0xd8bfa54f total_len=42 
reg0=0xa02,reg11=0x5,reg12=0x7,reg14=0x2,metadata=0x1,in_port=0 (via 
action) data_len=42 (unbuffered)
 userdata=00.00.00.01.00.00.00.00
arp,vlan_tci=0x,dl_src=f0:00:00:01:02:04,dl_dst=00:00:00:01:02:04,arp_spa=10.0.0.2,arp_tpa=172.16.1.1,arp_op=2,arp_sha=f0:00:00:01:02:04,arp_tha=00:00:00:01:02:04
2018-03-28T18:41:24.996Z|00335|poll_loop|DBG|wakeup due to [POLLIN] on fd 45 
(/opt/ovs/tests/testsuite.dir/2487/hv1/br-int.mgmt<->) at lib/stream-fd.c:157
2018-03-28T18:41:24.996Z|00336|vconn|DBG|unix#2: received: OFPT_FLOW_MOD 
(OF1.3) (xid=0xd4): ADD table:66 
priority=100,reg0=0xa02,reg15=0x2,metadata=0x1 
actions=set_field:f0:00:00:01:02:04->eth_dst
2018-03-28T18:41:24.996Z|00337|poll_loop|DBG|wakeup due to 0-ms timeout at 
ofproto/ofproto-dpif.c:1709
2018-03-28T18:41:24.996Z|00013|poll_loop(urcu2)|DBG|wakeup due to [POLLIN] on 
fd 24 (FIFO pipe:[16413512]) at lib/ovs-rcu.c:358
2018-03-28T18:41:24.996Z|00024|poll_loop(revalidator6)|DBG|wakeup due to 
[POLLIN] on fd 35 (FIFO pipe:[16413516]) at ofproto/ofproto-dpif-upcall.c:948
2018-03-28T18:41:24.996Z|00025|dpif(revalidator6)|DBG|system@ovs-system: 
flow_dump ufid:de19ea9b-110b-4476-b8f7-df5fac3d07c5 , packets:0, 
bytes:0, used:never
2018-03-28T18:41:24.996Z|00026|ofproto_dpif_upcall(revalidator6)|WARN|Flow 
already dumped
2018-03-28T18:41:24.996Z|00027|dpif(revalidator6)|DBG|system@ovs-system: dumped 
all flows
2018-03-28T18:41:24.997Z|00028|dpif(revalidator6)|DBG|system@ovs-system: 
flow_dump_destroy success



> -Original Message-
> From: Jan Scheurich
> Sent: Wednesday, 04 April, 2018 01:52
> To: Zoltán Balogh <zoltan.bal...@ericsson.com>; Justin Pettit 
> <jpet...@ovn.org>; Ben Pfaff <b...@ovn.org>
> Cc: g...@ovn.org; d...@openvswitch.org
> Subject: RE: [ovs-dev] [PATCH v3 3/3] xlate: call tnl_neigh_snoop() from 
> terminate_native_tunnel()
> 
> Hi,
> 
> I took this over from Zoltan and investigated the failing unit test a bit 
> further.
> 
> The essential difference between master and Zoltan's (rebased) patches is 
> that the dynamic "ARP cache" flow entry
> 
>table=66,priority=100,reg0=0xa02,reg15=0x2,metadata=0x1 
> actions=set_field:f0:00:00:01:02:04->eth_dst
> 
> (which OVN installs in response to the ARP reply it receives from OVS in 
> PACKET_IN after the first injected IP pa

Re: [ovs-dev] [PATCH v3 3/3] xlate: call tnl_neigh_snoop() from terminate_native_tunnel()

2018-04-03 Thread Jan Scheurich
Hi,

I took this over from Zoltan and investigated the failing unit test a bit 
further.

The essential difference between master and Zoltan's (rebased) patches is that 
the dynamic "ARP cache" flow entry 

   table=66,priority=100,reg0=0xa02,reg15=0x2,metadata=0x1 
actions=set_field:f0:00:00:01:02:04->eth_dst

(which OVN installs in response to the ARP reply it receives from OVS in 
PACKET_IN after the first injected IP packet) triggers different behavior in 
the subsequent revalidation:

In the case of master the existing datapath flow entry
 
recirc_id(0),in_port(3),packet_type(ns=0,id=0),eth(src=f0:00:00:01:02:03,dst=00:00:00:01:02:03),eth_type(0x0800),ipv4(src=192.168.1.2/255.255.255.254,dst=10.0.0.2/248.0.0.0,ttl=64,frag=no),
 
actions:ct_clear,set(eth(src=00:00:00:01:02:04,dst=00:00:00:00:00:00)),set(ipv4(src=192.168.1.2/255.255.255.254,dst=8.0.0.0/248.0.0.0,ttl=63)),userspace(pid=0,controller(reason=1,dont_send=0,continuation=0,recirc_id=1,rule_cookie=0x62066318,controller_id=0,max_len=65535))

is correctly deleted during revalidation due to the presence of the new rule in 
table 66, while with Zoltan's patch that same datapath flow entry remains 
untouched. It seems as if with the patch the revalidation of the datapath flow 
does not reach the new rule in table 66. 

Consequently the next IP packet injected by netdev-dummy/receive then matches 
the datapath flow and is sent up to OVN again, where it triggers the same ARP 
query procedure as before.

However, if I add a "sleep 11" before injecting the second IP packet to let the 
datapath flow entry time out, the test succeeds, proving that the OF pipeline 
behaves correctly for a real packet but not when revalidating the datapath flow 
entry with the userspace(controller(...)) action.

I still do not understand how the (passive) tunnel ARP snooping patch can 
possibly change the translation behavior for an IP packet in such a way that it 
affects the revalidation result.

Any idea is welcome.

Regards, Jan


BTW: The “pseudo” tunnel neighbor cache entry we get in this test on master for 
the tenant IP address 10.0.0.2

IP  MAC Bridge
==
10.0.0.2f0:00:00:01:02:04   br-int
192.168.0.2 5e:97:6a:82:7d:41   br-phys

is a good example why we need Zoltan's patch. Any IP address in an ARP reply is 
blindly inserted into the tunnel neighbor cache. Overlapping IP addresses among 
tenants can cause frequent overwriting of cache entries, in the worst case 
leading to continuous configuration changes and revalidation.


> -Original Message-
> From: Zoltán Balogh
> Sent: Friday, 02 February, 2018 15:42
> To: Justin Pettit <jpet...@ovn.org>; Ben Pfaff <b...@ovn.org>
> Cc: g...@ovn.org; d...@openvswitch.org; Jan Scheurich 
> <jan.scheur...@ericsson.com>
> Subject: RE: [ovs-dev] [PATCH v3 3/3] xlate: call tnl_neigh_snoop() from 
> terminate_native_tunnel()
> 
> Hi Justin,
> 
> I rebased the patches to recent master. Please find them attached.
> 
> Best regards,
> Zoltan
> 
> > -Original Message-
> > From: Justin Pettit [mailto:jpet...@ovn.org]
> > Sent: Friday, February 02, 2018 12:00 AM
> > To: Ben Pfaff <mailto:b...@ovn.org>
> > Cc: Zoltán Balogh <mailto:zoltan.bal...@ericsson.com>; mailto:g...@ovn.org; 
> > mailto:d...@openvswitch.org; Jan Scheurich
> > <mailto:jan.scheur...@ericsson.com>
> > Subject: Re: [ovs-dev] [PATCH v3 3/3] xlate: call tnl_neigh_snoop() from 
> > terminate_native_tunnel()
> >
> > I wasn't able to get this patch to apply to the tip of master.  Zoltan, can 
> > you rebase this patch and repost?
> >
> > The main thing my patch series does is make it so that packets that have a 
> > controller action aren't processed
> > entirely in userspace.  If, for example, the patches expect packets to be 
> > in userspace without an explicit slow-path
> > request when generating the datapath flow, then that would be a problem.
> >
> > --Justin
> >
> >
> > > On Feb 1, 2018, at 2:17 PM, Ben Pfaff <mailto:b...@ovn.org> wrote:
> > >
> > > Justin, I think this is mainly a question about your patches, can you
> > > take a look?
> > >
> > > On Fri, Jan 26, 2018 at 01:08:35PM +, Zoltán Balogh wrote:
> > >> Hi,
> > >>
> > >> I've been investigating the failing unit test. I can confirm, it does 
> > >> fail
> > >> with my series on master. However, when I created the series and sent it 
> > >> to
> > >> the mailing list it did not.
> > >>
> > >> I've rebased my series to this commit before I sent it to the mailing 
> 

Re: [ovs-dev] [PATCH] ofp-actions: Correct execution of encap/decap actions in action set

2018-04-03 Thread Jan Scheurich
> -Original Message-
> From: Ben Pfaff [mailto:b...@ovn.org]
> Sent: Tuesday, 03 April, 2018 20:36
> 
> On Mon, Mar 26, 2018 at 09:36:27AM +0200, Jan Scheurich wrote:
> > The actions encap, decap and dec_nsh_ttl were wrongly flagged as set_field
> > actions in ofpact_is_set_or_move_action(). This caused them to be executed
> > twice in the action set or a group bucket, once explicitly in
> > ofpacts_execute_action_set() and once again as part of the list of
> > set_field or move actions.
> >
> > Fixes: f839892a ("OF support and translation of generic encap and decap")
> > Fixes: 491e05c2 ("nsh: add dec_nsh_ttl action")
> >
> > Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>
> >
> > ---
> >
> > The fix should be backported to OVS 2.9 and OVS 2.8 (without the case
> > for OFPACT_DEC_NSH_TTL introduced in 2.9).
> 
> Thanks, I applied this to master and backported to branch-2.9 and
> branch-2.8.

Thanks!

> I want to encourage you to add a test that uses one or both of these
> actions in an action set, so that we get some test coverage.

Will do when I find the time. I thought about that but didn't want to delay the 
fix.

Regards, Jan

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] can not update userspace vxlan tunnel neigh mac when peer VTEP mac changed

2018-03-29 Thread Jan Scheurich
Hi Ychen,

If your tunnel NH is moving IP addresses between MAC addresses or changing the 
MAC address of an interface hosting the NH IP, I think it should send a GARP to 
inform the connected subnet about this change. Otherwise the neighbors will 
blackhole traffic by sending to the wrong MAC address until they refresh their 
ARP cache.
What kind of tunnel NH are you using? A Linux host?

OVS has no means to buffer packets while it is waiting for the ARP reply for an 
ARP request it has sent out to resolve the tunnel NH. The OVS datapath 
(including the slowpath) is essentially stateless. This is quite different from 
the Linux kernel.

Your suggestion to let OVS snoop on any incoming tunnel packet to refresh the 
ARP cache for the tunnel NH seems impractical to me. Normally the tunnel 
packets are matched by megaflows in the netdev datapath, the tunnel headers are 
stripped and the packet is recirculated. All this happens down in the datapath, 
while the ARP cache lives in the ofproto layer.

A safe way would be to trigger an ARP refresh in the ofproto layer for every 
known tunnel neighbor some time before the expiry of its cache entry, similar 
to what real IP stacks do. But that would duplicate the already available 
function on the host. I'd rather try to find out why the host is not refreshing 
the ARP entry or why OVS does not detect the reply to the ARP refresh.

BR, Jan

From: ychen [mailto:ychen103...@163.com]
Sent: Wednesday, 28 March, 2018 06:17
To: Jan Scheurich <jan.scheur...@ericsson.com>
Cc: d...@openvswitch.org; Manohar Krishnappa Chidambaraswamy 
<manohar.krishnappa.chidambarasw...@ericsson.com>
Subject: RE: [ovs-dev] can not update userspace vxlan tunnel neigh mac when 
peer VTEP mac changed


HI, Jan,
  Thanks for your reply.
  we have already modify code snooping on the GARP packets, but these 2 problem 
still exists.
   I think the main problem is that GARP packets are not sending from 
interfaces when we changed NIC mac address or IP address(read the linux kernel 
code, there is no such process)
   so we must depend on data packet to trigger the ARP request.
  I know that in linux kernel, when ARP packet is triggered, data packets will 
be cached in a specified time, so the first data packet can still be send out 
when ARP reply is received.

  for the second problem, can we update tunnel neigh cache when we receive data 
packet from remote VTEP? since we  can fetch tun_src and outer mac sa from the 
data packet.




At 2018-03-28 04:41:12, "Jan Scheurich" 
<jan.scheur...@ericsson.com<mailto:jan.scheur...@ericsson.com>> wrote:

>Hi Ychen,

>

>Funny! Again we are already working on a solution for problem 1.

>

>In our scenario the situation arises with a tunnel next hop being a VRRP 
>switch pair. The switch sends periodic gratuitous ARPs (GARPs) to announce the 
>VRRP IP but OVS native tunneling doesn't snoop on GARPs, only on ARP 
>replies. The host IP stack, on the other hand, accepts these GARPs and stops 
>sending refresh ARP requests itself. Hence nothing for OVS to snoop upon.

>

>The solution is to make OVS snoop on GARP requests also.

>

>It is quite possible that this will also fix your problem 2. If you also have 
>a VRRP tunnel next hop which just moves its VRRP IP address but not the MAC 
>address,  should send a GARP with the new IP/MAC mapping when it moves the IP 
>address, which would now update OVS' tunnel neighbor cache.

>

>@Mano: Can you submit the GARP patch in the near future?

>

>BR, Jan

>

>> -Original Message-

>> From: 
>> ovs-dev-boun...@openvswitch.org<mailto:ovs-dev-boun...@openvswitch.org> 
>> [mailto:ovs-dev-boun...@openvswitch.org] On Behalf Of ychen

>> Sent: Tuesday, 27 March, 2018 14:44

>> To: d...@openvswitch.org<mailto:d...@openvswitch.org>

>> Subject: [ovs-dev] can not update userspace vxlan tunnel neigh mac when peer 
>> VTEP mac changed

>>

>> Hi,

>>I found that sometime userspace vxlan can not work happily.

>>1.  first data packet loss

>> when tunnel neigh cache is empty, then the first data packet 
>> triggered  sending ARP packet to peer VTEP, and the data packet

>> dropped,

>> tunnel neigh cache added this entry when receive ARP reply packet.

>>

>> err = tnl_neigh_lookup(out_dev->xbridge->name, _ip6, );

>>if (err) {

>> xlate_report(ctx, OFT_DETAIL,

>>  "neighbor cache miss for %s on bridge %s, "

>>  "sending %s request",

>>  buf_dip6, out_dev->xbridge->name, d_ip ? "ARP" : "ND");

>> if (d_ip) {

>> tnl_send_arp_request(ctx, out_dev, smac, s_ip, d_ip);

>> } else

Re: [ovs-dev] can not update userspace vxlan tunnel neigh mac when peer VTEP mac changed

2018-03-27 Thread Jan Scheurich
Hi Ychen,

Funny! Again we are already working on a solution for problem 1. 

In our scenario the situation arises with a tunnel next hop being a VRRP switch 
pair. The switch sends periodic gratuitous ARPs (GARPs) to announce the VRRP 
IP but OVS native tunneling doesn't snoop on GARPs, only on ARP replies. 
The host IP stack, on the other hand, accepts these GARPs and stops sending 
refresh ARP requests itself. Hence nothing for OVS to snoop upon.

The solution is to make OVS snoop on GARP requests also.
 
It is quite possible that this will also fix your problem 2. If you also have a 
VRRP tunnel next hop which just moves its VRRP IP address but not the MAC 
address,  should send a GARP with the new IP/MAC mapping when it moves the IP 
address, which would now update OVS' tunnel neighbor cache.

@Mano: Can you submit the GARP patch in the near future?

BR, Jan

> -Original Message-
> From: ovs-dev-boun...@openvswitch.org 
> [mailto:ovs-dev-boun...@openvswitch.org] On Behalf Of ychen
> Sent: Tuesday, 27 March, 2018 14:44
> To: d...@openvswitch.org
> Subject: [ovs-dev] can not update userspace vxlan tunnel neigh mac when peer 
> VTEP mac changed
> 
> Hi,
>I found that sometime userspace vxlan can not work happily.
>1.  first data packet loss
> when tunnel neigh cache is empty, then the first data packet 
> triggered  sending ARP packet to peer VTEP, and the data packet
> dropped,
> tunnel neigh cache added this entry when receive ARP reply packet.
> 
> err = tnl_neigh_lookup(out_dev->xbridge->name, _ip6, );
>if (err) {
> xlate_report(ctx, OFT_DETAIL,
>  "neighbor cache miss for %s on bridge %s, "
>  "sending %s request",
>  buf_dip6, out_dev->xbridge->name, d_ip ? "ARP" : "ND");
> if (d_ip) {
> tnl_send_arp_request(ctx, out_dev, smac, s_ip, d_ip);
> } else {
> tnl_send_nd_request(ctx, out_dev, smac, _ip6, _ip6);
> }
> return err;
> }
> 
> 
> 2. connection lost when peer VTEP mac changed
> when VTEP mac is already in tunnel neigh cache,   exp:
> 10.182.6.81   fa:eb:26:c3:16:a5   br-phy
> 
> so when data packet come in,  it will use this mac for encaping outer 
> VXLAN header.
> but VTEP 10.182.6.81  mac changed from  fa:eb:26:c3:16:a5 to  
> 24:eb:26:c3:16:a5 because of NIC changed.
> 
> data packet continue sending with the old mac  fa:eb:26:c3:16:a5, but the 
> peer VTEP will not accept these packets because of mac
> not match.
> the wrong tunnel neigh entry aging until the data packet stop sending.
> 
> 
>if (ovs_native_tunneling_is_on(ctx->xbridge->ofproto)) {
> tnl_neigh_snoop(flow, wc, ctx->xbridge->name);
> }
> 
> 
> 3. is there anybody has working for these problems?
> 
> 
> 
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v10 2/3] dpif-netdev: Detailed performance stats for PMDs

2018-03-27 Thread Jan Scheurich
> -Original Message-
> From: Stokes, Ian [mailto:ian.sto...@intel.com]
> Sent: Tuesday, 27 March, 2018 16:21
> To: Ilya Maximets <i.maxim...@samsung.com>; Jan Scheurich 
> <jan.scheur...@ericsson.com>; d...@openvswitch.org
> Cc: ktray...@redhat.com; O Mahony, Billy <billy.o.mah...@intel.com>
> Subject: RE: [PATCH v10 2/3] dpif-netdev: Detailed performance stats for PMDs
> 
> > Comments inline.
> >
> > Best regards, Ilya Maximets.
> >
> > On 18.03.2018 20:55, Jan Scheurich wrote:
> > > This patch instruments the dpif-netdev datapath to record detailed
> > > statistics of what is happening in every iteration of a PMD thread.
> > >
> > > The collection of detailed statistics can be controlled by a new
> > > Open_vSwitch configuration parameter "other_config:pmd-perf-metrics".
> > > By default it is disabled. The run-time overhead, when enabled, is
> > > in the order of 1%.
> > >
> 
> [snip]
> 
> > > +}
> > > +if (tx_packets > 0) {
> > > +ds_put_format(str,
> > > +"  Tx packets:  %12"PRIu64"  (%.0f Kpps)\n"
> > > +"  Tx batches:  %12"PRIu64"  (%.2f pkts/batch)"
> > > +"\n",
> > > +tx_packets, (tx_packets / duration) / 1000,
> > > +tx_batches, 1.0 * tx_packets / tx_batches);
> > > +} else {
> > > +ds_put_format(str,
> > > +"  Tx packets:  %12"PRIu64"\n"
> > > +"\n",
> > > +0ULL);
> >
> > I have a few interesting warnings on 64bit ARMv8.
> >
> > Clang:
> >
> > lib/dpif-netdev-perf.c:216:17: error: format specifies type 'unsigned
> > long' but the argument has type 'unsigned long long' [-Werror,-Wformat]
> > 0ULL);
> > ^~~~
> > lib/dpif-netdev-perf.c:229:17: error: format specifies type 'unsigned
> > long' but the argument has type 'unsigned long long' [-Werror,-Wformat]
> > 0ULL);
> > ^~~~
> >
> > GCC:
> >
> > lib/dpif-netdev-perf.c: In function ‘pmd_perf_format_overall_stats’:
> > lib/dpif-netdev-perf.c:215:17: error: format ‘%lu’ expects argument of
> > type ‘long unsigned int’, but argument 3 has type ‘long long unsigned int’
> > [-Werror=format=]
> >  "  Rx packets:  %12"PRIu64"\n",
> >  ^
> > lib/dpif-netdev-perf.c:227:17: error: format ‘%lu’ expects argument of
> > type ‘long unsigned int’, but argument 3 has type ‘long long unsigned int’
> > [-Werror=format=]
> >  "  Tx packets:  %12"PRIu64"\n"
> >  ^
> >
> > Both are coming from the fact that PRIu64 expands to '%lu'.
> > Why we need this printing at all? Can we just print 0 in a string?
> > Otherwise, the only way to fix these warnings is to cast 0 directly to
> > uint64_t.
> 
> I see the same in Travis.
> 
> In the v9 of the series the format used was 0UL. This allowed compilation in 
> Travis except for when compiling OVS with the 32 bit flag.
> From the logs the introduction of 0ULL seems to avoid the issue for 32 bit 
> compilation but introduces the problem for 64 bit
> compilation.
> 
> I don’t see a way around it either without casting.
> 
> Ian

I'll work around this by printing "0" as a string :-)

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v6] Configurable Link State Change (LSC) detection mode

2018-03-27 Thread Jan Scheurich
Hi Ilya,

This patch is the upstream version of a fix we implemented downstream a year 
ago to fix the issue with massive packet drop of OVS-DPDK on Fortville NICs.

The root cause of this packet drop was the extended blocking of the 
ovs-vswitchd by the i40e PMD during the rte_eth_link_get_nowait() function, 
which caused the PMDs to hang for up to 40ms during upcalls.

At the time switching to LSC interrupt was the only viable solution. OVS still 
polls the links state from DPDK with rte_eth_link_get_nowait() but DPDK returns 
the locally buffered link state, updated through LSC interrupt, instead of 
going through the FVL's admin queue.

Now, the new i40e PMD fix in DPDK bypassing the admin queue should also solve 
the problem with Fortville NICs. It would have to be backported to older DPDK 
releases to be useful as fix for OVS 2.6, 2.7, 2.8 and 2.9, whereas the LSC 
interrupt solution in OVS would work as back-port for all OVS versions since 
2.6.

That's why we still think there is value in pursuing the LSC interrupt track.

BR, Jan

> -Original Message-
> From: ovs-dev-boun...@openvswitch.org 
> [mailto:ovs-dev-boun...@openvswitch.org] On Behalf Of Ilya Maximets
> Sent: Tuesday, 27 March, 2018 13:17
> To: Stokes, Ian ; Róbert Mulik 
> ; d...@openvswitch.org
> Subject: Re: [ovs-dev] [PATCH v6] Configurable Link State Change (LSC) 
> detection mode
> 
> On 27.03.2018 13:19, Stokes, Ian wrote:
> >> It is possible to change LSC detection mode to polling or interrupt mode
> >> for DPDK interfaces. The default is polling mode. To set interrupt mode,
> >> option dpdk-lsc-interrupt has to be set to true.
> >>
> >> In polling mode more processor time is needed, since the OVS repeatedly
> >> reads the link state with a short period. It can lead to packet loss for
> >> certain systems.
> >>
> >> In interrupt mode the hardware itself triggers an interrupt when link
> >> state change happens, so less processing time needs for the OVS.
> >>
> >> For detailed description and usage see the dpdk install documentation.
> 
> Could you, please, better describe why we need this change?
> Because we're not removing the polling thread. OVS will still
> poll the link states periodically. This config option has
> no effect on that side. Also, link state polling in OVS uses
> 'rte_eth_link_get_nowait()' function which will be called in both
> cases and should not wait for hardware reply in any implementation.
> 
> There was recent bug fix for intel NICs that fixes waiting of an
> admin queue on link state requests despite of 'no_wait' flag:
> http://dpdk.org/ml/archives/dev/2018-March/092156.html
> Will this fix your target case?
> 
> So, the difference of execution time of 'rte_eth_link_get_nowait()'
> with enabled and disabled interrupts should be not so significant.
> Do you have performance measurements? Measurement with above fix applied?
> 
> 
> >
> > Thanks for working on this Robert.
> >
> > I've completed some testing including the case where LSC is not supported, 
> > in which case the port will remain in a down state and
> fail rx/tx traffic. This behavior conforms to the netdev_reconfigure 
> expectations in the fail case so that's ok.
> 
> I'm not sure if this is acceptable. For example, we're not failing
> reconfiguration in case of issues with number of queues. We're trying
> different numbers until we have working configuration.
> Maybe we need the same fall-back mechanism in case of not supported LSC
> interrupts? (MTU setup errors are really uncommon unlike LSC interrupts'
> support in PMDs).
> 
> >
> > I'm a bit late to the thread but I have a few other comments below.
> >
> > I'd like to get this patch in the next pull request if possible so I'd 
> > appreciate if others can give any comments on the patch also.
> >
> > Thanks
> > Ian
> >
> >>
> >> Signed-off-by: Robert Mulik 
> >> ---
> >> v5 -> v6:
> >> - DPDK install documentation updated.
> >> - Status of lsc_interrupt_mode of DPDK interfaces can be read by command
> >>   ovs-appctl dpif/show.
> >> - It was suggested to check if the HW supports interrupt mode, but it is
> >> not
> >>   possible to do without DPDK code change, so it is skipped from this
> >> patch.
> >> ---
> >>  Documentation/intro/install/dpdk.rst | 33
> >> +
> >>  lib/netdev-dpdk.c| 24 ++--
> >>  vswitchd/vswitch.xml | 17 +
> >>  3 files changed, 72 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/Documentation/intro/install/dpdk.rst
> >> b/Documentation/intro/install/dpdk.rst
> >> index ed358d5..eb1bc7b 100644
> >> --- a/Documentation/intro/install/dpdk.rst
> >> +++ b/Documentation/intro/install/dpdk.rst
> >> @@ -628,6 +628,39 @@ The average number of packets per output batch can be
> >> checked in PMD stats::
> >>
> >>  $ ovs-appctl dpif-netdev/pmd-stats-show
> >>
> >> +Link State 

Re: [ovs-dev] [PATCH v10 2/3] dpif-netdev: Detailed performance stats for PMDs

2018-03-26 Thread Jan Scheurich
Hi Aaron,

Thanks for the feedback. A few good suggestions are always welcome.
I will include fixes for your comments in the (hopefully) final version.

Regards, Jan 

> -Original Message-
> From: Aaron Conole [mailto:acon...@redhat.com]
> Sent: Monday, 26 March, 2018 23:27
> To: Jan Scheurich <jan.scheur...@ericsson.com>
> Cc: d...@openvswitch.org; i.maxim...@samsung.com
> Subject: Re: [ovs-dev] [PATCH v10 2/3] dpif-netdev: Detailed performance 
> stats for PMDs
> 
> Hi Jan,
> 
> Some stylistic type comments follow.  Sorry to jump in at the end - but
> you asked for checkpatch changes, so I improved and ran it against your
> patch and found some stuff for which I have an opinion. :)  Maybe
> nothing to hold up merging but cleanup stuff.
> 
> Jan Scheurich <jan.scheur...@ericsson.com> writes:
> 
> > This patch instruments the dpif-netdev datapath to record detailed
> > statistics of what is happening in every iteration of a PMD thread.
> >
> > The collection of detailed statistics can be controlled by a new
> > Open_vSwitch configuration parameter "other_config:pmd-perf-metrics".
> > By default it is disabled. The run-time overhead, when enabled, is
> > in the order of 1%.
> >
> > The covered metrics per iteration are:
> >   - cycles
> >   - packets
> >   - (rx) batches
> >   - packets/batch
> >   - max. vhostuser qlen
> >   - upcalls
> >   - cycles spent in upcalls
> >
> > This raw recorded data is used threefold:
> >
> > 1. In histograms for each of the following metrics:
> >- cycles/iteration (log.)
> >- packets/iteration (log.)
> >- cycles/packet
> >- packets/batch
> >- max. vhostuser qlen (log.)
> >- upcalls
> >- cycles/upcall (log)
> >The histograms bins are divided linear or logarithmic.
> >
> > 2. A cyclic history of the above statistics for 999 iterations
> >
> > 3. A cyclic history of the cummulative/average values per millisecond
> >wall clock for the last 1000 milliseconds:
> >- number of iterations
> >- avg. cycles/iteration
> >- packets (Kpps)
> >- avg. packets/batch
> >- avg. max vhost qlen
> >- upcalls
> >- avg. cycles/upcall
> >
> > The gathered performance metrics can be printed at any time with the
> > new CLI command
> >
> > ovs-appctl dpif-netdev/pmd-perf-show [-nh] [-it iter_len] [-ms ms_len]
> > [-pmd core] [dp]
> >
> > The options are
> >
> > -nh:Suppress the histograms
> > -it iter_len:   Display the last iter_len iteration stats
> > -ms ms_len: Display the last ms_len millisecond stats
> > -pmd core:  Display only the specified PMD
> >
> > The performance statistics are reset with the existing
> > dpif-netdev/pmd-stats-clear command.
> >
> > The output always contains the following global PMD statistics,
> > similar to the pmd-stats-show command:
> >
> > Time: 15:24:55.270
> > Measurement duration: 1.008 s
> >
> > pmd thread numa_id 0 core_id 1:
> >
> >   Cycles:2419034712  (2.40 GHz)
> >   Iterations:572817  (1.76 us/it)
> >   - idle:486808  (15.9 % cycles)
> >   - busy: 86009  (84.1 % cycles)
> >   Rx packets:   2399607  (2381 Kpps, 848 cycles/pkt)
> >   Datapath passes:  3599415  (1.50 passes/pkt)
> >   - EMC hits:336472  ( 9.3 %)
> >   - Megaflow hits:  3262943  (90.7 %, 1.00 subtbl lookups/hit)
> >   - Upcalls:  0  ( 0.0 %, 0.0 us/upcall)
> >   - Lost upcalls: 0  ( 0.0 %)
> >   Tx packets:   2399607  (2381 Kpps)
> >   Tx batches:171400  (14.00 pkts/batch)
> >
> > Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>
> > Acked-by: Billy O'Mahony <billy.o.mah...@intel.com>
> > ---
> >  NEWS|   3 +
> >  lib/automake.mk |   1 +
> >  lib/dpif-netdev-perf.c  | 350 
> > +++-
> >  lib/dpif-netdev-perf.h  | 258 ++--
> >  lib/dpif-netdev-unixctl.man | 157 
> >  lib/dpif-netdev.c   | 183 +--
> >  manpages.mk |   2 +
> >  vswitchd/ovs-vswitchd.8.in  |  27 +---
> >  vswitchd/vswitch.xml|  12 ++
> >  9 files changed, 940 insertions(+), 53 deletions(-)
> >  create mode 100644 lib/dpif-netdev-unixctl.man
> >
> > diff --git a/NEWS b/NEWS

[ovs-dev] [PATCH] xlate: Correct handling of double encap() actions

2018-03-26 Thread Jan Scheurich
When the same encap() header was pushed twice onto a packet (e.g in the
case of NSH in NSH), the translation logic only generated a datapath push
action for the first encap() action. The second encap() did not emit a
push action because the packet type was unchanged.

commit_encap_decap_action() (renamed from commit_packet_type_change) must
solely rely on ctx->pending_encap to generate an datapath push action.

Similarly, the first decap() action on a double header packet does not
change the packet_type either. Add a corresponding ctx->pending_decap
flag and use that to trigger emitting a datapath pop action.

Fixes: f839892a2 ("OF support and translation of generic encap and decap")
Fixes: 1fc11c594 ("Generic encap and decap support for NSH")

Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>

---

This fix should be backported OVS 2.8 and 2.9


 lib/odp-util.c   | 16 ++--
 lib/odp-util.h   |  1 +
 ofproto/ofproto-dpif-xlate.c |  7 ++-
 3 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/lib/odp-util.c b/lib/odp-util.c
index 8743503..6db241a 100644
--- a/lib/odp-util.c
+++ b/lib/odp-util.c
@@ -7446,17 +7446,13 @@ odp_put_push_nsh_action(struct ofpbuf *odp_actions,
 }
 
 static void
-commit_packet_type_change(const struct flow *flow,
+commit_encap_decap_action(const struct flow *flow,
   struct flow *base_flow,
   struct ofpbuf *odp_actions,
   struct flow_wildcards *wc,
-  bool pending_encap,
+  bool pending_encap, bool pending_decap,
   struct ofpbuf *encap_data)
 {
-if (flow->packet_type == base_flow->packet_type) {
-return;
-}
-
 if (pending_encap) {
 switch (ntohl(flow->packet_type)) {
 case PT_ETH: {
@@ -7481,7 +7477,7 @@ commit_packet_type_change(const struct flow *flow,
  * The check is done at action translation. */
 OVS_NOT_REACHED();
 }
-} else {
+} else if (pending_decap || flow->packet_type != base_flow->packet_type) {
 /* This is an explicit or implicit decap case. */
 if (pt_ns(flow->packet_type) == OFPHTN_ETHERTYPE &&
 base_flow->packet_type == htonl(PT_ETH)) {
@@ -7520,14 +7516,14 @@ commit_packet_type_change(const struct flow *flow,
 enum slow_path_reason
 commit_odp_actions(const struct flow *flow, struct flow *base,
struct ofpbuf *odp_actions, struct flow_wildcards *wc,
-   bool use_masked, bool pending_encap,
+   bool use_masked, bool pending_encap, bool pending_decap,
struct ofpbuf *encap_data)
 {
 enum slow_path_reason slow1, slow2;
 bool mpls_done = false;
 
-commit_packet_type_change(flow, base, odp_actions, wc,
-  pending_encap, encap_data);
+commit_encap_decap_action(flow, base, odp_actions, wc,
+  pending_encap, pending_decap, encap_data);
 commit_set_ether_action(flow, base, odp_actions, wc, use_masked);
 /* Make packet a non-MPLS packet before committing L3/4 actions,
  * which would otherwise do nothing. */
diff --git a/lib/odp-util.h b/lib/odp-util.h
index 1fad159..6fcd1bb 100644
--- a/lib/odp-util.h
+++ b/lib/odp-util.h
@@ -283,6 +283,7 @@ enum slow_path_reason commit_odp_actions(const struct flow 
*,
  struct flow_wildcards *wc,
  bool use_masked,
  bool pending_encap,
+ bool pending_decap,
  struct ofpbuf *encap_data);
 
 /* ofproto-dpif interface.
diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
index bc6429c..326c088 100644
--- a/ofproto/ofproto-dpif-xlate.c
+++ b/ofproto/ofproto-dpif-xlate.c
@@ -243,6 +243,8 @@ struct xlate_ctx {
  * true. */
 bool pending_encap; /* True when waiting to commit a pending
  * encap action. */
+bool pending_decap; /* True when waiting to commit a pending
+ * decap action. */
 struct ofpbuf *encap_data;  /* May contain a pointer to an ofpbuf with
  * context for the datapath encap action.*/
 
@@ -3477,8 +3479,9 @@ xlate_commit_actions(struct xlate_ctx *ctx)
 ctx->xout->slow |= commit_odp_actions(>xin->flow, >base_flow,
   ctx->odp_actions, ctx->wc,
   use_masked, ctx->pending_encap,
-  ctx->encap_data);
+  ctx->pending_decap, ctx->encap_data);
 c

[ovs-dev] [PATCH] ofp-actions: Correct execution of encap/decap actions in action set

2018-03-26 Thread Jan Scheurich
The actions encap, decap and dec_nsh_ttl were wrongly flagged as set_field
actions in ofpact_is_set_or_move_action(). This caused them to be executed
twice in the action set or a group bucket, once explicitly in
ofpacts_execute_action_set() and once again as part of the list of
set_field or move actions.

Fixes: f839892a ("OF support and translation of generic encap and decap")
Fixes: 491e05c2 ("nsh: add dec_nsh_ttl action")

Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com>

---

The fix should be backported to OVS 2.9 and OVS 2.8 (without the case
for OFPACT_DEC_NSH_TTL introduced in 2.9).


 lib/ofp-actions.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/lib/ofp-actions.c b/lib/ofp-actions.c
index db85716..87797bc 100644
--- a/lib/ofp-actions.c
+++ b/lib/ofp-actions.c
@@ -6985,9 +6985,6 @@ ofpact_is_set_or_move_action(const struct ofpact *a)
 case OFPACT_SET_TUNNEL:
 case OFPACT_SET_VLAN_PCP:
 case OFPACT_SET_VLAN_VID:
-case OFPACT_ENCAP:
-case OFPACT_DECAP:
-case OFPACT_DEC_NSH_TTL:
 return true;
 case OFPACT_BUNDLE:
 case OFPACT_CLEAR_ACTIONS:
@@ -7025,6 +7022,9 @@ ofpact_is_set_or_move_action(const struct ofpact *a)
 case OFPACT_WRITE_METADATA:
 case OFPACT_DEBUG_RECIRC:
 case OFPACT_DEBUG_SLOW:
+case OFPACT_ENCAP:
+case OFPACT_DECAP:
+case OFPACT_DEC_NSH_TTL:
 return false;
 default:
 OVS_NOT_REACHED();
-- 
1.9.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] OVS will hit an assert if encap(nsh) is done in bucket of group

2018-03-26 Thread Jan Scheurich
Thanks for the confirmation Yi. I will post the fix straight away.
The other fix for double encap() is also ready.

BR, Jan

> -Original Message-
> From: Yang, Yi [mailto:yi.y.y...@intel.com]
> Sent: Monday, 26 March, 2018 03:42
> To: Jan Scheurich <jan.scheur...@ericsson.com>
> Cc: d...@openvswitch.org; Zoltán Balogh <zoltan.bal...@ericsson.com>
> Subject: Re: [ovs-dev] OVS will hit an assert if encap(nsh) is done in bucket 
> of group
> 
> I tried the below fix patch you mentioned, it did fix this issue.
> 
> diff --git a/lib/ofp-actions.c b/lib/ofp-actions.c
> index db85716..87797bc 100644
> --- a/lib/ofp-actions.c
> +++ b/lib/ofp-actions.c
> @@ -6985,9 +6985,6 @@ ofpact_is_set_or_move_action(const struct ofpact *a)
>  case OFPACT_SET_TUNNEL:
>  case OFPACT_SET_VLAN_PCP:
>  case OFPACT_SET_VLAN_VID:
> -case OFPACT_ENCAP:
> -case OFPACT_DECAP:
> -case OFPACT_DEC_NSH_TTL:
>  return true;
>  case OFPACT_BUNDLE:
>  case OFPACT_CLEAR_ACTIONS:
> @@ -7025,6 +7022,9 @@ ofpact_is_set_or_move_action(const struct ofpact *a)
>  case OFPACT_WRITE_METADATA:
>  case OFPACT_DEBUG_RECIRC:
>  case OFPACT_DEBUG_SLOW:
> +case OFPACT_ENCAP:
> +case OFPACT_DECAP:
> +case OFPACT_DEC_NSH_TTL:
>  return false;
>  default:
>  OVS_NOT_REACHED();
> 
> On Mon, Mar 26, 2018 at 12:45:46AM +, Yang, Yi Y wrote:
> > Jan, thank you so much, very exhaustive analysis :), I'll double check your 
> > fix patch.
> >
> > From: Jan Scheurich [mailto:jan.scheur...@ericsson.com]
> > Sent: Sunday, March 25, 2018 9:09 AM
> > To: Yang, Yi Y <yi.y.y...@intel.com>
> > Cc: d...@openvswitch.org; Zoltán Balogh <zoltan.bal...@ericsson.com>
> > Subject: RE: OVS will hit an assert if encap(nsh) is done in bucket of group
> >
> >
> > Hi Yi,
> >
> >
> >
> > Part of the seemingly strange behavior of the encap(nsh) action in a group 
> > is caused by the (often forgotten) fact that group buckets
> do not contain action *lists* but action *sets*. I have no idea why it was 
> defined like this when groups were first introduced in
> OpenFlow 1.1. In my view it was a bad decision and causes a lot of limitation 
> for using groups. But that's the way it is.
> >
> >
> >
> > In action sets there can only be one action of a kind (except for 
> > set_field, where there can be one action per target field). If there
> are multiple actions of the same kind specified, only the last one taken, the 
> earlier  ones ignored.
> >
> >
> >
> > Furthermore, the order of execution of the actions in the action set is not 
> > given by the order in which they are listed but defined by
> the OpenFlow standard (see chapter 5.6 of OF spec 1.5.1). Of course the 
> generic encap() and decap() actions are not standardized yet,
> so the OF spec doesn't specify where to put them in the sequence. We had to 
> implement something that follows the spirit of the
> specification, knowing that whatever we chose may fit some but won't fit many 
> other legitimate use cases.
> >
> >
> >
> > OVS's order is defined in ofpacts_execute_action_set() in ofp-actions.c:
> >
> > OFPACT_STRIP_VLAN
> >
> > OFPACT_POP_MPLS
> >
> > OFPACT_DECAP
> >
> > OFPACT_ENCAP
> >
> > OFPACT_PUSH_MPLS
> >
> > OFPACT_PUSH_VLAN
> >
> > OFPACT_DEC_TTL
> >
> > OFPACT_DEC_MPLS_TTL
> >
> > OFPACT_DEC_NSH_TTL
> >
> > All OFP_ACT SET_FIELD and OFP_ACT_MOVE (target)
> >
> > OFPACT_SET_QUEUE
> >
> >
> >
> > Now, your specific group bucket use case:
> >
> >
> >
> >encap(nsh),set_field:->nsh_xxx,output:vxlan_gpe_port
> >
> >
> >
> > should be a lucky fit and execute as expected, whereas the analogous use 
> > case
> >
> >
> >
> >encap(nsh),set_field:->nsh_xxx,encap(ethernet), output:ethernet_port
> >
> >
> >
> > fails with the error
> >
> >
> >
> >Dropping packet as encap(ethernet) is not supported for packet type 
> > ethernet.
> >
> >
> >
> > because the second encap(ethernet) action replaces the encap(nsh) in the 
> > action set and is executed first on the original received
> Ethernet packet. Boom!
> >
> >
> >
> > So, why does your valid use case cause an assertion failure? It's a 
> > consequence of two faults:
> >
> >
> > 1.  In the conversion of the group bucket's action list to the bucket 
> > acti

Re: [ovs-dev] OVS will hit an assert if encap(nsh) is done in bucket of group

2018-03-24 Thread Jan Scheurich
Hi Yi,



Part of the seemingly strange behavior of the encap(nsh) action in a group is 
caused by the (often forgotten) fact that group buckets do not contain action 
*lists* but action *sets*. I have no idea why it was defined like this when 
groups were first introduced in OpenFlow 1.1. In my view it was a bad decision 
and causes a lot of limitation for using groups. But that's the way it is.



In action sets there can only be one action of a kind (except for set_field, 
where there can be one action per target field). If there are multiple actions 
of the same kind specified, only the last one taken, the earlier  ones ignored.



Furthermore, the order of execution of the actions in the action set is not 
given by the order in which they are listed but defined by the OpenFlow 
standard (see chapter 5.6 of OF spec 1.5.1). Of course the generic encap() and 
decap() actions are not standardized yet, so the OF spec doesn't specify where 
to put them in the sequence. We had to implement something that follows the 
spirit of the specification, knowing that whatever we chose may fit some but 
won't fit many other legitimate use cases.



OVS's order is defined in ofpacts_execute_action_set() in ofp-actions.c:

OFPACT_STRIP_VLAN

OFPACT_POP_MPLS

OFPACT_DECAP

OFPACT_ENCAP

OFPACT_PUSH_MPLS

OFPACT_PUSH_VLAN

OFPACT_DEC_TTL

OFPACT_DEC_MPLS_TTL

OFPACT_DEC_NSH_TTL

All OFP_ACT SET_FIELD and OFP_ACT_MOVE (target)

OFPACT_SET_QUEUE



Now, your specific group bucket use case:



   encap(nsh),set_field:->nsh_xxx,output:vxlan_gpe_port



should be a lucky fit and execute as expected, whereas the analogous use case



   encap(nsh),set_field:->nsh_xxx,encap(ethernet), output:ethernet_port



fails with the error



   Dropping packet as encap(ethernet) is not supported for packet type ethernet.



because the second encap(ethernet) action replaces the encap(nsh) in the action 
set and is executed first on the original received Ethernet packet. Boom!



So, why does your valid use case cause an assertion failure? It's a consequence 
of two faults:



  1.  In the conversion of the group bucket's action list to the bucket action 
set in ofpacts_execute_action_set() the action list is filtered with 
ofpact_is_set_or_move_action() to select the set_field actions. This function 
incorrectly flagged OFPACT_ENCAP, OFPACT_DECAP and OFPACT_DEC_NSH_TTL as 
set_field actions. That's why the encap(nsh) action is wrongly copied twice to 
the action set.

  2.  The translation of the second encap(nsh) action in the action set doesn't 
change the packet_type as it is already (1,0x894f). Hence, the 
commit_packet_type_change() triggered at output to vxlan_gpe port misses to 
generate a second encap_nsh datapath action. The logic here is obviously not 
complete to cover the NSH in NSH use case that we intended to support and must 
be enhanced.


The commit of the changes to the NSH header in commit_set_nsh_action() then 
triggers assertion failure because the translation of the second encap(nsh) 
action did overwrite the original nsh_np (0x3 for Ethernet in NSH) in the flow 
with 0x4 (for NSH in NSH). Since it is not allowed to modify the nsh_np with 
set_field this is what triggers the assertion.


I believe this assertion to be correct. It did detect the combination of the 
above two faults.



The solution to 1 is trivial. I'll post a bug fix straight away. That should 
suffice for your problem.

The solution to 2 requires a bit more thinking. I will send a fix when I have 
found it.



BR, Jan



> -Original Message-

> From: Yang, Yi [mailto:yi.y.y...@intel.com]

> Sent: Friday, 23 March, 2018 08:55

> To: Jan Scheurich <jan.scheur...@ericsson.com>

> Cc: d...@openvswitch.org; Zoltán Balogh <zoltan.bal...@ericsson.com>

> Subject: Re: OVS will hit an assert if encap(nsh) is done in bucket of group

>

> On Fri, Mar 23, 2018 at 07:51:45AM +, Jan Scheurich wrote:

> > Hi Yi,

> >

> > Could you please provide the OF pipeline (flows and groups) and an 
> > ofproto/trace command that triggers that fault?

> >

> > Thanks, Jan

>

> Hi, Jan

>

> my br-int has the below ports:

>

>  1(dpdk0): addr:08:00:27:c6:9f:ff

>  config: 0

>  state:  LIVE

>  current:1GB-FD AUTO_NEG

>  speed: 1000 Mbps now, 0 Mbps max

>  2(vxlangpe1): addr:16:04:0c:e5:f1:2c

>  config: 0

>  state:  LIVE

>  speed: 0 Mbps now, 0 Mbps max

>  3(vxlan1): addr:da:1e:fb:2b:c8:63

>  config: 0

>  state:  LIVE

>  speed: 0 Mbps now, 0 Mbps max

>  4(veth-br): addr:92:3d:e0:ab:c2:85

>  config: 0

>  state:  LIVE

>  current:10GB-FD COPPER

>  speed: 1 Mbps now, 0 Mbps max

>  LOCAL(br-int): addr:08:00:27:c6:9f:ff

>  config: 0

>  state:  LIVE

>  current:10MB-FD

  1   2   3   4   5   >