Re: [ovs-dev] [PATCH] dpif-netdev: Use unmasked key when adding datapath flows.
Hi guys, I am afraid that commit is too long ago that would remember any details that caused us to change the code in beb75a40fdc2 ("userspace: Switching of L3 packets in L2 pipeline"). What I vaguely remember was that I couldn’t comprehend the original code and it was not working correctly in some of the cases we needed/tested. But perhaps the changes we introduced also had corner cases we didn't consider. Question though: > > The datapath supports installing wider flows, and OVS relies on this > > behavior. For example if ipv4(src=1.1.1.1/192.0.0.0, > > dst=1.1.1.2/192.0.0.0) exists, a wider flow (smaller mask) of > > ipv4(src=192.1.1.1/128.0.0.0,dst=192.1.1.2/128.0.0.0) is allowed to be > > added. That sounds strange to me. I always believed the datapath only supports non-overlapping flows, i.e. a packet can match at most one datapath flow. Only with that pre-requisite the dpcls classifier can work without priorities. Have I been wrong in this? What would be the semantics of adding a wider flow to the datapath? To my knowledge there is no guarantee that the dpcls subtables are visited in any specific order that would honor the mask width. And the first match will win. Please clarify this. And in which sense OVS relies on this behavior? BR, Jan > -Original Message- > From: Ilya Maximets > Sent: Tuesday, 18 October 2022 21:40 > To: Eelco Chaudron ; d...@openvswitch.org > Cc: i.maxim...@ovn.org; Jan Scheurich > Subject: Re: [ovs-dev] [PATCH] dpif-netdev: Use unmasked key when adding > datapath flows. > > On 10/18/22 18:42, Eelco Chaudron wrote: > > The datapath supports installing wider flows, and OVS relies on this > > behavior. For example if ipv4(src=1.1.1.1/192.0.0.0, > > dst=1.1.1.2/192.0.0.0) exists, a wider flow (smaller mask) of > > ipv4(src=192.1.1.1/128.0.0.0,dst=192.1.1.2/128.0.0.0) is allowed to be > > added. > > > > However, if we try to add a wildcard rule, the installation fails: > > > > # ovs-appctl dpctl/add-flow system@myDP "in_port(1),eth_type(0x0800), \ > > ipv4(src=1.1.1.1/192.0.0.0,dst=1.1.1.2/192.0.0.0,frag=no)" 2 # > > ovs-appctl dpctl/add-flow system@myDP "in_port(1),eth_type(0x0800), \ > > ipv4(src=192.1.1.1/0.0.0.0,dst=49.1.1.2/0.0.0.0,frag=no)" 2 > > ovs-vswitchd: updating flow table (File exists) > > > > The reason is that the key used to determine if the flow is already > > present in the system uses the original key ANDed with the mask. > > This results in the IP address not being part of the (miniflow) key, > > i.e., being substituted with an all-zero value. When doing the actual > > lookup, this results in the key wrongfully matching the first flow, > > and therefore the flow does not get installed. The solution is to use > > the unmasked key for the existence check, the same way this is handled > > in the userspace datapath. > > > > Signed-off-by: Eelco Chaudron > > --- > > lib/dpif-netdev.c| 33 + > > tests/dpif-netdev.at | 14 ++ > > 2 files changed, 43 insertions(+), 4 deletions(-) > > > > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index > > a45b46014..daa00aa2f 100644 > > --- a/lib/dpif-netdev.c > > +++ b/lib/dpif-netdev.c > > @@ -3321,6 +3321,28 @@ netdev_flow_key_init_masked(struct > netdev_flow_key *dst, > > (dst_u64 - miniflow_get_values(>mf)) > > * 8); } > > > > +/* Initializes 'dst' as a copy of 'flow'. */ static inline void > > +netdev_flow_key_init(struct netdev_flow_key *key, > > + const struct flow *flow) { > > +uint64_t *dst = miniflow_values(>mf); > > +uint32_t hash = 0; > > +uint64_t value; > > + > > +miniflow_map_init(>mf, flow); > > +miniflow_init(>mf, flow); > > + > > +size_t n = dst - miniflow_get_values(>mf); > > + > > +FLOW_FOR_EACH_IN_MAPS (value, flow, key->mf.map) { > > +hash = hash_add64(hash, value); > > +} > > + > > +key->hash = hash_finish(hash, n * 8); > > +key->len = netdev_flow_key_size(n); } > > + > > static inline void > > emc_change_entry(struct emc_entry *ce, struct dp_netdev_flow *flow, > > const struct netdev_flow_key *key) @@ -4195,7 > > +4217,7 @@ static int dpif_netdev_flow_put(struct dpif *dpif, const > > struct dpif_flow_put *put) { > > struct dp_netdev *dp = get_dp_netdev(dpif); > > -struct netdev_flow_key key, mask; > > +struct netdev_flow_key key; > > struct dp_netdev_pmd_thread *pmd; > >
Re: [ovs-dev] [PATCH v5] dpif-netdev: Allow cross-NUMA polling on selected ports
Hi, True, the cost of polling a packet from a physical port on a remote NUMA node is slightly higher than from local NUMA node. Hence the cross-NUMA polling of rx queues has some overhead. However, the packet processing cost is much more influence by the location of the target vhostuser ports. If the majority of the rx queue traffic is going to a VM on the other NUMA node, it is actually *better* to poll the packets in a PMD on the VM's NUMA node. Long story short, OVS doesn't have sufficient data to be able to correctly predict the actual rxq load when assigned to another PMD in a different queue configuration. The rxq processing cycles measured on the current PMD is the best estimate we have for balancing the overall load on the PMDs. We need to live with the inevitable inaccuracies. My main point is: these inaccuracies don't matter. The purpose of balancing the load over PMDs is *not* to minimize the total cycles spent by PMDs on processing packets. The PMD run in a busy loop anyhow and burn all cycles of the CPU. The purpose is to prevent that some PMD unnecessarily gets congested (i.e. load > 95%) while others have a lot of spare capacity and could take over some rxqs. Cross-NUMA polling of physical port rxqs has proven to be an extremely valuable tool to help OVS's cycle-based rxq-balancing algorithm to do its job, and I strongly suggest we allow the proposed per-port opt-in option. BR, Jan From: Anurag Agarwal Sent: Thursday, 21 July 2022 07:15 To: lic...@chinatelecom.cn Cc: Jan Scheurich Subject: RE: RE: [ovs-dev] [PATCH v5] dpif-netdev: Allow cross-NUMA polling on selected ports Hello Cheng, With cross-numa enabled, we flatten the PMD list across NUMAs and select the least loaded PMD. Thus I would not like to consider the case below. Regards, Anurag From: lic...@chinatelecom.cn<mailto:lic...@chinatelecom.cn> mailto:lic...@chinatelecom.cn>> Sent: Thursday, July 21, 2022 8:19 AM To: Anurag Agarwal mailto:anurag.agar...@ericsson.com>> Cc: Jan Scheurich mailto:jan.scheur...@ericsson.com>> Subject: Re: RE: [ovs-dev] [PATCH v5] dpif-netdev: Allow cross-NUMA polling on selected ports Hi Anurag, "If local numa has bandwidth for rxq, we are not supposed to assign a rxq to remote pmd." Would you like to consider this case? If not, I think we don't have to resolve the cycles measurement issue for cross numa case. 李成 From: Anurag Agarwal<mailto:anurag.agar...@ericsson.com> Date: 2022-07-21 10:21 To: lic...@chinatelecom.cn<mailto:lic...@chinatelecom.cn> CC: Jan Scheurich<mailto:jan.scheur...@ericsson.com> Subject: RE: RE: [ovs-dev] [PATCH v5] dpif-netdev: Allow cross-NUMA polling on selected ports + Jan Hello Cheng, Thanks for your insightful comments. Please find my inputs inline. Regards, Anurag From: lic...@chinatelecom.cn<mailto:lic...@chinatelecom.cn> mailto:lic...@chinatelecom.cn>> Sent: Monday, July 11, 2022 7:51 AM To: Anurag Agarwal mailto:anurag.agar...@ericsson.com>> Subject: Re: RE: [ovs-dev] [PATCH v5] dpif-netdev: Allow cross-NUMA polling on selected ports Hi Anurag, Sorry for late reply, I was busy on a task last two weeks. I think you proposal can cover the case I reported. It looks good to me. >> Thanks for your review and positive feedback However, to enable cross numa rxq pollin, we may have another problem to address. From my test, cross numa polling has worse performance than numa affinity polling.(at least 10%) So if local numa has bandwidth for rxq, we are not supposed to assign a rxq to remote pmd. Unfortunately, we don't know if a pmd is out of bandwidth from it's assigned rxq cycles. Because rx batchs size impacts the rxq cycle a lot in my test: rx batch cycles per pkt 1.00 5738 5.00 2353 12.15 1770 32.00 1533 Pkts come faster, the rx batch size is larger. More rxqs a pmd is assigned, rx batch size if larger. Imaging that pmd pA has only one rxq assigned. Pkts comes at 1.00 pkt/5738 cycle, the rxq rx batch size is 1.00. Now pA has 2 rxq assigned, each rxq has pkts comes at 1.00 pkt/5738 cycle. pmd spends 5738 cycles process the first rxq, and then the second. After the second rxq is processed, pmd comes back to the first rxq, now first rxq has 2 pkts ready(becase 2*5738 cycles passed). The rxq batch size becomes 2. >> Ok. Do you think it is a more generic problem with cycles measurement and >> PMD utilization? Not specific to cross-numa feature.. So it's hard to say if a pmd is overload from the rxq cycles. At last, I think cross numa feature is very nice. I will make effort on this as well to cover cases in our company. Let keep in sync on progress :) >> Thanks 李成 From: Anurag Agarwal<mailto:anurag.agar...@ericsson.com> Date: 2022-06-29 14:12 To: lic...@chinatelecom.cn
Re: [ovs-dev] [PATCH v2] ofproto-xlate: Fix crash when forwarding packet between legacy_l3 tunnels
> I found this one though: > https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af- > 45444731-805e1f47a2a28e92=1=1fbc0307-e0af-4087-98ef- > d9dbff40359a=https%3A%2F%2Fpatchwork.ozlabs.org%2Fproject%2Fopenvs > witch%2Fpatch%2F20220403222617.31688-1-jan.scheurich%40web.de%2F > > It was held back by the mail list and appears to be better formatted. > Let's see if it works. > > P.S. I marked that email address for acceptance, so the mail list > should no longer block it. Thanks Ilya, that is a good work-around for me until I find a solution for SMTP with my Ericsson mail. Regards, Jan ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] [PATCH v2] ofproto-xlate: Fix crash when forwarding packet between legacy_l3 tunnels
From: Jan Scheurich A packet received from a tunnel port with legacy_l3 packet-type (e.g. lisp, L3 gre, gtpu) is conceptually wrapped in a dummy Ethernet header for processing in an OF pipeline that is not packet-type-aware. Before transmission of the packet to another legacy_l3 tunnel port, the dummy Ethernet header is stripped again. In ofproto-xlate, wrapping in the dummy Ethernet header is done by simply changing the packet_type to PT_ETH. The generation of the push_eth datapath action is deferred until the packet's flow changes need to be committed, for example at output to a normal port. The deferred Ethernet encapsulation is marked in the pending_encap flag. This patch fixes a bug in the translation of the output action to a legacy_l3 tunnel port, where the packet_type of the flow is reverted from PT_ETH to PT_IPV4 or PT_IPV6 (depending on the dl_type) to remove its Ethernet header without clearing the pending_encap flag if it was set. At the subsequent commit of the flow changes, the unexpected combination of pending_encap == true with an PT_IPV4 or PT_IPV6 packet_type hit the OVS_NOT_REACHED() abortion clause. The pending_encap is now cleared in this situation. Reported-By: Dincer Beken Signed-off-By: Jan Scheurich Signed-off-By: Dincer Beken v1->v2: A specific test has been added to tunnel-push-pop.at to verify the correct behavior. --- ofproto/ofproto-dpif-xlate.c | 4 tests/tunnel-push-pop.at | 22 ++ 2 files changed, 26 insertions(+) diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c index cc9c1c628..1843a5d66 100644 --- a/ofproto/ofproto-dpif-xlate.c +++ b/ofproto/ofproto-dpif-xlate.c @@ -4195,6 +4195,10 @@ compose_output_action__(struct xlate_ctx *ctx, ofp_port_t ofp_port, if (xport->pt_mode == NETDEV_PT_LEGACY_L3) { flow->packet_type = PACKET_TYPE_BE(OFPHTN_ETHERTYPE, ntohs(flow->dl_type)); +if (ctx->pending_encap) { +/* The Ethernet header was not actually added yet. */ +ctx->pending_encap = false; +} } } diff --git a/tests/tunnel-push-pop.at b/tests/tunnel-push-pop.at index 57589758f..70462d905 100644 --- a/tests/tunnel-push-pop.at +++ b/tests/tunnel-push-pop.at @@ -546,6 +546,28 @@ AT_CHECK([ovs-ofctl dump-ports int-br | grep 'port [[37]]' | sort], [0], [dnl port 7: rx pkts=5, bytes=434, drop=?, errs=?, frame=?, over=?, crc=? ]) +dnl Send out packets received from L3GRE tunnel back to L3GRE tunnel +AT_CHECK([ovs-ofctl del-flows int-br]) +AT_CHECK([ovs-ofctl add-flow int-br "in_port=7,actions=set_field:3->in_port,7"]) +AT_CHECK([ovs-vsctl -- set Interface br0 options:pcap=br0.pcap]) + +AT_CHECK([ovs-appctl netdev-dummy/receive p0 'aa55aa55001b213cab640800457079464000402fba630101025c01010258280001c84554ba20400184861e011e024227e75400030af31955f2650100101112131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f3031323334353637']) +AT_CHECK([ovs-appctl netdev-dummy/receive p0 'aa55aa55001b213cab640800457079464000402fba630101025c01010258280001c84554ba20400184861e011e024227e75400030af31955f2650100101112131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f3031323334353637']) +AT_CHECK([ovs-appctl netdev-dummy/receive p0 'aa55aa55001b213cab640800457079464000402fba630101025c01010258280001c84554ba20400184861e011e024227e75400030af31955f2650100101112131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f3031323334353637']) + +ovs-appctl time/warp 1000 + +AT_CHECK([ovs-pcap p0.pcap > p0.pcap.txt 2>&1]) +AT_CHECK([tail -6 p0.pcap.txt], [0], [dnl +aa55aa55001b213cab640800457079464000402fba630101025c01010258280001c84554ba20400184861e011e024227e75400030af31955f2650100101112131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f3031323334353637 +001b213cab64aa55aa55080045704000402f33aa010102580101025c280001c84554ba20400184861e011e024227e75400030af31955f2650100101112131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f3031323334353637 +aa55aa55001b213cab640800457079464000402fba630101025c01010258280001c84554ba20400184861e011e024227e75400030af31955f2650100101112131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f3031323334353637 +001b213cab64aa55aa55080045704000402f33aa010102580101025c280001c84554ba20400184861e011e024227e75400030af31955f2650100101112131415161718191a1b1c1d1e1f2021222
Re: [ovs-dev] [PATCH v2] ofproto-xlate: Fix crash when forwarding packet between legacy_l3 tunnels
Hi Ilya, Sorry for spamming but I have problems again to send correctly formatted patches to the ovs-dev list. My previous SMTP server for git-send email no longer works and patches I send through my private SMTP provider do not reach the mailing list. Resending from Outlook obviously still screws up the patch ☹. Any chance that you could pick the PATCH v2 manually and feed it through the process? I doubt that I will find a solution to my mailing issues soon. Thanks, Jan > -Original Message- > From: 0-day Robot > Sent: Monday, 4 April, 2022 09:56 > To: Jan Scheurich > Cc: d...@openvswitch.org > Subject: Re: [ovs-dev] [PATCH v2] ofproto-xlate: Fix crash when forwarding > packet between legacy_l3 tunnels > > Bleep bloop. Greetings Jan Scheurich, I am a robot and I have tried out your > patch. > Thanks for your contribution. > > I encountered some error that I wasn't expecting. See the details below. > > > git-am: > error: corrupt patch at line 85 > error: could not build fake ancestor > hint: Use 'git am --show-current-patch' to see the failed patch Patch failed > at > 0001 ofproto-xlate: Fix crash when forwarding packet between legacy_l3 > tunnels When you have resolved this problem, run "git am --continue". > If you prefer to skip this patch, run "git am --skip" instead. > To restore the original branch and stop patching, run "git am --abort". > > > Patch skipped due to previous failure. > > Please check this out. If you feel there has been an error, please email > acon...@redhat.com > > Thanks, > 0-day Robot ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] [PATCH v2] ofproto-xlate: Fix crash when forwarding packet between legacy_l3 tunnels
From: Jan Scheurich A packet received from a tunnel port with legacy_l3 packet-type (e.g. lisp, L3 gre, gtpu) is conceptually wrapped in a dummy Ethernet header for processing in an OF pipeline that is not packet-type-aware. Before transmission of the packet to another legacy_l3 tunnel port, the dummy Ethernet header is stripped again. In ofproto-xlate, wrapping in the dummy Ethernet header is done by simply changing the packet_type to PT_ETH. The generation of the push_eth datapath action is deferred until the packet's flow changes need to be committed, for example at output to a normal port. The deferred Ethernet encapsulation is marked in the pending_encap flag. This patch fixes a bug in the translation of the output action to a legacy_l3 tunnel port, where the packet_type of the flow is reverted from PT_ETH to PT_IPV4 or PT_IPV6 (depending on the dl_type) to remove its Ethernet header without clearing the pending_encap flag if it was set. At the subsequent commit of the flow changes, the unexpected combination of pending_encap == true with an PT_IPV4 or PT_IPV6 packet_type hit the OVS_NOT_REACHED() abortion clause. The pending_encap is now cleared in this situation. Reported-By: Dincer Beken Signed-off-By: Jan Scheurich Signed-off-By: Dincer Beken v1->v2: A specific test has been added to tunnel-push-pop.at to verify the correct behavior. --- ofproto/ofproto-dpif-xlate.c | 4 tests/tunnel-push-pop.at | 22 ++ 2 files changed, 26 insertions(+) diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c index cc9c1c628..1843a5d66 100644 --- a/ofproto/ofproto-dpif-xlate.c +++ b/ofproto/ofproto-dpif-xlate.c @@ -4195,6 +4195,10 @@ compose_output_action__(struct xlate_ctx *ctx, ofp_port_t ofp_port, if (xport->pt_mode == NETDEV_PT_LEGACY_L3) { flow->packet_type = PACKET_TYPE_BE(OFPHTN_ETHERTYPE, ntohs(flow->dl_type)); +if (ctx->pending_encap) { +/* The Ethernet header was not actually added yet. */ +ctx->pending_encap = false; +} } } diff --git a/tests/tunnel-push-pop.at b/tests/tunnel-push-pop.at index 57589758f..70462d905 100644 --- a/tests/tunnel-push-pop.at +++ b/tests/tunnel-push-pop.at @@ -546,6 +546,28 @@ AT_CHECK([ovs-ofctl dump-ports int-br | grep 'port [[37]]' | sort], [0], [dnl port 7: rx pkts=5, bytes=434, drop=?, errs=?, frame=?, over=?, crc=? ]) +dnl Send out packets received from L3GRE tunnel back to L3GRE tunnel +AT_CHECK([ovs-ofctl del-flows int-br]) AT_CHECK([ovs-ofctl add-flow +int-br "in_port=7,actions=set_field:3->in_port,7"]) +AT_CHECK([ovs-vsctl -- set Interface br0 options:pcap=br0.pcap]) + +AT_CHECK([ovs-appctl netdev-dummy/receive p0 +'aa55aa55001b213cab640800457079464000402fba630101025c0101025820 +00080001c84554ba20400184861e011e024227e75400030 +af31955f2650100101112131415161718191a1b1c1d1e1f20212223 +2425262728292a2b2c2d2e2f3031323334353637']) +AT_CHECK([ovs-appctl netdev-dummy/receive p0 +'aa55aa55001b213cab640800457079464000402fba630101025c0101025820 +00080001c84554ba20400184861e011e024227e75400030 +af31955f2650100101112131415161718191a1b1c1d1e1f20212223 +2425262728292a2b2c2d2e2f3031323334353637']) +AT_CHECK([ovs-appctl netdev-dummy/receive p0 +'aa55aa55001b213cab640800457079464000402fba630101025c0101025820 +00080001c84554ba20400184861e011e024227e75400030 +af31955f2650100101112131415161718191a1b1c1d1e1f20212223 +2425262728292a2b2c2d2e2f3031323334353637']) + +ovs-appctl time/warp 1000 + +AT_CHECK([ovs-pcap p0.pcap > p0.pcap.txt 2>&1]) AT_CHECK([tail -6 +p0.pcap.txt], [0], [dnl +aa55aa55001b213cab640800457079464000402fba630101025c01010258200 +0080001c84554ba20400184861e011e024227e75400030a +f31955f2650100101112131415161718191a1b1c1d1e1f202122232 +425262728292a2b2c2d2e2f3031323334353637 +001b213cab64aa55aa55080045704000402f33aa010102580101025c200 +0080001c84554ba20400184861e011e024227e75400030a +f31955f2650100101112131415161718191a1b1c1d1e1f202122232 +425262728292a2b2c2d2e2f3031323334353637 +aa55aa55001b213cab640800457079464000402fba630101025c01010258200 +0080001c84554ba20400184861e011e024227e75400030a +f31955f2650100101112131415161718191a1b1c1d1e1f202122232 +425262728292a2b2c2d2e2f3031323334353637 +001b213cab64aa55aa55080045704000402f33aa010102580101025c200 +0080001c84554ba20400184861e011e024227e75400030a +f31955f2650100101112131415161718191a1b1c1d1e1f202122232 +425262728292a2b2c2d2e2f3031323334353637 +aa55aa55001b213cab64080045
Re: [ovs-dev] [PATCH v2] dpif-netdev: Allow cross-NUMA polling on selected ports
Hi Kevin, This was a bit of a misunderstanding. We didn't check your RFC patch carefully enough to realize that you had meant to encompass our cross-numa-polling function in that RFC patch. Sorry for the confusion. I wouldn't say we are particularly keen on upstreaming exactly our implementation of cross-numa-polling for ALB, as long as we get the functionality with a per-interface configuration option (preferably as in out patch, so that we can maintain backward compatibility with our downstream solution). I suggest we have a closer look at your RFC and come back with comments on that. > > 'roundrobin' and 'cycles', are dependent on RR of cores per numa, so I agree > it > makes sense in those cases to still RR the numas. Otherwise a mix of cross- > numa enabled and cross-numa disabled interfaces would conflict with each > other when selecting a core. So even though the user has selected to ignore > numa for this interface, we don't have a choice but to still RR the numa. > > For 'group' it is about finding the lowest loaded pmd core. In that case we > don't > need to RR numa. We can just find the lowest loaded pmd core from any numa. > > This is better because the user is choosing to remove numa based selection for > the interface and as the rxq scheduling algorthim is not dependent on it, we > can fully remove it too by checking pmds from all numas. I have done this in > my > RFC. > > It is also better to do this where possible because there may not be same > amount of pmd cores on each numa, or one numa could already be more > heavily loaded than the other. > > Another difference is with above for 'group' I added tiebreaker for a > local-to- > interface numa pmd to be selected if multiple pmds from different numas were > available with same load. This is most likely to be helpful for initial > selection > when there is no load on any pmds. At first glance I agree with your reasoning. Choosing the least-loaded PMD from all NUMAs using NUMA-locality as a tie-breaker makes sense in 'group' algorithm. How do you select between equally loaded PMDs on the same NUMA node? Just pick any? BR, Jan ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [External] Re: [PATCH] dpif-netdev: add an option to assign pmd rxq to all numas
> Thanks for sharing your experience with it. My fear with the proposal is that > someone turns this on and then tells us performance is worse and/or OVS > assignments/ALB are broken, because it has an impact on their case. > > In terms of limiting possible negative effects, > - it can be opt-in and recommended only for phy ports > - could print a warning when it is enabled > - ALB is currently disabled with cross-numa polling (except a limited > case) but it's clear you want to remove that restriction too > - for ALB, a user could increase the improvement threshold to account for any > reassignments triggered by inaccuracies [Jan] Yes, we want to enable cross-NUMA polling of selected (typically phy) ports in ALB "group" mode as an opt-in config option (default off). Based on our observations we are not too much concerned with the loss of ALB prediction accuracy but increasing the threshold may be a way of taking that into account, if wanted. > > There is also some improvements that can be made to the proposed method > when used with group assignment, > - we can prefer local numa where there is no difference between pmd cores. > (e.g. two unused cores available, pick the local numa one) > - we can flatten the list of pmds, so best pmd can be selected. This will > remove > issues with RR numa when there are different num of pmd cores or loads per > numa. > - I wrote an RFC that does these two items, I can post when(/if!) consensus is > reached on the broader topic [Jan] In our alternative version of the current upstream "group" ALB [1] we already maintained a flat list of PMDs. So we would support that feature. Using NUMA-locality as a tie-breaker makes sense. [1] https://mail.openvswitch.org/pipermail/ovs-dev/2021-June/384546.html > > In summary, it's a trade-off, > > With no cross-numa polling (current): > - won't have any impact to OVS assignment or ALB accuracy > - there could be a bottleneck on one numa pmds while other numa pmd cores > are idle and unused > > With cross-numa rx pinning (current): > - will have access to pmd cores on all numas > - may require more cycles for some traffic paths > - won't have any impact to OVS assignment or ALB accuracy > - >1 pinned rxqs per core may cause a bottleneck depending on traffic > > With cross-numa interface setting (proposed): > - will have access to all pmd cores on all numas (i.e. no unused pmd cores > during highest load) > - will require more cycles for some traffic paths > - will impact on OVS assignment and ALB accuracy > > Anything missing above, or is it a reasonable summary? I think that is a reasonable summary, albeit I would have characterized the third option a bit more positively: - Gives ALB maximum freedom to balance load of PMDs on all NUMA nodes (in the likely scenario of uneven VM load on the NUMAs) - Accepts an increase of cycles on cross-NUMA paths for a better utilization of a free PMD cycles - Mostly suitable for phy ports due to limited cycle increase for cross-NUMA polling of phy rx queues - Could negatively impact the ALB prediction accuracy in certain scenarios We will post a new version of our patch [2] for cross-numa polling on selected ports adapted to the current OVS master shortly. [2] https://mail.openvswitch.org/pipermail/ovs-dev/2021-June/384547.html Thanks, Jan ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [External] Re: [PATCH] dpif-netdev: add an option to assign pmd rxq to all numas
Hi Kevin, > > We have done extensive benchmarking and found that we get better overall > PMD load balance and resulting OVS performance when we do not statically > pin any rx queues and instead let the auto-load-balancing find the optimal > distribution of phy rx queues over both NUMA nodes to balance an asymmetric > load of vhu rx queues (polled only on the local NUMA node). > > > > Cross-NUMA polling of vhu rx queues comes with a very high latency cost due > to cross-NUMA access to volatile virtio ring pointers in every iteration (not > only > when actually copying packets). Cross-NUMA polling of phy rx queues doesn't > have a similar issue. > > > > I agree that for vhost rxq polling, it always causes a performance penalty > when > there is cross-numa polling. > > For polling phy rxq, when phy and vhost are in different numas, I don't see > any > additional penalty for cross-numa polling the phy rxq. > > For the case where phy and vhost are both in the same numa, if I change to > poll > the phy rxq cross-numa, then I see about a >20% tput drop for traffic from > phy - > > vhost. Are you seeing that too? Yes, but the performance drop is mostly due to the extra cost of copying the packets across the UPI bus to the virtio buffers on the other NUMA, not because of polling the phy rxq on the other NUMA. > > Also, the fact that a different numa can poll the phy rxq after every > rebalance > means that the ability of the auto-load-balancer to estimate and trigger a > rebalance is impacted. Agree, there is some inaccuracy in the estimation of the load a phy rx queue creates when it is moved to another NUMA node. So far we have not seen that as a practical problem. > > It seems like simple pinning some phy rxqs cross-numa would avoid all the > issues above and give most of the benefit of cross-numa polling for phy rxqs. That is what we have done in the past (far a lack of alternatives). But any static pinning reduces the ability of the auto-load balancer to do its job. Consider the following scenarios: 1. The phy ingress traffic is not evenly distributed by RSS due to lack of entropy (Examples for this are IP-IP encapsulated traffic, e.g. Calico, or MPLSoGRE encapsulated traffic). 2. VM traffic is very asymmetric, e.g. due to a large dual-NUMA VM whose vhu ports are all on NUMA 0. In all such scenarios, static pinning of phy rxqs may lead to unnecessarily uneven PMD load and loss of overall capacity. > > With the pmd-rxq-assign=group and pmd-rxq-isolate=false options, OVS could > still assign other rxqs to those cores which have with pinned phy rxqs and > properly adjust the assignments based on the load from the pinned rxqs. Yes, sometimes the vhu rxq load is distributed such that it can be use to balance the PMD, but not always. Sometimes the balance is just better when phy rxqs are not pinned. > > New assignments or auto-load-balance would not change the numa polling > those rxqs, so it it would have no impact to ALB or ability to assign based on > load. In our practical experience the new "group" algorithm for load-based rxq distribution is able to balance the PMD load best when none of the rxqs are pinned and cross-NUMA polling of phy rxqs is enabled. So the effect of the prediction error when doing auto-lb dry-runs cannot be significant. In our experience we consistently get the best PMD balance and OVS throughput when we give the auto-lb free hands (no cross-NUMA polling of vhu rxqs, through). BR, Jan ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [External] Re: [PATCH] dpif-netdev: add an option to assign pmd rxq to all numas
> > We do acknowledge the benefit of non-pinned polling of phy rx queues by > PMD threads on all NUMA nodes. It gives the auto-load balancer much better > options to utilize spare capacity on PMDs on all NUMA nodes. > > > > Our patch proposed in > > https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-45444 > > 731-c996011189a3eea8=1=0dc6a0b0-959c-493e-a3de- > fea8f3151705= > > https%3A%2F%2Fmail.openvswitch.org%2Fpipermail%2Fovs-dev%2F2021- > June%2 > > F384547.html indeed covers the difference between phy and vhu ports. > > One has to explicitly enable cross-NUMA-polling for individual interfaces > with: > > > >ovs-vsctl set interface other_config:cross-numa-polling=true > > > > This would typically only be done by static configuration for the fixed set > > of > physical ports. There is no code in the OpenStack's os-vif handler to apply > such > configuration for dynamically created vhu ports. > > > > I would strongly suggest that cross-num-polling be introduced as a per- > interface option as in our patch rather than as a per-datapath option as in > your > patch. Why not adapt our original patch to the latest OVS code base? We can > help you with that. > > > > BR, Jan > > > > Hi, Jan Scheurich > > We can achieve the static setting of pinning a phy port by combining pmd-rxq- > isolate and pmd-rxq-affinity. This setting can get the same result. And we > have > seen the benefits. > The new issue is the polling of vhu on one numa. Under heavy traffic, polling > vhu + phy will make the pmds reach 100% usage. While other pmds on the > other numa with only phy port reaches 70% usage. Enabling cross-numa polling > for a vhu port would give us more benefits in this case. Overloads of > different > pmds on both numa would be balanced. > As you have mentioned, there is no code to apply this config for vhu while > creating them. A global setting would save us from dynamically detecting the > vhu name or any new creation. Hi Wan Junjie, We have done extensive benchmarking and found that we get better overall PMD load balance and resulting OVS performance when we do not statically pin any rx queues and instead let the auto-load-balancing find the optimal distribution of phy rx queues over both NUMA nodes to balance an asymmetric load of vhu rx queues (polled only on the local NUMA node). Cross-NUMA polling of vhu rx queues comes with a very high latency cost due to cross-NUMA access to volatile virtio ring pointers in every iteration (not only when actually copying packets). Cross-NUMA polling of phy rx queues doesn't have a similar issue. BR, Jan ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [External] Re: [PATCH] dpif-netdev: add an option to assign pmd rxq to all numas
> > > > Btw, this patch is similar in functionality to the one posted by > > Anurag [0] and there was also some discussion about this approach here [1]. > > > > Thanks for pointing this out. > IMO, setting interface cross-numa would be good for phy port but not good for > vhu. Since vhu can be destroyed and created relatively frequently. > But yes the main idea is the same. > We do acknowledge the benefit of non-pinned polling of phy rx queues by PMD threads on all NUMA nodes. It gives the auto-load balancer much better options to utilize spare capacity on PMDs on all NUMA nodes. Our patch proposed in https://mail.openvswitch.org/pipermail/ovs-dev/2021-June/384547.html indeed covers the difference between phy and vhu ports. One has to explicitly enable cross-NUMA-polling for individual interfaces with: ovs-vsctl set interface other_config:cross-numa-polling=true This would typically only be done by static configuration for the fixed set of physical ports. There is no code in the OpenStack's os-vif handler to apply such configuration for dynamically created vhu ports. I would strongly suggest that cross-num-polling be introduced as a per-interface option as in our patch rather than as a per-datapath option as in your patch. Why not adapt our original patch to the latest OVS code base? We can help you with that. BR, Jan ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH v4 2/7] dpif-netdev: Make PMD auto load balance use common rxq scheduling.
> > In our patch series we decided to skip the check on cross-numa polling > > during > auto-load balancing. The rationale is as follows: > > > > If the estimated PMD-rxq distribution includes cross-NUMA rxq assignments, > the same must apply for the current distribution, as none of the scheduling > algorithms would voluntarily assign rxqs across NUMA nodes. So, current and > estimated rxq assignments are comparable and it makes sense to consider > rebalancing when the variance improves. > > > > Please consider removing this check. > > > > The first thing is that this patch is not changing any behaviour, just re- > implementing to reuse the common code, so it would not be the place to > change this functionality. Fair enough. We should address this in a separate patch. > About the proposed change itself, just to be clear what is allowed currently. > It > will allow rebalance when there are local pmds, OR there are no local pmds > and there is one other NUMA node with pmds available for cross-numa polling. > > The rationale of not doing a rebalance when there are no local pmds but > multiple other NUMAs available for cross-NUMA polling is that the estimate > may be incorrect due a different cross-NUMA being choosen for an Rxq than is > currently used. > > I thought about some things like making an Rxq sticky with a particular cross- > NUMA etc for this case but that brings a whole new set of problems, e.g. what > happens if that NUMA gets overloaded, reduced cores, how can it ever be reset > etc. so I decided not to pursue it as I think it is probably a corner case > (at least > for now). We currently don't see any scenarios with more than two NUMA nodes, but different CPU/server architectures may perhaps have more NUMA nodes than CPU sockets. > I know the case of no local pmd and one NUMA with pmds is not a corner case > as I'm aware of users doing that. Agree such configurations are a must to support with auto-lb. > We can discuss further about the multiple non-local NUMA case and maybe > there's some improvements we can think of, or maybe I've made some wrong > assumptions but it would be a follow on from the current patchset. Our main use case for cross-NUMA balancing comes with the additional freedom to allow cross-NUMA polling for selected ports that we introduce with fourth patch: dpif-netdev: Allow cross-NUMA polling on selected ports Today dpif-netdev considers PMD threads on a non-local NUMA node for automatic assignment of the rxqs of a port only if there are no local, non-isolated PMDs. On typical servers with both physical ports on one NUMA node, this often leaves the PMDs on the other NUMA node under-utilized, wasting CPU resources. The alternative, to manually pin the rxqs to PMDs on remote NUMA nodes, also has drawbacks as it limits OVS' ability to auto load-balance the rxqs. This patch introduces a new interface configuration option to allow ports to be automatically polled by PMDs on any NUMA node: ovs-vsctl set interface other_config:cross-numa-polling=true If this option is not present or set to false, legacy behaviour applies. We indeed use this for our physical ports to be polled by non-isolated PMDs on both NUMAs. The observed capacity improvement is very substantial, so we plan to port this feature on top of your patches once they are merged. This can only fly if the auto-load balancing is allowed to activate rxq assignments with cross-numa polling also in the case there are local non-isolated PMDs. Anyway, we can take this up later in our upcoming patch that introduces this option. BR, Jan ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH v4 6/7] dpif-netdev: Allow pin rxq and non-isolate PMD.
> >> +If using ``pmd-rxq-assign=group`` PMD threads with *pinned* Rxqs can > >> +be > >> +*non-isolated* by setting:: > >> + > >> + $ ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-isolate=false > > > > Is there any specific reason why the new pmd-rxq-isolate option should be > limited to the "group" scheduling algorithm? In my view it would make more > sense and simplify documentation and code if these aspects of scheduling were > kept orthogonal. > > > > David had a similar comment on an earlier version. I will add the fuller reply > below. In summary, pinning and the other algorithms (particularly if there was > multiple pinnings) conflict in how they operate because they are based on RR > pmd's to add equal number of Rxqs/PMD. > > --- > Yes, the main issue is that the other algorithms are based on a pmd order on > the assumption that they start from empty. For 'roundrobin' it is to equally > distribute num of rxqs on pmd RR - if we pin several rxqs on one pmds, it is > not > clear what to do etc. Even worse for 'cycles' it is based on placing rxqs in > order > of busy cycles on the pmds RR. If we pin an rxq initially we are conflicting > against how the algorithm operates. > > Because 'group' algorithm is not tied to a defined pmd order, it is better > suited > to be able to place a pinned rxq initially and be able to consider that pmd on > it's merits along with the others later. > --- Makes sense. The legacy round-robin and cycles algorithms are not really prepared for pinned rxqs. In our patch set we had replaced the original round-robin and cycles algorithms with a single new algorithm that simply distributed the rxqs to the "least-loaded" non-isolated PMD (based on #rxqs or cycles assigned so far). With pinning=isolation, this exactly reproduced the round-robin behavior and created at least as good balance as the original cycles algorithm. In that new algorithm any pinned rxqs were already considered and didn't lead to distorted results. BR, Jan ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH v4 2/7] dpif-netdev: Make PMD auto load balance use common rxq scheduling.
> -Original Message- > From: dev On Behalf Of Kevin Traynor > Sent: Thursday, 8 July, 2021 15:54 > To: d...@openvswitch.org > Cc: david.march...@redhat.com > Subject: [ovs-dev] [PATCH v4 2/7] dpif-netdev: Make PMD auto load balance > use common rxq scheduling. > > PMD auto load balance had its own separate implementation of the rxq > scheduling that it used for dry runs. This was done because previously the rxq > scheduling was not made reusable for a dry run. > > Apart from the code duplication (which is a good enough reason to replace it > alone) this meant that if any further rxq scheduling changes or assignment > types were added they would also have to be duplicated in the auto load > balance code too. > > This patch replaces the current PMD auto load balance rxq scheduling code to > reuse the common rxq scheduling code. > > The behaviour does not change from a user perspective, except the logs are > updated to be more consistent. > > As the dry run will compare the pmd load variances for current and estimated > assignments, new functions are added to populate the current assignments and > use the rxq scheduling data structs for variance calculations. > > Now that the new rxq scheduling data structures are being used in PMD auto > load balance, the older rr_* data structs and associated functions can be > removed. > > Signed-off-by: Kevin Traynor > --- > lib/dpif-netdev.c | 508 +++--- > 1 file changed, 161 insertions(+), 347 deletions(-) > > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index beafa00a0..338ffd971 > 100644 > --- a/lib/dpif-netdev.c > +++ b/lib/dpif-netdev.c > @@ -4903,138 +4903,4 @@ port_reconfigure(struct dp_netdev_port *port) } > > -struct rr_numa_list { > -struct hmap numas; /* Contains 'struct rr_numa' */ > -}; > - > -struct rr_numa { > -struct hmap_node node; > - > -int numa_id; > - > -/* Non isolated pmds on numa node 'numa_id' */ > -struct dp_netdev_pmd_thread **pmds; > -int n_pmds; > - > -int cur_index; > -bool idx_inc; > -}; > - > -static size_t > -rr_numa_list_count(struct rr_numa_list *rr) -{ > -return hmap_count(>numas); > -} > - > -static struct rr_numa * > -rr_numa_list_lookup(struct rr_numa_list *rr, int numa_id) -{ > -struct rr_numa *numa; > - > -HMAP_FOR_EACH_WITH_HASH (numa, node, hash_int(numa_id, 0), > >numas) { > -if (numa->numa_id == numa_id) { > -return numa; > -} > -} > - > -return NULL; > -} > - > -/* Returns the next node in numa list following 'numa' in round-robin > fashion. > - * Returns first node if 'numa' is a null pointer or the last node in 'rr'. > - * Returns NULL if 'rr' numa list is empty. */ -static struct rr_numa * - > rr_numa_list_next(struct rr_numa_list *rr, const struct rr_numa *numa) -{ > -struct hmap_node *node = NULL; > - > -if (numa) { > -node = hmap_next(>numas, >node); > -} > -if (!node) { > -node = hmap_first(>numas); > -} > - > -return (node) ? CONTAINER_OF(node, struct rr_numa, node) : NULL; > -} > - > -static void > -rr_numa_list_populate(struct dp_netdev *dp, struct rr_numa_list *rr) -{ > -struct dp_netdev_pmd_thread *pmd; > -struct rr_numa *numa; > - > -hmap_init(>numas); > - > -CMAP_FOR_EACH (pmd, node, >poll_threads) { > -if (pmd->core_id == NON_PMD_CORE_ID || pmd->isolated) { > -continue; > -} > - > -numa = rr_numa_list_lookup(rr, pmd->numa_id); > -if (!numa) { > -numa = xzalloc(sizeof *numa); > -numa->numa_id = pmd->numa_id; > -hmap_insert(>numas, >node, hash_int(pmd->numa_id, 0)); > -} > -numa->n_pmds++; > -numa->pmds = xrealloc(numa->pmds, numa->n_pmds * sizeof *numa- > >pmds); > -numa->pmds[numa->n_pmds - 1] = pmd; > -/* At least one pmd so initialise curr_idx and idx_inc. */ > -numa->cur_index = 0; > -numa->idx_inc = true; > -} > -} > - > -/* > - * Returns the next pmd from the numa node. > - * > - * If 'updown' is 'true' it will alternate between selecting the next pmd in > - * either an up or down walk, switching between up/down when the first or > last > - * core is reached. e.g. 1,2,3,3,2,1,1,2... > - * > - * If 'updown' is 'false' it will select the next pmd wrapping around when > last > - * core reached. e.g. 1,2,3,1,2,3,1,2... > - */ > -static struct dp_netdev_pmd_thread * > -rr_numa_get_pmd(struct rr_numa *numa, bool updown) -{ > -int numa_idx = numa->cur_index; > - > -if (numa->idx_inc == true) { > -/* Incrementing through list of pmds. */ > -if (numa->cur_index == numa->n_pmds-1) { > -/* Reached the last pmd. */ > -if (updown) { > -numa->idx_inc = false; > -} else { > -numa->cur_index = 0; > -} > -} else { > -numa->cur_index++; > -} > -}
Re: [ovs-dev] [PATCH v4 6/7] dpif-netdev: Allow pin rxq and non-isolate PMD.
> -Original Message- > From: dev On Behalf Of Kevin Traynor > Sent: Thursday, 8 July, 2021 15:54 > To: d...@openvswitch.org > Cc: david.march...@redhat.com > Subject: [ovs-dev] [PATCH v4 6/7] dpif-netdev: Allow pin rxq and non-isolate > PMD. > > Pinning an rxq to a PMD with pmd-rxq-affinity may be done for various reasons > such as reserving a full PMD for an rxq, or to ensure that multiple rxqs from > a > port are handled on different PMDs. > > Previously pmd-rxq-affinity always isolated the PMD so no other rxqs could be > assigned to it by OVS. There may be cases where there is unused cycles on > those pmds and the user would like other rxqs to also be able to be assigned > to > it by OVS. > > Add an option to pin the rxq and non-isolate the PMD. The default behaviour is > unchanged, which is pin and isolate the PMD. > > In order to pin and non-isolate: > ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-isolate=false > > Note this is available only with group assignment type, as pinning conflicts > with > the operation of the other rxq assignment algorithms. > > Signed-off-by: Kevin Traynor > --- > Documentation/topics/dpdk/pmd.rst | 9 ++- > NEWS | 3 + > lib/dpif-netdev.c | 34 -- > tests/pmd.at | 105 ++ > vswitchd/vswitch.xml | 19 ++ > 5 files changed, 162 insertions(+), 8 deletions(-) > > diff --git a/Documentation/topics/dpdk/pmd.rst > b/Documentation/topics/dpdk/pmd.rst > index 29ba53954..30040d703 100644 > --- a/Documentation/topics/dpdk/pmd.rst > +++ b/Documentation/topics/dpdk/pmd.rst > @@ -102,6 +102,11 @@ like so: > - Queue #3 pinned to core 8 > > -PMD threads on cores where Rx queues are *pinned* will become *isolated*. > This -means that this thread will only poll the *pinned* Rx queues. > +PMD threads on cores where Rx queues are *pinned* will become > +*isolated* by default. This means that this thread will only poll the > *pinned* > Rx queues. > + > +If using ``pmd-rxq-assign=group`` PMD threads with *pinned* Rxqs can be > +*non-isolated* by setting:: > + > + $ ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-isolate=false Is there any specific reason why the new pmd-rxq-isolate option should be limited to the "group" scheduling algorithm? In my view it would make more sense and simplify documentation and code if these aspects of scheduling were kept orthogonal. Regards, Jan ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH] tests: Fixed L3 over patch port tests
LGTM. Acked-by: Jan Scheurich > -Original Message- > From: Martin Varghese > Sent: Wednesday, 9 June, 2021 15:36 > To: d...@openvswitch.org; i.maxim...@ovn.org; Jan Scheurich > > Cc: Martin Varghese > Subject: [PATCH] tests: Fixed L3 over patch port tests > > From: Martin Varghese > > Normal action is replaced with output to GRE port for sending > l3 packets over GRE tunnel. Normal action cannot be used with > l3 packets. > > Fixes: d03d0cf2b71b ("tests: Extend PTAP unit tests with decap action") > Signed-off-by: Martin Varghese > --- > tests/packet-type-aware.at | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/tests/packet-type-aware.at b/tests/packet-type-aware.at index > 540cf98f3..73aa14cea 100644 > --- a/tests/packet-type-aware.at > +++ b/tests/packet-type-aware.at > @@ -697,7 +697,7 @@ AT_CHECK([ > ovs-ofctl del-flows br1 && > ovs-ofctl del-flows br2 && > ovs-ofctl add-flow br0 in_port=n0,actions=decap,output=p0 -OOpenFlow13 > && > -ovs-ofctl add-flow br1 in_port=p1,actions=NORMAL && > +ovs-ofctl add-flow br1 in_port=p1,actions=output=gre1 && > ovs-ofctl add-flow br2 in_port=LOCAL,actions=output=n2 ], [0]) > > @@ -708,7 +708,7 @@ AT_CHECK([ovs-ofctl -OOpenFlow13 dump-flows br0 | > ofctl_strip | grep actions], > > AT_CHECK([ovs-ofctl -OOpenFlow13 dump-flows br1 | ofctl_strip | grep > actions], [0], [dnl > - reset_counts in_port=20 actions=NORMAL > + reset_counts in_port=20 actions=output:100 > ]) > > AT_CHECK([ovs-ofctl -OOpenFlow13 dump-flows br2 | ofctl_strip | grep > actions], @@ -726,7 +726,7 @@ ovs-appctl time/warp 1000 AT_CHECK([ > ovs-appctl dpctl/dump-flows --names dummy@ovs-dummy | strip_used | > grep -v ipv6 | sort ], [0], [flow-dump from the main thread: > - > recirc_id(0),in_port(n0),packet_type(ns=0,id=0),eth(src=3a:6d:d2:09:9c:ab,dst=1 > e:2c:e9:2a:66:9e),eth_type(0x0800),ipv4(tos=0/0x3,frag=no), packets:1, > bytes:98, used:0.0s, > actions:pop_eth,clone(tnl_push(tnl_port(gre_sys),header(size=38,type=3,eth(dst > =de:af:be:ef:ba:be,src=aa:55:00:00:00:02,dl_type=0x0800),ipv4(src=10.0.0.1,dst > =10.0.0.2,proto=47,tos=0,ttl=64,frag=0x4000),gre((flags=0x0,proto=0x800))),out > _port(br2)),n2) > +recirc_id(0),in_port(n0),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(t > +os=0/0x3,frag=no), packets:1, bytes:98, used:0.0s, > +actions:pop_eth,clone(tnl_push(tnl_port(gre_sys),header(size=38,type=3, > +eth(dst=de:af:be:ef:ba:be,src=aa:55:00:00:00:02,dl_type=0x0800),ipv4(sr > +c=10.0.0.1,dst=10.0.0.2,proto=47,tos=0,ttl=64,frag=0x4000),gre((flags=0 > +x0,proto=0x800))),out_port(br2)),n2) > ]) > > AT_CHECK([ > -- > 2.18.4 ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH v2] Fix redundant datapath set ethernet action with NSH Decap
> -Original Message- > From: Martin Varghese > Sent: Monday, 7 June, 2021 16:47 > To: Ilya Maximets > Cc: d...@openvswitch.org; echau...@redhat.com; Jan Scheurich > ; Martin Varghese > > Subject: Re: [ovs-dev] [PATCH v2] Fix redundant datapath set ethernet action > with NSH Decap > > On Wed, May 19, 2021 at 12:26:40PM +0200, Ilya Maximets wrote: > > On 5/19/21 5:26 AM, Martin Varghese wrote: > > > On Tue, May 18, 2021 at 10:03:39PM +0200, Ilya Maximets wrote: > > >> On 5/17/21 3:45 PM, Martin Varghese wrote: > > >>> From: Martin Varghese > > >>> > > >>> When a decap action is applied on NSH header encapsulatiing a > > >>> ethernet packet a redundant set mac address action is programmed > > >>> to the datapath. > > >>> > > >>> Fixes: f839892a206a ("OF support and translation of generic encap > > >>> and decap") > > >>> Signed-off-by: Martin Varghese > > >>> Acked-by: Jan Scheurich > > >>> Acked-by: Eelco Chaudron > > >>> --- > > >>> Changes in v2: > > >>> - Fixed code styling > > >>> - Added Ack from jan.scheur...@ericsson.com > > >>> - Added Ack from echau...@redhat.com > > >>> > > >> > > >> Hi, Martin. > > >> For some reason this patch triggers frequent failures of the > > >> following unit test: > > >> > > >> 2314. packet-type-aware.at:619: testing ptap - L3 over patch port > > >> ... > > The test is failing as, during revalidation, NORMAL action is dropping > packets. > With these changes, the mac address in flow structures get cleared with decap > action. Hence the NORMAL action drops the packet assuming a loop (SRC and > DST mac address are zero). I assume NORMAL action handling in > xlate_push_stats_entry is not adapted for l3 packet. The timing at which > revalidator gets triggered explains the sporadicity of the issue. The issue is > never seen as the MAC addresses in flow structure were not cleared with decap > before. > > So can we use NORMAL action with a L3 packet ? Does OVS handle all the L3 > use cases with Normal action ? If not, shouldn't we not use NORMAL action in > this test case > > Comments? > Good catch! Normal flow L2 bridging is of course nonsense for the use case of forwarding an L3 packet. I am surprised that the packet was forwarded at all in the first place. That in itself can be considered a bug. Correctly, a Normal flow should drop non-Ethernet packets, I would say. To fix the test case I suggest to replace the Normal action in br1 with "output:gre1" in line 700. > > > >> stdout: > > >> warped > > >> ./packet-type-aware.at:726: > > >> ovs-appctl dpctl/dump-flows --names dummy@ovs-dummy | > > >> strip_used | grep -v ipv6 | sort > > >> > > >> --- - 2021-05-18 21:57:56.810513366 +0200 > > >> +++ /home/i.maximets/work/git/ovs/tests/testsuite.dir/at- > groups/2314/stdout 2021-05-18 21:57:56.806609814 +0200 > > >> @@ -1,3 +1,3 @@ > > >> flow-dump from the main thread: > > >> -recirc_id(0),in_port(n0),packet_type(ns=0,id=0),eth(src=3a:6d:d2:0 > > >> 9:9c:ab,dst=1e:2c:e9:2a:66:9e),eth_type(0x0800),ipv4(tos=0/0x3,frag > > >> =no), packets:1, bytes:98, used:0.0s, > > >> actions:pop_eth,clone(tnl_push(tnl_port(gre_sys),header(size=38,typ > > >> e=3,eth(dst=de:af:be:ef:ba:be,src=aa:55:00:00:00:02,dl_type=0x0800) > > >> ,ipv4(src=10.0.0.1,dst=10.0.0.2,proto=47,tos=0,ttl=64,frag=0x4000), > > >> gre((flags=0x0,proto=0x800))),out_port(br2)),n2) > > >> +recirc_id(0),in_port(n0),packet_type(ns=0,id=0),eth(src=3a:6d:d2:0 > > >> +9:9c:ab,dst=1e:2c:e9:2a:66:9e),eth_type(0x0800),ipv4(tos=0/0x3,fra > > >> +g=no), packets:1, bytes:98, used:0.0s, actions:drop > > >> > > >> > > >> It fails very frequently in GitHub Actions, but it's harder to make > > >> it fail on my local machine. Following change to the test allows > > >> to reproduce the failure almost always on my local machine: > > >> > > >> diff --git a/tests/packet-type-aware.at > > >> b/tests/packet-type-aware.at index 540cf98f3..01dbc8030 100644 > > >> --- a/tests/packet-type-aware.at > > >> +++ b/tests/packet-type-aware.at > > >> @@ -721,7 +721,7 @@ AT_CHECK([ > > >> ovs-appctl netdev-dummy/receive n0 > > >> > 1e2ce92a669e3a6
Re: [ovs-dev] [PATCH] Fix redundant datapath set ethernet action with NSH Decap
Hi Martin, Somehow I didn’t receive this patch email via the ovs-dev mailing list, perhaps one of the many spam filters on the way interfered. Don't know if this response email will be recognized by ovs patchwork. The nsh.at lines are broken wrongly in this email, but they look OK in patchworks. Otherwise, LGTM. Acked-by: Jan Scheurich /Jan > -Original Message- > From: Varghese, Martin (Nokia - IN/Bangalore) > > Sent: Monday, 3 May, 2021 15:25 > To: Jan Scheurich > Subject: RE: [PATCH] Fix redundant datapath set ethernet action with NSH > Decap > > Hi Jan, > > Could you please review this patch. > > Regards, > Martin > > -Original Message- > From: Martin Varghese > Sent: Tuesday, April 27, 2021 6:13 PM > To: d...@openvswitch.org; echau...@redhat.com; > jan.scheur...@ericsson.com > Cc: Varghese, Martin (Nokia - IN/Bangalore) > Subject: [PATCH] Fix redundant datapath set ethernet action with NSH Decap > > From: Martin Varghese > > When a decap action is applied on NSH header encapsulatiing a ethernet > packet a redundant set mac address action is programmed to the datapath. > > Fixes: f839892a206a ("OF support and translation of generic encap and > decap") > Signed-off-by: Martin Varghese > --- > lib/odp-util.c | 3 ++- > ofproto/ofproto-dpif-xlate.c | 2 ++ > tests/nsh.at | 8 > 3 files changed, 8 insertions(+), 5 deletions(-) > > diff --git a/lib/odp-util.c b/lib/odp-util.c index e1199d1da..9d558082f 100644 > --- a/lib/odp-util.c > +++ b/lib/odp-util.c > @@ -7830,7 +7830,8 @@ commit_set_ether_action(const struct flow *flow, > struct flow *base_flow, > struct offsetof_sizeof ovs_key_ethernet_offsetof_sizeof_arr[] = > OVS_KEY_ETHERNET_OFFSETOF_SIZEOF_ARR; > > -if (flow->packet_type != htonl(PT_ETH)) { > +if ((flow->packet_type != htonl(PT_ETH)) || > +(base_flow->packet_type != htonl(PT_ETH))) { > return; > } > > diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c index > 7108c8a30..a6f4ea334 100644 > --- a/ofproto/ofproto-dpif-xlate.c > +++ b/ofproto/ofproto-dpif-xlate.c > @@ -6549,6 +6549,8 @@ xlate_generic_decap_action(struct xlate_ctx *ctx, > * Delay generating pop_eth to the next commit. */ > flow->packet_type = htonl(PACKET_TYPE(OFPHTN_ETHERTYPE, >ntohs(flow->dl_type))); > +flow->dl_src = eth_addr_zero; > +flow->dl_dst = eth_addr_zero; > ctx->wc->masks.dl_type = OVS_BE16_MAX; > } > return false; > diff --git a/tests/nsh.at b/tests/nsh.at index d5c772ff0..e84134e42 100644 > --- a/tests/nsh.at > +++ b/tests/nsh.at > @@ -105,7 +105,7 @@ bridge("br0") > > Final flow: > in_port=1,vlan_tci=0x,dl_src=00:00:00:00:00:00,dl_dst=11:22:33:44:55:6 > 6,dl_type=0x894f,nsh_flags=0,nsh_ttl=63,nsh_mdtype=1,nsh_np=3,nsh_spi=0 > x1234,nsh_si=255,nsh_c1=0x11223344,nsh_c2=0x0,nsh_c3=0x0,nsh_c4=0x0, > nw_proto=0,nw_tos=0,nw_ecn=0,nw_ttl=0 > Megaflow: > recirc_id=0,eth,ip,in_port=1,dl_dst=66:77:88:99:aa:bb,nw_frag=no > -Datapath actions: > push_nsh(flags=0,ttl=63,mdtype=1,np=3,spi=0x1234,si=255,c1=0x11223344,c > 2=0x0,c3=0x0,c4=0x0),push_eth(src=00:00:00:00:00:00,dst=11:22:33:44:55:6 > 6),pop_eth,pop_nsh(),set(eth(dst=11:22:33:44:55:66)),recirc(0x1) > +Datapath actions: > +push_nsh(flags=0,ttl=63,mdtype=1,np=3,spi=0x1234,si=255,c1=0x11223344, > c > +2=0x0,c3=0x0,c4=0x0),push_eth(src=00:00:00:00:00:00,dst=11:22:33:44:55: > +66),pop_eth,pop_nsh(),recirc(0x1) > ]) > > AT_CHECK([ > @@ -139,7 +139,7 @@ ovs-appctl time/warp 1000 AT_CHECK([ > ovs-appctl dpctl/dump-flows dummy@ovs-dummy | strip_used | grep -v > ipv6 | sort ], [0], [flow-dump from the main thread: > - > recirc_id(0),in_port(1),packet_type(ns=0,id=0),eth(dst=1e:2c:e9:2a:66:9e),eth > _type(0x0800),ipv4(frag=no), packets:1, bytes:98, used:0.0s, > actions:push_nsh(flags=0,ttl=63,mdtype=1,np=3,spi=0x1234,si=255,c1=0x112 > 23344,c2=0x0,c3=0x0,c4=0x0),push_eth(src=00:00:00:00:00:00,dst=11:22:33: > 44:55:66),pop_eth,pop_nsh(),set(eth(dst=11:22:33:44:55:66)),recirc(0x3) > +recirc_id(0),in_port(1),packet_type(ns=0,id=0),eth(dst=1e:2c:e9:2a:66:9 > +e),eth_type(0x0800),ipv4(frag=no), packets:1, bytes:98, used:0.0s, > +actions:push_nsh(flags=0,ttl=63,mdtype=1,np=3,spi=0x1234,si=255,c1=0x11 > +223344,c2=0x0,c3=0x0,c4=0x0),push_eth(src=00:00:00:00:00:00,dst=11:22: > 3 > +3:44:55:66),pop_eth,pop_nsh(),recirc(0x3) > > recirc_id(0x3),in_port(1),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag= > no), p
Re: [ovs-dev] [PATCH v4 1/2] Encap & Decap actions for MPLS packet type.
> -Original Message- > From: Martin Varghese > Sent: Tuesday, 13 April, 2021 16:20 > To: Jan Scheurich > Cc: Eelco Chaudron ; d...@openvswitch.org; > pshe...@ovn.org; martin.vargh...@nokia.com > Subject: Re: [PATCH v4 1/2] Encap & Decap actions for MPLS packet type. > > On Wed, Apr 07, 2021 at 03:49:07PM +, Jan Scheurich wrote: > > Hi Martin, > > > > I guess you are aware of the original design document we wrote for generic > encap/decap and NSH support: > > https://protect2.fireeye.com/v1/url?k=993ba795-c6a09d8c-993be70e-8682a > > aa22bc0-3c9b4464027ca7bf=1=f89fc25e-8dc0-45bf-bdd9- > 2d0ca03b5686= > > > https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1oWMYUH8sjZJzWa72 > o2q9kU > > 0N6pNE-rwZcLH3-kbbDR8%2Fedit%23 > > > > It is no longer 100% aligned with the final implementation in OVS but still > > a > good reference for understanding the design principles behind the > implementation and some specifics for Ethernet and NSH encap/decap use > cases. > > > > Please find some more answers/comments below. > > > > BR, Jan > > > > > -Original Message- > > > From: Martin Varghese > > > Sent: Wednesday, 7 April, 2021 10:43 > > > To: Jan Scheurich > > > Cc: Eelco Chaudron ; d...@openvswitch.org; > > > pshe...@ovn.org; martin.vargh...@nokia.com > > > Subject: Re: [PATCH v4 1/2] Encap & Decap actions for MPLS packet type. > > > > > > On Tue, Apr 06, 2021 at 09:00:16AM +, Jan Scheurich wrote: > > > > Hi, > > > > > > > > Thanks for the heads up. The interaction with MPLS push/pop is a > > > > use case > > > that was likely not tested during the NSH and generic encap/decap > > > design. It's complex code and a long time ago. I'm willing to help, > > > but I will need some time to go back and have a look. > > > > > > > > It would definitely help, if you could provide a minimal example > > > > for > > > reproducing the problem. > > > > > > > > > > Hi Jan , > > > > > > Thanks for your help. > > > > > > I was trying to implement ENCAP/DECAP support for MPLS. > > > > > > The programming of datapath flow for the below userspace rule fails > > > as there is set(eth() action between pop_mpls and recirc ovs-ofctl > > > -O OpenFlow13 add- flow br_mpls2 > > > "in_port=$egress_port,dl_type=0x8847 > > > actions=decap(),decap(packet_type(ns=0,type=0),goto_table:1 > > > > > > 2021-04-05T05:46:49.192Z|00068|dpif(handler51)|WARN|system@ovs- > > > system: failed to put[create] (Invalid argument) > > > ufid:1dddb0ba-27fe-44ea- 9a99-5815764b4b9c > > > recirc_id(0),dp_hash(0/0),skb_priority(0/0),in_port(6),skb_mark(0/0) > > > ,ct_state > > > (0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),eth(src=00:00:00:00:00:01/00: > > > 00:00:00:00:00,dst=00:00:00:00:00:02/00:00:00:00:00:00),eth_type(0x8 > > > 847) ,mpls(label=2/0x0,tc=0/0,ttl=64/0x0,bos=1/1), > > > actions:pop_eth,pop_mpls(eth_type=0x6558),set(eth()),recirc(0x45) > > > > > > > Conceptually, what should happen in this scenario is that, after the second > decap(packet_type(ns=0,type=0) action, OVS processes the unchanged inner > packet as packet type PT_ETH, i.e. as L2 Ethernet frame. Overwriting the > existing Ethernet header with zero values through set(eth()) is clearly > incorrect. That is a logical error inside the ofproto-dpif-xlate module (see > below). > > > > I believe the netdev userspace datapath would still have accepted the > incorrect datapath flow. I have too little experience with the kernel > datapath to > explain why that rejects the datapath flow as invalid. > > > > Unlike in the Ethernet and NSH cases, the MPLS header does not contain any > indication about the inner packet type. That is why the packet_type must be > provided by the SDN controller as part of the decap() action. And the > ofproto- > dpif-xlate module must consider the specified inner packet type when > continuing the translation. In the general case, a decap() action should > trigger > recirculation for reparsing of the inner packet, so the new packet type must > be > set before recirculation. (Exceptions to the general recirculation rule are > those > where OVS has already parsed further into the packet and ofproto can modify > the flow on the fly: decap(Ethernet) and possibly decap(MPLS) for all but the > last bottom of stack label). > > > > I have had a look at your new code for encap/decap o
Re: [ovs-dev] [PATCH v4 1/2] Encap & Decap actions for MPLS packet type.
Hi Martin, I guess you are aware of the original design document we wrote for generic encap/decap and NSH support: https://docs.google.com/document/d/1oWMYUH8sjZJzWa72o2q9kU0N6pNE-rwZcLH3-kbbDR8/edit# It is no longer 100% aligned with the final implementation in OVS but still a good reference for understanding the design principles behind the implementation and some specifics for Ethernet and NSH encap/decap use cases. Please find some more answers/comments below. BR, Jan > -Original Message- > From: Martin Varghese > Sent: Wednesday, 7 April, 2021 10:43 > To: Jan Scheurich > Cc: Eelco Chaudron ; d...@openvswitch.org; > pshe...@ovn.org; martin.vargh...@nokia.com > Subject: Re: [PATCH v4 1/2] Encap & Decap actions for MPLS packet type. > > On Tue, Apr 06, 2021 at 09:00:16AM +, Jan Scheurich wrote: > > Hi, > > > > Thanks for the heads up. The interaction with MPLS push/pop is a use case > that was likely not tested during the NSH and generic encap/decap design. It's > complex code and a long time ago. I'm willing to help, but I will need some > time to go back and have a look. > > > > It would definitely help, if you could provide a minimal example for > reproducing the problem. > > > > Hi Jan , > > Thanks for your help. > > I was trying to implement ENCAP/DECAP support for MPLS. > > The programming of datapath flow for the below userspace rule fails as there > is set(eth() action between pop_mpls and recirc ovs-ofctl -O OpenFlow13 add- > flow br_mpls2 "in_port=$egress_port,dl_type=0x8847 > actions=decap(),decap(packet_type(ns=0,type=0),goto_table:1 > > 2021-04-05T05:46:49.192Z|00068|dpif(handler51)|WARN|system@ovs- > system: failed to put[create] (Invalid argument) ufid:1dddb0ba-27fe-44ea- > 9a99-5815764b4b9c > recirc_id(0),dp_hash(0/0),skb_priority(0/0),in_port(6),skb_mark(0/0),ct_state > (0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),eth(src=00:00:00:00:00:01/00: > 00:00:00:00:00,dst=00:00:00:00:00:02/00:00:00:00:00:00),eth_type(0x8847) > ,mpls(label=2/0x0,tc=0/0,ttl=64/0x0,bos=1/1), > actions:pop_eth,pop_mpls(eth_type=0x6558),set(eth()),recirc(0x45) > Conceptually, what should happen in this scenario is that, after the second decap(packet_type(ns=0,type=0) action, OVS processes the unchanged inner packet as packet type PT_ETH, i.e. as L2 Ethernet frame. Overwriting the existing Ethernet header with zero values through set(eth()) is clearly incorrect. That is a logical error inside the ofproto-dpif-xlate module (see below). I believe the netdev userspace datapath would still have accepted the incorrect datapath flow. I have too little experience with the kernel datapath to explain why that rejects the datapath flow as invalid. Unlike in the Ethernet and NSH cases, the MPLS header does not contain any indication about the inner packet type. That is why the packet_type must be provided by the SDN controller as part of the decap() action. And the ofproto-dpif-xlate module must consider the specified inner packet type when continuing the translation. In the general case, a decap() action should trigger recirculation for reparsing of the inner packet, so the new packet type must be set before recirculation. (Exceptions to the general recirculation rule are those where OVS has already parsed further into the packet and ofproto can modify the flow on the fly: decap(Ethernet) and possibly decap(MPLS) for all but the last bottom of stack label). I have had a look at your new code for encap/decap of MPLS headers, but I must admit I cannot fully judge in how far re-using the existing translation functions for MPLS label stacks written for the legacy push/pop_mpls case (i.e. manipulating a label stack between the L2 and the L3 headers of a PT_ETH Packet) are possible to re-use in the new context. BTW: Do you support multiple MPLS label encap or decap actions with your patch? Have you tested that? I am uncertain about the handling of the ethertype of the decapsulated inner packet. In the design base, the ethertype that is set in the existing L2 header of the packet after pop_mpls of the last label is coming from the pop_mpls action, while in the decap(packet_type(0,0)) case the entire inner packet should be recirculated as is with packet_type PT_ETH. case PT_MPLS: { int n; ovs_be16 ethertype; flow->packet_type = decap->new_pkt_type; ethertype = pt_ns_type_be(flow->packet_type); n = flow_count_mpls_labels(flow, ctx->wc); flow_pop_mpls(flow, n, ethertype, ctx->wc); if (!ctx->xbridge->support.add_mpls) { ctx->xout->slow |= SLOW_ACTION; } ctx->pending_decap = true; return true; In the example scenario the new_pkt_type is P
Re: [ovs-dev] [PATCH v4 1/2] Encap & Decap actions for MPLS packet type.
Hi, Thanks for the heads up. The interaction with MPLS push/pop is a use case that was likely not tested during the NSH and generic encap/decap design. It's complex code and a long time ago. I'm willing to help, but I will need some time to go back and have a look. It would definitely help, if you could provide a minimal example for reproducing the problem. BR, Jan > -Original Message- > From: Eelco Chaudron > Sent: Tuesday, 6 April, 2021 10:55 > To: Martin Varghese ; Jan Scheurich > > Cc: d...@openvswitch.org; pshe...@ovn.org; martin.vargh...@nokia.com > Subject: Re: [PATCH v4 1/2] Encap & Decap actions for MPLS packet type. > > > > On 6 Apr 2021, at 10:27, Martin Varghese wrote: > > > On Thu, Apr 01, 2021 at 11:32:06AM +0200, Eelco Chaudron wrote: > >> > >> > >> On 1 Apr 2021, at 11:28, Martin Varghese wrote: > >> > >>> On Thu, Apr 01, 2021 at 11:17:14AM +0200, Eelco Chaudron wrote: > >>>> > >>>> > >>>> On 1 Apr 2021, at 11:09, Martin Varghese wrote: > >>>> > >>>>> On Thu, Apr 01, 2021 at 10:54:42AM +0200, Eelco Chaudron wrote: > >>>>>> > >>>>>> > >>>>>> On 1 Apr 2021, at 10:35, Martin Varghese wrote: > >>>>>> > >>>>>>> On Thu, Apr 01, 2021 at 08:59:27AM +0200, Eelco Chaudron wrote: > >>>>>>>> > >>>>>>>> > >>>>>>>> On 1 Apr 2021, at 6:10, Martin Varghese wrote: > >>>>>>>> > >>>>>>>>> On Wed, Mar 31, 2021 at 03:59:40PM +0200, Eelco Chaudron > >>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On 26 Mar 2021, at 7:21, Martin Varghese wrote: > >>>>>>>>>> > >>>>>>>>>>> From: Martin Varghese > >>>>>>>>>>> > >>>>>>>>>>> The encap & decap actions are extended to support MPLS > >>>>>>>>>>> packet type. > >>>>>>>>>>> Encap & decap actions adds and removes MPLS header at start > >>>>>>>>>>> of the packet. > >>>>>>>>>> > >>>>>>>>>> Hi Martin, > >>>>>>>>>> > >>>>>>>>>> I’m trying to do some real-life testing, and I’m running into > >>>>>>>>>> issues. This might be me setting it up wrongly but just > >>>>>>>>>> wanting to confirm… > >>>>>>>>>> > >>>>>>>>>> I’m sending an MPLS packet that contains an ARP packet into a > >>>>>>>>>> physical port. > >>>>>>>>>> This is the packet: > >>>>>>>>>> > >>>>>>>>>> Frame 4: 64 bytes on wire (512 bits), 64 bytes captured (512 > >>>>>>>>>> bits) > >>>>>>>>>> Encapsulation type: Ethernet (1) > >>>>>>>>>> [Protocols in frame: eth:ethertype:mpls:data] Ethernet > >>>>>>>>>> II, Src: 00:00:00_00:00:01 (00:00:00:00:00:01), Dst: > >>>>>>>>>> 00:00:00_00:00:02 (00:00:00:00:00:02) > >>>>>>>>>> Destination: 00:00:00_00:00:02 (00:00:00:00:00:02) > >>>>>>>>>> Address: 00:00:00_00:00:02 (00:00:00:00:00:02) > >>>>>>>>>> ..0. = LG bit: Globally > >>>>>>>>>> unique address (factory default) > >>>>>>>>>> ...0 = IG bit: > >>>>>>>>>> Individual address > >>>>>>>>>> (unicast) > >>>>>>>>>> Source: 00:00:00_00:00:01 (00:00:00:00:00:01) > >>>>>>>>>> Address: 00:00:00_00:00:01 (00:00:00:00:00:01) > >>>>>>>>>> ..0. = LG bit: Globally > >>>>>>>>>> unique address (factory default) > >>>>>>>>>> ...0 = IG bit: > >>>>>>>>>> Individual address > >>>&
Re: [ovs-dev] [PATCH v2] ofp-ed-props: Fix using uninitialized padding for NSH encap actions.
LGTM. Please back-port to stable branches. Acked-by: Jan Scheurich /Jan > -Original Message- > From: Ilya Maximets > Sent: Wednesday, 14 October, 2020 18:14 > To: ovs-dev@openvswitch.org; Jan Scheurich > Cc: Ben Pfaff ; Ilya Maximets > Subject: [PATCH v2] ofp-ed-props: Fix using uninitialized padding for NSH > encap actions. > > OVS uses memcmp to compare actions of existing and new flows, but 'struct > ofp_ed_prop_nsh_md_type' and corresponding ofpact structure has > 3 bytes of padding that never initialized and passed around within OF data > structures and messages. > > Uninitialized bytes in MemcmpInterceptorCommon > at offset 21 inside [0x709003f8, 136) > WARNING: MemorySanitizer: use-of-uninitialized-value > #0 0x4a184e in bcmp (vswitchd/ovs-vswitchd+0x4a184e) > #1 0x896c8a in ofpacts_equal lib/ofp-actions.c:9121:31 > #2 0x564403 in replace_rule_finish ofproto/ofproto.c:5650:37 > #3 0x563462 in add_flow_finish ofproto/ofproto.c:5218:13 > #4 0x54a1ff in ofproto_flow_mod_finish ofproto/ofproto.c:8091:17 > #5 0x5433b2 in handle_flow_mod__ ofproto/ofproto.c:6216:17 > #6 0x56a2fc in handle_flow_mod ofproto/ofproto.c:6190:17 > #7 0x565bda in handle_single_part_openflow ofproto/ofproto.c:8504:16 > #8 0x540b25 in handle_openflow ofproto/ofproto.c:8685:21 > #9 0x6697fd in ofconn_run ofproto/connmgr.c:1329:13 > #10 0x668e6e in connmgr_run ofproto/connmgr.c:356:9 > #11 0x53f1bc in ofproto_run ofproto/ofproto.c:1890:5 > #12 0x4ead0c in bridge_run__ vswitchd/bridge.c:3250:9 > #13 0x4e9bc8 in bridge_run vswitchd/bridge.c:3309:5 > #14 0x51c072 in main vswitchd/ovs-vswitchd.c:127:9 > #15 0x7f23a99011a2 in __libc_start_main (/lib64/libc.so.6) > #16 0x46b92d in _start (vswitchd/ovs-vswitchd+0x46b92d) > > Uninitialized value was stored to memory at > #0 0x4745aa in __msan_memcpy.part.0 (vswitchd/ovs-vswitchd) > #1 0x54529f in rule_actions_create ofproto/ofproto.c:3134:5 > #2 0x54915e in ofproto_rule_create ofproto/ofproto.c:5284:11 > #3 0x55d419 in add_flow_init ofproto/ofproto.c:5123:17 > #4 0x54841f in ofproto_flow_mod_init ofproto/ofproto.c:7987:17 > #5 0x543250 in handle_flow_mod__ ofproto/ofproto.c:6206:13 > #6 0x56a2fc in handle_flow_mod ofproto/ofproto.c:6190:17 > #7 0x565bda in handle_single_part_openflow ofproto/ofproto.c:8504:16 > #8 0x540b25 in handle_openflow ofproto/ofproto.c:8685:21 > #9 0x6697fd in ofconn_run ofproto/connmgr.c:1329:13 > #10 0x668e6e in connmgr_run ofproto/connmgr.c:356:9 > #11 0x53f1bc in ofproto_run ofproto/ofproto.c:1890:5 > #12 0x4ead0c in bridge_run__ vswitchd/bridge.c:3250:9 > #13 0x4e9bc8 in bridge_run vswitchd/bridge.c:3309:5 > #14 0x51c072 in main vswitchd/ovs-vswitchd.c:127:9 > #15 0x7f23a99011a2 in __libc_start_main (/lib64/libc.so.6) > > Uninitialized value was created by an allocation of 'ofpacts_stub' > in the stack frame of function 'handle_flow_mod' > #0 0x569e80 in handle_flow_mod ofproto/ofproto.c:6170 > > This could cause issues with flow modifications or other operations. > > To reproduce, some NSH tests could be run under valgrind or clang > MemorySantizer. Ex. "nsh - md1 encap over a veth link" test. > > Fix that by clearing padding bytes while encoding and decoding. > OVS will still accept OF messages with non-zero padding from controllers. > > New tests added to tests/ofp-actions.at. > > Fixes: 1fc11c5948cf ("Generic encap and decap support for NSH") > Signed-off-by: Ilya Maximets > --- > lib/ofp-ed-props.c | 3 ++- > tests/ofp-actions.at | 11 +++ > 2 files changed, 13 insertions(+), 1 deletion(-) > > diff --git a/lib/ofp-ed-props.c b/lib/ofp-ed-props.c index > 28382e012..02a9235d5 100644 > --- a/lib/ofp-ed-props.c > +++ b/lib/ofp-ed-props.c > @@ -49,7 +49,7 @@ decode_ed_prop(const struct ofp_ed_prop_header > **ofp_prop, > return OFPERR_NXBAC_BAD_ED_PROP; > } > struct ofpact_ed_prop_nsh_md_type *pnmt = > -ofpbuf_put_uninit(out, sizeof(*pnmt)); > +ofpbuf_put_zeros(out, sizeof *pnmt); > pnmt->header.prop_class = prop_class; > pnmt->header.type = prop_type; > pnmt->header.len = len; > @@ -108,6 +108,7 @@ encode_ed_prop(const struct ofpact_ed_prop > **prop, > opnmt->header.len = > offsetof(struct ofp_ed_prop_nsh_md_type, pad); > opnmt->md_type = pnmt->md_type; > +memset(opnmt->pad, 0, sizeof opnmt->pad); > prop_len = sizeof(*pnmt); > break; >
Re: [ovs-dev] [PATCH] ofp-ed-props: Fix using uninitialized padding for NSH encap actions.
> >> Fix that by clearing padding bytes while encoding, and checking that > >> these bytes are all zeros on decoding. > > > > Is the latter strictly necessary? It may break existing controllers that do > > not > initialize the padding bytes to zero. > > Wouldn't it be sufficient to just zero the padding bytes at reception? > > I do not have a strong opinion. I guess, we could not fail OF request if > padding is not all zeroes for backward compatibility. > Anyway, it seems like I missed one part of this change (see inline). > > On the other hand, AFAIU, NXOXM_NSH_ is not standardized, so, technically, > we could change the rules here. As an option, we could apply the patch > without checking for all-zeroes padding and backport it this way to stable > branches. Afterwards, we could introduce the 'is_all_zeros' check and > mention this change in release notes for the new version. Anyway OpenFlow > usually requires paddings to be all-zeroes for most of matches and actions, so > this should be a sane requirement for controllers. > What do you think? > I think there is little to gain by enforcing strict rules on zeroed padding bytes in a future release. It just creates grief with users of OVS by unnecessarily breaking backward compatibility without any benefit for OVS. No matter if OVS is has the right to do so or not. > >> diff --git a/lib/ofp-ed-props.c b/lib/ofp-ed-props.c index > >> 28382e012..5a4b12d9f 100644 > >> --- a/lib/ofp-ed-props.c > >> +++ b/lib/ofp-ed-props.c > >> @@ -48,6 +48,9 @@ decode_ed_prop(const struct ofp_ed_prop_header > >> **ofp_prop, > >> if (len > sizeof(*opnmt) || len > *remaining) { > >> return OFPERR_NXBAC_BAD_ED_PROP; > >> } > >> +if (!is_all_zeros(opnmt->pad, sizeof opnmt->pad)) { > >> +return OFPERR_NXBRC_MUST_BE_ZERO; > >> +} > >> struct ofpact_ed_prop_nsh_md_type *pnmt = > >> ofpbuf_put_uninit(out, sizeof(*pnmt)); > > This should be 'ofpbuf_put_zeroes' because 'struct > ofpact_ed_prop_nsh_md_type' > contains padding too that must be cleared while constructing ofpacts. > Since OVS compares decoded ofpacts' and not the original OF messages, this > should do the trick. Agree. > > I'll send v2 with this change and will remove 'is_all_zeros' check for this > fix. Thanks, Jan ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH] ofp-ed-props: Fix using uninitialized padding for NSH encap actions.
Hi Ilya, Good catch. One comment below. /Jan > -Original Message- > From: Ilya Maximets > Sent: Tuesday, 13 October, 2020 21:02 > To: ovs-dev@openvswitch.org; Jan Scheurich > Cc: Ben Pfaff ; Yi Yang ; Ilya Maximets > > Subject: [PATCH] ofp-ed-props: Fix using uninitialized padding for NSH encap > actions. > > OVS uses memcmp to compare actions of existing and new flows, but 'struct > ofp_ed_prop_nsh_md_type' has 3 bytes of padding that never initialized and > passed around within OF data structures and messages. > > Uninitialized bytes in MemcmpInterceptorCommon > at offset 21 inside [0x709003f8, 136) > WARNING: MemorySanitizer: use-of-uninitialized-value > #0 0x4a184e in bcmp (vswitchd/ovs-vswitchd+0x4a184e) > #1 0x896c8a in ofpacts_equal lib/ofp-actions.c:9121:31 > #2 0x564403 in replace_rule_finish ofproto/ofproto.c:5650:37 > #3 0x563462 in add_flow_finish ofproto/ofproto.c:5218:13 > #4 0x54a1ff in ofproto_flow_mod_finish ofproto/ofproto.c:8091:17 > #5 0x5433b2 in handle_flow_mod__ ofproto/ofproto.c:6216:17 > #6 0x56a2fc in handle_flow_mod ofproto/ofproto.c:6190:17 > #7 0x565bda in handle_single_part_openflow ofproto/ofproto.c:8504:16 > #8 0x540b25 in handle_openflow ofproto/ofproto.c:8685:21 > #9 0x6697fd in ofconn_run ofproto/connmgr.c:1329:13 > #10 0x668e6e in connmgr_run ofproto/connmgr.c:356:9 > #11 0x53f1bc in ofproto_run ofproto/ofproto.c:1890:5 > #12 0x4ead0c in bridge_run__ vswitchd/bridge.c:3250:9 > #13 0x4e9bc8 in bridge_run vswitchd/bridge.c:3309:5 > #14 0x51c072 in main vswitchd/ovs-vswitchd.c:127:9 > #15 0x7f23a99011a2 in __libc_start_main (/lib64/libc.so.6) > #16 0x46b92d in _start (vswitchd/ovs-vswitchd+0x46b92d) > > Uninitialized value was stored to memory at > #0 0x4745aa in __msan_memcpy.part.0 (vswitchd/ovs-vswitchd) > #1 0x54529f in rule_actions_create ofproto/ofproto.c:3134:5 > #2 0x54915e in ofproto_rule_create ofproto/ofproto.c:5284:11 > #3 0x55d419 in add_flow_init ofproto/ofproto.c:5123:17 > #4 0x54841f in ofproto_flow_mod_init ofproto/ofproto.c:7987:17 > #5 0x543250 in handle_flow_mod__ ofproto/ofproto.c:6206:13 > #6 0x56a2fc in handle_flow_mod ofproto/ofproto.c:6190:17 > #7 0x565bda in handle_single_part_openflow ofproto/ofproto.c:8504:16 > #8 0x540b25 in handle_openflow ofproto/ofproto.c:8685:21 > #9 0x6697fd in ofconn_run ofproto/connmgr.c:1329:13 > #10 0x668e6e in connmgr_run ofproto/connmgr.c:356:9 > #11 0x53f1bc in ofproto_run ofproto/ofproto.c:1890:5 > #12 0x4ead0c in bridge_run__ vswitchd/bridge.c:3250:9 > #13 0x4e9bc8 in bridge_run vswitchd/bridge.c:3309:5 > #14 0x51c072 in main vswitchd/ovs-vswitchd.c:127:9 > #15 0x7f23a99011a2 in __libc_start_main (/lib64/libc.so.6) > > Uninitialized value was created by an allocation of 'ofpacts_stub' > in the stack frame of function 'handle_flow_mod' > #0 0x569e80 in handle_flow_mod ofproto/ofproto.c:6170 > > This could cause issues with flow modifications or other operations. > > To reproduce, some NSH tests could be run under valgrind or clang > MemorySantizer. Ex. "nsh - md1 encap over a veth link" test. > > Fix that by clearing padding bytes while encoding, and checking that these > bytes are all zeros on decoding. Is the latter strictly necessary? It may break existing controllers that do not initialize the padding bytes to zero. Wouldn't it be sufficient to just zero the padding bytes at reception? > > New tests added to tests/ofp-actions.at. > > Fixes: 1fc11c5948cf ("Generic encap and decap support for NSH") > Signed-off-by: Ilya Maximets > --- > lib/ofp-ed-props.c | 4 > tests/ofp-actions.at | 11 +++ > 2 files changed, 15 insertions(+) > > diff --git a/lib/ofp-ed-props.c b/lib/ofp-ed-props.c index > 28382e012..5a4b12d9f 100644 > --- a/lib/ofp-ed-props.c > +++ b/lib/ofp-ed-props.c > @@ -48,6 +48,9 @@ decode_ed_prop(const struct ofp_ed_prop_header > **ofp_prop, > if (len > sizeof(*opnmt) || len > *remaining) { > return OFPERR_NXBAC_BAD_ED_PROP; > } > +if (!is_all_zeros(opnmt->pad, sizeof opnmt->pad)) { > +return OFPERR_NXBRC_MUST_BE_ZERO; > +} > struct ofpact_ed_prop_nsh_md_type *pnmt = > ofpbuf_put_uninit(out, sizeof(*pnmt)); > pnmt->header.prop_class = prop_class; @@ -108,6 +111,7 @@ > encode_ed_prop(const struct ofpact_ed_prop **prop, > opnmt->header.len = > offsetof(struct ofp_ed_prop_nsh_md_type, pad); > opnmt->md_type = pnmt->md_type
Re: [ovs-dev] [PATCH] userspace: Switch default cache from EMC to SMC.
> -Original Message- > From: Flavio Leitner > On Tue, Sep 22, 2020 at 01:22:58PM +0200, Ilya Maximets wrote: > > On 9/19/20 3:07 PM, Flavio Leitner wrote: > > > The EMC is not large enough for current production cases and they > > > are scaling up, so this change switches over from EMC to SMC by > > > default, which provides better results. > > > > > > The EMC is still available and could be used when only a few number > > > of flows is used. > > I am curious to find out what others think about this change, so going to > wait a > bit before following up with the next version if that sounds OK. > For production deployments of OVS-DPDK in NFVI we also recommend switching SMC on and EMC off. No problem with making that configuration default in the future. As to be expected SMC only provides acceleration over DPCLS if the avg. number of sub-table lookups is > 1. BR, Jan ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH] dpif-netdev: Do not mix recirculation depth into RSS hash itself.
Even simpler solution to the problem. Acked-by: Jan Scheurich BR, Jan > -Original Message- > From: Ilya Maximets > Sent: Thursday, 24 October, 2019 14:32 > To: ovs-dev@openvswitch.org > Cc: Ian Stokes ; Kevin Traynor ; > Jan Scheurich ; ychen103...@163.com; Ilya > Maximets > Subject: [PATCH] dpif-netdev: Do not mix recirculation depth into RSS hash > itself. > > Mixing of RSS hash with recirculation depth is useful for flow lookup because > same packet after recirculation should match with different datapath rule. > Setting of the mixed value back to the packet is completely unnecessary > because recirculation depth is different on each recirculation, i.e. we will > have > different packet hash for flow lookup anyway. > > This should fix the issue that packets from the same flow could be directed to > different buckets based on a dp_hash or different ports of a balanced bonding > in case they were recirculated different number of times (e.g. due to > conntrack > rules). > With this change, the original RSS hash will remain the same making it > possible > to calculate equal dp_hash values for such packets. > > Reported-at: https://protect2.fireeye.com/v1/url?k=0a51a6c3-56db840c- > 0a51e658-0cc47ad93ea4-b7f1e9be8f7bbef8=1=c9f55798-de3c-45f4-afeb- > 9a87d3d594ca=https%3A%2F%2Fmail.openvswitch.org%2Fpipermail%2Fovs- > dev%2F2019-September%2F363127.html > Fixes: 048963aa8507 ("dpif-netdev: Reset RSS hash when recirculating.") > Signed-off-by: Ilya Maximets > --- > lib/dpif-netdev.c | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 4546b55e8..c09b8fd95 > 100644 > --- a/lib/dpif-netdev.c > +++ b/lib/dpif-netdev.c > @@ -6288,7 +6288,6 @@ dpif_netdev_packet_get_rss_hash(struct dp_packet > *packet, > recirc_depth = *recirc_depth_get_unsafe(); > if (OVS_UNLIKELY(recirc_depth)) { > hash = hash_finish(hash, recirc_depth); > -dp_packet_set_rss_hash(packet, hash); > } > return hash; > } > -- > 2.17.1 ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] group dp_hash method works incorrectly when using snat
Hi, You have pointed out an interesting issue in the netdev datapath implementation (not sure in how far the same applies also to the kernel datapath). Conceptually, the dp_hash of a packet should be based on the current packet's flow. It should not change if the headers remain unchanged. For performance reasons, the actual implementation of the dp_hash action in odp_execute.c bases the dp_hash value on the current RSS hash of the packet, if it has one, and only computes it from the actual packet content if not. However, the RSS hash of the packet is updated with every recirculation in order to improve the EMC lookup success rate. So even if initially the RSS hash was a suitable base for dp_hash (that itself is uncertain, as the implemantation of the RSS hash is dependent on the NIC HW and might not satisfy the algorithm specified as part of the dp_hash action), its volatility at recirculation destroys the required property of the dp_hash. What we could do is something like the following (not even compiler-tested): diff --git a/lib/odp-execute.c b/lib/odp-execute.c index 563ad1da8..1937bb1e6 100644 --- a/lib/odp-execute.c +++ b/lib/odp-execute.c @@ -820,16 +820,22 @@ odp_execute_actions(void *dp, struct dp_packet_batch *batch, bool steal, uint32_t hash; DP_PACKET_BATCH_FOR_EACH (i, packet, batch) { -/* RSS hash can be used here instead of 5tuple for - * performance reasons. */ -if (dp_packet_rss_valid(packet)) { -hash = dp_packet_get_rss_hash(packet); -hash = hash_int(hash, hash_act->hash_basis); -} else { -flow_extract(packet, ); -hash = flow_hash_5tuple(, hash_act->hash_basis); +if (packet->md.dp_hash == 0) { +if (packet->md.recirc_id == 0 && +dp_packet_rss_valid(packet)) { +/* RSS hash is used here instead of 5tuple for + * performance reasons. */ +hash = dp_packet_get_rss_hash(packet); +hash = hash_int(hash, hash_act->hash_basis); +} else { +flow_extract(packet, ); +hash = flow_hash_5tuple(, hash_act->hash_basis); +} +if (unlikely(hash == 0)) { +hash = 1; +} +packet->md.dp_hash = hash; } -packet->md.dp_hash = hash; } break; } @@ -842,6 +848,9 @@ odp_execute_actions(void *dp, struct dp_packet_batch *batch, bool steal, hash = flow_hash_symmetric_l3l4(, hash_act->hash_basis, false); +if (unlikely(hash == 0)) { +hash = 1; +} packet->md.dp_hash = hash; } break; diff --git a/lib/packets.c b/lib/packets.c index ab0b1a36d..a03a3ab61 100644 --- a/lib/packets.c +++ b/lib/packets.c @@ -391,6 +391,8 @@ push_mpls(struct dp_packet *packet, ovs_be16 ethtype, ovs_be32 lse) header = dp_packet_resize_l2_5(packet, MPLS_HLEN); memmove(header, header + MPLS_HLEN, len); memcpy(header + len, , sizeof lse); +/* Invalidate dp_hash */ +packet->md.dp_hash = 0; } /* If 'packet' is an MPLS packet, removes its outermost MPLS label stack entry. @@ -411,6 +413,8 @@ pop_mpls(struct dp_packet *packet, ovs_be16 ethtype) /* Shift the l2 header forward. */ memmove((char*)dp_packet_data(packet) + MPLS_HLEN, dp_packet_data(packet), len); dp_packet_resize_l2_5(packet, -MPLS_HLEN); +/* Invalidate dp_hash */ +packet->md.dp_hash = 0; } } @@ -444,6 +448,8 @@ push_nsh(struct dp_packet *packet, const struct nsh_hdr *nsh_hdr_src) packet->packet_type = htonl(PT_NSH); dp_packet_reset_offsets(packet); packet->l3_ofs = 0; +/* Invalidate dp_hash */ +packet->md.dp_hash = 0; } bool @@ -474,6 +480,8 @@ pop_nsh(struct dp_packet *packet) length = nsh_hdr_len(nsh); dp_packet_reset_packet(packet, length); +/* Invalidate dp_hash */ +packet->md.dp_hash = 0; packet->packet_type = htonl(next_pt); /* Packet must be recirculated for further processing. */ } diff --git a/lib/packets.h b/lib/packets.h index a4bee3819..8691fa0c2 100644 --- a/lib/packets.h +++ b/lib/packets.h @@ -98,8 +98,7 @@ PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline0, uint32_t recirc_id; /* Recirculation id carried with the recirculating
Re: [ovs-dev] [RFC] dpif-netdev: only poll enabled vhost queues
> > > > I am afraid it is not a valid assumption that there will be similarly large > number of OVS PMD threads as there are queues. > > > > In OpenStack deployments the OVS is typically statically configured to use a > few dedicated host CPUs for PMDs (perhaps 2-8). > > > > Typical Telco VNF VMs, on the other hand, are very large (12-20 vCPUs or > even more). If they enable an instance for multi-queue in Nova, Nova (in its > eternal wisdom) will set up every vhostuser port with #vCPU queue pairs. > > For me, it's an issue of Nova. It's pretty easy to limit the maximum number of > queue pairs to some sane value (the value that could be handled by your > number of available PMD threads). > It'll be a one config and a small patch to nova-compute. With a bit more work > you could make this per-port configurable and finally stop wasting HW > resources. OK, I fully agree. The OpenStack community is slow, though, when it comes to these kind of changes. Do we have contacts we could push? > > > A (real world) VM with 20 vCPUs and 6 ports would have 120 queue pairs, > even if only one or two high-traffic ports can actually profit from > multi-queue. > Even on those ports is it unlikely that the application will use all 16 > queues. And > often there would be another such VM on the second NUMA node. > > With limiting the number of queues in Nova (like I described above) to 4 > you'll > have just > 24 queues for 6 ports. If you'll make it per-port, you'll be able to limit > this to > even more sane values. Yes, per port configuration in Neutron seems the logical thing for me to do, rather than a global per instance parameter in the Nova flavor. A per server setting in Nova compute to limit the number of acceptable queue pairs to match the OVS configuration might still be useful on top. > > > > > So, as soon as a VNF enables MQ in OpenStack, there will typically be a vast > number of un-used queue pairs in OVS and it makes a lot of sense to minimize > the run-time impact of having these around. > > For me it seems like not an OVS, DPDK or QEMU issue. The orchestrator should > configure sane values first of all. It's totally unclear why we're changing > OVS > instead of changing Nova. The VNF orchestrator would request queues based on the applications needs. They should not need to be aware of the configuration of the infrastructure (such as the number of PMD threads in OVS). The OpenStack operator would have to make sure that the instantiated queues are a good compromise between application needs and infra capabilities. > > > > > We have had discussion earlier with RedHat as to how a vhostuser backend > like OVS could negotiate the number of queue pairs with Qemu down to a > reasonable value (e.g. the number PMDs available for polling) *before* Qemu > would actually start the guest. The guest would then not have to guess on the > optimal number of queue pairs to actually activate. > > > > BR, Jan > > ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [RFC] dpif-netdev: only poll enabled vhost queues
Hi Ilya, > > > > With a simple pvp setup of mine. > > 1c/2t poll two physical ports. > > 1c/2t poll four vhost ports with 16 queues each. > > Only one queue is enabled on each virtio device attached by the guest. > > The first two virtio devices are bound to the virtio kmod. > > The last two virtio devices are bound to vfio-pci and used to forward > incoming traffic with testpmd. > > > > The forwarding zeroloss rate goes from 5.2Mpps (polling all 64 vhost queues) > to 6.2Mpps (polling only the 4 enabled vhost queues). > > That's interesting. However, this doesn't look like a realistic scenario. > In practice you'll need much more PMD threads to handle so many queues. > If you'll add more threads, zeroloss test could show even worse results if > one of > idle VMs will periodically change the number of queues. Periodic latency > spikes > will cause queue overruns and subsequent packet drops on hot Rx queues. This > could be partially solved by allowing n_rxq to grow only. > However, I'd be happy to have different solution that will not hide number of > queues from the datapath. > I am afraid it is not a valid assumption that there will be similarly large number of OVS PMD threads as there are queues. In OpenStack deployments the OVS is typically statically configured to use a few dedicated host CPUs for PMDs (perhaps 2-8). Typical Telco VNF VMs, on the other hand, are very large (12-20 vCPUs or even more). If they enable an instance for multi-queue in Nova, Nova (in its eternal wisdom) will set up every vhostuser port with #vCPU queue pairs. A (real world) VM with 20 vCPUs and 6 ports would have 120 queue pairs, even if only one or two high-traffic ports can actually profit from multi-queue. Even on those ports is it unlikely that the application will use all 16 queues. And often there would be another such VM on the second NUMA node. So, as soon as a VNF enables MQ in OpenStack, there will typically be a vast number of un-used queue pairs in OVS and it makes a lot of sense to minimize the run-time impact of having these around. We have had discussion earlier with RedHat as to how a vhostuser backend like OVS could negotiate the number of queue pairs with Qemu down to a reasonable value (e.g. the number PMDs available for polling) *before* Qemu would actually start the guest. The guest would then not have to guess on the optimal number of queue pairs to actually activate. BR, Jan ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH] dpif-netdev-perf: Fix millisecond stats precision with slower TSC.
Hi Ilya, OK with me! BR, Jan > -Original Message- > From: Ilya Maximets > Sent: Tuesday, 19 March, 2019 12:08 > To: ovs-dev@openvswitch.org; Ian Stokes > Cc: Kevin Traynor ; Ilya Maximets > ; Jan Scheurich > Subject: [PATCH] dpif-netdev-perf: Fix millisecond stats precision with slower > TSC. > > Unlike x86 where TSC frequency usually matches with CPU frequency, another > architectures could have much slower TSCs. > For example, it's common for Arm SoCs to have 100 MHz TSC by default. > In this case perf module will check for end of current millisecond each 10K > cycles, i.e 10 times per millisecond. This could be not enough to collect > precise > statistics. > Fix that by taking current TSC frequency into account instead of hardcoding > the > number of cycles. > > CC: Jan Scheurich > Fixes: 79f368756ce8 ("dpif-netdev: Detailed performance stats for PMDs") > Signed-off-by: Ilya Maximets > --- > lib/dpif-netdev-perf.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c index > 52324858d..e7ed49e7e 100644 > --- a/lib/dpif-netdev-perf.c > +++ b/lib/dpif-netdev-perf.c > @@ -554,8 +554,8 @@ pmd_perf_end_iteration(struct pmd_perf_stats *s, int > rx_packets, > cum_ms = history_next(>milliseconds); > cum_ms->timestamp = now; > } > -/* Do the next check after 10K cycles (4 us at 2.5 GHz TSC clock). */ > -s->next_check_tsc = cycles_counter_update(s) + 1; > +/* Do the next check after 4 us (10K cycles at 2.5 GHz TSC clock). */ > +s->next_check_tsc = cycles_counter_update(s) + get_tsc_hz() / > + 25; > } > } > > -- > 2.17.1 ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH] dpif-netdev-perf: Fix double update of perf histograms.
Hi Ilya, Thanks for spotting this. I believe your fix is correct. BR, Jan > -Original Message- > From: Ilya Maximets > Sent: Monday, 18 March, 2019 14:01 > To: ovs-dev@openvswitch.org; Ian Stokes > Cc: Kevin Traynor ; Ilya Maximets > ; Jan Scheurich > Subject: [PATCH] dpif-netdev-perf: Fix double update of perf histograms. > > Real values of 'packets per batch' and 'cycles per upcall' already added to > histograms in 'dpif-netdev' on receive. Adding the averages makes statistics > wrong. We should not add to histograms values that never really appeared. > > For exmaple, in current code following situation is possible: > > pmd thread numa_id 0 core_id 5: > ... > Rx packets: 83 (0 Kpps, 13873 cycles/pkt) > ... > - Upcalls:3 ( 3.6 %, 248.6 us/upcall) > > Histograms > packets/it pkts/batch upcalls/it cycles/upcall > 1 831 1661 3... > 15848 2 > 19952 2 > ... > 50118 2 > > i.e. all the packets counted twice in 'pkts/batch' column and all the upcalls > counted twice in 'cycles/upcall' column. > > CC: Jan Scheurich > Fixes: 79f368756ce8 ("dpif-netdev: Detailed performance stats for PMDs") > Signed-off-by: Ilya Maximets > --- > lib/dpif-netdev-perf.c | 8 > 1 file changed, 8 deletions(-) > > diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c index > 8f0c9bc4f..52324858d 100644 > --- a/lib/dpif-netdev-perf.c > +++ b/lib/dpif-netdev-perf.c > @@ -498,15 +498,7 @@ pmd_perf_end_iteration(struct pmd_perf_stats *s, int > rx_packets, > cycles_per_pkt = cycles / rx_packets; > histogram_add_sample(>cycles_per_pkt, cycles_per_pkt); > } > -if (s->current.batches > 0) { > -histogram_add_sample(>pkts_per_batch, > - rx_packets / s->current.batches); > -} > histogram_add_sample(>upcalls, s->current.upcalls); > -if (s->current.upcalls > 0) { > -histogram_add_sample(>cycles_per_upcall, > - s->current.upcall_cycles / s->current.upcalls); > -} > histogram_add_sample(>max_vhost_qfill, s->current.max_vhost_qfill); > > /* Add iteration samples to millisecond stats. */ > -- > 2.17.1 ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH 1/2] odp-util: Fix a bug in parse_odp_push_nsh_action
Hi Ben and Yfeng, Looking at the current code on master I believe it is correct except that I wrongly used ofpbuf_push_zeros() where it should have been ofpbuf_put_zeros() to append the padding zeroes to the end. The ofp_buf should never be relocated through padding as it is allocated with the allowed maximum size from the start. The entire MD2 metadata (1 or more TLVs) is treated as a single NLATTR. The reason is that for the purpose of push_nsh the datapath does not care about the internal structure. The whole MD2 complex is added as one binary blob of data. Remember we did not implement the possibility to match on specific MD2 TLV fields in OVS 2.8 as that would have required a generalization of the generic metadata TLV field infrastructure to match fields. Admittedly, the possibility to specify arbitrary hex-encoded MD2 TLVs in a push_nsh action in ovs-dpctl is a bit raw as it trusts the hex data to be well-formed. As discussed, the datapath code doesn't really care, but the receiver of an NSH packet with malformed MD2 headers might choke. The alternatives would have been to not accept MD2 metadata in dpctl commands or to validate the hex data (at least from a TLV-structural perspective). I believe for MD2 metadata specified in an OF encap_nsh action we do ensure the generated TLV structure is correct. I hope this helps. BR, Jan > -Original Message- > From: Ben Pfaff > Sent: Thursday, 27 December, 2018 19:55 > To: Yifeng Sun ; Jan Scheurich > > Cc: d...@openvswitch.org > Subject: Re: [ovs-dev] [PATCH 1/2] odp-util: Fix a bug in > parse_odp_push_nsh_action > > On Wed, Dec 26, 2018 at 04:52:22PM -0800, Yifeng Sun wrote: > > In this piece of code, 'struct ofpbuf b' should always point to > > metadata so that metadata can be filled with values through ofpbuf > > operations, like ofpbuf_put_hex and ofpbuf_push_zeros. However, > > ofpbuf_push_zeros may change the data pointer of 'struct ofpbuf b' > > and therefore, metadata will not contain the expected values. This > > patch fixes it. > > > > Reported-at: > > https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=10863 > > Signed-off-by: Yifeng Sun > > --- > > lib/odp-util.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/lib/odp-util.c b/lib/odp-util.c index > > cb6a5f2047fd..af855873690c 100644 > > --- a/lib/odp-util.c > > +++ b/lib/odp-util.c > > @@ -2114,12 +2114,12 @@ parse_odp_push_nsh_action(const char *s, > struct ofpbuf *actions) > > if (ovs_scan_len(s, , "md2=0x%511[0-9a-fA-F]", buf) > > && n/2 <= sizeof metadata) { > > ofpbuf_use_stub(, metadata, sizeof metadata); > > -ofpbuf_put_hex(, buf, ); > > /* Pad metadata to 4 bytes. */ > > padding = PAD_SIZE(mdlen, 4); > > if (padding > 0) { > > -ofpbuf_push_zeros(, padding); > > +ofpbuf_put_zeros(, padding); > > } > > +ofpbuf_put_hex(, buf, ); > > md_size = mdlen + padding; > > ofpbuf_uninit(); > > continue; > > Yifeng, this fix looks wrong because it uses 'mdlen' in PAD_SIZE before > initializing it. > > This code is weird. It adds padding to a 4-byte boundary even though I can't > find any other code that checks for that or relies on it. > Furthermore, it puts the padding **BEFORE** the metadata, which just seems > super wrong. > > When I look at datapath code for md2 I get even more confused. > nsh_key_put_from_nlattr() seems to assume that an md2 attribute has well- > formatted data in it, then nsh_hdr_from_nlattr() copies it without checking > into > nh->md2, and then if it's not perfectly formatted then > nsh->md2.length is going to be invalid. If I'm reading it right, it > also assumes there's exactly one TLV. > > Jan, I think this is your code, can you help me understand this code? > > Thanks, > > Ben. ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] Bug: select group with dp_hash causing recursive recirculation
Hi Zang, Thanks for reporting this bug. As I see it, the check on dp_hash != 0 in ofproto-dpif-xlate.c is there to guarantee that a dp_hash value has been computed for the packet once before, not necessarily that a new one is computed for each translated select group. That's why a check for a valid dp_hash is OK. Bat all datapaths must adhere to the invariant that valid dp_hash !=0. Indeed the kernel datapath implements this: datapath/linux/actions.c: 1071 static void execute_hash(struct sk_buff *skb, struct sw_flow_key *key, 1072 const struct nlattr *attr) 1073 { 1074 struct ovs_action_hash *hash_act = nla_data(attr); 1075 u32 hash = 0; 1076 1077 /* OVS_HASH_ALG_L4 is the only possible hash algorithm. */ 1078 hash = skb_get_hash(skb); 1079 hash = jhash_1word(hash, hash_act->hash_basis); 1080 if (!hash) 1081 hash = 0x1; 1082 1083 key->ovs_flow_hash = hash; 1084 } The correct fix in my view would be to implement the same for the netdev datapath in lib/odp-execute.c. This requirement on the dp_hash action implementations should better be documented properly. BR, Jan > -Original Message- > From: ovs-dev-boun...@openvswitch.org On > Behalf Of Zang MingJie > Sent: Tuesday, 25 September, 2018 10:45 > To: ovs dev > Subject: [ovs-dev] Bug: select group with dp_hash causing recursive > recirculation > > Hi, we found a serious problem where one pmd is stop working, I want to > share the problem and find solution here. > > vswitchd log: > > 2018-09-13T23:36:44.377Z|40269235|dpif_netdev(pmd45)|WARN|Packet dropped. > Max recirculation depth exceeded. > 2018-09-13T23:36:44.387Z|40269236|dpif_netdev(pmd45)|WARN|Packet dropped. > Max recirculation depth exceeded. > 2018-09-13T23:36:44.391Z|40269237|dpif_netdev(pmd45)|WARN|Packet dropped. > Max recirculation depth exceeded. > > problematic datapath flows: > > > ct_state(+new-est),recirc_id(0x143c893),in_port(2),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(proto=6,frag=no),tcp(dst=443), > packets:84573093, bytes:6308807903, used:0.009s, > flags:SFPRU.ECN[200][400][800], > actions:meter(306),hash(hash_l4(0)),recirc(0x237b09d) > > > recirc_id(0x237b09d),in_port(2),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), > packets:279713339, bytes:20890205186, used:0.007s, > flags:SFPRU.ECN[200][400][800], actions:hash(hash_l4(0)),recirc(0x237b09d) > > corresponding openflow: > > cookie=0x5b5ab65e000f0101, duration=4848269.642s, table=40, > n_packets=974343805, n_bytes=72484367083, > priority=10,tcp,metadata=0xf0100/0xff00,tp_dst=443 > actions=group:983297 > > > group_id=983297,type=select,selection_method=dp_hash,bucket=bucket_id:3057033848,weight:100,actions=ct(commit,table=70,zo > ne=15,exec(nat(dst=10.177.251.203:443))),...``lots > of buckets``... > > > > Following explains how select group with dp_hash works. > > To implement select group with dp_hash, two datapath flows are needed: > > 1. calculate dp_hash, recirculate to second one > 2. select group bucket by dp_hash > > When encounter a datapath miss, openflow doesn't know which one is missing, > so it depends on dp_hash value of the packet: > > if dp_hash == 0 generate first dp flow. > if dp_hash != 0 generate second dp flow. > > > Back to the problem. > > Notice that second datapath flow is a dead loop, it recirculate to itself. > The cause of the problem is here ofproto/ofproto-dpif-xlate.c#L4429[1]: > > /* dp_hash value 0 is special since it means that the dp_hash has not > been > * computed, as all computed dp_hash values are non-zero. Therefore > * compare to zero can be used to decide if the dp_hash value is valid > * without masking the dp_hash field. */ > if (!dp_hash) { > > The comment saying that `dp_hash` shouldn't be zero, but under DPDK, it can > be zero, at lib/odp-execute.c#L747[2] > >/* RSS hash can be used here instead of 5tuple for > * performance reasons. */ >if (dp_packet_rss_valid(packet)) { >hash = dp_packet_get_rss_hash(packet); >hash = hash_int(hash, hash_act->hash_basis); >} else { >flow_extract(packet, ); >hash = flow_hash_5tuple(, hash_act->hash_basis); >} >packet->md.dp_hash = hash; > > I don't know how small chance that `hash_int` returns 0, we have tested > that if the final hash is 0, will definitely trigger the same bug. And due > to the chance is extremely low, I'm also investigation that if there are > other situation that will pass 0 hash to ofp. > > > > IMO, it is silly to depends on dp_hash value, maybe we need a new mechanism > which can pass data between ofp and odp freely. And a quick solution could > be just change the 0 hash to 1. > > > [1] > https://github.com/openvswitch/ovs/blob/master/ofproto/ofproto-dpif-xlate.c#L4429 > [2]
Re: [ovs-dev] [PATCH v4] Upcall/Slowpath rate-limiter for OVS
> Have you considered making this token bucket per-port instead of > per-pmd? As I read it, a greedy port can exhaust all the tokens from a > particular PMD, possibly leading to an unfair performance for that PMD > thread. Am I just being overly paranoid? > [manu] Yes, this is possible. But it can happen for both fast and slowpath > today, as PMDs sequentially iterate through ports. In order > to keep it simple, its done per-PMD. It can be extended to per-port if needed. The purpose of the upcall rate limiter for the netdev datapath is to protect a PMD from becoming clogged down by having to process an excessive number of upcalls. It is not to police the number of upcalls per port to some rate, especially not across multiple PMDs (in the case of RSS). I think what you are after, Aaron, is some kind of fairness scheme that provides each rx queue with a minimum rate of upcalls even if the global PMD rate limit is reached? I don't believe simply partitioning the global PMD rate limit into a number of smaller rx queue buckets would be a good solution. But I don't have a better alternative either. I agree with Manu that it should not stop us implementing the PMD-level protection. We can add a fairness scheme later, if needed. BR, Jan ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH] Improved Packet Drop Statistics in OVS.
The user-space part for packet drop stats should be generic and work with any dpif datapath. So, if someone implemented the equivalent drop stats functionality in the kernel datapath that would be very welcome. We in Ericsson cannot do that currently due to license restrictions. Regards, Jan > -Original Message- > From: ovs-dev-boun...@openvswitch.org On > Behalf Of Rohith Basavaraja > Sent: Friday, 25 May, 2018 07:37 > To: Ben Pfaff > Cc: d...@openvswitch.org > Subject: Re: [ovs-dev] [PATCH] Improved Packet Drop Statistics in OVS. > > Thanks Ben for the clarification. Yes this new stuff is used only in the > DPDK datapath and it’s not used in the kernel datapath. > > Thanks > Rohith > > On 25/05/18, 2:52 AM, "Ben Pfaff" wrote: > > On Thu, May 24, 2018 at 02:19:06AM +, Rohith Basavaraja wrote: > > Only changes in > > datapath/linux/compat/include/linux/openvswitch.h > > are related to OvS Kernel module. > > On a second look, I see that the new stuff here is only for the DPDK > datapath. If you don't intend to add this feature to the kernel > datapath, there should be no problem. Never mind. > > > ___ > dev mailing list > d...@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] OVS-DPDK public meeting
> > I was planning to if the community agreed it was warranted. > > > > However the general feeling expressed at the past few community calls is > > that the next move should be to DPDK 18.11 LTS and I tend to agree with > > this. > > > > The main advantage of this is the DPDK LTS lifecycle provides bug fixes > > for DPDK for 2 years from release. Moving to a non DPDK LTS becomes a pain > > as critical bug fixes will not be backported on the DPDK side so are not > > addressed in OVS with DPDK either, we've seen this with some of the CVE > > fixes for vhost quite recently. > > > > 18.05 is also the largest DPDK release to date with a lot of code being > > introduced in the later RC stages which IMO increases the risk rather than > > the gain of moving to it. > > > > However I'm open to discussing if a move to 18.05 is warranted, are there > > any critical features or usecases it enables that you had in mind? > > > > There are always the two big groups of users. > - Those that want max stability for a huge Production setup (which would > follow the pick LTS argument) > - And those that want/need the very latest HW support and features (which > would always prefer the latest version) I subscribe to that statement. > > I had no single critical feature in mind for 18.05, but especially your > argument of "the largest DPDK release to date with a lot of code being > introduced" makes it interesting for the second group. > Actually I think there are also plenty of new devices which are not > supported at all before 18.02/18.05. > > So far my DPDK upgrade policy was "the last DPDK available which has at > least one point release AND works with OVS". > If OpenVswitch really changed to only support to each DPDK LTS version, > then I might have to follow that. > I must admit I already had the same thought to only pick .11 stable > versions, so I'm not totally opposed if that is the way it is preferred for > Openvswitch. > > But if we can make this a toleration (saying it works with 17.11 AND newer > 18.05) then this would be a great contrib to OVS and IMHO be warranted. > If the latter would work it could be great to spot issues early on instead > of having a super-big jump from 17.11 to 18.11 in one shot. > But if you have to kill support fot 17.11 to let it work with 18.05, then > better not. > > Interested what other opinions on this are. In my eyes that would the only way to satisfy both user's needs: Keep default support for the associated DPDK LTS release and add optional support for bleeding edge DPDK versions. The downside of this is that it will likely clutter the OVS code with conditional compiler directives to handle the DPDK API/ABI incompatibilities. Plus, someone must also clean these up at a later stage when they are no longer needed. Today, OVS developers that really need the latest DPDK typically fork/branch OVS locally and maintain their fork until OVS master switches to the required DPDK version. That model doesn't burden the community with the problem. BR, Jan ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] [PATCH v4 2/3] ofproto-dpif: Improve dp_hash selection method for select groups
The current implementation of the "dp_hash" selection method suffers from two deficiences: 1. The hash mask and hence the number of dp_hash values is just large enough to cover the number of group buckets, but does not consider the case that buckets have different weights. 2. The xlate-time selection of best bucket from the masked dp_hash value often results in bucket load distributions that are quite different from the bucket weights because the number of available masked dp_hash values is too small (2-6 bits compared to 32 bits of a full hash in the default hash selection method). This commit provides a more accurate implementation of the dp_hash select group by applying the well known Webster method for distributing a small number of "seats" fairly over the weighted "parties" (see https://en.wikipedia.org/wiki/Webster/Sainte-Lagu%C3%AB_method). The dp_hash mask is autmatically chosen large enough to provide good enough accuracy even with widely differing weights. This distribution happens at group modification time and the resulting table is stored with the group-dpif struct. At xlation time, we use the masked dp_hash values as index to look up the assigned bucket. If the bucket should not be live, we do a circular search over the mapping table until we find the first live bucket. As the buckets in the table are by construction in pseudo-random order with a frequency according to their weight, this method maintains correct distribution even if one or more buckets are non-live. Xlation is further simplified by storing some derived select group state at group construction in struct group-dpif in a form better suited for xlation purposes. Adapted the unit test case for dp_hash select group accordingly. Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com> Signed-off-by: Nitin Katiyar <nitin.kati...@ericsson.com> Co-authored-by: Nitin Katiyar <nitin.kati...@ericsson.com> --- lib/odp-util.c | 4 +- ofproto/ofproto-dpif-xlate.c | 59 ++--- ofproto/ofproto-dpif.c | 150 +++ ofproto/ofproto-dpif.h | 13 tests/ofproto-dpif.at| 15 +++-- 5 files changed, 211 insertions(+), 30 deletions(-) diff --git a/lib/odp-util.c b/lib/odp-util.c index 105ac80..8d4afa0 100644 --- a/lib/odp-util.c +++ b/lib/odp-util.c @@ -595,7 +595,9 @@ format_odp_hash_action(struct ds *ds, const struct ovs_action_hash *hash_act) ds_put_format(ds, "hash("); if (hash_act->hash_alg == OVS_HASH_ALG_L4) { -ds_put_format(ds, "hash_l4(%"PRIu32")", hash_act->hash_basis); +ds_put_format(ds, "l4(%"PRIu32")", hash_act->hash_basis); +} else if (hash_act->hash_alg == OVS_HASH_ALG_SYM_L4) { +ds_put_format(ds, "sym_l4(%"PRIu32")", hash_act->hash_basis); } else { ds_put_format(ds, "Unknown hash algorithm(%"PRIu32")", hash_act->hash_alg); diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c index 9f7fca7..c990d8a 100644 --- a/ofproto/ofproto-dpif-xlate.c +++ b/ofproto/ofproto-dpif-xlate.c @@ -4392,27 +4392,37 @@ pick_hash_fields_select_group(struct xlate_ctx *ctx, struct group_dpif *group) static struct ofputil_bucket * pick_dp_hash_select_group(struct xlate_ctx *ctx, struct group_dpif *group) { +uint32_t dp_hash = ctx->xin->flow.dp_hash; + /* dp_hash value 0 is special since it means that the dp_hash has not been * computed, as all computed dp_hash values are non-zero. Therefore * compare to zero can be used to decide if the dp_hash value is valid * without masking the dp_hash field. */ -if (!ctx->xin->flow.dp_hash) { -uint64_t param = group->up.props.selection_method_param; - -ctx_trigger_recirculate_with_hash(ctx, param >> 32, (uint32_t)param); +if (!dp_hash) { +enum ovs_hash_alg hash_alg = group->hash_alg; +if (hash_alg > ctx->xbridge->support.max_hash_alg) { +/* Algorithm supported by all datapaths. */ +hash_alg = OVS_HASH_ALG_L4; +} +ctx_trigger_recirculate_with_hash(ctx, hash_alg, group->hash_basis); return NULL; } else { -uint32_t n_buckets = group->up.n_buckets; -if (n_buckets) { -/* Minimal mask to cover the number of buckets. */ -uint32_t mask = (1 << log_2_ceil(n_buckets)) - 1; -/* Multiplier chosen to make the trivial 1 bit case to - * actually distribute amongst two equal weight buckets. */ -uint32_t basis = 0xc2b73583 * (ctx->xin->flow.dp_hash & mask); - -ctx->wc->masks.dp_hash |= mask; -return group_best_live_bucket(ctx, group, basis); +uint32_t hash_mask = group->hash_mask; +ctx->wc
[ovs-dev] [PATCH v4 3/3] ofproto-dpif: Use dp_hash as default selection method
The dp_hash selection method for select groups overcomes the scalability problems of the current default selection method which, due to L2-L4 hashing during xlation and un-wildcarding of the hashed fields, basically requires an upcall to the slow path to load-balance every L4 connection. The consequence are an explosion of datapath flows (megaflows degenerate to miniflows) and a limitation of connection setup rate OVS can handle. This commit changes the default selection method to dp_hash, provided the bucket configuration is such that the dp_hash method can accurately represent the bucket weights with up to 64 hash values. Otherwise we stick to original default hash method. We use the new dp_hash algorithm OVS_HASH_L4_SYMMETRIC to maintain the symmetry property of the old default hash method. A controller can explicitly request the old default hash selection method by specifying selection method "hash" with an empty list of fields in the Group properties of the OpenFlow 1.5 Group Mod message. Update the documentation about selection method in the ovs-ovctl man page. Revise and complete the ofproto-dpif unit tests cases for select groups. Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com> Signed-off-by: Nitin Katiyar <nitin.kati...@ericsson.com> Co-authored-by: Nitin Katiyar <nitin.kati...@ericsson.com> --- NEWS | 2 + lib/ofp-group.c| 15 ++- ofproto/ofproto-dpif.c | 30 +++-- ofproto/ofproto-dpif.h | 1 + ofproto/ofproto-provider.h | 2 +- tests/mpls-xlate.at| 26 ++-- tests/ofproto-dpif.at | 316 +++-- tests/ofproto-macros.at| 7 +- utilities/ovs-ofctl.8.in | 47 --- 9 files changed, 334 insertions(+), 112 deletions(-) diff --git a/NEWS b/NEWS index ec548b0..2b2be1e 100644 --- a/NEWS +++ b/NEWS @@ -17,6 +17,8 @@ Post-v2.9.0 * OFPT_ROLE_STATUS is now available in OpenFlow 1.3. * OpenFlow 1.5 extensible statistics (OXS) now implemented. * New OpenFlow 1.0 extensions for group support. + * Default selection method for select groups is now dp_hash with improved + accuracy. - Linux kernel 4.14 * Add support for compiling OVS with the latest Linux 4.14 kernel - ovn: diff --git a/lib/ofp-group.c b/lib/ofp-group.c index f5b0af8..697208f 100644 --- a/lib/ofp-group.c +++ b/lib/ofp-group.c @@ -1600,12 +1600,17 @@ parse_group_prop_ntr_selection_method(struct ofpbuf *payload, return OFPERR_OFPBPC_BAD_VALUE; } -error = oxm_pull_field_array(payload->data, fields_len, - >fields); -if (error) { -OFPPROP_LOG(, false, +if (fields_len > 0) { +error = oxm_pull_field_array(payload->data, fields_len, +>fields); +if (error) { +OFPPROP_LOG(, false, "ntr selection method fields are invalid"); -return error; +return error; +} +} else { +/* Selection_method "hash: w/o fields means default hash method. */ +gp->fields.values_size = 0; } return 0; diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c index c9c2e51..a45d6ea 100644 --- a/ofproto/ofproto-dpif.c +++ b/ofproto/ofproto-dpif.c @@ -1,5 +1,4 @@ /* - * Copyright (c) 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017 Nicira, Inc. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -4787,7 +4786,7 @@ group_setup_dp_hash_table(struct group_dpif *group, size_t max_hash) } *webster; if (n_buckets == 0) { -VLOG_DBG(" Don't apply dp_hash method without buckets"); +VLOG_DBG(" Don't apply dp_hash method without buckets."); return false; } @@ -4862,9 +4861,24 @@ group_set_selection_method(struct group_dpif *group) const struct ofputil_group_props *props = >up.props; const char *selection_method = props->selection_method; +VLOG_DBG("Constructing select group %"PRIu32, group->up.group_id); if (selection_method[0] == '\0') { -VLOG_DBG("No selection method specified."); -group->selection_method = SEL_METHOD_DEFAULT; +VLOG_DBG("No selection method specified. Trying dp_hash."); +/* If the controller has not specified a selection method, check if + * the dp_hash selection method with max 64 hash values is appropriate + * for the given bucket configuration. */ +if (group_setup_dp_hash_table(group, 64)) { +/* Use dp_hash selection method with symmetric L4 hash. */ +group->selection_method = SEL_METHOD_DP_HASH; +group->hash_alg = OVS_HASH_ALG_SYM_L4; +group->hash_basis = 0; +VLOG_DBG("Use dp_hash with %d hash
[ovs-dev] [PATCH v4 1/3] userspace datapath: Add OVS_HASH_L4_SYMMETRIC dp_hash algorithm
This commit implements a new dp_hash algorithm OVS_HASH_L4_SYMMETRIC in the netdev datapath. It will be used as default hash algorithm for the dp_hash-based select groups in a subsequent commit to maintain compatibility with the symmetry property of the current default hash selection method. A new dpif_backer_support field 'max_hash_alg' is introduced to reflect the highest hash algorithm a datapath supports in the dp_hash action. Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com> Signed-off-by: Nitin Katiyar <nitin.kati...@ericsson.com> Co-authored-by: Nitin Katiyar <nitin.kati...@ericsson.com> --- datapath/linux/compat/include/linux/openvswitch.h | 4 ++ lib/flow.c| 43 +- lib/flow.h| 1 + lib/odp-execute.c | 23 ++-- ofproto/ofproto-dpif-xlate.c | 7 +++- ofproto/ofproto-dpif.c| 45 +++ ofproto/ofproto-dpif.h| 5 ++- 7 files changed, 121 insertions(+), 7 deletions(-) diff --git a/datapath/linux/compat/include/linux/openvswitch.h b/datapath/linux/compat/include/linux/openvswitch.h index 6f4fa01..5c1e238 100644 --- a/datapath/linux/compat/include/linux/openvswitch.h +++ b/datapath/linux/compat/include/linux/openvswitch.h @@ -724,6 +724,10 @@ struct ovs_action_push_vlan { */ enum ovs_hash_alg { OVS_HASH_ALG_L4, +#ifndef __KERNEL__ + OVS_HASH_ALG_SYM_L4, +#endif + __OVS_HASH_MAX }; /* diff --git a/lib/flow.c b/lib/flow.c index 136f060..75ca456 100644 --- a/lib/flow.c +++ b/lib/flow.c @@ -2124,6 +2124,45 @@ flow_hash_symmetric_l4(const struct flow *flow, uint32_t basis) return jhash_bytes(, sizeof fields, basis); } +/* Symmetrically Hashes non-IP 'flow' based on its L2 headers. */ +uint32_t +flow_hash_symmetric_l2(const struct flow *flow, uint32_t basis) +{ +union { +struct { +ovs_be16 eth_type; +ovs_be16 vlan_tci; +struct eth_addr eth_addr; +ovs_be16 pad; +}; +uint32_t word[3]; +} fields; + +uint32_t hash = basis; +int i; + +if (flow->packet_type != htonl(PT_ETH)) { +/* Cannot hash non-Ethernet flows */ +return 0; +} + +for (i = 0; i < ARRAY_SIZE(fields.eth_addr.be16); i++) { +fields.eth_addr.be16[i] = +flow->dl_src.be16[i] ^ flow->dl_dst.be16[i]; +} +fields.vlan_tci = 0; +for (i = 0; i < FLOW_MAX_VLAN_HEADERS; i++) { +fields.vlan_tci ^= flow->vlans[i].tci & htons(VLAN_VID_MASK); +} +fields.eth_type = flow->dl_type; +fields.pad = 0; + +hash = hash_add(hash, fields.word[0]); +hash = hash_add(hash, fields.word[1]); +hash = hash_add(hash, fields.word[2]); +return hash_finish(hash, basis); +} + /* Hashes 'flow' based on its L3 through L4 protocol information */ uint32_t flow_hash_symmetric_l3l4(const struct flow *flow, uint32_t basis, @@ -2144,8 +2183,8 @@ flow_hash_symmetric_l3l4(const struct flow *flow, uint32_t basis, hash = hash_add64(hash, a[i] ^ b[i]); } } else { -/* Cannot hash non-IP flows */ -return 0; +/* Revert to hashing L2 headers */ +return flow_hash_symmetric_l2(flow, basis); } hash = hash_add(hash, flow->nw_proto); diff --git a/lib/flow.h b/lib/flow.h index 7a9e7d0..9de94b2 100644 --- a/lib/flow.h +++ b/lib/flow.h @@ -236,6 +236,7 @@ hash_odp_port(odp_port_t odp_port) uint32_t flow_hash_5tuple(const struct flow *flow, uint32_t basis); uint32_t flow_hash_symmetric_l4(const struct flow *flow, uint32_t basis); +uint32_t flow_hash_symmetric_l2(const struct flow *flow, uint32_t basis); uint32_t flow_hash_symmetric_l3l4(const struct flow *flow, uint32_t basis, bool inc_udp_ports ); diff --git a/lib/odp-execute.c b/lib/odp-execute.c index c5080ea..5831d1f 100644 --- a/lib/odp-execute.c +++ b/lib/odp-execute.c @@ -730,14 +730,16 @@ odp_execute_actions(void *dp, struct dp_packet_batch *batch, bool steal, } switch ((enum ovs_action_attr) type) { + case OVS_ACTION_ATTR_HASH: { const struct ovs_action_hash *hash_act = nl_attr_get(a); -/* Calculate a hash value directly. This might not match the +/* Calculate a hash value directly. This might not match the * value computed by the datapath, but it is much less expensive, * and the current use case (bonding) does not require a strict * match to work properly. */ -if (hash_act->hash_alg == OVS_HASH_ALG_L4) { +switch (hash_act->hash_alg) { +case OVS_HASH_ALG_L4: { struct flow flow; uint32_t hash; @@ -753,7 +755,22 @@ odp_execute_actions(
[ovs-dev] [PATCH v4 0/3] Use improved dp_hash select group by default
The current default OpenFlow select group implementation sends every new L4 flow to the slow path for the balancing decision and installs a 5-tuple "miniflow" in the datapath to forward subsequent packets of the connection accordingly. Clearly this has major scalability issues with many parallel L4 flows and high connection setup rates. The dp_hash selection method for the OpenFlow select group was added to OVS as an alternative. It avoids the scalability issues for the price of an additional recirculation in the datapath. The dp_hash method is only available to OF1.5 SDN controllers speaking the Netronome Group Mod extension to configure the selection mechanism. This severely limited the applicability of the dp_hash select group in the past. Furthermore, testing revealed that the implemented dp_hash selection often generated a very uneven distribution of flows over group buckets and didn't consider bucket weights at all. The present patch set in a first step improves the dp_hash selection method to much more accurately distribute flows over weighted group buckets and to apply a symmetric dp_hash function to maintain the symmetry property of the legacy hash function. In a second step it makes the improved dp_hash method the default in OVS for select groups that can be accurately handled by dp_hash. That should be the vast majority of cases. Otherwise we fall back to the legacy slow-path selection method. The Netronome extension can still be used to override the default decision and require the legacy slow-path or the dp_hash selection method. v3 -> v4: - Rebased to master (commit 82d5b337cd). - Implemented Ben's improvement suggestions for patch 2/3. - Fixed machine dependency of one select group test case. v2 -> v3: - Fixed another corner case crash reported by Chen Yuefang. - Fixed several sparse and clang warnings reported by Ben. - Rewritten the select group unit tests to abstract the checks from the behavior of the system-specific hash function implementation. - Added dpif_backer_support field for dp_hash algorithms to prevent using the new OVS_HASH_L4_SYMMETRIC algorithm if it is not supported by the datapath. v1 -> v2: - Fixed crashes for corner cases reported by Chen Yuefang. - Fixed group ref leakage with dp_hash reported by Chen Yuefang. - Changed all xlation logging from INFO to DBG. - Revised, completed and detailed select group unit test cases in ofproto-dpif. - Updated selection_method documentation in ovs-ofctl man page. - Added NEWS item. Jan Scheurich (3): userspace datapath: Add OVS_HASH_L4_SYMMETRIC dp_hash algorithm ofproto-dpif: Improve dp_hash selection method for select groups ofproto-dpif: Use dp_hash as default selection method NEWS | 2 + datapath/linux/compat/include/linux/openvswitch.h | 4 + lib/flow.c| 43 ++- lib/flow.h| 1 + lib/odp-execute.c | 23 +- lib/odp-util.c| 4 +- lib/ofp-group.c | 15 +- ofproto/ofproto-dpif-xlate.c | 66 +++-- ofproto/ofproto-dpif.c| 211 ++- ofproto/ofproto-dpif.h| 19 +- ofproto/ofproto-provider.h| 2 +- tests/mpls-xlate.at | 26 +- tests/ofproto-dpif.at | 315 +- tests/ofproto-macros.at | 7 +- utilities/ovs-ofctl.8.in | 47 ++-- 15 files changed, 651 insertions(+), 134 deletions(-) -- 1.9.1 ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH v3 3/3] ofproto-dpif: Use dp_hash as default selection method
> > Thanks for working on this. > > I get the following test failure with this applied (with or without the > incremental changes I suggested for patch 2). > > Will you take a look? > The test should verify that only one of the buckets is hit when the packets have no entropy in the custom hash fields. Which bucket is hit depends on the hash function implementation and can differ between platforms. Will fix the check. Regards, Jan > Thanks, > > Ben. > ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH v3 2/3] ofproto-dpif: Improve dp_hash selection method for select groups
> > Thanks a lot. > > I don't think that the new 'aux' member in ofputil_bucket is too > useful. It looks to me like the only use of it could be kept just as > easily in struct webster. > > group_setup_dp_hash_table() uses floating-point arithmetic for good > reasons, but it seems to me that some of it is unnecessary, especially > since we have DIV_ROUND_UP and ROUND_UP_POW2. > > group_dp_hash_best_bucket() seems like it unnecessarily modifies its > dp_hash parameter (and then never uses it again) and unnecessarily uses > % when & would work. I also saw a few ways to make the style better > match what we most often do these days. > > So here's an incremental that I suggest folding in for v4. What do you > think? I agree with your suggestions. The incremental looks good to me. Will include it in v4. Thanks, Jan > > Thanks, > > Ben. > > --8<--cut here-->8-- > > diff --git a/include/openvswitch/ofp-group.h b/include/openvswitch/ofp-group.h > index af4033dc68e4..8d893a53fcb2 100644 > --- a/include/openvswitch/ofp-group.h > +++ b/include/openvswitch/ofp-group.h > @@ -47,7 +47,6 @@ struct bucket_counter { > /* Bucket for use in groups. */ > struct ofputil_bucket { > struct ovs_list list_node; > -uint16_t aux; /* Padding. Also used for temporary data. */ > uint16_t weight;/* Relative weight, for "select" groups. */ > ofp_port_t watch_port; /* Port whose state affects whether this > bucket > * is live. Only required for fast failover > diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c > index e35582df0c37..1c78c2d7ca50 100644 > --- a/ofproto/ofproto-dpif-xlate.c > +++ b/ofproto/ofproto-dpif-xlate.c > @@ -4386,26 +4386,22 @@ group_dp_hash_best_bucket(struct xlate_ctx *ctx, >const struct group_dpif *group, >uint32_t dp_hash) > { > -struct ofputil_bucket *bucket, *best_bucket = NULL; > -uint32_t n_hash = group->hash_mask + 1; > - > -uint32_t hash = dp_hash &= group->hash_mask; > -ctx->wc->masks.dp_hash |= group->hash_mask; > +uint32_t hash_mask = group->hash_mask; > +ctx->wc->masks.dp_hash |= hash_mask; > > /* Starting from the original masked dp_hash value iterate over the > * hash mapping table to find the first live bucket. As the buckets > * are quasi-randomly spread over the hash values, this maintains > * a distribution according to bucket weights even when some buckets > * are non-live. */ > -for (int i = 0; i < n_hash; i++) { > -bucket = group->hash_map[(hash + i) % n_hash]; > -if (bucket_is_alive(ctx, bucket, 0)) { > -best_bucket = bucket; > -break; > +for (int i = 0; i <= hash_mask; i++) { > +struct ofputil_bucket *b = group->hash_map[(dp_hash + i) & > hash_mask]; > +if (bucket_is_alive(ctx, b, 0)) { > +return b; > } > } > > -return best_bucket; > +return NULL; > } > > static void > diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c > index f5ecd8be8d05..c9c2e5176e46 100644 > --- a/ofproto/ofproto-dpif.c > +++ b/ofproto/ofproto-dpif.c > @@ -4777,13 +4777,13 @@ group_setup_dp_hash_table(struct group_dpif *group, > size_t max_hash) > { > struct ofputil_bucket *bucket; > uint32_t n_buckets = group->up.n_buckets; > -double total_weight = 0.0; > +uint64_t total_weight = 0; > uint16_t min_weight = UINT16_MAX; > -uint32_t n_hash; > struct webster { > struct ofputil_bucket *bucket; > uint32_t divisor; > double value; > +int hits; > } *webster; > > if (n_buckets == 0) { > @@ -4794,7 +4794,6 @@ group_setup_dp_hash_table(struct group_dpif *group, > size_t max_hash) > webster = xcalloc(n_buckets, sizeof(struct webster)); > int i = 0; > LIST_FOR_EACH (bucket, list_node, >up.buckets) { > -bucket->aux = 0; > if (bucket->weight > 0 && bucket->weight < min_weight) { > min_weight = bucket->weight; > } > @@ -4802,6 +4801,7 @@ group_setup_dp_hash_table(struct group_dpif *group, > size_t max_hash) > webster[i].bucket = bucket; > webster[i].divisor = 1; > webster[i].value = bucket->weight; > +webster[i].hits = 0; > i++; > } > > @@ -4810,19 +4810,19 @@ group_setup_dp_hash_table(struct group_dpif *group, > size_t max_hash) > free(webster); > return false; > } > -VLOG_DBG(" Minimum weight: %d, total weight: %.0f", > +VLOG_DBG(" Minimum weight: %d, total weight: %"PRIu64, > min_weight, total_weight); > > -uint32_t min_slots = ceil(total_weight / min_weight); > -n_hash = MAX(16, 1L << log_2_ceil(min_slots)); > - > +uint64_t min_slots = DIV_ROUND_UP(total_weight, min_weight); > +uint64_t
Re: [ovs-dev] [PATCH v3] Upcall/Slowpath rate limiter for OVS
Hi Manu, Thanks for working on this. Two general comments: 1. Is there a chance to add unit test cases for this feature? I know it might be difficult due to the real-time character, but perhaps using very low parameter values? 2. I believe the number RL-dropped packets must be accounted for in function dpif_netdev_get_stats() as stats->n_missed, otherwise the overall number of reported packets may not match the total number of packets processed. Other comments in-line. Regards, Jan > From: Manohar Krishnappa Chidambaraswamy > Sent: Monday, 07 May, 2018 12:45 > > Hi > > Rebased to master and adapted to the new dpif-netdev-perf counters. > As explained in v2 thread, OFPM_SLOWPATH meters cannot be used as is > for rate-limiting upcalls, hence reverted back to the simpler method > using token bucket. I guess the question was not whether to use meter actions in the datapath to implement the upcall rate limiter in dpif-netdev but whether to allow configuration of the upcall rate limiter through OpenFlow Meter Mod command for special pre-defined meter OFPM_SLOWPATH. In any case, the decision was to drop that idea. > > Could you please review this patch? > > Thanx > Manu > > v2: https://patchwork.ozlabs.org/patch/860687/ > v1: https://patchwork.ozlabs.org/patch/836737/ Please add the list of main changes between the current and the previous version in the next revision of the patch. Put it below the '---' separator so that is not part of the commit message. > > Signed-off-by: Manohar K C > <manohar.krishnappa.chidambarasw...@ericsson.com> > CC: Jan Scheurich <jan.scheur...@ericsson.com> > --- > Documentation/howto/dpdk.rst | 21 +++ > lib/dpif-netdev-perf.h | 1 + > lib/dpif-netdev.c| 83 > > vswitchd/vswitch.xml | 47 + > 4 files changed, 146 insertions(+), 6 deletions(-) > > diff --git a/Documentation/howto/dpdk.rst b/Documentation/howto/dpdk.rst > index 79b626c..bd1eaac 100644 > --- a/Documentation/howto/dpdk.rst > +++ b/Documentation/howto/dpdk.rst > @@ -739,3 +739,24 @@ devices to bridge ``br0``. Once complete, follow the > below steps: > Check traffic on multiple queues:: > > $ cat /proc/interrupts | grep virtio > + > +Upcall rate limiting > + > +ovs-vsctl can be used to enable and configure upcall rate limit parameters. > +There are 2 configurable values ``upcall-rate`` and ``upcall-burst`` which > +take effect when global enable knob ``upcall-rl`` is set to true. Please explain why upcall rate limiting may be relevant in the context of DPDK datapath (upcalls executed in the context of the PMD and affecting datapath forwarding capacity). Worth noting here, perhaps, that this rate limiting is independently per PMD and not a global limit. Replace "knob" by "configuration parameter" and put the command to enable rate limiting: $ ovs-vsctl set Open_vSwitch . other_config:upcall-rl=true before the commands to tune the token bucket parameters. Mention the default parameter values? > + > +Upcall rate should be set using ``upcall-rate`` in packets-per-sec. For > +example:: > + > +$ ovs-vsctl set Open_vSwitch . other_config:upcall-rate=2000 > + > +Upcall burst should be set using ``upcall-burst`` in packets-per-sec. For > +example:: > + > +$ ovs-vsctl set Open_vSwitch . other_config:upcall-burst=2000 > + > +Upcall ratelimit feature should be globally enabled using ``upcall-rl``. For > +example:: > + > +$ ovs-vsctl set Open_vSwitch . other_config:upcall-rl=true > diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h > index 5993c25..189213c 100644 > --- a/lib/dpif-netdev-perf.h > +++ b/lib/dpif-netdev-perf.h > @@ -64,6 +64,7 @@ enum pmd_stat_type { > * recirculation. */ > PMD_STAT_SENT_PKTS, /* Packets that have been sent. */ > PMD_STAT_SENT_BATCHES, /* Number of batches sent. */ > +PMD_STAT_RATELIMIT_DROP,/* Packets dropped due to upcall policer. */ Name PMD_STAT_RL_DROP and move up in list after PMD_STAT_LOST. Add space before comment. > PMD_CYCLES_ITER_IDLE, /* Cycles spent in idle iterations. */ > PMD_CYCLES_ITER_BUSY, /* Cycles spent in busy iterations. */ > PMD_N_STATS > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c > index be31fd0..eebab89 100644 > --- a/lib/dpif-netdev.c > +++ b/lib/dpif-netdev.c > @@ -101,6 +101,16 @@ static struct shash dp_netdevs > OVS_GUARDED_BY(dp_netdev_mutex) > > static struct vlog_rate_limit upcall_rl = VLOG_RATE_LIMIT_INIT(600, 600); > > +/* Upcall rate-limit parameters */ > +static bool upcall_ratelimit; > +static unsigned i
[ovs-dev] [PATCH v3 3/3] ofproto-dpif: Use dp_hash as default selection method
The dp_hash selection method for select groups overcomes the scalability problems of the current default selection method which, due to L2-L4 hashing during xlation and un-wildcarding of the hashed fields, basically requires an upcall to the slow path to load-balance every L4 connection. The consequence are an explosion of datapath flows (megaflows degenerate to miniflows) and a limitation of connection setup rate OVS can handle. This commit changes the default selection method to dp_hash, provided the bucket configuration is such that the dp_hash method can accurately represent the bucket weights with up to 64 hash values. Otherwise we stick to original default hash method. We use the new dp_hash algorithm OVS_HASH_L4_SYMMETRIC to maintain the symmetry property of the old default hash method. A controller can explicitly request the old default hash selection method by specifying selection method "hash" with an empty list of fields in the Group properties of the OpenFlow 1.5 Group Mod message. Update the documentation about selection method in the ovs-ovctl man page. Revise and complete the ofproto-dpif unit tests cases for select groups. Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com> Signed-off-by: Nitin Katiyar <nitin.kati...@ericsson.com> Co-authored-by: Nitin Katiyar <nitin.kati...@ericsson.com> --- NEWS | 2 + lib/ofp-group.c| 15 ++- ofproto/ofproto-dpif.c | 30 +++-- ofproto/ofproto-dpif.h | 1 + ofproto/ofproto-provider.h | 2 +- tests/mpls-xlate.at| 26 ++-- tests/ofproto-dpif.at | 315 +++-- tests/ofproto-macros.at| 7 +- utilities/ovs-ofctl.8.in | 47 --- 9 files changed, 334 insertions(+), 111 deletions(-) diff --git a/NEWS b/NEWS index cd4ffbb..fbd987f 100644 --- a/NEWS +++ b/NEWS @@ -15,6 +15,8 @@ Post-v2.9.0 - ovs-vsctl: New commands "add-bond-iface" and "del-bond-iface". - OpenFlow: * OFPT_ROLE_STATUS is now available in OpenFlow 1.3. + * Default selection method for select groups is now dp_hash with improved + accuracy. - Linux kernel 4.14 * Add support for compiling OVS with the latest Linux 4.14 kernel - ovn: diff --git a/lib/ofp-group.c b/lib/ofp-group.c index 31b0437..c5ddc65 100644 --- a/lib/ofp-group.c +++ b/lib/ofp-group.c @@ -1518,12 +1518,17 @@ parse_group_prop_ntr_selection_method(struct ofpbuf *payload, return OFPERR_OFPBPC_BAD_VALUE; } -error = oxm_pull_field_array(payload->data, fields_len, - >fields); -if (error) { -OFPPROP_LOG(, false, +if (fields_len > 0) { +error = oxm_pull_field_array(payload->data, fields_len, +>fields); +if (error) { +OFPPROP_LOG(, false, "ntr selection method fields are invalid"); -return error; +return error; +} +} else { +/* Selection_method "hash: w/o fields means default hash method. */ +gp->fields.values_size = 0; } return 0; diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c index f5ecd8b..52282a8 100644 --- a/ofproto/ofproto-dpif.c +++ b/ofproto/ofproto-dpif.c @@ -1,5 +1,4 @@ /* - * Copyright (c) 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017 Nicira, Inc. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -4787,7 +4786,7 @@ group_setup_dp_hash_table(struct group_dpif *group, size_t max_hash) } *webster; if (n_buckets == 0) { -VLOG_DBG(" Don't apply dp_hash method without buckets"); +VLOG_DBG(" Don't apply dp_hash method without buckets."); return false; } @@ -4860,9 +4859,24 @@ group_set_selection_method(struct group_dpif *group) const struct ofputil_group_props *props = >up.props; const char *selection_method = props->selection_method; +VLOG_DBG("Constructing select group %"PRIu32, group->up.group_id); if (selection_method[0] == '\0') { -VLOG_DBG("No selection method specified."); -group->selection_method = SEL_METHOD_DEFAULT; +VLOG_DBG("No selection method specified. Trying dp_hash."); +/* If the controller has not specified a selection method, check if + * the dp_hash selection method with max 64 hash values is appropriate + * for the given bucket configuration. */ +if (group_setup_dp_hash_table(group, 64)) { +/* Use dp_hash selection method with symmetric L4 hash. */ +group->selection_method = SEL_METHOD_DP_HASH; +group->hash_alg = OVS_HASH_ALG_SYM_L4; +group->hash_basis = 0; +VLOG_DBG("Use dp_hash with %d hash values u
[ovs-dev] [PATCH v3 1/3] userspace datapath: Add OVS_HASH_L4_SYMMETRIC dp_hash algorithm
This commit implements a new dp_hash algorithm OVS_HASH_L4_SYMMETRIC in the netdev datapath. It will be used as default hash algorithm for the dp_hash-based select groups in a subsequent commit to maintain compatibility with the symmetry property of the current default hash selection method. A new dpif_backer_support field 'max_hash_alg' is introduced to reflect the highest hash algorithm a datapath supports in the dp_hash action. Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com> Signed-off-by: Nitin Katiyar <nitin.kati...@ericsson.com> Co-authored-by: Nitin Katiyar <nitin.kati...@ericsson.com> --- datapath/linux/compat/include/linux/openvswitch.h | 4 ++ lib/flow.c| 43 +- lib/flow.h| 1 + lib/odp-execute.c | 23 ++-- ofproto/ofproto-dpif-xlate.c | 7 +++- ofproto/ofproto-dpif.c| 45 +++ ofproto/ofproto-dpif.h| 5 ++- 7 files changed, 121 insertions(+), 7 deletions(-) diff --git a/datapath/linux/compat/include/linux/openvswitch.h b/datapath/linux/compat/include/linux/openvswitch.h index 84ebcaf..2bb3cb2 100644 --- a/datapath/linux/compat/include/linux/openvswitch.h +++ b/datapath/linux/compat/include/linux/openvswitch.h @@ -720,6 +720,10 @@ struct ovs_action_push_vlan { */ enum ovs_hash_alg { OVS_HASH_ALG_L4, +#ifndef __KERNEL__ + OVS_HASH_ALG_SYM_L4, +#endif + __OVS_HASH_MAX }; /* diff --git a/lib/flow.c b/lib/flow.c index 09b66b8..c65b288 100644 --- a/lib/flow.c +++ b/lib/flow.c @@ -2108,6 +2108,45 @@ flow_hash_symmetric_l4(const struct flow *flow, uint32_t basis) return jhash_bytes(, sizeof fields, basis); } +/* Symmetrically Hashes non-IP 'flow' based on its L2 headers. */ +uint32_t +flow_hash_symmetric_l2(const struct flow *flow, uint32_t basis) +{ +union { +struct { +ovs_be16 eth_type; +ovs_be16 vlan_tci; +struct eth_addr eth_addr; +ovs_be16 pad; +}; +uint32_t word[3]; +} fields; + +uint32_t hash = basis; +int i; + +if (flow->packet_type != htonl(PT_ETH)) { +/* Cannot hash non-Ethernet flows */ +return 0; +} + +for (i = 0; i < ARRAY_SIZE(fields.eth_addr.be16); i++) { +fields.eth_addr.be16[i] = +flow->dl_src.be16[i] ^ flow->dl_dst.be16[i]; +} +fields.vlan_tci = 0; +for (i = 0; i < FLOW_MAX_VLAN_HEADERS; i++) { +fields.vlan_tci ^= flow->vlans[i].tci & htons(VLAN_VID_MASK); +} +fields.eth_type = flow->dl_type; +fields.pad = 0; + +hash = hash_add(hash, fields.word[0]); +hash = hash_add(hash, fields.word[1]); +hash = hash_add(hash, fields.word[2]); +return hash_finish(hash, basis); +} + /* Hashes 'flow' based on its L3 through L4 protocol information */ uint32_t flow_hash_symmetric_l3l4(const struct flow *flow, uint32_t basis, @@ -2128,8 +2167,8 @@ flow_hash_symmetric_l3l4(const struct flow *flow, uint32_t basis, hash = hash_add64(hash, a[i] ^ b[i]); } } else { -/* Cannot hash non-IP flows */ -return 0; +/* Revert to hashing L2 headers */ +return flow_hash_symmetric_l2(flow, basis); } hash = hash_add(hash, flow->nw_proto); diff --git a/lib/flow.h b/lib/flow.h index af82931..900e8f8 100644 --- a/lib/flow.h +++ b/lib/flow.h @@ -236,6 +236,7 @@ hash_odp_port(odp_port_t odp_port) uint32_t flow_hash_5tuple(const struct flow *flow, uint32_t basis); uint32_t flow_hash_symmetric_l4(const struct flow *flow, uint32_t basis); +uint32_t flow_hash_symmetric_l2(const struct flow *flow, uint32_t basis); uint32_t flow_hash_symmetric_l3l4(const struct flow *flow, uint32_t basis, bool inc_udp_ports ); diff --git a/lib/odp-execute.c b/lib/odp-execute.c index 1969f02..c716c41 100644 --- a/lib/odp-execute.c +++ b/lib/odp-execute.c @@ -726,14 +726,16 @@ odp_execute_actions(void *dp, struct dp_packet_batch *batch, bool steal, } switch ((enum ovs_action_attr) type) { + case OVS_ACTION_ATTR_HASH: { const struct ovs_action_hash *hash_act = nl_attr_get(a); -/* Calculate a hash value directly. This might not match the +/* Calculate a hash value directly. This might not match the * value computed by the datapath, but it is much less expensive, * and the current use case (bonding) does not require a strict * match to work properly. */ -if (hash_act->hash_alg == OVS_HASH_ALG_L4) { +switch (hash_act->hash_alg) { +case OVS_HASH_ALG_L4: { struct flow flow; uint32_t hash; @@ -749,7 +751,22 @@ odp_execute_actions(
[ovs-dev] [PATCH v3 0/3] Use improved dp_hash select group by default
The current default OpenFlow select group implementation sends every new L4 flow to the slow path for the balancing decision and installs a 5-tuple "miniflow" in the datapath to forward subsequent packets of the connection accordingly. Clearly this has major scalability issues with many parallel L4 flows and high connection setup rates. The dp_hash selection method for the OpenFlow select group was added to OVS as an alternative. It avoids the scalability issues for the price of an additional recirculation in the datapath. The dp_hash method is only available to OF1.5 SDN controllers speaking the Netronome Group Mod extension to configure the selection mechanism. This severely limited the applicability of the dp_hash select group in the past. Furthermore, testing revealed that the implemented dp_hash selection often generated a very uneven distribution of flows over group buckets and didn't consider bucket weights at all. The present patch set in a first step improves the dp_hash selection method to much more accurately distribute flows over weighted group buckets and to apply a symmetric dp_hash function to maintain the symmetry property of the legacy hash function. In a second step it makes the improved dp_hash method the default in OVS for select groups that can be accurately handled by dp_hash. That should be the vast majority of cases. Otherwise we fall back to the legacy slow-path selection method. The Netronome extension can still be used to override the default decision and require the legacy slow-path or the dp_hash selection method. v2 -> v3: - Fixed another corner case crash reported by Chen Yuefang. - Fixed several sparse and clang warnings reported by Ben. - Rewritten the select group unit tests to abstract the checks from the behavior of the system-specific hash function implementation. - Added dpif_backer_support field for dp_hash algorithms to prevent using the new OVS_HASH_L4_SYMMETRIC algorithm if it is not supported by the datapath. v1 -> v2: - Fixed crashes for corner cases reported by Chen Yuefang. - Fixed group ref leakage with dp_hash reported by Chen Yuefang. - Changed all xlation logging from INFO to DBG. - Revised, completed and detailed select group unit test cases in ofproto-dpif. - Updated selection_method documentation in ovs-ofctl man page. - Added NEWS item. Jan Scheurich (3): userspace datapath: Add OVS_HASH_L4_SYMMETRIC dp_hash algorithm ofproto-dpif: Improve dp_hash selection method for select groups ofproto-dpif: Use dp_hash as default selection method NEWS | 2 + datapath/linux/compat/include/linux/openvswitch.h | 4 + include/openvswitch/ofp-group.h | 1 + lib/flow.c| 43 ++- lib/flow.h| 1 + lib/odp-execute.c | 23 +- lib/odp-util.c| 4 +- lib/ofp-group.c | 15 +- ofproto/ofproto-dpif-xlate.c | 89 -- ofproto/ofproto-dpif.c| 209 +- ofproto/ofproto-dpif.h| 19 +- ofproto/ofproto-provider.h| 2 +- tests/mpls-xlate.at | 26 +- tests/ofproto-dpif.at | 314 +- tests/ofproto-macros.at | 7 +- utilities/ovs-ofctl.8.in | 47 ++-- 16 files changed, 667 insertions(+), 139 deletions(-) -- 1.9.1 ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] [PATCH v3 2/3] ofproto-dpif: Improve dp_hash selection method for select groups
The current implementation of the "dp_hash" selection method suffers from two deficiences: 1. The hash mask and hence the number of dp_hash values is just large enough to cover the number of group buckets, but does not consider the case that buckets have different weights. 2. The xlate-time selection of best bucket from the masked dp_hash value often results in bucket load distributions that are quite different from the bucket weights because the number of available masked dp_hash values is too small (2-6 bits compared to 32 bits of a full hash in the default hash selection method). This commit provides a more accurate implementation of the dp_hash select group by applying the well known Webster method for distributing a small number of "seats" fairly over the weighted "parties" (see https://en.wikipedia.org/wiki/Webster/Sainte-Lagu%C3%AB_method). The dp_hash mask is autmatically chosen large enough to provide good enough accuracy even with widely differing weights. This distribution happens at group modification time and the resulting table is stored with the group-dpif struct. At xlation time, we use the masked dp_hash values as index to look up the assigned bucket. If the bucket should not be live, we do a circular search over the mapping table until we find the first live bucket. As the buckets in the table are by construction in pseudo-random order with a frequency according to their weight, this method maintains correct distribution even if one or more buckets are non-live. Xlation is further simplified by storing some derived select group state at group construction in struct group-dpif in a form better suited for xlation purposes. Adapted the unit test case for dp_hash select group accordingly. Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com> Signed-off-by: Nitin Katiyar <nitin.kati...@ericsson.com> Co-authored-by: Nitin Katiyar <nitin.kati...@ericsson.com> --- include/openvswitch/ofp-group.h | 1 + lib/odp-util.c | 4 +- ofproto/ofproto-dpif-xlate.c| 82 ++ ofproto/ofproto-dpif.c | 148 ofproto/ofproto-dpif.h | 13 tests/ofproto-dpif.at | 15 ++-- 6 files changed, 227 insertions(+), 36 deletions(-) diff --git a/include/openvswitch/ofp-group.h b/include/openvswitch/ofp-group.h index 8d893a5..af4033d 100644 --- a/include/openvswitch/ofp-group.h +++ b/include/openvswitch/ofp-group.h @@ -47,6 +47,7 @@ struct bucket_counter { /* Bucket for use in groups. */ struct ofputil_bucket { struct ovs_list list_node; +uint16_t aux; /* Padding. Also used for temporary data. */ uint16_t weight;/* Relative weight, for "select" groups. */ ofp_port_t watch_port; /* Port whose state affects whether this bucket * is live. Only required for fast failover diff --git a/lib/odp-util.c b/lib/odp-util.c index 6db241a..2db4e9d 100644 --- a/lib/odp-util.c +++ b/lib/odp-util.c @@ -595,7 +595,9 @@ format_odp_hash_action(struct ds *ds, const struct ovs_action_hash *hash_act) ds_put_format(ds, "hash("); if (hash_act->hash_alg == OVS_HASH_ALG_L4) { -ds_put_format(ds, "hash_l4(%"PRIu32")", hash_act->hash_basis); +ds_put_format(ds, "l4(%"PRIu32")", hash_act->hash_basis); +} else if (hash_act->hash_alg == OVS_HASH_ALG_SYM_L4) { +ds_put_format(ds, "sym_l4(%"PRIu32")", hash_act->hash_basis); } else { ds_put_format(ds, "Unknown hash algorithm(%"PRIu32")", hash_act->hash_alg); diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c index 05db090..e6d97a4 100644 --- a/ofproto/ofproto-dpif-xlate.c +++ b/ofproto/ofproto-dpif-xlate.c @@ -4380,35 +4380,59 @@ xlate_hash_fields_select_group(struct xlate_ctx *ctx, struct group_dpif *group, } } +static struct ofputil_bucket * +group_dp_hash_best_bucket(struct xlate_ctx *ctx, + const struct group_dpif *group, + uint32_t dp_hash) +{ +struct ofputil_bucket *bucket, *best_bucket = NULL; +uint32_t n_hash = group->hash_mask + 1; + +uint32_t hash = dp_hash &= group->hash_mask; +ctx->wc->masks.dp_hash |= group->hash_mask; + +/* Starting from the original masked dp_hash value iterate over the + * hash mapping table to find the first live bucket. As the buckets + * are quasi-randomly spread over the hash values, this maintains + * a distribution according to bucket weights even when some buckets + * are non-live. */ +for (int i = 0; i < n_hash; i++) { +bucket = group->hash_map[(hash + i) % n_hash]; +if (bucket_is_alive(ctx, bucket, 0)) { +best_bucket = bucket; +bre
Re: [ovs-dev] [PATCH v12 2/3] dpif-netdev: Detailed performance stats for PMDs
> > I hope that Clang is intelligent enough to recognize this. If not, I > > wouldn't know how to fix it other than by removing OVS_REQUIRES(s- > > >stats_mutex) from pmd_perf_stats_clear_lock() and just rely on comments. > > > > BR, Jan > > Thanks Jan, that resolves the issue and there's a clear travis build now. > I'll add this series as part of the next pull request. > > Thanks all for the work reviewing and testing this. > > Ian Perfect! Thank you too, Jan ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] Mempool redesign for OVS 2.10
Hi, Thanks, everyone, for re-opening the discussion around the new packet mempool handling for 2.10. Before we agree on what to actually implement I’d like to summarize my understanding of the requirements that have been discussed so far. Based on those I want to share some thoughts about how e can best address these requirements. Requirements: R1 (Backward compatibility): The new mempool handling shall be able to function equally well as the OVS 2.9 design base given any specific configuration of OVS-DPDK: hugepage memory, PMDs, ports, queues, MTU sizes, traffic flows. This is to ensure that we can upgrade OVS in existing deployments without risk of breaking anything. R2 (Dimensioning for static deployments): It shall be possible for an operator to calculate the amount of memory needed for packet mempools in a given static (maximum) configuration (PMDs, ethernet ports and queues, maximum number of vhost ports, MTU sizes) to reserve sufficient hugepages for OVS. R3 (Safe operation): If the mempools are dimensioned correctly, it shall not be possible that OVS runs out of mbufs for packet processing. R4 (Minimal footprint): The packet mempool size needed for safe operation of OVS should be as small as possible. R5 (Dynamic mempool allocation): It should be possible to automatically adjust the size of packet mempools at run-time when changing the OVS configuration e.g. adding PMDs, adding ports, adding rx/tx queues, changing the port MTU size. (Note: Shrinking the mempools with reducing OVS configuration is less important.) Actual maximum mbuf consumption in OVS DPDK: 1. Phy rx queues: Sum over dpdk dev: (dev->requested_n_rxq * dev->requested_rxq_size) Note: Normally the number of rx queues should not exceed the number of PMDs. 2. Phy tx queues: Sum over dpdk dev: (#active tx queues (=#PMDs) * dev->requested_txq_size) Note 1: These are hogged because of DPDK PMD’s lazy release of transmitted mbufs. Note 2: Stored mbufs in a tx queue are coming from all ports. 1. One rx batch per PMD during processing: #PMDs * 32 2. One batch per active tx queue for time-based batching: 32 * #devs * #PMDs Assuming rx/tx queue size of 2K for physical ports and #rx queues = #PMDs (RSS), the upper limit for the used mbufs would be (*1*) #dpdk devs * #PMDs * 4K + (#dpdk devs + #vhost devs) * #PMDs * 32 + #PMDs * 32 Examples: * With a typical NFVI deployment (2 DPDK devs, 4 PMDs, 128 vhost devs ) this yields 32K + 17K = 49K mbufs * For a large NFVI deployment (4 DPDK devs, 8 PMDs, 256 vhost devs ) this would yield 128K + 66K = 194K mbufs Roughly 1/3rd of the total mbufs are hogged in dpdk dev rx queues. The remaining 2/3rds are populated with an arbitrary mix of mbufs from all sources. Legacy shared mempool handling up to OVS 2.9: * One mempool per NUMA node and used MTU size range. * Each mempool has the maximum of (256K, 128K, 64K, 32K or 16K) mbufs available in DPDK at mempool creation. * Each mempool is shared among all ports on its NUMA node with an MTU in its range. * All rx queues of a port share the same mempool The legacy code trivially satisfies R1. Its good feature is that the mempools are shared so that it avoids the bloating of dedicated mempools per port implied by the handling on master (see below). Apart from that it does not fulfill any of the requirements. * It swallows all available hugepage memory to allocate up to 256K mbufs per NUMA node, even though that is far more than typically needed (violating R4). * The actual size of created mempools depends on the order of creation and the hugepage memory available. Early mempools are over-dimensioned, later mempools might be under-dimensioned. Operation is not at all safe (violating R3) * It doesn’t provide any help for the operator to dimension and reserve hugepages for OVS (violating R2) * The only dynamicity is that it creates additional mempools for new MTU size ranges only when they are needed. Due to greedy initial allocation these are likely to fail (violating R5). My take is that even though the shared mempool is concept is good, the legacy mempool handling should not be kept as is. Mempool per port scheme (currently implemented on master): From the above mbuf utilization calculation it is clear that only the dpdk rx queues are populated exclusively with mbufs from the port’s mempool. All other places are populated with mbufs from all ports, in the case of tx queues typically not even their own. As it is not possible to predict the assignment of rx queues to PMDs and the flow of packets between ports, safety requirement R3 implies that each port mempool must be dimensioned for the worst case, i.e. [#PMDs * 2K ] + #dpdk devs * #PMDs * 2K + (#dpdk devs + #vhost devs) * #PMDs * 32 + #PMDs * 32 Even though the first term [#PMDs * 2K] is only needed for physical ports this
Re: [ovs-dev] [PATCH v12 2/3] dpif-netdev: Detailed performance stats for PMDs
Hi Ian, Thanks for checking this. I suggest to address Clang's compliant by the following incremental: diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c index 47ce2c2..c7b8e7b 100644 --- a/lib/dpif-netdev-perf.c +++ b/lib/dpif-netdev-perf.c @@ -442,6 +442,7 @@ pmd_perf_stats_clear(struct pmd_perf_stats *s) inline void pmd_perf_start_iteration(struct pmd_perf_stats *s) +OVS_REQUIRES(s->stats_mutex) { if (s->clear) { /* Clear the PMD stats before starting next iteration. */ The mutex pmd->perf_stats.stats_mutex is taken by the calling pmd_thread_main() while it is in the poll loop. So the pre-requisite for pmd_perf_start_iteration() is in place. It is essential for performance that pmd_perf_start_iteration () does not take the lock in each iteration. I hope that Clang is intelligent enough to recognize this. If not, I wouldn't know how to fix it other than by removing OVS_REQUIRES(s->stats_mutex) from pmd_perf_stats_clear_lock() and just rely on comments. BR, Jan > -Original Message- > From: Stokes, Ian [mailto:ian.sto...@intel.com] > Sent: Wednesday, 25 April, 2018 11:55 > To: Jan Scheurich <jan.scheur...@ericsson.com>; d...@openvswitch.org > Cc: ktray...@redhat.com; i.maxim...@samsung.com; O Mahony, Billy > <billy.o.mah...@intel.com> > Subject: RE: [PATCH v12 2/3] dpif-netdev: Detailed performance stats for PMDs > > > This patch instruments the dpif-netdev datapath to record detailed > > statistics of what is happening in every iteration of a PMD thread. > > > > The collection of detailed statistics can be controlled by a new > > Open_vSwitch configuration parameter "other_config:pmd-perf-metrics". > > By default it is disabled. The run-time overhead, when enabled, is > > in the order of 1%. > > Hi Jan, thanks for the patch, 1 comment below. > > [snip] > > > + > > +/* This function can be called from the anywhere to clear the stats > > + * of PMD and non-PMD threads. */ > > +void > > +pmd_perf_stats_clear(struct pmd_perf_stats *s) > > +{ > > +if (ovs_mutex_trylock(>stats_mutex) == 0) { > > +/* Locking successful. PMD not polling. */ > > +pmd_perf_stats_clear_lock(s); > > +ovs_mutex_unlock(>stats_mutex); > > +} else { > > +/* Request the polling PMD to clear the stats. There is no need > > to > > + * block here as stats retrieval is prevented during clearing. */ > > +s->clear = true; > > +} > > +} > > + > > +/* Functions recording PMD metrics per iteration. */ > > + > > +inline void > > +pmd_perf_start_iteration(struct pmd_perf_stats *s) > > +{ > > Clang will complain that the mutex must be exclusively held for > s->stats_mutex in this function. > I can add this with the following incremental > > --- a/lib/dpif-netdev-perf.c > +++ b/lib/dpif-netdev-perf.c > @@ -417,6 +417,7 @@ pmd_perf_stats_clear(struct pmd_perf_stats *s) > inline void > pmd_perf_start_iteration(struct pmd_perf_stats *s) > { > +ovs_mutex_lock(>stats_mutex); > if (s->clear) { > /* Clear the PMD stats before starting next iteration. */ > pmd_perf_stats_clear_lock(s); > @@ -433,6 +434,7 @@ pmd_perf_start_iteration(struct pmd_perf_stats *s) > /* In case last_tsc has never been set before. */ > s->start_tsc = cycles_counter_update(s); > } > +ovs_mutex_unlock(>stats_mutex); > } > > But in that case is the ovs_mutex_trylock in function pmd_perf_stats_clear() > made redundant? > > Ian > > > +if (s->clear) { > > +/* Clear the PMD stats before starting next iteration. */ > > +pmd_perf_stats_clear_lock(s); > > +} > > +s->iteration_cnt++; > > +/* Initialize the current interval stats. */ > > +memset(>current, 0, sizeof(struct iter_stats)); > > +if (OVS_LIKELY(s->last_tsc)) { > > +/* We assume here that last_tsc was updated immediately prior at > > + * the end of the previous iteration, or just before the first > > + * iteration. */ > > +s->start_tsc = s->last_tsc; > > +} else { > > +/* In case last_tsc has never been set before. */ > > +s->start_tsc = cycles_counter_update(s); > > +} > > +} > > + ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] [PATCH v12 1/3] netdev: Add optional qfill output parameter to rxq_recv()
If the caller provides a non-NULL qfill pointer and the netdev implemementation supports reading the rx queue fill level, the rxq_recv() function returns the remaining number of packets in the rx queue after reception of the packet burst to the caller. If the implementation does not support this, it returns -ENOTSUP instead. Reading the remaining queue fill level should not substantilly slow down the recv() operation. A first implementation is provided for ethernet and vhostuser DPDK ports in netdev-dpdk.c. This output parameter will be used in the upcoming commit for PMD performance metrics to supervise the rx queue fill level for DPDK vhostuser ports. Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com> Acked-by: Billy O'Mahony <billy.o.mah...@intel.com> --- lib/dpif-netdev.c | 2 +- lib/netdev-bsd.c | 8 +++- lib/netdev-dpdk.c | 41 - lib/netdev-dummy.c| 8 +++- lib/netdev-linux.c| 7 ++- lib/netdev-provider.h | 8 +++- lib/netdev.c | 5 +++-- lib/netdev.h | 3 ++- 8 files changed, 69 insertions(+), 13 deletions(-) diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index be31fd0..7ce3943 100644 --- a/lib/dpif-netdev.c +++ b/lib/dpif-netdev.c @@ -3277,7 +3277,7 @@ dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd, pmd->ctx.last_rxq = rxq; dp_packet_batch_init(); -error = netdev_rxq_recv(rxq->rx, ); +error = netdev_rxq_recv(rxq->rx, , NULL); if (!error) { /* At least one packet received. */ *recirc_depth_get() = 0; diff --git a/lib/netdev-bsd.c b/lib/netdev-bsd.c index 05974c1..b70f327 100644 --- a/lib/netdev-bsd.c +++ b/lib/netdev-bsd.c @@ -618,7 +618,8 @@ netdev_rxq_bsd_recv_tap(struct netdev_rxq_bsd *rxq, struct dp_packet *buffer) } static int -netdev_bsd_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch) +netdev_bsd_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch, +int *qfill) { struct netdev_rxq_bsd *rxq = netdev_rxq_bsd_cast(rxq_); struct netdev *netdev = rxq->up.netdev; @@ -643,6 +644,11 @@ netdev_bsd_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch) batch->packets[0] = packet; batch->count = 1; } + +if (qfill) { +*qfill = -ENOTSUP; +} + return retval; } diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index ee39cbe..a4fc382 100644 --- a/lib/netdev-dpdk.c +++ b/lib/netdev-dpdk.c @@ -1812,13 +1812,13 @@ netdev_dpdk_vhost_update_rx_counters(struct netdev_stats *stats, */ static int netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq, - struct dp_packet_batch *batch) + struct dp_packet_batch *batch, int *qfill) { struct netdev_dpdk *dev = netdev_dpdk_cast(rxq->netdev); struct ingress_policer *policer = netdev_dpdk_get_ingress_policer(dev); uint16_t nb_rx = 0; uint16_t dropped = 0; -int qid = rxq->queue_id; +int qid = rxq->queue_id * VIRTIO_QNUM + VIRTIO_TXQ; int vid = netdev_dpdk_get_vid(dev); if (OVS_UNLIKELY(vid < 0 || !dev->vhost_reconfigured @@ -1826,14 +1826,23 @@ netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq, return EAGAIN; } -nb_rx = rte_vhost_dequeue_burst(vid, qid * VIRTIO_QNUM + VIRTIO_TXQ, -dev->mp, +nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->mp, (struct rte_mbuf **) batch->packets, NETDEV_MAX_BURST); if (!nb_rx) { return EAGAIN; } +if (qfill) { +if (nb_rx == NETDEV_MAX_BURST) { +/* The DPDK API returns a uint32_t which often has invalid bits in + * the upper 16-bits. Need to restrict the value to uint16_t. */ +*qfill = rte_vhost_rx_queue_count(vid, qid) & UINT16_MAX; +} else { +*qfill = 0; +} +} + if (policer) { dropped = nb_rx; nb_rx = ingress_policer_run(policer, @@ -1854,7 +1863,8 @@ netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq, } static int -netdev_dpdk_rxq_recv(struct netdev_rxq *rxq, struct dp_packet_batch *batch) +netdev_dpdk_rxq_recv(struct netdev_rxq *rxq, struct dp_packet_batch *batch, + int *qfill) { struct netdev_rxq_dpdk *rx = netdev_rxq_dpdk_cast(rxq); struct netdev_dpdk *dev = netdev_dpdk_cast(rxq->netdev); @@ -1891,6 +1901,14 @@ netdev_dpdk_rxq_recv(struct netdev_rxq *rxq, struct dp_packet_batch *batch) batch->count = nb_rx; dp_packet_batch_init_packet_fields(batch); +if (qfill) { +if (nb_rx == NETDEV_MAX_BURST) { +*qfill = rte_eth_rx_queue_count(rx->port_id, rxq->queue_id); +} else { +*qfill = 0; +} +} + return 0; } @@ -3172,6 +3190,19 @@ vr
[ovs-dev] [PATCH v12 2/3] dpif-netdev: Detailed performance stats for PMDs
This patch instruments the dpif-netdev datapath to record detailed statistics of what is happening in every iteration of a PMD thread. The collection of detailed statistics can be controlled by a new Open_vSwitch configuration parameter "other_config:pmd-perf-metrics". By default it is disabled. The run-time overhead, when enabled, is in the order of 1%. The covered metrics per iteration are: - cycles - packets - (rx) batches - packets/batch - max. vhostuser qlen - upcalls - cycles spent in upcalls This raw recorded data is used threefold: 1. In histograms for each of the following metrics: - cycles/iteration (log.) - packets/iteration (log.) - cycles/packet - packets/batch - max. vhostuser qlen (log.) - upcalls - cycles/upcall (log) The histograms bins are divided linear or logarithmic. 2. A cyclic history of the above statistics for 999 iterations 3. A cyclic history of the cummulative/average values per millisecond wall clock for the last 1000 milliseconds: - number of iterations - avg. cycles/iteration - packets (Kpps) - avg. packets/batch - avg. max vhost qlen - upcalls - avg. cycles/upcall The gathered performance metrics can be printed at any time with the new CLI command ovs-appctl dpif-netdev/pmd-perf-show [-nh] [-it iter_len] [-ms ms_len] [-pmd core] [dp] The options are -nh:Suppress the histograms -it iter_len: Display the last iter_len iteration stats -ms ms_len: Display the last ms_len millisecond stats -pmd core: Display only the specified PMD The performance statistics are reset with the existing dpif-netdev/pmd-stats-clear command. The output always contains the following global PMD statistics, similar to the pmd-stats-show command: Time: 15:24:55.270 Measurement duration: 1.008 s pmd thread numa_id 0 core_id 1: Cycles:2419034712 (2.40 GHz) Iterations:572817 (1.76 us/it) - idle:486808 (15.9 % cycles) - busy: 86009 (84.1 % cycles) Rx packets: 2399607 (2381 Kpps, 848 cycles/pkt) Datapath passes: 3599415 (1.50 passes/pkt) - EMC hits:336472 ( 9.3 %) - Megaflow hits: 3262943 (90.7 %, 1.00 subtbl lookups/hit) - Upcalls: 0 ( 0.0 %, 0.0 us/upcall) - Lost upcalls: 0 ( 0.0 %) Tx packets: 2399607 (2381 Kpps) Tx batches:171400 (14.00 pkts/batch) Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com> Acked-by: Billy O'Mahony <billy.o.mah...@intel.com> --- NEWS| 4 + lib/automake.mk | 1 + lib/dpif-netdev-perf.c | 462 +++- lib/dpif-netdev-perf.h | 197 --- lib/dpif-netdev-unixctl.man | 157 +++ lib/dpif-netdev.c | 187 -- manpages.mk | 2 + vswitchd/ovs-vswitchd.8.in | 27 +-- vswitchd/vswitch.xml| 12 ++ 9 files changed, 985 insertions(+), 64 deletions(-) create mode 100644 lib/dpif-netdev-unixctl.man diff --git a/NEWS b/NEWS index cd4ffbb..a665c7f 100644 --- a/NEWS +++ b/NEWS @@ -23,6 +23,10 @@ Post-v2.9.0 other IPv4/IPv6-based protocols whenever a reject ACL rule is hit. * ACL match conditions can now match on Port_Groups as well as address sets that are automatically generated by Port_Groups. + - Userspace datapath: + * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a single PMD + * Detailed PMD performance metrics available with new command + ovs-appctl dpif-netdev/pmd-perf-show v2.9.0 - 19 Feb 2018 diff --git a/lib/automake.mk b/lib/automake.mk index 915a33b..3276aaa 100644 --- a/lib/automake.mk +++ b/lib/automake.mk @@ -491,6 +491,7 @@ MAN_FRAGMENTS += \ lib/dpctl.man \ lib/memory-unixctl.man \ lib/netdev-dpdk-unixctl.man \ + lib/dpif-netdev-unixctl.man \ lib/ofp-version.man \ lib/ovs.tmac \ lib/service.man \ diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c index f06991a..caa0e27 100644 --- a/lib/dpif-netdev-perf.c +++ b/lib/dpif-netdev-perf.c @@ -15,18 +15,333 @@ */ #include +#include +#include "dpif-netdev-perf.h" #include "openvswitch/dynamic-string.h" #include "openvswitch/vlog.h" -#include "dpif-netdev-perf.h" +#include "ovs-thread.h" #include "timeval.h" VLOG_DEFINE_THIS_MODULE(pmd_perf); +#ifdef DPDK_NETDEV +static uint64_t +get_tsc_hz(void) +{ +return rte_get_tsc_hz(); +} +#else +/* This function is only invoked from PMD threads which depend on DPDK. + * A dummy function is sufficient when building without DPDK_NETDEV. */ +static uint64_t +get_tsc_hz(void) +{ +return 1; +} +#endif + +/* Histogram functions. */ + +static void +histogram_walls_set_lin(struct histogram *hist, uint32_t
[ovs-dev] [PATCH v12 3/3] dpif-netdev: Detection and logging of suspicious PMD iterations
This patch enhances dpif-netdev-perf to detect iterations with suspicious statistics according to the following criteria: - iteration lasts longer than US_THR microseconds (default 250). This can be used to capture events where a PMD is blocked or interrupted for such a period of time that there is a risk for dropped packets on any of its Rx queues. - max vhost qlen exceeds a threshold Q_THR (default 128). This can be used to infer virtio queue overruns and dropped packets inside a VM, which are not visible in OVS otherwise. Such suspicious iterations can be logged together with their iteration statistics to be able to correlate them to packet drop or other events outside OVS. A new command is introduced to enable/disable logging at run-time and to adjust the above thresholds for suspicious iterations: ovs-appctl dpif-netdev/pmd-perf-log-set on | off [-b before] [-a after] [-e|-ne] [-us usec] [-q qlen] Turn logging on or off at run-time (on|off). -b before: The number of iterations before the suspicious iteration to be logged (default 5). -a after: The number of iterations after the suspicious iteration to be logged (default 5). -e: Extend logging interval if another suspicious iteration is detected before logging occurs. -ne:Do not extend logging interval (default). -q qlen:Suspicious vhost queue fill level threshold. Increase this to 512 if the Qemu supports 1024 virtio queue length. (default 128). -us usec: change the duration threshold for a suspicious iteration (default 250 us). Note: Logging of suspicious iterations itself consumes a considerable amount of processing cycles of a PMD which may be visible in the iteration history. In the worst case this can lead OVS to detect another suspicious iteration caused by logging. If more than 100 iterations around a suspicious iteration have been logged once, OVS falls back to the safe default values (-b 5/-a 5/-ne) to avoid that logging itself causes continuos further logging. Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com> Acked-by: Billy O'Mahony <billy.o.mah...@intel.com> --- NEWS| 2 + lib/dpif-netdev-perf.c | 223 lib/dpif-netdev-perf.h | 21 + lib/dpif-netdev-unixctl.man | 59 lib/dpif-netdev.c | 5 + 5 files changed, 310 insertions(+) diff --git a/NEWS b/NEWS index a665c7f..7259492 100644 --- a/NEWS +++ b/NEWS @@ -27,6 +27,8 @@ Post-v2.9.0 * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a single PMD * Detailed PMD performance metrics available with new command ovs-appctl dpif-netdev/pmd-perf-show + * Supervision of PMD performance metrics and logging of suspicious + iterations v2.9.0 - 19 Feb 2018 diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c index caa0e27..47ce2c2 100644 --- a/lib/dpif-netdev-perf.c +++ b/lib/dpif-netdev-perf.c @@ -25,6 +25,24 @@ VLOG_DEFINE_THIS_MODULE(pmd_perf); +#define ITER_US_THRESHOLD 250 /* Warning threshold for iteration duration + in microseconds. */ +#define VHOST_QUEUE_FULL 128/* Size of the virtio TX queue. */ +#define LOG_IT_BEFORE 5 /* Number of iterations to log before + suspicious iteration. */ +#define LOG_IT_AFTER 5 /* Number of iterations to log after + suspicious iteration. */ + +bool log_enabled = false; +bool log_extend = false; +static uint32_t log_it_before = LOG_IT_BEFORE; +static uint32_t log_it_after = LOG_IT_AFTER; +static uint32_t log_us_thr = ITER_US_THRESHOLD; +uint32_t log_q_thr = VHOST_QUEUE_FULL; +uint64_t iter_cycle_threshold; + +static struct vlog_rate_limit latency_rl = VLOG_RATE_LIMIT_INIT(600, 600); + #ifdef DPDK_NETDEV static uint64_t get_tsc_hz(void) @@ -141,6 +159,10 @@ pmd_perf_stats_init(struct pmd_perf_stats *s) histogram_walls_set_log(>max_vhost_qfill, 0, 512); s->iteration_cnt = 0; s->start_ms = time_msec(); +s->log_susp_it = UINT32_MAX; +s->log_begin_it = UINT32_MAX; +s->log_end_it = UINT32_MAX; +s->log_reason = NULL; } void @@ -391,6 +413,10 @@ pmd_perf_stats_clear_lock(struct pmd_perf_stats *s) history_init(>milliseconds); s->start_ms = time_msec(); s->milliseconds.sample[0].timestamp = s->start_ms; +s->log_susp_it = UINT32_MAX; +s->log_begin_it = UINT32_MAX; +s->log_end_it = UINT32_MAX; +s->log_reason = NULL; /* Clearing finished. */ s->clear = false; ovs_mutex_unlock(>clear_mutex); @@ -442,6 +468,7 @@ pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets, uint64_t now_tsc = cycles_counter_update(s); struct iter_stats *cum_ms; uint64_t cycles, cycles_per
[ovs-dev] [PATCH v12 0/3] dpif-netdev: Detailed PMD performance metrics and supervision
The run-time performance of PMDs is often difficult to understand and trouble-shoot. The existing PMD statistics counters only provide a coarse grained average picture. At packet rates of several Mpps sporadic drops of packet bursts happen at sub-millisecond time scales and are impossible to capture and analyze with existing tools. This patch collects a large number of important PMD performance metrics per PMD iteration, maintaining histograms and circular histories for iteration metrics and millisecond averages. To capture sporadic drop events, the patch set can be configured to monitor iterations for suspicious metrics and to log the neighborhood of such iterations for off-line analysis. The extra cost for the performance metric collection and the supervision has been measured to be in the order of 1% compared to the base commit in a PVP setup with L3 pipeline over VXLAN tunnels. For that reason the metrics collection is disabled by default and can be enabled at run-time through configuration. v11 -> v12: * Rebased to master (commit 83c2757bd) * Clarified meaning of recv() param *qfill in netdev-provider.h (Ben) v10 -> v11: * Rebased to master (commit 00a0a011d) * Implemented comments on v10 by Ilya, Aaron and Ian. * Replaced broken macro ATOMIC_LLONG_LOCK_FREE with working macro ATOMIC_ALWAYS_LOCK_FREE_8B. * Changed iteration key in iteration history from TSC timetamp to iteration counter. * Bugfix: Suspicious iteration logged was one off the actual suspicious iteration. v9 -> v10: * Implemented missed comment by Ilya on v8: use ATOMIC_LLONG_LOCK_FREE * Fixed travis and checkpatch errors reported by Ian on v9. v8 -> v9: * Rebased to master (commit cb8cbbbe9) * Implemented minor comments on v8 by Billy v7 -> v8: * Rebased on to master (commit 4e99b70df) * Implemented comments from Ilya Maximets and Billy O'Mahony. * Replaced netdev_rxq_length() introduced in v7 by optional out parameter for the remaining rx queue len in netdev_rxq_recv(). * Fixed thread synchronization issues in clearing PMD stats: - Use mutex to control whether to clear from main thread directly or in PMD at start of next iteration. - Use mutex to prevent concurrent clearing and printing of metrics. * Added tx packet and batch stats to pmd-perf-show output. * Delay warning for suspicious iteration to the iteration in which we also log the neighborhood to not pollute the logged iteration stats with logging costs. * Corrected the exact number of iterations logged before and after a supicious iteration. * Introduced options -e and -ne in pmd-perf-log-set to control whether to *extend* the range of logged iterations when additional supicious iterations are detected before the scheduled end of logging interval is reached. * Exclude logging cycles from the iteration stats to avoid confusing ghost peaks. * Performance impact compared to master less than 1% even with supervision enabled. v5 -> v7: * Rebased on to dpdk_merge (commit e68) - New base contains earlier refactoring parts of series. * Implemented comments from Ilya Maximets and Billy O'Mahony. * Replaced piggybacking qlen on dp_packet_batch with a new netdev API netdev_rxq_length(). * Thread-safe clearing of pmd counters in pmd_perf_start_iteration(). * Fixed bug in reporting datapath stats. * Work-around a bug in DPDK rte_vhost_rx_queue_count() which sometimes returns bogus in the upper 16 bits of the uint32_t return value. v4 -> v5: * Rebased to master (commit e9de6c0) * Implemented comments from Aaron Conole and Darrel Ball v3 -> v4: * Rebased to master (commit 4d0a31b) - Reverting changes to struct dp_netdev_pmd_thread. * Make metrics collection configurable. * Several bugfixes. v2 -> v3: * Rebased to OVS master (commit 3728b3b). * Non-trivial adaptation to struct dp_netdev_pmd_thread. - refactored in commit a807c157 (Bhanu). * No other changes compared to v2. v1 -> v2: * Rebased to OVS master (commit 7468ec788). * No other changes compared to v1. Jan Scheurich (3): netdev: Add optional qfill output parameter to rxq_recv() dpif-netdev: Detailed performance stats for PMDs dpif-netdev: Detection and logging of suspicious PMD iterations NEWS| 6 + lib/automake.mk | 1 + lib/dpif-netdev-perf.c | 685 +++- lib/dpif-netdev-perf.h | 218 -- lib/dpif-netdev-unixctl.man | 216 ++ lib/dpif-netdev.c | 192 - lib/netdev-bsd.c| 8 +- lib/netdev-dpdk.c | 41 ++- lib/netdev-dummy.c | 8 +- lib/netdev-linux.c | 7 +- lib/netdev-provider.h | 8 +- lib/netdev.c| 5 +- lib/netdev.h| 3 +- manpages.mk | 2 + vswitchd/ovs-vswitchd.8.in | 27 +- vswitchd/vswitch.xml| 12 + 16 files changed, 1363 insertions(+), 76 deletions(-) create mode 1006
Re: [ovs-dev] [PATCH v2 2/3] ofproto-dpif: Improve dp_hash selection method for select groups
> How about this approach, which should cleanly eliminate the warning? > > diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c > index e1a5c097f3aa..362339a4abb4 100644 > --- a/ofproto/ofproto-dpif.c > +++ b/ofproto/ofproto-dpif.c > @@ -4780,22 +4780,17 @@ group_setup_dp_hash_table(struct group_dpif *group, > size_t max_hash) > > /* Use Webster method to distribute hash values over buckets. */ > for (int hash = 0; hash < n_hash; hash++) { > -double max_val = 0.0; > -struct webster *winner; > -for (i = 0; i < n_buckets; i++) { > -if (webster[i].value > max_val) { > -max_val = webster[i].value; > +struct webster *winner = [0]; > +for (i = 1; i < n_buckets; i++) { > +if (webster[i].value > winner->value) { > winner = [i]; > } > } > -#pragma GCC diagnostic push > -#pragma GCC diagnostic ignored "-Wmaybe-uninitialized" > /* winner is a reference to a webster[] element initialized above. */ > winner->divisor += 2; > winner->value = (double) winner->bucket->weight / winner->divisor; > group->hash_map[hash] = winner->bucket; > winner->bucket->aux++; > -#pragma GCC diagnostic pop > } Thank you, Ben, for your thorough checks. Yes, your approach is better and compiles w/o warnings. Regards, Jan ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH v2 2/3] ofproto-dpif: Improve dp_hash selection method for select groups
Hi Ychen, Thank you for finding yet another corner case. I will fix it in the next version with the following incremental: diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c index 8f71083..674b3b5 100644 --- a/ofproto/ofproto-dpif.c +++ b/ofproto/ofproto-dpif.c @@ -4762,6 +4762,11 @@ group_setup_dp_hash_table(struct group_dpif *group, size_t max_hash) VLOG_DBG(" Minimum weight: %d, total weight: %.0f", min_weight, total_weight); +if (total_weight == 0) { +VLOG_DBG(" Total weight is zero. No active buckets."); +return false; +} + uint32_t min_slots = ceil(total_weight / min_weight); n_hash = MAX(16, 1L << log_2_ceil(min_slots)); I would like to mention your contribution with Tested-By: tag in the commit messages. Would that be ok? What is your real name I should put? BR, Jan From: ychen [mailto:ychen103...@163.com] Sent: Tuesday, 17 April, 2018 13:22 To: Jan Scheurich <jan.scheur...@ericsson.com> Cc: d...@openvswitch.org; Nitin Katiyar <nitin.kati...@ericsson.com>; b...@ovn.org Subject: Re:[PATCH v2 2/3] ofproto-dpif: Improve dp_hash selection method for select groups Hi, Jan: I think the following code should also be modified + for (int hash = 0; hash < n_hash; hash++) { + double max_val = 0.0; + struct webster *winner; +for (i = 0; i < n_buckets; i++) { +if (webster[i].value > max_val) { ===> if bucket->weight=0, and there is only one bucket with weight equal to 0, then winner will be null +max_val = webster[i].value; +winner = [i]; +} +} Test like this command: ovs-ofctl add-group br-int -O openflow15 "group_id=2,type=select,selection_method=dp_hash,bucket=bucket_id=1,weight=0,actions=output:10" vswitchd crashed after command put. ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH] ofproto-dpif: Init ukey->dump_seq to zero
> > > > OK. > > > > I am going to sit on this for a few days and see whether anyone reports > > unusual issues. If nothing arises, I'll backport as far as reasonable. > > I backported to branch-2.9 and branch-2.8. Thanks, Ben. ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] [PATCH v2 3/3] ofproto-dpif: Use dp_hash as default selection method
The dp_hash selection method for select groups overcomes the scalability problems of the current default selection method which, due to L2-L4 hashing during xlation and un-wildcarding of the hashed fields, basically requires an upcall to the slow path to load-balance every L4 connection. The consequence are an explosion of datapath flows (megaflows degenerate to miniflows) and a limitation of connection setup rate OVS can handle. This commit changes the default selection method to dp_hash, provided the bucket configuration is such that the dp_hash method can accurately represent the bucket weights with up to 64 hash values. Otherwise we stick to original default hash method. We use the new dp_hash algorithm OVS_HASH_L4_SYMMETRIC to maintain the symmetry property of the old default hash method. A controller can explicitly request the old default hash selection method by specifying selection method "hash" with an empty list of fields in the Group properties of the OpenFlow 1.5 Group Mod message. Update the documentation about selection method in the ovs-ovctl man page. Revise and complete the ofproto-dpif unit tests cases for select groups. Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com> Signed-off-by: Nitin Katiyar <nitin.kati...@ericsson.com> Co-authored-by: Nitin Katiyar <nitin.kati...@ericsson.com> --- NEWS | 2 + lib/ofp-group.c| 15 ++- ofproto/ofproto-dpif.c | 35 +++-- ofproto/ofproto-dpif.h | 1 + ofproto/ofproto-provider.h | 2 +- tests/mpls-xlate.at| 26 ++-- tests/ofproto-dpif.at | 322 +++-- utilities/ovs-ofctl.8.in | 47 --- 8 files changed, 336 insertions(+), 114 deletions(-) diff --git a/NEWS b/NEWS index 58a7b58..4ea03bf 100644 --- a/NEWS +++ b/NEWS @@ -14,6 +14,8 @@ Post-v2.9.0 - ovs-vsctl: New commands "add-bond-iface" and "del-bond-iface". - OpenFlow: * OFPT_ROLE_STATUS is now available in OpenFlow 1.3. + * Default selection method for select groups is now dp_hash with improved + accuracy. - Linux kernel 4.14 * Add support for compiling OVS with the latest Linux 4.14 kernel - ovn: diff --git a/lib/ofp-group.c b/lib/ofp-group.c index 31b0437..c5ddc65 100644 --- a/lib/ofp-group.c +++ b/lib/ofp-group.c @@ -1518,12 +1518,17 @@ parse_group_prop_ntr_selection_method(struct ofpbuf *payload, return OFPERR_OFPBPC_BAD_VALUE; } -error = oxm_pull_field_array(payload->data, fields_len, - >fields); -if (error) { -OFPPROP_LOG(, false, +if (fields_len > 0) { +error = oxm_pull_field_array(payload->data, fields_len, +>fields); +if (error) { +OFPPROP_LOG(, false, "ntr selection method fields are invalid"); -return error; +return error; +} +} else { +/* Selection_method "hash: w/o fields means default hash method. */ +gp->fields.values_size = 0; } return 0; diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c index e1a5c09..8f71083 100644 --- a/ofproto/ofproto-dpif.c +++ b/ofproto/ofproto-dpif.c @@ -4814,39 +4814,54 @@ group_set_selection_method(struct group_dpif *group) struct ofputil_group_props *props = >up.props; char *selection_method = props->selection_method; +VLOG_DBG("Constructing select group %"PRIu32, group->up.group_id); if (selection_method[0] == '\0') { -VLOG_INFO("No selection method specified."); -group->selection_method = SEL_METHOD_DEFAULT; - +VLOG_DBG("No selection method specified. Trying dp_hash."); +/* If the controller has not specified a selection method, check if + * the dp_hash selection method with max 64 hash values is appropriate + * for the given bucket configuration. */ +if (group_setup_dp_hash_table(group, 64)) { +/* Use dp_hash selection method with symmetric L4 hash. */ +VLOG_DBG("Use dp_hash with %d hash values.", + group->hash_mask + 1); +group->selection_method = SEL_METHOD_DP_HASH; +group->hash_alg = OVS_HASH_ALG_SYM_L4; +group->hash_basis = 0xdeadbeef; +} else { +/* Fall back to original default hashing in slow path. */ +VLOG_DBG("Falling back to default hash method."); +group->selection_method = SEL_METHOD_DEFAULT; +} } else if (!strcmp(selection_method, "dp_hash")) { -VLOG_INFO("Selection method specified: dp_hash."); +VLOG_DBG("Selection method specified: dp_hash."); /* Try to use dp_hash if possible at all. */ if (group_setup_dp_hash_table(group, 0)) { group-
[ovs-dev] [PATCH v2 2/3] ofproto-dpif: Improve dp_hash selection method for select groups
The current implementation of the "dp_hash" selection method suffers from two deficiences: 1. The hash mask and hence the number of dp_hash values is just large enough to cover the number of group buckets, but does not consider the case that buckets have different weights. 2. The xlate-time selection of best bucket from the masked dp_hash value often results in bucket load distributions that are quite different from the bucket weights because the number of available masked dp_hash values is too small (2-6 bits compared to 32 bits of a full hash in the default hash selection method). This commit provides a more accurate implementation of the dp_hash select group by applying the well known Webster method for distributing a small number of "seats" fairly over the weighted "parties" (see https://en.wikipedia.org/wiki/Webster/Sainte-Lagu%C3%AB_method). The dp_hash mask is autmatically chosen large enough to provide good enough accuracy even with widely differing weights. This distribution happens at group modification time and the resulting table is stored with the group-dpif struct. At xlation time, we use the masked dp_hash values as index to look up the assigned bucket. If the bucket should not be live, we do a circular search over the mapping table until we find the first live bucket. As the buckets in the table are by construction in pseudo-random order with a frequency according to their weight, this method maintains correct distribution even if one or more buckets are non-live. Xlation is further simplified by storing some derived select group state at group construction in struct group-dpif in a form better suited for xlation purposes. Adapted the unit test case for dp_hash select group accordingly. Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com> Signed-off-by: Nitin Katiyar <nitin.kati...@ericsson.com> Co-authored-by: Nitin Katiyar <nitin.kati...@ericsson.com> --- include/openvswitch/ofp-group.h | 1 + ofproto/ofproto-dpif-xlate.c| 74 +--- ofproto/ofproto-dpif.c | 146 ofproto/ofproto-dpif.h | 13 tests/ofproto-dpif.at | 18 +++-- 5 files changed, 221 insertions(+), 31 deletions(-) diff --git a/include/openvswitch/ofp-group.h b/include/openvswitch/ofp-group.h index 8d893a5..af4033d 100644 --- a/include/openvswitch/ofp-group.h +++ b/include/openvswitch/ofp-group.h @@ -47,6 +47,7 @@ struct bucket_counter { /* Bucket for use in groups. */ struct ofputil_bucket { struct ovs_list list_node; +uint16_t aux; /* Padding. Also used for temporary data. */ uint16_t weight;/* Relative weight, for "select" groups. */ ofp_port_t watch_port; /* Port whose state affects whether this bucket * is live. Only required for fast failover diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c index c8baba1..df245c5 100644 --- a/ofproto/ofproto-dpif-xlate.c +++ b/ofproto/ofproto-dpif-xlate.c @@ -4235,35 +4235,55 @@ xlate_hash_fields_select_group(struct xlate_ctx *ctx, struct group_dpif *group, } } +static struct ofputil_bucket * +group_dp_hash_best_bucket(struct xlate_ctx *ctx, + const struct group_dpif *group, + uint32_t dp_hash) +{ +struct ofputil_bucket *bucket, *best_bucket = NULL; +uint32_t n_hash = group->hash_mask + 1; + +uint32_t hash = dp_hash &= group->hash_mask; +ctx->wc->masks.dp_hash |= group->hash_mask; + +/* Starting from the original masked dp_hash value iterate over the + * hash mapping table to find the first live bucket. As the buckets + * are quasi-randomly spread over the hash values, this maintains + * a distribution according to bucket weights even when some buckets + * are non-live. */ +for (int i = 0; i < n_hash; i++) { +bucket = group->hash_map[(hash + i) % n_hash]; +if (bucket_is_alive(ctx, bucket, 0)) { +best_bucket = bucket; +break; +} +} + +return best_bucket; +} + static void xlate_dp_hash_select_group(struct xlate_ctx *ctx, struct group_dpif *group, bool is_last_action) { -struct ofputil_bucket *bucket; - /* dp_hash value 0 is special since it means that the dp_hash has not been * computed, as all computed dp_hash values are non-zero. Therefore * compare to zero can be used to decide if the dp_hash value is valid * without masking the dp_hash field. */ if (!ctx->xin->flow.dp_hash) { -uint64_t param = group->up.props.selection_method_param; - -ctx_trigger_recirculate_with_hash(ctx, param >> 32, (uint32_t)param); +ctx_trigger_recirculate_with_hash(ctx, group->hash_alg, + group->has
[ovs-dev] [PATCH v2 1/3] userspace datapath: Add OVS_HASH_L4_SYMMETRIC dp_hash algorithm
This commit implements a new dp_hash algorithm OVS_HASH_L4_SYMMETRIC in the netdev datapath. It will be used as default hash algorithm for the dp_hash-based select groups in a subsequent commit to maintain compatibility with the symmetry property of the current default hash selection method. Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com> Signed-off-by: Nitin Katiyar <nitin.kati...@ericsson.com> Co-authored-by: Nitin Katiyar <nitin.kati...@ericsson.com> --- datapath/linux/compat/include/linux/openvswitch.h | 4 +++ lib/flow.c| 42 +-- lib/flow.h| 1 + lib/odp-execute.c | 23 +++-- 4 files changed, 65 insertions(+), 5 deletions(-) diff --git a/datapath/linux/compat/include/linux/openvswitch.h b/datapath/linux/compat/include/linux/openvswitch.h index 84ebcaf..2bb3cb2 100644 --- a/datapath/linux/compat/include/linux/openvswitch.h +++ b/datapath/linux/compat/include/linux/openvswitch.h @@ -720,6 +720,10 @@ struct ovs_action_push_vlan { */ enum ovs_hash_alg { OVS_HASH_ALG_L4, +#ifndef __KERNEL__ + OVS_HASH_ALG_SYM_L4, +#endif + __OVS_HASH_MAX }; /* diff --git a/lib/flow.c b/lib/flow.c index 09b66b8..9d8c1ca 100644 --- a/lib/flow.c +++ b/lib/flow.c @@ -2108,6 +2108,44 @@ flow_hash_symmetric_l4(const struct flow *flow, uint32_t basis) return jhash_bytes(, sizeof fields, basis); } +/* Symmetrically Hashes non-IP 'flow' based on its L2 headers. */ +uint32_t +flow_hash_symmetric_l2(const struct flow *flow, uint32_t basis) +{ +union { +struct { +ovs_be16 eth_type; +ovs_be16 vlan_tci; +struct eth_addr eth_addr; +ovs_be16 pad; +}; +uint32_t word[3]; +} fields; + +uint32_t hash = basis; +int i; + +if (flow->packet_type != htons(PT_ETH)) { +/* Cannot hash non-Ethernet flows */ +return 0; +} + +for (i = 0; i < ARRAY_SIZE(fields.eth_addr.be16); i++) { +fields.eth_addr.be16[i] = +flow->dl_src.be16[i] ^ flow->dl_dst.be16[i]; +} +for (i = 0; i < FLOW_MAX_VLAN_HEADERS; i++) { +fields.vlan_tci ^= flow->vlans[i].tci & htons(VLAN_VID_MASK); +} +fields.eth_type = flow->dl_type; +fields.pad = 0; + +hash = hash_add(hash, fields.word[0]); +hash = hash_add(hash, fields.word[1]); +hash = hash_add(hash, fields.word[2]); +return hash_finish(hash, basis); +} + /* Hashes 'flow' based on its L3 through L4 protocol information */ uint32_t flow_hash_symmetric_l3l4(const struct flow *flow, uint32_t basis, @@ -2128,8 +2166,8 @@ flow_hash_symmetric_l3l4(const struct flow *flow, uint32_t basis, hash = hash_add64(hash, a[i] ^ b[i]); } } else { -/* Cannot hash non-IP flows */ -return 0; +/* Revert to hashing L2 headers */ +return flow_hash_symmetric_l2(flow, basis); } hash = hash_add(hash, flow->nw_proto); diff --git a/lib/flow.h b/lib/flow.h index af82931..900e8f8 100644 --- a/lib/flow.h +++ b/lib/flow.h @@ -236,6 +236,7 @@ hash_odp_port(odp_port_t odp_port) uint32_t flow_hash_5tuple(const struct flow *flow, uint32_t basis); uint32_t flow_hash_symmetric_l4(const struct flow *flow, uint32_t basis); +uint32_t flow_hash_symmetric_l2(const struct flow *flow, uint32_t basis); uint32_t flow_hash_symmetric_l3l4(const struct flow *flow, uint32_t basis, bool inc_udp_ports ); diff --git a/lib/odp-execute.c b/lib/odp-execute.c index 1969f02..c716c41 100644 --- a/lib/odp-execute.c +++ b/lib/odp-execute.c @@ -726,14 +726,16 @@ odp_execute_actions(void *dp, struct dp_packet_batch *batch, bool steal, } switch ((enum ovs_action_attr) type) { + case OVS_ACTION_ATTR_HASH: { const struct ovs_action_hash *hash_act = nl_attr_get(a); -/* Calculate a hash value directly. This might not match the +/* Calculate a hash value directly. This might not match the * value computed by the datapath, but it is much less expensive, * and the current use case (bonding) does not require a strict * match to work properly. */ -if (hash_act->hash_alg == OVS_HASH_ALG_L4) { +switch (hash_act->hash_alg) { +case OVS_HASH_ALG_L4: { struct flow flow; uint32_t hash; @@ -749,7 +751,22 @@ odp_execute_actions(void *dp, struct dp_packet_batch *batch, bool steal, } packet->md.dp_hash = hash; } -} else { +break; +} +case OVS_HASH_ALG_SYM_L4: { +struct flow flow; +uint32_t hash; + +DP_PACKET_BATCH_FOR_EACH (i, pa
[ovs-dev] [PATCH v2 0/3] Use improved dp_hash select group by default
The current default OpenFlow select group implementation sends every new L4 flow to the slow path for the balancing decision and installs a 5-tuple "miniflow" in the datapath to forward subsequent packets of the connection accordingly. Clearly this has major scalability issues with many parallel L4 flows and high connection setup rates. The dp_hash selection method for the OpenFlow select group was added to OVS as an alternative. It avoids the scalability issues for the price of an additional recirculation in the datapath. The dp_hash method is only available to OF1.5 SDN controllers speaking the Netronome Group Mod extension to configure the selection mechanism. This severely limited the applicability of the dp_hash select group in the past. Furthermore, testing revealed that the implemented dp_hash selection often generated a very uneven distribution of flows over group buckets and didn't consider bucket weights at all. The present patch set in a first step improves the dp_hash selection method to much more accurately distribute flows over weighted group buckets. In a second step it makes the improved dp_hash method the default in OVS for select groups that can be accurately handled by dp_hash. That should be the vast majority of cases. Otherwise we fall back to the legacy slow-path selection method. The Netronome extension can still be used to override the default decision and require the legacy slow-path or the dp_hash selection method. v1 -> v2: - Fixed crashes for corner cases reported by Ychen - Fixed group ref leakage with dp_hash reported by Ychen - Changed all xlation logging from INFO to DBG - Revised, completed and detailed select group unit test cases in ofproto-dpif - Updated selection_method documentation in ovs-ofctl man page - Added NEWS item Jan Scheurich (3): userspace datapath: Add OVS_HASH_L4_SYMMETRIC dp_hash algorithm ofproto-dpif: Improve dp_hash selection method for select groups ofproto-dpif: Use dp_hash as default selection method NEWS | 2 + datapath/linux/compat/include/linux/openvswitch.h | 4 + include/openvswitch/ofp-group.h | 1 + lib/flow.c| 42 ++- lib/flow.h| 1 + lib/odp-execute.c | 23 +- lib/ofp-group.c | 15 +- ofproto/ofproto-dpif-xlate.c | 74 +++-- ofproto/ofproto-dpif.c| 161 +++ ofproto/ofproto-dpif.h| 14 + ofproto/ofproto-provider.h| 2 +- tests/mpls-xlate.at | 26 +- tests/ofproto-dpif.at | 314 +- utilities/ovs-ofctl.8.in | 47 ++-- 14 files changed, 599 insertions(+), 127 deletions(-) -- 1.9.1 ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH v2 0/2] Correct handling of double encap and decap actions
Just sent the adjusted version for 2.8. /Jan > -Original Message- > From: Ben Pfaff [mailto:b...@ovn.org] > Sent: Friday, 13 April, 2018 20:19 > To: Jan Scheurich <jan.scheur...@ericsson.com> > Cc: d...@openvswitch.org; yi.y.y...@intel.com > Subject: Re: [PATCH v2 0/2] Correct handling of double encap and decap actions > > Thanks for checking. I applied both patches to branch-2.9. For > branch-2.8, would you mind submitting the fixed-up patches? It would > save me a few minutes. > > Thanks, > > Ben. > > On Fri, Apr 06, 2018 at 05:37:20PM +, Jan Scheurich wrote: > > Yes that fix should be applied to branches 2.9 and 2.8. > > > > I checked that it applies and passes all unit tests pass. > > > > On branch-2.8 the patch for nsh.at patch must be slightly retrofitted as > > the datapath action names changed from > encap_nsh/decap_nsh to push_nsh/pop_nsh and the nsh_ttl field was introduced > in 2.9. > > > > diff --git a/tests/nsh.at b/tests/nsh.at > > index 6ae71b5..6eb4637 100644 > > --- a/tests/nsh.at > > +++ b/tests/nsh.at > > @@ -351,7 +351,7 @@ bridge("br0") > > > > Final flow: unchanged > > Megaflow: recirc_id=0,eth,ip,in_port=1,dl_dst=66:77:88:99:aa:bb,nw_frag=no > > -Datapath actions: > > push_nsh(flags=0,ttl=63,mdtype=1,np=3,spi=0x1234,si=255,c1=0x1122334 > > +Datapath actions: > > encap_nsh(flags=0,mdtype=1,np=3,spi=0x1234,si=255,c1=0x11223344,c2=0 > > ]) > > > > AT_CHECK([ > > @@ -370,7 +370,7 @@ bridge("br0") > > > > Final flow: > > recirc_id=0x1,eth,in_port=4,vlan_tci=0x,dl_src=00:00:00:00:00:00,dl_ds > > Megaflow: > > recirc_id=0x1,packet_type=(1,0x894f),in_port=4,nsh_mdtype=1,nsh_np=3,nsh_spi > > -Datapath actions: pop_nsh(),recirc(0x2) > > +Datapath actions: decap_nsh(),recirc(0x2) > > ]) > > > > AT_CHECK([ > > @@ -407,8 +407,8 @@ ovs-appctl time/warp 1000 > > AT_CHECK([ > > ovs-appctl dpctl/dump-flows dummy@ovs-dummy | strip_used | grep -v > > ipv6 | sort > > ], [0], [flow-dump from non-dpdk interfaces: > > -recirc_id(0),in_port(1),packet_type(ns=0,id=0),eth(dst=1e:2c:e9:2a:66:9e),eth_type(0x0 > > -recirc_id(0x3),in_port(1),packet_type(ns=1,id=0x894f),nsh(mdtype=1,np=3,spi=0x1234,c1= > > +recirc_id(0),in_port(1),packet_type(ns=0,id=0),eth(dst=1e:2c:e9:2a:66:9e),eth_type(0x0 > > +recirc_id(0x3),in_port(1),packet_type(ns=1,id=0x894f),nsh(mdtype=1,np=3,spi=0x1234,c1= > > > > recirc_id(0x4),in_port(1),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), > > packe > > ]) > > > > Thanks, Jan > > > > > -Original Message- > > > From: Ben Pfaff [mailto:b...@ovn.org] > > > Sent: Friday, 06 April, 2018 18:36 > > > To: Jan Scheurich <jan.scheur...@ericsson.com> > > > Cc: d...@openvswitch.org; yi.y.y...@intel.com > > > Subject: Re: [PATCH v2 0/2] Correct handling of double encap and decap > > > actions > > > > > > On Fri, Apr 06, 2018 at 09:35:48AM -0700, Ben Pfaff wrote: > > > > On Thu, Apr 05, 2018 at 04:11:02PM +0200, Jan Scheurich wrote: > > > > > Recent tests with NSH encap have shown that the translation of > > > > > multiple > > > > > subsequent encap() or decap() actions was incorrect. This patch set > > > > > corrects the handling and adds a unit test for NSH to cover two NSH > > > > > and one Ethernet encapsulation levels. > > > > > > > > Thanks. Should this be applied to branch-2.9? > > > > > > To be clear, I applied it to master just now. ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] [PATCH branch-2.8 1/2] xlate: Correct handling of double encap() actions
When the same encap() header was pushed twice onto a packet (e.g in the case of NSH in NSH), the translation logic only generated a datapath push action for the first encap() action. The second encap() did not emit a push action because the packet type was unchanged. commit_encap_decap_action() (renamed from commit_packet_type_change) must solely rely on ctx->pending_encap to generate an datapath push action. Similarly, the first decap() action on a double header packet does not change the packet_type either. Add a corresponding ctx->pending_decap flag and use that to trigger emitting a datapath pop action. Fixes: f839892a2 ("OF support and translation of generic encap and decap") Fixes: 1fc11c594 ("Generic encap and decap support for NSH") Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com> --- lib/odp-util.c | 16 ++-- lib/odp-util.h | 1 + ofproto/ofproto-dpif-xlate.c | 7 ++- 3 files changed, 13 insertions(+), 11 deletions(-) diff --git a/lib/odp-util.c b/lib/odp-util.c index 7f42b98..78cc903 100644 --- a/lib/odp-util.c +++ b/lib/odp-util.c @@ -6883,17 +6883,13 @@ odp_put_encap_nsh_action(struct ofpbuf *odp_actions, } static void -commit_packet_type_change(const struct flow *flow, +commit_encap_decap_action(const struct flow *flow, struct flow *base_flow, struct ofpbuf *odp_actions, struct flow_wildcards *wc, - bool pending_encap, + bool pending_encap, bool pending_decap, struct ofpbuf *encap_data) { -if (flow->packet_type == base_flow->packet_type) { -return; -} - if (pending_encap) { switch (ntohl(flow->packet_type)) { case PT_ETH: { @@ -6918,7 +6914,7 @@ commit_packet_type_change(const struct flow *flow, * The check is done at action translation. */ OVS_NOT_REACHED(); } -} else { +} else if (pending_decap || flow->packet_type != base_flow->packet_type) { /* This is an explicit or implicit decap case. */ if (pt_ns(flow->packet_type) == OFPHTN_ETHERTYPE && base_flow->packet_type == htonl(PT_ETH)) { @@ -6957,14 +6953,14 @@ commit_packet_type_change(const struct flow *flow, enum slow_path_reason commit_odp_actions(const struct flow *flow, struct flow *base, struct ofpbuf *odp_actions, struct flow_wildcards *wc, - bool use_masked, bool pending_encap, + bool use_masked, bool pending_encap, bool pending_decap, struct ofpbuf *encap_data) { enum slow_path_reason slow1, slow2; bool mpls_done = false; -commit_packet_type_change(flow, base, odp_actions, wc, - pending_encap, encap_data); +commit_encap_decap_action(flow, base, odp_actions, wc, + pending_encap, pending_decap, encap_data); commit_set_ether_action(flow, base, odp_actions, wc, use_masked); /* Make packet a non-MPLS packet before committing L3/4 actions, * which would otherwise do nothing. */ diff --git a/lib/odp-util.h b/lib/odp-util.h index 27c2ab4..9d6cc45 100644 --- a/lib/odp-util.h +++ b/lib/odp-util.h @@ -278,6 +278,7 @@ enum slow_path_reason commit_odp_actions(const struct flow *, struct flow_wildcards *wc, bool use_masked, bool pending_encap, + bool pending_decap, struct ofpbuf *encap_data); /* ofproto-dpif interface. diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c index 3890b2e..54fd06c 100644 --- a/ofproto/ofproto-dpif-xlate.c +++ b/ofproto/ofproto-dpif-xlate.c @@ -241,6 +241,8 @@ struct xlate_ctx { * true. */ bool pending_encap; /* True when waiting to commit a pending * encap action. */ +bool pending_decap; /* True when waiting to commit a pending + * decap action. */ struct ofpbuf *encap_data; /* May contain a pointer to an ofpbuf with * context for the datapath encap action.*/ @@ -3537,8 +3539,9 @@ xlate_commit_actions(struct xlate_ctx *ctx) ctx->xout->slow |= commit_odp_actions(>xin->flow, >base_flow, ctx->odp_actions, ctx->wc, use_masked, ctx->pending_encap, - ctx->encap_data); + ctx->pending_decap, ctx->encap_data); ctx->pending_encap = false; +ctx->pending_de
[ovs-dev] [PATCH branch-2.8 2/2] nsh: Add unit test for double NSH encap and decap
The added test verifies that OVS correctly encapsulates an Ethernet packet with two NSH (MD1) headers, sends it with an Ethernet header over a patch port and decap the Ethernet and the two NSH headers on the receiving bridge to reveal the original packet. The test case performs the encap() operations in a sequence of three chained groups to test the correct handling of encap() actions in group buckets recently fixed in commit ce4a16ac0. Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com> --- tests/nsh.at | 143 +++ 1 file changed, 143 insertions(+) diff --git a/tests/nsh.at b/tests/nsh.at index aa80a2a..6eb4637 100644 --- a/tests/nsh.at +++ b/tests/nsh.at @@ -274,6 +274,149 @@ AT_CLEANUP ### - +### Double NSH MD1 encapsulation using groups over veth link +### - + +AT_SETUP([nsh - double encap over veth link using groups]) + +OVS_VSWITCHD_START([]) + +AT_CHECK([ +ovs-vsctl set bridge br0 datapath_type=dummy \ +protocols=OpenFlow10,OpenFlow13,OpenFlow14,OpenFlow15 -- \ +add-port br0 p1 -- set Interface p1 type=dummy ofport_request=1 -- \ +add-port br0 p2 -- set Interface p2 type=dummy ofport_request=2 -- \ +add-port br0 v3 -- set Interface v3 type=patch options:peer=v4 ofport_request=3 -- \ +add-port br0 v4 -- set Interface v4 type=patch options:peer=v3 ofport_request=4]) + +AT_DATA([flows.txt], [dnl +table=0,in_port=1,ip,actions=group:100 + table=0,in_port=4,packet_type=(0,0),dl_type=0x894f,nsh_mdtype=1,nsh_spi=0x5678,nsh_c1=0x55667788,actions=decap(),goto_table:1 + table=1,packet_type=(1,0x894f),nsh_mdtype=1,nsh_spi=0x5678,nsh_c1=0x55667788,actions=decap(),goto_table:2 + table=2,packet_type=(1,0x894f),nsh_mdtype=1,nsh_spi=0x1234,nsh_c1=0x11223344,actions=decap(),output:2 +]) + +AT_DATA([groups.txt], [dnl +add group_id=100,type=indirect,bucket=actions=encap(nsh(md_type=1)),set_field:0x1234->nsh_spi,set_field:0x11223344->nsh_c1,group:200 +add group_id=200,type=indirect,bucket=actions=encap(nsh(md_type=1)),set_field:0x5678->nsh_spi,set_field:0x55667788->nsh_c1,group:300 +add group_id=300,type=indirect,bucket=actions=encap(ethernet),set_field:11:22:33:44:55:66->dl_dst,3 +]) + +AT_CHECK([ +ovs-ofctl del-flows br0 +ovs-ofctl -Oopenflow13 add-groups br0 groups.txt +ovs-ofctl -Oopenflow13 add-flows br0 flows.txt +ovs-ofctl -Oopenflow13 dump-flows br0 | ofctl_strip | sort | grep actions +], [0], [dnl + in_port=4,dl_type=0x894f,nsh_mdtype=1,nsh_spi=0x5678,nsh_c1=0x55667788 actions=decap(),goto_table:1 + ip,in_port=1 actions=group:100 + table=1, packet_type=(1,0x894f),nsh_mdtype=1,nsh_spi=0x5678,nsh_c1=0x55667788 actions=decap(),goto_table:2 + table=2, packet_type=(1,0x894f),nsh_mdtype=1,nsh_spi=0x1234,nsh_c1=0x11223344 actions=decap(),output:2 +]) + +# TODO: +# The fields nw_proto, nw_tos, nw_ecn, nw_ttl in final flow seem unnecessary. Can they be avoided? +# The match on dl_dst=66:77:88:99:aa:bb in the Megaflow is a side effect of setting the dl_dst in the pushed outer +# Ethernet header. It is a consequence of using wc->masks both for tracking matched and set bits and seems hard to +# avoid except by using separate masks for both purposes. + +AT_CHECK([ +ovs-appctl ofproto/trace br0 'in_port=1,icmp,dl_src=00:11:22:33:44:55,dl_dst=66:77:88:99:aa:bb,nw_dst=10.10.10.10,nw_src=20.20.20.20' +], [0], [dnl +Flow: icmp,in_port=1,vlan_tci=0x,dl_src=00:11:22:33:44:55,dl_dst=66:77:88:99:aa:bb,nw_src=20.20.20.20,nw_dst=10.10.10.10,nw_tos=0,nw_ecn=0,nw_ttl=0,icmp_type=0,icmp_code=0 + +bridge("br0") +- + 0. ip,in_port=1, priority 32768 +group:100 +encap(nsh(md_type=1)) +set_field:0x1234->nsh_spi +set_field:0x11223344->nsh_c1 +group:200 +encap(nsh(md_type=1)) +set_field:0x5678->nsh_spi +set_field:0x55667788->nsh_c1 +group:300 +encap(ethernet) +set_field:11:22:33:44:55:66->eth_dst +output:3 + +bridge("br0") +- + 0. in_port=4,dl_type=0x894f,nsh_mdtype=1,nsh_spi=0x5678,nsh_c1=0x55667788, priority 32768 +decap() +goto_table:1 + 1. packet_type=(1,0x894f),nsh_mdtype=1,nsh_spi=0x5678,nsh_c1=0x55667788, priority 32768 +decap() + +Final flow: unchanged +Megaflow: recirc_id=0,eth,ip,in_port=1,dl_dst=66:77:88:99:aa:bb,nw_frag=no +Datapath actions: encap_nsh(flags=0,mdtype=1,np=3,spi=0x1234,si=255,c1=0x11223344,c2=0x0,c3=0x0,c4=0x0),encap_nsh(flags=0,mdtype=1,np=4,spi=0x5678,si=255,c1=0x55667788,c2=0x0,c3=0x0,c4=0x0),push_eth(src=00:00:00:00:00:00,dst=11:22:33:44:55:66),pop_eth,decap_nsh(),recirc(0x1) +]) + +AT_CHECK([ +ovs-appctl ofproto/trace br0 'recirc_id=1,in_port=4,packet_type=(1,0x894f),nsh_mdtype=1,nsh_np=3,nsh_spi=0x1234,nsh_c1=0x11223344' +], [0], [dnl +Flow: re
[ovs-dev] [PATCH branch-2.8 0/2] Correct handling of double encap and decap actions
Recent tests with NSH encap have shown that the translation of multiple subsequent encap() or decap() actions was incorrect. This patch set corrects the handling and adds a unit test for NSH to cover two NSH and one Ethernet encapsulation levels. This patch retrofits the new NSH test in commit a5b3e2a6f2 to the preliminary NSH support in OVS 2.8. Jan Scheurich (2): xlate: Correct handling of double encap() actions nsh: Add unit test for double NSH encap and decap lib/odp-util.c | 16 ++--- lib/odp-util.h | 1 + ofproto/ofproto-dpif-xlate.c | 7 ++- tests/nsh.at | 143 +++ 4 files changed, 156 insertions(+), 11 deletions(-) -- 1.9.1 ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH v11 1/3] netdev: Add optional qfill output parameter to rxq_recv()
> -Original Message- > From: Ben Pfaff [mailto:b...@ovn.org] > Sent: Thursday, 12 April, 2018 18:37 > > On Thu, Apr 12, 2018 at 05:32:11PM +0200, Jan Scheurich wrote: > > If the caller provides a non-NULL qfill pointer and the netdev > > implemementation supports reading the rx queue fill level, the rxq_recv() > > function returns the remaining number of packets in the rx queue after > > reception of the packet burst to the caller. If the implementation does > > not support this, it returns -ENOTSUP instead. Reading the remaining queue > > fill level should not substantilly slow down the recv() operation. > > > > A first implementation is provided for ethernet and vhostuser DPDK ports > > in netdev-dpdk.c. > > > > This output parameter will be used in the upcoming commit for PMD > > performance metrics to supervise the rx queue fill level for DPDK > > vhostuser ports. > > Thanks for working on the generic netdev layer. > > I wasn't sure what a qfill was, so I looked at the comment on the > function that returned it and it says that it is a "queue fill level". > I can kind of guess what that is, but maybe it should be spelled out a > little more. For example, is it the number of packets currently waiting > to be received? (Maybe it is the number of bytes, who knows.) So, I > suggest making the comment just a little more explicit. Sure, will do that. ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH v10 3/3] dpif-netdev: Detection and logging of suspicious PMD iterations
> > I would not say this is expected behavior. > > > > It seems that you are executing on a somewhat slower system (tsc clock > > seems to be 100/us = 0.1 GHz) and that, even with only 5 > lines logged before and after, the logging output is causing so much slow > down of the PMD that it continues to cause iterations using > excessive cycles (362000 = 3.62 ms!) due to logging. > > The system is slower than usual, but not so much. > This behaviour captured on ARMv8. The TSC frequency on ARM is usually around > 100MHz > without using PMU which is not available from userspace by default. > Meantime, CPU frequency is 2GHz. On my x86_64 server (2.4 GHz) I could reproduce the periodic logging if I go down with the us_thr to values as low as 50 us under high PMD load. But it always stops when I increase the threshold to a value higher than 80 us. It seems you ARM system is much slower when logging to file. The threshold that should be reasonably applied may depend on the system under test. > > > > > The actual iteration with logging is not flagged as suspicious, but the > > subsequent iteration gets the hit of the massive cycles that have > passed on the TSC clock. The "phantom" duration of 0 us shown is probably a > side effect of this. > > I guess, you just have some bug in your calculation of execution time. > Possibly you're mixing up the TSC and CPU frequencies. > Zero ms duration is normal for printing so small amount of data. It was actually a bug in the code: The iteration logged as suspicious was the one directly after the actually suspicious iteration. That bug is fixed in v11 I just sent out. BR, Jan ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] [PATCH v11 0/3] dpif-netdev: Detailed PMD performance metrics and supervision
The run-time performance of PMDs is often difficult to understand and trouble-shoot. The existing PMD statistics counters only provide a coarse grained average picture. At packet rates of several Mpps sporadic drops of packet bursts happen at sub-millisecond time scales and are impossible to capture and analyze with existing tools. This patch collects a large number of important PMD performance metrics per PMD iteration, maintaining histograms and circular histories for iteration metrics and millisecond averages. To capture sporadic drop events, the patch set can be configured to monitor iterations for suspicious metrics and to log the neighborhood of such iterations for off-line analysis. The extra cost for the performance metric collection and the supervision has been measured to be in the order of 1% compared to the base commit in a PVP setup with L3 pipeline over VXLAN tunnels. For that reason the metrics collection is disabled by default and can be enabled at run-time through configuration. v9 -> v10: * Rebased to master (commit 00a0a011d) * Implemented comments on v10 by Ilya, Aaron and Ian. * Replaced broken macro ATOMIC_LLONG_LOCK_FREE with working macro ATOMIC_ALWAYS_LOCK_FREE_8B. * Changed iteration key in iteration history from TSC timetamp to iteration counter. * Bugfix: Suspicious iteration logged was one off the actual suspicious iteration. v9 -> v10: * Implemented missed comment by Ilya on v8: use ATOMIC_LLONG_LOCK_FREE * Fixed travis and checkpatch errors reported by Ian on v9. v8 -> v9: * Rebased to master (commit cb8cbbbe9) * Implemented minor comments on v8 by Billy v7 -> v8: * Rebased on to master (commit 4e99b70df) * Implemented comments from Ilya Maximets and Billy O'Mahony. * Replaced netdev_rxq_length() introduced in v7 by optional out parameter for the remaining rx queue len in netdev_rxq_recv(). * Fixed thread synchronization issues in clearing PMD stats: - Use mutex to control whether to clear from main thread directly or in PMD at start of next iteration. - Use mutex to prevent concurrent clearing and printing of metrics. * Added tx packet and batch stats to pmd-perf-show output. * Delay warning for suspicious iteration to the iteration in which we also log the neighborhood to not pollute the logged iteration stats with logging costs. * Corrected the exact number of iterations logged before and after a supicious iteration. * Introduced options -e and -ne in pmd-perf-log-set to control whether to *extend* the range of logged iterations when additional supicious iterations are detected before the scheduled end of logging interval is reached. * Exclude logging cycles from the iteration stats to avoid confusing ghost peaks. * Performance impact compared to master less than 1% even with supervision enabled. v5 -> v7: * Rebased on to dpdk_merge (commit e68) - New base contains earlier refactoring parts of series. * Implemented comments from Ilya Maximets and Billy O'Mahony. * Replaced piggybacking qlen on dp_packet_batch with a new netdev API netdev_rxq_length(). * Thread-safe clearing of pmd counters in pmd_perf_start_iteration(). * Fixed bug in reporting datapath stats. * Work-around a bug in DPDK rte_vhost_rx_queue_count() which sometimes returns bogus in the upper 16 bits of the uint32_t return value. v4 -> v5: * Rebased to master (commit e9de6c0) * Implemented comments from Aaron Conole and Darrel Ball v3 -> v4: * Rebased to master (commit 4d0a31b) - Reverting changes to struct dp_netdev_pmd_thread. * Make metrics collection configurable. * Several bugfixes. v2 -> v3: * Rebased to OVS master (commit 3728b3b). * Non-trivial adaptation to struct dp_netdev_pmd_thread. - refactored in commit a807c157 (Bhanu). * No other changes compared to v2. v1 -> v2: * Rebased to OVS master (commit 7468ec788). * No other changes compared to v1. Jan Scheurich (3): netdev: Add optional qfill output parameter to rxq_recv() dpif-netdev: Detailed performance stats for PMDs dpif-netdev: Detection and logging of suspicious PMD iterations NEWS| 6 + lib/automake.mk | 1 + lib/dpif-netdev-perf.c | 685 +++- lib/dpif-netdev-perf.h | 218 -- lib/dpif-netdev-unixctl.man | 216 ++ lib/dpif-netdev.c | 192 - lib/netdev-bsd.c| 8 +- lib/netdev-dpdk.c | 41 ++- lib/netdev-dummy.c | 8 +- lib/netdev-linux.c | 7 +- lib/netdev-provider.h | 7 +- lib/netdev.c| 5 +- lib/netdev.h| 3 +- manpages.mk | 2 + vswitchd/ovs-vswitchd.8.in | 27 +- vswitchd/vswitch.xml| 12 + 16 files changed, 1362 insertions(+), 76 deletions(-) create mode 100644 lib/dpif-netdev-unixctl.man -- 1.9.1 ___ dev mailing list d...@openvswitch.org h
[ovs-dev] [PATCH v11 1/3] netdev: Add optional qfill output parameter to rxq_recv()
If the caller provides a non-NULL qfill pointer and the netdev implemementation supports reading the rx queue fill level, the rxq_recv() function returns the remaining number of packets in the rx queue after reception of the packet burst to the caller. If the implementation does not support this, it returns -ENOTSUP instead. Reading the remaining queue fill level should not substantilly slow down the recv() operation. A first implementation is provided for ethernet and vhostuser DPDK ports in netdev-dpdk.c. This output parameter will be used in the upcoming commit for PMD performance metrics to supervise the rx queue fill level for DPDK vhostuser ports. Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com> Acked-by: Billy O'Mahony <billy.o.mah...@intel.com> --- lib/dpif-netdev.c | 2 +- lib/netdev-bsd.c | 8 +++- lib/netdev-dpdk.c | 41 - lib/netdev-dummy.c| 8 +++- lib/netdev-linux.c| 7 ++- lib/netdev-provider.h | 7 ++- lib/netdev.c | 5 +++-- lib/netdev.h | 3 ++- 8 files changed, 68 insertions(+), 13 deletions(-) diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index be31fd0..7ce3943 100644 --- a/lib/dpif-netdev.c +++ b/lib/dpif-netdev.c @@ -3277,7 +3277,7 @@ dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd, pmd->ctx.last_rxq = rxq; dp_packet_batch_init(); -error = netdev_rxq_recv(rxq->rx, ); +error = netdev_rxq_recv(rxq->rx, , NULL); if (!error) { /* At least one packet received. */ *recirc_depth_get() = 0; diff --git a/lib/netdev-bsd.c b/lib/netdev-bsd.c index 05974c1..b70f327 100644 --- a/lib/netdev-bsd.c +++ b/lib/netdev-bsd.c @@ -618,7 +618,8 @@ netdev_rxq_bsd_recv_tap(struct netdev_rxq_bsd *rxq, struct dp_packet *buffer) } static int -netdev_bsd_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch) +netdev_bsd_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch, +int *qfill) { struct netdev_rxq_bsd *rxq = netdev_rxq_bsd_cast(rxq_); struct netdev *netdev = rxq->up.netdev; @@ -643,6 +644,11 @@ netdev_bsd_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch) batch->packets[0] = packet; batch->count = 1; } + +if (qfill) { +*qfill = -ENOTSUP; +} + return retval; } diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index ee39cbe..a4fc382 100644 --- a/lib/netdev-dpdk.c +++ b/lib/netdev-dpdk.c @@ -1812,13 +1812,13 @@ netdev_dpdk_vhost_update_rx_counters(struct netdev_stats *stats, */ static int netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq, - struct dp_packet_batch *batch) + struct dp_packet_batch *batch, int *qfill) { struct netdev_dpdk *dev = netdev_dpdk_cast(rxq->netdev); struct ingress_policer *policer = netdev_dpdk_get_ingress_policer(dev); uint16_t nb_rx = 0; uint16_t dropped = 0; -int qid = rxq->queue_id; +int qid = rxq->queue_id * VIRTIO_QNUM + VIRTIO_TXQ; int vid = netdev_dpdk_get_vid(dev); if (OVS_UNLIKELY(vid < 0 || !dev->vhost_reconfigured @@ -1826,14 +1826,23 @@ netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq, return EAGAIN; } -nb_rx = rte_vhost_dequeue_burst(vid, qid * VIRTIO_QNUM + VIRTIO_TXQ, -dev->mp, +nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->mp, (struct rte_mbuf **) batch->packets, NETDEV_MAX_BURST); if (!nb_rx) { return EAGAIN; } +if (qfill) { +if (nb_rx == NETDEV_MAX_BURST) { +/* The DPDK API returns a uint32_t which often has invalid bits in + * the upper 16-bits. Need to restrict the value to uint16_t. */ +*qfill = rte_vhost_rx_queue_count(vid, qid) & UINT16_MAX; +} else { +*qfill = 0; +} +} + if (policer) { dropped = nb_rx; nb_rx = ingress_policer_run(policer, @@ -1854,7 +1863,8 @@ netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq, } static int -netdev_dpdk_rxq_recv(struct netdev_rxq *rxq, struct dp_packet_batch *batch) +netdev_dpdk_rxq_recv(struct netdev_rxq *rxq, struct dp_packet_batch *batch, + int *qfill) { struct netdev_rxq_dpdk *rx = netdev_rxq_dpdk_cast(rxq); struct netdev_dpdk *dev = netdev_dpdk_cast(rxq->netdev); @@ -1891,6 +1901,14 @@ netdev_dpdk_rxq_recv(struct netdev_rxq *rxq, struct dp_packet_batch *batch) batch->count = nb_rx; dp_packet_batch_init_packet_fields(batch); +if (qfill) { +if (nb_rx == NETDEV_MAX_BURST) { +*qfill = rte_eth_rx_queue_count(rx->port_id, rxq->queue_id); +} else { +*qfill = 0; +} +} + return 0; } @@ -3172,6 +3190,19 @@ vrin
[ovs-dev] [PATCH v11 3/3] dpif-netdev: Detection and logging of suspicious PMD iterations
This patch enhances dpif-netdev-perf to detect iterations with suspicious statistics according to the following criteria: - iteration lasts longer than US_THR microseconds (default 250). This can be used to capture events where a PMD is blocked or interrupted for such a period of time that there is a risk for dropped packets on any of its Rx queues. - max vhost qlen exceeds a threshold Q_THR (default 128). This can be used to infer virtio queue overruns and dropped packets inside a VM, which are not visible in OVS otherwise. Such suspicious iterations can be logged together with their iteration statistics to be able to correlate them to packet drop or other events outside OVS. A new command is introduced to enable/disable logging at run-time and to adjust the above thresholds for suspicious iterations: ovs-appctl dpif-netdev/pmd-perf-log-set on | off [-b before] [-a after] [-e|-ne] [-us usec] [-q qlen] Turn logging on or off at run-time (on|off). -b before: The number of iterations before the suspicious iteration to be logged (default 5). -a after: The number of iterations after the suspicious iteration to be logged (default 5). -e: Extend logging interval if another suspicious iteration is detected before logging occurs. -ne:Do not extend logging interval (default). -q qlen:Suspicious vhost queue fill level threshold. Increase this to 512 if the Qemu supports 1024 virtio queue length. (default 128). -us usec: change the duration threshold for a suspicious iteration (default 250 us). Note: Logging of suspicious iterations itself consumes a considerable amount of processing cycles of a PMD which may be visible in the iteration history. In the worst case this can lead OVS to detect another suspicious iteration caused by logging. If more than 100 iterations around a suspicious iteration have been logged once, OVS falls back to the safe default values (-b 5/-a 5/-ne) to avoid that logging itself causes continuos further logging. Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com> Acked-by: Billy O'Mahony <billy.o.mah...@intel.com> --- NEWS| 2 + lib/dpif-netdev-perf.c | 223 lib/dpif-netdev-perf.h | 21 + lib/dpif-netdev-unixctl.man | 59 lib/dpif-netdev.c | 5 + 5 files changed, 310 insertions(+) diff --git a/NEWS b/NEWS index ff81a9f..f11ef64 100644 --- a/NEWS +++ b/NEWS @@ -23,6 +23,8 @@ Post-v2.9.0 * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a single PMD * Detailed PMD performance metrics available with new command ovs-appctl dpif-netdev/pmd-perf-show + * Supervision of PMD performance metrics and logging of suspicious + iterations v2.9.0 - 19 Feb 2018 diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c index caa0e27..47ce2c2 100644 --- a/lib/dpif-netdev-perf.c +++ b/lib/dpif-netdev-perf.c @@ -25,6 +25,24 @@ VLOG_DEFINE_THIS_MODULE(pmd_perf); +#define ITER_US_THRESHOLD 250 /* Warning threshold for iteration duration + in microseconds. */ +#define VHOST_QUEUE_FULL 128/* Size of the virtio TX queue. */ +#define LOG_IT_BEFORE 5 /* Number of iterations to log before + suspicious iteration. */ +#define LOG_IT_AFTER 5 /* Number of iterations to log after + suspicious iteration. */ + +bool log_enabled = false; +bool log_extend = false; +static uint32_t log_it_before = LOG_IT_BEFORE; +static uint32_t log_it_after = LOG_IT_AFTER; +static uint32_t log_us_thr = ITER_US_THRESHOLD; +uint32_t log_q_thr = VHOST_QUEUE_FULL; +uint64_t iter_cycle_threshold; + +static struct vlog_rate_limit latency_rl = VLOG_RATE_LIMIT_INIT(600, 600); + #ifdef DPDK_NETDEV static uint64_t get_tsc_hz(void) @@ -141,6 +159,10 @@ pmd_perf_stats_init(struct pmd_perf_stats *s) histogram_walls_set_log(>max_vhost_qfill, 0, 512); s->iteration_cnt = 0; s->start_ms = time_msec(); +s->log_susp_it = UINT32_MAX; +s->log_begin_it = UINT32_MAX; +s->log_end_it = UINT32_MAX; +s->log_reason = NULL; } void @@ -391,6 +413,10 @@ pmd_perf_stats_clear_lock(struct pmd_perf_stats *s) history_init(>milliseconds); s->start_ms = time_msec(); s->milliseconds.sample[0].timestamp = s->start_ms; +s->log_susp_it = UINT32_MAX; +s->log_begin_it = UINT32_MAX; +s->log_end_it = UINT32_MAX; +s->log_reason = NULL; /* Clearing finished. */ s->clear = false; ovs_mutex_unlock(>clear_mutex); @@ -442,6 +468,7 @@ pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets, uint64_t now_tsc = cycles_counter_update(s); struct iter_stats *cum_ms; uint64_t cycles, cycles_per
[ovs-dev] [PATCH v11 2/3] dpif-netdev: Detailed performance stats for PMDs
This patch instruments the dpif-netdev datapath to record detailed statistics of what is happening in every iteration of a PMD thread. The collection of detailed statistics can be controlled by a new Open_vSwitch configuration parameter "other_config:pmd-perf-metrics". By default it is disabled. The run-time overhead, when enabled, is in the order of 1%. The covered metrics per iteration are: - cycles - packets - (rx) batches - packets/batch - max. vhostuser qlen - upcalls - cycles spent in upcalls This raw recorded data is used threefold: 1. In histograms for each of the following metrics: - cycles/iteration (log.) - packets/iteration (log.) - cycles/packet - packets/batch - max. vhostuser qlen (log.) - upcalls - cycles/upcall (log) The histograms bins are divided linear or logarithmic. 2. A cyclic history of the above statistics for 999 iterations 3. A cyclic history of the cummulative/average values per millisecond wall clock for the last 1000 milliseconds: - number of iterations - avg. cycles/iteration - packets (Kpps) - avg. packets/batch - avg. max vhost qlen - upcalls - avg. cycles/upcall The gathered performance metrics can be printed at any time with the new CLI command ovs-appctl dpif-netdev/pmd-perf-show [-nh] [-it iter_len] [-ms ms_len] [-pmd core] [dp] The options are -nh:Suppress the histograms -it iter_len: Display the last iter_len iteration stats -ms ms_len: Display the last ms_len millisecond stats -pmd core: Display only the specified PMD The performance statistics are reset with the existing dpif-netdev/pmd-stats-clear command. The output always contains the following global PMD statistics, similar to the pmd-stats-show command: Time: 15:24:55.270 Measurement duration: 1.008 s pmd thread numa_id 0 core_id 1: Cycles:2419034712 (2.40 GHz) Iterations:572817 (1.76 us/it) - idle:486808 (15.9 % cycles) - busy: 86009 (84.1 % cycles) Rx packets: 2399607 (2381 Kpps, 848 cycles/pkt) Datapath passes: 3599415 (1.50 passes/pkt) - EMC hits:336472 ( 9.3 %) - Megaflow hits: 3262943 (90.7 %, 1.00 subtbl lookups/hit) - Upcalls: 0 ( 0.0 %, 0.0 us/upcall) - Lost upcalls: 0 ( 0.0 %) Tx packets: 2399607 (2381 Kpps) Tx batches:171400 (14.00 pkts/batch) Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com> Acked-by: Billy O'Mahony <billy.o.mah...@intel.com> --- NEWS| 4 + lib/automake.mk | 1 + lib/dpif-netdev-perf.c | 462 +++- lib/dpif-netdev-perf.h | 197 --- lib/dpif-netdev-unixctl.man | 157 +++ lib/dpif-netdev.c | 187 -- manpages.mk | 2 + vswitchd/ovs-vswitchd.8.in | 27 +-- vswitchd/vswitch.xml| 12 ++ 9 files changed, 985 insertions(+), 64 deletions(-) create mode 100644 lib/dpif-netdev-unixctl.man diff --git a/NEWS b/NEWS index 757d648..ff81a9f 100644 --- a/NEWS +++ b/NEWS @@ -19,6 +19,10 @@ Post-v2.9.0 * implemented icmp4/icmp6/tcp_reset actions in order to drop the packet and reply with a RST for TCP or ICMPv4/ICMPv6 unreachable message for other IPv4/IPv6-based protocols whenever a reject ACL rule is hit. + - Userspace datapath: + * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a single PMD + * Detailed PMD performance metrics available with new command + ovs-appctl dpif-netdev/pmd-perf-show v2.9.0 - 19 Feb 2018 diff --git a/lib/automake.mk b/lib/automake.mk index 915a33b..3276aaa 100644 --- a/lib/automake.mk +++ b/lib/automake.mk @@ -491,6 +491,7 @@ MAN_FRAGMENTS += \ lib/dpctl.man \ lib/memory-unixctl.man \ lib/netdev-dpdk-unixctl.man \ + lib/dpif-netdev-unixctl.man \ lib/ofp-version.man \ lib/ovs.tmac \ lib/service.man \ diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c index f06991a..caa0e27 100644 --- a/lib/dpif-netdev-perf.c +++ b/lib/dpif-netdev-perf.c @@ -15,18 +15,333 @@ */ #include +#include +#include "dpif-netdev-perf.h" #include "openvswitch/dynamic-string.h" #include "openvswitch/vlog.h" -#include "dpif-netdev-perf.h" +#include "ovs-thread.h" #include "timeval.h" VLOG_DEFINE_THIS_MODULE(pmd_perf); +#ifdef DPDK_NETDEV +static uint64_t +get_tsc_hz(void) +{ +return rte_get_tsc_hz(); +} +#else +/* This function is only invoked from PMD threads which depend on DPDK. + * A dummy function is sufficient when building without DPDK_NETDEV. */ +static uint64_t +get_tsc_hz(void) +{ +return 1; +} +#endif + +/* Histogram functions. */ + +static void +histogram_walls_set_lin(struct histogram *hist, ui
Re: [ovs-dev] [PATCH net-next 1/6] netdev-dpdk: Allow vswitchd to parse devargs as dpdk-bond args
> > The bond of openvswitch has not good performance. > > Any examples? For example, balance-tcp bond mode for L34 load sharing still requires a recirculation after dp_hash. I believe that it would definitely be interesting to compare bond performance between DPDK bonding and OVS bonding with DPDK datapath for various bond modes and traffic patterns. Another interesting performance metric would be link failover times and packet drop (at link down and link up) in static and dynamic (LACP) bond configurations. That is an area where we have repeatedly seen problems with OVS bonding. > > > In some > > cases we would recommend that you use Linux bonds instead > > of Open vSwitch bonds. In userspace datapath, we wants use > > bond to improve bandwidth. The DPDK has implemented it as lib. > > You could use OVS bonding for userspace datapath and it has > good performance, especially after TX batching patch-set. > > DPDK bonding has a variety of limitations like the requirement > to call rte_eth_tx_burst and rte_eth_rx_burst with intervals > period of less than 100ms for link aggregation modes. > OVS could not assure that. A periodic dummy tx burst every 100 ms is something that could easily be added to dpif-netdev PMD for bonded dpdk netdevs. > > > > > These patches base DPDK bond to implement the dpdk-bond > > device as a vswitchd interface. > > > > If users set the interface options with multi-pci or device names > > with ',' as a separator, we try to parse it as dpdk-bond args. > > For example, set an interface as: > > > > ovs-vsctl add-port br0 dpdk0 -- \ > > set Interface dpdk0 type=dpdk \ > > options:dpdk-devargs=:06:00.0,:06:00.1 > > > > And now these patch support to set bond mode, such as round > > robin, active_backup and balance and so on. Later some features > > of bond will be supported. > > Hmm, but you're already have ability to add any virtual dpdk device > including bond devices like this: > > ovs-vsctl add-port br0 bond0 -- \ > set Interface dpdk0 type=dpdk \ > > options:dpdk-devargs="eth_bond0,mode=2,slave=:05:00.0,slave=:05:00.1,xmit_policy=l34" > > So, what is the profit of this patch-set? Thanks for the pointer. That is a valid question. I guess special handling like periodic dummy tx burst might have to be enabled based on dpdk-devargs bond configuration. BR, Jan ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH net-next 0/6] Add dpdk-bond support
Hi Tonghao, Thanks for working on this. That was on my backlog to try out for a while. One immediate feedback: This is a pure OVS user space patch. Please remove the "net-next" tag from your patches in the next version. "net-next" is reserved for OVS kernel module patches that are first submitted upstream to the Linux "net-next" repository. Regards, Jan > -Original Message- > From: ovs-dev-boun...@openvswitch.org > [mailto:ovs-dev-boun...@openvswitch.org] On Behalf Of xiangxia.m@gmail.com > Sent: Thursday, 12 April, 2018 14:53 > To: d...@openvswitch.org > Subject: [ovs-dev] [PATCH net-next 0/6] Add dpdk-bond support > > From: Tonghao Zhang> > The bond of openvswitch has not good performance. In some > cases we would recommend that you use Linux bonds instead > of Open vSwitch bonds. In userspace datapath, we wants use > bond to improve bandwidth. The DPDK has implemented it as lib. > > These patches base DPDK bond to implement the dpdk-bond > device as a vswitchd interface. > > If users set the interface options with multi-pci or device names > with ',' as a separator, we try to parse it as dpdk-bond args. > For example, set an interface as: > > ovs-vsctl add-port br0 dpdk0 -- \ > set Interface dpdk0 type=dpdk \ > options:dpdk-devargs=:06:00.0,:06:00.1 > > And now these patch support to set bond mode, such as round > robin, active_backup and balance and so on. Later some features > of bond will be supported. > > These patches are RFC, any proposal will be welcome. Ignore the doc, > if these pathes is ok for openvswitch the doc will be posted. > > There are somes shell scripts, which can help us to test the patches. > https://github.com/nickcooper-zhangtonghao/ovs-bond-tests > > Tonghao Zhang (6): > netdev-dpdk: Allow vswitchd to parse devargs as dpdk-bond args > netdev-dpdk: Allow dpdk-ethdev not support setting mtu > netdev-dpdk: Add netdev_dpdk_bond struct > netdev-dpdk: Add dpdk-bond support > netdev-dpdk: Add check whether dpdk-port is used > netdev-dpdk: Add dpdk-bond mode setting > > lib/netdev-dpdk.c | 304 > +- > 1 file changed, 299 insertions(+), 5 deletions(-) > > -- > 1.8.3.1 > > ___ > dev mailing list > d...@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH v10 3/3] dpif-netdev: Detection and logging of suspicious PMD iterations
Hi Ilya, I would not say this is expected behavior. It seems that you are executing on a somewhat slower system (tsc clock seems to be 100/us = 0.1 GHz) and that, even with only 5 lines logged before and after, the logging output is causing so much slow down of the PMD that it continues to cause iterations using excessive cycles (362000 = 3.62 ms!) due to logging. The actual iteration with logging is not flagged as suspicious, but the subsequent iteration gets the hit of the massive cycles that have passed on the TSC clock. The "phantom" duration of 0 us shown is probably a side effect of this. I will try to reproduce and investigate. I will have a look at the detection logic to see if this can be avoided. BR, Jan > -Original Message- > From: Ilya Maximets [mailto:i.maxim...@samsung.com] > Sent: Tuesday, 27 March, 2018 16:05 > To: Jan Scheurich <jan.scheur...@ericsson.com>; d...@openvswitch.org > Cc: ktray...@redhat.com; ian.sto...@intel.com; billy.o.mah...@intel.com > Subject: Re: [PATCH v10 3/3] dpif-netdev: Detection and logging of suspicious > PMD iterations > > I see following behaviour: > > 1. Configure low -us (like 100) > 2. After that I see many logs about suspicious iterations (expected). > > 2018-03-27T13:58:27Z|03574|pmd_perf(pmd7)|WARN|Suspicious iteration > (Excessive total cycles): tsc=520415762246435 > duration=106 us > 2018-03-27T13:58:27Z|03575|pmd_perf(pmd7)|WARN|Neighborhood of suspicious > iteration: >tsc cycles packets cycles/pkt pkts/batch > vhost qlen upcalls cycles/upcall >520415762297985 9711 32 303 32 > 424 00 >520415762287041 1066732 333 32 > 419 00 >520415762277319 9722 32 303 32 > 429 00 >520415762267083 9971 32 311 32 > 443 00 >520415762257413 9670 32 302 32 > 451 00 >520415762246435 1069932 334 32 > 448 00 >520415762235033 1110932 347 32 > 455 00 >520415762180220 9826 32 307 32 > 399 00 >520415762169792 1022932 319 32 > 413 00 >520415762160385 9407 32 293 32 > 408 00 >520415762150221 9891 32 309 32 > 434 00 > 2018-03-27T13:58:27Z|03576|pmd_perf(pmd7)|WARN|Suspicious iteration > (Excessive total cycles): tsc=520415762469997 > duration=104 us > 2018-03-27T13:58:27Z|03577|pmd_perf(pmd7)|WARN|Neighborhood of suspicious > iteration: >tsc cycles packets cycles/pkt pkts/batch > vhost qlen upcalls cycles/upcall >520415762519119 9462 32 295 32 > 505 00 >520415762509595 9319 32 291 32 > 537 00 >520415762500154 9283 32 290 32 > 569 00 >520415762490585 9287 32 290 32 > 601 00 >520415762480693 9730 32 304 32 > 633 00 >520415762469997 1041432 325 32 > 665 00 >520415762459348 1034232 323 32 > 697 00 >520415762297985 9711 32 303 32 > 424 00 >520415762287041 1066732 333 32 > 419 00 >520415762277319 9722 32 303 32 > 429 00 >520415762267083 9971 32 311 32 > 443 00 > > 3. Configure back high -us (like 1000). > 4. Logs are still there with zero duration. Logs printed every second like > this: > > 2018-03-27T14:02:08Z|04140|pmd_perf(pmd7)|WARN|Suspicious iteration > (Excessive total cycles): tsc=520437806368099 duration=0 > us > [Thread 0x7fb56f2910 (LWP 19754) exited] > [New Thread 0x7fb56f2910 (LWP 19755)] > 2018-03-27T14:02:08Z|04141|pmd_perf(pmd7)|WARN|Neighb
Re: [ovs-dev] [PATCH 2/3] ofproto-dpif: Improve dp_hash selection method for select groups
Hi Ychen, Thanks a lot for your tests of corner cases and suggested bug fixes. I will include fixes in the next version, possibly also unit test cases for those. A bucket weight of zero should in my eyes imply no traffic to that bucket. I will check how to achieve that. I will also look into your ofproto_group_unref question. Regards, Jan From: ychen [mailto:ychen103...@163.com] Sent: Wednesday, 11 April, 2018 06:16 To: Jan Scheurich <jan.scheur...@ericsson.com> Cc: d...@openvswitch.org; Nitin Katiyar <nitin.kati...@ericsson.com> Subject: Re:[PATCH 2/3] ofproto-dpif: Improve dp_hash selection method for select groups Hi, Jan: When I test dp_hash with the new patch, vswitchd was killed by segment fault in some conditions. 1. add group with no buckets, then winner will be NULL 2. add buckets with weight with 0, then winner will also be NULL I did little modify to the patch, will you help to check whether it is correct? diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c index 8f6070d..b3a9639 100755 --- a/ofproto/ofproto-dpif.c +++ b/ofproto/ofproto-dpif.c @@ -4773,6 +4773,8 @@ group_setup_dp_hash_table(struct group_dpif *group, size_t max_hash) webster[i].value = bucket->weight; i++; } +//consider bucket weight equal to 0 +if (!min_weight) min_weight = 1; uint32_t min_slots = ceil(total_weight / min_weight); n_hash = MAX(16, 1L << log_2_ceil(min_slots)); @@ -4794,11 +4796,12 @@ group_setup_dp_hash_table(struct group_dpif *group, size_t max_hash) for (int hash = 0; hash < n_hash; hash++) { VLOG_DBG("Hash value: %d", hash); double max_val = 0.0; -struct webster *winner; +struct webster *winner = NULL; for (i = 0; i < n_buckets; i++) { VLOG_DBG("Webster[%d]: divisor=%d value=%.2f", i, webster[i].divisor, webster[i].value); -if (webster[i].value > max_val) { +// use >= in condition there is only one bucket with weight 0 +if (webster[i].value >= max_val) { max_val = webster[i].value; winner = [i]; } @@ -4827,7 +4830,8 @@ group_set_selection_method(struct group_dpif *group) group->selection_method = SEL_METHOD_DEFAULT; } else if (!strcmp(selection_method, "dp_hash")) { /* Try to use dp_hash if possible at all. */ -if (group_setup_dp_hash_table(group, 64)) { +uint32_t n_buckets = group->up.n_buckets; +if (n_buckets && group_setup_dp_hash_table(group, 64)) { group->selection_method = SEL_METHOD_DP_HASH; group->hash_alg = props->selection_method_param >> 32; if (group->hash_alg >= __OVS_HASH_MAX) { Another question, I found in function xlate_default_select_group and xlate_hash_fields_select_group, when group_best_live_bucket is NULL, it will call ofproto_group_unref, why dp_hash function no need to call it when there is no best bucket found?(exp: group with no buckets) At 2018-03-21 02:16:17, "Jan Scheurich" <jan.scheur...@ericsson.com<mailto:jan.scheur...@ericsson.com>> wrote: >The current implementation of the "dp_hash" selection method suffers >from two deficiences: 1. The hash mask and hence the number of dp_hash >values is just large enough to cover the number of group buckets, but >does not consider the case that buckets have different weights. 2. The >xlate-time selection of best bucket from the masked dp_hash value often >results in bucket load distributions that are quite different from the >bucket weights because the number of available masked dp_hash values >is too small (2-6 bits compared to 32 bits of a full hash in the default >hash selection method). > >This commit provides a more accurate implementation of the dp_hash >select group by applying the well known Webster method for distributing >a small number of "seats" fairly over the weighted "parties" >(see https://en.wikipedia.org/wiki/Webster/Sainte-Lagu%C3%AB_method). >The dp_hash mask is autmatically chosen large enough to provide good >enough accuracy even with widely differing weights. > >This distribution happens at group modification time and the resulting >table is stored with the group-dpif struct. At xlation time, we use the >masked dp_hash values as index to look up the assigned bucket. > >If the bucket should not be live, we do a circular search over the >mapping table until we find the first live bucket. As the buckets in >the table are by construction in pseudo-random order with a frequency >according to their weight, this method maintains correct distribution >even if one or more buckets are non-live. > >Xlation is further simplif
Re: [ovs-dev] [PATCH] ofp-actions: Correct execution of encap/decap actions in action set
Hi Yi, The assertion failure is indeed caused by the incorrect implementation of double encap() and should be fixed by the patch you mention (which is merged to master by now). Prior to the below fix this happened with every encap(nsh) in an group bucket. I can't say why it still happens periodically every few minutes in your test. You'd need to carefully analyze a crash dump to try to understand the packet processing history that leads to a double encap() or perhaps decap(). It is definitely worth trying whether the problem is already resolved on the latest master. BR, Jan > -Original Message- > From: Yang, Yi Y [mailto:yi.y.y...@intel.com] > Sent: Sunday, 08 April, 2018 10:27 > To: Jan Scheurich <jan.scheur...@ericsson.com>; d...@openvswitch.org > Subject: RE: [PATCH] ofp-actions: Correct execution of encap/decap actions in > action set > > Hi, Jan > > Sangfor guy tried this one, he still encountered assert issue after ovs ran > for about 20 minutes, moreover it appeared periodically. I'm > not sure if https://patchwork.ozlabs.org/patch/895405/ is helpful for this > issue. Do you think what the root cause is? > > -Original Message- > From: Jan Scheurich [mailto:jan.scheur...@ericsson.com] > Sent: Monday, March 26, 2018 3:36 PM > To: d...@openvswitch.org > Cc: Yang, Yi Y <yi.y.y...@intel.com>; Jan Scheurich > <jan.scheur...@ericsson.com> > Subject: [PATCH] ofp-actions: Correct execution of encap/decap actions in > action set > > The actions encap, decap and dec_nsh_ttl were wrongly flagged as set_field > actions in ofpact_is_set_or_move_action(). This caused > them to be executed twice in the action set or a group bucket, once > explicitly in > ofpacts_execute_action_set() and once again as part of the list of set_field > or move actions. > > Fixes: f839892a ("OF support and translation of generic encap and decap") > Fixes: 491e05c2 ("nsh: add dec_nsh_ttl action") > > Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com> > > --- > > The fix should be backported to OVS 2.9 and OVS 2.8 (without the case for > OFPACT_DEC_NSH_TTL introduced in 2.9). > > > lib/ofp-actions.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/lib/ofp-actions.c b/lib/ofp-actions.c index db85716..87797bc > 100644 > --- a/lib/ofp-actions.c > +++ b/lib/ofp-actions.c > @@ -6985,9 +6985,6 @@ ofpact_is_set_or_move_action(const struct ofpact *a) > case OFPACT_SET_TUNNEL: > case OFPACT_SET_VLAN_PCP: > case OFPACT_SET_VLAN_VID: > -case OFPACT_ENCAP: > -case OFPACT_DECAP: > -case OFPACT_DEC_NSH_TTL: > return true; > case OFPACT_BUNDLE: > case OFPACT_CLEAR_ACTIONS: > @@ -7025,6 +7022,9 @@ ofpact_is_set_or_move_action(const struct ofpact *a) > case OFPACT_WRITE_METADATA: > case OFPACT_DEBUG_RECIRC: > case OFPACT_DEBUG_SLOW: > +case OFPACT_ENCAP: > +case OFPACT_DECAP: > +case OFPACT_DEC_NSH_TTL: > return false; > default: > OVS_NOT_REACHED(); > -- > 1.9.1 ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH v2 0/2] Correct handling of double encap and decap actions
Yes that fix should be applied to branches 2.9 and 2.8. I checked that it applies and passes all unit tests pass. On branch-2.8 the patch for nsh.at patch must be slightly retrofitted as the datapath action names changed from encap_nsh/decap_nsh to push_nsh/pop_nsh and the nsh_ttl field was introduced in 2.9. diff --git a/tests/nsh.at b/tests/nsh.at index 6ae71b5..6eb4637 100644 --- a/tests/nsh.at +++ b/tests/nsh.at @@ -351,7 +351,7 @@ bridge("br0") Final flow: unchanged Megaflow: recirc_id=0,eth,ip,in_port=1,dl_dst=66:77:88:99:aa:bb,nw_frag=no -Datapath actions: push_nsh(flags=0,ttl=63,mdtype=1,np=3,spi=0x1234,si=255,c1=0x1122334 +Datapath actions: encap_nsh(flags=0,mdtype=1,np=3,spi=0x1234,si=255,c1=0x11223344,c2=0 ]) AT_CHECK([ @@ -370,7 +370,7 @@ bridge("br0") Final flow: recirc_id=0x1,eth,in_port=4,vlan_tci=0x,dl_src=00:00:00:00:00:00,dl_ds Megaflow: recirc_id=0x1,packet_type=(1,0x894f),in_port=4,nsh_mdtype=1,nsh_np=3,nsh_spi -Datapath actions: pop_nsh(),recirc(0x2) +Datapath actions: decap_nsh(),recirc(0x2) ]) AT_CHECK([ @@ -407,8 +407,8 @@ ovs-appctl time/warp 1000 AT_CHECK([ ovs-appctl dpctl/dump-flows dummy@ovs-dummy | strip_used | grep -v ipv6 | sort ], [0], [flow-dump from non-dpdk interfaces: -recirc_id(0),in_port(1),packet_type(ns=0,id=0),eth(dst=1e:2c:e9:2a:66:9e),eth_type(0x0 -recirc_id(0x3),in_port(1),packet_type(ns=1,id=0x894f),nsh(mdtype=1,np=3,spi=0x1234,c1= +recirc_id(0),in_port(1),packet_type(ns=0,id=0),eth(dst=1e:2c:e9:2a:66:9e),eth_type(0x0 +recirc_id(0x3),in_port(1),packet_type(ns=1,id=0x894f),nsh(mdtype=1,np=3,spi=0x1234,c1= recirc_id(0x4),in_port(1),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packe ]) Thanks, Jan > -Original Message- > From: Ben Pfaff [mailto:b...@ovn.org] > Sent: Friday, 06 April, 2018 18:36 > To: Jan Scheurich <jan.scheur...@ericsson.com> > Cc: d...@openvswitch.org; yi.y.y...@intel.com > Subject: Re: [PATCH v2 0/2] Correct handling of double encap and decap actions > > On Fri, Apr 06, 2018 at 09:35:48AM -0700, Ben Pfaff wrote: > > On Thu, Apr 05, 2018 at 04:11:02PM +0200, Jan Scheurich wrote: > > > Recent tests with NSH encap have shown that the translation of multiple > > > subsequent encap() or decap() actions was incorrect. This patch set > > > corrects the handling and adds a unit test for NSH to cover two NSH > > > and one Ethernet encapsulation levels. > > > > Thanks. Should this be applied to branch-2.9? > > To be clear, I applied it to master just now. ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH v10 1/3] netdev: Add optional qfill output parameter to rxq_recv()
> > @@ -1846,11 +1846,24 @@ netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq, > > batch->count = nb_rx; > > dp_packet_batch_init_packet_fields(batch); > > > > +if (qfill) { > > +if (nb_rx == NETDEV_MAX_BURST) { > > +/* The DPDK API returns a uint32_t which often has invalid > > bits in > > + * the upper 16-bits. Need to restrict the value to uint16_t. > > */ > > +*qfill = rte_vhost_rx_queue_count(netdev_dpdk_get_vid(dev), > > I lost count of how many times I talked about this. Please, don't obtain the > 'vid' twice. You have to check the result of 'netdev_dpdk_get_vid()' always. > Otherwise this could lead to crash. > > Details, as usual, here: > daf22bf7a826 ("netdev-dpdk: Fix calling vhost API with negative vid.") > > I believe, that I already wrote this comment to one of the previous versions > of this patch-set. Yes, sorry I missed that one. I will for sure fix it. As this is fairly non-obvious from looking at the code it might be a good idea to add some warning comment to the function 'netdev_dpdk_get_vid()' and/or the places where it is used from the PMD. /Jan ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH] ofproto-dpif: Init ukey->dump_seq to zero
> -Original Message- > From: Ben Pfaff [mailto:b...@ovn.org] > Sent: Wednesday, 04 April, 2018 22:28 > > Oh, that's weird. It's as if I didn't read the patch. Maybe I just > read some preliminary version in another thread. > > Anyway, you're totally right. I applied this to master. If you're > seeing problems in another branch, let me know and I will backport. Thanks! I think the issue was introduced into OVS by the following commit a long time ago. commit 23597df052262dec961fd86eb7c54d10984a1ec0 Author: Joe StringerDate: Fri Jul 25 13:54:24 2014 +1200 It's a temporary glitch that can cause unexpected behavior only within the first few hundred milliseconds after datapath flow creation. It is most likely to affect "reactive" controller use cases (MAC learning, ARP handling), like the OVN test case that now failed with a small change of timing. So it is possible that one could notice short packet drops or duplicate PACKET_INs in real SDN deployments when looking close enough. My preference would be to backport it all the way to OVS 2.5. But of course I don't have proof that there are actual problems out in the field that it would solve. One could also do a systematic search of unit test cases that apply "sleep" or "time/warp" work-arounds for the issue and simplify these on master branch. But I fear I won't time for that. Regards, Jan ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH] xlate: Correct handling of double encap() actions
> > > > This fix should be backported OVS 2.8 and 2.9 > > This seems tricky. Do you plan to write a test? Hi Ben, I have just posted v2 where I have added an NSH unit test for double NSH plus Ethernet encapsulation. The test also covers encap() actions in group buckets. BR, Jan ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] [PATCH v2 0/2] Correct handling of double encap and decap actions
Recent tests with NSH encap have shown that the translation of multiple subsequent encap() or decap() actions was incorrect. This patch set corrects the handling and adds a unit test for NSH to cover two NSH and one Ethernet encapsulation levels. v1->v2: - Rebased to master (commit 4b337e489) - Added NSH unit test with double encap Jan Scheurich (2): xlate: Correct handling of double encap() actions nsh: Add unit test for double NSH encap and decap lib/odp-util.c | 16 ++--- lib/odp-util.h | 1 + ofproto/ofproto-dpif-xlate.c | 7 ++- tests/nsh.at | 143 +++ 4 files changed, 156 insertions(+), 11 deletions(-) -- 1.9.1 ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] [PATCH v2 2/2] nsh: Add unit test for double NSH encap and decap
The added test verifies that OVS correctly encapsulates an Ethernet packet with two NSH (MD1) headers, sends it with an Ethernet header over a patch port and decaps the Ethernet and the two NSH headers on the receiving bridge to reveal the original packet. The test case performs the encap() operations in a sequence of three chained groups to test the correct handling of encap() actions in group buckets recently fixed in commit ce4a16ac0. Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com> --- tests/nsh.at | 143 +++ 1 file changed, 143 insertions(+) diff --git a/tests/nsh.at b/tests/nsh.at index e6a8345..7539e91 100644 --- a/tests/nsh.at +++ b/tests/nsh.at @@ -276,6 +276,149 @@ AT_CLEANUP ### - +### Double NSH MD1 encapsulation using groups over veth link +### - + +AT_SETUP([nsh - double encap over veth link using groups]) + +OVS_VSWITCHD_START([]) + +AT_CHECK([ +ovs-vsctl set bridge br0 datapath_type=dummy \ +protocols=OpenFlow10,OpenFlow13,OpenFlow14,OpenFlow15 -- \ +add-port br0 p1 -- set Interface p1 type=dummy ofport_request=1 -- \ +add-port br0 p2 -- set Interface p2 type=dummy ofport_request=2 -- \ +add-port br0 v3 -- set Interface v3 type=patch options:peer=v4 ofport_request=3 -- \ +add-port br0 v4 -- set Interface v4 type=patch options:peer=v3 ofport_request=4]) + +AT_DATA([flows.txt], [dnl +table=0,in_port=1,ip,actions=group:100 + table=0,in_port=4,packet_type=(0,0),dl_type=0x894f,nsh_mdtype=1,nsh_spi=0x5678,nsh_c1=0x55667788,actions=decap(),goto_table:1 + table=1,packet_type=(1,0x894f),nsh_mdtype=1,nsh_spi=0x5678,nsh_c1=0x55667788,actions=decap(),goto_table:2 + table=2,packet_type=(1,0x894f),nsh_mdtype=1,nsh_spi=0x1234,nsh_c1=0x11223344,actions=decap(),output:2 +]) + +AT_DATA([groups.txt], [dnl +add group_id=100,type=indirect,bucket=actions=encap(nsh(md_type=1)),set_field:0x1234->nsh_spi,set_field:0x11223344->nsh_c1,group:200 +add group_id=200,type=indirect,bucket=actions=encap(nsh(md_type=1)),set_field:0x5678->nsh_spi,set_field:0x55667788->nsh_c1,group:300 +add group_id=300,type=indirect,bucket=actions=encap(ethernet),set_field:11:22:33:44:55:66->dl_dst,3 +]) + +AT_CHECK([ +ovs-ofctl del-flows br0 +ovs-ofctl -Oopenflow13 add-groups br0 groups.txt +ovs-ofctl -Oopenflow13 add-flows br0 flows.txt +ovs-ofctl -Oopenflow13 dump-flows br0 | ofctl_strip | sort | grep actions +], [0], [dnl + in_port=4,dl_type=0x894f,nsh_mdtype=1,nsh_spi=0x5678,nsh_c1=0x55667788 actions=decap(),goto_table:1 + ip,in_port=1 actions=group:100 + table=1, packet_type=(1,0x894f),nsh_mdtype=1,nsh_spi=0x5678,nsh_c1=0x55667788 actions=decap(),goto_table:2 + table=2, packet_type=(1,0x894f),nsh_mdtype=1,nsh_spi=0x1234,nsh_c1=0x11223344 actions=decap(),output:2 +]) + +# TODO: +# The fields nw_proto, nw_tos, nw_ecn, nw_ttl in final flow seem unnecessary. Can they be avoided? +# The match on dl_dst=66:77:88:99:aa:bb in the Megaflow is a side effect of setting the dl_dst in the pushed outer +# Ethernet header. It is a consequence of using wc->masks both for tracking matched and set bits and seems hard to +# avoid except by using separate masks for both purposes. + +AT_CHECK([ +ovs-appctl ofproto/trace br0 'in_port=1,icmp,dl_src=00:11:22:33:44:55,dl_dst=66:77:88:99:aa:bb,nw_dst=10.10.10.10,nw_src=20.20.20.20' +], [0], [dnl +Flow: icmp,in_port=1,vlan_tci=0x,dl_src=00:11:22:33:44:55,dl_dst=66:77:88:99:aa:bb,nw_src=20.20.20.20,nw_dst=10.10.10.10,nw_tos=0,nw_ecn=0,nw_ttl=0,icmp_type=0,icmp_code=0 + +bridge("br0") +- + 0. ip,in_port=1, priority 32768 +group:100 +encap(nsh(md_type=1)) +set_field:0x1234->nsh_spi +set_field:0x11223344->nsh_c1 +group:200 +encap(nsh(md_type=1)) +set_field:0x5678->nsh_spi +set_field:0x55667788->nsh_c1 +group:300 +encap(ethernet) +set_field:11:22:33:44:55:66->eth_dst +output:3 + +bridge("br0") +- + 0. in_port=4,dl_type=0x894f,nsh_mdtype=1,nsh_spi=0x5678,nsh_c1=0x55667788, priority 32768 +decap() +goto_table:1 + 1. packet_type=(1,0x894f),nsh_mdtype=1,nsh_spi=0x5678,nsh_c1=0x55667788, priority 32768 +decap() + +Final flow: unchanged +Megaflow: recirc_id=0,eth,ip,in_port=1,dl_dst=66:77:88:99:aa:bb,nw_frag=no +Datapath actions: push_nsh(flags=0,ttl=63,mdtype=1,np=3,spi=0x1234,si=255,c1=0x11223344,c2=0x0,c3=0x0,c4=0x0),push_nsh(flags=0,ttl=63,mdtype=1,np=4,spi=0x5678,si=255,c1=0x55667788,c2=0x0,c3=0x0,c4=0x0),push_eth(src=00:00:00:00:00:00,dst=11:22:33:44:55:66),pop_eth,pop_nsh(),recirc(0x1) +]) + +AT_CHECK([ +ovs-appctl ofproto/trace br0 'recirc_id=1,in_port=4,packet_type=(1,0x894f),nsh_mdtype=1,nsh_np=3,nsh_spi=0x1234,nsh_c1=0x11223344' +], [0], [dnl +Flow: re
[ovs-dev] [PATCH v2 1/2] xlate: Correct handling of double encap() actions
When the same encap() header was pushed twice onto a packet (e.g in the case of NSH in NSH), the translation logic only generated a datapath push action for the first encap() action. The second encap() did not emit a push action because the packet type was unchanged. commit_encap_decap_action() (renamed from commit_packet_type_change) must solely rely on ctx->pending_encap to generate an datapath push action. Similarly, the first decap() action on a double header packet does not change the packet_type either. Add a corresponding ctx->pending_decap flag and use that to trigger emitting a datapath pop action. Fixes: f839892a2 ("OF support and translation of generic encap and decap") Fixes: 1fc11c594 ("Generic encap and decap support for NSH") Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com> --- lib/odp-util.c | 16 ++-- lib/odp-util.h | 1 + ofproto/ofproto-dpif-xlate.c | 7 ++- 3 files changed, 13 insertions(+), 11 deletions(-) diff --git a/lib/odp-util.c b/lib/odp-util.c index 8743503..6db241a 100644 --- a/lib/odp-util.c +++ b/lib/odp-util.c @@ -7446,17 +7446,13 @@ odp_put_push_nsh_action(struct ofpbuf *odp_actions, } static void -commit_packet_type_change(const struct flow *flow, +commit_encap_decap_action(const struct flow *flow, struct flow *base_flow, struct ofpbuf *odp_actions, struct flow_wildcards *wc, - bool pending_encap, + bool pending_encap, bool pending_decap, struct ofpbuf *encap_data) { -if (flow->packet_type == base_flow->packet_type) { -return; -} - if (pending_encap) { switch (ntohl(flow->packet_type)) { case PT_ETH: { @@ -7481,7 +7477,7 @@ commit_packet_type_change(const struct flow *flow, * The check is done at action translation. */ OVS_NOT_REACHED(); } -} else { +} else if (pending_decap || flow->packet_type != base_flow->packet_type) { /* This is an explicit or implicit decap case. */ if (pt_ns(flow->packet_type) == OFPHTN_ETHERTYPE && base_flow->packet_type == htonl(PT_ETH)) { @@ -7520,14 +7516,14 @@ commit_packet_type_change(const struct flow *flow, enum slow_path_reason commit_odp_actions(const struct flow *flow, struct flow *base, struct ofpbuf *odp_actions, struct flow_wildcards *wc, - bool use_masked, bool pending_encap, + bool use_masked, bool pending_encap, bool pending_decap, struct ofpbuf *encap_data) { enum slow_path_reason slow1, slow2; bool mpls_done = false; -commit_packet_type_change(flow, base, odp_actions, wc, - pending_encap, encap_data); +commit_encap_decap_action(flow, base, odp_actions, wc, + pending_encap, pending_decap, encap_data); commit_set_ether_action(flow, base, odp_actions, wc, use_masked); /* Make packet a non-MPLS packet before committing L3/4 actions, * which would otherwise do nothing. */ diff --git a/lib/odp-util.h b/lib/odp-util.h index 1fad159..6fcd1bb 100644 --- a/lib/odp-util.h +++ b/lib/odp-util.h @@ -283,6 +283,7 @@ enum slow_path_reason commit_odp_actions(const struct flow *, struct flow_wildcards *wc, bool use_masked, bool pending_encap, + bool pending_decap, struct ofpbuf *encap_data); /* ofproto-dpif interface. diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c index 42ac118..c8baba1 100644 --- a/ofproto/ofproto-dpif-xlate.c +++ b/ofproto/ofproto-dpif-xlate.c @@ -243,6 +243,8 @@ struct xlate_ctx { * true. */ bool pending_encap; /* True when waiting to commit a pending * encap action. */ +bool pending_decap; /* True when waiting to commit a pending + * decap action. */ struct ofpbuf *encap_data; /* May contain a pointer to an ofpbuf with * context for the datapath encap action.*/ @@ -3477,8 +3479,9 @@ xlate_commit_actions(struct xlate_ctx *ctx) ctx->xout->slow |= commit_odp_actions(>xin->flow, >base_flow, ctx->odp_actions, ctx->wc, use_masked, ctx->pending_encap, - ctx->encap_data); + ctx->pending_decap, ctx->encap_data); ctx->pending_encap = false; +ctx->pending_de
[ovs-dev] [PATCH v4 2/2] xlate: Move tnl_neigh_snoop() to terminate_native_tunnel()
From: Zoltan Balogh <zoltan.balogh@gmail.com> Currently OVS snoops any ARP or ND packets in any bridge and populates the tunnel neighbor cache with the retreived data. For instance, when an ARP reply originated by a tenant is received in an overlay bridge, the ARP packet is snooped and tunnel neighbor cache is filled with tenant address information. This is at best useless as tunnel endpoints can only reside on an underlay bridge. The real problem starts if different tenants on the overlay bridge have overlapping IP addresses such that they keep overwriting each other's pseudo tunnel neighbor entries. These frequent updates are treated as configuration changes and trigger revalidation each time, thus causing a lot of useless revalidation load on the system. To keep the ARP neighbor cache clean, this patch moves tunnel neighbor snooping from the generic function do_xlate_actions() to the specific funtion terminate_native_tunnel() in compose_output_action(). Thus, only ARP and Neighbor Advertisement packets addressing a local tunnel endpoint (on the LOCAL port of the underlay bridge) are snooped. In order to achieve this, IP addresses of the bridge ports are retrieved and then stored in xbridge by calling xlate_xbridge_set(). The destination address extracted from the ARP or Neighbor Advertisement packet is then matched against the known xbridge addresses in is_neighbor_reply_correct() to filter the snooped packets further. Signed-off-by: Zoltan Balogh <zoltan.balogh@gmail.com> Co-authored-by: Jan Scheurich <jan.scheur...@ericsson.com> Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com> --- include/sparse/netinet/in.h | 10 +++ ofproto/ofproto-dpif-xlate.c | 147 -- tests/tunnel-push-pop-ipv6.at | 68 ++- tests/tunnel-push-pop.at | 67 ++- 4 files changed, 282 insertions(+), 10 deletions(-) diff --git a/include/sparse/netinet/in.h b/include/sparse/netinet/in.h index 6abdb23..eea41bd 100644 --- a/include/sparse/netinet/in.h +++ b/include/sparse/netinet/in.h @@ -123,6 +123,16 @@ struct sockaddr_in6 { (X)->s6_addr[10] == 0xff &&\ (X)->s6_addr[11] == 0xff) +#define IN6_IS_ADDR_MC_LINKLOCAL(a) \ +(((const uint8_t *) (a))[0] == 0xff && \ + (((const uint8_t *) (a))[1] & 0xf) == 0x2) + +# define IN6_ARE_ADDR_EQUAL(a,b) \ +const uint32_t *) (a))[0] == ((const uint32_t *) (b))[0]) && \ + (((const uint32_t *) (a))[1] == ((const uint32_t *) (b))[1]) && \ + (((const uint32_t *) (a))[2] == ((const uint32_t *) (b))[2]) && \ + (((const uint32_t *) (a))[3] == ((const uint32_t *) (b))[3])) + #define INET_ADDRSTRLEN 16 #define INET6_ADDRSTRLEN 46 diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c index 42ac118..f593a6e 100644 --- a/ofproto/ofproto-dpif-xlate.c +++ b/ofproto/ofproto-dpif-xlate.c @@ -91,6 +91,16 @@ VLOG_DEFINE_THIS_MODULE(ofproto_dpif_xlate); * recursive or not. */ #define MAX_RESUBMITS (MAX_DEPTH * MAX_DEPTH) +/* The structure holds an array of IP addresses assigned to a bridge and the + * number of elements in the array. These data are mutable and are evaluated + * when ARP or Neighbor Advertisement packets received on a native tunnel + * port are xlated. So 'ref_cnt' and RCU are used for synchronization. */ +struct xbridge_addr { +struct in6_addr *addr;/* Array of IP addresses of xbridge. */ +int n_addr; /* Number of IP addresses. */ +struct ovs_refcount ref_cnt; +}; + struct xbridge { struct hmap_node hmap_node; /* Node in global 'xbridges' map. */ struct ofproto_dpif *ofproto; /* Key in global 'xbridges' map. */ @@ -114,6 +124,8 @@ struct xbridge { /* Datapath feature support. */ struct dpif_backer_support support; + +struct xbridge_addr *addr; }; struct xbundle { @@ -582,7 +594,8 @@ static void xlate_xbridge_set(struct xbridge *, struct dpif *, const struct dpif_ipfix *, const struct netflow *, bool forward_bpdu, bool has_in_band, - const struct dpif_backer_support *); + const struct dpif_backer_support *, + const struct xbridge_addr *); static void xlate_xbundle_set(struct xbundle *xbundle, enum port_vlan_mode vlan_mode, uint16_t qinq_ethtype, int vlan, @@ -836,6 +849,56 @@ xlate_xport_init(struct xlate_cfg *xcfg, struct xport *xport) uuid_hash(>uuid)); } +static struct xbridge_addr * +xbridge_addr_create(struct xbridge *xbridge) +{ +struct xbridge_addr *xbridge_addr = xbridge->addr; +struct in6_addr *addr = N
[ovs-dev] [PATCH v4 1/2] tests: Inject ARP replies for snoop tests on different port
From: Zoltan Balogh <zoltan.balogh@gmail.com> The ARP replies injected into the underlay bridge 'br0' to trigger ARP snooping should be destined to the the bridges LOCAL port. So far the tests injected them on LOCAL port 'br0' itself, which didn't matter as OVS snooped on all ARP packets passing the bridge. This patch injects the ARP replies on a different port in preparation for an upcoming commit that will make OVS only snoop on ARP packets output to the LOCAL port. The clone() wrapper must be added to the generated datapath flows now as the traced packets would actually be transmitted through the tunnel port. Previously the underlay bridge dropped the packets as the learned egress port for the tunnel nexthop was the LOCAL port, which also served as virtual ingress port for the encapsulated traffic. The translation end result was an expensive way to say 'drop'. Signed-off-by: Zoltan Balogh <zoltan.balogh@gmail.com> Co-authored-by: Jan Scheurich <jan.scheur...@ericsson.com> Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com> --- tests/tunnel-push-pop-ipv6.at | 14 +++--- tests/tunnel-push-pop.at | 24 2 files changed, 19 insertions(+), 19 deletions(-) diff --git a/tests/tunnel-push-pop-ipv6.at b/tests/tunnel-push-pop-ipv6.at index 7ca522a..29bc1f3 100644 --- a/tests/tunnel-push-pop-ipv6.at +++ b/tests/tunnel-push-pop-ipv6.at @@ -55,9 +55,9 @@ AT_CHECK([cat p0.pcap.txt | grep 93aa55aa5586dd60203aff2001cafe | un ]) dnl Check ARP Snoop -AT_CHECK([ovs-appctl netdev-dummy/receive br0 'in_port(100),eth(src=f8:bc:12:44:34:b6,dst=aa:55:aa:55:00:00),eth_type(0x86dd),ipv6(src=2001:cafe::92,dst=2001:cafe::94,label=0,proto=58,tclass=0,hlimit=255,frag=no),icmpv6(type=136,code=0),nd(target=2001:cafe::92,sll=00:00:00:00:00:00,tll=f8:bc:12:44:34:b6)']) +AT_CHECK([ovs-appctl netdev-dummy/receive p0 'in_port(1),eth(src=f8:bc:12:44:34:b6,dst=aa:55:aa:55:00:00),eth_type(0x86dd),ipv6(src=2001:cafe::92,dst=2001:cafe::94,label=0,proto=58,tclass=0,hlimit=255,frag=no),icmpv6(type=136,code=0),nd(target=2001:cafe::92,sll=00:00:00:00:00:00,tll=f8:bc:12:44:34:b6)']) -AT_CHECK([ovs-appctl netdev-dummy/receive br0 'in_port(100),eth(src=f8:bc:12:44:34:b7,dst=aa:55:aa:55:00:00),eth_type(0x86dd),ipv6(src=2001:cafe::93,dst=2001:cafe::94,label=0,proto=58,tclass=0,hlimit=255,frag=no),icmpv6(type=136,code=0),nd(target=2001:cafe::93,sll=00:00:00:00:00:00,tll=f8:bc:12:44:34:b7)']) +AT_CHECK([ovs-appctl netdev-dummy/receive p0 'in_port(1),eth(src=f8:bc:12:44:34:b7,dst=aa:55:aa:55:00:00),eth_type(0x86dd),ipv6(src=2001:cafe::93,dst=2001:cafe::94,label=0,proto=58,tclass=0,hlimit=255,frag=no),icmpv6(type=136,code=0),nd(target=2001:cafe::93,sll=00:00:00:00:00:00,tll=f8:bc:12:44:34:b7)']) AT_CHECK([ovs-appctl tnl/arp/show | tail -n+3 | sort], [0], [dnl 2001:cafe::92 f8:bc:12:44:34:b6 br0 @@ -93,28 +93,28 @@ dnl Check VXLAN tunnel push AT_CHECK([ovs-ofctl add-flow int-br action=2]) AT_CHECK([ovs-appctl ofproto/trace ovs-dummy 'in_port(2),eth(src=f8:bc:12:44:34:b6,dst=aa:55:aa:55:00:01),eth_type(0x0800),ipv4(src=1.1.3.88,dst=1.1.3.112,proto=47,tos=0,ttl=64,frag=no)'], [0], [stdout]) AT_CHECK([tail -1 stdout], [0], - [Datapath actions: tnl_push(tnl_port(4789),header(size=70,type=4,eth(dst=f8:bc:12:44:34:b6,src=aa:55:aa:55:00:00,dl_type=0x86dd),ipv6(src=2001:cafe::88,dst=2001:cafe::92,label=0,proto=17,tclass=0x0,hlimit=64),udp(src=0,dst=4789,csum=0x),vxlan(flags=0x800,vni=0x7b)),out_port(100)) + [Datapath actions: clone(tnl_push(tnl_port(4789),header(size=70,type=4,eth(dst=f8:bc:12:44:34:b6,src=aa:55:aa:55:00:00,dl_type=0x86dd),ipv6(src=2001:cafe::88,dst=2001:cafe::92,label=0,proto=17,tclass=0x0,hlimit=64),udp(src=0,dst=4789,csum=0x),vxlan(flags=0x800,vni=0x7b)),out_port(100)),1) ]) dnl Check VXLAN tunnel push set tunnel id by flow and checksum AT_CHECK([ovs-ofctl add-flow int-br "actions=set_tunnel:124,4"]) AT_CHECK([ovs-appctl ofproto/trace ovs-dummy 'in_port(2),eth(src=f8:bc:12:44:34:b6,dst=aa:55:aa:55:00:01),eth_type(0x0800),ipv4(src=1.1.3.88,dst=1.1.3.112,proto=47,tos=0,ttl=64,frag=no)'], [0], [stdout]) AT_CHECK([tail -1 stdout], [0], - [Datapath actions: tnl_push(tnl_port(4789),header(size=70,type=4,eth(dst=f8:bc:12:44:34:b7,src=aa:55:aa:55:00:00,dl_type=0x86dd),ipv6(src=2001:cafe::88,dst=2001:cafe::93,label=0,proto=17,tclass=0x0,hlimit=64),udp(src=0,dst=4789,csum=0x),vxlan(flags=0x800,vni=0x7c)),out_port(100)) + [Datapath actions: clone(tnl_push(tnl_port(4789),header(size=70,type=4,eth(dst=f8:bc:12:44:34:b7,src=aa:55:aa:55:00:00,dl_type=0x86dd),ipv6(src=2001:cafe::88,dst=2001:cafe::93,label=0,proto=17,tclass=0x0,hlimit=64),udp(src=0,dst=4789,csum=0x),vxlan(flags=0x800,vni=0x7c)),out_port(100)),1) ]) dnl Check GRE tunnel push AT_CHECK([ovs-ofctl add-flow int-br action=3]) AT_CHECK([ovs-appctl ofproto/trace ovs-dummy 'in_port(2),eth(src=f8
[ovs-dev] [PATCH v4 0/2] Fix tunnel neighbor cache population
Currently, OVS snoops any ARP or ND packets in any bridge and populates the tunnel neighbor cache with the retrieved data. For instance, when ARP reply originated by a tenant is received on an overlay bridge, the ARP packet is snooped and tunnel neighbor cache is filled with tenant addresses, however only actual tunnel neighbor data should be stored there. In worst case tunnel peer data could be overwritten in the cache. This series resolves the issue by limiting the range of ARP and ND packets being snooped to only those that are addressed to potential local tunnel endpoints. v3 -> v4: - Rebased to master (commit 4b337e489b) - Failing unit test case with v3 fixed by commit 8f0e86f84 - Improved commit messages Zoltan Balogh (2): tests: Inject ARP replies for snoop tests on different port xlate: Move tnl_neigh_snoop() to terminate_native_tunnel() include/sparse/netinet/in.h | 10 +++ ofproto/ofproto-dpif-xlate.c | 147 -- tests/tunnel-push-pop-ipv6.at | 78 -- tests/tunnel-push-pop.at | 91 ++ 4 files changed, 299 insertions(+), 27 deletions(-) -- 1.9.1 ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH] ofproto-dpif: Init ukey->dump_seq to zero
Thanks Ben, I hope what my patch does is precisely what you suggest. I had the same thoughts when I had a closer look at the code. Regards, Jan > -Original Message- > From: Ben Pfaff [mailto:b...@ovn.org] > Sent: Wednesday, 04 April, 2018 19:20 > To: Jan Scheurich <jan.scheur...@ericsson.com> > Cc: d...@openvswitch.org; Zoltán Balogh <zoltan.bal...@ericsson.com>; > jpet...@ovn.org > Subject: Re: [PATCH] ofproto-dpif: Init ukey->dump_seq to zero > > On Wed, Apr 04, 2018 at 01:26:02PM +0200, Jan Scheurich wrote: > > In the current implementation the dump_seq of a new datapath flow ukey > > is set to seq_read(udpif->dump_seq). This implies that any revalidation > > during the current dump_seq period (up to 500 ms) is skipped. > > > > This can trigger incorrect behavior, for example when the the creation of > > datapath flow triggers a PACKET_IN to the controller, which which course > > the controller installs a new flow entry that should invalidate the > > original datapath flow. > > > > Initializing ukey->dump_seq to zero implies that the first dump of the > > flow, be it for revalidation or dumping statistics, will always be > > executed as zero is not a valid value of the ovs_seq. > > > > Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com> > > If we are going to do this, then we should delete the 'dump_seq' member > of struct upcall, because it will always be zero. It is also worth > considering whether the other caller of ukey_create__() should pass 0, > and if so then we can delete the 'dump_seq' parameter of > ukey_create__(). > > Thanks, > > Ben. ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] [PATCH] ofproto-dpif: Init ukey->dump_seq to zero
In the current implementation the dump_seq of a new datapath flow ukey is set to seq_read(udpif->dump_seq). This implies that any revalidation during the current dump_seq period (up to 500 ms) is skipped. This can trigger incorrect behavior, for example when the the creation of datapath flow triggers a PACKET_IN to the controller, which which course the controller installs a new flow entry that should invalidate the original datapath flow. Initializing ukey->dump_seq to zero implies that the first dump of the flow, be it for revalidation or dumping statistics, will always be executed as zero is not a valid value of the ovs_seq. Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com> --- ofproto/ofproto-dpif-upcall.c | 14 +- 1 file changed, 5 insertions(+), 9 deletions(-) diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c index 7bfeedd..00160e1 100644 --- a/ofproto/ofproto-dpif-upcall.c +++ b/ofproto/ofproto-dpif-upcall.c @@ -231,7 +231,6 @@ struct upcall { bool ukey_persists;/* Set true to keep 'ukey' beyond the lifetime of this upcall. */ -uint64_t dump_seq; /* udpif->dump_seq at translation time. */ uint64_t reval_seq;/* udpif->reval_seq at translation time. */ /* Not used by the upcall callback interface. */ @@ -1159,7 +1158,6 @@ upcall_xlate(struct udpif *udpif, struct upcall *upcall, * with pushing its stats eventually. */ } -upcall->dump_seq = seq_read(udpif->dump_seq); upcall->reval_seq = seq_read(udpif->reval_seq); xerr = xlate_actions(, >xout); @@ -1633,7 +1631,7 @@ ukey_create__(const struct nlattr *key, size_t key_len, const struct nlattr *mask, size_t mask_len, bool ufid_present, const ovs_u128 *ufid, const unsigned pmd_id, const struct ofpbuf *actions, - uint64_t dump_seq, uint64_t reval_seq, long long int used, + uint64_t reval_seq, long long int used, uint32_t key_recirc_id, struct xlate_out *xout) OVS_NO_THREAD_SAFETY_ANALYSIS { @@ -1654,7 +1652,7 @@ ukey_create__(const struct nlattr *key, size_t key_len, ukey_set_actions(ukey, actions); ovs_mutex_init(>mutex); -ukey->dump_seq = dump_seq; +ukey->dump_seq = 0; /* Not yet dumped */ ukey->reval_seq = reval_seq; ukey->state = UKEY_CREATED; ukey->state_thread = ovsthread_id_self(); @@ -1704,8 +1702,7 @@ ukey_create_from_upcall(struct upcall *upcall, struct flow_wildcards *wc) return ukey_create__(keybuf.data, keybuf.size, maskbuf.data, maskbuf.size, true, upcall->ufid, upcall->pmd_id, - >put_actions, upcall->dump_seq, - upcall->reval_seq, 0, + >put_actions, upcall->reval_seq, 0, upcall->have_recirc_ref ? upcall->recirc->id : 0, >xout); } @@ -1717,7 +1714,7 @@ ukey_create_from_dpif_flow(const struct udpif *udpif, { struct dpif_flow full_flow; struct ofpbuf actions; -uint64_t dump_seq, reval_seq; +uint64_t reval_seq; uint64_t stub[DPIF_FLOW_BUFSIZE / 8]; const struct nlattr *a; unsigned int left; @@ -1754,12 +1751,11 @@ ukey_create_from_dpif_flow(const struct udpif *udpif, } } -dump_seq = seq_read(udpif->dump_seq); reval_seq = seq_read(udpif->reval_seq) - 1; /* Ensure revalidation. */ ofpbuf_use_const(, >actions, flow->actions_len); *ukey = ukey_create__(flow->key, flow->key_len, flow->mask, flow->mask_len, flow->ufid_present, - >ufid, flow->pmd_id, , dump_seq, + >ufid, flow->pmd_id, , reval_seq, flow->stats.used, 0, NULL); return 0; -- 1.9.1 ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH v3 3/3] xlate: call tnl_neigh_snoop() from terminate_native_tunnel()
d.c:157 2018-03-28T18:41:24.965Z|00012|poll_loop(urcu2)|DBG|wakeup due to [POLLIN] on fd 24 (FIFO pipe:[16413512]) at lib/ovs-rcu.c:235 2018-03-28T18:41:24.982Z|00328|poll_loop|DBG|wakeup due to [POLLIN] on fd 46 (/opt/ovs/tests/testsuite.dir/2487/hv1/br-int.mgmt<->) at lib/stream-fd.c:157 2018-03-28T18:41:24.983Z|00329|vconn|DBG|unix#3: received: OFPT_PACKET_OUT (OF1.3) (xid=0xd3): in_port=CONTROLLER actions=set_field:0xa02->reg0,set_field:0xac100101->reg1,set_field:0x1->reg10,set_field:0x5->reg11,set_field:0x7->reg12,set_field:0x1->reg14,set_field:0x2->reg15,set_field:0x1->metadata,set_field:ff:ff:ff:ff:ff:ff->eth_dst,move:NXM_NX_XXREG0[64..95]->NXM_OF_ARP_SPA[],move:NXM_NX_XXREG0[96..127]->NXM_OF_ARP_TPA[],set_field:1->arp_op,resubmit(,32) data_len=42 arp,vlan_tci=0x,dl_src=00:00:00:01:02:04,dl_dst=00:00:00:00:00:00,arp_spa=192.168.1.2,arp_tpa=10.0.0.2,arp_op=1,arp_sha=00:00:00:01:02:04,arp_tha=00:00:00:00:00:00 2018-03-28T18:41:24.983Z|00330|dpif_netdev|DBG|ovs-system: action upcall: skb_priority(0),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),recirc_id(0),dp_hash(0),packet_type(ns=0,id=0),eth(src=f0:00:00:01:02:04,dst=00:00:00:01:02:04),eth_type(0x0806),arp(sip=10.0.0.2,tip=172.16.1.1,op=2,sha=f0:00:00:01:02:04,tha=00:00:00:01:02:04) arp,vlan_tci=0x,dl_src=f0:00:00:01:02:04,dl_dst=00:00:00:01:02:04,arp_spa=10.0.0.2,arp_tpa=172.16.1.1,arp_op=2,arp_sha=f0:00:00:01:02:04,arp_tha=00:00:00:01:02:04 2018-03-28T18:41:24.983Z|00331|dpif|DBG|system@ovs-system: execute userspace(pid=0,controller(reason=1,dont_send=0,continuation=0,recirc_id=2,rule_cookie=0xd8bfa54f,controller_id=0,max_len=65535)) on packet arp,vlan_tci=0x,dl_src=f0:00:00:01:02:04,dl_dst=00:00:00:01:02:04,arp_spa=10.0.0.2,arp_tpa=172.16.1.1,arp_op=2,arp_sha=f0:00:00:01:02:04,arp_tha=00:00:00:01:02:04 with metadata skb_priority(0),skb_mark(0) mtu 0 2018-03-28T18:41:24.983Z|00332|dpif|DBG|system@ovs-system: sub-execute userspace(pid=0,controller(reason=1,dont_send=0,continuation=0,recirc_id=2,rule_cookie=0xd8bfa54f,controller_id=0,max_len=65535)) on packet arp,vlan_tci=0x,dl_src=f0:00:00:01:02:04,dl_dst=00:00:00:01:02:04,arp_spa=10.0.0.2,arp_tpa=172.16.1.1,arp_op=2,arp_sha=f0:00:00:01:02:04,arp_tha=00:00:00:01:02:04 with metadata skb_priority(0),skb_mark(0) mtu 0 2018-03-28T18:41:24.983Z|00333|poll_loop|DBG|wakeup due to 0-ms timeout at ofproto/ofproto-dpif.c:1713 2018-03-28T18:41:24.984Z|00334|vconn|DBG|unix#3: sent (Success): NXT_PACKET_IN2 (OF1.3) (xid=0x0): table_id=9 cookie=0xd8bfa54f total_len=42 reg0=0xa02,reg11=0x5,reg12=0x7,reg14=0x2,metadata=0x1,in_port=0 (via action) data_len=42 (unbuffered) userdata=00.00.00.01.00.00.00.00 arp,vlan_tci=0x,dl_src=f0:00:00:01:02:04,dl_dst=00:00:00:01:02:04,arp_spa=10.0.0.2,arp_tpa=172.16.1.1,arp_op=2,arp_sha=f0:00:00:01:02:04,arp_tha=00:00:00:01:02:04 2018-03-28T18:41:24.996Z|00335|poll_loop|DBG|wakeup due to [POLLIN] on fd 45 (/opt/ovs/tests/testsuite.dir/2487/hv1/br-int.mgmt<->) at lib/stream-fd.c:157 2018-03-28T18:41:24.996Z|00336|vconn|DBG|unix#2: received: OFPT_FLOW_MOD (OF1.3) (xid=0xd4): ADD table:66 priority=100,reg0=0xa02,reg15=0x2,metadata=0x1 actions=set_field:f0:00:00:01:02:04->eth_dst 2018-03-28T18:41:24.996Z|00337|poll_loop|DBG|wakeup due to 0-ms timeout at ofproto/ofproto-dpif.c:1709 2018-03-28T18:41:24.996Z|00013|poll_loop(urcu2)|DBG|wakeup due to [POLLIN] on fd 24 (FIFO pipe:[16413512]) at lib/ovs-rcu.c:358 2018-03-28T18:41:24.996Z|00024|poll_loop(revalidator6)|DBG|wakeup due to [POLLIN] on fd 35 (FIFO pipe:[16413516]) at ofproto/ofproto-dpif-upcall.c:948 2018-03-28T18:41:24.996Z|00025|dpif(revalidator6)|DBG|system@ovs-system: flow_dump ufid:de19ea9b-110b-4476-b8f7-df5fac3d07c5 , packets:0, bytes:0, used:never 2018-03-28T18:41:24.996Z|00026|ofproto_dpif_upcall(revalidator6)|WARN|Flow already dumped 2018-03-28T18:41:24.996Z|00027|dpif(revalidator6)|DBG|system@ovs-system: dumped all flows 2018-03-28T18:41:24.997Z|00028|dpif(revalidator6)|DBG|system@ovs-system: flow_dump_destroy success > -Original Message- > From: Jan Scheurich > Sent: Wednesday, 04 April, 2018 01:52 > To: Zoltán Balogh <zoltan.bal...@ericsson.com>; Justin Pettit > <jpet...@ovn.org>; Ben Pfaff <b...@ovn.org> > Cc: g...@ovn.org; d...@openvswitch.org > Subject: RE: [ovs-dev] [PATCH v3 3/3] xlate: call tnl_neigh_snoop() from > terminate_native_tunnel() > > Hi, > > I took this over from Zoltan and investigated the failing unit test a bit > further. > > The essential difference between master and Zoltan's (rebased) patches is > that the dynamic "ARP cache" flow entry > >table=66,priority=100,reg0=0xa02,reg15=0x2,metadata=0x1 > actions=set_field:f0:00:00:01:02:04->eth_dst > > (which OVN installs in response to the ARP reply it receives from OVS in > PACKET_IN after the first injected IP pa
Re: [ovs-dev] [PATCH v3 3/3] xlate: call tnl_neigh_snoop() from terminate_native_tunnel()
Hi, I took this over from Zoltan and investigated the failing unit test a bit further. The essential difference between master and Zoltan's (rebased) patches is that the dynamic "ARP cache" flow entry table=66,priority=100,reg0=0xa02,reg15=0x2,metadata=0x1 actions=set_field:f0:00:00:01:02:04->eth_dst (which OVN installs in response to the ARP reply it receives from OVS in PACKET_IN after the first injected IP packet) triggers different behavior in the subsequent revalidation: In the case of master the existing datapath flow entry recirc_id(0),in_port(3),packet_type(ns=0,id=0),eth(src=f0:00:00:01:02:03,dst=00:00:00:01:02:03),eth_type(0x0800),ipv4(src=192.168.1.2/255.255.255.254,dst=10.0.0.2/248.0.0.0,ttl=64,frag=no), actions:ct_clear,set(eth(src=00:00:00:01:02:04,dst=00:00:00:00:00:00)),set(ipv4(src=192.168.1.2/255.255.255.254,dst=8.0.0.0/248.0.0.0,ttl=63)),userspace(pid=0,controller(reason=1,dont_send=0,continuation=0,recirc_id=1,rule_cookie=0x62066318,controller_id=0,max_len=65535)) is correctly deleted during revalidation due to the presence of the new rule in table 66, while with Zoltan's patch that same datapath flow entry remains untouched. It seems as if with the patch the revalidation of the datapath flow does not reach the new rule in table 66. Consequently the next IP packet injected by netdev-dummy/receive then matches the datapath flow and is sent up to OVN again, where it triggers the same ARP query procedure as before. However, if I add a "sleep 11" before injecting the second IP packet to let the datapath flow entry time out, the test succeeds, proving that the OF pipeline behaves correctly for a real packet but not when revalidating the datapath flow entry with the userspace(controller(...)) action. I still do not understand how the (passive) tunnel ARP snooping patch can possibly change the translation behavior for an IP packet in such a way that it affects the revalidation result. Any idea is welcome. Regards, Jan BTW: The “pseudo” tunnel neighbor cache entry we get in this test on master for the tenant IP address 10.0.0.2 IP MAC Bridge == 10.0.0.2f0:00:00:01:02:04 br-int 192.168.0.2 5e:97:6a:82:7d:41 br-phys is a good example why we need Zoltan's patch. Any IP address in an ARP reply is blindly inserted into the tunnel neighbor cache. Overlapping IP addresses among tenants can cause frequent overwriting of cache entries, in the worst case leading to continuous configuration changes and revalidation. > -Original Message- > From: Zoltán Balogh > Sent: Friday, 02 February, 2018 15:42 > To: Justin Pettit <jpet...@ovn.org>; Ben Pfaff <b...@ovn.org> > Cc: g...@ovn.org; d...@openvswitch.org; Jan Scheurich > <jan.scheur...@ericsson.com> > Subject: RE: [ovs-dev] [PATCH v3 3/3] xlate: call tnl_neigh_snoop() from > terminate_native_tunnel() > > Hi Justin, > > I rebased the patches to recent master. Please find them attached. > > Best regards, > Zoltan > > > -Original Message- > > From: Justin Pettit [mailto:jpet...@ovn.org] > > Sent: Friday, February 02, 2018 12:00 AM > > To: Ben Pfaff <mailto:b...@ovn.org> > > Cc: Zoltán Balogh <mailto:zoltan.bal...@ericsson.com>; mailto:g...@ovn.org; > > mailto:d...@openvswitch.org; Jan Scheurich > > <mailto:jan.scheur...@ericsson.com> > > Subject: Re: [ovs-dev] [PATCH v3 3/3] xlate: call tnl_neigh_snoop() from > > terminate_native_tunnel() > > > > I wasn't able to get this patch to apply to the tip of master. Zoltan, can > > you rebase this patch and repost? > > > > The main thing my patch series does is make it so that packets that have a > > controller action aren't processed > > entirely in userspace. If, for example, the patches expect packets to be > > in userspace without an explicit slow-path > > request when generating the datapath flow, then that would be a problem. > > > > --Justin > > > > > > > On Feb 1, 2018, at 2:17 PM, Ben Pfaff <mailto:b...@ovn.org> wrote: > > > > > > Justin, I think this is mainly a question about your patches, can you > > > take a look? > > > > > > On Fri, Jan 26, 2018 at 01:08:35PM +, Zoltán Balogh wrote: > > >> Hi, > > >> > > >> I've been investigating the failing unit test. I can confirm, it does > > >> fail > > >> with my series on master. However, when I created the series and sent it > > >> to > > >> the mailing list it did not. > > >> > > >> I've rebased my series to this commit before I sent it to the mailing >
Re: [ovs-dev] [PATCH] ofp-actions: Correct execution of encap/decap actions in action set
> -Original Message- > From: Ben Pfaff [mailto:b...@ovn.org] > Sent: Tuesday, 03 April, 2018 20:36 > > On Mon, Mar 26, 2018 at 09:36:27AM +0200, Jan Scheurich wrote: > > The actions encap, decap and dec_nsh_ttl were wrongly flagged as set_field > > actions in ofpact_is_set_or_move_action(). This caused them to be executed > > twice in the action set or a group bucket, once explicitly in > > ofpacts_execute_action_set() and once again as part of the list of > > set_field or move actions. > > > > Fixes: f839892a ("OF support and translation of generic encap and decap") > > Fixes: 491e05c2 ("nsh: add dec_nsh_ttl action") > > > > Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com> > > > > --- > > > > The fix should be backported to OVS 2.9 and OVS 2.8 (without the case > > for OFPACT_DEC_NSH_TTL introduced in 2.9). > > Thanks, I applied this to master and backported to branch-2.9 and > branch-2.8. Thanks! > I want to encourage you to add a test that uses one or both of these > actions in an action set, so that we get some test coverage. Will do when I find the time. I thought about that but didn't want to delay the fix. Regards, Jan ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] can not update userspace vxlan tunnel neigh mac when peer VTEP mac changed
Hi Ychen, If your tunnel NH is moving IP addresses between MAC addresses or changing the MAC address of an interface hosting the NH IP, I think it should send a GARP to inform the connected subnet about this change. Otherwise the neighbors will blackhole traffic by sending to the wrong MAC address until they refresh their ARP cache. What kind of tunnel NH are you using? A Linux host? OVS has no means to buffer packets while it is waiting for the ARP reply for an ARP request it has sent out to resolve the tunnel NH. The OVS datapath (including the slowpath) is essentially stateless. This is quite different from the Linux kernel. Your suggestion to let OVS snoop on any incoming tunnel packet to refresh the ARP cache for the tunnel NH seems impractical to me. Normally the tunnel packets are matched by megaflows in the netdev datapath, the tunnel headers are stripped and the packet is recirculated. All this happens down in the datapath, while the ARP cache lives in the ofproto layer. A safe way would be to trigger an ARP refresh in the ofproto layer for every known tunnel neighbor some time before the expiry of its cache entry, similar to what real IP stacks do. But that would duplicate the already available function on the host. I'd rather try to find out why the host is not refreshing the ARP entry or why OVS does not detect the reply to the ARP refresh. BR, Jan From: ychen [mailto:ychen103...@163.com] Sent: Wednesday, 28 March, 2018 06:17 To: Jan Scheurich <jan.scheur...@ericsson.com> Cc: d...@openvswitch.org; Manohar Krishnappa Chidambaraswamy <manohar.krishnappa.chidambarasw...@ericsson.com> Subject: RE: [ovs-dev] can not update userspace vxlan tunnel neigh mac when peer VTEP mac changed HI, Jan, Thanks for your reply. we have already modify code snooping on the GARP packets, but these 2 problem still exists. I think the main problem is that GARP packets are not sending from interfaces when we changed NIC mac address or IP address(read the linux kernel code, there is no such process) so we must depend on data packet to trigger the ARP request. I know that in linux kernel, when ARP packet is triggered, data packets will be cached in a specified time, so the first data packet can still be send out when ARP reply is received. for the second problem, can we update tunnel neigh cache when we receive data packet from remote VTEP? since we can fetch tun_src and outer mac sa from the data packet. At 2018-03-28 04:41:12, "Jan Scheurich" <jan.scheur...@ericsson.com<mailto:jan.scheur...@ericsson.com>> wrote: >Hi Ychen, > >Funny! Again we are already working on a solution for problem 1. > >In our scenario the situation arises with a tunnel next hop being a VRRP >switch pair. The switch sends periodic gratuitous ARPs (GARPs) to announce the >VRRP IP but OVS native tunneling doesn't snoop on GARPs, only on ARP >replies. The host IP stack, on the other hand, accepts these GARPs and stops >sending refresh ARP requests itself. Hence nothing for OVS to snoop upon. > >The solution is to make OVS snoop on GARP requests also. > >It is quite possible that this will also fix your problem 2. If you also have >a VRRP tunnel next hop which just moves its VRRP IP address but not the MAC >address, should send a GARP with the new IP/MAC mapping when it moves the IP >address, which would now update OVS' tunnel neighbor cache. > >@Mano: Can you submit the GARP patch in the near future? > >BR, Jan > >> -Original Message- >> From: >> ovs-dev-boun...@openvswitch.org<mailto:ovs-dev-boun...@openvswitch.org> >> [mailto:ovs-dev-boun...@openvswitch.org] On Behalf Of ychen >> Sent: Tuesday, 27 March, 2018 14:44 >> To: d...@openvswitch.org<mailto:d...@openvswitch.org> >> Subject: [ovs-dev] can not update userspace vxlan tunnel neigh mac when peer >> VTEP mac changed >> >> Hi, >>I found that sometime userspace vxlan can not work happily. >>1. first data packet loss >> when tunnel neigh cache is empty, then the first data packet >> triggered sending ARP packet to peer VTEP, and the data packet >> dropped, >> tunnel neigh cache added this entry when receive ARP reply packet. >> >> err = tnl_neigh_lookup(out_dev->xbridge->name, _ip6, ); >>if (err) { >> xlate_report(ctx, OFT_DETAIL, >> "neighbor cache miss for %s on bridge %s, " >> "sending %s request", >> buf_dip6, out_dev->xbridge->name, d_ip ? "ARP" : "ND"); >> if (d_ip) { >> tnl_send_arp_request(ctx, out_dev, smac, s_ip, d_ip); >> } else
Re: [ovs-dev] can not update userspace vxlan tunnel neigh mac when peer VTEP mac changed
Hi Ychen, Funny! Again we are already working on a solution for problem 1. In our scenario the situation arises with a tunnel next hop being a VRRP switch pair. The switch sends periodic gratuitous ARPs (GARPs) to announce the VRRP IP but OVS native tunneling doesn't snoop on GARPs, only on ARP replies. The host IP stack, on the other hand, accepts these GARPs and stops sending refresh ARP requests itself. Hence nothing for OVS to snoop upon. The solution is to make OVS snoop on GARP requests also. It is quite possible that this will also fix your problem 2. If you also have a VRRP tunnel next hop which just moves its VRRP IP address but not the MAC address, should send a GARP with the new IP/MAC mapping when it moves the IP address, which would now update OVS' tunnel neighbor cache. @Mano: Can you submit the GARP patch in the near future? BR, Jan > -Original Message- > From: ovs-dev-boun...@openvswitch.org > [mailto:ovs-dev-boun...@openvswitch.org] On Behalf Of ychen > Sent: Tuesday, 27 March, 2018 14:44 > To: d...@openvswitch.org > Subject: [ovs-dev] can not update userspace vxlan tunnel neigh mac when peer > VTEP mac changed > > Hi, >I found that sometime userspace vxlan can not work happily. >1. first data packet loss > when tunnel neigh cache is empty, then the first data packet > triggered sending ARP packet to peer VTEP, and the data packet > dropped, > tunnel neigh cache added this entry when receive ARP reply packet. > > err = tnl_neigh_lookup(out_dev->xbridge->name, _ip6, ); >if (err) { > xlate_report(ctx, OFT_DETAIL, > "neighbor cache miss for %s on bridge %s, " > "sending %s request", > buf_dip6, out_dev->xbridge->name, d_ip ? "ARP" : "ND"); > if (d_ip) { > tnl_send_arp_request(ctx, out_dev, smac, s_ip, d_ip); > } else { > tnl_send_nd_request(ctx, out_dev, smac, _ip6, _ip6); > } > return err; > } > > > 2. connection lost when peer VTEP mac changed > when VTEP mac is already in tunnel neigh cache, exp: > 10.182.6.81 fa:eb:26:c3:16:a5 br-phy > > so when data packet come in, it will use this mac for encaping outer > VXLAN header. > but VTEP 10.182.6.81 mac changed from fa:eb:26:c3:16:a5 to > 24:eb:26:c3:16:a5 because of NIC changed. > > data packet continue sending with the old mac fa:eb:26:c3:16:a5, but the > peer VTEP will not accept these packets because of mac > not match. > the wrong tunnel neigh entry aging until the data packet stop sending. > > >if (ovs_native_tunneling_is_on(ctx->xbridge->ofproto)) { > tnl_neigh_snoop(flow, wc, ctx->xbridge->name); > } > > > 3. is there anybody has working for these problems? > > > > ___ > dev mailing list > d...@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH v10 2/3] dpif-netdev: Detailed performance stats for PMDs
> -Original Message- > From: Stokes, Ian [mailto:ian.sto...@intel.com] > Sent: Tuesday, 27 March, 2018 16:21 > To: Ilya Maximets <i.maxim...@samsung.com>; Jan Scheurich > <jan.scheur...@ericsson.com>; d...@openvswitch.org > Cc: ktray...@redhat.com; O Mahony, Billy <billy.o.mah...@intel.com> > Subject: RE: [PATCH v10 2/3] dpif-netdev: Detailed performance stats for PMDs > > > Comments inline. > > > > Best regards, Ilya Maximets. > > > > On 18.03.2018 20:55, Jan Scheurich wrote: > > > This patch instruments the dpif-netdev datapath to record detailed > > > statistics of what is happening in every iteration of a PMD thread. > > > > > > The collection of detailed statistics can be controlled by a new > > > Open_vSwitch configuration parameter "other_config:pmd-perf-metrics". > > > By default it is disabled. The run-time overhead, when enabled, is > > > in the order of 1%. > > > > > [snip] > > > > +} > > > +if (tx_packets > 0) { > > > +ds_put_format(str, > > > +" Tx packets: %12"PRIu64" (%.0f Kpps)\n" > > > +" Tx batches: %12"PRIu64" (%.2f pkts/batch)" > > > +"\n", > > > +tx_packets, (tx_packets / duration) / 1000, > > > +tx_batches, 1.0 * tx_packets / tx_batches); > > > +} else { > > > +ds_put_format(str, > > > +" Tx packets: %12"PRIu64"\n" > > > +"\n", > > > +0ULL); > > > > I have a few interesting warnings on 64bit ARMv8. > > > > Clang: > > > > lib/dpif-netdev-perf.c:216:17: error: format specifies type 'unsigned > > long' but the argument has type 'unsigned long long' [-Werror,-Wformat] > > 0ULL); > > ^~~~ > > lib/dpif-netdev-perf.c:229:17: error: format specifies type 'unsigned > > long' but the argument has type 'unsigned long long' [-Werror,-Wformat] > > 0ULL); > > ^~~~ > > > > GCC: > > > > lib/dpif-netdev-perf.c: In function ‘pmd_perf_format_overall_stats’: > > lib/dpif-netdev-perf.c:215:17: error: format ‘%lu’ expects argument of > > type ‘long unsigned int’, but argument 3 has type ‘long long unsigned int’ > > [-Werror=format=] > > " Rx packets: %12"PRIu64"\n", > > ^ > > lib/dpif-netdev-perf.c:227:17: error: format ‘%lu’ expects argument of > > type ‘long unsigned int’, but argument 3 has type ‘long long unsigned int’ > > [-Werror=format=] > > " Tx packets: %12"PRIu64"\n" > > ^ > > > > Both are coming from the fact that PRIu64 expands to '%lu'. > > Why we need this printing at all? Can we just print 0 in a string? > > Otherwise, the only way to fix these warnings is to cast 0 directly to > > uint64_t. > > I see the same in Travis. > > In the v9 of the series the format used was 0UL. This allowed compilation in > Travis except for when compiling OVS with the 32 bit flag. > From the logs the introduction of 0ULL seems to avoid the issue for 32 bit > compilation but introduces the problem for 64 bit > compilation. > > I don’t see a way around it either without casting. > > Ian I'll work around this by printing "0" as a string :-) ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH v6] Configurable Link State Change (LSC) detection mode
Hi Ilya, This patch is the upstream version of a fix we implemented downstream a year ago to fix the issue with massive packet drop of OVS-DPDK on Fortville NICs. The root cause of this packet drop was the extended blocking of the ovs-vswitchd by the i40e PMD during the rte_eth_link_get_nowait() function, which caused the PMDs to hang for up to 40ms during upcalls. At the time switching to LSC interrupt was the only viable solution. OVS still polls the links state from DPDK with rte_eth_link_get_nowait() but DPDK returns the locally buffered link state, updated through LSC interrupt, instead of going through the FVL's admin queue. Now, the new i40e PMD fix in DPDK bypassing the admin queue should also solve the problem with Fortville NICs. It would have to be backported to older DPDK releases to be useful as fix for OVS 2.6, 2.7, 2.8 and 2.9, whereas the LSC interrupt solution in OVS would work as back-port for all OVS versions since 2.6. That's why we still think there is value in pursuing the LSC interrupt track. BR, Jan > -Original Message- > From: ovs-dev-boun...@openvswitch.org > [mailto:ovs-dev-boun...@openvswitch.org] On Behalf Of Ilya Maximets > Sent: Tuesday, 27 March, 2018 13:17 > To: Stokes, Ian; Róbert Mulik > ; d...@openvswitch.org > Subject: Re: [ovs-dev] [PATCH v6] Configurable Link State Change (LSC) > detection mode > > On 27.03.2018 13:19, Stokes, Ian wrote: > >> It is possible to change LSC detection mode to polling or interrupt mode > >> for DPDK interfaces. The default is polling mode. To set interrupt mode, > >> option dpdk-lsc-interrupt has to be set to true. > >> > >> In polling mode more processor time is needed, since the OVS repeatedly > >> reads the link state with a short period. It can lead to packet loss for > >> certain systems. > >> > >> In interrupt mode the hardware itself triggers an interrupt when link > >> state change happens, so less processing time needs for the OVS. > >> > >> For detailed description and usage see the dpdk install documentation. > > Could you, please, better describe why we need this change? > Because we're not removing the polling thread. OVS will still > poll the link states periodically. This config option has > no effect on that side. Also, link state polling in OVS uses > 'rte_eth_link_get_nowait()' function which will be called in both > cases and should not wait for hardware reply in any implementation. > > There was recent bug fix for intel NICs that fixes waiting of an > admin queue on link state requests despite of 'no_wait' flag: > http://dpdk.org/ml/archives/dev/2018-March/092156.html > Will this fix your target case? > > So, the difference of execution time of 'rte_eth_link_get_nowait()' > with enabled and disabled interrupts should be not so significant. > Do you have performance measurements? Measurement with above fix applied? > > > > > > Thanks for working on this Robert. > > > > I've completed some testing including the case where LSC is not supported, > > in which case the port will remain in a down state and > fail rx/tx traffic. This behavior conforms to the netdev_reconfigure > expectations in the fail case so that's ok. > > I'm not sure if this is acceptable. For example, we're not failing > reconfiguration in case of issues with number of queues. We're trying > different numbers until we have working configuration. > Maybe we need the same fall-back mechanism in case of not supported LSC > interrupts? (MTU setup errors are really uncommon unlike LSC interrupts' > support in PMDs). > > > > > I'm a bit late to the thread but I have a few other comments below. > > > > I'd like to get this patch in the next pull request if possible so I'd > > appreciate if others can give any comments on the patch also. > > > > Thanks > > Ian > > > >> > >> Signed-off-by: Robert Mulik > >> --- > >> v5 -> v6: > >> - DPDK install documentation updated. > >> - Status of lsc_interrupt_mode of DPDK interfaces can be read by command > >> ovs-appctl dpif/show. > >> - It was suggested to check if the HW supports interrupt mode, but it is > >> not > >> possible to do without DPDK code change, so it is skipped from this > >> patch. > >> --- > >> Documentation/intro/install/dpdk.rst | 33 > >> + > >> lib/netdev-dpdk.c| 24 ++-- > >> vswitchd/vswitch.xml | 17 + > >> 3 files changed, 72 insertions(+), 2 deletions(-) > >> > >> diff --git a/Documentation/intro/install/dpdk.rst > >> b/Documentation/intro/install/dpdk.rst > >> index ed358d5..eb1bc7b 100644 > >> --- a/Documentation/intro/install/dpdk.rst > >> +++ b/Documentation/intro/install/dpdk.rst > >> @@ -628,6 +628,39 @@ The average number of packets per output batch can be > >> checked in PMD stats:: > >> > >> $ ovs-appctl dpif-netdev/pmd-stats-show > >> > >> +Link State
Re: [ovs-dev] [PATCH v10 2/3] dpif-netdev: Detailed performance stats for PMDs
Hi Aaron, Thanks for the feedback. A few good suggestions are always welcome. I will include fixes for your comments in the (hopefully) final version. Regards, Jan > -Original Message- > From: Aaron Conole [mailto:acon...@redhat.com] > Sent: Monday, 26 March, 2018 23:27 > To: Jan Scheurich <jan.scheur...@ericsson.com> > Cc: d...@openvswitch.org; i.maxim...@samsung.com > Subject: Re: [ovs-dev] [PATCH v10 2/3] dpif-netdev: Detailed performance > stats for PMDs > > Hi Jan, > > Some stylistic type comments follow. Sorry to jump in at the end - but > you asked for checkpatch changes, so I improved and ran it against your > patch and found some stuff for which I have an opinion. :) Maybe > nothing to hold up merging but cleanup stuff. > > Jan Scheurich <jan.scheur...@ericsson.com> writes: > > > This patch instruments the dpif-netdev datapath to record detailed > > statistics of what is happening in every iteration of a PMD thread. > > > > The collection of detailed statistics can be controlled by a new > > Open_vSwitch configuration parameter "other_config:pmd-perf-metrics". > > By default it is disabled. The run-time overhead, when enabled, is > > in the order of 1%. > > > > The covered metrics per iteration are: > > - cycles > > - packets > > - (rx) batches > > - packets/batch > > - max. vhostuser qlen > > - upcalls > > - cycles spent in upcalls > > > > This raw recorded data is used threefold: > > > > 1. In histograms for each of the following metrics: > >- cycles/iteration (log.) > >- packets/iteration (log.) > >- cycles/packet > >- packets/batch > >- max. vhostuser qlen (log.) > >- upcalls > >- cycles/upcall (log) > >The histograms bins are divided linear or logarithmic. > > > > 2. A cyclic history of the above statistics for 999 iterations > > > > 3. A cyclic history of the cummulative/average values per millisecond > >wall clock for the last 1000 milliseconds: > >- number of iterations > >- avg. cycles/iteration > >- packets (Kpps) > >- avg. packets/batch > >- avg. max vhost qlen > >- upcalls > >- avg. cycles/upcall > > > > The gathered performance metrics can be printed at any time with the > > new CLI command > > > > ovs-appctl dpif-netdev/pmd-perf-show [-nh] [-it iter_len] [-ms ms_len] > > [-pmd core] [dp] > > > > The options are > > > > -nh:Suppress the histograms > > -it iter_len: Display the last iter_len iteration stats > > -ms ms_len: Display the last ms_len millisecond stats > > -pmd core: Display only the specified PMD > > > > The performance statistics are reset with the existing > > dpif-netdev/pmd-stats-clear command. > > > > The output always contains the following global PMD statistics, > > similar to the pmd-stats-show command: > > > > Time: 15:24:55.270 > > Measurement duration: 1.008 s > > > > pmd thread numa_id 0 core_id 1: > > > > Cycles:2419034712 (2.40 GHz) > > Iterations:572817 (1.76 us/it) > > - idle:486808 (15.9 % cycles) > > - busy: 86009 (84.1 % cycles) > > Rx packets: 2399607 (2381 Kpps, 848 cycles/pkt) > > Datapath passes: 3599415 (1.50 passes/pkt) > > - EMC hits:336472 ( 9.3 %) > > - Megaflow hits: 3262943 (90.7 %, 1.00 subtbl lookups/hit) > > - Upcalls: 0 ( 0.0 %, 0.0 us/upcall) > > - Lost upcalls: 0 ( 0.0 %) > > Tx packets: 2399607 (2381 Kpps) > > Tx batches:171400 (14.00 pkts/batch) > > > > Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com> > > Acked-by: Billy O'Mahony <billy.o.mah...@intel.com> > > --- > > NEWS| 3 + > > lib/automake.mk | 1 + > > lib/dpif-netdev-perf.c | 350 > > +++- > > lib/dpif-netdev-perf.h | 258 ++-- > > lib/dpif-netdev-unixctl.man | 157 > > lib/dpif-netdev.c | 183 +-- > > manpages.mk | 2 + > > vswitchd/ovs-vswitchd.8.in | 27 +--- > > vswitchd/vswitch.xml| 12 ++ > > 9 files changed, 940 insertions(+), 53 deletions(-) > > create mode 100644 lib/dpif-netdev-unixctl.man > > > > diff --git a/NEWS b/NEWS
[ovs-dev] [PATCH] xlate: Correct handling of double encap() actions
When the same encap() header was pushed twice onto a packet (e.g in the case of NSH in NSH), the translation logic only generated a datapath push action for the first encap() action. The second encap() did not emit a push action because the packet type was unchanged. commit_encap_decap_action() (renamed from commit_packet_type_change) must solely rely on ctx->pending_encap to generate an datapath push action. Similarly, the first decap() action on a double header packet does not change the packet_type either. Add a corresponding ctx->pending_decap flag and use that to trigger emitting a datapath pop action. Fixes: f839892a2 ("OF support and translation of generic encap and decap") Fixes: 1fc11c594 ("Generic encap and decap support for NSH") Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com> --- This fix should be backported OVS 2.8 and 2.9 lib/odp-util.c | 16 ++-- lib/odp-util.h | 1 + ofproto/ofproto-dpif-xlate.c | 7 ++- 3 files changed, 13 insertions(+), 11 deletions(-) diff --git a/lib/odp-util.c b/lib/odp-util.c index 8743503..6db241a 100644 --- a/lib/odp-util.c +++ b/lib/odp-util.c @@ -7446,17 +7446,13 @@ odp_put_push_nsh_action(struct ofpbuf *odp_actions, } static void -commit_packet_type_change(const struct flow *flow, +commit_encap_decap_action(const struct flow *flow, struct flow *base_flow, struct ofpbuf *odp_actions, struct flow_wildcards *wc, - bool pending_encap, + bool pending_encap, bool pending_decap, struct ofpbuf *encap_data) { -if (flow->packet_type == base_flow->packet_type) { -return; -} - if (pending_encap) { switch (ntohl(flow->packet_type)) { case PT_ETH: { @@ -7481,7 +7477,7 @@ commit_packet_type_change(const struct flow *flow, * The check is done at action translation. */ OVS_NOT_REACHED(); } -} else { +} else if (pending_decap || flow->packet_type != base_flow->packet_type) { /* This is an explicit or implicit decap case. */ if (pt_ns(flow->packet_type) == OFPHTN_ETHERTYPE && base_flow->packet_type == htonl(PT_ETH)) { @@ -7520,14 +7516,14 @@ commit_packet_type_change(const struct flow *flow, enum slow_path_reason commit_odp_actions(const struct flow *flow, struct flow *base, struct ofpbuf *odp_actions, struct flow_wildcards *wc, - bool use_masked, bool pending_encap, + bool use_masked, bool pending_encap, bool pending_decap, struct ofpbuf *encap_data) { enum slow_path_reason slow1, slow2; bool mpls_done = false; -commit_packet_type_change(flow, base, odp_actions, wc, - pending_encap, encap_data); +commit_encap_decap_action(flow, base, odp_actions, wc, + pending_encap, pending_decap, encap_data); commit_set_ether_action(flow, base, odp_actions, wc, use_masked); /* Make packet a non-MPLS packet before committing L3/4 actions, * which would otherwise do nothing. */ diff --git a/lib/odp-util.h b/lib/odp-util.h index 1fad159..6fcd1bb 100644 --- a/lib/odp-util.h +++ b/lib/odp-util.h @@ -283,6 +283,7 @@ enum slow_path_reason commit_odp_actions(const struct flow *, struct flow_wildcards *wc, bool use_masked, bool pending_encap, + bool pending_decap, struct ofpbuf *encap_data); /* ofproto-dpif interface. diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c index bc6429c..326c088 100644 --- a/ofproto/ofproto-dpif-xlate.c +++ b/ofproto/ofproto-dpif-xlate.c @@ -243,6 +243,8 @@ struct xlate_ctx { * true. */ bool pending_encap; /* True when waiting to commit a pending * encap action. */ +bool pending_decap; /* True when waiting to commit a pending + * decap action. */ struct ofpbuf *encap_data; /* May contain a pointer to an ofpbuf with * context for the datapath encap action.*/ @@ -3477,8 +3479,9 @@ xlate_commit_actions(struct xlate_ctx *ctx) ctx->xout->slow |= commit_odp_actions(>xin->flow, >base_flow, ctx->odp_actions, ctx->wc, use_masked, ctx->pending_encap, - ctx->encap_data); + ctx->pending_decap, ctx->encap_data); c
[ovs-dev] [PATCH] ofp-actions: Correct execution of encap/decap actions in action set
The actions encap, decap and dec_nsh_ttl were wrongly flagged as set_field actions in ofpact_is_set_or_move_action(). This caused them to be executed twice in the action set or a group bucket, once explicitly in ofpacts_execute_action_set() and once again as part of the list of set_field or move actions. Fixes: f839892a ("OF support and translation of generic encap and decap") Fixes: 491e05c2 ("nsh: add dec_nsh_ttl action") Signed-off-by: Jan Scheurich <jan.scheur...@ericsson.com> --- The fix should be backported to OVS 2.9 and OVS 2.8 (without the case for OFPACT_DEC_NSH_TTL introduced in 2.9). lib/ofp-actions.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/lib/ofp-actions.c b/lib/ofp-actions.c index db85716..87797bc 100644 --- a/lib/ofp-actions.c +++ b/lib/ofp-actions.c @@ -6985,9 +6985,6 @@ ofpact_is_set_or_move_action(const struct ofpact *a) case OFPACT_SET_TUNNEL: case OFPACT_SET_VLAN_PCP: case OFPACT_SET_VLAN_VID: -case OFPACT_ENCAP: -case OFPACT_DECAP: -case OFPACT_DEC_NSH_TTL: return true; case OFPACT_BUNDLE: case OFPACT_CLEAR_ACTIONS: @@ -7025,6 +7022,9 @@ ofpact_is_set_or_move_action(const struct ofpact *a) case OFPACT_WRITE_METADATA: case OFPACT_DEBUG_RECIRC: case OFPACT_DEBUG_SLOW: +case OFPACT_ENCAP: +case OFPACT_DECAP: +case OFPACT_DEC_NSH_TTL: return false; default: OVS_NOT_REACHED(); -- 1.9.1 ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] OVS will hit an assert if encap(nsh) is done in bucket of group
Thanks for the confirmation Yi. I will post the fix straight away. The other fix for double encap() is also ready. BR, Jan > -Original Message- > From: Yang, Yi [mailto:yi.y.y...@intel.com] > Sent: Monday, 26 March, 2018 03:42 > To: Jan Scheurich <jan.scheur...@ericsson.com> > Cc: d...@openvswitch.org; Zoltán Balogh <zoltan.bal...@ericsson.com> > Subject: Re: [ovs-dev] OVS will hit an assert if encap(nsh) is done in bucket > of group > > I tried the below fix patch you mentioned, it did fix this issue. > > diff --git a/lib/ofp-actions.c b/lib/ofp-actions.c > index db85716..87797bc 100644 > --- a/lib/ofp-actions.c > +++ b/lib/ofp-actions.c > @@ -6985,9 +6985,6 @@ ofpact_is_set_or_move_action(const struct ofpact *a) > case OFPACT_SET_TUNNEL: > case OFPACT_SET_VLAN_PCP: > case OFPACT_SET_VLAN_VID: > -case OFPACT_ENCAP: > -case OFPACT_DECAP: > -case OFPACT_DEC_NSH_TTL: > return true; > case OFPACT_BUNDLE: > case OFPACT_CLEAR_ACTIONS: > @@ -7025,6 +7022,9 @@ ofpact_is_set_or_move_action(const struct ofpact *a) > case OFPACT_WRITE_METADATA: > case OFPACT_DEBUG_RECIRC: > case OFPACT_DEBUG_SLOW: > +case OFPACT_ENCAP: > +case OFPACT_DECAP: > +case OFPACT_DEC_NSH_TTL: > return false; > default: > OVS_NOT_REACHED(); > > On Mon, Mar 26, 2018 at 12:45:46AM +, Yang, Yi Y wrote: > > Jan, thank you so much, very exhaustive analysis :), I'll double check your > > fix patch. > > > > From: Jan Scheurich [mailto:jan.scheur...@ericsson.com] > > Sent: Sunday, March 25, 2018 9:09 AM > > To: Yang, Yi Y <yi.y.y...@intel.com> > > Cc: d...@openvswitch.org; Zoltán Balogh <zoltan.bal...@ericsson.com> > > Subject: RE: OVS will hit an assert if encap(nsh) is done in bucket of group > > > > > > Hi Yi, > > > > > > > > Part of the seemingly strange behavior of the encap(nsh) action in a group > > is caused by the (often forgotten) fact that group buckets > do not contain action *lists* but action *sets*. I have no idea why it was > defined like this when groups were first introduced in > OpenFlow 1.1. In my view it was a bad decision and causes a lot of limitation > for using groups. But that's the way it is. > > > > > > > > In action sets there can only be one action of a kind (except for > > set_field, where there can be one action per target field). If there > are multiple actions of the same kind specified, only the last one taken, the > earlier ones ignored. > > > > > > > > Furthermore, the order of execution of the actions in the action set is not > > given by the order in which they are listed but defined by > the OpenFlow standard (see chapter 5.6 of OF spec 1.5.1). Of course the > generic encap() and decap() actions are not standardized yet, > so the OF spec doesn't specify where to put them in the sequence. We had to > implement something that follows the spirit of the > specification, knowing that whatever we chose may fit some but won't fit many > other legitimate use cases. > > > > > > > > OVS's order is defined in ofpacts_execute_action_set() in ofp-actions.c: > > > > OFPACT_STRIP_VLAN > > > > OFPACT_POP_MPLS > > > > OFPACT_DECAP > > > > OFPACT_ENCAP > > > > OFPACT_PUSH_MPLS > > > > OFPACT_PUSH_VLAN > > > > OFPACT_DEC_TTL > > > > OFPACT_DEC_MPLS_TTL > > > > OFPACT_DEC_NSH_TTL > > > > All OFP_ACT SET_FIELD and OFP_ACT_MOVE (target) > > > > OFPACT_SET_QUEUE > > > > > > > > Now, your specific group bucket use case: > > > > > > > >encap(nsh),set_field:->nsh_xxx,output:vxlan_gpe_port > > > > > > > > should be a lucky fit and execute as expected, whereas the analogous use > > case > > > > > > > >encap(nsh),set_field:->nsh_xxx,encap(ethernet), output:ethernet_port > > > > > > > > fails with the error > > > > > > > >Dropping packet as encap(ethernet) is not supported for packet type > > ethernet. > > > > > > > > because the second encap(ethernet) action replaces the encap(nsh) in the > > action set and is executed first on the original received > Ethernet packet. Boom! > > > > > > > > So, why does your valid use case cause an assertion failure? It's a > > consequence of two faults: > > > > > > 1. In the conversion of the group bucket's action list to the bucket > > acti
Re: [ovs-dev] OVS will hit an assert if encap(nsh) is done in bucket of group
Hi Yi, Part of the seemingly strange behavior of the encap(nsh) action in a group is caused by the (often forgotten) fact that group buckets do not contain action *lists* but action *sets*. I have no idea why it was defined like this when groups were first introduced in OpenFlow 1.1. In my view it was a bad decision and causes a lot of limitation for using groups. But that's the way it is. In action sets there can only be one action of a kind (except for set_field, where there can be one action per target field). If there are multiple actions of the same kind specified, only the last one taken, the earlier ones ignored. Furthermore, the order of execution of the actions in the action set is not given by the order in which they are listed but defined by the OpenFlow standard (see chapter 5.6 of OF spec 1.5.1). Of course the generic encap() and decap() actions are not standardized yet, so the OF spec doesn't specify where to put them in the sequence. We had to implement something that follows the spirit of the specification, knowing that whatever we chose may fit some but won't fit many other legitimate use cases. OVS's order is defined in ofpacts_execute_action_set() in ofp-actions.c: OFPACT_STRIP_VLAN OFPACT_POP_MPLS OFPACT_DECAP OFPACT_ENCAP OFPACT_PUSH_MPLS OFPACT_PUSH_VLAN OFPACT_DEC_TTL OFPACT_DEC_MPLS_TTL OFPACT_DEC_NSH_TTL All OFP_ACT SET_FIELD and OFP_ACT_MOVE (target) OFPACT_SET_QUEUE Now, your specific group bucket use case: encap(nsh),set_field:->nsh_xxx,output:vxlan_gpe_port should be a lucky fit and execute as expected, whereas the analogous use case encap(nsh),set_field:->nsh_xxx,encap(ethernet), output:ethernet_port fails with the error Dropping packet as encap(ethernet) is not supported for packet type ethernet. because the second encap(ethernet) action replaces the encap(nsh) in the action set and is executed first on the original received Ethernet packet. Boom! So, why does your valid use case cause an assertion failure? It's a consequence of two faults: 1. In the conversion of the group bucket's action list to the bucket action set in ofpacts_execute_action_set() the action list is filtered with ofpact_is_set_or_move_action() to select the set_field actions. This function incorrectly flagged OFPACT_ENCAP, OFPACT_DECAP and OFPACT_DEC_NSH_TTL as set_field actions. That's why the encap(nsh) action is wrongly copied twice to the action set. 2. The translation of the second encap(nsh) action in the action set doesn't change the packet_type as it is already (1,0x894f). Hence, the commit_packet_type_change() triggered at output to vxlan_gpe port misses to generate a second encap_nsh datapath action. The logic here is obviously not complete to cover the NSH in NSH use case that we intended to support and must be enhanced. The commit of the changes to the NSH header in commit_set_nsh_action() then triggers assertion failure because the translation of the second encap(nsh) action did overwrite the original nsh_np (0x3 for Ethernet in NSH) in the flow with 0x4 (for NSH in NSH). Since it is not allowed to modify the nsh_np with set_field this is what triggers the assertion. I believe this assertion to be correct. It did detect the combination of the above two faults. The solution to 1 is trivial. I'll post a bug fix straight away. That should suffice for your problem. The solution to 2 requires a bit more thinking. I will send a fix when I have found it. BR, Jan > -Original Message- > From: Yang, Yi [mailto:yi.y.y...@intel.com] > Sent: Friday, 23 March, 2018 08:55 > To: Jan Scheurich <jan.scheur...@ericsson.com> > Cc: d...@openvswitch.org; Zoltán Balogh <zoltan.bal...@ericsson.com> > Subject: Re: OVS will hit an assert if encap(nsh) is done in bucket of group > > On Fri, Mar 23, 2018 at 07:51:45AM +, Jan Scheurich wrote: > > Hi Yi, > > > > Could you please provide the OF pipeline (flows and groups) and an > > ofproto/trace command that triggers that fault? > > > > Thanks, Jan > > Hi, Jan > > my br-int has the below ports: > > 1(dpdk0): addr:08:00:27:c6:9f:ff > config: 0 > state: LIVE > current:1GB-FD AUTO_NEG > speed: 1000 Mbps now, 0 Mbps max > 2(vxlangpe1): addr:16:04:0c:e5:f1:2c > config: 0 > state: LIVE > speed: 0 Mbps now, 0 Mbps max > 3(vxlan1): addr:da:1e:fb:2b:c8:63 > config: 0 > state: LIVE > speed: 0 Mbps now, 0 Mbps max > 4(veth-br): addr:92:3d:e0:ab:c2:85 > config: 0 > state: LIVE > current:10GB-FD COPPER > speed: 1 Mbps now, 0 Mbps max > LOCAL(br-int): addr:08:00:27:c6:9f:ff > config: 0 > state: LIVE > current:10MB-FD