Re: [ovs-dev] Conntrack performance drop in OVS 2.8

2018-07-04 Thread Nitin Katiyar
Hi,
Any suggestions/pointers on following?

Regards,
Nitin

From: Nitin Katiyar
Sent: Friday, June 29, 2018 3:00 PM
To: ovs-dev@openvswitch.org
Subject: Conntrack performance drop in OVS 2.8

Hi,
The performance of OVS 2.8 (with DPDK 17.05.02) with conntrack configuration 
has dropped significantly (especially for single flow case) as compared to OVS 
2.6 (with DPDK 16.11.4).

Following is the comparison between 2.6.2 and 2.8.2.
PKT Size   # of Flows MPPS (OVS 2.6)   MPPS (OVS 2.8) % Drop

64   1  3.37
1.8644.71
128 1  3.09 
   1.7443.52
256 1  2.66 
   1.1556.84
64   11.73  
  1.5113.03
128 11.68   
 1.4612.65
256 11.55   
 1.3413.60


OVS is configured with 2 DPDK ports (10G -Intel 82599)bonded in bond-slb mode 
and 1 VHU port. The VM is running the testpmd and it echoes the UDP packets.

I used following OF rules.

ovs-ofctl add-flow br-int "table=0, 
priority=10,ct_state=-trk,udp,actions=ct(table=1)"
ovs-ofctl add-flow br-int "table=1, 
priority=1000,ct_state=+new+trk,udp,in_port=10,actions=strip_vlan,ct(commit),output:101"

ovs-ofctl add-flow br-int "table=1, 
priority=900,ct_state=+est+trk,in_port=10,actions=strip_vlan,output:101"
ovs-ofctl add-flow br-int "table=1, 
priority=900,ct_state=+est+trk,in_port=101,actions=push_vlan:0x8100,mod_vlan_vid:4044,output:10".

There are 2 bridges configured in OVS which are connected through patch ports. 
Port 10 above is patch port and 101 is vhu port. OVS is configured with 2 DPDK 
ports (10G -Intel 82599) bonded in bond-slb mode and 1 VHU port. The VM is 
running the testpmd which echoes the UDP packets.

The generator (running on different server) is sending UDP traffic by altering 
the udp source port.

Has anyone else experienced the similar behavior?

Regards,
Nitin






I have simple configuration of SUT with one VM running testpmd echoing traffic 
on top of OVS.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [iovisor-dev] [RFC PATCH 00/11] OVS eBPF datapath.

2018-07-04 Thread William Tu
On Tue, Jul 3, 2018 at 10:56 AM, Alexei Starovoitov
 wrote:
> On Thu, Jun 28, 2018 at 07:19:35AM -0700, William Tu wrote:
>> Hi Alexei,
>>
>> Thanks a lot for the feedback!
>>
>> On Wed, Jun 27, 2018 at 8:00 PM, Alexei Starovoitov
>>  wrote:
>> > On Sat, Jun 23, 2018 at 05:16:32AM -0700, William Tu wrote:
>> >>
>> >> Discussion
>> >> ==
>> >> We are still actively working on finishing the feature, currently
>> >> the basic forwarding and tunnel feature work, but still under
>> >> heavy debugging and development.  The purpose of this RFC is to
>> >> get some early feedbacks and direction for finishing the complete
>> >> features in existing kernel's OVS datapath (the net/openvswitch/*).
>> >
>> > Thank you for sharing the patches.
>> >
>> >> Three major issues we are worried:
>> >>   a. Megaflow support in BPF.
>> >>   b. Connection Tracking support in BPF.
>> >
>> > my opinion on the above two didn't change.
>> > To recap:
>> > A. Non scalable megaflow map is no go. I'd like to see packet 
>> > classification
>> > algorithm like hicuts or efficuts to be implemented instead, since it can 
>> > be
>> > shared by generic bpf, bpftiler, ovs and likely others.
>>
>> We did try the decision tree approach using dpdk's acl lib. The lookup
>> speed is 6 times faster than the magaflow using tuple space.
>> However, the update/insertion requires rebuilding/re-balancing the decision
>> tree so it's way too slow. I think hicuts or efficuts suffers the same issue.
>> So decision tree algos are scalable only for lookup operation due to its
>> optimization over tree depth, but not scalable under
>> update/insert/delete operations.
>>
>> On customer's system we see megaflow update/insert rate around 10 rules/sec,
>> this makes decision tree unusable, unless we invent something to optimize the
>> update/insert time or incremental update of these decision tree algo.
>
> is this a typo? you probably meant 10K rule updates a second ?
I mean "new" rules being added at 10 rules/sec.
Update rate might be much higher.

> Last time I've dealt with these algorithms we had 100K acl updates a second.
> It was an important metric that we were optimizing for.
> I'm pretty sure '*cuts' algos do many thousands per second non optimized.

When adding a new rule, do these algorithms require rebuilding the
entire tree?

In our evaluation, updating an existing entry in the decision tree
performs OK, because it is equal to lookup and replace, and lookup
is fast, update is just atomic swap. But inserting a new rule is slow,
because it requires re-building the tree using all existing rules.
And we see new rule being added at rate 10 rules per second.
So we are constantly rebuilding the entire tree.

If the entire tree has 100k of rules, it takes around 2 seconds to rebuild,
based on the dpdk acl library.  And without an incremental algorithm,
adding 1 new rule will trigger rebuilding the 100k of rules, and it is too slow.

Reading through HyperCuts and EffiCuts, I'm not sure how it supports
incrementally adding a new rule, without rebuilding the entire tree.
http://ccr.sigcomm.org/online/files/p207.pdf
http://cseweb.ucsd.edu/~susingh/papers/hyp-sigcomm03.pdf

The HyperCuts papers says
"A fast update algorithm can also be implemented; however we do not
go into the details of incremental update in this paper"

>
>> >>   c. Verifier limitation.
>> >
>> > Not sure what limitations you're concerned about.
>> >
>>
>> Mostly related to stack.  The flow key OVS uses (struct sw_flow_key)
>> is 464 byte. We trim a lot, now around 300 byte, but still huge, considering
>> the BPF's stack limit is 512 byte.
>
> have you tried using per-cpu array of one element with large value
> instead of stack?

yes, now we store the flow key in percpu array with 1 element.

> In the latest verifier most of the operations that can be done with the stack
> pointer can be done with pointer to map value too.
>
Once the flow key is stored in map, another eBPF program
needs to use that key to lookup flow table (another map).
So we have to store the flow key on stack first, in order to
use it as key to lookup the flow table map.

Is there a way to work around it?
Thanks
William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] Postulaciones Abiertas

2018-07-04 Thread Nuevos Doctorados 2018-2019


  






 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
 
 


 
 
 
 
 
 
 
 
  
 
 


 
 
 
 
 
 
 
 
  
 
 


 
 
 
 
 
 
 
 
  
 
 


 
 
 
 
 
 
 
 
  
 
 


 
 
 
 
 
 
 
 

---
Este correo electrónico ha sido comprobado en busca de virus por AVG.
http://www.avg.com
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2 13/14] dpdk-tests: Accept other configs in OVS_DPDK_START

2018-07-04 Thread Tiago Lam
As it stands, OVS_DPDK_START() won't allow other configs to be set
before starting the ovs-vswitchd daemon. This is a problem since some
configs, such as the "dpdk-multi-seg-mbufs=true" for enabling the
multi-segment mbufs, need to be set prior to start OvS.

To support other options, OVS_DPDK_START() has been modified to accept
extra configs in the form "$config_name=$config_value". It then uses
ovs-vsctl to set the configs.

Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 tests/system-dpdk-macros.at | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/tests/system-dpdk-macros.at b/tests/system-dpdk-macros.at
index 0762ee0..485a0f3 100644
--- a/tests/system-dpdk-macros.at
+++ b/tests/system-dpdk-macros.at
@@ -21,7 +21,7 @@ m4_define([OVS_DPDK_PRE_CHECK],
 ])
 
 
-# OVS_DPDK_START()
+# OVS_DPDK_START([other-conf-args])
 #
 # Create an empty database and start ovsdb-server. Add special configuration
 # dpdk-init to enable DPDK functionality. Start ovs-vswitchd connected to that
@@ -48,6 +48,11 @@ m4_define([OVS_DPDK_START],
AT_CHECK([lscpu], [], [stdout])
AT_CHECK([cat stdout | grep "NUMA node(s)" | awk '{c=1; while (c++<$(3)) 
{printf "1024,"}; print "1024"}' > SOCKET_MEM])
AT_CHECK([ovs-vsctl --no-wait set Open_vSwitch . 
other_config:dpdk-socket-mem="$(cat SOCKET_MEM)"])
+   dnl Iterate through $other-conf-args list and include them
+   m4_foreach_w(opt, $1, [
+   AT_CHECK([ovs-vsctl --no-wait set Open_vSwitch . other_config:opt])
+   ])
+   dnl m4_if([$1], [], [], [AT_CHECK(["$1"])])
 
dnl Start ovs-vswitchd.
AT_CHECK([ovs-vswitchd --detach --no-chdir --pidfile --log-file -vvconn 
-vofproto_dpif -vunixctl], [0], [stdout], [stderr])
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2 14/14] dpdk-tests: End-to-end tests for multi-seg mbufs.

2018-07-04 Thread Tiago Lam
The following tests are added to the DPDK testsuite to add some
coverage for the multi-segment mbufs:
- Check that multi-segment mbufs are disabled by default;
- Check that providing `other_config:dpdk-multi-seg-mbufs=true` indeed
  enables mbufs;
- Using a DPDK port, send a random packet out and check that `ofctl
  dump-flows` shows the correct amount of packets and bytes sent.

Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 tests/system-dpdk.at | 65 
 1 file changed, 65 insertions(+)

diff --git a/tests/system-dpdk.at b/tests/system-dpdk.at
index 3d21b01..7468d49 100644
--- a/tests/system-dpdk.at
+++ b/tests/system-dpdk.at
@@ -71,3 +71,68 @@ OVS_VSWITCHD_STOP("/does not exist. The Open vSwitch kernel 
module is probably n
 ")
 AT_CLEANUP
 dnl --
+
+AT_SETUP([Jumbo frames - Multi-segment disabled by default])
+OVS_DPDK_START()
+
+AT_CHECK([grep "multi-segment mbufs enabled" ovs-vswitchd.log], [1], [])
+OVS_VSWITCHD_STOP("/Global register is changed during/d
+/EAL: No free hugepages reported in hugepages-1048576kB/d
+")
+AT_CLEANUP
+
+AT_SETUP([Jumbo frames - Multi-segment enabled])
+OVS_DPDK_START([dpdk-multi-seg-mbufs=true])
+AT_CHECK([grep "multi-segment mbufs enabled" ovs-vswitchd.log], [], [stdout])
+OVS_VSWITCHD_STOP("/Global register is changed during/d
+/EAL: No free hugepages reported in hugepages-1048576kB/d
+")
+AT_CLEANUP
+
+AT_SETUP([Jumbo frames - Multi-segment mbufs Tx])
+OVS_DPDK_PRE_CHECK()
+OVS_DPDK_START([dpdk-multi-seg-mbufs=true])
+
+dnl Add userspace bridge and attach it to OVS
+AT_CHECK([ovs-vsctl add-br br10 -- set bridge br10 datapath_type=netdev])
+AT_CHECK([ovs-vsctl add-port br10 dpdk0 \
+-- set Interface dpdk0 type=dpdk options:dpdk-devargs=$(cat PCI_ADDR) \
+-- set Interface dpdk0 mtu_request=9000], [], [stdout], [stderr])
+
+AT_CHECK([ovs-vsctl show], [], [stdout])
+
+dnl Add flows to send packets out from the 'dpdk0' port
+AT_CHECK([
+ovs-ofctl del-flows br10
+ovs-ofctl add-flow br10 in_port=LOCAL,actions=output:dpdk0
+], [], [stdout])
+
+AT_CHECK([ovs-ofctl dump-flows br10], [], [stdout])
+
+dnl Send packet out, of the 'dpdk0' port
+AT_CHECK([
+ARP_HEADER="09000B0009000A000806000108000604000100010A\
+0100020A02"
+dnl Build a random hex string to append to the ARP_HEADER
+RANDOM_BODY=$(printf '0102030405%.0s' {1..1750})
+dnl 8792B ARP packet
+RANDOM_ARP="$ARP_HEADER$RANDOM_BODY"
+
+ovs-ofctl packet-out br10 "packet=$RANDOM_ARP,action=resubmit:LOCAL"
+], [], [stdout])
+
+AT_CHECK([ovs-ofctl dump-flows br10], [0], [stdout])
+
+dnl Confirm the single packet as been sent with correct size
+AT_CHECK([ovs-ofctl dump-flows br10 | ofctl_strip | grep in_port], [0], [dnl
+ n_packets=1, n_bytes=8792, in_port=LOCAL actions=output:1
+])
+
+dnl Clean up
+OVS_VSWITCHD_STOP("/does not exist. The Open vSwitch kernel module is probably 
not loaded./d
+/Failed to enable flow control/d
+/failed to connect to \/tmp\/dpdkvhostclient0: No such file or directory/d
+/Global register is changed during/d
+/EAL: No free hugepages reported in hugepages-1048576kB/d
+")
+AT_CLEANUP
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2 12/14] dpdk-tests: Add uni-tests for multi-seg mbufs.

2018-07-04 Thread Tiago Lam
In order to create a minimal environment that allows the tests to get
mbufs from an existing mempool, the following approach is taken:
- EAL is initialised (by using the main dpdk_init()) and a (very) small
  mempool is instantiated (mimicking the logic in dpdk_mp_create()).
  This mempool instance is global and used by all the tests;
- Packets are then allocated from the instantiated mempool, and tested
  on, by running some operations on them and manipulating data.

The tests introduced focus on testing DPDK dp_packets (where
source=DPBUF_DPDK), linked with a single or multiple mbufs, across
several operations, such as:
- dp_packet_put();
- dp_packet_shift();
- dp_packet_reserve();
- dp_packet_push_uninit();
- dp_packet_clear();
- And as a consequence of some of these, dp_packet_put_uninit() and
  dp_packet_resize__().

Finally, this has also been integrated with the new DPDK testsuite.
Thus, when running `$sudo make check-dpdk` one will also be running
these tests.

Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 tests/automake.mk  |  10 +-
 tests/dpdk-packet-mbufs.at |   7 +
 tests/system-dpdk-testsuite.at |   1 +
 tests/test-dpdk-mbufs.c| 518 +
 4 files changed, 535 insertions(+), 1 deletion(-)
 create mode 100644 tests/dpdk-packet-mbufs.at
 create mode 100644 tests/test-dpdk-mbufs.c

diff --git a/tests/automake.mk b/tests/automake.mk
index 8224e5a..b3941d0 100644
--- a/tests/automake.mk
+++ b/tests/automake.mk
@@ -134,7 +134,8 @@ SYSTEM_DPDK_TESTSUITE_AT = \
tests/system-common-macros.at \
tests/system-dpdk-macros.at \
tests/system-dpdk-testsuite.at \
-   tests/system-dpdk.at
+   tests/system-dpdk.at \
+   tests/dpdk-packet-mbufs.at
 
 check_SCRIPTS += tests/atlocal
 
@@ -391,6 +392,10 @@ tests_ovstest_SOURCES = \
tests/test-vconn.c \
tests/test-aa.c \
tests/test-stopwatch.c
+if DPDK_NETDEV
+tests_ovstest_SOURCES = \
+   tests/test-dpdk-mbufs.c
+endif
 
 if !WIN32
 tests_ovstest_SOURCES += \
@@ -403,6 +408,9 @@ tests_ovstest_SOURCES += \
 endif
 
 tests_ovstest_LDADD = lib/libopenvswitch.la ovn/lib/libovn.la
+if DPDK_NETDEV
+tests_ovstest_LDFLAGS = $(AM_LDFLAGS) $(DPDK_vswitchd_LDFLAGS)
+endif
 
 noinst_PROGRAMS += tests/test-strtok_r
 tests_test_strtok_r_SOURCES = tests/test-strtok_r.c
diff --git a/tests/dpdk-packet-mbufs.at b/tests/dpdk-packet-mbufs.at
new file mode 100644
index 000..f28e4fc
--- /dev/null
+++ b/tests/dpdk-packet-mbufs.at
@@ -0,0 +1,7 @@
+AT_BANNER([OVS-DPDK dp_packet unit tests])
+
+AT_SETUP([OVS-DPDK dp_packet - mbufs allocation])
+AT_KEYWORDS([dp_packet, multi-seg, mbufs])
+AT_CHECK(ovstest test-dpdk-packet, [], [ignore], [ignore])
+
+AT_CLEANUP
diff --git a/tests/system-dpdk-testsuite.at b/tests/system-dpdk-testsuite.at
index 382f09e..f5edf58 100644
--- a/tests/system-dpdk-testsuite.at
+++ b/tests/system-dpdk-testsuite.at
@@ -23,3 +23,4 @@ m4_include([tests/system-common-macros.at])
 m4_include([tests/system-dpdk-macros.at])
 
 m4_include([tests/system-dpdk.at])
+m4_include([tests/dpdk-packet-mbufs.at])
diff --git a/tests/test-dpdk-mbufs.c b/tests/test-dpdk-mbufs.c
new file mode 100644
index 000..1c77038
--- /dev/null
+++ b/tests/test-dpdk-mbufs.c
@@ -0,0 +1,518 @@
+/*
+ * Copyright (c) 2018 Intel Corporation
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "dp-packet.h"
+#include "ovstest.h"
+#include "dpdk.h"
+#include "smap.h"
+
+#define N_MBUFS 1024
+#define MBUF_DATA_LEN 2048
+
+int num_tests = 0;
+
+/* Global var to hold a mempool instance, "test-mp", used in all of the tests
+ * below. This instance is instantiated in dpdk_setup_eal_with_mp(). */
+static struct rte_mempool *mp;
+
+/* Test data used to fill the packets with data. Note that this isn't a string
+ * that repsents a valid packet, by any means. The pattern is generated in set_
+ * testing_pattern_str() and the sole purpose is to verify the data remains the
+ * same after inserting and operating on multi-segment mbufs. */
+static char *test_str;
+
+/* Asserts a dp_packet that holds a single mbuf, where:
+ * - nb_segs must be 1;
+ * - pkt_len must be equal to data_len which in turn must equal the provided
+ *   'pkt_len';
+ * - data_off must start at the provided 'data_ofs';
+ * - next must be NULL. */
+static void
+assert_single_mbuf(struct dp_packet *pkt

[ovs-dev] [PATCH v2 11/14] netdev-dpdk: support multi-segment jumbo frames

2018-07-04 Thread Tiago Lam
From: Mark Kavanagh 

Currently, jumbo frame support for OvS-DPDK is implemented by
increasing the size of mbufs within a mempool, such that each mbuf
within the pool is large enough to contain an entire jumbo frame of
a user-defined size. Typically, for each user-defined MTU,
'requested_mtu', a new mempool is created, containing mbufs of size
~requested_mtu.

With the multi-segment approach, a port uses a single mempool,
(containing standard/default-sized mbufs of ~2k bytes), irrespective
of the user-requested MTU value. To accommodate jumbo frames, mbufs
are chained together, where each mbuf in the chain stores a portion of
the jumbo frame. Each mbuf in the chain is termed a segment, hence the
name.

== Enabling multi-segment mbufs ==
Multi-segment and single-segment mbufs are mutually exclusive, and the
user must decide on which approach to adopt on init. The introduction
of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this. This
is a global boolean value, which determines how jumbo frames are
represented across all DPDK ports. In the absence of a user-supplied
value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment
mbufs must be explicitly enabled / single-segment mbufs remain the
default.

Setting the field is identical to setting existing DPDK-specific OVSDB
fields:

ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10
ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0
==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true

Co-authored-by: Tiago Lam 

Signed-off-by: Mark Kavanagh 
Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 Documentation/topics/dpdk/jumbo-frames.rst | 51 +++
 NEWS   |  1 +
 lib/dpdk.c |  8 
 lib/netdev-dpdk.c  | 65 --
 lib/netdev-dpdk.h  |  2 +
 vswitchd/vswitch.xml   | 23 +++
 6 files changed, 146 insertions(+), 4 deletions(-)

diff --git a/Documentation/topics/dpdk/jumbo-frames.rst 
b/Documentation/topics/dpdk/jumbo-frames.rst
index 00360b4..d1fb111 100644
--- a/Documentation/topics/dpdk/jumbo-frames.rst
+++ b/Documentation/topics/dpdk/jumbo-frames.rst
@@ -71,3 +71,54 @@ Jumbo frame support has been validated against 9728B frames, 
which is the
 largest frame size supported by Fortville NIC using the DPDK i40e driver, but
 larger frames and other DPDK NIC drivers may be supported. These cases are
 common for use cases involving East-West traffic only.
+
+---
+Multi-segment mbufs
+---
+
+Instead of increasing the size of mbufs within a mempool, such that each mbuf
+within the pool is large enough to contain an entire jumbo frame of a
+user-defined size, mbufs can be chained together instead. In this approach each
+mbuf in the chain stores a portion of the jumbo frame, by default ~2K bytes,
+irrespective of the user-requested MTU value. Since each mbuf in the chain is
+termed a segment, this approach is named "multi-segment mbufs".
+
+This approach may bring more flexibility in use cases where the maximum packet
+length may be hard to guess. For example, in cases where packets originate from
+sources marked for oflload (such as TSO), each packet may be larger than the
+MTU, and as such, when forwarding it to a DPDK port a single mbuf may not be
+enough to hold all of the packet's data.
+
+Multi-segment and single-segment mbufs are mutually exclusive, and the user
+must decide on which approach to adopt on initialisation. If multi-segment
+mbufs is to be enabled, it can be done so with the following command::
+
+$ ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true
+
+Single-segment mbufs still remain the default when using OvS-DPDK, and the
+above option `dpdk-multi-seg-mbufs` must be explicitly set to `true` if
+multi-segment mbufs are to be used.
+
+~
+Performance notes
+~
+
+When using multi-segment mbufs some PMDs may not support vectorized Tx
+functions, due to its non-contiguous nature. As a result this can hit
+performance for smaller packet sizes. For example, on a setup sending 64B
+packets at line rate, a decrease of ~20% has been observed. The performance
+impact stops being noticeable for larger packet sizes, although the exact size
+will between PMDs, and depending on the architecture one's using.
+
+Tests performed with the i40e PMD driver only showed this limitation for 64B
+packets, and the same rate was observed when comparing multi-segment mbufs and
+single-segment mbuf for 128B packets. In other words, the 20% drop in
+performance was not observed for packets >= 128B during this test case.
+
+Because of this, multi-segment mbufs is not advised to be used with smaller
+packet sizes, such as 64B.
+
+Also, note that using multi-segment mbufs won't imp

[ovs-dev] [PATCH v2 10/14] netdev-dpdk: copy large packet to multi-seg. mbufs

2018-07-04 Thread Tiago Lam
From: Mark Kavanagh 

Currently, packets are only copied to a single segment in the function
dpdk_do_tx_copy(). This could be an issue in the case of jumbo frames,
particularly when multi-segment mbufs are involved.

This patch calculates the number of segments needed by a packet and
copies the data to each segment.

A new function, dpdk_buf_alloc(), has also been introduced as a wrapper
around the nonpmd_mp_mutex to serialise allocations from a non-pmd
context.

Co-authored-by: Michael Qiu 
Co-authored-by: Tiago Lam 

Signed-off-by: Mark Kavanagh 
Signed-off-by: Michael Qiu 
Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 lib/netdev-dpdk.c | 78 ---
 1 file changed, 69 insertions(+), 9 deletions(-)

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index b9003fa..1b59123 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -522,6 +522,22 @@ dpdk_rte_mzalloc(size_t sz)
 return rte_zmalloc(OVS_VPORT_DPDK, sz, OVS_CACHE_LINE_SIZE);
 }
 
+static struct rte_mbuf *
+dpdk_buf_alloc(struct rte_mempool *mp)
+{
+if (!dpdk_thread_is_pmd()) {
+ovs_mutex_lock(&nonpmd_mp_mutex);
+}
+
+struct rte_mbuf *mbuf = rte_pktmbuf_alloc(mp);
+
+if (!dpdk_thread_is_pmd()) {
+ovs_mutex_unlock(&nonpmd_mp_mutex);
+}
+
+return mbuf;
+}
+
 void
 free_dpdk_buf(struct dp_packet *packet)
 {
@@ -2199,6 +2215,49 @@ out:
 }
 }
 
+static int
+dpdk_copy_dp_packet_to_mbuf(struct dp_packet *packet, struct rte_mbuf **head,
+struct rte_mempool *mp)
+{
+struct rte_mbuf *mbuf, *fmbuf;
+uint32_t size = dp_packet_size(packet);
+uint16_t max_data_len;
+uint32_t nb_segs = 0;
+
+/* Allocate first mbuf to know the size of data available */
+fmbuf = mbuf = *head = dpdk_buf_alloc(mp);
+if (OVS_UNLIKELY(!mbuf)) {
+return ENOMEM;
+}
+
+/* All new allocated mbuf's max data len is the same */
+max_data_len = mbuf->buf_len - mbuf->data_off;
+
+/* Calculate # of output mbufs. */
+nb_segs = size / max_data_len;
+if (size % max_data_len) {
+nb_segs = nb_segs + 1;
+}
+
+/* Allocate additional mbufs, less the one alredy allocated above */
+for (int i = 1; i < nb_segs; i++) {
+mbuf->next = dpdk_buf_alloc(mp);
+if (!mbuf->next) {
+free_dpdk_buf(CONTAINER_OF(fmbuf, struct dp_packet, mbuf));
+fmbuf = NULL;
+return ENOMEM;
+}
+mbuf = mbuf->next;
+}
+
+fmbuf->nb_segs = nb_segs;
+fmbuf->pkt_len = size;
+
+dp_packet_mbuf_write(fmbuf, 0, size, dp_packet_data(packet));
+
+return 0;
+}
+
 /* Tx function. Transmit packets indefinitely */
 static void
 dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
@@ -2215,6 +2274,7 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct 
dp_packet_batch *batch)
 struct rte_mbuf *pkts[PKT_ARRAY_SIZE];
 uint32_t cnt = batch_cnt;
 uint32_t dropped = 0;
+uint32_t i;
 
 if (dev->type != DPDK_DEV_VHOST) {
 /* Check if QoS has been configured for this netdev. */
@@ -2225,28 +2285,28 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct 
dp_packet_batch *batch)
 
 uint32_t txcnt = 0;
 
-for (uint32_t i = 0; i < cnt; i++) {
+for (i = 0; i < cnt; i++) {
 struct dp_packet *packet = batch->packets[i];
 uint32_t size = dp_packet_size(packet);
+int err = 0;
 
 if (OVS_UNLIKELY(size > dev->max_packet_len)) {
 VLOG_WARN_RL(&rl, "Too big size %u max_packet_len %d",
  size, dev->max_packet_len);
-
 dropped++;
 continue;
 }
 
-pkts[txcnt] = rte_pktmbuf_alloc(dev->mp);
-if (OVS_UNLIKELY(!pkts[txcnt])) {
+err = dpdk_copy_dp_packet_to_mbuf(packet, &pkts[txcnt], dev->mp);
+if (err != 0) {
+if (err == ENOMEM) {
+VLOG_ERR_RL(&rl, "Failed to alloc mbufs! %u packets dropped",
+cnt - i);
+}
+
 dropped += cnt - i;
 break;
 }
-
-/* We have to do a copy for now */
-memcpy(rte_pktmbuf_mtod(pkts[txcnt], void *),
-   dp_packet_data(packet), size);
-dp_packet_set_size((struct dp_packet *)pkts[txcnt], size);
 dp_packet_copy_mbuf_flags((struct dp_packet *)pkts[txcnt], packet);
 
 txcnt++;
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2 06/14] dp-packet: Handle multi-seg mbufs in helper funcs.

2018-07-04 Thread Tiago Lam
Most helper functions in dp-packet assume that the data held by a
dp_packet is contiguous, and perform operations such as pointer
arithmetic under that assumption. However, with the introduction of
multi-segment mbufs, where data is non-contiguous, such assumptions are
no longer possible. Some examples of Such helper functions are
dp_packet_tail(), dp_packet_tailroom(), dp_packet_end(),
dp_packet_get_allocated() and dp_packet_at().

Thus, instead of assuming contiguous data in dp_packet, they  now
iterate over the (non-contiguous) data in mbufs to perform their
calculations.

Finally, dp_packet_use__() has also been modified to perform the
initialisation of the packet (and setting the source) before continuing
to set its size and data length, which now depends on the type of
packet.

Co-authored-by: Mark Kavanagh 

Signed-off-by: Mark Kavanagh 
Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 lib/dp-packet.c |   4 +-
 lib/dp-packet.h | 114 +---
 2 files changed, 111 insertions(+), 7 deletions(-)

diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index 782e7c2..2aaeaae 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -41,11 +41,11 @@ static void
 dp_packet_use__(struct dp_packet *b, void *base, size_t allocated,
  enum dp_packet_source source)
 {
+dp_packet_init__(b, allocated, source);
+
 dp_packet_set_base(b, base);
 dp_packet_set_data(b, base);
 dp_packet_set_size(b, 0);
-
-dp_packet_init__(b, allocated, source);
 }
 
 /* Initializes 'b' as an empty dp_packet that contains the 'allocated' bytes of
diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index fcbaf60..4880fed 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -185,9 +185,25 @@ dp_packet_delete(struct dp_packet *b)
 static inline void *
 dp_packet_at(const struct dp_packet *b, size_t offset, size_t size)
 {
-return offset + size <= dp_packet_size(b)
-   ? (char *) dp_packet_data(b) + offset
-   : NULL;
+if (offset + size > dp_packet_size(b)) {
+return NULL;
+}
+
+#ifdef DPDK_NETDEV
+if (b->source == DPBUF_DPDK) {
+struct rte_mbuf *buf = CONST_CAST(struct rte_mbuf *, &b->mbuf);
+
+while (buf && offset > buf->data_len) {
+offset -= buf->data_len;
+
+buf = buf->next;
+}
+
+return buf ? rte_pktmbuf_mtod_offset(buf, char *, offset) : NULL;
+}
+#endif
+
+return (char *) dp_packet_data(b) + offset;
 }
 
 /* Returns a pointer to byte 'offset' in 'b', which must contain at least
@@ -196,13 +212,23 @@ static inline void *
 dp_packet_at_assert(const struct dp_packet *b, size_t offset, size_t size)
 {
 ovs_assert(offset + size <= dp_packet_size(b));
-return ((char *) dp_packet_data(b)) + offset;
+return dp_packet_at(b, offset, size);
 }
 
 /* Returns a pointer to byte following the last byte of data in use in 'b'. */
 static inline void *
 dp_packet_tail(const struct dp_packet *b)
 {
+#ifdef DPDK_NETDEV
+if (b->source == DPBUF_DPDK) {
+struct rte_mbuf *buf = CONST_CAST(struct rte_mbuf *, &b->mbuf);
+/* Find last segment where data ends, meaning the tail of the chained
+ *  mbufs must be there */
+buf = rte_pktmbuf_lastseg(buf);
+
+return rte_pktmbuf_mtod_offset(buf, void *, buf->data_len);
+}
+#endif
 return (char *) dp_packet_data(b) + dp_packet_size(b);
 }
 
@@ -211,6 +237,15 @@ dp_packet_tail(const struct dp_packet *b)
 static inline void *
 dp_packet_end(const struct dp_packet *b)
 {
+#ifdef DPDK_NETDEV
+if (b->source == DPBUF_DPDK) {
+struct rte_mbuf *buf = CONST_CAST(struct rte_mbuf *, &(b->mbuf));
+
+buf = rte_pktmbuf_lastseg(buf);
+
+return (char *) buf->buf_addr + buf->buf_len;
+}
+#endif
 return (char *) dp_packet_base(b) + dp_packet_get_allocated(b);
 }
 
@@ -236,6 +271,15 @@ dp_packet_tailroom(const struct dp_packet *b)
 static inline void
 dp_packet_clear(struct dp_packet *b)
 {
+#ifdef DPDK_NETDEV
+if (b->source == DPBUF_DPDK) {
+/* sets pkt_len and data_len to zero and frees unused mbufs */
+dp_packet_set_size(b, 0);
+rte_pktmbuf_reset(&b->mbuf);
+
+return;
+}
+#endif
 dp_packet_set_data(b, dp_packet_base(b));
 dp_packet_set_size(b, 0);
 }
@@ -252,12 +296,33 @@ dp_packet_pull(struct dp_packet *b, size_t size)
 return data;
 }
 
+#ifdef DPDK_NETDEV
+/* Similar to dp_packet_try_pull() but doesn't actually pull any data, only
+ * checks if it could and returns true or false accordingly.
+ *
+ * Valid for dp_packets carrying mbufs only. */
+static inline bool
+dp_packet_mbuf_may_pull(const struct dp_packet *b, size_t size) {
+if (size > b->mbuf.data_len) {
+return false;
+}
+
+return true;
+}
+#endif
+
 /* If 'b' has at least 'size' bytes of data, removes that many bytes from the
  * head end of 'b' and returns the first byte removed.  Otherwise, returns a
  * null pointer without mo

[ovs-dev] [PATCH v2 08/14] dp-packet: Handle multi-seg mbufs in resize__().

2018-07-04 Thread Tiago Lam
When enabled with DPDK OvS relies on mbufs allocated by mempools to
receive and output data on DPDK ports. Until now, each OvS dp_packet has
had only one mbuf associated, which is allocated with the maximum
possible size, taking the MTU into account. This approach, however,
doesn't allow us to increase the allocated size in an mbuf, if needed,
since an mbuf is allocated and initialised upon mempool creation. Thus,
in the current implementatin this is dealt with by calling
OVS_NOT_REACHED() and terminating OvS.

To avoid this, and allow the (already) allocated space to be better
used, dp_packet_resize__() now tries to use the available room, both the
tailroom and the headroom, to make enough space for the new data. Since
this happens for packets of source DPBUF_DPDK, the single-segment mbuf
case mentioned above is also covered by this new aproach in resize__().

Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 lib/dp-packet.c | 48 ++--
 1 file changed, 46 insertions(+), 2 deletions(-)

diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index d6e19eb..87af459 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -237,9 +237,51 @@ dp_packet_resize__(struct dp_packet *b, size_t 
new_headroom, size_t new_tailroom
 new_allocated = new_headroom + dp_packet_size(b) + new_tailroom;
 
 switch (b->source) {
+/* When resizing mbufs, both a single mbuf and multi-segment mbufs (where
+ * data is not contigously held in memory), both the headroom and the
+ * tailroom available will be used to make more space for where data needs
+ * to be inserted. I.e if there's not enough headroom, data may be shifted
+ * right if there's enough tailroom.
+ * However, this is not bulletproof and in some cases the space available
+ * won't be enough - in those cases, an error should be returned and the
+ * packet dropped. */
 case DPBUF_DPDK:
-OVS_NOT_REACHED();
+{
+size_t miss_len;
+
+if (new_headroom == dp_packet_headroom(b)) {
+/* This is a tailroom adjustment. Since there's no tailroom space
+ * left, try and shift data towards the head to free up tail space,
+ * if there's enough headroom */
+
+miss_len = new_tailroom - dp_packet_tailroom(b);
+
+if (miss_len <= new_headroom) {
+dp_packet_shift(b, -miss_len);
+} else {
+/* XXX: Handle error case and report error to caller */
+OVS_NOT_REACHED();
+}
+} else {
+/* Otherwise, this is a headroom adjustment. Try to shift data
+ * towards the tail to free up head space, if there's enough
+ * tailroom */
+
+miss_len = new_headroom - dp_packet_headroom(b);
 
+
+if (miss_len <= new_tailroom) {
+dp_packet_shift(b, miss_len);
+} else {
+/* XXX: Handle error case and report error to caller */
+OVS_NOT_REACHED();
+}
+}
+
+new_base = dp_packet_base(b);
+
+break;
+}
 case DPBUF_MALLOC:
 if (new_headroom == dp_packet_headroom(b)) {
 new_base = xrealloc(dp_packet_base(b), new_allocated);
@@ -263,7 +305,9 @@ dp_packet_resize__(struct dp_packet *b, size_t 
new_headroom, size_t new_tailroom
 OVS_NOT_REACHED();
 }
 
-dp_packet_set_allocated(b, new_allocated);
+if (b->source != DPBUF_DPDK) {
+dp_packet_set_allocated(b, new_allocated);
+}
 dp_packet_set_base(b, new_base);
 
 new_data = (char *) new_base + new_headroom;
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2 09/14] dp-packet: copy data from multi-seg. DPDK mbuf

2018-07-04 Thread Tiago Lam
From: Michael Qiu 

When doing packet clone, if packet source is from DPDK driver,
multi-segment must be considered, and copy the segment's data one by
one.

Also, lots of DPDK mbuf's info is missed during a copy, like packet
type, ol_flags, etc.  That information is very important for DPDK to do
packets processing.

Co-authored-by: Mark Kavanagh 
Co-authored-by: Tiago Lam 

Signed-off-by: Michael Qiu 
Signed-off-by: Mark Kavanagh 
Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 lib/dp-packet.c   | 71 ++-
 lib/dp-packet.h   |  3 +++
 lib/netdev-dpdk.c |  1 +
 3 files changed, 64 insertions(+), 11 deletions(-)

diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index 87af459..ae060e2 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -48,6 +48,23 @@ dp_packet_use__(struct dp_packet *b, void *base, size_t 
allocated,
 dp_packet_set_size(b, 0);
 }
 
+#ifdef DPDK_NETDEV
+void
+dp_packet_copy_mbuf_flags(struct dp_packet *dst, const struct dp_packet *src)
+{
+ovs_assert(dst != NULL && src != NULL);
+struct rte_mbuf *buf_dst = &(dst->mbuf);
+struct rte_mbuf buf_src = src->mbuf;
+
+buf_dst->nb_segs = buf_src.nb_segs;
+buf_dst->ol_flags = buf_src.ol_flags;
+buf_dst->packet_type = buf_src.packet_type;
+buf_dst->tx_offload = buf_src.tx_offload;
+}
+#else
+#define dp_packet_copy_mbuf_flags(arg1, arg2)
+#endif
+
 /* Initializes 'b' as an empty dp_packet that contains the 'allocated' bytes of
  * memory starting at 'base'.  'base' should be the first byte of a region
  * obtained from malloc().  It will be freed (with free()) if 'b' is resized or
@@ -158,6 +175,44 @@ dp_packet_clone(const struct dp_packet *buffer)
 return dp_packet_clone_with_headroom(buffer, 0);
 }
 
+#ifdef DPDK_NETDEV
+struct dp_packet *
+dp_packet_clone_with_headroom(const struct dp_packet *b, size_t headroom) {
+struct dp_packet *new_buffer;
+uint32_t pkt_len = dp_packet_size(b);
+
+/* copy multi-seg data */
+if (b->source == DPBUF_DPDK && b->mbuf.nb_segs > 1) {
+void *dst = NULL;
+struct rte_mbuf *mbuf = CONST_CAST(struct rte_mbuf *, &b->mbuf);
+
+new_buffer = dp_packet_new_with_headroom(pkt_len, headroom);
+dst = dp_packet_data(new_buffer);
+dp_packet_set_size(new_buffer, pkt_len);
+
+if (!rte_pktmbuf_read(mbuf, 0, pkt_len, dst)) {
+return NULL;
+}
+} else {
+new_buffer = dp_packet_clone_data_with_headroom(dp_packet_data(b),
+dp_packet_size(b),
+headroom);
+}
+
+/* Copy the following fields into the returned buffer: l2_pad_size,
+ * l2_5_ofs, l3_ofs, l4_ofs, cutlen, packet_type and md. */
+memcpy(&new_buffer->l2_pad_size, &b->l2_pad_size,
+   sizeof(struct dp_packet) -
+   offsetof(struct dp_packet, l2_pad_size));
+
+dp_packet_copy_mbuf_flags(new_buffer, b);
+if (dp_packet_rss_valid(new_buffer)) {
+new_buffer->mbuf.hash.rss = b->mbuf.hash.rss;
+}
+
+return new_buffer;
+}
+#else
 /* Creates and returns a new dp_packet whose data are copied from 'buffer'.
  * The returned dp_packet will additionally have 'headroom' bytes of
  * headroom. */
@@ -165,32 +220,25 @@ struct dp_packet *
 dp_packet_clone_with_headroom(const struct dp_packet *buffer, size_t headroom)
 {
 struct dp_packet *new_buffer;
+uint32_t pkt_len = dp_packet_size(buffer);
 
 new_buffer = dp_packet_clone_data_with_headroom(dp_packet_data(buffer),
- dp_packet_size(buffer),
- headroom);
+ pkt_len, headroom);
+
 /* Copy the following fields into the returned buffer: l2_pad_size,
  * l2_5_ofs, l3_ofs, l4_ofs, cutlen, packet_type and md. */
 memcpy(&new_buffer->l2_pad_size, &buffer->l2_pad_size,
 sizeof(struct dp_packet) -
 offsetof(struct dp_packet, l2_pad_size));
 
-#ifdef DPDK_NETDEV
-new_buffer->mbuf.ol_flags = buffer->mbuf.ol_flags;
-#else
 new_buffer->rss_hash_valid = buffer->rss_hash_valid;
-#endif
-
 if (dp_packet_rss_valid(new_buffer)) {
-#ifdef DPDK_NETDEV
-new_buffer->mbuf.hash.rss = buffer->mbuf.hash.rss;
-#else
 new_buffer->rss_hash = buffer->rss_hash;
-#endif
 }
 
 return new_buffer;
 }
+#endif
 
 /* Creates and returns a new dp_packet that initially contains a copy of the
  * 'size' bytes of data starting at 'data' with no headroom or tailroom. */
@@ -374,6 +422,7 @@ dp_packet_mbuf_write(struct rte_mbuf *mbuf, int16_t ofs, 
uint32_t len,
 len -= len_copy;
 ofs = 0;
 
+mbuf->data_len = len_copy;
 mbuf = mbuf->next;
 }
 }
diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index e46628d..595be5d 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -124,6 +124,9 @@ void 

[ovs-dev] [PATCH v2 07/14] dp-packet: Handle multi-seg mubfs in shift() func.

2018-07-04 Thread Tiago Lam
In its current implementation dp_packet_shift() is also unaware of
multi-seg mbufs (that holds data in memory non-contiguously) and assumes
that data exists contiguously in memory, memmove'ing data to perform the
shift.

To add support for multi-seg mbuds a new set of functions was
introduced, dp_packet_mbuf_shift() and dp_packet_mbuf_write(). These
functions are used by dp_packet_shift(), when handling multi-seg mbufs,
to shift and write data within a chain of mbufs.

Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 lib/dp-packet.c | 97 +
 lib/dp-packet.h | 10 ++
 2 files changed, 107 insertions(+)

diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index 2aaeaae..d6e19eb 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -294,6 +294,97 @@ dp_packet_prealloc_headroom(struct dp_packet *b, size_t 
size)
 }
 }
 
+#ifdef DPDK_NETDEV
+/* Write len data bytes in a mbuf at specified offset.
+ *
+ * 'mbuf', pointer to the destination mbuf where 'ofs' is, and the mbuf where
+ * the data will first be written.
+ * 'ofs', the offset within the provided 'mbuf' where 'data' is to be written.
+ * 'len', the size of the to be written 'data'.
+ * 'data', pointer to the to be written bytes.
+ *
+ * XXX: This function is the counterpart of the `rte_pktmbuf_read()` function
+ * available with DPDK, in the rte_mbuf.h */
+void
+dp_packet_mbuf_write(struct rte_mbuf *mbuf, int16_t ofs, uint32_t len,
+ const void *data)
+{
+char *dst_addr;
+uint16_t data_len;
+int len_copy;
+while (mbuf) {
+if (len == 0) {
+break;
+}
+
+dst_addr = rte_pktmbuf_mtod_offset(mbuf, char *, ofs);
+data_len = MBUF_BUF_END(mbuf->buf_addr, mbuf->buf_len) - dst_addr;
+
+len_copy = MIN(len, data_len);
+/* We don't know if 'data' is the result of a rte_pktmbuf_read() call,
+ * in which case we may end up writing to the same region of memory we
+ * are reading from and overlapping. Hence the use of memmove() here */
+memmove(dst_addr, data, len_copy);
+
+data = ((char *) data) + len_copy;
+len -= len_copy;
+ofs = 0;
+
+mbuf = mbuf->next;
+}
+}
+
+static void
+dp_packet_mbuf_shift_(struct rte_mbuf *dbuf, int16_t dst_ofs,
+  const struct rte_mbuf *sbuf, uint16_t src_ofs, int len)
+{
+char rd[len];
+const char *wd = rte_pktmbuf_read(sbuf, src_ofs, len, rd);
+
+ovs_assert(wd);
+
+dp_packet_mbuf_write(dbuf, dst_ofs, len, wd);
+}
+
+/* Similarly to dp_packet_shift(), shifts the data within the mbufs of a
+ * dp_packet of DPBUF_DPDK source by 'delta' bytes.
+ * Caller must make sure of the following conditions:
+ * - When shifting left, delta can't be bigger than the data_len available in
+ *   the last mbuf;
+ * - When shifting right, delta can't be bigger than the space available in the
+ *   first mbuf (buf_len - data_off).
+ * Both these conditions guarantee that a shift operation doesn't fall outside
+ * the bounds of the existing mbufs, so that the first and last mbufs (when
+ * using multi-segment mbufs), remain the same. */
+static void
+dp_packet_mbuf_shift(struct dp_packet *b, int delta)
+{
+uint16_t src_ofs;
+int16_t dst_ofs;
+
+struct rte_mbuf *mbuf = CONST_CAST(struct rte_mbuf *, &b->mbuf);
+struct rte_mbuf *tmbuf = rte_pktmbuf_lastseg(mbuf);
+
+if (delta < 0) {
+ovs_assert(-delta <= tmbuf->data_len);
+} else {
+ovs_assert(delta < (mbuf->buf_len - mbuf->data_off));
+}
+
+/* Set the destination and source offsets to copy to */
+dst_ofs = delta;
+src_ofs = 0;
+
+/* Shift data from src mbuf and offset to dst mbuf and offset */
+dp_packet_mbuf_shift_(mbuf, dst_ofs, mbuf, src_ofs,
+  rte_pktmbuf_pkt_len(mbuf));
+
+/* Update mbufs' properties, and if using multi-segment mbufs, first and
+ * last mbuf's data_len also needs to be adjusted */
+mbuf->data_off = mbuf->data_off + dst_ofs;
+}
+#endif
+
 /* Shifts all of the data within the allocated space in 'b' by 'delta' bytes.
  * For example, a 'delta' of 1 would cause each byte of data to move one byte
  * forward (from address 'p' to 'p+1'), and a 'delta' of -1 would cause each
@@ -306,6 +397,12 @@ dp_packet_shift(struct dp_packet *b, int delta)
: true);
 
 if (delta != 0) {
+#ifdef DPDK_NETDEV
+if (b->source == DPBUF_DPDK) {
+dp_packet_mbuf_shift(b, delta);
+return;
+}
+#endif
 char *dst = (char *) dp_packet_data(b) + delta;
 memmove(dst, dp_packet_data(b), dp_packet_size(b));
 dp_packet_set_data(b, dst);
diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index 4880fed..e46628d 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -80,6 +80,11 @@ struct dp_packet {
 };
 };
 
+#ifdef DPDK_NETDEV
+#define MBUF_BUF_END(BUF_ADDR, BUF_LEN) \
+(char *) (((char *) BUF_ADDR

[ovs-dev] [PATCH v2 05/14] dp-packet: Fix data_len handling multi-seg mbufs.

2018-07-04 Thread Tiago Lam
When a dp_packet is from a DPDK source, and it contains multi-segment
mbufs, the data_len is not equal to the packet size, pkt_len. Instead,
the data_len of each mbuf in the chain should be considered while
distributing the new (provided) size.

To account for the above dp_packet_set_size() has been changed so that,
in the multi-segment mbufs case, only the data_len on the last mbuf of
the chain and the total size of the packet, pkt_len, are changed. The
data_len on the intermediate mbufs preceeding the last mbuf is not
changed by dp_packet_set_size(). Furthermore, in some cases
dp_packet_set_size() may be used to set a smaller size than the current
packet size, thus effectively trimming the end of the packet. In the
multi-segment mbufs case this may lead to lingering mbufs that may need
freeing.

__dp_packet_set_data() now also updates an mbufs' data_len after setting
the data offset. This is so that both fields are always in sync for each
mbuf in a chain.

Co-authored-by: Michael Qiu 
Co-authored-by: Mark Kavanagh 
Co-authored-by: Przemyslaw Lal 
Co-authored-by: Marcin Ksiadz 
Co-authored-by: Yuanhan Liu 

Signed-off-by: Michael Qiu 
Signed-off-by: Mark Kavanagh 
Signed-off-by: Przemyslaw Lal 
Signed-off-by: Marcin Ksiadz 
Signed-off-by: Yuanhan Liu 
Signed-off-by: Tiago Lam 
---
 lib/dp-packet.h | 76 -
 1 file changed, 64 insertions(+), 12 deletions(-)

diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index c612ad4..fcbaf60 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -429,17 +429,49 @@ dp_packet_size(const struct dp_packet *b)
 static inline void
 dp_packet_set_size(struct dp_packet *b, uint32_t v)
 {
-/* netdev-dpdk does not currently support segmentation; consequently, for
- * all intents and purposes, 'data_len' (16 bit) and 'pkt_len' (32 bit) may
- * be used interchangably.
- *
- * On the datapath, it is expected that the size of packets
- * (and thus 'v') will always be <= UINT16_MAX; this means that there is no
- * loss of accuracy in assigning 'v' to 'data_len'.
- */
-b->mbuf.data_len = (uint16_t)v;  /* Current seg length. */
-b->mbuf.pkt_len = v; /* Total length of all segments linked to
-  * this segment. */
+if (b->source == DPBUF_DPDK) {
+struct rte_mbuf *mbuf = &b->mbuf;
+uint16_t new_len = v;
+uint16_t data_len;
+uint16_t nb_segs = 0;
+uint16_t pkt_len = 0;
+
+/* Trim 'v' length bytes from the end of the chained buffers, freeing
+   any buffers that may be left floating */
+while (mbuf) {
+data_len = MIN(new_len, mbuf->data_len);
+mbuf->data_len = data_len;
+
+if (new_len - data_len <= 0) {
+/* Free the rest of chained mbufs */
+free_dpdk_buf(CONTAINER_OF(mbuf->next, struct dp_packet,
+   mbuf));
+mbuf->next = NULL;
+} else if (!mbuf->next) {
+/* Don't assign more than what we have available */
+mbuf->data_len = MIN(new_len,
+ mbuf->buf_len - mbuf->data_off);
+}
+
+new_len -= data_len;
+nb_segs += 1;
+pkt_len += mbuf->data_len;
+mbuf = mbuf->next;
+}
+
+/* pkt_len != v would effectively mean that pkt_len < than 'v' (as
+ * being bigger is logically impossible). Being < than 'v' would mean
+ * the 'v' provided was bigger than the available room, which is the
+ * responsibility of the caller to make sure there is enough room */
+ovs_assert(pkt_len == v);
+
+b->mbuf.nb_segs = nb_segs;
+b->mbuf.pkt_len = pkt_len;
+} else {
+b->mbuf.data_len = v;
+/* Total length of all segments linked to this segment. */
+b->mbuf.pkt_len = v;
+}
 }
 
 static inline uint16_t
@@ -451,7 +483,27 @@ __packet_data(const struct dp_packet *b)
 static inline void
 __packet_set_data(struct dp_packet *b, uint16_t v)
 {
-b->mbuf.data_off = v;
+if (b->source == DPBUF_DPDK) {
+/* Moving data_off away from the first mbuf in the chain is not a
+ * possibility using DPBUF_DPDK dp_packets */
+ovs_assert(v == UINT16_MAX || v <= b->mbuf.buf_len - b->mbuf.data_off);
+
+uint16_t prev_ofs = b->mbuf.data_off;
+b->mbuf.data_off = v;
+int16_t ofs_diff = prev_ofs - b->mbuf.data_off;
+
+/* When dealing with DPDK mbufs, keep data_off and data_len in sync.
+ * Thus, update data_len if the length changes with the move of
+ * data_off. However, if data_len is 0, there's no data to move and
+ * data_len should remain 0. */
+
+if (b->mbuf.data_len != 0) {
+b->mbuf.data_len = MIN(b->mbuf.data_len + ofs_diff,
+   b->mbuf.buf_len - b->mbuf.d

[ovs-dev] [PATCH v2 04/14] netdev-dpdk: Serialise non-pmds mbufs' alloc/free.

2018-07-04 Thread Tiago Lam
A new mutex, 'nonpmd_mp_mutex', has been introduced to serialise
allocation and free operations by non-pmd threads on a given mempool.

free_dpdk_buf() has been modified to make use of the introduced mutex.

Signed-off-by: Tiago Lam 
---
 lib/netdev-dpdk.c | 30 +++---
 1 file changed, 27 insertions(+), 3 deletions(-)

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 675dd48..6c9f0e7 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -294,6 +294,16 @@ static struct ovs_mutex dpdk_mp_mutex 
OVS_ACQ_AFTER(dpdk_mutex)
 static struct ovs_list dpdk_mp_free_list OVS_GUARDED_BY(dpdk_mp_mutex)
 = OVS_LIST_INITIALIZER(&dpdk_mp_free_list);
 
+/* This mutex must be used by non pmd threads when allocating or freeing
+ * mbufs through mempools, when outside of the `non_pmd_mutex` mutex, in struct
+ * dp_netdev.
+ * The reason, as pointed out in the "Known Issues" section in DPDK's EAL docs,
+ * is that the implementation on which mempool is based off is non-preemptable.
+ * Since non-pmds may end up not being pinned this could lead to the preemption
+ * between non-pmds performing operations on the same mempool, which could lead
+ * to memory corruption. */
+static struct ovs_mutex nonpmd_mp_mutex = OVS_MUTEX_INITIALIZER;
+
 /* Wrapper for a mempool released but not yet freed. */
 struct dpdk_mp {
  struct rte_mempool *mp;
@@ -462,6 +472,8 @@ struct netdev_rxq_dpdk {
 dpdk_port_t port_id;
 };
 
+static bool dpdk_thread_is_pmd(void);
+
 static void netdev_dpdk_destruct(struct netdev *netdev);
 static void netdev_dpdk_vhost_destruct(struct netdev *netdev);
 
@@ -495,6 +507,12 @@ dpdk_buf_size(int mtu)
  NETDEV_DPDK_MBUF_ALIGN);
 }
 
+static bool
+dpdk_thread_is_pmd(void)
+{
+ return rte_lcore_id() != NON_PMD_CORE_ID;
+}
+
 /* Allocates an area of 'sz' bytes from DPDK.  The memory is zero'ed.
  *
  * Unlike xmalloc(), this function can return NULL on failure. */
@@ -505,11 +523,17 @@ dpdk_rte_mzalloc(size_t sz)
 }
 
 void
-free_dpdk_buf(struct dp_packet *p)
+free_dpdk_buf(struct dp_packet *packet)
 {
-struct rte_mbuf *pkt = (struct rte_mbuf *) p;
+if (!dpdk_thread_is_pmd()) {
+ovs_mutex_lock(&nonpmd_mp_mutex);
+}
 
-rte_pktmbuf_free(pkt);
+rte_pktmbuf_free(&packet->mbuf);
+
+if (!dpdk_thread_is_pmd()) {
+ovs_mutex_unlock(&nonpmd_mp_mutex);
+}
 }
 
 static void
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2 03/14] dp-packet: Fix allocated size on DPDK init.

2018-07-04 Thread Tiago Lam
When enabled with DPDK OvS deals with two types of packets, the ones
coming from the mempool and the ones locally created by OvS - which are
copied to mempool mbufs before output. In the latter, the space is
allocated from the system, while in the former the mbufs are allocated
from a mempool, which takes care of initialising them appropriately.

In the current implementation, during mempool's initialisation of mbufs,
dp_packet_set_allocated() is called from dp_packet_init_dpdk() without
considering that the allocated space, in the case of multi-segment
mbufs, might be greater than a single mbuf.  Furthermore, given that
dp_packet_init_dpdk() is on the code path that's called upon mempool's
initialisation, a call to dp_packet_set_allocated() is redundant, since
mempool takes care of initialising it.

To fix this, dp_packet_set_allocated() is no longer called after
initialisation of a mempool, only in dp_packet_init__(), which is still
called by OvS when initialising locally created packets.

Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 lib/dp-packet.c   | 3 +--
 lib/dp-packet.h   | 2 +-
 lib/netdev-dpdk.c | 2 +-
 3 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index 443c225..782e7c2 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -99,9 +99,8 @@ dp_packet_use_const(struct dp_packet *b, const void *data, 
size_t size)
  * buffer.  Here, non-transient ovs dp-packet fields are initialized for
  * packets that are part of a DPDK memory pool. */
 void
-dp_packet_init_dpdk(struct dp_packet *b, size_t allocated)
+dp_packet_init_dpdk(struct dp_packet *b)
 {
-dp_packet_set_allocated(b, allocated);
 b->source = DPBUF_DPDK;
 }
 
diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index cfe0824..c612ad4 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -114,7 +114,7 @@ void dp_packet_use(struct dp_packet *, void *, size_t);
 void dp_packet_use_stub(struct dp_packet *, void *, size_t);
 void dp_packet_use_const(struct dp_packet *, const void *, size_t);
 
-void dp_packet_init_dpdk(struct dp_packet *, size_t allocated);
+void dp_packet_init_dpdk(struct dp_packet *);
 
 void dp_packet_init(struct dp_packet *, size_t);
 void dp_packet_uninit(struct dp_packet *);
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 8fb859e..675dd48 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -520,7 +520,7 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp OVS_UNUSED,
 {
 struct rte_mbuf *pkt = _p;
 
-dp_packet_init_dpdk((struct dp_packet *) pkt, pkt->buf_len);
+dp_packet_init_dpdk((struct dp_packet *) pkt);
 }
 
 static int
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2 01/14] netdev-dpdk: fix mbuf sizing

2018-07-04 Thread Tiago Lam
From: Mark Kavanagh 

There are numerous factors that must be considered when calculating
the size of an mbuf:
- the data portion of the mbuf must be sized in accordance With Rx
  buffer alignment (typically 1024B). So, for example, in order to
  successfully receive and capture a 1500B packet, mbufs with a
  data portion of size 2048B must be used.
- in OvS, the elements that comprise an mbuf are:
  * the dp packet, which includes a struct rte mbuf (704B)
  * RTE_PKTMBUF_HEADROOM (128B)
  * packet data (aligned to 1k, as previously described)
  * RTE_PKTMBUF_TAILROOM (typically 0)

Some PMDs require that the total mbuf size (i.e. the total sum of all
of the above-listed components' lengths) is cache-aligned. To satisfy
this requirement, it may be necessary to round up the total mbuf size
with respect to cacheline size. In doing so, it's possible that the
dp_packet's data portion is inadvertently increased in size, such that
it no longer adheres to Rx buffer alignment. Consequently, the
following property of the mbuf no longer holds true:

mbuf.data_len == mbuf.buf_len - mbuf.data_off

This creates a problem in the case of multi-segment mbufs, where that
assumption is assumed to be true for all but the final segment in an
mbuf chain. Resolve this issue by adjusting the size of the mbuf's
private data portion, as opposed to the packet data portion when
aligning mbuf size to cachelines.

Fixes: 4be4d22 ("netdev-dpdk: clean up mbuf initialization")
Fixes: 31b88c9 ("netdev-dpdk: round up mbuf_size to cache_line_size")
CC: Santosh Shukla 
Signed-off-by: Mark Kavanagh 
Acked-by: Santosh Shukla 
Acked-by: Eelco Chaudron 
---
 lib/netdev-dpdk.c | 53 +++--
 1 file changed, 35 insertions(+), 18 deletions(-)

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index c223e5b..8fb859e 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -82,12 +82,6 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 
20);
  + (2 * VLAN_HEADER_LEN))
 #define MTU_TO_FRAME_LEN(mtu)   ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN)
 #define MTU_TO_MAX_FRAME_LEN(mtu)   ((mtu) + ETHER_HDR_MAX_LEN)
-#define FRAME_LEN_TO_MTU(frame_len) ((frame_len)\
- - ETHER_HDR_LEN - ETHER_CRC_LEN)
-#define MBUF_SIZE(mtu)  ROUND_UP((MTU_TO_MAX_FRAME_LEN(mtu) \
- + sizeof(struct dp_packet) \
- + RTE_PKTMBUF_HEADROOM),   \
- RTE_CACHE_LINE_SIZE)
 #define NETDEV_DPDK_MBUF_ALIGN  1024
 #define NETDEV_DPDK_MAX_PKT_LEN 9728
 
@@ -494,7 +488,7 @@ is_dpdk_class(const struct netdev_class *class)
  * behaviour, which reduces performance. To prevent this, use a buffer size
  * that is closest to 'mtu', but which satisfies the aforementioned criteria.
  */
-static uint32_t
+static uint16_t
 dpdk_buf_size(int mtu)
 {
 return ROUND_UP((MTU_TO_MAX_FRAME_LEN(mtu) + RTE_PKTMBUF_HEADROOM),
@@ -586,7 +580,7 @@ dpdk_mp_do_not_free(struct rte_mempool *mp) 
OVS_REQUIRES(dpdk_mp_mutex)
  *  - a new mempool was just created;
  *  - a matching mempool already exists. */
 static struct rte_mempool *
-dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
+dpdk_mp_create(struct netdev_dpdk *dev, uint16_t mbuf_pkt_data_len)
 {
 char mp_name[RTE_MEMPOOL_NAMESIZE];
 const char *netdev_name = netdev_get_name(&dev->up);
@@ -594,6 +588,7 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
 uint32_t n_mbufs;
 uint32_t hash = hash_string(netdev_name, 0);
 struct rte_mempool *mp = NULL;
+uint16_t mbuf_size, aligned_mbuf_size, mbuf_priv_data_len;
 
 /*
  * XXX: rough estimation of number of mbufs required for this port:
@@ -612,24 +607,46 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
 /* Full DPDK memory pool name must be unique and cannot be
  * longer than RTE_MEMPOOL_NAMESIZE. */
 int ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
-   "ovs%08x%02d%05d%07u",
-   hash, socket_id, mtu, n_mbufs);
+   "ovs%08x%02d%05u%07u",
+   hash, socket_id, mbuf_pkt_data_len, n_mbufs);
 if (ret < 0 || ret >= RTE_MEMPOOL_NAMESIZE) {
 VLOG_DBG("snprintf returned %d. "
  "Failed to generate a mempool name for \"%s\". "
- "Hash:0x%x, socket_id: %d, mtu:%d, mbufs:%u.",
- ret, netdev_name, hash, socket_id, mtu, n_mbufs);
+ "Hash:0x%x, socket_id: %d, pkt data room:%u, mbufs:%u.",
+ ret, netdev_name, hash, socket_id, mbuf_pkt_data_len,
+ n_mbufs);
 break;
 }
 
 VLOG_DBG("Port %s: Requesting a mempool of %u mbufs "
-  "on socket %d for %d Rx and %d Tx queues.",
+  "on 

[ovs-dev] [PATCH v2 02/14] dp-packet: Init specific mbuf fields.

2018-07-04 Thread Tiago Lam
From: Mark Kavanagh 

dp_packets are created using xmalloc(); in the case of OvS-DPDK, it's
possible the the resultant mbuf portion of the dp_packet contains
random data. For some mbuf fields, specifically those related to
multi-segment mbufs and/or offload features, random values may cause
unexpected behaviour, should the dp_packet's contents be later copied
to a DPDK mbuf. It is critical therefore, that these fields should be
initialized to 0.

This patch ensures that the following mbuf fields are initialized to
appropriate values on creation of a new dp_packet:
   - ol_flags=0
   - nb_segs=1
   - tx_offload=0
   - packet_type=0
   - next=NULL

Adapted from an idea by Michael Qiu :
https://patchwork.ozlabs.org/patch/777570/

Co-authored-by: Tiago Lam 

Signed-off-by: Mark Kavanagh 
Signed-off-by: Tiago Lam 
Acked-by: Eelco Chaudron 
---
 lib/dp-packet.h | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index 596cfe6..cfe0824 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -625,14 +625,15 @@ dp_packet_mbuf_rss_flag_reset(struct dp_packet *p 
OVS_UNUSED)
 }
 
 /* This initialization is needed for packets that do not come
- * from DPDK interfaces, when vswitchd is built with --with-dpdk.
- * The DPDK rte library will still otherwise manage the mbuf.
- * We only need to initialize the mbuf ol_flags. */
+ * from DPDK interfaces, when vswitchd is built with --with-dpdk. */
 static inline void
 dp_packet_mbuf_init(struct dp_packet *p OVS_UNUSED)
 {
 #ifdef DPDK_NETDEV
-p->mbuf.ol_flags = 0;
+struct rte_mbuf *mbuf = &(p->mbuf);
+mbuf->ol_flags = mbuf->tx_offload = mbuf->packet_type = 0;
+mbuf->nb_segs = 1;
+mbuf->next = NULL;
 #endif
 }
 
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2 00/14] Support multi-segment mbufs

2018-07-04 Thread Tiago Lam
Overview

This patchset introduces support for multi-segment mbufs to OvS-DPDK.
Multi-segment mbufs are typically used when the size of an mbuf is
insufficient to contain the entirety of a packet's data. Instead, the
data is split across numerous mbufs, each carrying a portion, or
'segment', of the packet data. mbufs are chained via their 'next'
attribute (an mbuf pointer).

Use Cases
=
i.  Handling oversized (guest-originated) frames, which are marked
for hardware accelration/offload (TSO, for example).

Packets which originate from a non-DPDK source may be marked for
offload; as such, they may be larger than the permitted ingress
interface's MTU, and may be stored in an oversized dp-packet. In
order to transmit such packets over a DPDK port, their contents
must be copied to a DPDK mbuf (via dpdk_do_tx_copy). However, in
its current implementation, that function only copies data into
a single mbuf; if the space available in the mbuf is exhausted,
but not all packet data has been copied, then it is lost.
Similarly, when cloning a DPDK mbuf, it must be considered
whether that mbuf contains multiple segments. Both issues are
resolved within this patchset.

ii. Handling jumbo frames.

While OvS already supports jumbo frames, it does so by increasing
mbuf size, such that the entirety of a jumbo frame may be handled
in a single mbuf. This is certainly the preferred, and most
performant approach (and remains the default).

Enabling multi-segment mbufs

Multi-segment and single-segment mbufs are mutually exclusive, and the
user must decide on which approach to adopt on init. The introduction
of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this.

This is a global boolean value, which determines how jumbo frames are
represented across all DPDK ports. In the absence of a user-supplied
value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment
mbufs must be explicitly enabled / single-segment mbufs remain the
default.

Setting the field is identical to setting existing DPDK-specific OVSDB
fields:

ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10
ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0
==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true

Performance notes (based on v8)
=
In order to test for regressions in performance, tests were run on top
of master 88125d6 and v8 of this patchset, both with the multi-segment
mbufs option enabled and disabled.

VSperf was used to run the phy2phy_cont and pvp_cont tests with varying
packet sizes of 64B, 1500B and 7000B, on a 10Gbps interface.

Test | Size | Master | Multi-seg disabled | Multi-seg enabled
-
p2p  |  64  | ~22.7  |  ~22.65|   ~18.3
p2p  | 1500 |  ~1.6  |~1.6|~1.6
p2p  | 7000 | ~0.36  |   ~0.36|   ~0.36
pvp  |  64  |  ~6.7  |~6.7|~6.3
pvp  | 1500 |  ~1.6  |~1.6|~1.6
pvp  | 7000 | ~0.36  |   ~0.36|   ~0.36

Packet size is in bytes, while all packet rates are reported in mpps
(aggregated).

No noticeable regression has been observed (certainly everything is
within the ± 5% margin of existing performance), aside from the 64B
packet size case when multi-segment mbuf is enabled. This is
expected, however, because of how Tx vectoriszed functions are
incompatible with multi-segment mbufs on some PMDs. The PMD under
use during these tests was the i40e (on a Intel X710 NIC), which
indeed doesn't support vectorized Tx functions with multi-segment
mbufs.

---
v2:
- Rebase on master e7cd8cf ("db-ctl-base: Don't die in cmd_destroy() on
  error.");
- Address Eelco's comments:
- In free_dpdk_buf(), use mbuf's struct address in dp_packet instead of
  casting;
- Remove unneeded variable in dp_packet_set_size(), pointing to the
  first mbuf in the chain;
- Assert in dp_packet_set_size() to enforce that "pkt_len == v" is
  always true for DPBUF_DPDK packets;
- Assert in __packet_set_data() to enforce that data_off never goes
  beyond the first mbuf in the chain.

v1:
- v8 should have been sent as v1 really, as that's usually the approach
  followed in OvS. That clearly didn't happen, so restarting the series
  now. This also helps making it clear it is no longer an RFC series;
- Rebase on master e461481 ("ofp-meter: Fix ofp_print_meter_flags()
  output.");
- Address Eelco's comments:
- Change dp_packet_size() so that for `DPBUF_DPDK` packets their
  `pkt_len` and `data_len` can't be set to values bigger than the
  available space. Also fix assigment to `data_len` which was
  incorrectly being set to just`pkt_len`;

Re: [ovs-dev] [PATCH v4 1/2] dpif-netdev: Add SMC cache after EMC cache

2018-07-04 Thread O Mahony, Billy
Hi,

I've checked the latest patch and the performance results I get are similar to 
the ones give in the previous patches. Also enabling/disabling the DFC on the 
fly works as expected.

The main query I have regards the slow sweep for SMC

[[BO'M]] The slow sweep removes EMC entries that are no longer valid because 
the associated dpcls rule has been changed or has expired. Is there a mechanism 
to remove SMC entries associated with a changed/expired dpcls rules?

Further comments, queries inline. Review not complete yet, I'll try to finish 
off tomorrow.

Regards,
Billy.


> -Original Message-
> From: Wang, Yipeng1
> Sent: Friday, June 29, 2018 6:53 PM
> To: d...@openvswitch.org
> Cc: Wang, Yipeng1 ; jan.scheur...@ericsson.com;
> Stokes, Ian ; O Mahony, Billy
> ; Loftus, Ciara 
> Subject: [PATCH v4 1/2] dpif-netdev: Add SMC cache after EMC cache
> 
> This patch adds a signature match cache (SMC) after exact match cache (EMC).
> The difference between SMC and EMC is SMC only stores a signature of a flow
> thus it is much more memory efficient. With same memory space, EMC can
> store 8k flows while SMC can store 1M flows. It is generally beneficial to 
> turn on
> SMC but turn off EMC when traffic flow count is much larger than EMC size.
> 
> SMC cache will map a signature to an netdev_flow index in flow_table. Thus, we
> add two new APIs in cmap for lookup key by index and lookup index by key.
> 
> For now, SMC is an experimental feature that it is turned off by default. One 
> can
> turn it on using ovsdb options.
> 
> Signed-off-by: Yipeng Wang 
> ---
>  Documentation/topics/dpdk/bridge.rst |  15 ++
>  NEWS |   2 +
>  lib/cmap.c   |  73 +
>  lib/cmap.h   |   5 +
>  lib/dpif-netdev-perf.h   |   1 +
>  lib/dpif-netdev.c| 310 
> ++-
>  tests/pmd.at |   1 +
>  vswitchd/vswitch.xml |  13 ++
>  8 files changed, 383 insertions(+), 37 deletions(-)
> 
> diff --git a/Documentation/topics/dpdk/bridge.rst
> b/Documentation/topics/dpdk/bridge.rst
> index 63f8a62..df74c02 100644
> --- a/Documentation/topics/dpdk/bridge.rst
> +++ b/Documentation/topics/dpdk/bridge.rst
> @@ -102,3 +102,18 @@ For certain traffic profiles with many parallel flows, 
> it's
> recommended to set  ``N`` to '0' to achieve higher forwarding performance.
> 
>  For more information on the EMC refer to :doc:`/intro/install/dpdk` .
> +
> +
> +SMC cache (experimental)
> +-
> +
> +SMC cache or signature match cache is a new cache level after EMC cache.
> +The difference between SMC and EMC is SMC only stores a signature of a
> +flow thus it is much more memory efficient. With same memory space, EMC
> +can store 8k flows while SMC can store 1M flows. When traffic flow
> +count is much larger than EMC size, it is generally beneficial to turn
> +off EMC and turn on SMC. It is currently turned off by default and an
> experimental feature.
> +
> +To turn on SMC::
> +
> +$ ovs-vsctl --no-wait set Open_vSwitch .
> + other_config:smc-enable=true
> diff --git a/NEWS b/NEWS
> index cd15a33..26d6ef1 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -40,6 +40,8 @@ Post-v2.9.0
>   ovs-appctl dpif-netdev/pmd-perf-show
>   * Supervision of PMD performance metrics and logging of suspicious
> iterations
> + * Add signature match cache (SMC) as experimental feature. When turned
> on,
> +   it improves throughput when traffic has many more flows than EMC size.
> - ERSPAN:
>   * Implemented ERSPAN protocol (draft-foschiano-erspan-00.txt) for
> both kernel datapath and userspace datapath.
> diff --git a/lib/cmap.c b/lib/cmap.c
> index 07719a8..db1c806 100644
> --- a/lib/cmap.c
> +++ b/lib/cmap.c
> @@ -373,6 +373,79 @@ cmap_find(const struct cmap *cmap, uint32_t hash)
> hash);
>  }
> 
> +/* Find a node by the index of the entry of cmap. For example, index 7
> +means
> + * the second bucket and the third item.
[[BO'M]] Is this is assuming 4 for the bucket size. Maybe explicitly add where 
the bucket size is coming from - CMAP_K ? If so is that value not 5 for a 64 
bit system?

> + * Notice that it is not protected by the optimistic lock (versioning)
> +because
> + * of performance reasons. Currently it is only used by the datapath DFC 
> cache.
> + *
> + * Return node for the entry of index or NULL if the index beyond
> +boundary */ const struct cmap_node * cmap_find_by_index(const struct
> +cmap *cmap, uint16_t index) {
> +const struct cmap_impl *impl = cmap_get_impl(cmap);
> +
> +uint32_t b = index / CMAP_K;
> +uint32_t e = index % CMAP_K;
> +
> +if (b > impl->mask) {
> +return NULL;
> +}
> +
> +const struct cmap_bucket *bucket = &impl->buckets[b];
> +
> +return cmap_node_next(&bucket->nodes[e]);
> +}
> +
> +/* Find the index of certain hash value. Curre

[ovs-dev] Website Ranking Solution!

2018-07-04 Thread Daisy Williams
Greetings,

 

Let's work together!

 

We are a trusted digital marketing agency for more than 10 years, our team
of 200+ marketing gurus has served over 4000 clients. Our unique
"Pay-for-performance" model attracts customers from all geographies of the
world, and we are proud to cater to the needs of every type of business
belonging to varied industries, scales, and regions. 

 

We do "Performance Based Marketing" service, Pay only when we rank your
keywords on Top searches. 

 

-  FREE website analysis report

-  No monthly fee / No contractual payout

-  Dedicated 24*7 support

-  Only one time set up fee 

 

Our results 'Talk' 

 

Get your website evaluated NOW, Just reply to this email with your contact
details along with your requirement and we will call you back.




Thanks & Regards,

Daisy Williams 

Marketing Manager
Head Office: San Jose, CA 95120 

 

 

Disclaimer:  We are using this domain for marketing. If you are interested
and want to know about us, just reply to this email, if we have offended you
by sending this to you by mistake, we apologize. Please reply "NO" or
"UNSUBSCRIBE" to this email if not interested, so that we shall add you to
our "Do Not Contact Again" list.

 

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v1 05/14] dp-packet: Fix data_len handling multi-seg mbufs.

2018-07-04 Thread Lam, Tiago
On 03/07/2018 11:22, Eelco Chaudron wrote:
> 
> 
> On 28 Jun 2018, at 17:41, Tiago Lam wrote:
> 
>> When a dp_packet is from a DPDK source, and it contains multi-segment
>> mbufs, the data_len is not equal to the packet size, pkt_len. Instead,
>> the data_len of each mbuf in the chain should be considered while
>> distributing the new (provided) size.
>>
>> To account for the above dp_packet_set_size() has been changed so 
>> that,
>> in the multi-segment mbufs case, only the data_len on the last mbuf of
>> the chain and the total size of the packet, pkt_len, are changed. The
>> data_len on the intermediate mbufs preceeding the last mbuf is not
>> changed by dp_packet_set_size(). Furthermore, in some cases
>> dp_packet_set_size() may be used to set a smaller size than the 
>> current
>> packet size, thus effectively trimming the end of the packet. In the
>> multi-segment mbufs case this may lead to lingering mbufs that may 
>> need
>> freeing.
>>
>> __dp_packet_set_data() now also updates an mbufs' data_len after 
>> setting
>> the data offset. This is so that both fields are always in sync for 
>> each
>> mbuf in a chain.
>>
>> Co-authored-by: Michael Qiu 
>> Co-authored-by: Mark Kavanagh 
>> Co-authored-by: Przemyslaw Lal 
>> Co-authored-by: Marcin Ksiadz 
>> Co-authored-by: Yuanhan Liu 
>>
>> Signed-off-by: Michael Qiu 
>> Signed-off-by: Mark Kavanagh 
>> Signed-off-by: Przemyslaw Lal 
>> Signed-off-by: Marcin Ksiadz 
>> Signed-off-by: Yuanhan Liu 
>> Signed-off-by: Tiago Lam 
>> ---
>>  lib/dp-packet.h | 67 
>> ++---
>>  1 file changed, 55 insertions(+), 12 deletions(-)
>>
>> diff --git a/lib/dp-packet.h b/lib/dp-packet.h
>> index c612ad4..d8cd35c 100644
>> --- a/lib/dp-packet.h
>> +++ b/lib/dp-packet.h
>> @@ -429,17 +429,44 @@ dp_packet_size(const struct dp_packet *b)
>>  static inline void
>>  dp_packet_set_size(struct dp_packet *b, uint32_t v)
>>  {
>> -/* netdev-dpdk does not currently support segmentation; 
>> consequently, for
>> - * all intents and purposes, 'data_len' (16 bit) and 'pkt_len' 
>> (32 bit) may
>> - * be used interchangably.
>> - *
>> - * On the datapath, it is expected that the size of packets
>> - * (and thus 'v') will always be <= UINT16_MAX; this means that 
>> there is no
>> - * loss of accuracy in assigning 'v' to 'data_len'.
>> - */
>> -b->mbuf.data_len = (uint16_t)v;  /* Current seg length. */
>> -b->mbuf.pkt_len = v; /* Total length of all segments 
>> linked to
>> -  * this segment. */
>> +if (b->source == DPBUF_DPDK) {
>> +struct rte_mbuf *mbuf = &b->mbuf;
>> +struct rte_mbuf *fmbuf = mbuf;
>> +uint16_t new_len = v;
>> +uint16_t data_len;
>> +uint16_t nb_segs = 0;
>> +uint16_t pkt_len = 0;
>> +
>> +/* Trim 'v' length bytes from the end of the chained buffers, 
>> freeing
>> +   any buffers that may be left floating */
>> +while (mbuf) {
>> +data_len = MIN(new_len, mbuf->data_len);
>> +mbuf->data_len = data_len;
>> +
>> +if (new_len - data_len <= 0) {
>> +/* Free the rest of chained mbufs */
>> +free_dpdk_buf(CONTAINER_OF(mbuf->next, struct 
>> dp_packet,
>> +   mbuf));
>> +mbuf->next = NULL;
>> +} else if (!mbuf->next) {
>> +/* Don't assign more than what we have available */
>> +mbuf->data_len = MIN(new_len,
>> + mbuf->buf_len - mbuf->data_off);
>> +}
>> +
>> +new_len -= data_len;
>> +nb_segs += 1;
>> +pkt_len += mbuf->data_len;
>> +mbuf = mbuf->next;
>> +}
>> +
>> +fmbuf->nb_segs = nb_segs;
>> +fmbuf->pkt_len = pkt_len;
> 
> Guess we do not need the extra variable, we can just do b->mbuf.pkt_len 
> etc as below. Looks like a left over from the previous revision?
> 

Good point, I got rid of it for v2.

> In the increase packet size case we can end up with an b->mbuf.pkt_len < 
> v, maybe we should assert on this?
> 

Yeah, it can happen if 'v' is bigger than the available tailroom. An
assertion sounds good in this case since dp_packet_set_size() relies on
the caller making sure there is enough space for performing the
operation (like allocating space themselves if not enough space is
available).

>> +} else {
>> +b->mbuf.data_len = v;
>> +/* Total length of all segments linked to this segment. */
>> +b->mbuf.pkt_len = v;
>> +}
>>  }
>>
>>  static inline uint16_t
>> @@ -451,7 +478,23 @@ __packet_data(const struct dp_packet *b)
>>  static inline void
>>  __packet_set_data(struct dp_packet *b, uint16_t v)
>>  {
>> -b->mbuf.data_off = v;
>> +if (b->source == DPBUF_DPDK) {
>> +uint16_t prev_ofs = b->mbuf.data_off;
>> +b->mbu

Re: [ovs-dev] [PATCH v1 04/14] netdev-dpdk: Serialise non-pmds mbufs' alloc/free.

2018-07-04 Thread Lam, Tiago
On 03/07/2018 10:49, Eelco Chaudron wrote:
> 
> 
> On 28 Jun 2018, at 17:41, Tiago Lam wrote:
> 
>> A new mutex, 'nonpmd_mp_mutex', has been introduced to serialise
>> allocation and free operations by non-pmd threads on a given mempool.
>>
>> free_dpdk_buf() has been modified to make use of the introduced mutex.
>>
>> Signed-off-by: Tiago Lam 
>> ---
>>  lib/netdev-dpdk.c | 30 +++---
>>  1 file changed, 27 insertions(+), 3 deletions(-)
>>
>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>> index c5c6aab..5c62d85 100644
>> --- a/lib/netdev-dpdk.c
>> +++ b/lib/netdev-dpdk.c
>> @@ -294,6 +294,16 @@ static struct ovs_mutex dpdk_mp_mutex 
>> OVS_ACQ_AFTER(dpdk_mutex)
>>  static struct ovs_list dpdk_mp_free_list 
>> OVS_GUARDED_BY(dpdk_mp_mutex)
>>  = OVS_LIST_INITIALIZER(&dpdk_mp_free_list);
>>
>> +/* This mutex must be used by non pmd threads when allocating or 
>> freeing
>> + * mbufs through mempools, when outside of the `non_pmd_mutex` mutex, 
>> in struct
>> + * dp_netdev.
>> + * The reason, as pointed out in the "Known Issues" section in DPDK's 
>> EAL docs,
>> + * is that the implementation on which mempool is based off is 
>> non-preemptable.
>> + * Since non-pmds may end up not being pinned this could lead to the 
>> preemption
>> + * between non-pmds performing operations on the same mempool, which 
>> could lead
>> + * to memory corruption. */
>> +static struct ovs_mutex nonpmd_mp_mutex = OVS_MUTEX_INITIALIZER;
>> +
>>  /* Wrapper for a mempool released but not yet freed. */
>>  struct dpdk_mp {
>>   struct rte_mempool *mp;
>> @@ -462,6 +472,8 @@ struct netdev_rxq_dpdk {
>>  dpdk_port_t port_id;
>>  };
>>
>> +static bool dpdk_thread_is_pmd(void);
>> +
>>  static void netdev_dpdk_destruct(struct netdev *netdev);
>>  static void netdev_dpdk_vhost_destruct(struct netdev *netdev);
>>
>> @@ -495,6 +507,12 @@ dpdk_buf_size(int mtu)
>>   NETDEV_DPDK_MBUF_ALIGN);
>>  }
>>
>> +static bool
>> +dpdk_thread_is_pmd(void)
>> +{
>> + return rte_lcore_id() != NON_PMD_CORE_ID;
>> +}
>> +
>>  /* Allocates an area of 'sz' bytes from DPDK.  The memory is zero'ed.
>>   *
>>   * Unlike xmalloc(), this function can return NULL on failure. */
>> @@ -505,11 +523,17 @@ dpdk_rte_mzalloc(size_t sz)
>>  }
>>
>>  void
>> -free_dpdk_buf(struct dp_packet *p)
>> +free_dpdk_buf(struct dp_packet *packet)
>>  {
>> -struct rte_mbuf *pkt = (struct rte_mbuf *) p;
>> +if (!dpdk_thread_is_pmd()) {
>> +ovs_mutex_lock(&nonpmd_mp_mutex);
>> +}
>>
>> -rte_pktmbuf_free(pkt);
>> +rte_pktmbuf_free((struct rte_mbuf *) packet);
> 
> Rather than casting it, it should be &packet->mbuf, some one might 
> change the order in the dp_packet structure.
> 

Missed this one. I'll do this for v2, thanks.

Tiago.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2] dpif-netdev: Avoid reordering of packets in a batch with same megaflow

2018-07-04 Thread Vishal Deep Ajmera
OVS reads packets in batches from a given port and packets in the
batch are subjected to potentially 3 levels of lookups to identify
the datapath megaflow entry (or flow) associated with the packet.
Each megaflow entry has a dedicated buffer in which packets that match
the flow classification criteria are collected. This buffer helps OVS
perform batch processing for all packets associated with a given flow.

Each packet in the received batch is first subjected to lookup in the
Exact Match Cache (EMC). Each EMC entry will point to a flow. If the
EMC lookup is successful, the packet is moved from the rx batch to the
per-flow buffer.

Packets that did not match any EMC entry are rearranged in the rx batch
at the beginning and are now subjected to a lookup in the megaflow cache.
Packets that match a megaflow cache entry are *appended* to the per-flow
buffer.

Packets that do not match any megaflow entry are subjected to slow-path
processing through the upcall mechanism. This cannot change the order of
packets as by definition upcall processing is only done for packets
without matching megaflow entry.

The EMC entry match fields encompass all potentially significant header
fields, typically more than specified in the associated flow's match
criteria. Hence, multiple EMC entries can point to the same flow. Given
that per-flow batching happens at each lookup stage, packets belonging
to the same megaflow can get re-ordered because some packets match EMC
entries while others do not.

The following example can illustrate the issue better. Consider
following batch of packets (labelled P1 to P8) associated with a single
TCP connection and associated with a single flow. Let us assume that
packets with just the ACK bit set in TCP flags have been received in a
prior batch also and a corresponding EMC entry exists.

1. P1 (TCP Flag: ACK)
2. P2 (TCP Flag: ACK)
3. P3 (TCP Flag: ACK)
4. P4 (TCP Flag: ACK, PSH)
5. P5 (TCP Flag: ACK)
6. P6 (TCP Flag: ACK)
7. P7 (TCP Flag: ACK)
8. P8 (TCP Flag: ACK)

The megaflow classification criteria does not include TCP flags while
the EMC match criteria does. Thus, all packets other than P4 match
the existing EMC entry and are moved to the per-flow packet batch.
Subsequently, packet P4 is moved to the same per-flow packet batch as
a result of the megaflow lookup. Though the packets have all been
correctly classified as being associated with the same flow, the
packet order has not been preserved because of the per-flow batching
performed during the EMC lookup stage. This packet re-ordering has
performance implications for TCP applications.

This patch preserves the packet ordering by performing the per-flow
batching after both the EMC and megaflow lookups are complete. As an
optimization, packets are flow-batched in emc processing till any
packet in the batch has an EMC miss.

A new flow map is maintained to keep the original order of packet
along with flow information. Post fastpath processing, packets from
flow map are *appended* to per-flow buffer.

Signed-off-by: Vishal Deep Ajmera 
Co-authored-by: Venkatesan Pradeep 
Signed-off-by: Venkatesan Pradeep 
---
 lib/dpif-netdev.c | 80 ++-
 1 file changed, 67 insertions(+), 13 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 9390fff..b0e3d76 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -207,6 +207,13 @@ struct dpcls_rule {
 /* 'flow' must be the last field, additional space is allocated here. */
 };
 
+/* data structure to keep packet order till fastpath processing */
+struct dp_packet_flow_map {
+struct dp_packet *packet;
+struct dp_netdev_flow *flow;
+uint16_t tcp_flags;
+};
+
 static void dpcls_init(struct dpcls *);
 static void dpcls_destroy(struct dpcls *);
 static void dpcls_sort_subtable_vector(struct dpcls *);
@@ -5081,10 +5088,10 @@ struct packet_batch_per_flow {
 static inline void
 packet_batch_per_flow_update(struct packet_batch_per_flow *batch,
  struct dp_packet *packet,
- const struct miniflow *mf)
+ uint16_t tcp_flags)
 {
 batch->byte_count += dp_packet_size(packet);
-batch->tcp_flags |= miniflow_get_tcp_flags(mf);
+batch->tcp_flags |= tcp_flags;
 batch->array.packets[batch->array.count++] = packet;
 }
 
@@ -5118,7 +5125,7 @@ packet_batch_per_flow_execute(struct 
packet_batch_per_flow *batch,
 
 static inline void
 dp_netdev_queue_batches(struct dp_packet *pkt,
-struct dp_netdev_flow *flow, const struct miniflow *mf,
+struct dp_netdev_flow *flow, uint16_t tcp_flags,
 struct packet_batch_per_flow *batches,
 size_t *n_batches)
 {
@@ -5129,7 +5136,7 @@ dp_netdev_queue_batches(struct dp_packet *pkt,
 packet_batch_per_flow_init(batch, flow);
 }
 
-packet_batch_per_flow_update(batch, pkt, mf);
+packet_batch_per_flow_upd

Re: [ovs-dev] [PATCH] dpif-netdev: Avoid reordering of packets in a batch with same megaflow

2018-07-04 Thread Vishal Deep Ajmera
> 
> Hi Vishal, thanks for the patch, I've some minor comments below. I'm
> still testing but so far it seems good.
> 
> Unless there are objections (or any new issues come to light) I will put
> this as part of the pull request this week and back port to the
> appropriate branches.
> 
> Thanks
> Ian

Thanks Ian for reviewing and testing this patch. 
I will address your comments in V2 patch.

Warm Regards,
Vishal Ajmera
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v3] ovn.at: Add stateful test for ACL on port groups.

2018-07-04 Thread Daniel Alvarez Sanchez
Thanks a lot Han!


On Tue, Jun 26, 2018 at 9:41 AM Jakub Sitnicki  wrote:

> On Mon, 25 Jun 2018 10:03:02 -0700
> Han Zhou  wrote:
>
> > A bug was reported on the feature of applying ACLs on port groups [1].
> > This bug was not detected by the original test case, because it didn't
> > test the return traffic and so didn't ensure the stateful feature is
> > working. The fix [2] causes the original test case fail, because
> > once the conntrack is enabled, the test packets are dropped because
> > the checksum in those packets are invalid and so marked with "invalid"
> > state by conntrack. To avoid the test case failure, the fix [2] changed
> > it to test stateless acl only, which leaves the scenario untested,
> > although it is fixed. This patch adds back the stateful ACL in the
> > test, and replaced the dummy/receive with inject-pkt to send the test
> > packets, so that checksums can be properly filled in, and it also
> > adds tests for the return traffic, which ensures the stateful is
> > working.
> >
> > [1]
> https://mail.openvswitch.org/pipermail/ovs-discuss/2018-June/046927.html
> >
> > [2] https://patchwork.ozlabs.org/patch/931913/
> >
> > Signed-off-by: Han Zhou 
> > ---
> > Note: this patch depends on Daniel's patch [2] which is not merged yet.
> > v1->v2:
> > - Addressed Jacub's comments - simplified packet expr and removed
> >   debug information.
> > - Renamed test_ip to test_icmp.
> > v2->v3:
> > - Updated comments.
> >
> >  tests/ovn.at | 69
> ++--
> >  1 file changed, 48 insertions(+), 21 deletions(-)
>
> Acked-by: Jakub Sitnicki 
>
Acked-by: Daniel Alvarez 

> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH] db-ctl-base: Fix compilation warnings.

2018-07-04 Thread Ian Stokes
This commit fixes uninitialized variable warnings in functions
cmd_create() and cmd_get() when compiling with gcc 6.3.1 and -Werror
by initializing variables 'symbol' and 'new' to NULL.

Cc: Alex Wang 
Fixes: 07ff77ccb82a ("db-ctl-base: Make common database command code
  into library.")
Signed-off-by: Ian Stokes 
---
 lib/db-ctl-base.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/lib/db-ctl-base.c b/lib/db-ctl-base.c
index a59ac30..1768b45 100644
--- a/lib/db-ctl-base.c
+++ b/lib/db-ctl-base.c
@@ -919,8 +919,8 @@ cmd_get(struct ctl_context *ctx)
 }
 
 if (id) {
-struct ovsdb_symbol *symbol;
-bool new;
+struct ovsdb_symbol *symbol = NULL;
+bool new = NULL;
 
 ctx->error = create_symbol(ctx->symtab, id, &symbol, &new);
 if (ctx->error) {
@@ -1698,7 +1698,7 @@ cmd_create(struct ctl_context *ctx)
 const char *table_name = ctx->argv[1];
 const struct ovsdb_idl_table_class *table;
 const struct ovsdb_idl_row *row;
-const struct uuid *uuid;
+const struct uuid *uuid = NULL;
 int i;
 
 ctx->error = get_table(table_name, &table);
@@ -1706,7 +1706,7 @@ cmd_create(struct ctl_context *ctx)
 return;
 }
 if (id) {
-struct ovsdb_symbol *symbol;
+struct ovsdb_symbol *symbol = NULL;
 
 ctx->error = create_symbol(ctx->symtab, id, &symbol, NULL);
 if (ctx->error) {
@@ -1719,8 +1719,6 @@ cmd_create(struct ctl_context *ctx)
 symbol->strong_ref = true;
 }
 uuid = &symbol->uuid;
-} else {
-uuid = NULL;
 }
 
 row = ovsdb_idl_txn_insert(ctx->txn, table, uuid);
-- 
2.7.5

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH RFC net-next] openvswitch: Queue upcalls to userspace in per-port round-robin order

2018-07-04 Thread Matteo Croce
From: Stefano Brivio 

Open vSwitch sends to userspace all received packets that have
no associated flow (thus doing an "upcall"). Then the userspace
program creates a new flow and determines the actions to apply
based on its configuration.

When a single port generates a high rate of upcalls, it can
prevent other ports from dispatching their own upcalls. vswitchd
overcomes this problem by creating many netlink sockets for each
port, but it quickly exceeds any reasonable maximum number of
open files when dealing with huge amounts of ports.

This patch queues all the upcalls into a list, ordering them in
a per-port round-robin fashion, and schedules a deferred work to
queue them to userspace.

The algorithm to queue upcalls in a round-robin fashion,
provided by Stefano, is based on these two rules:
 - upcalls for a given port must be inserted after all the other
   occurrences of upcalls for the same port already in the queue,
   in order to avoid out-of-order upcalls for a given port
 - insertion happens once the highest upcall count for any given
   port (excluding the one currently at hand) is greater than the
   count for the port we're queuing to -- if this condition is
   never true, upcall is queued at the tail. This results in a
   per-port round-robin order.

In order to implement a fair round-robin behaviour, a variable
queueing delay is introduced. This will be zero if the upcalls
rate is below a given threshold, and grows linearly with the
queue utilisation (i.e. upcalls rate) otherwise.

This ensures fairness among ports under load and with few
netlink sockets.

Signed-off-by: Matteo Croce 
Co-authored-by: Stefano Brivio 
---
 net/openvswitch/datapath.c | 143 ++---
 net/openvswitch/datapath.h |  27 ++-
 2 files changed, 161 insertions(+), 9 deletions(-)

diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 0f5ce77460d4..2cfd504562d8 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -59,6 +59,10 @@
 #include "vport-internal_dev.h"
 #include "vport-netdev.h"
 
+#define UPCALL_QUEUE_TIMEOUT   msecs_to_jiffies(10)
+#define UPCALL_QUEUE_MAX_DELAY msecs_to_jiffies(10)
+#define UPCALL_QUEUE_MAX_LEN   200
+
 unsigned int ovs_net_id __read_mostly;
 
 static struct genl_family dp_packet_genl_family;
@@ -225,6 +229,116 @@ void ovs_dp_detach_port(struct vport *p)
ovs_vport_del(p);
 }
 
+static void ovs_dp_upcall_dequeue(struct work_struct *work)
+{
+   struct datapath *dp = container_of(work, struct datapath,
+  upcalls.work.work);
+   struct dp_upcall_info *u, *n;
+
+   spin_lock_bh(&dp->upcalls.lock);
+   list_for_each_entry_safe(u, n, &dp->upcalls.list, list) {
+   if (unlikely(ovs_dp_upcall(dp, u->skb, &u->key, u, 0)))
+   kfree_skb(u->skb);
+   else
+   consume_skb(u->skb);
+   kfree(u);
+   }
+   dp->upcalls.len = 0;
+   INIT_LIST_HEAD(&dp->upcalls.list);
+   spin_unlock_bh(&dp->upcalls.lock);
+}
+
+/* Calculate the delay of the deferred work which sends the upcalls. If it ran
+ * more than UPCALL_QUEUE_TIMEOUT ago, schedule the work immediately. Otherwise
+ * return a time between 0 and UPCALL_QUEUE_MAX_DELAY, depending linearly on 
the
+ * queue utilisation.
+ */
+static unsigned long ovs_dp_upcall_delay(int queue_len, unsigned long last_run)
+{
+   if (jiffies - last_run >= UPCALL_QUEUE_TIMEOUT)
+   return 0;
+
+   return UPCALL_QUEUE_MAX_DELAY -
+  UPCALL_QUEUE_MAX_DELAY * queue_len / UPCALL_QUEUE_MAX_LEN;
+}
+
+static int ovs_dp_upcall_queue_roundrobin(struct datapath *dp,
+ struct dp_upcall_info *upcall)
+{
+   struct list_head *head = &dp->upcalls.list;
+   struct dp_upcall_info *here = NULL, *pos;
+   bool find_next = true;
+   unsigned long delay;
+   int err = 0;
+   u8 count;
+
+   spin_lock_bh(&dp->upcalls.lock);
+   if (dp->upcalls.len > UPCALL_QUEUE_MAX_LEN) {
+   err = -ENOSPC;
+   goto out;
+   }
+
+   /* Insert upcalls in the list in a per-port round-robin fashion, look
+* for insertion point:
+* - to avoid out-of-order per-port upcalls, we can insert only after
+*   the last occurrence of upcalls for the same port
+* - insert upcall only after we reach a count of occurrences for a
+*   given port greater than the one we're inserting this upcall for
+*/
+   list_for_each_entry(pos, head, list) {
+   /* Count per-port upcalls. */
+   if (dp->upcalls.count[pos->port_no] == U8_MAX - 1) {
+   err = -ENOSPC;
+   goto out_clear;
+   }
+   dp->upcalls.count[pos->port_no]++;
+
+   if (pos->port_no == upcall->port_no) {
+   /* Another upcall for the same 

Re: [ovs-dev] [PATCH] ofproto-macros: Ignore "Dropped # log messages" in check_logs.

2018-07-04 Thread Timothy Redaelli
On Tue,  3 Jul 2018 11:32:18 -0700
Ben Pfaff  wrote:

> check_logs ignores some log messages, but it wasn't smart enough to
> ignore the messages that said that the ignored messages had been
> rate-limited. This fixes the problem.
> 
> It's OK to ignore all rate-limiting messages because they only appear
> if at least one message was not rate-limited, which check_logs will
> catch anyway.
> 
> Reported-by: Timothy Redaelli 
> Reported-at:
> https://mail.openvswitch.org/pipermail/ovs-discuss/2018-July/046978.html
> Signed-off-by: Ben Pfaff  ---
>  tests/ofproto-macros.at | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/tests/ofproto-macros.at b/tests/ofproto-macros.at
> index 7388a20a2236..2a56ae6e2f3e 100644
> --- a/tests/ofproto-macros.at
> +++ b/tests/ofproto-macros.at
> @@ -400,6 +400,11 @@ check_logs () {
>  # all "connection reset" warning logs for similar reasons
> (either EPIPE or # ECONNRESET can be returned on a send depending on
> whether the peer had # unconsumed data when it closed the socket).
> +#
> +# We also ignore "Dropped # log messages..." messages.
> Otherwise, even if
> +# we ignore the messages that were rate-limited, we can end up
> failing just
> +# because of the announcement that rate-limiting happened (and
> in a racy,
> +# timing-dependent way, too).
>  sed -n "$1
>  /reset by peer/d
>  /Broken pipe/d
> @@ -408,6 +413,7 @@ check_logs () {
>  /timeval.*disk: [[0-9]]* reads, [[0-9]]* writes/d
>  /timeval.*context switches: [[0-9]]* voluntary, [[0-9]]*
> involuntary/d /ovs_rcu.*blocked [[0-9]]* ms waiting for .* to
> quiesce/d +/Dropped [[0-9]]* log messages/d
>  /|WARN|/p
>  /|ERR|/p
>  /|EMER|/p" ${logs}

Hi,
tested on Fedora Rawhide.
I still have "2018-07-04T13:54:27.012Z|00038|jsonrpc|WARN|Dropped 14
log messages in last 8 seconds (most recently, 2 seconds ago) due to
excessive rate", but test doesn't fail anymore.

Tested-By: Timothy Redaelli 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] Query regarding connection tracking info sent to controller as part of PACKET_IN

2018-07-04 Thread Keshav Gupta
Hi
   For below pipeline I see different behavior in case of bridge is of type 
netdev or system. I see in OVS2.6.2 (and OVS 2.8.2) when 
netdev bridge is used conntrack field are sent to controller while in case of 
kernel bridge it is not sent. So I want to know what is the correct behavior.  
In which condition Conn track info in match filed of PACKET_IN message are sent 
to controller?

cookie=0x800, duration=1166.892s, table=0, n_packets=12, n_bytes=1603, 
priority=4,in_port=LOCAL,vlan_tci=0x/0x1fff 
actions=write_metadata:0x200/0xff01,goto_table:17
cookie=0x690, duration=1144.308s, table=17, n_packets=12, n_bytes=1603, 
priority=10,metadata=0x200/0xff00 
actions=write_metadata:0x9200/0xfffe,goto_table:210
cookie=0x690, duration=1216.748s, table=210, n_packets=5, n_bytes=245, 
priority=61010,ip,metadata=0x200/0xf00,dl_src=fa:16:3e:b6:f6:f2,nw_src=30.0.0.5
 actions=goto_table:211
cookie=0x690, duration=1279.576s, table=211, n_packets=5, n_bytes=245, 
priority=100,udp actions=write_metadata:0/0x2,goto_table:212
cookie=0x690, duration=1255.979s, table=212, n_packets=10, n_bytes=486, 
priority=100,ip,metadata=0x200/0xf00 
actions=ct(table=213,zone=5003)
cookie=0x690, duration=1343.665s, table=213, n_packets=13, n_bytes=673, 
priority=0 actions=goto_table:214
cookie=0x690, duration=1311.077s, table=214, n_packets=14, n_bytes=681, 
priority=1002,ip,metadata=0x200/0xf00 actions=goto_table:217
cookie=0x690, duration=1320.567s, table=217, n_packets=14, n_bytes=681, 
priority=100,ip,metadata=0x200/0xf02 
actions=ct(commit,zone=5003,exec(set_field:0x1->ct_mark)),resubmit(,17)
cookie=0x801, duration=1335.836s, table=17, n_packets=28, n_bytes=2271, 
priority=10,metadata=0x9200/0xff00 
actions=load:0x186a0->NXM_NX_REG3[0..24],write_metadata:0xa2030d40/0xfffe,goto_table:19
cookie=0x809, duration=1374.873s, table=19, n_packets=16, n_bytes=777, 
priority=20,metadata=0x30d40/0xfe,dl_dst=fa:16:3e:64:7e:bd 
actions=goto_table:21
cookie=0x804, duration=1379.940s, table=21, n_packets=20, n_bytes=964, 
priority=10,ip,metadata=0x30d40/0xfe actions=goto_table:26
cookie=0x806, duration=1604.891s, table=26, n_packets=3, n_bytes=148, 
priority=5,ip,metadata=0x30d40/0xfe actions=goto_table:46
cookie=0x81286a9, duration=1629.410s, table=46, n_packets=1, n_bytes=50, 
priority=5,udp,metadata=0x30d40/0xfe actions=CONTROLLER:6653

Thanks
Keshav
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] Monitor Real-Time locations for your Employees

2018-07-04 Thread Ricky
 

 

Hi,

 

I hope you are doing well.  

 


I am delighted to share with you "THE SECRET BEHIND SUCCESSFUL SALES TEAM
MANAGEMENT".

I found this is beneficial to you & your organization.

 

Check: 5 Advantages of Smart Work-flow: 

 

1.  Automate work-force.
2.  Increases transparency. 
3.  Instant communication. 
4.  Improves business efficiency.
5.  Saves time & cost.

I think it's something that your organization might see immediate value in.

 

Looking forward to your response

 

Regards

Ricky

 


Travelize


Address -  #No.5, First Floor, Race Course Road, Bengaluru-560009,
Karnataka, India


Mobile: 7406878981

If you are not interested please reply in subject line as UN-SUBSCRIBE

 

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2] ofproto-dpif-xlate: Fix packet_in reason for Table-miss rule

2018-07-04 Thread Keshav Gupta
Currently in OvS if we hit "Table-miss" rules (associated with Controller
action) then we send PACKET_IN message to controller with reason as
OFPR_NO_MATCH.

“Table-miss” rule is one whose priority is 0 and its catch all rule.

But if we hit same "Table-miss" rule after executing group entry we will
send the reason as OFPR_ACTION (for OF1.3 and below) and OFPR_GROUP
(for OF1.4 and above).

This is because once we execute group entry we set ctx->in_group and later
when we hit the "Table-miss" rule, Since ctx->in_group  is set we send
reason as OFPR_ACTION (for OF1.3) and OFPR_GROUP (for OF1.4 and above).

For eg: for the following pipeline, we will send the reason as OFPR_ACTION
even if we hit The “Table-miss” rule.

cookie=0x800, duration=761.189s, table=0, n_packets=1401, n_bytes=67954,
  priority=4,in_port=9,vlan_tci=0x/0x1fff
  actions=write_metadata:0x678700/0xff01,goto_table:17

cookie=0x681, duration=768.848s, table=17, n_packets=1418, n_bytes=68776,
  priority=10,metadata=0x678700/0xff00
  actions=write_metadata:0xe0678700/0xfffe,goto_table:60

cookie=0x680, duration=24944.312s, table=60, n_packets=58244,
  n_bytes=2519520, priority=0 actions=resubmit(,17)

cookie=0x804, duration=785.733s, table=17, n_packets=1450, n_bytes=69724,
  priority=10,metadata=0xe0678700/0xff00
  actions=write_metadata:0x67871d4d00/0xfffe,goto_table:43

cookie=0x822002d, duration=24960.795s, table=43, n_packets=53097,
  n_bytes=2230074, priority=100,arp,arp_op=1 actions=group:6000

group_id=6000,type=all,bucket=actions=CONTROLLER:65535,
  bucket=actions=resubmit(,48), bucket=actions=resubmit(,81)

cookie=0x850, duration=24977.323s, table=48, n_packets=58309, 
n_bytes=2522634,
  priority=0 actions=resubmit(,49),resubmit(,50)

cookie=0x805, duration=24984.679s, table=50, n_packets=6, n_bytes=264,
  priority=0 actions=CONTROLLER:65535

Currently we are sending table_id as 50 and packet_in reason as OFPR_ACTION.
Instead of sending packet_in reason as OFPR_NO_MATCH.

Signed-off-by: Keshav Gupta 
Co-authored-by: Rohith Basavaraja 
Signed-off-by: Rohith Basavaraja 
---
 ofproto/ofproto-dpif-xlate.c | 82 +---
 tests/ofproto-dpif.at| 38 
 2 files changed, 84 insertions(+), 36 deletions(-)

diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
index c02a032..0215793 100644
--- a/ofproto/ofproto-dpif-xlate.c
+++ b/ofproto/ofproto-dpif-xlate.c
@@ -249,7 +249,6 @@ struct xlate_ctx {
  */
 int depth;  /* Current resubmit nesting depth. */
 int resubmits;  /* Total number of resubmits. */
-bool in_group;  /* Currently translating ofgroup, if true. */
 bool in_action_set; /* Currently translating action_set, if true. 
*/
 bool in_packet_out; /* Currently translating a packet_out msg, if
  * true. */
@@ -528,12 +527,12 @@ static OVSRCU_TYPE(struct xlate_cfg *) xcfgp = 
OVSRCU_INITIALIZER(NULL);
 static struct xlate_cfg *new_xcfg = NULL;
 
 typedef void xlate_actions_handler(const struct ofpact *, size_t ofpacts_len,
-   struct xlate_ctx *, bool);
+   struct xlate_ctx *, bool, bool);
 static bool may_receive(const struct xport *, struct xlate_ctx *);
 static void do_xlate_actions(const struct ofpact *, size_t ofpacts_len,
- struct xlate_ctx *, bool);
+ struct xlate_ctx *, bool, bool);
 static void clone_xlate_actions(const struct ofpact *, size_t ofpacts_len,
-struct xlate_ctx *, bool);
+struct xlate_ctx *, bool, bool);
 static void xlate_normal(struct xlate_ctx *);
 static void xlate_table_action(struct xlate_ctx *, ofp_port_t in_port,
uint8_t table_id, bool may_packet_in,
@@ -4104,7 +4103,7 @@ xlate_recursively(struct xlate_ctx *ctx, struct rule_dpif 
*rule,
 ctx->rule_cookie = rule->up.flow_cookie;
 actions = rule_get_actions(&rule->up);
 actions_xlator(actions->ofpacts, actions->ofpacts_len, ctx,
-   is_last_action);
+   is_last_action, false);
 ctx->rule_cookie = old_cookie;
 ctx->rule = old_rule;
 ctx->depth -= deepens;
@@ -4279,7 +4278,8 @@ xlate_group_bucket(struct xlate_ctx *ctx, struct 
ofputil_bucket *bucket,
 
 ofpacts_execute_action_set(&action_list, &action_set);
 ctx->depth++;
-do_xlate_actions(action_list.data, action_list.size, ctx, is_last_action);
+do_xlate_actions(action_list.data, action_list.size, ctx, is_last_action,
+ true);
 ctx->depth--;
 
 ofpbuf_uninit(&action_list);
@@ -4459,9 +4459,6 @@ static void
 xlate_group_action__(struct xlate_ctx *ctx, struct group_dpif *group,