[dpdk-dev] [PATCH 06/11] kvargs: simpler parsing and allow duplicated keys

2014-01-29 Thread Olivier MATZ
Hi Bruce,

>> Also replace rte_strsplit() by more a standard function strtok_r() that is
>> easier to understand for people already knowing the libc. It also avoids
>> useless calls to strnlen(). The delimiters macros become strings instead of
>> chars due to the strtok_r() API.
>>
> 
> As a general rule, we try to use only string functions which
> track the buffer length they are working with, which is why the
> function rte_strsplit() is used. While strtok_r() is indeed a
> standard C function, why not use the code as originally written?

Thank you for your comment.

Another reason for not using rte_strsplit() is that the buffer length
argument that was given was not relevant. Indeed, the string we want to
parse initially comes from the command line arguments, which is
duplicated with strdup(). It is not stored in a fixed size buffer.

In my opinion, giving strnlen(pairs[i], MAX_ARG_STRLEN) as the size
to rte_strsplit() would not prevent the program to a segfault or to
read a corrupted string if it is not properly null-terminated.

Does it makes sense ?

Regards
Olivier



[dpdk-dev] multiple VLAN IDs for SR-IOV ports

2014-01-29 Thread James Yu
Any one know what commands to use on the KVM host to add multiple VLAN IDs
to a SR-IOV port ?

I could only add one VLAN ID using

ip link set ethN vf NUM vlan VLANID

Also do I have to call any routine on the DPDK side to add those VIDs ?

James


[dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.

2014-01-29 Thread François-Frédéric Ozog
> > First and easy answer: it is open source, so anyone can recompile. So,
> > what's the issue?
> 
> I'm talking from a pure distribution perspective here: Requiring to
> recompile all DPDK based applications to distribute a bugfix or to add
> support for a new PMD is not ideal.

> 
> So ideally OVS would have the possibility to link against the shared
> library long term.

I agree that distribution of DPDK apps is not covered properly at present.
Identifying the proper scheme requires a specific analysis based on the
constraints of the Telecom/Cloud/Networking markets.

In the telecom world, if you fix the underlying framework of an app, you
will still have to validate the solution, ie app/framework. In addition, the
idea of shared libraries introduces the implied requirement to validate apps
against diverse versions of DPDK shared libraries. This translates into
development and support costs.

I also expect many DPDK applications to tackle core networking features,
with sub micro second packet handling delays  and even lower than 200ns
(NAT64...). The lazy binding based on ELF PLT represent quite a cost, not
mentioning that optimization stops are shared libraries boundaries (gcc
whole program optimization can be very effective...). Microsoft DLL linkage
are an order of magnitude faster. If Linux was to provide that, I would
probably revise my judgment. (I haven't checked Linux dynamic linking
implementation for some time so my understanding of Linux dynamic linking
may be outdated).


> 
> > I get lost: do you mean ABI + API toward the PMDs or towards the
> > applications using the librte ?
> 
> Towards the PMDs is more straight forward at first so it seems logical to
> focus on that first.

I don't think it is so straight forward. Many recent cards such as Chelsio
and Myricom have a very different "packet memory layout" that does not fit
so easily into actual DPDK architecture.

1) "traditional" architecture: the driver reserves X buffers and provide the
card with descriptors of those buffers. Each packet is DMA'ed into exactly
one buffer. Typically you have 2K buffers, a 64 byte packet consumes exactly
one buffer

2) "alternative" new architecture: the driver reserves a memory zone, say
4MB, without any structure, and provide a a single zone description and a
ring buffer to the card. (there no individual buffer descriptors any more).
The card fills the memory zone with packets, one next to the other and
specifies where the packets are by updating the supplied ring. Out of the
many issues fitting this scheme into DPDK, you cannot free a single mbuf:
you have to maintain a ref count to the memory zone so that, when all mbufs
have been "released", the memory zone can be freed.
That's quite a stretch from actual paradigm.

Apart from this aspect, managing RSS is two tied to Intel's flow director
concepts and cannot accommodate directly smarter or dumber RSS mechanisms.

That said, I fully agree PMD API should be revisited.


Cordially,

Fran?ois-Fr?d?ric



[dpdk-dev] How to debug packet sends to virtual functions

2014-01-29 Thread Mats Liljegren
I'm trying to get a modified version of the l2fwd example running, and
have problems with packets being silently thrown away. I can receive
packets, and my printf's indicates that the packets are being sent to
the correct port, using correct MAC address as source address. And
still, the packets are lost.

Since the port is a virtual function, it seems like I cannot use
tcpdump on it to see the network traffic. There is nothing coming out
of the cable (activity light not flashing, the receiving end running
tcpdump does not see any traffic).

I'm using two X540 with two virtual functions each. The application
runs in a KVM/Qemu environmen.

Any suggestions how to debug this?

Regards,
Mats


[dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.

2014-01-29 Thread Thomas Graf
On 01/29/2014 05:34 PM, Vincent JARDIN wrote:
> Thomas,
>
> First and easy answer: it is open source, so anyone can recompile. So,
> what's the issue?

I'm talking from a pure distribution perspective here: Requiring to
recompile all DPDK based applications to distribute a bugfix or to
add support for a new PMD is not ideal.

So ideally OVS would have the possibility to link against the shared
library long term.

> I get lost: do you mean ABI + API toward the PMDs or towards the
> applications using the librte ?

Towards the PMDs is more straight forward at first so it seems logical
to focus on that first.

A stable API and ABI for librte seems required as well long term as
DPDK does offer shared libraries but I realize that this is a stretch
goal in the initial phase.


[dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.

2014-01-29 Thread Vincent JARDIN
Thomas,

First and easy answer: it is open source, so anyone can recompile. So, 
what's the issue?

> Without a concept of stable interfaces, it will be difficult to
> package and distribute RTE libraries, PMD, and DPDK applications. Right
> now, the obvious path would include packaging the PMD bits together
> with each DPDK application depending on the version of DPDK the binary
> was compiled against. This is clearly not ideal.

>
>> I agree that some areas could be improved since they are not into the
>> critical datapath of packets, but still other areas remain very CPU
>> constraints. For instance:
>> http://dpdk.org/browse/dpdk/commit/lib/librte_ether/rte_ethdev.h?id=c3d0564cf0f00c3c9a61cf72bd4bd1c441740637
>>
>> is bad:
>> struct eth_dev_ops
>> is churned, no comment, and a #ifdef that changes the structure
>> according to compilation!
>
> This is a very good example as it outlines the difference between
> control structures and the fast path. We have this same exact trade off
> in the kernel a lot where we have highly optimized internal APIs
> towards modules and drivers but want to provide binary compatibility to
> a certain extend.

As long as we agree on this limited scope, we'll think about it and 
provide a proposal on dev at dpdk.org mailing list.

> As for the specific example you mention, it is relatively trivial to
> make eth_dev_ops backwards compatible by appending appropriate padding
> to the struct before a new major release and ensure that new members
> are added by replacing the padding accordingly. Obviously no ifdefs
> would be allowed anymore.

Of course, it is basic C!

>> Should an application use the librte libraries of the DPDK:
>>- you can use RTE_VERSION and RTE_VERSION_NUM :
>> http://dpdk.org/doc/api/rte__version_8h.html#a8775053b0f721b9fa0457494cfbb7ed9
>
> Right. This would be more or less identical to requiring a specific
> DPDK version in OVS_CHEC_DPDK. It's not ideal to require application to
> clutter their code with #ifdefs all over for every new minor release
> though.
>
>>- you can write your own wrapper (with CPU overhead) in order to have
>> a stable ABI, that wrapper should be tight to the versions of the librte
>> => the overhead is part of your application instead of the DPDK,
>>- *otherwise recompile your software, it is opensource, what's the
>> issue?*
>>
>> We are opened to any suggestion to have stable ABI, but it should never
>> remove the options to have fast/efficient/compilation/CPU execution
>> processing.
>
> Absolutely agreed. We also don't want to add tons of abstraction and
> overcomplicate everything. Still, I strongly believe that the definition
> of stable interfaces towards applications and especially PMD is
> essential.
>
> I'm not proposing to standardize all the APIs towards applications on
> the level of POSIX. DPDK is in early stages and disruptive changes will
> come along. What I would propose on an abstract level is:
>
> 1. Extend but not break API between minor releases. Postpone API
> breakages to the next major release. High cadence of major
> releases initially, lower cadence as DPDK matures.
>
> 2. Define ABI stability towards PMD for minor releases to allow
> isolated packaging of PMD by padding control structures and keeping
> functions ABI stable.

I get lost: do you mean ABI + API toward the PMDs or towards the 
applications using the librte ?

Best regards,
   Vincent



[dpdk-dev] [PATCH 06/11] kvargs: simpler parsing and allow duplicated keys

2014-01-29 Thread Richardson, Bruce
> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Olivier Matz
> Sent: Tuesday, January 28, 2014 4:07 PM
> To: dev at dpdk.org
> Subject: [dpdk-dev] [PATCH 06/11] kvargs: simpler parsing and allow
> duplicated keys
> 
> Remove the rte_kvargs_add_pair() function whose only role was to check if
> a key is duplicated. Having duplicated keys is now allowed by kvargs API.
> 
> Also replace rte_strsplit() by more a standard function strtok_r() that is
> easier to understand for people already knowing the libc. It also avoids
> useless calls to strnlen(). The delimiters macros become strings instead of
> chars due to the strtok_r() API.
> 

As a general rule, we try to use only string functions which track the buffer 
length they are working with, which is why the function rte_strsplit() is used. 
While strtok_r() is indeed a standard C function, why not use the code as 
originally written?


[dpdk-dev] [PATCH 04/11] kvargs: remove useless size field

2014-01-29 Thread Richardson, Bruce
> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Olivier Matz
> Sent: Tuesday, January 28, 2014 4:07 PM
> To: dev at dpdk.org
> Subject: [dpdk-dev] [PATCH 04/11] kvargs: remove useless size field
> 
> This value was not very useful as the size of the table is fixed (equals
> RTE_KVARGS_MAX).
> 
> By the way, the memset in the initialization function was wrong (size too
> short). Even if it was not really an issue since we rely on the "count" 
> field, it
> is now fixed by this patch.
> 
> Signed-off-by: Olivier Matz 
> ---
Acked-by: Bruce Richardson 



[dpdk-dev] [PATCH 03/11] kvargs: remove driver name in arguments

2014-01-29 Thread Richardson, Bruce
> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Olivier Matz
> Sent: Tuesday, January 28, 2014 4:07 PM
> To: dev at dpdk.org
> Subject: [dpdk-dev] [PATCH 03/11] kvargs: remove driver name in
> arguments
> 
> Now that rte_kvargs is a generic library, there is no need to have an
> argument for the driver name in rte_kvargs_tokenize() and
> rte_kvargs_parse() prototypes. This argument was only used to log the
> driver name in case of error. Instead, we can add a log in init function of
> pmd_pcap and pmd_ring.
> 
> Signed-off-by: Olivier Matz 
> ---
>  lib/librte_kvargs/rte_kvargs.c | 13 ++---
>  lib/librte_kvargs/rte_kvargs.h |  4 +---
>  lib/librte_pmd_pcap/rte_eth_pcap.c |  4 +++-
> lib/librte_pmd_ring/rte_eth_ring.c |  2 ++
>  4 files changed, 12 insertions(+), 11 deletions(-)
> 
> diff --git a/lib/librte_kvargs/rte_kvargs.c b/lib/librte_kvargs/rte_kvargs.c
> index 7a950d6..935698c 100644
> --- a/lib/librte_kvargs/rte_kvargs.c
> +++ b/lib/librte_kvargs/rte_kvargs.c
> @@ -91,8 +91,7 @@ rte_kvargs_add_pair(struct rte_kvargs *kvlist, char
> *key, char *val)
>   * strtok() is used so the params string will be copied to be modified.
>   */
>  static int
> -rte_kvargs_tokenize(struct rte_kvargs *kvlist, const char *name,
> - const char *params)
> +rte_kvargs_tokenize(struct rte_kvargs *kvlist, const char *params)
>  {
>   unsigned i, count;
>   char *args;
> @@ -101,7 +100,7 @@ rte_kvargs_tokenize(struct rte_kvargs *kvlist,
> const char *name,
> 
>   /* If params are empty, nothing to do */
>   if (params == NULL || params[0] == 0) {
> - RTE_LOG(ERR, PMD, "Couldn't parse %s device, empty
> arguments\n", name);
> + RTE_LOG(ERR, PMD, "Cannot parse empty arguments\n");
>   return -1;
>   }
> 
> @@ -110,7 +109,7 @@ rte_kvargs_tokenize(struct rte_kvargs *kvlist,
> const char *name,
>*/
>   args = strdup(params);
>   if(args == NULL){
> - RTE_LOG(ERR, PMD, "Couldn't parse %s device \n", name);
> + RTE_LOG(ERR, PMD, "Cannot parse arguments: not enough
> memory\n");
>   return -1;
>   }
> 
> @@ -127,7 +126,8 @@ rte_kvargs_tokenize(struct rte_kvargs *kvlist,
> const char *name,
>   if (pair[0] == NULL || pair[1] == NULL || pair[0][0] == 0
>   || pair[1][0] == 0) {
>   RTE_LOG(ERR, PMD,
> - "Couldn't parse %s device, wrong key
> or value \n", name);
> + "Cannot parse arguments: wrong key or
> value\n"
> + "params=<%s>\n", params);
>   goto error;
>   }
> 
> @@ -230,14 +230,13 @@ rte_kvargs_process(const struct rte_kvargs
> *kvlist,
>   */
>  int
>  rte_kvargs_parse(struct rte_kvargs *kvlist,
> - const char *name,
>   const char *args,
>   const char *valid_keys[])
>  {
> 
>   int ret;
> 
> - ret = rte_kvargs_tokenize(kvlist, name, args);
> + ret = rte_kvargs_tokenize(kvlist, args);
>   if (ret < 0)
>   return ret;
> 
> diff --git a/lib/librte_kvargs/rte_kvargs.h b/lib/librte_kvargs/rte_kvargs.h
> index 19485b1..804ea1d 100644
> --- a/lib/librte_kvargs/rte_kvargs.h
> +++ b/lib/librte_kvargs/rte_kvargs.h
> @@ -100,8 +100,6 @@ int rte_kvargs_init(struct rte_kvargs *kvlist);
>   *
>   * @param kvlist
>   *   The rte_kvargs structure
> - * @param name
> - *   The name of the driver
>   * @param args
>   *   The input string containing the key/value associations
>   * @param valid_keys
> @@ -112,7 +110,7 @@ int rte_kvargs_init(struct rte_kvargs *kvlist);
>   *   - 0 on success
>   *   - Negative on error
>   */
> -int rte_kvargs_parse(struct rte_kvargs *kvlist, const char *name,
> +int rte_kvargs_parse(struct rte_kvargs *kvlist,
>   const char *args, const char *valid_keys[]);
> 
>  /**
> diff --git a/lib/librte_pmd_pcap/rte_eth_pcap.c
> b/lib/librte_pmd_pcap/rte_eth_pcap.c
> index e47afcb..2006b35 100644
> --- a/lib/librte_pmd_pcap/rte_eth_pcap.c
> +++ b/lib/librte_pmd_pcap/rte_eth_pcap.c
> @@ -706,6 +706,8 @@ rte_pmd_pcap_init(const char *name, const char
> *params)
>   struct rx_pcaps pcaps;
>   struct tx_pcaps dumpers;
> 
> + RTE_LOG(INFO, PMD, "Initializing pmd_pcap for %s\n", name);
> +
>   rte_kvargs_init();
> 
>   numa_node = rte_socket_id();
> @@ -714,7 +716,7 @@ rte_pmd_pcap_init(const char *name, const char
> *params)
>   start_cycles = rte_get_timer_cycles();
>   hz = rte_get_timer_hz();
> 
> - if (rte_kvargs_parse(, name, params, valid_arguments) < 0)
> + if (rte_kvargs_parse(, params, valid_arguments) < 0)
>   return -1;
> 
>   /*
> diff --git a/lib/librte_pmd_ring/rte_eth_ring.c
> b/lib/librte_pmd_ring/rte_eth_ring.c
> index fa3ff72..abef2e8 100644
> --- a/lib/librte_pmd_ring/rte_eth_ring.c
> +++ b/lib/librte_pmd_ring/rte_eth_ring.c
> @@ -384,6 

[dpdk-dev] [PATCH 02/11] kvargs: use the new library in pmd_pcap

2014-01-29 Thread Richardson, Bruce
> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Olivier Matz
> Sent: Tuesday, January 28, 2014 4:07 PM
> To: dev at dpdk.org
> Subject: [dpdk-dev] [PATCH 02/11] kvargs: use the new library in pmd_pcap
> 
> The rte_kvargs library is a reworked copy of rte_eth_pcap_arg_parser, so it
> provides the same service. Therefore we can use it and remove the code of
> rte_eth_pcap_arg_parser.
> 
> Signed-off-by: Olivier Matz 
> ---
>  lib/librte_pmd_pcap/Makefile  |   8 +-
>  lib/librte_pmd_pcap/rte_eth_pcap.c|  29 +--
>  lib/librte_pmd_pcap/rte_eth_pcap_arg_parser.c | 255 
> --
> lib/librte_pmd_pcap/rte_eth_pcap_arg_parser.h |  71 ---
>  4 files changed, 19 insertions(+), 344 deletions(-)  delete mode 100644
> lib/librte_pmd_pcap/rte_eth_pcap_arg_parser.c
>  delete mode 100644 lib/librte_pmd_pcap/rte_eth_pcap_arg_parser.h
> 
> diff --git a/lib/librte_pmd_pcap/Makefile b/lib/librte_pmd_pcap/Makefile
> index 33174c0..741dafc 100644
> --- a/lib/librte_pmd_pcap/Makefile
> +++ b/lib/librte_pmd_pcap/Makefile
> @@ -1,6 +1,7 @@
>  #   BSD LICENSE
>  #
>  #   Copyright(c) 2010-2013 Intel Corporation. All rights reserved.
> +#   Copyright(c) 2014 6WIND S.A.
>  #   All rights reserved.
>  #
>  #   Redistribution and use in source and binary forms, with or without
> @@ -43,16 +44,15 @@ CFLAGS += $(WERROR_FLAGS)  # all source are
> stored in SRCS-y  #
>  SRCS-$(CONFIG_RTE_LIBRTE_PMD_PCAP) += rte_eth_pcap.c
> -SRCS-$(CONFIG_RTE_LIBRTE_PMD_PCAP) += rte_eth_pcap_arg_parser.c
> -
> 
>  #
>  # Export include files
>  #
>  SYMLINK-y-include += rte_eth_pcap.h
> -SYMLINK-y-include += rte_eth_pcap_arg_parser.h
> 
>  # this lib depends upon:
> -DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PCAP) += lib/librte_mbuf
> lib/librte_ether
> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PCAP) += lib/librte_mbuf
> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PCAP) += lib/librte_ether
> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_PCAP) += lib/librte_kvargs
> 
>  include $(RTE_SDK)/mk/rte.lib.mk
> diff --git a/lib/librte_pmd_pcap/rte_eth_pcap.c
> b/lib/librte_pmd_pcap/rte_eth_pcap.c
> index 208e316..e47afcb 100644
> --- a/lib/librte_pmd_pcap/rte_eth_pcap.c
> +++ b/lib/librte_pmd_pcap/rte_eth_pcap.c
> @@ -2,6 +2,7 @@
>   *   BSD LICENSE
>   *
>   *   Copyright(c) 2010-2013 Intel Corporation. All rights reserved.
> + *   Copyright(c) 2014 6WIND S.A.
>   *   All rights reserved.
>   *
>   *   Redistribution and use in source and binary forms, with or without
> @@ -39,9 +40,9 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  #include "rte_eth_pcap.h"
> -#include "rte_eth_pcap_arg_parser.h"
> 
>  #define RTE_ETH_PCAP_SNAPSHOT_LEN 65535  #define
> RTE_ETH_PCAP_SNAPLEN 4096 @@ -701,11 +702,11 @@
> rte_pmd_pcap_init(const char *name, const char *params)  {
>   unsigned numa_node, using_dumpers = 0;
>   int ret;
> - struct args_dict dict;
> + struct rte_kvargs kvlist;
>   struct rx_pcaps pcaps;
>   struct tx_pcaps dumpers;
> 
> - rte_eth_pcap_init_args_dict();
> + rte_kvargs_init();
> 
>   numa_node = rte_socket_id();
> 
> @@ -713,16 +714,16 @@ rte_pmd_pcap_init(const char *name, const
> char *params)
>   start_cycles = rte_get_timer_cycles();
>   hz = rte_get_timer_hz();
> 
> - if (rte_eth_pcap_parse_args(, name, params,
> valid_arguments) < 0)
> + if (rte_kvargs_parse(, name, params, valid_arguments) < 0)
>   return -1;
> 
>   /*
>* If iface argument is passed we open the NICs and use them for
>* reading / writing
>*/
> - if (rte_eth_pcap_num_of_args(, ETH_PCAP_IFACE_ARG) == 1)
> {
> + if (rte_kvargs_count(, ETH_PCAP_IFACE_ARG) == 1) {
> 
> - ret = rte_eth_pcap_post_process_arguments(,
> ETH_PCAP_IFACE_ARG,
> + ret = rte_kvargs_process(, ETH_PCAP_IFACE_ARG,
>   _rx_tx_iface, [0]);
>   if (ret < 0)
>   return -1;
> @@ -734,13 +735,13 @@ rte_pmd_pcap_init(const char *name, const
> char *params)
>* We check whether we want to open a RX stream from a real NIC
> or a
>* pcap file
>*/
> - if ((pcaps.num_of_rx = rte_eth_pcap_num_of_args(,
> ETH_PCAP_RX_PCAP_ARG))) {
> - ret = rte_eth_pcap_post_process_arguments(,
> ETH_PCAP_RX_PCAP_ARG,
> + if ((pcaps.num_of_rx = rte_kvargs_count(,
> ETH_PCAP_RX_PCAP_ARG))) {
> + ret = rte_kvargs_process(,
> ETH_PCAP_RX_PCAP_ARG,
>   _rx_pcap, );
>   } else {
> - pcaps.num_of_rx = rte_eth_pcap_num_of_args(,
> + pcaps.num_of_rx = rte_kvargs_count(,
>   ETH_PCAP_RX_IFACE_ARG);
> - ret = rte_eth_pcap_post_process_arguments(,
> ETH_PCAP_RX_IFACE_ARG,
> + ret = rte_kvargs_process(,
> ETH_PCAP_RX_IFACE_ARG,
>   _rx_iface, );
>   }
> 
> @@ -751,15 +752,15 @@ rte_pmd_pcap_init(const char *name, const
> char 

[dpdk-dev] [PATCH 01/11] kvargs: add a new library to parse key/value arguments

2014-01-29 Thread Richardson, Bruce


> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Olivier Matz
> Sent: Tuesday, January 28, 2014 4:07 PM
> To: dev at dpdk.org
> Subject: [dpdk-dev] [PATCH 01/11] kvargs: add a new library to parse
> key/value arguments
> 
> Copy the code from rte_eth_pcap_arg_parser.[ch], without any functional
> modifications, only:
> - rename functions and structure
> - restyle (indentation)
> - add comments (doxygen style)
> - add "const" or "static" attributes, remove unneeded "inline"
> 
> Signed-off-by: Olivier Matz 
> ---
>  config/defconfig_i686-default-linuxapp-gcc   |   5 +
>  config/defconfig_i686-default-linuxapp-icc   |   5 +
>  config/defconfig_x86_64-default-linuxapp-gcc |   5 +
>  config/defconfig_x86_64-default-linuxapp-icc |   5 +
>  lib/Makefile |   1 +
>  lib/librte_kvargs/Makefile   |  50 ++
>  lib/librte_kvargs/rte_kvargs.c   | 248
> +++
>  lib/librte_kvargs/rte_kvargs.h   | 159 +
>  mk/rte.app.mk|   4 +
>  9 files changed, 482 insertions(+)
>  create mode 100644 lib/librte_kvargs/Makefile  create mode 100644
> lib/librte_kvargs/rte_kvargs.c  create mode 100644
> lib/librte_kvargs/rte_kvargs.h
> 
> diff --git a/config/defconfig_i686-default-linuxapp-gcc
> b/config/defconfig_i686-default-linuxapp-gcc
> index 4f57242..349ca24 100644
> --- a/config/defconfig_i686-default-linuxapp-gcc
> +++ b/config/defconfig_i686-default-linuxapp-gcc
> @@ -141,6 +141,11 @@ CONFIG_RTE_LIBRTE_EAL_BAREMETAL=n
> CONFIG_RTE_LIBRTE_EAL_VMWARE_TSC_MAP_SUPPORT=y
> 
>  #
> +# Compile the argument parser library
> +#
> +CONFIG_RTE_LIBRTE_KVARGS=y
> +
> +#
>  # Compile generic ethernet library
>  #
>  CONFIG_RTE_LIBRTE_ETHER=y
> diff --git a/config/defconfig_i686-default-linuxapp-icc
> b/config/defconfig_i686-default-linuxapp-icc
> index 90521e1..4e7338f 100644
> --- a/config/defconfig_i686-default-linuxapp-icc
> +++ b/config/defconfig_i686-default-linuxapp-icc
> @@ -141,6 +141,11 @@ CONFIG_RTE_LIBRTE_EAL_BAREMETAL=n
> CONFIG_RTE_LIBRTE_EAL_VMWARE_TSC_MAP_SUPPORT=y
> 
>  #
> +# Compile the argument parser library
> +#
> +CONFIG_RTE_LIBRTE_KVARGS=y
> +
> +#
>  # Compile generic ethernet library
>  #
>  CONFIG_RTE_LIBRTE_ETHER=y
> diff --git a/config/defconfig_x86_64-default-linuxapp-gcc
> b/config/defconfig_x86_64-default-linuxapp-gcc
> index 4c05ffc..ca85b1a 100644
> --- a/config/defconfig_x86_64-default-linuxapp-gcc
> +++ b/config/defconfig_x86_64-default-linuxapp-gcc
> @@ -141,6 +141,11 @@ CONFIG_RTE_LIBRTE_EAL_BAREMETAL=n
> CONFIG_RTE_LIBRTE_EAL_VMWARE_TSC_MAP_SUPPORT=y
> 
>  #
> +# Compile the argument parser library
> +#
> +CONFIG_RTE_LIBRTE_KVARGS=y
> +
> +#
>  # Compile generic ethernet library
>  #
>  CONFIG_RTE_LIBRTE_ETHER=y
> diff --git a/config/defconfig_x86_64-default-linuxapp-icc
> b/config/defconfig_x86_64-default-linuxapp-icc
> index c4d5c73..37dbdd0 100644
> --- a/config/defconfig_x86_64-default-linuxapp-icc
> +++ b/config/defconfig_x86_64-default-linuxapp-icc
> @@ -141,6 +141,11 @@ CONFIG_RTE_LIBRTE_EAL_BAREMETAL=n
> CONFIG_RTE_LIBRTE_EAL_VMWARE_TSC_MAP_SUPPORT=y
> 
>  #
> +# Compile the argument parser library
> +#
> +CONFIG_RTE_LIBRTE_KVARGS=y
> +
> +#
>  # Compile generic ethernet library
>  #
>  CONFIG_RTE_LIBRTE_ETHER=y
> diff --git a/lib/Makefile b/lib/Makefile index b655d4f..31644f2 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -52,6 +52,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_POWER) += librte_power
>  DIRS-$(CONFIG_RTE_LIBRTE_METER) += librte_meter
>  DIRS-$(CONFIG_RTE_LIBRTE_SCHED) += librte_sched
>  DIRS-$(CONFIG_RTE_LIBRTE_ACL) += librte_acl
> +DIRS-$(CONFIG_RTE_LIBRTE_KVARGS) += librte_kvargs
> 
>  ifeq ($(CONFIG_RTE_EXEC_ENV_LINUXAPP),y)
>  DIRS-$(CONFIG_RTE_LIBRTE_KNI) += librte_kni diff --git
> a/lib/librte_kvargs/Makefile b/lib/librte_kvargs/Makefile new file mode
> 100644 index 000..c4c2855
> --- /dev/null
> +++ b/lib/librte_kvargs/Makefile
> @@ -0,0 +1,50 @@
> +#   BSD LICENSE
> +#
> +#   Copyright(c) 2014 6WIND S.A.
> +#   All rights reserved.
> +#
> +#   Redistribution and use in source and binary forms, with or without
> +#   modification, are permitted provided that the following conditions
> +#   are met:
> +#
> +# * Redistributions of source code must retain the above copyright
> +#   notice, this list of conditions and the following disclaimer.
> +# * Redistributions in binary form must reproduce the above copyright
> +#   notice, this list of conditions and the following disclaimer in
> +#   the documentation and/or other materials provided with the
> +#   distribution.
> +# * Neither the name of Intel Corporation nor the names of its
> +#   contributors may be used to endorse or promote products derived
> +#   from this software without specific prior written permission.
> +#
> +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
> CONTRIBUTORS
> +#   "AS 

[dpdk-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.

2014-01-29 Thread Prashant Upadhyaya
Hi Pravin,

I think your stuff is on the brink of a creating a mini revolution :)

Some questions inline below --
+ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk
What do you mean by portid here, do you mean the physical interface id like 
eth0 which I have bound to igb_uio now ?
If I have multiple interfaces I have assigned igb_uio to, eg. eth0, eth1, eth2 
etc., what is the id mapping for those ?

If I have VM's running, then typically how to interface those VM's to this OVS 
in user space now, do I use the same classical 'tap' interface and add it to 
the OVS above.
What is the actual path the data takes from the VM now all the way to the 
switch, wouldn't it be hypervisor to kernel to OVS switch in user space to 
other VM/Network ?
I think if we can solve the VM to OVS port connectivity remaining in userspace 
only, then we have a great thing at our hand. Kindly comment on this.

Regards
-Prashant



-Original Message-
From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of pshe...@nicira.com
Sent: Tuesday, January 28, 2014 7:19 AM
To: dev at openvswitch.org; dev at dpdk.org; dpdk-ovs at lists.01.org
Cc: Gerald Rogers
Subject: [dpdk-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.

From: Pravin B Shelar 

Following patch adds DPDK netdev-class to userspace datapath.
Approach taken in this patch differs from Intel? DPDK vSwitch
where DPDK datapath switching is done in saparate process.  This
patch adds support for DPDK type port and uses OVS userspace
datapath for switching.  Therefore all DPDK processing and flow
miss handling is done in single process.  This also avoids code
duplication by reusing OVS userspace datapath switching and
therefore it supports all flow matching and actions that
user-space datapath supports.  Refer to INSTALL.DPDK doc for
further info.

With this patch I got similar performance for netperf TCP_STREAM
tests compared to kernel datapath.

This is based a patch from Gerald Rogers.

Signed-off-by: Pravin B Shelar 
CC: "Gerald Rogers" 
---
This patch is tested on latest OVS master (commit 9d0581fdf22bec79).
---
 INSTALL |1 +
 INSTALL.DPDK|   85 
 Makefile.am |1 +
 acinclude.m4|   40 ++
 configure.ac|1 +
 lib/automake.mk |6 +
 lib/dpif-netdev.c   |  393 +++-
 lib/netdev-dpdk.c   | 1152 +++
 lib/netdev-dpdk.h   |7 +
 lib/netdev-dummy.c  |   38 +-
 lib/netdev-linux.c  |   33 +-
 lib/netdev-provider.h   |   13 +-
 lib/netdev-vport.c  |1 +
 lib/netdev.c|   52 ++-
 lib/netdev.h|   15 +-
 lib/ofpbuf.c|7 +-
 lib/ofpbuf.h|   13 +-
 lib/packets.c   |9 +
 lib/packets.h   |1 +
 vswitchd/ovs-vswitchd.c |   14 +-
 20 files changed, 1702 insertions(+), 180 deletions(-)
 create mode 100644 INSTALL.DPDK
 create mode 100644 lib/netdev-dpdk.c
 create mode 100644 lib/netdev-dpdk.h

diff --git a/INSTALL b/INSTALL
index 001d3cb..74cd278 100644
--- a/INSTALL
+++ b/INSTALL
@@ -10,6 +10,7 @@ on a specific platform, please see one of these files:
 - INSTALL.RHEL
 - INSTALL.XenServer
 - INSTALL.NetBSD
+- INSTALL.DPDK

 Build Requirements
 --
diff --git a/INSTALL.DPDK b/INSTALL.DPDK
new file mode 100644
index 000..1c95104
--- /dev/null
+++ b/INSTALL.DPDK
@@ -0,0 +1,85 @@
+   Using Open vSwitch with DPDK
+   
+
+Open vSwitch can use Intel(R) DPDK lib to operate entirely in
+userspace. This file explains how to install and use Open vSwitch in
+such a mode.
+
+The DPDK support of Open vSwitch is considered experimental.
+It has not been thoroughly tested.
+
+This version of Open vSwitch should be built manually with "configure"
+and "make".
+
+Building and Installing:
+
+
+DPDK:
+cd DPDK
+make install T=x86_64-default-linuxapp-gcc
+Refer to http://dpdk.org/ requirements of details.
+
+Linux kernel:
+Refer to intel-dpdk-getting-started-guide.pdf for understanding
+DPDK kernel requirement.
+
+OVS:
+cd $(OVS_DIR)/openvswitch
+./boot.sh
+./configure --with-dpdk=$(DPDK_BUILD)
+make
+
+Refer to INSTALL.userspace for general requirements of building
+userspace OVS.
+
+Using the DPDK with ovs-vswitchd:
+-
+
+Fist setup DPDK devices:
+  - insert igb_uio.ko
+e.g. insmod DPDK/x86_64-default-linuxapp-gcc/kmod/igb_uio.ko
+  - mount hugefs
+e.g. mount -t hugetlbfs -o pagesize=1G none /mnt/huge/
+  - Bind network device to ibg_uio.
+e.g. DPDK/tools/pci_unbind.py --bind=igb_uio eth1
+
+Ref to http://www.dpdk.org/doc/quick-start for verifying DPDK setup.
+
+Start vswitchd:
+DPDK configuration arguments can be passed to vswitchd via `--dpdk`
+argument.
+   e.g.
+   ./vswitchd/ovs-vswitchd --dpdk -c 0x1 -n 4 -- unix:$DB_SOCK  --pidfile 
--detach
+
+To 

[dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.

2014-01-29 Thread Pravin Shelar
On Wed, Jan 29, 2014 at 2:01 AM, Thomas Graf  wrote:
> On 01/28/2014 02:48 AM, pshelar at nicira.com wrote:
>>
>> From: Pravin B Shelar 
>>
>> Following patch adds DPDK netdev-class to userspace datapath.
>> Approach taken in this patch differs from Intel? DPDK vSwitch
>> where DPDK datapath switching is done in saparate process.  This
>> patch adds support for DPDK type port and uses OVS userspace
>> datapath for switching.  Therefore all DPDK processing and flow
>> miss handling is done in single process.  This also avoids code
>> duplication by reusing OVS userspace datapath switching and
>> therefore it supports all flow matching and actions that
>> user-space datapath supports.  Refer to INSTALL.DPDK doc for
>> further info.
>>
>> With this patch I got similar performance for netperf TCP_STREAM
>> tests compared to kernel datapath.
>>
>> This is based a patch from Gerald Rogers.
>>
>> Signed-off-by: Pravin B Shelar 
>> CC: "Gerald Rogers" 
>
>
> Pravin,
>
> Some initial comments below. I will provide more after deeper
> digging.
>

> Do you have any ideas on how to implement the TX batching yet?
>
We can batch packets for some interval, then to do tx-burst. But I did
not see any performance improvements as we have other bottleneck in
userspace datapath. Ben's RCU patch should help there.

>
>> +
>> +static int
>> +netdev_dpdk_rx_drain(struct netdev_rx *rx_)
>> +{
>> +struct netdev_rx_dpdk *rx = netdev_rx_dpdk_cast(rx_);
>> +int pending;
>> +int i;
>> +
>> +pending = rx->ofpbuf_cnt;
>> +if (pending) {
>
>
> This conditional seems unneeded.
>
Right.

>
>> +for (i = 0; i < pending; i++) {
>> + build_ofpbuf(rx, >ofpbuf[i], NULL);
>> +}
>> +rx->ofpbuf_cnt = 0;
>> +return 0;
>> +}
>> +
>> +return 0;
>> +}
>> +
>> +/* Tx function. Transmit packets indefinitely */
>> +static int
>> +dpdk_do_tx_copy(struct netdev *netdev, char *buf, int size)
>> +{
>> +struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>> +struct rte_mbuf *pkt;
>> +uint32_t nb_tx = 0;
>> +
>> +pkt = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
>> +if (!pkt) {
>> +return 0;
>
>
> Silent drop? ;-) Shouldn't these drops be accounted for somehow?
>
ahh, I will keep it in netdev-dpdk.

>
>> +}
>> +
>> +/* We have to do a copy for now */
>> +memcpy(pkt->pkt.data, buf, size);
>> +
>> +rte_pktmbuf_data_len(pkt) = size;
>> +rte_pktmbuf_pkt_len(pkt) = size;
>> +
>> +rte_spinlock_lock(>tx_lock);
>
>
> What is the purpose of tx_lock here? Multiple threads writing to
> the same Q? The lock is not acquired for the zerocopy path below.
>
There are PMD threads which have their own queue. So tx in these
threads is lockless. But vswitchd can send packet from other thread
all other thread send packets from single queue which is locked.

>
>> +nb_tx = rte_eth_tx_burst(dev->port_id, NR_QUEUE, , 1);
>> +rte_spinlock_unlock(>tx_lock);
>> +
>> +if (nb_tx != 1) {
>> +/* free buffers if we couldn't transmit packets */
>> +rte_mempool_put_bulk(dev->dpdk_mp->mp, (void **), 1);
>> +}
>> +return nb_tx;
>> +}
>> +
>> +static int
>> +netdev_dpdk_send(struct netdev *netdev,
>> + struct ofpbuf *ofpbuf, bool may_steal)
>> +{
>> +struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>> +
>> +if (ofpbuf->size > dev->max_packet_len) {
>> +VLOG_ERR("2big size %d max_packet_len %d",
>> +  (int)ofpbuf->size , dev->max_packet_len);
>
>
> Should probably use VLOG_RATE_LIMIT_INIT
>
ok.

>
>> +return E2BIG;
>> +}
>> +
>> +rte_prefetch0(>private_p);
>> +if (!may_steal ||
>> +!ofpbuf->private_p || ofpbuf->source != OFPBUF_DPDK) {
>> +dpdk_do_tx_copy(netdev, (char *) ofpbuf->data, ofpbuf->size);
>> +} else {
>> +struct rte_mbuf *pkt;
>> +uint32_t nb_tx;
>> +int qid;
>> +
>> +pkt = ofpbuf->private_p;
>> +ofpbuf->private_p = NULL;
>> +rte_pktmbuf_data_len(pkt) = ofpbuf->size;
>> +rte_pktmbuf_pkt_len(pkt) = ofpbuf->size;
>> +
>> +/* TODO: TX batching. */
>> +qid = rte_lcore_id() % NR_QUEUE;
>> +nb_tx = rte_eth_tx_burst(dev->port_id, qid, , 1);
>> +if (nb_tx != 1) {
>> +struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>> +
>> +rte_mempool_put_bulk(dev->dpdk_mp->mp, (void **), 1);
>> +VLOG_ERR("TX error, zero packets sent");
>
>
> Same here
>
ok

>
>> +   }
>> +}
>> +return 0;
>> +}
>
>
>> +static int
>> +netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
>> +{
>> +struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>> +int old_mtu, err;
>> +struct dpdk_mp *old_mp;
>> +struct dpdk_mp *mp;
>> +
>> +ovs_mutex_lock(_mutex);
>> +ovs_mutex_lock(>mutex);
>> +if (dev->mtu == mtu) {
>> +err = 0;
>> +goto out;
>> +}
>> +
>> +mp = dpdk_mp_get(dev->socket_id, dev->mtu);
>> +   

[dpdk-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.

2014-01-29 Thread Pravin Shelar
On Wed, Jan 29, 2014 at 12:56 AM, Prashant Upadhyaya
 wrote:
> Hi Pravin,
>
> I think your stuff is on the brink of a creating a mini revolution :)
>
> Some questions inline below --
> +ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk
> What do you mean by portid here, do you mean the physical interface id like 
> eth0 which I have bound to igb_uio now ?
> If I have multiple interfaces I have assigned igb_uio to, eg. eth0, eth1, 
> eth2 etc., what is the id mapping for those ?
>
Port id is id assigned by DPDK. DPDK interface takes this port id as
argument. Currently you need to look at pci id to figure out the
device mapping to port id. I know it is clean and I am exploring
better interface so that we can specify device names to ovs-vsctl.

> If I have VM's running, then typically how to interface those VM's to this 
> OVS in user space now, do I use the same classical 'tap' interface and add it 
> to the OVS above.

tap device will work, but you would not get performance primarily due
to scheduling delay and memcopy.
DPDK has multiple drivers to create interface with KVM guests OS.
those should perform better. I have no tried it yet.

> What is the actual path the data takes from the VM now all the way to the 
> switch, wouldn't it be hypervisor to kernel to OVS switch in user space to 
> other VM/Network ?

Depends on method you use. e.g. Memnic bypass hypervisor and host
kernel entirely.

> I think if we can solve the VM to OVS port connectivity remaining in 
> userspace only, then we have a great thing at our hand. Kindly comment on 
> this.
>
right, performance looks pretty good. Still DPDK needs constant
polling which consumes more power. RFC ovs-dkdp patch has simple
polling which need tweaking for better power usage.

Thanks,
Pravin.



> Regards
> -Prashant
>
>


[dpdk-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.

2014-01-29 Thread Pravin Shelar
On Tue, Jan 28, 2014 at 4:15 PM, Vincent JARDIN
 wrote:
> Hi Pravin,
>
>
>>> Few feature questions:
>>>
>>>- what's about the vNIC supports (toward the guests)?
>>>- what's about IPsec support (VxLAN over IPsec for instance)?
>>> I do not understand how your patch will solve those 2 cases.
>>>
>> At this point I wanted to get basic DPDK support in OVS, once that is
>> done we can add support for vNIC.
>
>
> For vNIC, did you notice:
>   http://dpdk.org/browse/memnic/
>
> ?

AFAIU it is introducing backend driver for vNIC which shld work with this patch.

>
>> IPsec and vxlan or any L3 tunneling requires IP stack in userspace and
>> needs more design work.
>
>
> OK, understood.
>
>
>> This is based a patch from Gerald Rogers.
>>>
>>>
>>> Please which patch? I cannot find it into the archives.
>>>
>> It was directly sent to Jesse at Nicira. If you want I can send it out for
>> ref.
>
>
> Yes, please.

Attached is the patch based upon HEAD-309d9da.

Thanks,
Pravin.

>
> Thank you,
>   Vincent
>


[dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.

2014-01-29 Thread Vincent JARDIN
Hi Thomas,

On 29/01/2014 09:15, Thomas Graf wrote:

 > The obvious and usual best practise would be for DPDK to guarantee
 > ABI stability between minor releases.
 >
 > Since dpdk-dev is copied as well, any comments?

DPDK's ABIs are not Kernel's ABIs, they are not POSIX, there is no 
standard. Currently, there is no such plan to have a stable ABI since we 
need to keep freedom to chase CPU cycles over having a stable ABI. For 
instance, some applications on top of the DPDK process the packets in 
less than 150 CPU cycles (have a look at testpmd:
   http://dpdk.org/browse/dpdk/tree/app/test-pmd )

I agree that some areas could be improved since they are not into the 
critical datapath of packets, but still other areas remain very CPU 
constraints. For instance:
http://dpdk.org/browse/dpdk/commit/lib/librte_ether/rte_ethdev.h?id=c3d0564cf0f00c3c9a61cf72bd4bd1c441740637
is bad:
struct eth_dev_ops
is churned, no comment, and a #ifdef that changes the structure 
according to compilation!

Should an application use the librte libraries of the DPDK:
   - you can use RTE_VERSION and RTE_VERSION_NUM :
http://dpdk.org/doc/api/rte__version_8h.html#a8775053b0f721b9fa0457494cfbb7ed9
   - you can write your own wrapper (with CPU overhead) in order to have 
a stable ABI, that wrapper should be tight to the versions of the librte 
=> the overhead is part of your application instead of the DPDK,
   - *otherwise recompile your software, it is opensource, what's the 
issue?*

We are opened to any suggestion to have stable ABI, but it should never 
remove the options to have fast/efficient/compilation/CPU execution 
processing.

Best regards,
   Vincent


[dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.

2014-01-29 Thread Thomas Graf
On 01/28/2014 02:48 AM, pshelar at nicira.com wrote:
> From: Pravin B Shelar 
>
> Following patch adds DPDK netdev-class to userspace datapath.
> Approach taken in this patch differs from Intel? DPDK vSwitch
> where DPDK datapath switching is done in saparate process.  This
> patch adds support for DPDK type port and uses OVS userspace
> datapath for switching.  Therefore all DPDK processing and flow
> miss handling is done in single process.  This also avoids code
> duplication by reusing OVS userspace datapath switching and
> therefore it supports all flow matching and actions that
> user-space datapath supports.  Refer to INSTALL.DPDK doc for
> further info.
>
> With this patch I got similar performance for netperf TCP_STREAM
> tests compared to kernel datapath.
>
> This is based a patch from Gerald Rogers.
>
> Signed-off-by: Pravin B Shelar 
> CC: "Gerald Rogers" 

Pravin,

Some initial comments below. I will provide more after deeper
digging.

Do you have any ideas on how to implement the TX batching yet?

> +
> +static int
> +netdev_dpdk_rx_drain(struct netdev_rx *rx_)
> +{
> +struct netdev_rx_dpdk *rx = netdev_rx_dpdk_cast(rx_);
> +int pending;
> +int i;
> +
> +pending = rx->ofpbuf_cnt;
> +if (pending) {

This conditional seems unneeded.

> +for (i = 0; i < pending; i++) {
> + build_ofpbuf(rx, >ofpbuf[i], NULL);
> +}
> +rx->ofpbuf_cnt = 0;
> +return 0;
> +}
> +
> +return 0;
> +}
> +
> +/* Tx function. Transmit packets indefinitely */
> +static int
> +dpdk_do_tx_copy(struct netdev *netdev, char *buf, int size)
> +{
> +struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> +struct rte_mbuf *pkt;
> +uint32_t nb_tx = 0;
> +
> +pkt = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
> +if (!pkt) {
> +return 0;

Silent drop? ;-) Shouldn't these drops be accounted for somehow?

> +}
> +
> +/* We have to do a copy for now */
> +memcpy(pkt->pkt.data, buf, size);
> +
> +rte_pktmbuf_data_len(pkt) = size;
> +rte_pktmbuf_pkt_len(pkt) = size;
> +
> +rte_spinlock_lock(>tx_lock);

What is the purpose of tx_lock here? Multiple threads writing to
the same Q? The lock is not acquired for the zerocopy path below.

> +nb_tx = rte_eth_tx_burst(dev->port_id, NR_QUEUE, , 1);
> +rte_spinlock_unlock(>tx_lock);
> +
> +if (nb_tx != 1) {
> +/* free buffers if we couldn't transmit packets */
> +rte_mempool_put_bulk(dev->dpdk_mp->mp, (void **), 1);
> +}
> +return nb_tx;
> +}
> +
> +static int
> +netdev_dpdk_send(struct netdev *netdev,
> + struct ofpbuf *ofpbuf, bool may_steal)
> +{
> +struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> +
> +if (ofpbuf->size > dev->max_packet_len) {
> +VLOG_ERR("2big size %d max_packet_len %d",
> +  (int)ofpbuf->size , dev->max_packet_len);

Should probably use VLOG_RATE_LIMIT_INIT

> +return E2BIG;
> +}
> +
> +rte_prefetch0(>private_p);
> +if (!may_steal ||
> +!ofpbuf->private_p || ofpbuf->source != OFPBUF_DPDK) {
> +dpdk_do_tx_copy(netdev, (char *) ofpbuf->data, ofpbuf->size);
> +} else {
> +struct rte_mbuf *pkt;
> +uint32_t nb_tx;
> +int qid;
> +
> +pkt = ofpbuf->private_p;
> +ofpbuf->private_p = NULL;
> +rte_pktmbuf_data_len(pkt) = ofpbuf->size;
> +rte_pktmbuf_pkt_len(pkt) = ofpbuf->size;
> +
> +/* TODO: TX batching. */
> +qid = rte_lcore_id() % NR_QUEUE;
> +nb_tx = rte_eth_tx_burst(dev->port_id, qid, , 1);
> +if (nb_tx != 1) {
> +struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> +
> +rte_mempool_put_bulk(dev->dpdk_mp->mp, (void **), 1);
> +VLOG_ERR("TX error, zero packets sent");

Same here

> +   }
> +}
> +return 0;
> +}

> +static int
> +netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
> +{
> +struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> +int old_mtu, err;
> +struct dpdk_mp *old_mp;
> +struct dpdk_mp *mp;
> +
> +ovs_mutex_lock(_mutex);
> +ovs_mutex_lock(>mutex);
> +if (dev->mtu == mtu) {
> +err = 0;
> +goto out;
> +}
> +
> +mp = dpdk_mp_get(dev->socket_id, dev->mtu);
> +if (!mp) {
> +err = ENOMEM;
> +goto out;
> +}
> +
> +rte_eth_dev_stop(dev->port_id);
> +
> +old_mtu = dev->mtu;
> +old_mp = dev->dpdk_mp;
> +dev->dpdk_mp = mp;
> +dev->mtu = mtu;
> +dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
> +
> +err = dpdk_eth_dev_init(dev);
> +if (err) {
> +
> +dpdk_mp_put(mp);
> +dev->mtu = old_mtu;
> +dev->dpdk_mp = old_mp;
> +dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
> +dpdk_eth_dev_init(dev);

Would be nice if we don't need these constructs and DPDK would
provide an all or nothing init method.

> +goto out;
> +}
> +
> +dpdk_mp_put(old_mp);
> 

[dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.

2014-01-29 Thread Stephen Hemminger
On Wed, 29 Jan 2014 18:14:01 +0100
Thomas Graf  wrote:

> On 01/29/2014 05:34 PM, Vincent JARDIN wrote:
> > Thomas,
> >
> > First and easy answer: it is open source, so anyone can recompile. So,
> > what's the issue?
> 
> I'm talking from a pure distribution perspective here: Requiring to
> recompile all DPDK based applications to distribute a bugfix or to
> add support for a new PMD is not ideal.
> 
> So ideally OVS would have the possibility to link against the shared
> library long term.
> 
> > I get lost: do you mean ABI + API toward the PMDs or towards the
> > applications using the librte ?
> 
> Towards the PMDs is more straight forward at first so it seems logical
> to focus on that first.
> 
> A stable API and ABI for librte seems required as well long term as
> DPDK does offer shared libraries but I realize that this is a stretch
> goal in the initial phase.

I would hate to see the API/ABI nailed down. We have lots of bug fixes
and new drivers that are ready to contribute, but most of them have some
change to existing ABI.


[dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.

2014-01-29 Thread Thomas Graf
On 01/28/2014 07:17 PM, Pravin Shelar wrote:
> Right, version mismatch will not work. API provided by DPDK are not
> stable, So OVS has to be built for different releases for now.
>
> I do not see how we can fix it from OVS side. DPDK needs to
> standardize API, Actually OVS also needs more API, like DPDK
> initialization, mempool destroy, etc.

Agreed. It's not fixable from the OVS side. I also don't want to
object to including this. I'm just raising awareness of the issue
as this will become essential for dstribution.

The obvious and usual best practise would be for DPDK to guarantee
ABI stability between minor releases.

Since dpdk-dev is copied as well, any comments?


[dpdk-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.

2014-01-29 Thread Vincent JARDIN
Hi Pravin,

 >> Few feature questions:
>>- what's about the vNIC supports (toward the guests)?
>>- what's about IPsec support (VxLAN over IPsec for instance)?
>> I do not understand how your patch will solve those 2 cases.
>>
> At this point I wanted to get basic DPDK support in OVS, once that is
> done we can add support for vNIC.

For vNIC, did you notice:
   http://dpdk.org/browse/memnic/
?

> IPsec and vxlan or any L3 tunneling requires IP stack in userspace and
> needs more design work.

OK, understood.

> This is based a patch from Gerald Rogers.
>>
>> Please which patch? I cannot find it into the archives.
>>
> It was directly sent to Jesse at Nicira. If you want I can send it out for 
> ref.

Yes, please.

Thank you,
   Vincent