date:20161124

[dpdk-dev] dpdk/vpp and cross-version migration for vhost

2016-11-24 Thread Yuanhan Liu

On Thu, Nov 24, 2016 at 09:30:49AM +, Kevin Traynor wrote:
> On 11/24/2016 06:31 AM, Yuanhan Liu wrote:
> > On Tue, Nov 22, 2016 at 04:53:05PM +0200, Michael S. Tsirkin wrote:
>  You keep assuming that you have the VM started first and
>  figure out things afterwards, but this does not work.
> 
>  Think about a cluster of machines. You want to start a VM in
>  a way that will ensure compatibility with all hosts
>  in a cluster.
> >>>
> >>> I see. I was more considering about the case when the dst
> >>> host (including the qemu and dpdk combo) is given, and
> >>> then determine whether it will be a successfull migration
> >>> or not.
> >>>
> >>> And you are asking that we need to know which host could
> >>> be a good candidate before starting the migration. In such
> >>> case, we indeed need some inputs from both the qemu and
> >>> vhost-user backend.
> >>>
> >>> For DPDK, I think it could be simple, just as you said, it
> >>> could be either a tiny script, or even a macro defined in
> >>> the source code file (we extend it every time we add a
> >>> new feature) to let the libvirt to read it. Or something
> >>> else.
> >>
> >> There's the issue of APIs that tweak features as Maxime
> >> suggested.
> > 
> > Yes, it's a good point.
> > 
> >> Maybe the only thing to do is to deprecate it,
> > 
> > Looks like so.
> > 
> >> but I feel some way for application to pass info into
> >> guest might be benefitial.
> > 
> > The two APIs are just for tweaking feature bits DPDK supports before
> > any device got connected. It's another way to disable some features
> > (the another obvious way is to through QEMU command lines).
> > 
> > IMO, it's bit handy only in a case like: we have bunch of VMs. Instead
> > of disabling something though qemu one by one, we could disable it
> > once in DPDK.
> > 
> > But I doubt the useful of it. It's only used in DPDK's vhost example
> > after all. Nor is it used in vhost pmd, neither is it used in OVS.
> 
> rte_vhost_feature_disable() is currently used in OVS, lib/netdev-dpdk.c

Hmmm. I must have checked very old code ...
> 
> netdev_dpdk_vhost_class_init(void)
> {
> static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
> 
> /* This function can be called for different classes.  The
> initialization
>  * needs to be done only once */
> if (ovsthread_once_start(&once)) {
> rte_vhost_driver_callback_register(&virtio_net_device_ops);
> rte_vhost_feature_disable(1ULL << VIRTIO_NET_F_HOST_TSO4
>   | 1ULL << VIRTIO_NET_F_HOST_TSO6
>   | 1ULL << VIRTIO_NET_F_CSUM);

I saw the commit introduced such change, but it tells no reason why
it was added.

commit 362ca39639ae871806be5ae97d55e1cbb14afd92
Author: mweglicx 
Date:   Thu Apr 14 17:40:06 2016 +0100

Update relevant artifacts to add support for DPDK 16.04.

Following changes are applied:
 - INSTALL.DPDK.md: CONFIG_RTE_BUILD_COMBINE_LIBS step has been
   removed because it is no longer present in DPDK configuration
   (combined library is created by default),
 - INSTALL.DPDK.md: VHost Cuse configuration is updated,
 - netdev-dpdk.c: Link speed definition is changed in DPDK and
   netdev_dpdk_get_features is updated accordingly,
 - netdev-dpdk.c: TSO and checksum offload has been disabled for
   vhostuser device.
 - .travis/linux-build.sh: DPDK version is updated and legacy
   flags have been removed in configuration.

Signed-off-by: Michal Weglicki 
Signed-off-by: Panu Matilainen 
Acked-by: Daniele Di Proietto 

--yliu

[dpdk-dev] [PATCH 00/56] Solarflare libefx-based PMD

2016-11-24 Thread Andrew Rybchenko

On 11/23/2016 06:29 PM, Ferruh Yigit wrote:
> On 11/21/2016 3:00 PM, Andrew Rybchenko wrote:
>> The patch series adds Solarflare libefx-based network PMD.
>>
>> This version of the driver supports Solarflare SFN7xxx and SFN8xxx
>> families of 10/40 Gbps adapters.
>>
>> libefx is a platform-independent library to implement drivers for
>> Solarflare network adapters. It provides unified adapter family
>> independent interface (if possible). FreeBSD [1] and illumos [2]
>> drivers are built on top of the library.
>>
>> The patch series could be logically structured into 5 sub-series:
>>   1. (1) add the driver skeleton including documentation
>>   2. (2-30) import libefx and include it in build with the latest patch
>>   3. (31-43) implement minimal device level operations in steps
>>   4. (44-51) implement Rx subsystem
>>   5. (52-56) implement Tx subsystem
>>
>> Functional driver with multi-queue support capable to send and receive
>> traffic appears with the last patch in the series.
>>
>> The following design decisions are made during development:
>>
>>   1. Since libefx uses positive errno return codes, positive errno
>>  return codes are used inside the driver and coversion to negative
>>  is done on return from eth_dev_ops callbacks. We think that it
>>  is the less error-prone way.
>>
>>   2. Another Solarflare PMD with in-kernel part (for control operations)
>>  is considered and could be added in the future. Code for data path
>>  should be shared by these two drivers. libefx-based PMD is put into
>>  'efx' subdirectory to have a space for another PMD and shared code.
>>
>>   3. Own event queue (a way to deliver events from HW to host CPU) is
>>  used for house-keeping (e.g. link status notifications), each Tx
>>  and each Rx queue. No locks on datapath are requires in this case.
>>
>>   4. Alarm is used to periodically poll house-keeping event queue.
>>  The event queue is used to deliver link status change notifications,
>>  Rx/Tx queue flush events, SRAM events. It is not used on datapath.
>>  The event queue polling is protected using spin-lock since
>>  concurrent access from different contexts is possible (e.g. device
>>  stop when polling alarm is running).
>>
>> [1] https://svnweb.freebsd.org/base/head/sys/dev/sfxge/common/
>> [2] 
>> https://github.com/illumos/illumos-gate/tree/master/usr/src/uts/common/io/sfxge/common/
>>
>> ---
> I would like to note that very well organized patchset. Thank you for
> your effort.

Thanks a lot, it is very pleasant to read it.

Please, see my questions in thread for patches 01/56 (about compiler
versions to test on) and 30/56 (about libefx configuration documentation).

Also I'd like to ask a question about the further patches submission.
We have about 40 patches which support various features (RSS, stats,
flow control and many others). What is the preferred way to submit it?
A. Separately since they are not so tightly related (but in fact cannot be
applied in random order since some touch the same lines in code)
B. As a series to process everything in one go.

Thanks,
Andrew.

[dpdk-dev] [PATCH 01/56] net/sfc: libefx-based PMD stub sufficient to build and init

2016-11-24 Thread Andrew Rybchenko

On 11/23/2016 06:26 PM, Ferruh Yigit wrote:
> On 11/21/2016 3:00 PM, Andrew Rybchenko wrote:
>> The PMD is put into the sfc/efx subdirectory to have a place for
>> the second PMD and library shared by both.
>>
>> Enable the PMD by default on supported configuratons.
>>
>> Reviewed-by: Andy Moreton 
>> Signed-off-by: Andrew Rybchenko 
>> ---
>>   MAINTAINERS |   6 ++
>>   config/common_base  |   6 ++
>>   config/defconfig_arm-armv7a-linuxapp-gcc|   1 +
>>   config/defconfig_arm64-armv8a-linuxapp-gcc  |   1 +
>>   config/defconfig_i686-native-linuxapp-gcc   |   5 +
>>   config/defconfig_i686-native-linuxapp-icc   |   5 +
>>   config/defconfig_ppc_64-power8-linuxapp-gcc |   1 +
>>   config/defconfig_tile-tilegx-linuxapp-gcc   |   1 +
>>   config/defconfig_x86_64-native-linuxapp-icc |   5 +
>>   config/defconfig_x86_x32-native-linuxapp-gcc|   5 +
>>   doc/guides/nics/features/sfc_efx.ini|  10 ++
>>   doc/guides/nics/index.rst   |   1 +
>>   doc/guides/nics/sfc_efx.rst | 109 +
> Can you also update release notes please, to announce new driver.

Thanks, will do in v2.

> <...>
>
>> diff --git a/drivers/net/sfc/efx/Makefile b/drivers/net/sfc/efx/Makefile
>> new file mode 100644
>> index 000..71f07ca
>> --- /dev/null
>> +++ b/drivers/net/sfc/efx/Makefile
>> @@ -0,0 +1,81 @@
> <...>
>> +
>> +include $(RTE_SDK)/mk/rte.vars.mk
>> +
>> +#
>> +# library name
>> +#
>> +LIB = librte_pmd_sfc_efx.a
>> +
>> +CFLAGS += -O3
>> +
>> +# Enable basic warnings but disable some which are accepted
>> +CFLAGS += -Wall
> It is possible to use $(WERROR_FLAGS), which set automatically based on
> selected compiler. See mk/toolchain/* .

Thanks, will do in v2.

> And you can add extra options here, please keep in mind that there are
> three compiler supported right now: gcc, clang and icc. You may require
> to add compiler and version checks..

I've tried to disable the driver build on ICC since we've never tested it.
I've failed to find list of compiler versions which must/should be checked.
I've tested versions which come with RHEL 7.2, Debian Jessie and Sid.
(In v1 I've lost my fixes for clang which produce warnings because of
unsupported -W option)

>> +CFLAGS += -Wno-strict-aliasing
>> +
>> +# Enable extra warnings but disable some which are accepted
>> +CFLAGS += -Wextra
>> +CFLAGS += -Wno-empty-body
>> +CFLAGS += -Wno-sign-compare
>> +CFLAGS += -Wno-type-limits
>> +CFLAGS += -Wno-unused-parameter
> Is there a way to not disable these warnings but fix in the source code?
> Or move to CFLAGS_BASE_DRIVER, if the reason is the base driver?

Will do in v2.

>> +
>> +# More warnings not enabled by above aggregators
>> +CFLAGS += -Waggregate-return
>> +CFLAGS += -Wbad-function-cast
>> +CFLAGS += -Wcast-qual
>> +CFLAGS += -Wdisabled-optimization
>> +CFLAGS += -Wmissing-declarations
>> +CFLAGS += -Wmissing-prototypes
>> +CFLAGS += -Wnested-externs
>> +CFLAGS += -Wold-style-definition
>> +CFLAGS += -Wpointer-arith
>> +CFLAGS += -Wstrict-prototypes
>> +CFLAGS += -Wundef
>> +CFLAGS += -Wwrite-strings
> If you believe some can be useful for everybody, please feel free to add
> to mk/toolchain/* .

I'll definitely remove duplicates which are already included in 
$(WERROR_FLAGS).
I'd prefer to keep the rest just here for now. I think that adding it 
world-wide
requires testing on really many compiler versions etc.

>> +
>> +EXPORT_MAP := rte_pmd_sfc_efx_version.map
>> +
>> +LIBABIVER := 1
>> +
>> +#
>> +# all source are stored in SRCS-y
>> +#
>> +SRCS-$(CONFIG_RTE_LIBRTE_SFC_EFX_PMD) += sfc_ethdev.c
>> +SRCS-$(CONFIG_RTE_LIBRTE_SFC_EFX_PMD) += sfc_kvargs.c
>> +
>> +
>> +# this lib depends upon:
>> +DEPDIRS-$(CONFIG_RTE_LIBRTE_SFC_EFX_PMD) += lib/librte_eal
>> +DEPDIRS-$(CONFIG_RTE_LIBRTE_SFC_EFX_PMD) += lib/librte_kvargs
>> +DEPDIRS-$(CONFIG_RTE_LIBRTE_SFC_EFX_PMD) += lib/librte_ether
>> +DEPDIRS-$(CONFIG_RTE_LIBRTE_SFC_EFX_PMD) += lib/librte_mempool
>> +DEPDIRS-$(CONFIG_RTE_LIBRTE_SFC_EFX_PMD) += lib/librte_mbuf
>> +
>> +include $(RTE_SDK)/mk/rte.lib.mk
>> diff --git a/drivers/net/sfc/efx/rte_pmd_sfc_efx_version.map 
>> b/drivers/net/sfc/efx/rte_pmd_sfc_efx_version.map
>> new file mode 100644
>> index 000..1901bcb
>> --- /dev/null
>> +++ b/drivers/net/sfc/efx/rte_pmd_sfc_efx_version.map
>> @@ -0,0 +1,4 @@
>> +DPDK_16.07 {
> Now this become 17.02

Thanks, will fix in v2.

>> +
>> +local: *;
>> +};
>> diff --git a/drivers/net/sfc/efx/sfc.h b/drivers/net/sfc/efx/sfc.h
>> new file mode 100644
>> index 000..16fd2bb
>> --- /dev/null
>> +++ b/drivers/net/sfc/efx/sfc.h
>> @@ -0,0 +1,53 @@
> <..>
>> +
>> +#ifndef _SFC_H
>> +#define _SFC_H
> s/^I/ /
> This also exists in other locations and files..

Will fix in v2.
I thought that DPDK prefers TAB after #define as FreeBSD does, but 
counting shows that space is really preferred.
I think that such things should be cau

[dpdk-dev] [PATCH 30/56] net/sfc: include libefx in build

2016-11-24 Thread Andrew Rybchenko

See one question below.

On 11/23/2016 06:26 PM, Ferruh Yigit wrote:
> On 11/21/2016 3:00 PM, Andrew Rybchenko wrote:
>> From: Artem Andreev 
>>
>> Implement efsys.h for the PMD.
>>
>> Reviewed-by: Andy Moreton 
>> Signed-off-by: Artem Andreev 
>> Signed-off-by: Andrew Rybchenko 
>> ---
>>   drivers/net/sfc/efx/Makefile |  54 +++
>>   drivers/net/sfc/efx/efsys.h  | 767 
>> +++
>>   2 files changed, 821 insertions(+)
>>   create mode 100644 drivers/net/sfc/efx/efsys.h
>>
>> diff --git a/drivers/net/sfc/efx/Makefile b/drivers/net/sfc/efx/Makefile
>> index 71f07ca..de95ea8 100644
>> --- a/drivers/net/sfc/efx/Makefile
>> +++ b/drivers/net/sfc/efx/Makefile
>> @@ -33,6 +33,8 @@ include $(RTE_SDK)/mk/rte.vars.mk
>>   #
>>   LIB = librte_pmd_sfc_efx.a
>>   
>> +CFLAGS += -I$(SRCDIR)/base/
>> +CFLAGS += -I$(SRCDIR)
>>   CFLAGS += -O3
>>   
>>   # Enable basic warnings but disable some which are accepted
>> @@ -60,6 +62,17 @@ CFLAGS += -Wstrict-prototypes
>>   CFLAGS += -Wundef
>>   CFLAGS += -Wwrite-strings
>>   
>> +# Extra CFLAGS for base driver files
>> +CFLAGS_BASE_DRIVER += -Wno-unused-variable
>> +CFLAGS_BASE_DRIVER += -Wno-unused-but-set-variable
> clang complain about this one:
> warning: unknown warning option '-Wno-unused-but-set-variable'; did you
> mean '-Wno-unused-const-variable'? [-Wunknown-warning-option]

Will fix in v2

>> +
>> +#
>> +# List of base driver object files for which
>> +# special CFLAGS above should be applied
>> +#
>> +BASE_DRIVER_OBJS=$(patsubst %.c,%.o,$(notdir $(wildcard 
>> $(SRCDIR)/base/*.c)))
>> +$(foreach obj, $(BASE_DRIVER_OBJS), $(eval CFLAGS+=$(CFLAGS_BASE_DRIVER)))
> This cause multiple "-Wno-unused-variable -Wno-unused-but-set-variable"
> params in final command, I guess the intention is:
>
> $(foreach obj, $(BASE_DRIVER_OBJS), $(eval
> CFLAGS_$(obj)+=$(CFLAGS_BASE_DRIVER)))
>
> Fixing this may generate a few compiler warnings.

Many thanks, will fix in v2.

> <...>
>
>> diff --git a/drivers/net/sfc/efx/efsys.h b/drivers/net/sfc/efx/efsys.h
>> new file mode 100644
>> index 000..2eef996
>> --- /dev/null
>> +++ b/drivers/net/sfc/efx/efsys.h
>> @@ -0,0 +1,767 @@
> <...>
>
> I guess below is hardcoded compile time configuration for libefx, do you
> think does it make sense to document this default configuration?

Yes, it is libefx configuration and more options will be enabled when 
corresponding
functionality is supported in the PMD.
I'm sorry, but I don't understand what would you like to see in the 
documentation.
Could you clarify, please?

>> +
>> +#define EFSYS_OPT_NAMES 0
>> +
>> +#define EFSYS_OPT_SIENA 0
>> +#define EFSYS_OPT_HUNTINGTON 1
>> +#define EFSYS_OPT_MEDFORD 1
>> +#ifdef RTE_LIBRTE_SFC_EFX_DEBUG
>> +#define EFSYS_OPT_CHECK_REG 1
>> +#else
>> +#define EFSYS_OPT_CHECK_REG 0
>> +#endif
>> +
>> +#define EFSYS_OPT_MCDI 1
>> +#define EFSYS_OPT_MCDI_LOGGING 0
>> +#define EFSYS_OPT_MCDI_PROXY_AUTH 0
>> +
>> +#define EFSYS_OPT_MAC_STATS 0
>> +
>> +#define EFSYS_OPT_LOOPBACK 0
>> +
>> +#define EFSYS_OPT_MON_MCDI 0
>> +#define EFSYS_OPT_MON_STATS 0
>> +
>> +#define EFSYS_OPT_PHY_STATS 0
>> +#define EFSYS_OPT_BIST 0
>> +#define EFSYS_OPT_PHY_LED_CONTROL 0
>> +#define EFSYS_OPT_PHY_FLAGS 0
>> +
>> +#define EFSYS_OPT_VPD 0
>> +#define EFSYS_OPT_NVRAM 0
>> +#define EFSYS_OPT_BOOTCFG 0
>> +
>> +#define EFSYS_OPT_DIAG 0
>> +#define EFSYS_OPT_RX_SCALE 0
>> +#define EFSYS_OPT_QSTATS 0
>> +#define EFSYS_OPT_FILTER 1
>> +#define EFSYS_OPT_RX_SCATTER 0
>> +
>> +#define EFSYS_OPT_EV_PREFETCH 0
>> +
>> +#define EFSYS_OPT_DECODE_INTR_FATAL 0
>> +
>> +#define EFSYS_OPT_LICENSING 0
>> +
>> +#define EFSYS_OPT_ALLOW_UNCONFIGURED_NIC 0
>> +
>> +#define EFSYS_OPT_RX_PACKED_STREAM 0
> <...>

[dpdk-dev] [PATCH 1/1] net/i40e: enable auto link update for XXV710

2016-11-24 Thread Qi Zhang

This patch remove the limitation that XXV710 device does
not support auto link update.

Signed-off-by: Qi Zhang 
---
 drivers/net/i40e/i40e_ethdev.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index 67778ba..b7a916d 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -1628,6 +1628,8 @@ i40e_phy_conf_link(struct i40e_hw *hw,

/* use get_phy_abilities_resp value for the rest */
phy_conf.phy_type = phy_ab.phy_type;
+   phy_conf.phy_type_ext = phy_ab.phy_type_ext;
+   phy_conf.fec_config = phy_ab.mod_type_ext;
phy_conf.eee_capability = phy_ab.eee_capability;
phy_conf.eeer = phy_ab.eeer_val;
phy_conf.low_power_ctrl = phy_ab.d3_lpan;
@@ -1653,8 +1655,7 @@ i40e_apply_link_speed(struct rte_eth_dev *dev)
struct rte_eth_conf *conf = &dev->data->dev_conf;

speed = i40e_parse_link_speeds(conf->link_speeds);
-   if (!I40E_PHY_TYPE_SUPPORT_25G(hw->phy.phy_types))
-   abilities |= I40E_AQ_PHY_ENABLE_ATOMIC_LINK;
+   abilities |= I40E_AQ_PHY_ENABLE_ATOMIC_LINK;
if (!(conf->link_speeds & ETH_LINK_SPEED_FIXED))
abilities |= I40E_AQ_PHY_AN_ENABLED;
abilities |= I40E_AQ_PHY_LINK_ENABLED;
@@ -1990,8 +1991,7 @@ i40e_dev_set_link_down(struct rte_eth_dev *dev)
uint8_t abilities = 0;
struct i40e_hw *hw = I40E_DEV_PRIVATE_TO_HW(dev->data->dev_private);

-   if (!I40E_PHY_TYPE_SUPPORT_25G(hw->phy.phy_types))
-   abilities = I40E_AQ_PHY_ENABLE_ATOMIC_LINK;
+   abilities = I40E_AQ_PHY_ENABLE_ATOMIC_LINK;
return i40e_phy_conf_link(hw, abilities, speed);
 }

-- 
2.7.4

[dpdk-dev] [PATCH] test: adding AES cipher-only tests on QAT PMD

2016-11-24 Thread Fiona Trahe

Extended functional AES-CBC and AES-CTR cipher-only
tests to run on QAT PMD.
Added AES_CBC cipher-only performance tests on QAT PMD.
No driver changes, but as now tested, QAT documentation
is updated to remove constraint.

Signed-off-by: Fiona Trahe 
---
 app/test/test_cryptodev.c  | 18 ++
 app/test/test_cryptodev_aes_test_vectors.h | 36 +++
 app/test/test_cryptodev_perf.c | 96 +++---
 doc/guides/cryptodevs/qat.rst  |  1 -
 4 files changed, 102 insertions(+), 49 deletions(-)

diff --git a/app/test/test_cryptodev.c b/app/test/test_cryptodev.c
index 872f8b4..00dced5 100644
--- a/app/test/test_cryptodev.c
+++ b/app/test/test_cryptodev.c
@@ -1530,6 +1530,22 @@ test_AES_chain_qat_all(void)
 }

 static int
+test_AES_cipheronly_qat_all(void)
+{
+   struct crypto_testsuite_params *ts_params = &testsuite_params;
+   int status;
+
+   status = test_blockcipher_all_tests(ts_params->mbuf_pool,
+   ts_params->op_mpool, ts_params->valid_devs[0],
+   RTE_CRYPTODEV_QAT_SYM_PMD,
+   BLKCIPHER_AES_CIPHERONLY_TYPE);
+
+   TEST_ASSERT_EQUAL(status, 0, "Test failed");
+
+   return TEST_SUCCESS;
+}
+
+static int
 test_authonly_openssl_all(void)
 {
struct crypto_testsuite_params *ts_params = &testsuite_params;
@@ -6032,6 +6048,8 @@ static struct unit_test_suite cryptodev_qat_testsuite  = {
test_multi_session),

TEST_CASE_ST(ut_setup, ut_teardown, test_AES_chain_qat_all),
+   TEST_CASE_ST(ut_setup, ut_teardown,
+   test_AES_cipheronly_qat_all),
TEST_CASE_ST(ut_setup, ut_teardown, test_3DES_chain_qat_all),
TEST_CASE_ST(ut_setup, ut_teardown,
test_3DES_cipheronly_qat_all),
diff --git a/app/test/test_cryptodev_aes_test_vectors.h 
b/app/test/test_cryptodev_aes_test_vectors.h
index 1c68f93..efbe7da 100644
--- a/app/test/test_cryptodev_aes_test_vectors.h
+++ b/app/test/test_cryptodev_aes_test_vectors.h
@@ -1024,73 +1024,85 @@ static const struct blockcipher_test_case 
aes_cipheronly_test_cases[] = {
.test_descr = "AES-128-CBC Encryption",
.test_data = &aes_test_data_4,
.op_mask = BLOCKCIPHER_TEST_OP_ENCRYPT,
-   .pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
+   .pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL |
+   BLOCKCIPHER_TEST_TARGET_PMD_QAT
},
{
.test_descr = "AES-128-CBC Decryption",
.test_data = &aes_test_data_4,
.op_mask = BLOCKCIPHER_TEST_OP_DECRYPT,
-   .pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
+   .pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL |
+   BLOCKCIPHER_TEST_TARGET_PMD_QAT
},
{
.test_descr = "AES-192-CBC Encryption",
.test_data = &aes_test_data_10,
.op_mask = BLOCKCIPHER_TEST_OP_ENCRYPT,
-   .pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
+   .pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL |
+   BLOCKCIPHER_TEST_TARGET_PMD_QAT
},
{
.test_descr = "AES-192-CBC Decryption",
.test_data = &aes_test_data_10,
.op_mask = BLOCKCIPHER_TEST_OP_DECRYPT,
-   .pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
+   .pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL |
+   BLOCKCIPHER_TEST_TARGET_PMD_QAT
},
{
.test_descr = "AES-256-CBC Encryption",
.test_data = &aes_test_data_11,
.op_mask = BLOCKCIPHER_TEST_OP_ENCRYPT,
-   .pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
+   .pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL |
+   BLOCKCIPHER_TEST_TARGET_PMD_QAT
},
{
.test_descr = "AES-256-CBC Decryption",
.test_data = &aes_test_data_11,
.op_mask = BLOCKCIPHER_TEST_OP_DECRYPT,
-   .pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
+   .pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL |
+   BLOCKCIPHER_TEST_TARGET_PMD_QAT
},
{
.test_descr = "AES-128-CTR Encryption",
.test_data = &aes_test_data_1,
.op_mask = BLOCKCIPHER_TEST_OP_ENCRYPT,
-   .pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
+   .pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL |
+   BLOCKCIPHER_TEST_TARGET_PMD_QAT
},
{
.test_descr = "AES-128-CTR Decryption",
.test_data = &aes_test_data_1,
.op_mask = BLOCKCIPHER_TEST_OP_DECRYPT,
-   .pmd_mask = BLOCKCIPHER_TEST_TA

[dpdk-dev] [PATCH 31/56] net/sfc: implement dummy callback to get device information

2016-11-24 Thread Andrew Rybchenko

On 11/23/2016 06:26 PM, Ferruh Yigit wrote:
> On 11/21/2016 3:00 PM, Andrew Rybchenko wrote:
>> Just a stub to be filled in when corresponding functionality is
>> implemented.
> What about merging this stub with real implementation?
> Or perhaps replace with code that adds dummy .dev_configure?

Thanks, I like the second idea. Will do in v2.

>> Reviewed-by: Andy Moreton 
>> Signed-off-by: Andrew Rybchenko 
>> ---
>>   drivers/net/sfc/efx/sfc_ethdev.c | 11 +--
>>   1 file changed, 9 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/sfc/efx/sfc_ethdev.c 
>> b/drivers/net/sfc/efx/sfc_ethdev.c
>> index ff20a13..0deff07 100644
>> --- a/drivers/net/sfc/efx/sfc_ethdev.c
>> +++ b/drivers/net/sfc/efx/sfc_ethdev.c
>> @@ -37,9 +37,16 @@
>>   #include "sfc_kvargs.h"
>>   
>>   
>> +static void
>> +sfc_dev_infos_get(struct rte_eth_dev *dev, struct rte_eth_dev_info 
>> *dev_info)
>> +{
>> +struct sfc_adapter *sa = dev->data->dev_private;
>> +
>> +sfc_log_init(sa, "entry");
>> +}
>> +
>>   static const struct eth_dev_ops sfc_eth_dev_ops = {
>> -/* Just dummy init to avoid build-time warning */
>> -.dev_configure  = NULL,
>> +.dev_infos_get  = sfc_dev_infos_get,
>>   };
>>   
>>   static int
>>

[dpdk-dev] [PATCH v2] ethdev: check number of queues less than RTE_ETHDEV_QUEUE_STAT_CNTRS

2016-11-24 Thread Olivier Matz

Hi,

On Mon, 2016-11-21 at 09:59 +, Alejandro Lucero wrote:
> From: Bert van Leeuwen 
> 
> Arrays inside rte_eth_stats have size=RTE_ETHDEV_QUEUE_STAT_CNTRS.
> Some devices report more queues than that and this code blindly uses
> the reported number of queues by the device to fill those arrays up.
> This patch fixes the problem using MIN between the reported number of
> queues and RTE_ETHDEV_QUEUE_STAT_CNTRS.
> 
> Signed-off-by: Alejandro Lucero 
> 

Reviewed-by: Olivier Matz 

As a next step, I'm wondering if it would be possible to remove
this limitation. We could replace the tables in struct rte_eth_stats
by a pointer to an array allocated dynamically at pmd setup.

It would break the API, so it should be announced first. I'm thinking
of something like:

struct rte_eth_generic_stats {
uint64_t ipackets;
uint64_t opackets;
uint64_t ibytes;
uint64_t obytes;
uint64_t imissed;
uint64_t ierrors;
uint64_t oerrors;
uint64_t rx_nombuf
};

struct rte_eth_stats {
struct rte_eth_generic_stats?port_stats;
struct rte_eth_generic_stats?*queue_stats;
};

The queue_stats array would always be indexed by queue_id.
The xstats would continue to report the generic stats per-port and
per-queue.

About the mapping API, either we keep it as-is, or it could
become a driver-specific API.

Thomas, what do you think?

Regards,
Olivier

[dpdk-dev] [PATCH 32/56] net/sfc: implement driver operation to init device on attach

2016-11-24 Thread Andrew Rybchenko

On 11/23/2016 06:26 PM, Ferruh Yigit wrote:
> On 11/21/2016 3:00 PM, Andrew Rybchenko wrote:
>> The setup and configuration of the PMD is not performance sensitive,
>> but is not thread safe either. It is possible that the multiple
>> read/writes during PMD setup and configuration could be corrupted
>> in a multi-thread environment.
> Right, this is not thread-safe, but this is valid for all PMDs, it is
> not expected to have initialization in multi-threaded environment, that
> said so, synchronization also won't hurt, as you said this is not fast
> path, just may not be necessary.

In fact further patches really need the lock and it should be introduced and
maintained from the very beginning. I'll add comments in v2 to explain it.

>> Since this is not performance
>> sensitive, the developer can choose to add their own layer to provide
>> thread-safe setup and configuration. It is expected that, in most
>> applications, the initial configuration of the network ports would be
>> done by a single thread at startup.
>>
>> Reviewed-by: Andy Moreton 
>> Signed-off-by: Andrew Rybchenko 
> <...>

Thanks,
Andrew.

[dpdk-dev] [PATCH 33/56] net/sfc: add device configure and close stubs

2016-11-24 Thread Andrew Rybchenko

On 11/23/2016 06:26 PM, Ferruh Yigit wrote:
> On 11/21/2016 3:00 PM, Andrew Rybchenko wrote:
>> Reviewed-by: Andy Moreton 
>> Signed-off-by: Andrew Rybchenko 
>> ---
> <...>
>
>> diff --git a/drivers/net/sfc/efx/sfc.h b/drivers/net/sfc/efx/sfc.h
>> index 01d652d..d040f98 100644
>> --- a/drivers/net/sfc/efx/sfc.h
>> +++ b/drivers/net/sfc/efx/sfc.h
> <...>
>> @@ -131,7 +147,7 @@ sfc_adapter_unlock(struct sfc_adapter *sa)
>>   }
>>   
>>   static inline void
>> -sfc_adapter_lock_destroy(struct sfc_adapter *sa)
>> +sfc_adapter_lock_fini(struct sfc_adapter *sa)
> Why not do proper naming in 32/56 at first place?

Thanks, will be fixed in v2.

> <...>

[dpdk-dev] [PATCH] doc: introduce PVP reference benchmark

2016-11-24 Thread Mcnamara, John

> -Original Message-
> From: Maxime Coquelin [mailto:maxime.coquelin at redhat.com]
> Sent: Wednesday, November 23, 2016 9:00 PM
> To: yuanhan.liu at linux.intel.com; thomas.monjalon at 6wind.com; Mcnamara, 
> John
> ; Yang, Zhiyong ;
> dev at dpdk.org
> Cc: fbaudin at redhat.com; Maxime Coquelin 
> Subject: [PATCH] doc: introduce PVP reference benchmark
> 
> Having reference benchmarks is important in order to obtain reproducible
> performance figures.
> 
> This patch describes required steps to configure a PVP setup using testpmd
> in both host and guest.
> 
> Not relying on external vSwitch ease integration in a CI loop by not being
> impacted by DPDK API changes.

Hi Maxime,

Thanks for the detailed doc and this initiative. Some minor documentation
comments below.



> +
> +Setup overview
> +..

This level header should be -, even if it looks like dots in the
contribution guide:

http://dpdk.org/doc/guides/contributing/documentation.html#section-headers


> +
> +.. figure:: img/pvp_2nics.svg
> +
> +  PVP setup using 2 NICs
> +

The figure needs a target so it can be used with :numref:, like this:

.. _figure_pvp_2nics:

.. figure:: img/pvp_2nics.*

   PVP setup using 2 NICs


> +DPDK build
> +~~
> +

Put a one line description at the start of each section, even if it is just: 
Build DPDK:



> +Testpmd launch
> +~~
> +
> +#. Assign NICs to DPDK:
> +
> +   .. code-block:: console
> +
> +modprobe vfio-pci
> +$RTE_SDK/install/sbin/dpdk-devbind -b vfio-pci :11:00.0
> + :11:00.1
> +
> +*Note: Sandy Bridge family seems to have some limitations wrt its
> +IOMMU, giving poor performance results. To achieve good performance on
> +these machines, consider using UIO instead.*

This would be better as an RST note:

#. Assign NICs to DPDK:

   .. code-block:: console

  modprobe vfio-pci
  $RTE_SDK/install/sbin/dpdk-devbind -b vfio-pci :11:00.0 :11:00.1

   .. Note::

  The Sandy Bridge family seems to have some IOMMU limitations giving poor
  performance results. To achieve good performance on these machines
  consider using UIO instead.



> +First, SELinux policy needs to be set to permissiven, as testpmd is run
> +as root (reboot required):

s/permissiven/permissive/


There are a couple of trailing whitespace errors as well at build as well.


John

[dpdk-dev] [PATCH 7/7] net/mlx5: remove inefficient prefetching

2016-11-24 Thread Nelio Laranjeiro

Prefetching completion queue entries is inefficient because too few CPU
cycles are spent before their use, which results into cache misses anyway.

Signed-off-by: Nelio Laranjeiro 
Acked-by: Adrien Mazarguil 
---
 drivers/net/mlx5/mlx5_rxtx.c | 21 -
 1 file changed, 21 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
index 4b8c197..9f74fd4 100644
--- a/drivers/net/mlx5/mlx5_rxtx.c
+++ b/drivers/net/mlx5/mlx5_rxtx.c
@@ -348,23 +348,6 @@ mlx5_tx_dbrec(struct txq *txq)
 }

 /**
- * Prefetch a CQE.
- *
- * @param txq
- *   Pointer to TX queue structure.
- * @param cqe_ci
- *   CQE consumer index.
- */
-static inline void
-tx_prefetch_cqe(struct txq *txq, uint16_t ci)
-{
-   volatile struct mlx5_cqe *cqe;
-
-   cqe = &(*txq->cqes)[ci & ((1 << txq->cqe_n) - 1)];
-   rte_prefetch0(cqe);
-}
-
-/**
  * DPDK callback for TX.
  *
  * @param dpdk_txq
@@ -395,8 +378,6 @@ mlx5_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, 
uint16_t pkts_n)
if (unlikely(!pkts_n))
return 0;
/* Prefetch first packet cacheline. */
-   tx_prefetch_cqe(txq, txq->cq_ci);
-   tx_prefetch_cqe(txq, txq->cq_ci + 1);
rte_prefetch0(*pkts);
/* Start processing. */
txq_complete(txq);
@@ -733,7 +714,6 @@ mlx5_tx_burst_mpw(void *dpdk_txq, struct rte_mbuf **pkts, 
uint16_t pkts_n)
if (unlikely(!pkts_n))
return 0;
/* Prefetch first packet cacheline. */
-   tx_prefetch_cqe(txq, txq->cq_ci);
rte_prefetch0(tx_mlx5_wqe(txq, txq->wqe_ci));
rte_prefetch0(tx_mlx5_wqe(txq, txq->wqe_ci + 1));
/* Start processing. */
@@ -938,7 +918,6 @@ mlx5_tx_burst_mpw_inline(void *dpdk_txq, struct rte_mbuf 
**pkts,
if (unlikely(!pkts_n))
return 0;
/* Prefetch first packet cacheline. */
-   tx_prefetch_cqe(txq, txq->cq_ci);
rte_prefetch0(tx_mlx5_wqe(txq, txq->wqe_ci));
rte_prefetch0(tx_mlx5_wqe(txq, txq->wqe_ci + 1));
/* Start processing. */
-- 
2.1.4

[dpdk-dev] [PATCH 6/7] net/mlx5: optimize copy of Ethernet header

2016-11-24 Thread Nelio Laranjeiro

Use fewer instructions to copy the first two bytes of Ethernet headers to
work queue elements.

Signed-off-by: Nelio Laranjeiro 
Acked-by: Adrien Mazarguil 
---
 drivers/net/mlx5/mlx5_rxtx.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
index 0d0b807..4b8c197 100644
--- a/drivers/net/mlx5/mlx5_rxtx.c
+++ b/drivers/net/mlx5/mlx5_rxtx.c
@@ -409,7 +409,7 @@ mlx5_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, 
uint16_t pkts_n)
unsigned int ds = 0;
uintptr_t addr;
uint16_t pkt_inline_sz = MLX5_WQE_DWORD_SIZE;
-   uint8_t ehdr[2];
+   uint16_t ehdr;
uint8_t cs_flags = 0;
 #ifdef MLX5_PMD_SOFT_COUNTERS
uint32_t total_length = 0;
@@ -436,8 +436,8 @@ mlx5_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, 
uint16_t pkts_n)
rte_prefetch0(*pkts);
addr = rte_pktmbuf_mtod(buf, uintptr_t);
length = DATA_LEN(buf);
-   ehdr[0] = ((uint8_t *)addr)[0];
-   ehdr[1] = ((uint8_t *)addr)[1];
+   ehdr = (((uint8_t *)addr)[1] << 8) |
+  ((uint8_t *)addr)[0];
 #ifdef MLX5_PMD_SOFT_COUNTERS
total_length = length;
 #endif
@@ -600,8 +600,7 @@ mlx5_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, 
uint16_t pkts_n)
0,
cs_flags,
0,
-   (ehdr[1] << 24) | (ehdr[0] << 16) |
-   htons(pkt_inline_sz),
+   (ehdr << 16) | htons(pkt_inline_sz),
};
txq->wqe_ci += (ds + 3) / 4;
 #ifdef MLX5_PMD_SOFT_COUNTERS
-- 
2.1.4

[dpdk-dev] [PATCH 5/7] net/mlx5: move static prototype

2016-11-24 Thread Nelio Laranjeiro

Gather function prototypes at the beginning of the file.

Signed-off-by: Nelio Laranjeiro 
Acked-by: Adrien Mazarguil 
---
 drivers/net/mlx5/mlx5_rxtx.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
index 52733da..0d0b807 100644
--- a/drivers/net/mlx5/mlx5_rxtx.c
+++ b/drivers/net/mlx5/mlx5_rxtx.c
@@ -76,6 +76,9 @@ check_cqe(volatile struct mlx5_cqe *cqe,
  unsigned int cqes_n, const uint16_t ci)
  __attribute__((always_inline));

+static inline void
+txq_complete(struct txq *txq) __attribute__((always_inline));
+
 static inline uint32_t
 txq_mp2mr(struct txq *txq, struct rte_mempool *mp)
__attribute__((always_inline));
@@ -192,9 +195,6 @@ tx_mlx5_wqe(struct txq *txq, uint16_t ci)
return (uintptr_t *)((uintptr_t)txq->wqes + ci * MLX5_WQE_SIZE);
 }

-static inline void
-txq_complete(struct txq *txq) __attribute__((always_inline));
-
 /**
  * Manage TX completions.
  *
-- 
2.1.4

[dpdk-dev] [PATCH 4/7] net/mlx5: fix missing inline attributes

2016-11-24 Thread Nelio Laranjeiro

These functions must be forced inline for better performance.

Fixes: 99c12dcca65d ("net/mlx5: handle Rx CQE compression")
Fixes: 1d88ba171942 ("net/mlx5: refactor Tx data path")
Fixes: 67fa62bc672d ("mlx5: support checksum offload")

CC: stable at dpdk.org
Signed-off-by: Nelio Laranjeiro 
Acked-by: Adrien Mazarguil 
---
 drivers/net/mlx5/mlx5_rxtx.c | 34 +-
 1 file changed, 25 insertions(+), 9 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
index e161cd9..52733da 100644
--- a/drivers/net/mlx5/mlx5_rxtx.c
+++ b/drivers/net/mlx5/mlx5_rxtx.c
@@ -71,6 +71,31 @@
 #include "mlx5_defs.h"
 #include "mlx5_prm.h"

+static inline int
+check_cqe(volatile struct mlx5_cqe *cqe,
+ unsigned int cqes_n, const uint16_t ci)
+ __attribute__((always_inline));
+
+static inline uint32_t
+txq_mp2mr(struct txq *txq, struct rte_mempool *mp)
+   __attribute__((always_inline));
+
+static inline void
+mlx5_tx_dbrec(struct txq *txq) __attribute__((always_inline));
+
+static inline uint32_t
+rxq_cq_to_pkt_type(volatile struct mlx5_cqe *cqe)
+   __attribute__((always_inline));
+
+static inline int
+mlx5_rx_poll_len(struct rxq *rxq, volatile struct mlx5_cqe *cqe,
+uint16_t cqe_cnt, uint32_t *rss_hash)
+__attribute__((always_inline));
+
+static inline uint32_t
+rxq_cq_to_ol_flags(struct rxq *rxq, volatile struct mlx5_cqe *cqe)
+  __attribute__((always_inline));
+
 #ifndef NDEBUG

 /**
@@ -100,11 +125,6 @@ check_cqe_seen(volatile struct mlx5_cqe *cqe)

 #endif /* NDEBUG */

-static inline int
-check_cqe(volatile struct mlx5_cqe *cqe,
- unsigned int cqes_n, const uint16_t ci)
- __attribute__((always_inline));
-
 /**
  * Check whether CQE is valid.
  *
@@ -266,10 +286,6 @@ txq_mb2mp(struct rte_mbuf *buf)
return buf->pool;
 }

-static inline uint32_t
-txq_mp2mr(struct txq *txq, struct rte_mempool *mp)
-   __attribute__((always_inline));
-
 /**
  * Get Memory Region (MR) <-> Memory Pool (MP) association from txq->mp2mr[].
  * Add MP to txq->mp2mr[] if it's not registered yet. If mp2mr[] is full,
-- 
2.1.4

[dpdk-dev] [PATCH 3/7] net/mlx5: use vector types to speed up processing

2016-11-24 Thread Nelio Laranjeiro

Let compiler automatically use the vector capabilities of the target
machine to optimize instructions.

Signed-off-by: Nelio Laranjeiro 
Acked-by: Adrien Mazarguil 
---
 drivers/net/mlx5/mlx5_prm.h  |  7 +
 drivers/net/mlx5/mlx5_rxtx.c | 74 +++-
 2 files changed, 46 insertions(+), 35 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_prm.h b/drivers/net/mlx5/mlx5_prm.h
index 3dd4cbe..9cd9fdf 100644
--- a/drivers/net/mlx5/mlx5_prm.h
+++ b/drivers/net/mlx5/mlx5_prm.h
@@ -44,6 +44,7 @@
 #pragma GCC diagnostic error "-Wpedantic"
 #endif

+#include 
 #include "mlx5_autoconf.h"

 /* Get CQE owner bit. */
@@ -134,6 +135,12 @@ struct mlx5_wqe {
struct mlx5_wqe_eth_seg_small eseg;
 };

+/* Vectorize WQE header. */
+struct mlx5_wqe_v {
+   rte_v128u32_t ctrl;
+   rte_v128u32_t eseg;
+};
+
 /* WQE. */
 struct mlx5_wqe64 {
struct mlx5_wqe hdr;
diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
index ada8e74..e161cd9 100644
--- a/drivers/net/mlx5/mlx5_rxtx.c
+++ b/drivers/net/mlx5/mlx5_rxtx.c
@@ -371,7 +371,7 @@ mlx5_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, 
uint16_t pkts_n)
unsigned int j = 0;
unsigned int max;
unsigned int comp;
-   volatile struct mlx5_wqe *wqe = NULL;
+   volatile struct mlx5_wqe_v *wqe = NULL;
unsigned int segs_n = 0;
struct rte_mbuf *buf = NULL;
uint8_t *raw;
@@ -388,12 +388,13 @@ mlx5_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, 
uint16_t pkts_n)
if (max > elts_n)
max -= elts_n;
do {
-   volatile struct mlx5_wqe_data_seg *dseg = NULL;
+   volatile rte_v128u32_t *dseg = NULL;
uint32_t length;
unsigned int ds = 0;
uintptr_t addr;
uint16_t pkt_inline_sz = MLX5_WQE_DWORD_SIZE;
uint8_t ehdr[2];
+   uint8_t cs_flags = 0;
 #ifdef MLX5_PMD_SOFT_COUNTERS
uint32_t total_length = 0;
 #endif
@@ -412,7 +413,7 @@ mlx5_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, 
uint16_t pkts_n)
--segs_n;
if (!segs_n)
--pkts_n;
-   wqe = (volatile struct mlx5_wqe *)
+   wqe = (volatile struct mlx5_wqe_v *)
tx_mlx5_wqe(txq, txq->wqe_ci);
rte_prefetch0(tx_mlx5_wqe(txq, txq->wqe_ci + 1));
if (pkts_n > 1)
@@ -438,11 +439,7 @@ mlx5_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, 
uint16_t pkts_n)
/* Should we enable HW CKSUM offload */
if (buf->ol_flags &
(PKT_TX_IP_CKSUM | PKT_TX_TCP_CKSUM | PKT_TX_UDP_CKSUM)) {
-   wqe->eseg.cs_flags =
-   MLX5_ETH_WQE_L3_CSUM |
-   MLX5_ETH_WQE_L4_CSUM;
-   } else {
-   wqe->eseg.cs_flags = 0;
+   cs_flags = MLX5_ETH_WQE_L3_CSUM | MLX5_ETH_WQE_L4_CSUM;
}
raw = ((uint8_t *)(uintptr_t)wqe) + 2 * MLX5_WQE_DWORD_SIZE;
/*
@@ -498,12 +495,11 @@ mlx5_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, 
uint16_t pkts_n)
 */
ds = 2 + MLX5_WQE_DS(pkt_inline_sz - 2);
if (length > 0) {
-   dseg = (volatile struct mlx5_wqe_data_seg *)
+   dseg = (volatile rte_v128u32_t *)
((uintptr_t)wqe +
 (ds * MLX5_WQE_DWORD_SIZE));
if ((uintptr_t)dseg >= end)
-   dseg = (volatile struct
-   mlx5_wqe_data_seg *)
+   dseg = (volatile rte_v128u32_t *)
   txq->wqes;
goto use_dseg;
} else if (!segs_n) {
@@ -516,16 +512,17 @@ mlx5_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, 
uint16_t pkts_n)
 * No inline has been done in the packet, only the
 * Ethernet Header as been stored.
 */
-   wqe->eseg.inline_hdr_sz = htons(MLX5_WQE_DWORD_SIZE);
-   dseg = (volatile struct mlx5_wqe_data_seg *)
+   dseg = (volatile rte_v128u32_t *)
((uintptr_t)wqe + (3 * MLX5_WQE_DWORD_SIZE));
ds = 3;
 use_dseg:
/* Add the remaining packet as a simple ds. */
-   *dseg = (volatile struct mlx5_wqe_data_seg) {
-   .addr = htonll(addr),
-   .byte_count = htonl(length),
-   .lkey = txq_mp2mr(txq, txq_mb2mp(buf)),
+

[dpdk-dev] [PATCH 2/7] net/mlx5: use work queue buffer as a raw buffer

2016-11-24 Thread Nelio Laranjeiro

Define a single work queue element type that encompasses them all.  It
includes control, Ethernet segment and raw data all grouped in a single
place.

Signed-off-by: Nelio Laranjeiro 
Acked-by: Adrien Mazarguil 
---
 drivers/net/mlx5/mlx5_prm.h  |  13 --
 drivers/net/mlx5/mlx5_rxtx.c | 103 ++-
 drivers/net/mlx5/mlx5_rxtx.h |   2 +-
 drivers/net/mlx5/mlx5_txq.c  |   8 ++--
 4 files changed, 68 insertions(+), 58 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_prm.h b/drivers/net/mlx5/mlx5_prm.h
index 7f31a2f..3dd4cbe 100644
--- a/drivers/net/mlx5/mlx5_prm.h
+++ b/drivers/net/mlx5/mlx5_prm.h
@@ -114,12 +114,19 @@ struct mlx5_wqe_eth_seg_small {
uint32_t rsvd2;
uint16_t inline_hdr_sz;
uint8_t inline_hdr[2];
-};
+} __rte_aligned(MLX5_WQE_DWORD_SIZE);

 struct mlx5_wqe_inl_small {
uint32_t byte_cnt;
uint8_t raw;
-};
+} __rte_aligned(MLX5_WQE_DWORD_SIZE);
+
+struct mlx5_wqe_ctrl {
+   uint32_t ctrl0;
+   uint32_t ctrl1;
+   uint32_t ctrl2;
+   uint32_t ctrl3;
+} __rte_aligned(MLX5_WQE_DWORD_SIZE);

 /* Small common part of the WQE. */
 struct mlx5_wqe {
@@ -131,7 +138,7 @@ struct mlx5_wqe {
 struct mlx5_wqe64 {
struct mlx5_wqe hdr;
uint8_t raw[32];
-} __rte_aligned(64);
+} __rte_aligned(MLX5_WQE_SIZE);

 /* MPW session status. */
 enum mlx5_mpw_state {
diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
index 5dacd93..ada8e74 100644
--- a/drivers/net/mlx5/mlx5_rxtx.c
+++ b/drivers/net/mlx5/mlx5_rxtx.c
@@ -154,6 +154,24 @@ check_cqe(volatile struct mlx5_cqe *cqe,
return 0;
 }

+/**
+ * Return the address of the WQE.
+ *
+ * @param txq
+ *   Pointer to TX queue structure.
+ * @param  wqe_ci
+ *   WQE consumer index.
+ *
+ * @return
+ *   WQE address.
+ */
+static inline uintptr_t *
+tx_mlx5_wqe(struct txq *txq, uint16_t ci)
+{
+   ci &= ((1 << txq->wqe_n) - 1);
+   return (uintptr_t *)((uintptr_t)txq->wqes + ci * MLX5_WQE_SIZE);
+}
+
 static inline void
 txq_complete(struct txq *txq) __attribute__((always_inline));

@@ -175,7 +193,7 @@ txq_complete(struct txq *txq)
uint16_t elts_tail;
uint16_t cq_ci = txq->cq_ci;
volatile struct mlx5_cqe *cqe = NULL;
-   volatile struct mlx5_wqe *wqe;
+   volatile struct mlx5_wqe_ctrl *ctrl;

do {
volatile struct mlx5_cqe *tmp;
@@ -201,9 +219,9 @@ txq_complete(struct txq *txq)
} while (1);
if (unlikely(cqe == NULL))
return;
-   wqe = &(*txq->wqes)[ntohs(cqe->wqe_counter) &
-   ((1 << txq->wqe_n) - 1)].hdr;
-   elts_tail = wqe->ctrl[3];
+   ctrl = (volatile struct mlx5_wqe_ctrl *)
+   tx_mlx5_wqe(txq, ntohs(cqe->wqe_counter));
+   elts_tail = ctrl->ctrl3;
assert(elts_tail < (1 << txq->wqe_n));
/* Free buffers. */
while (elts_free != elts_tail) {
@@ -331,23 +349,6 @@ tx_prefetch_cqe(struct txq *txq, uint16_t ci)
 }

 /**
- * Prefetch a WQE.
- *
- * @param txq
- *   Pointer to TX queue structure.
- * @param  wqe_ci
- *   WQE consumer index.
- */
-static inline void
-tx_prefetch_wqe(struct txq *txq, uint16_t ci)
-{
-   volatile struct mlx5_wqe64 *wqe;
-
-   wqe = &(*txq->wqes)[ci & ((1 << txq->wqe_n) - 1)];
-   rte_prefetch0(wqe);
-}
-
-/**
  * DPDK callback for TX.
  *
  * @param dpdk_txq
@@ -411,9 +412,9 @@ mlx5_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, 
uint16_t pkts_n)
--segs_n;
if (!segs_n)
--pkts_n;
-   wqe = &(*txq->wqes)[txq->wqe_ci &
-   ((1 << txq->wqe_n) - 1)].hdr;
-   tx_prefetch_wqe(txq, txq->wqe_ci + 1);
+   wqe = (volatile struct mlx5_wqe *)
+   tx_mlx5_wqe(txq, txq->wqe_ci);
+   rte_prefetch0(tx_mlx5_wqe(txq, txq->wqe_ci + 1));
if (pkts_n > 1)
rte_prefetch0(*pkts);
addr = rte_pktmbuf_mtod(buf, uintptr_t);
@@ -464,8 +465,9 @@ mlx5_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, 
uint16_t pkts_n)
}
/* Inline if enough room. */
if (txq->max_inline != 0) {
-   uintptr_t end =
-   (uintptr_t)&(*txq->wqes)[1 << txq->wqe_n];
+   uintptr_t end = (uintptr_t)
+   (((uintptr_t)txq->wqes) +
+(1 << txq->wqe_n) * MLX5_WQE_SIZE);
uint16_t max_inline =
txq->max_inline * RTE_CACHE_LINE_SIZE;
uint16_t room;
@@ -496,12 +498,13 @@ mlx5_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, 
uint16_t pkts_n)
 */
ds = 2 + MLX5_WQE_DS(pkt_inline_sz - 2);
if (length > 0) {
-   dseg = (struct mlx5_wqe_data_seg *

[dpdk-dev] [PATCH 1/7] net/mlx5: prepare Tx vectorization

2016-11-24 Thread Nelio Laranjeiro

Prepare the code to write the Work Queue Element with vectorized
instructions.

Signed-off-by: Nelio Laranjeiro 
Signed-off-by: Elad Persiko 
Acked-by: Adrien Mazarguil 
---
 drivers/net/mlx5/mlx5_rxtx.c | 44 
 1 file changed, 28 insertions(+), 16 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
index ffd09ac..5dacd93 100644
--- a/drivers/net/mlx5/mlx5_rxtx.c
+++ b/drivers/net/mlx5/mlx5_rxtx.c
@@ -391,6 +391,8 @@ mlx5_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, 
uint16_t pkts_n)
uint32_t length;
unsigned int ds = 0;
uintptr_t addr;
+   uint16_t pkt_inline_sz = MLX5_WQE_DWORD_SIZE;
+   uint8_t ehdr[2];
 #ifdef MLX5_PMD_SOFT_COUNTERS
uint32_t total_length = 0;
 #endif
@@ -416,6 +418,8 @@ mlx5_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, 
uint16_t pkts_n)
rte_prefetch0(*pkts);
addr = rte_pktmbuf_mtod(buf, uintptr_t);
length = DATA_LEN(buf);
+   ehdr[0] = ((uint8_t *)addr)[0];
+   ehdr[1] = ((uint8_t *)addr)[1];
 #ifdef MLX5_PMD_SOFT_COUNTERS
total_length = length;
 #endif
@@ -439,24 +443,20 @@ mlx5_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, 
uint16_t pkts_n)
} else {
wqe->eseg.cs_flags = 0;
}
-   raw  = (uint8_t *)(uintptr_t)&wqe->eseg.inline_hdr[0];
-   /* Start the know and common part of the WQE structure. */
-   wqe->ctrl[0] = htonl((txq->wqe_ci << 8) | MLX5_OPCODE_SEND);
-   wqe->ctrl[2] = 0;
-   wqe->ctrl[3] = 0;
-   wqe->eseg.rsvd0 = 0;
-   wqe->eseg.rsvd1 = 0;
-   wqe->eseg.mss = 0;
-   wqe->eseg.rsvd2 = 0;
-   /* Start by copying the Ethernet Header. */
-   memcpy((uint8_t *)raw, ((uint8_t *)addr), 16);
+   raw = ((uint8_t *)(uintptr_t)wqe) + 2 * MLX5_WQE_DWORD_SIZE;
+   /*
+* Start by copying the Ethernet header minus the first two
+* bytes which will be appended at the end of the Ethernet
+* segment.
+*/
+   memcpy((uint8_t *)raw, ((uint8_t *)addr) + 2, 16);
length -= MLX5_WQE_DWORD_SIZE;
addr += MLX5_WQE_DWORD_SIZE;
/* Replace the Ethernet type by the VLAN if necessary. */
if (buf->ol_flags & PKT_TX_VLAN_PKT) {
uint32_t vlan = htonl(0x8100 | buf->vlan_tci);

-   memcpy((uint8_t *)(raw + MLX5_WQE_DWORD_SIZE -
+   memcpy((uint8_t *)(raw + MLX5_WQE_DWORD_SIZE - 2 -
   sizeof(vlan)),
   &vlan, sizeof(vlan));
addr -= sizeof(vlan);
@@ -468,10 +468,13 @@ mlx5_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, 
uint16_t pkts_n)
(uintptr_t)&(*txq->wqes)[1 << txq->wqe_n];
uint16_t max_inline =
txq->max_inline * RTE_CACHE_LINE_SIZE;
-   uint16_t pkt_inline_sz = MLX5_WQE_DWORD_SIZE;
uint16_t room;

-   raw += MLX5_WQE_DWORD_SIZE;
+   /*
+* raw starts two bytes before the boundary to
+* continue the above copy of packet data.
+*/
+   raw += MLX5_WQE_DWORD_SIZE - 2;
room = end - (uintptr_t)raw;
if (room > max_inline) {
uintptr_t addr_end = (addr + max_inline) &
@@ -487,8 +490,6 @@ mlx5_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, 
uint16_t pkts_n)
/* Sanity check. */
assert(addr <= addr_end);
}
-   /* Store the inlined packet size in the WQE. */
-   wqe->eseg.inline_hdr_sz = htons(pkt_inline_sz);
/*
 * 2 DWORDs consumed by the WQE header + 1 DSEG +
 * the size of the inline part of the packet.
@@ -570,7 +571,18 @@ mlx5_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, 
uint16_t pkts_n)
--pkts_n;
 next_pkt:
++i;
+   /* Initialize known and common part of the WQE structure. */
+   wqe->ctrl[0] = htonl((txq->wqe_ci << 8) | MLX5_OPCODE_SEND);
wqe->ctrl[1] = htonl(txq->qp_num_8s | ds);
+   wqe->ctrl[2] = 0;
+   wqe->ctrl[3] = 0;
+   wqe->eseg.rsvd0 = 0;
+   wqe->eseg.rsvd1 = 0;
+   wqe->eseg.mss = 0;
+   wqe->eseg.rsvd2 = 0;
+   wqe->eseg.inli

[dpdk-dev] [PATCH 0/7] net/mlx5: improve single core performance

2016-11-24 Thread Nelio Laranjeiro

This series applies on top of
"[PATCH] eal: define generic vector types" [1][2]

Using built-in vector types forces compilers to consider SIMD instructions in
specific places in order to improve performance on both IBM POWER8 and Intel
architectures.

For example, testpmd single-thread I/O forwarding packets per second
performance is improved by 6% on Intel platforms.

 [1] http://dpdk.org/ml/archives/dev/2016-November/050261.html
 [2] http://dpdk.org/dev/patchwork/patch/17024/

Nelio Laranjeiro (7):
  net/mlx5: prepare Tx vectorization
  net/mlx5: use work queue buffer as a raw buffer
  net/mlx5: use vector types to speed up processing
  net/mlx5: fix missing inline attributes
  net/mlx5: move static prototype
  net/mlx5: optimize copy of Ethernet header
  net/mlx5: remove inefficient prefetching

 drivers/net/mlx5/mlx5_prm.h  |  20 +++-
 drivers/net/mlx5/mlx5_rxtx.c | 243 +++
 drivers/net/mlx5/mlx5_rxtx.h |   2 +-
 drivers/net/mlx5/mlx5_txq.c  |   8 +-
 4 files changed, 150 insertions(+), 123 deletions(-)

-- 
2.1.4

[dpdk-dev] [PATCH 1/4] eventdev: introduce event driven programming model

2016-11-24 Thread Thomas Monjalon

2016-11-24 07:29, Jerin Jacob:
> On Wed, Nov 23, 2016 at 07:39:09PM +0100, Thomas Monjalon wrote:
> > 2016-11-18 11:14, Jerin Jacob:
> > > +Eventdev API - EXPERIMENTAL
> > > +M: Jerin Jacob 
> > > +F: lib/librte_eventdev/
> > 
> > OK to mark it experimental.
> > What is the plan to remove the experimental word?
> 
> IMO, EXPERIMENTAL status can be changed when
> - At least two event drivers available(Intel and Cavium are working on
>   SW and HW event drivers)
> - Functional test applications are fine with at least two drivers
> - Portable example application to showcase the features of the library
> - eventdev integration with another dpdk subsystem such as ethdev
> 
> Thoughts?. I am not sure the criteria used in cryptodev case.

Sounds good.
We will be more confident when drivers and tests will be implemented.

I think the roadmap for the SW driver targets the release 17.05.
Do you still plan 17.02 for this API and the Cavium driver?

> > > +#define EVENTDEV_NAME_SKELETON_PMD event_skeleton
> > > +/**< Skeleton event device PMD name */
> > 
> > I do not understand this #define.
> 
> Applications can explicitly request the a specific driver though driver
> name. This will go as argument to rte_event_dev_get_dev_id(const char *name).
> The reason for keeping this #define in rte_eventdev.h is that,
> application needs to include only rte_eventdev.h not rte_eventdev_pmd.h.

So each driver must register its name in the API?
Is it really needed?

> > > +struct rte_event_dev_config {
> > > + uint32_t dequeue_wait_ns;
> > > + /**< rte_event_dequeue() wait for *dequeue_wait_ns* ns on this device.
> > 
> > Please explain exactly when the wait occurs and why.
> 
> Here is the explanation from rte_event_dequeue() API definition,
> -
> @param wait
> 0 - no-wait, returns immediately if there is no event.
> >0 - wait for the event, if the device is configured with
> RTE_EVENT_DEV_CFG_PER_DEQUEUE_WAIT then this function will wait until
> the event available or *wait* time.
> if the device is not configured with RTE_EVENT_DEV_CFG_PER_DEQUEUE_WAIT
> then this function will wait until the event available or *dequeue_wait_ns*
>   ^^
> ns which was previously supplied to rte_event_dev_configure()
> -
> This is provides the application to have control over, how long the
> implementation should wait if event is not available.
> 
> Let me know what exact changes are required if details are not enough in
> rte_event_dequeue() API definition.

Maybe that timeout would be a better name.
It waits only if there is nothing in the queue.
It can be interesting to highlight in this comment that this parameter
makes the dequeue function a blocking call.

> > > +/** Event port configuration structure */
> > > +struct rte_event_port_conf {
> > > + int32_t new_event_threshold;
> > > + /**< A backpressure threshold for new event enqueues on this port.
> > > +  * Use for *closed system* event dev where event capacity is limited,
> > > +  * and cannot exceed the capacity of the event dev.
> > > +  * Configuring ports with different thresholds can make higher priority
> > > +  * traffic less likely to  be backpressured.
> > > +  * For example, a port used to inject NIC Rx packets into the event dev
> > > +  * can have a lower threshold so as not to overwhelm the device,
> > > +  * while ports used for worker pools can have a higher threshold.
> > > +  * This value cannot exceed the *nb_events_limit*
> > > +  * which previously supplied to rte_event_dev_configure()
> > > +  */
> > > + uint8_t dequeue_depth;
> > > + /**< Configure number of bulk dequeues for this event port.
> > > +  * This value cannot exceed the *nb_event_port_dequeue_depth*
> > > +  * which previously supplied to rte_event_dev_configure()
> > > +  */
> > > + uint8_t enqueue_depth;
> > > + /**< Configure number of bulk enqueues for this event port.
> > > +  * This value cannot exceed the *nb_event_port_enqueue_depth*
> > > +  * which previously supplied to rte_event_dev_configure()
> > > +  */
> > > +};
> > 
> > The depth configuration is not clear to me.
> 
> Basically the maximum number of events can be enqueued/dequeued at time
> from a given event port. depth of one == non burst mode.

OK so depth is the queue size. Please could you reword?

> > > +/* Event types to classify the event source */
> > 
> > Why this classification is needed?
> 
> This for application pipeling and the cases like, if application wants to 
> know which
> subsystem generated the event.
> 
> example packet forwarding loop on the worker cores:
> while(1) {
>   ev = dequeue()
>   // event from ethdev subsystem
>   if (ev.event_type == RTE_EVENT_TYPE_ETHDEV) {
>   - swap the mac address
>   - push to atomic queue for ingress flow order maintenance
> by CORE
>   /* events from core */
>   } else if (ev.event_type == RTE_EVENT_TYPE_CORE) {
> 
>   }
>   enqueue(ev);

[dpdk-dev] [PATCH 1/4] eventdev: introduce event driven programming model

2016-11-24 Thread Bruce Richardson

On Fri, Nov 18, 2016 at 11:14:59AM +0530, Jerin Jacob wrote:
> In a polling model, lcores poll ethdev ports and associated
> rx queues directly to look for packet. In an event driven model,
> by contrast, lcores call the scheduler that selects packets for
> them based on programmer-specified criteria. Eventdev library
> adds support for event driven programming model, which offer
> applications automatic multicore scaling, dynamic load balancing,
> pipelining, packet ingress order maintenance and
> synchronization services to simplify application packet processing.
> 
> By introducing event driven programming model, DPDK can support
> both polling and event driven programming models for packet processing,
> and applications are free to choose whatever model
> (or combination of the two) that best suits their needs.
> 
> Signed-off-by: Jerin Jacob 
> ---

Hi Jerin,

Thanks for the patchset. A few minor comments in general on the API that
we found from working with it (thus far - more may follow :-) ).

1. Priorities: priorities are used in a number of places in the API, but
   all are uint8_t types and have their own MAX/NORMAL/MIN values. I think
   it would be simpler for the user just to have one priority type in the
   library, and use that everywhere. I suggest using RTE_EVENT_PRIORITY_*
   and drop the separate defines for SERVICE_PRIORITY, and QUEUE_PRIORITY
   etc. Ideally, I'd see things like this converted to enums too, rather
   than defines, but I'm not sure it's possible in this case.

2. Functions for config and setup can have their structure parameter
   types as const as they don't/shouldn't change the values internally.
   So add "const" to parameters to:
 rte_event_dev_configure()
 rte_event_queue_setup()
 rte_event_port_setup()
 rte_event_port_link()

3. in event schedule() function, the dev->schedule() function needs the
   dev instance pointer passed in as parameter.

4. The event op values and the event type values would be better as
   enums rather than as a set of #defines.

Regards,
/Bruce

[dpdk-dev] [virtio] virtio-net PMD cannot be used on Ubuntu 16.10

2016-11-24 Thread Hobywan Kenoby

Hello,

While I (almost) never had a problem with testpmd, Ubuntu 16.10 make it fail:

$ sudo ./testpmd -c 0x3 -n1 --no-huge -- --disable-hw-vlan --disable-rss -i 
--rxq=1 --txq=1 --rxd=256 --txd=256
EAL: Detected 4 lcore(s)
EAL: Probing VFIO support...
EAL: WARNING: cpu flags constant_tsc=yes nonstop_tsc=no -> using unreliable 
clock cycles !
EAL: PCI device :00:08.0 on NUMA socket -1
EAL:   probe driver: 1af4:1000 net_virtio
EAL: Error - exiting with code: 1
  Cause: Requested device :00:08.0 cannot be used


This boils down to virtio_pci.c

if (!check_vq_phys_addr_ok(vq))
return -1;

which fails because vq is 0x77172700 and is greater than 16TB.


is there an issue or do I configure something badly?
I used the same method to test on 14.04 15.10 and 16.04 without problems.
Huge pages are configured and are used by DPDK.

-HK

[dpdk-dev] [RFC PATCH] i40e: fix setting of default MAC address

2016-11-24 Thread Igor Ryzhov

While testing X710 cards in our lab I found that setting of default MAC address
doesn't work correctly for i40e driver. I compared DPDK driver implementation
with Linux driver implementation and found that a lot of code is lost in DPDK.
I tried to make DPDK implementation similar to Linux implementation and it
worked for me ? now everything is working. But I'm not sure that my changes are
correct so, please, maintainers, check the patch very careful.

Signed-off-by: Igor Ryzhov 
---
 drivers/net/i40e/i40e_ethdev.c | 30 --
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index 67778ba..b73f9c8 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -9694,6 +9694,7 @@ static int i40e_get_eeprom(struct rte_eth_dev *dev,
 static void i40e_set_default_mac_addr(struct rte_eth_dev *dev,
  struct ether_addr *mac_addr)
 {
+   struct i40e_vsi *vsi = 
I40E_DEV_PRIVATE_TO_MAIN_VSI(dev->data->dev_private);
struct i40e_hw *hw = I40E_DEV_PRIVATE_TO_HW(dev->data->dev_private);

if (!is_valid_assigned_ether_addr(mac_addr)) {
@@ -9701,8 +9702,33 @@ static void i40e_set_default_mac_addr(struct rte_eth_dev 
*dev,
return;
}

-   /* Flags: 0x3 updates port address */
-   i40e_aq_mac_address_write(hw, 0x3, mac_addr->addr_bytes, NULL);
+   i40e_aq_mac_address_write(hw, I40E_AQC_WRITE_TYPE_LAA_WOL, 
mac_addr->addr_bytes, NULL);
+
+   if (!memcmp(&dev->data->mac_addrs[0].addr_bytes, hw->mac.addr, 
ETH_ADDR_LEN)) {
+   struct i40e_aqc_remove_macvlan_element_data element;
+
+   memset(&element, 0, sizeof(element));
+   memcpy(element.mac_addr, &dev->data->mac_addrs[0].addr_bytes, 
ETH_ADDR_LEN);
+   element.flags = I40E_AQC_MACVLAN_DEL_PERFECT_MATCH;
+   i40e_aq_remove_macvlan(hw, vsi->seid, &element, 1, NULL);
+   } else {
+   i40e_vsi_delete_mac(vsi, &dev->data->mac_addrs[0]);
+   }
+
+   if (!memcmp(mac_addr->addr_bytes, hw->mac.addr, ETH_ADDR_LEN)) {
+   struct i40e_aqc_add_macvlan_element_data element;
+
+   memset(&element, 0, sizeof(element));
+   memcpy(element.mac_addr, hw->mac.addr, ETH_ADDR_LEN);
+   element.flags = CPU_TO_LE16(I40E_AQC_MACVLAN_ADD_PERFECT_MATCH);
+   i40e_aq_add_macvlan(hw, vsi->seid, &element, 1, NULL);
+   } else {
+   struct i40e_mac_filter_info filter;
+
+   memcpy(&filter.mac_addr, mac_addr, ETH_ADDR_LEN);
+   filter.filter_type = RTE_MAC_PERFECT_MATCH;
+   i40e_vsi_add_mac(vsi, &filter);
+   }
 }

 static int
-- 
2.6.4

[dpdk-dev] dpdk/vpp and cross-version migration for vhost

2016-11-24 Thread Kavanagh, Mark B

>
>On 11/24/2016 12:47 PM, Maxime Coquelin wrote:
>>
>>
>> On 11/24/2016 01:33 PM, Yuanhan Liu wrote:
>>> On Thu, Nov 24, 2016 at 09:30:49AM +, Kevin Traynor wrote:
 > On 11/24/2016 06:31 AM, Yuanhan Liu wrote:
> > > On Tue, Nov 22, 2016 at 04:53:05PM +0200, Michael S. Tsirkin wrote:
 >  You keep assuming that you have the VM started first and
 >  figure out things afterwards, but this does not work.
 > 
 >  Think about a cluster of machines. You want to start a VM in
 >  a way that will ensure compatibility with all hosts
 >  in a cluster.
>>> > >>>
>>> > >>> I see. I was more considering about the case when the dst
>>> > >>> host (including the qemu and dpdk combo) is given, and
>>> > >>> then determine whether it will be a successfull migration
>>> > >>> or not.
>>> > >>>
>>> > >>> And you are asking that we need to know which host could
>>> > >>> be a good candidate before starting the migration. In such
>>> > >>> case, we indeed need some inputs from both the qemu and
>>> > >>> vhost-user backend.
>>> > >>>
>>> > >>> For DPDK, I think it could be simple, just as you said, it
>>> > >>> could be either a tiny script, or even a macro defined in
>>> > >>> the source code file (we extend it every time we add a
>>> > >>> new feature) to let the libvirt to read it. Or something
>>> > >>> else.
>> > >>
>> > >> There's the issue of APIs that tweak features as Maxime
>> > >> suggested.
> > >
> > > Yes, it's a good point.
> > >
>> > >> Maybe the only thing to do is to deprecate it,
> > >
> > > Looks like so.
> > >
>> > >> but I feel some way for application to pass info into
>> > >> guest might be benefitial.
> > >
> > > The two APIs are just for tweaking feature bits DPDK supports
> before
> > > any device got connected. It's another way to disable some features
> > > (the another obvious way is to through QEMU command lines).
> > >
> > > IMO, it's bit handy only in a case like: we have bunch of VMs.
> Instead
> > > of disabling something though qemu one by one, we could disable it
> > > once in DPDK.
> > >
> > > But I doubt the useful of it. It's only used in DPDK's vhost
> example
> > > after all. Nor is it used in vhost pmd, neither is it used in OVS.
 >
 > rte_vhost_feature_disable() is currently used in OVS,
 lib/netdev-dpdk.c
>>> Hmmm. I must have checked very old code ...
 >
 > netdev_dpdk_vhost_class_init(void)
 > {
 > static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
 >
 > /* This function can be called for different classes.  The
 > initialization
 >  * needs to be done only once */
 > if (ovsthread_once_start(&once)) {
 > rte_vhost_driver_callback_register(&virtio_net_device_ops);
 > rte_vhost_feature_disable(1ULL << VIRTIO_NET_F_HOST_TSO4
 >   | 1ULL << VIRTIO_NET_F_HOST_TSO6
 >   | 1ULL << VIRTIO_NET_F_CSUM);
>>> I saw the commit introduced such change, but it tells no reason why
>>> it was added.
>>
>> I'm also interested to know the reason.
>
>I can't remember off hand, added Mark K or Michal W who should be able
>to shed some light on it.

DPDK v16.04 added support for vHost User TSO; as such, by default, TSO is 
advertised to guest devices as an available feature during feature negotiation 
with QEMU.
However, while the vHost user backend sets up the majority of the mbuf fields 
that are required for TSO, there is still a reliance on the associated DPDK 
application (i.e. in this case OvS-DPDK) to set the remaining flags and/or 
offsets. Since OvS-DPDK doesn't currently provide that functionality, it is 
necessary to explicitly disable TSO; otherwise, undefined behaviour will ensue.

>
>> In any case, I think this is something that can/should be managed by
>> the management tool, which  should disable it in cmd parameters.
>>
>> Kevin, do you agree?
>
>I think best to find out the reason first. Because if no reason to
>disable in the code, then no need to debate!
>
>>
>> Cheers,
>> Maxime

[dpdk-dev] dpdk/vpp and cross-version migration for vhost

2016-11-24 Thread Kevin Traynor

On 11/24/2016 12:47 PM, Maxime Coquelin wrote:
> 
> 
> On 11/24/2016 01:33 PM, Yuanhan Liu wrote:
>> On Thu, Nov 24, 2016 at 09:30:49AM +, Kevin Traynor wrote:
>>> > On 11/24/2016 06:31 AM, Yuanhan Liu wrote:
 > > On Tue, Nov 22, 2016 at 04:53:05PM +0200, Michael S. Tsirkin wrote:
>>> >  You keep assuming that you have the VM started first and
>>> >  figure out things afterwards, but this does not work.
>>> > 
>>> >  Think about a cluster of machines. You want to start a VM in
>>> >  a way that will ensure compatibility with all hosts
>>> >  in a cluster.
>> > >>>
>> > >>> I see. I was more considering about the case when the dst
>> > >>> host (including the qemu and dpdk combo) is given, and
>> > >>> then determine whether it will be a successfull migration
>> > >>> or not.
>> > >>>
>> > >>> And you are asking that we need to know which host could
>> > >>> be a good candidate before starting the migration. In such
>> > >>> case, we indeed need some inputs from both the qemu and
>> > >>> vhost-user backend.
>> > >>>
>> > >>> For DPDK, I think it could be simple, just as you said, it
>> > >>> could be either a tiny script, or even a macro defined in
>> > >>> the source code file (we extend it every time we add a
>> > >>> new feature) to let the libvirt to read it. Or something
>> > >>> else.
> > >>
> > >> There's the issue of APIs that tweak features as Maxime
> > >> suggested.
 > >
 > > Yes, it's a good point.
 > >
> > >> Maybe the only thing to do is to deprecate it,
 > >
 > > Looks like so.
 > >
> > >> but I feel some way for application to pass info into
> > >> guest might be benefitial.
 > >
 > > The two APIs are just for tweaking feature bits DPDK supports
 before
 > > any device got connected. It's another way to disable some features
 > > (the another obvious way is to through QEMU command lines).
 > >
 > > IMO, it's bit handy only in a case like: we have bunch of VMs.
 Instead
 > > of disabling something though qemu one by one, we could disable it
 > > once in DPDK.
 > >
 > > But I doubt the useful of it. It's only used in DPDK's vhost
 example
 > > after all. Nor is it used in vhost pmd, neither is it used in OVS.
>>> >
>>> > rte_vhost_feature_disable() is currently used in OVS,
>>> lib/netdev-dpdk.c
>> Hmmm. I must have checked very old code ...
>>> >
>>> > netdev_dpdk_vhost_class_init(void)
>>> > {
>>> > static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
>>> >
>>> > /* This function can be called for different classes.  The
>>> > initialization
>>> >  * needs to be done only once */
>>> > if (ovsthread_once_start(&once)) {
>>> > rte_vhost_driver_callback_register(&virtio_net_device_ops);
>>> > rte_vhost_feature_disable(1ULL << VIRTIO_NET_F_HOST_TSO4
>>> >   | 1ULL << VIRTIO_NET_F_HOST_TSO6
>>> >   | 1ULL << VIRTIO_NET_F_CSUM);
>> I saw the commit introduced such change, but it tells no reason why
>> it was added.
> 
> I'm also interested to know the reason.

I can't remember off hand, added Mark K or Michal W who should be able
to shed some light on it.

> In any case, I think this is something that can/should be managed by
> the management tool, which  should disable it in cmd parameters.
> 
> Kevin, do you agree?

I think best to find out the reason first. Because if no reason to
disable in the code, then no need to debate!

> 
> Cheers,
> Maxime

[dpdk-dev] dpdk/vpp and cross-version migration for vhost

2016-11-24 Thread Yuanhan Liu

On Tue, Nov 22, 2016 at 04:53:05PM +0200, Michael S. Tsirkin wrote:
> > > You keep assuming that you have the VM started first and
> > > figure out things afterwards, but this does not work.
> > > 
> > > Think about a cluster of machines. You want to start a VM in
> > > a way that will ensure compatibility with all hosts
> > > in a cluster.
> > 
> > I see. I was more considering about the case when the dst
> > host (including the qemu and dpdk combo) is given, and
> > then determine whether it will be a successfull migration
> > or not.
> > 
> > And you are asking that we need to know which host could
> > be a good candidate before starting the migration. In such
> > case, we indeed need some inputs from both the qemu and
> > vhost-user backend.
> > 
> > For DPDK, I think it could be simple, just as you said, it
> > could be either a tiny script, or even a macro defined in
> > the source code file (we extend it every time we add a
> > new feature) to let the libvirt to read it. Or something
> > else.
> 
> There's the issue of APIs that tweak features as Maxime
> suggested.

Yes, it's a good point.

> Maybe the only thing to do is to deprecate it,

Looks like so.

> but I feel some way for application to pass info into
> guest might be benefitial.

The two APIs are just for tweaking feature bits DPDK supports before
any device got connected. It's another way to disable some features
(the another obvious way is to through QEMU command lines).

IMO, it's bit handy only in a case like: we have bunch of VMs. Instead
of disabling something though qemu one by one, we could disable it
once in DPDK.

But I doubt the useful of it. It's only used in DPDK's vhost example
after all. Nor is it used in vhost pmd, neither is it used in OVS.

> > > If you don't, guest visible interface will change
> > > and you won't be able to migrate.
> > > 
> > > It does not make sense to discuss feature bits specifically
> > > since that is not the only part of interface.
> > > For example, max ring size supported might change.
> > 
> > I don't quite understand why we have to consider the max ring
> > size here? Isn't it a virtio device attribute, that QEMU could
> > provide such compatibility information?
> >
> > I mean, DPDK is supposed to support vary vring size, it's QEMU
> > to give a specifc value.
> 
> If backend supports s/g of any size up to 2^16, there's no issue.

I don't know others, but I see no issues in DPDK.

> ATM some backends might be assuming up to 1K s/g since
> QEMU never supported bigger ones. We might classify this
> as a bug, or not and add a feature flag.
> 
> But it's just an example. There might be more values at issue
> in the future.

Yeah, maybe. But we could analysis it one by one.

> > > Let me describe how it works in qemu/libvirt.
> > > When you install a VM, you can specify compatibility
> > > level (aka "machine type"), and you can query the supported compatibility
> > > levels. Management uses that to find the supported compatibility
> > > and stores the compatibility in XML that is migrated with the VM.
> > > There's also a way to find the latest level which is the
> > > default unless overridden by user, again this level
> > > is recorded and then
> > > - management can make sure migration destination is compatible
> > > - management can avoid migration to hosts without that support
> > 
> > Thanks for the info, it helps.
> > 
> > ...
> > > > > >>As version here is an opaque string for libvirt and qemu,
> > > > > >>anything can be used - but I suggest either a list
> > > > > >>of values defining the interface, e.g.
> > > > > >>any_layout=on,max_ring=256
> > > > > >>or a version including the name and vendor of the backend,
> > > > > >>e.g. "org.dpdk.v4.5.6".
> > 
> > The version scheme may not be ideal here. Assume a QEMU is supposed
> > to work with a specific DPDK version, however, user may disable some
> > newer features through qemu command line, that it also could work with
> > an elder DPDK version. Using the version scheme will not allow us doing
> > such migration to an elder DPDK version. The MTU is a lively example
> > here? (when MTU feature is provided by QEMU but is actually disabled
> > by user, that it could also work with an elder DPDK without MTU support).
> > 
> > --yliu
> 
> OK, so does a list of values look better to you then?

Yes, if there are no better way.

And I think it may be better to not list all those features, literally.
But instead, using the number should be better, say, features=0xdeadbeef.

Listing the feature names means we have to come to an agreement in all
components involved here (QEMU, libvirt, DPDK, VPP, and maybe more
backends), that we have to use the exact same feature names. Though it
may not be a big deal, it lacks some flexibility.

A feature bits will not have this issue.

--yliu

> 
> 
> > > > > >>
> > > > > >>Note that typically the list of supported versions can only be
> > > > > >>extended, not shrunk. Also, if the host/guest interf

[dpdk-dev] [RFC 2/9] ethdev: move queue id check in generic layer

2016-11-24 Thread Olivier Matz

Hi Ferruh,

On Thu, 2016-11-24 at 10:59 +, Ferruh Yigit wrote:
> On 11/24/2016 9:54 AM, Olivier Matz wrote:
> > The check of queue_id is done in all drivers implementing
> > rte_eth_rx_queue_count(). Factorize this check in the generic
> > function.
> > 
> > Note that the nfp driver was doing the check differently, which
> > could
> > induce crashes if the queue index was too big.
> > 
> > By the way, also move the is_supported test before the port valid
> > and
> > queue valid test.
> > 
> > PR=52423
> > Signed-off-by: Olivier Matz 
> > Acked-by: Ivan Boule 
> > ---
> 
> <...>
> 
> > diff --git a/lib/librte_ether/rte_ethdev.h
> > b/lib/librte_ether/rte_ethdev.h
> > index c3edc23..9551cfd 100644
> > --- a/lib/librte_ether/rte_ethdev.h
> > +++ b/lib/librte_ether/rte_ethdev.h
> > @@ -2693,7 +2693,7 @@ rte_eth_rx_burst(uint8_t port_id, uint16_t
> > queue_id,
> > ? *??The queue id on the specific port.
> > ? * @return
> > ? *??The number of used descriptors in the specific queue, or:
> > - *?(-EINVAL) if *port_id* is invalid
> > + *?(-EINVAL) if *port_id* or *queue_id* is invalid
> > ? *?(-ENOTSUP) if the device does not support this function
> > ? */
> > ?static inline int
> > @@ -2701,8 +2701,10 @@ rte_eth_rx_queue_count(uint8_t port_id,
> > uint16_t queue_id)
> > ?{
> > ?   struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> > ?
> > -   RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
> > ?   RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_count,
> > -ENOTSUP);
> > +   RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
> 
> Doing port validity check before accessing dev->dev_ops-
> >rx_queue_count
> can be good idea.
> 
> What about validating port_id even before accessing
> rte_eth_devices[port_id]?
> 

oops right, we should not move this line, it's stupid...

Thanks for the feedback,
Olivier

[dpdk-dev] [PATCH 00/56] Solarflare libefx-based PMD

2016-11-24 Thread Andrew Rybchenko

On 11/23/2016 10:21 PM, Stephen Hemminger wrote:
> On Wed, 23 Nov 2016 10:49:33 +0300
> Andrew Rybchenko  wrote:
>
>> I've tried to explain it above in item (2):
>>
>>   >>>
>>
>>2. Another Solarflare PMD with in-kernel part (for control operations)
>>   is considered and could be added in the future. Code for data path
>>   should be shared by these two drivers. libefx-based PMD is put into
>>   'efx' subdirectory to have a space for another PMD and shared code.
>>
>> <<<
>>
>> So, main reason is to have location for the code shared by two Solarflare
>> network PMDs. May be it better to relocate when we really have it.
>> I'm open for other ideas/suggestions.
> So is this driver dependent on another non-upstream kernel driver to work?

No, this driver does not have any external dependencies.

> Is this documented in docs directory.  If it does depend on other non-upstream
> components, then the default config should disable the driver.


Andrew.

[dpdk-dev] Proposal for a new Committer model

2016-11-24 Thread Yuanhan Liu

On Wed, Nov 23, 2016 at 03:19:19PM -0500, Neil Horman wrote:
> On Wed, Nov 23, 2016 at 11:41:20PM +0800, Yuanhan Liu wrote:
> > On Wed, Nov 23, 2016 at 09:11:54AM -0500, Neil Horman wrote:
> > > > Could we define some of the potential subtrees now and look to 
> > > > introduce them in the this release cycle? EAL and the Core libs, as 
> > > > suggested by Thomas, seem like 2 obvious ones.
> > > > 
> > > Sure, I'd suggest the following:
> > 
> > I would pull the git history to see which components are in
> > active status in last release (or even, in last few release).
> > And try to make a sub-tree if corresponding component is hot.
> > 
> > # the 2nd volume shows how many patches prefixed with a related component
> > [yliu at yliu-dev ~/dpdk]$ git log --oneline v16.07..v16.11 | awk '{print 
> > $2}' | \
> > sort | uniq -c  | sort -nr | head -30 | nl
> >  1   52 doc:
> >  2   40 net/ixgbe/base:
> >  3   38 app/test:
> >  4   37 kni:
> >  5   27 vhost:
> >  6   27 net/virtio:
> >  7   27 net/mlx5:
> >  8   26 app/testpmd:
> >  9   25 net/i40e:
> > 10   23 net/pcap:
> > 11   22 net/bnxt:
> > 12   20 net/enic:
> > 13   18 net/qede:
> > 14   17 net/thunderx:
> > 15   16 net/qede/base:
> > 16   16 eal:
> > 17   15 net/ixgbe:
> > 18   14 net:
> > 19   14 crypto/qat:
> > 20   13 scripts:
> > 21   13 net/bnx2x:
> > 22   12 net/i40e/base:
> > 23   12 examples/ipsec-secgw:
> > 24   11 mbuf:
> > 25   11 hash:
> > 26   10 lib:
> > 27   10 examples/ip_pipeline:
> > 28   10 ethdev:
> > 299 pci:
> > 307 net/vmxnet3:
> > ...
> > 463 pdump:
> > 473 net/virtio_user:
> > 483 net/ring:
> > 493 net/nfp:
> > 503 net/mlx:
> > 513 net/ena:
> > 523 net/e1000:
> > 533 net/bonding:
> > ...
> > 562 sched:
> > 572 port:
> > ...
> > 651 timer:
> > 661 remove
> > 671 pmdinfogen:
> > 681 net/igb:
> > 691 net/enic/base:
> > 701 meter:
> > ...
> > 841 cfgfile:
> > 851 app/procinfo:
> > 861 app/proc_info:
> > 871 acl:
> > 
> > Something obvious is that:
> > 
> > - "doc" deserves a sub-tree, and John is a perfect committer for that
> >   if he's willing to.
> > 
> > - generally, I'd agree with Neil that most (if not all) pmds may need
> >   a sub-tree. While, some others may not, for example, net/ring, net/pcap.
> > 
> No, thats the opposite of what I think.  I think all net pmds should flow
> through a single subtree, all crypto pmds through another, etc.

I misunderstood it. I was think you were suggesting to create a sub-tree
for most (or all) pmds. Some of my comments didn't apply then.

But yes, we have already done that: we have next-net and next-crypto.

> >   For those non-active pmds, I think it's okay to let the generic
> >   pmd committer to cover them.
> > 
> Not sure what you're getting at here.  Low volume pms (or any library) can 
> still
> go through a subtree.  The goal is to fragmet the commit work so one person
> doesn't have to do it all.
> 
> > - it's not that wise to me to list all the components we have so far
> >   and make a sub-tree for each of them.
> > 
> I think you misunderstood the organization of my last note.  I agree with you
> here.  When I listed the core and listed several libraries under it, my intent
> was to create a core subtree that accepted patches for all of those libraries.
> 
> >   For example, some components like librte_{port, pdump, cfgfile, acl,
> >   and etc} just have few (or even, just one) commits in last release.
> >   It makes no sense to me to introduce a tree for each of them.
> > 
> Yes, this is what I was saying in my last note.
> 
> > Another thought is we could also create sub-trees based on category
> > but not on components like Neil suggested, especially that EAL looks
> > way too big to be maintained in one tree. Instead, it could be something
> > like:
> > 
> > - a tree for BSD
> > 
> This gets tricky, because then several libraries will be covered by multiple
> trees, and that leads to merge conflicts.

If we go that way, I meant a sub-sub-tree under EAL sub-tree. And conflicts
is almost impossible to avoid when we have multiple trees.

> > - a tree for ARM (and some other trees for other platforms)
> > 
> > - a tree for mem related (mempool, mbuf, hugepage, etc)
> > 
> > - a tree for BUS
> > 
> > - ...
> > 
> > 
> > Last but not the least, I think it's general good to have more and
> > more trees in the end. But I don't think it's a good idea to go
> > radically and create all those trees once (say in one release).
> > 
> > Something I would like to suggest i

[dpdk-dev] dpdk/vpp and cross-version migration for vhost

2016-11-24 Thread Maxime Coquelin



On 11/24/2016 01:33 PM, Yuanhan Liu wrote:
> On Thu, Nov 24, 2016 at 09:30:49AM +, Kevin Traynor wrote:
>> > On 11/24/2016 06:31 AM, Yuanhan Liu wrote:
>>> > > On Tue, Nov 22, 2016 at 04:53:05PM +0200, Michael S. Tsirkin wrote:
>> >  You keep assuming that you have the VM started first and
>> >  figure out things afterwards, but this does not work.
>> > 
>> >  Think about a cluster of machines. You want to start a VM in
>> >  a way that will ensure compatibility with all hosts
>> >  in a cluster.
> > >>>
> > >>> I see. I was more considering about the case when the dst
> > >>> host (including the qemu and dpdk combo) is given, and
> > >>> then determine whether it will be a successfull migration
> > >>> or not.
> > >>>
> > >>> And you are asking that we need to know which host could
> > >>> be a good candidate before starting the migration. In such
> > >>> case, we indeed need some inputs from both the qemu and
> > >>> vhost-user backend.
> > >>>
> > >>> For DPDK, I think it could be simple, just as you said, it
> > >>> could be either a tiny script, or even a macro defined in
> > >>> the source code file (we extend it every time we add a
> > >>> new feature) to let the libvirt to read it. Or something
> > >>> else.
 > >>
 > >> There's the issue of APIs that tweak features as Maxime
 > >> suggested.
>>> > >
>>> > > Yes, it's a good point.
>>> > >
 > >> Maybe the only thing to do is to deprecate it,
>>> > >
>>> > > Looks like so.
>>> > >
 > >> but I feel some way for application to pass info into
 > >> guest might be benefitial.
>>> > >
>>> > > The two APIs are just for tweaking feature bits DPDK supports before
>>> > > any device got connected. It's another way to disable some features
>>> > > (the another obvious way is to through QEMU command lines).
>>> > >
>>> > > IMO, it's bit handy only in a case like: we have bunch of VMs. Instead
>>> > > of disabling something though qemu one by one, we could disable it
>>> > > once in DPDK.
>>> > >
>>> > > But I doubt the useful of it. It's only used in DPDK's vhost example
>>> > > after all. Nor is it used in vhost pmd, neither is it used in OVS.
>> >
>> > rte_vhost_feature_disable() is currently used in OVS, lib/netdev-dpdk.c
> Hmmm. I must have checked very old code ...
>> >
>> > netdev_dpdk_vhost_class_init(void)
>> > {
>> > static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
>> >
>> > /* This function can be called for different classes.  The
>> > initialization
>> >  * needs to be done only once */
>> > if (ovsthread_once_start(&once)) {
>> > rte_vhost_driver_callback_register(&virtio_net_device_ops);
>> > rte_vhost_feature_disable(1ULL << VIRTIO_NET_F_HOST_TSO4
>> >   | 1ULL << VIRTIO_NET_F_HOST_TSO6
>> >   | 1ULL << VIRTIO_NET_F_CSUM);
> I saw the commit introduced such change, but it tells no reason why
> it was added.

I'm also interested to know the reason.
In any case, I think this is something that can/should be managed by
the management tool, which  should disable it in cmd parameters.

Kevin, do you agree?

Cheers,
Maxime

[dpdk-dev] [PATCH] doc: introduce PVP reference benchmark

2016-11-24 Thread Maxime Coquelin



On 11/24/2016 12:58 PM, Kevin Traynor wrote:
> On 11/23/2016 09:00 PM, Maxime Coquelin wrote:
>> Having reference benchmarks is important in order to obtain
>> reproducible performance figures.
>>
>> This patch describes required steps to configure a PVP setup
>> using testpmd in both host and guest.
>>
>> Not relying on external vSwitch ease integration in a CI loop by
>> not being impacted by DPDK API changes.
>>
>> Signed-off-by: Maxime Coquelin 
>
> A short template/hint of the main things to report after running could
> be useful to help ML discussions about results e.g.
>
> Traffic Generator: IXIA
> Acceptable Loss: 100% (i.e. raw throughput test)
> DPDK version/commit: v16.11
> QEMU version/commit: v2.7.0
> Patches applied: 
> CPU: E5-2680 v3, 2.8GHz
> Result: x mpps
> NIC: ixgbe 82599

Good idea, I'll add a section in the end providing this template.

>
>> ---
>>  doc/guides/howto/img/pvp_2nics.svg   | 556 
>> +++
>>  doc/guides/howto/index.rst   |   1 +
>>  doc/guides/howto/pvp_reference_benchmark.rst | 389 +++
>>  3 files changed, 946 insertions(+)
>>  create mode 100644 doc/guides/howto/img/pvp_2nics.svg
>>  create mode 100644 doc/guides/howto/pvp_reference_benchmark.rst
>>
>
> 
>
>> +Host tuning
>> +~~~
>
> I would add turbo boost =disabled on BIOS.
>
+1, will be in next revision.

>> +
>> +#. Append these options to Kernel command line:
>> +
>> +   .. code-block:: console
>> +
>> +intel_pstate=disable mce=ignore_ce default_hugepagesz=1G hugepagesz=1G 
>> hugepages=6 isolcpus=2-7 rcu_nocbs=2-7 nohz_full=2-7 iommu=pt intel_iommu=on
>> +
>> +#. Disable hyper-threads at runtime if necessary and BIOS not accessible:
>> +
>> +   .. code-block:: console
>> +
>> +cat /sys/devices/system/cpu/cpu*[0-9]/topology/thread_siblings_list \
>> +| sort | uniq \
>> +| awk -F, '{system("echo 0 > 
>> /sys/devices/system/cpu/cpu"$2"/online")}'
>> +
>> +#. Disable NMIs:
>> +
>> +   .. code-block:: console
>> +
>> +echo 0 > /proc/sys/kernel/nmi_watchdog
>> +
>> +#. Exclude isolated CPUs from the writeback cpumask:
>> +
>> +   .. code-block:: console
>> +
>> +echo ff03 > /sys/bus/workqueue/devices/writeback/cpumask
>> +
>> +#. Isolate CPUs from IRQs:
>> +
>> +   .. code-block:: console
>> +
>> +clear_mask=0xfc #Isolate CPU2 to CPU7 from IRQs
>> +for i in /proc/irq/*/smp_affinity
>> +do
>> + echo "obase=16;$(( 0x$(cat $i) & ~$clear_mask ))" | bc > $i
>> +done
>> +
>> +Qemu build
>> +~~
>> +
>> +   .. code-block:: console
>> +
>> +git clone git://dpdk.org/dpdk
>> +cd dpdk
>> +export RTE_SDK=$PWD
>> +make install T=x86_64-native-linuxapp-gcc DESTDIR=install
>> +
>> +DPDK build
>> +~~
>> +
>> +   .. code-block:: console
>> +
>> +git clone git://dpdk.org/dpdk
>> +cd dpdk
>> +export RTE_SDK=$PWD
>> +make install T=x86_64-native-linuxapp-gcc DESTDIR=install
>> +
>> +Testpmd launch
>> +~~
>> +
>> +#. Assign NICs to DPDK:
>> +
>> +   .. code-block:: console
>> +
>> +modprobe vfio-pci
>> +$RTE_SDK/install/sbin/dpdk-devbind -b vfio-pci :11:00.0 :11:00.1
>> +
>> +*Note: Sandy Bridge family seems to have some limitations wrt its IOMMU,
>> +giving poor performance results. To achieve good performance on these 
>> machines,
>> +consider using UIO instead.*
>> +
>> +#. Launch testpmd application:
>> +
>> +   .. code-block:: console
>> +
>> +$RTE_SDK/install/bin/testpmd -l 0,2,3,4,5 --socket-mem=1024 -n 4 \
>> +--vdev 'net_vhost0,iface=/tmp/vhost-user1' \
>> +--vdev 'net_vhost1,iface=/tmp/vhost-user2' -- \
>> +--portmask=f --disable-hw-vlan -i --rxq=1 --txq=1
>> +--nb-cores=4 --forward-mode=io
>> +
>> +#. In testpmd interactive mode, set the portlist to obtin the right 
>> chaining:
>> +
>> +   .. code-block:: console
>> +
>> +set portlist 0,2,1,3
>> +start
>> +
>> +VM launch
>> +~
>> +
>> +The VM may be launched ezither by calling directly QEMU, or by using 
>> libvirt.
>
> s/ezither/either
>
>> +
>> +#. Qemu way:
>> +
>> +Launch QEMU with two Virtio-net devices paired to the vhost-user sockets 
>> created by testpmd:
>> +
>> +   .. code-block:: console
>> +
>> +/bin/x86_64-softmmu/qemu-system-x86_64 \
>> +-enable-kvm -cpu host -m 3072 -smp 3 \
>> +-chardev socket,id=char0,path=/tmp/vhost-user1 \
>> +-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
>> +-device 
>> virtio-net-pci,netdev=mynet1,mac=52:54:00:02:d9:01,addr=0x10 \
>> +-chardev socket,id=char1,path=/tmp/vhost-user2 \
>> +-netdev type=vhost-user,id=mynet2,chardev=char1,vhostforce \
>> +-device 
>> virtio-net-pci,netdev=mynet2,mac=52:54:00:02:d9:02,addr=0x11 \
>> +-object 
>> memory-backend-file,id=mem,size=3072M,mem-path=/dev/hugepages,share=on \
>> +-numa node,memdev=mem -mem-prealloc \
>> +-net user,hostfwd=tcp::1002$1-:22 -

[dpdk-dev] [PATCH] doc: introduce PVP reference benchmark

2016-11-24 Thread Yuanhan Liu

First of all, thanks for the doc! It's a great one.

On Wed, Nov 23, 2016 at 10:00:06PM +0100, Maxime Coquelin wrote:
> +Qemu build
> +~~
> +
> +   .. code-block:: console
> +
> +git clone git://dpdk.org/dpdk
> +cd dpdk
> +export RTE_SDK=$PWD
> +make install T=x86_64-native-linuxapp-gcc DESTDIR=install

It's actually DPDK build.

I will take a closer look at it and also render it to see how it looks
like when I get back to office next week.

--yliu
> +
> +DPDK build
> +~~
> +
> +   .. code-block:: console
> +
> +git clone git://dpdk.org/dpdk
> +cd dpdk
> +export RTE_SDK=$PWD
> +make install T=x86_64-native-linuxapp-gcc DESTDIR=install
> +

[dpdk-dev] [PATCH v2 5/5] Revert "bonding: use existing enslaved device queues"

2016-11-24 Thread Jan Blunck

From: Ilya Maximets 

This reverts commit 5b7bb2bda5519b7800f814df64d4e015282140e5.

It is necessary to reconfigure all queues every time because configuration
can be changed.

For example, if we're reconfiguring bonding device with new memory pool,
already configured queues will still use the old one. And if the old
mempool be freed, application likely will panic in attempt to use
freed mempool.

This happens when we use the bonding device with OVS 2.6 while MTU
reconfiguration:

PANIC in rte_mempool_get_ops():
assert "(ops_index >= 0) && (ops_index < RTE_MEMPOOL_MAX_OPS_IDX)" failed

Cc: 
Signed-off-by: Ilya Maximets 
Acked-by: Declan Doherty 
Acked-by: Declan Doherty 
Acked-by: Jan Blunck 
---
 drivers/net/bonding/rte_eth_bond_pmd.c | 10 ++
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c 
b/drivers/net/bonding/rte_eth_bond_pmd.c
index e61afc9..b604642 100644
--- a/drivers/net/bonding/rte_eth_bond_pmd.c
+++ b/drivers/net/bonding/rte_eth_bond_pmd.c
@@ -1317,8 +1317,6 @@ slave_configure(struct rte_eth_dev *bonded_eth_dev,
struct bond_rx_queue *bd_rx_q;
struct bond_tx_queue *bd_tx_q;

-   uint16_t old_nb_tx_queues = slave_eth_dev->data->nb_tx_queues;
-   uint16_t old_nb_rx_queues = slave_eth_dev->data->nb_rx_queues;
int errval;
uint16_t q_id;

@@ -1362,9 +1360,7 @@ slave_configure(struct rte_eth_dev *bonded_eth_dev,
}

/* Setup Rx Queues */
-   /* Use existing queues, if any */
-   for (q_id = old_nb_rx_queues;
-q_id < bonded_eth_dev->data->nb_rx_queues; q_id++) {
+   for (q_id = 0; q_id < bonded_eth_dev->data->nb_rx_queues; q_id++) {
bd_rx_q = (struct bond_rx_queue 
*)bonded_eth_dev->data->rx_queues[q_id];

errval = rte_eth_rx_queue_setup(slave_eth_dev->data->port_id, 
q_id,
@@ -1380,9 +1376,7 @@ slave_configure(struct rte_eth_dev *bonded_eth_dev,
}

/* Setup Tx Queues */
-   /* Use existing queues, if any */
-   for (q_id = old_nb_tx_queues;
-q_id < bonded_eth_dev->data->nb_tx_queues; q_id++) {
+   for (q_id = 0; q_id < bonded_eth_dev->data->nb_tx_queues; q_id++) {
bd_tx_q = (struct bond_tx_queue 
*)bonded_eth_dev->data->tx_queues[q_id];

errval = rte_eth_tx_queue_setup(slave_eth_dev->data->port_id, 
q_id,
-- 
2.7.4

[dpdk-dev] [PATCH v2 4/5] net/bonding: Force reconfiguration of removed slave interfaces

2016-11-24 Thread Jan Blunck

After a slave interface is removed from a bond group it still has the
configuration of the bond interface. Lets enforce that the slave interface
is reconfigured after removal by resetting it.

Signed-off-by: Jan Blunck 
---
 drivers/net/bonding/rte_eth_bond_pmd.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c 
b/drivers/net/bonding/rte_eth_bond_pmd.c
index a80b6fa..e61afc9 100644
--- a/drivers/net/bonding/rte_eth_bond_pmd.c
+++ b/drivers/net/bonding/rte_eth_bond_pmd.c
@@ -1454,6 +1454,9 @@ slave_remove(struct bond_dev_private *internals,
(internals->slave_count - i - 1));

internals->slave_count--;
+
+   /* force reconfiguration of slave interfaces */
+   _rte_eth_dev_reset(slave_eth_dev);
 }

 static void
-- 
2.7.4

[dpdk-dev] [PATCH v2 3/5] ethdev: Add DPDK internal _rte_eth_dev_reset()

2016-11-24 Thread Jan Blunck

This is a helper for DPDK internal users to force a reconfiguration of a
device.

Signed-off-by: Jan Blunck 
---
 lib/librte_ether/rte_ethdev.c  | 16 
 lib/librte_ether/rte_ethdev.h  | 13 +
 lib/librte_ether/rte_ether_version.map |  6 ++
 3 files changed, 35 insertions(+)

diff --git a/lib/librte_ether/rte_ethdev.c b/lib/librte_ether/rte_ethdev.c
index a3986ad..9e69ee5 100644
--- a/lib/librte_ether/rte_ethdev.c
+++ b/lib/librte_ether/rte_ethdev.c
@@ -858,6 +858,22 @@ rte_eth_dev_configure(uint8_t port_id, uint16_t nb_rx_q, 
uint16_t nb_tx_q,
return 0;
 }

+void
+_rte_eth_dev_reset(struct rte_eth_dev *dev)
+{
+   if (dev->data->dev_started) {
+   RTE_PMD_DEBUG_TRACE(
+   "port %d must be stopped to allow reset\n",
+   dev->data->port_id);
+   return;
+   }
+
+   rte_eth_dev_rx_queue_config(dev, 0);
+   rte_eth_dev_tx_queue_config(dev, 0);
+
+   memset(&dev->data->dev_conf, 0, sizeof(dev->data->dev_conf));
+}
+
 static void
 rte_eth_dev_config_restore(uint8_t port_id)
 {
diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h
index 9678179..e0740db 100644
--- a/lib/librte_ether/rte_ethdev.h
+++ b/lib/librte_ether/rte_ethdev.h
@@ -1914,6 +1914,19 @@ int rte_eth_dev_configure(uint8_t port_id, uint16_t 
nb_rx_queue,
uint16_t nb_tx_queue, const struct rte_eth_conf *eth_conf);

 /**
+ * @internal Release a devices rx/tx queues and clear its configuration to
+ * force the user application to reconfigure it. It is for DPDK internal user
+ * only.
+ *
+ * @param dev
+ *  Pointer to struct rte_eth_dev.
+ *
+ * @return
+ *  void
+ */
+void _rte_eth_dev_reset(struct rte_eth_dev *dev);
+
+/**
  * Allocate and set up a receive queue for an Ethernet device.
  *
  * The function allocates a contiguous block of memory for *nb_rx_desc*
diff --git a/lib/librte_ether/rte_ether_version.map 
b/lib/librte_ether/rte_ether_version.map
index 72be66d..0c31c5d 100644
--- a/lib/librte_ether/rte_ether_version.map
+++ b/lib/librte_ether/rte_ether_version.map
@@ -147,3 +147,9 @@ DPDK_16.11 {
rte_eth_dev_pci_remove;

 } DPDK_16.07;
+
+DPDK_17.02 {
+   global:
+
+   _rte_eth_dev_reset;
+} DPDK_16.11;
-- 
2.7.4

[dpdk-dev] [PATCH v2 2/5] ethdev: Free rx/tx_queues after releasing all queues

2016-11-24 Thread Jan Blunck

If all queues are released lets also free up the dev->data->rx/tx_queues
to be able to properly reinitialize.

Signed-off-by: Jan Blunck 
---
 lib/librte_ether/rte_ethdev.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/lib/librte_ether/rte_ethdev.c b/lib/librte_ether/rte_ethdev.c
index 8c4b6cd..a3986ad 100644
--- a/lib/librte_ether/rte_ethdev.c
+++ b/lib/librte_ether/rte_ethdev.c
@@ -531,6 +531,9 @@ rte_eth_dev_rx_queue_config(struct rte_eth_dev *dev, 
uint16_t nb_queues)

for (i = nb_queues; i < old_nb_queues; i++)
(*dev->dev_ops->rx_queue_release)(rxq[i]);
+
+   rte_free(dev->data->rx_queues);
+   dev->data->rx_queues = NULL;
}
dev->data->nb_rx_queues = nb_queues;
return 0;
@@ -682,6 +685,9 @@ rte_eth_dev_tx_queue_config(struct rte_eth_dev *dev, 
uint16_t nb_queues)

for (i = nb_queues; i < old_nb_queues; i++)
(*dev->dev_ops->tx_queue_release)(txq[i]);
+
+   rte_free(dev->data->tx_queues);
+   dev->data->tx_queues = NULL;
}
dev->data->nb_tx_queues = nb_queues;
return 0;
-- 
2.7.4

[dpdk-dev] [PATCH v2 1/5] ethdev: Call rx/tx_queue_release before rx/tx_queue_setup

2016-11-24 Thread Jan Blunck

If a queue has been setup before lets release it before we setup.
Otherwise we might leak resources.

Signed-off-by: Jan Blunck 
---
 lib/librte_ether/rte_ethdev.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/lib/librte_ether/rte_ethdev.c b/lib/librte_ether/rte_ethdev.c
index fde8112..8c4b6cd 100644
--- a/lib/librte_ether/rte_ethdev.c
+++ b/lib/librte_ether/rte_ethdev.c
@@ -1010,6 +1010,7 @@ rte_eth_rx_queue_setup(uint8_t port_id, uint16_t 
rx_queue_id,
uint32_t mbp_buf_size;
struct rte_eth_dev *dev;
struct rte_eth_dev_info dev_info;
+   void **rxq;

RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);

@@ -1068,6 +1069,14 @@ rte_eth_rx_queue_setup(uint8_t port_id, uint16_t 
rx_queue_id,
return -EINVAL;
}

+   rxq = dev->data->rx_queues;
+   if (rxq[rx_queue_id]) {
+   RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_release,
+   -ENOTSUP);
+   (*dev->dev_ops->rx_queue_release)(rxq[rx_queue_id]);
+   rxq[rx_queue_id] = NULL;
+   }
+
if (rx_conf == NULL)
rx_conf = &dev_info.default_rxconf;

@@ -1089,6 +1098,7 @@ rte_eth_tx_queue_setup(uint8_t port_id, uint16_t 
tx_queue_id,
 {
struct rte_eth_dev *dev;
struct rte_eth_dev_info dev_info;
+   void **txq;

RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);

@@ -1121,6 +1131,14 @@ rte_eth_tx_queue_setup(uint8_t port_id, uint16_t 
tx_queue_id,
return -EINVAL;
}

+   txq = dev->data->tx_queues;
+   if (txq[tx_queue_id]) {
+   RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->tx_queue_release,
+   -ENOTSUP);
+   (*dev->dev_ops->tx_queue_release)(txq[tx_queue_id]);
+   txq[tx_queue_id] = NULL;
+   }
+
if (tx_conf == NULL)
tx_conf = &dev_info.default_txconf;

-- 
2.7.4

[dpdk-dev] [PATCH v2 0/5] bonding: setup all queues of slave devices

2016-11-24 Thread Jan Blunck

Prior to 16.11 some drivers (e.g. virtio) still had problems if their
queues where setup repeatedly. The bonding driver was working around the
problem by reusing already setup queues. This series of patches changes the
way how queue setup is done to give control to the driver to properly release
already initialized queues before they are setup again. Therefore the driver
call sequence is as if the number of queues is temporarily reduced before the
queues are setup again.

Ilya Maximets (1):
  Revert "bonding: use existing enslaved device queues"

Jan Blunck (4):
  ethdev: Call rx/tx_queue_release before rx/tx_queue_setup
  ethdev: Free rx/tx_queues after releasing all queues
  ethdev: Add DPDK internal _rte_eth_dev_reset()
  net/bonding: Force reconfiguration of removed slave interfaces

 drivers/net/bonding/rte_eth_bond_pmd.c | 13 +--
 lib/librte_ether/rte_ethdev.c  | 40 ++
 lib/librte_ether/rte_ethdev.h  | 13 +++
 lib/librte_ether/rte_ether_version.map |  6 +
 4 files changed, 64 insertions(+), 8 deletions(-)

-- 
2.7.4

[dpdk-dev] [PATCH 1/4] eventdev: introduce event driven programming model

2016-11-24 Thread Bruce Richardson

On Thu, Nov 24, 2016 at 07:29:13AM +0530, Jerin Jacob wrote:
> On Wed, Nov 23, 2016 at 07:39:09PM +0100, Thomas Monjalon wrote:

Just some comments on mine triggered by Thomas comments?


> > + */
> > > +static inline int
> > > +rte_event_enqueue(uint8_t dev_id, uint8_t port_id, struct rte_event *ev)
> > 
> > Is it really needed to have non-burst variant of enqueue/dequeue?
> 
> Yes. certain HW can work only with non burst variants.

In those cases is it not acceptable just to have the dequeue_burst
function return 1 all the time? It would allow apps to be more portable
between burst and non-burst varients would it not.

> > 
> > > +/**
> > > + * Converts nanoseconds to *wait* value for rte_event_dequeue()
> > > + *
> > > + * If the device is configured with RTE_EVENT_DEV_CFG_PER_DEQUEUE_WAIT 
> > > flag then
> > > + * application can use this function to convert wait value in 
> > > nanoseconds to
> > > + * implementations specific wait value supplied in rte_event_dequeue()
> > 
> > Why is it implementation-specific?
> > Why this conversion is not internal in the driver?
> 
> This is for performance optimization, otherwise in drivers
> need to convert ns to ticks in "fast path"
> 
> > 
Is that really likely to be a performance bottleneck. I would expect
modern cores to fly through basic arithmetic in a negligable amount of
cycles?

/Bruce

[dpdk-dev] [RFC 1/9] ethdev: clarify api comments of rx queue count

2016-11-24 Thread Olivier Matz

On Thu, 2016-11-24 at 10:52 +, Ferruh Yigit wrote:
> On 11/24/2016 9:54 AM, Olivier Matz wrote:
> > The API comments are not consistent between each other.
> > 
> > The function rte_eth_rx_queue_count() returns the number of used
> > descriptors on a receive queue.
> > 
> > PR=52423
> 
> What is this marker?
> 

Sorry, this is a mistake, it's an internal marker...
I hoped nobody would notice it ;)


> > Signed-off-by: Olivier Matz 
> > Acked-by: Ivan Boule 
> 
> Acked-by: Ferruh Yigit 
> 

Thanks for reviewing!

Regards,
Olivier

[dpdk-dev] [PATCH] doc: introduce PVP reference benchmark

2016-11-24 Thread Kevin Traynor

On 11/23/2016 09:00 PM, Maxime Coquelin wrote:
> Having reference benchmarks is important in order to obtain
> reproducible performance figures.
> 
> This patch describes required steps to configure a PVP setup
> using testpmd in both host and guest.
> 
> Not relying on external vSwitch ease integration in a CI loop by
> not being impacted by DPDK API changes.
> 
> Signed-off-by: Maxime Coquelin 

A short template/hint of the main things to report after running could
be useful to help ML discussions about results e.g.

Traffic Generator: IXIA
Acceptable Loss: 100% (i.e. raw throughput test)
DPDK version/commit: v16.11
QEMU version/commit: v2.7.0
Patches applied: 
CPU: E5-2680 v3, 2.8GHz
Result: x mpps
NIC: ixgbe 82599

> ---
>  doc/guides/howto/img/pvp_2nics.svg   | 556 
> +++
>  doc/guides/howto/index.rst   |   1 +
>  doc/guides/howto/pvp_reference_benchmark.rst | 389 +++
>  3 files changed, 946 insertions(+)
>  create mode 100644 doc/guides/howto/img/pvp_2nics.svg
>  create mode 100644 doc/guides/howto/pvp_reference_benchmark.rst
> 



> +Host tuning
> +~~~

I would add turbo boost =disabled on BIOS.

> +
> +#. Append these options to Kernel command line:
> +
> +   .. code-block:: console
> +
> +intel_pstate=disable mce=ignore_ce default_hugepagesz=1G hugepagesz=1G 
> hugepages=6 isolcpus=2-7 rcu_nocbs=2-7 nohz_full=2-7 iommu=pt intel_iommu=on
> +
> +#. Disable hyper-threads at runtime if necessary and BIOS not accessible:
> +
> +   .. code-block:: console
> +
> +cat /sys/devices/system/cpu/cpu*[0-9]/topology/thread_siblings_list \
> +| sort | uniq \
> +| awk -F, '{system("echo 0 > 
> /sys/devices/system/cpu/cpu"$2"/online")}'
> +
> +#. Disable NMIs:
> +
> +   .. code-block:: console
> +
> +echo 0 > /proc/sys/kernel/nmi_watchdog
> +
> +#. Exclude isolated CPUs from the writeback cpumask:
> +
> +   .. code-block:: console
> +
> +echo ff03 > /sys/bus/workqueue/devices/writeback/cpumask
> +
> +#. Isolate CPUs from IRQs:
> +
> +   .. code-block:: console
> +
> +clear_mask=0xfc #Isolate CPU2 to CPU7 from IRQs
> +for i in /proc/irq/*/smp_affinity
> +do
> + echo "obase=16;$(( 0x$(cat $i) & ~$clear_mask ))" | bc > $i
> +done
> +
> +Qemu build
> +~~
> +
> +   .. code-block:: console
> +
> +git clone git://dpdk.org/dpdk
> +cd dpdk
> +export RTE_SDK=$PWD
> +make install T=x86_64-native-linuxapp-gcc DESTDIR=install
> +
> +DPDK build
> +~~
> +
> +   .. code-block:: console
> +
> +git clone git://dpdk.org/dpdk
> +cd dpdk
> +export RTE_SDK=$PWD
> +make install T=x86_64-native-linuxapp-gcc DESTDIR=install
> +
> +Testpmd launch
> +~~
> +
> +#. Assign NICs to DPDK:
> +
> +   .. code-block:: console
> +
> +modprobe vfio-pci
> +$RTE_SDK/install/sbin/dpdk-devbind -b vfio-pci :11:00.0 :11:00.1
> +
> +*Note: Sandy Bridge family seems to have some limitations wrt its IOMMU,
> +giving poor performance results. To achieve good performance on these 
> machines,
> +consider using UIO instead.*
> +
> +#. Launch testpmd application:
> +
> +   .. code-block:: console
> +
> +$RTE_SDK/install/bin/testpmd -l 0,2,3,4,5 --socket-mem=1024 -n 4 \
> +--vdev 'net_vhost0,iface=/tmp/vhost-user1' \
> +--vdev 'net_vhost1,iface=/tmp/vhost-user2' -- \
> +--portmask=f --disable-hw-vlan -i --rxq=1 --txq=1
> +--nb-cores=4 --forward-mode=io
> +
> +#. In testpmd interactive mode, set the portlist to obtin the right chaining:
> +
> +   .. code-block:: console
> +
> +set portlist 0,2,1,3
> +start
> +
> +VM launch
> +~
> +
> +The VM may be launched ezither by calling directly QEMU, or by using libvirt.

s/ezither/either

> +
> +#. Qemu way:
> +
> +Launch QEMU with two Virtio-net devices paired to the vhost-user sockets 
> created by testpmd:
> +
> +   .. code-block:: console
> +
> +/bin/x86_64-softmmu/qemu-system-x86_64 \
> +-enable-kvm -cpu host -m 3072 -smp 3 \
> +-chardev socket,id=char0,path=/tmp/vhost-user1 \
> +-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
> +-device virtio-net-pci,netdev=mynet1,mac=52:54:00:02:d9:01,addr=0x10 
> \
> +-chardev socket,id=char1,path=/tmp/vhost-user2 \
> +-netdev type=vhost-user,id=mynet2,chardev=char1,vhostforce \
> +-device virtio-net-pci,netdev=mynet2,mac=52:54:00:02:d9:02,addr=0x11 
> \
> +-object 
> memory-backend-file,id=mem,size=3072M,mem-path=/dev/hugepages,share=on \
> +-numa node,memdev=mem -mem-prealloc \
> +-net user,hostfwd=tcp::1002$1-:22 -net nic \
> +-qmp unix:/tmp/qmp.socket,server,nowait \
> +-monitor stdio .qcow2

Probably mergeable rx data path =off would want to be tested also when
evaluating any performance improvements/regressions.

> +
> +You can use this qmp-vcpu-pin script to pin vCPUs:
> +
> +   .. code-block:: python
> +
> +#

[dpdk-dev] [PATCH v2 1/2] net: remove dead driver names

2016-11-24 Thread Ferruh Yigit

On 11/21/2016 6:06 PM, David Marchand wrote:
> Since commit b1fb53a39d88 ("ethdev: remove some PCI specific handling"),
> rte_eth_dev_info_get() relies on dev->data->drv_name to report the driver
> name to caller.
> 
> Having the pmds set driver_info->driver_name in the pmds is useless,
> since ethdev overwrites it right after.
> The only thing the pmd must do is:
> - for pci drivers, call rte_eth_copy_pci_info() which then sets
>   data->drv_name
> - for vdev drivers, manually set data->drv_name
> 
> At this stage, virtio-user does not properly report a driver name (fixed in
> next commit).
> 
> Signed-off-by: David Marchand 
> Reviewed-by: Ferruh Yigit 
> Reviewed-by: Jan Blunck 

Series applied to dpdk-next-net/master, thanks.

[dpdk-dev] [PATCH] crypto/qat: fix to avoid buffer overwrite in OOP case

2016-11-24 Thread Fiona Trahe

In out-of-place operation, data is DMAed from source mbuf
to destination mbuf. To avoid header data in dest mbuf being
overwritten, the minimal data-set should be DMAed.

Fixes: 39e0bee48e81 ("crypto/qat: rework request builder for performance")

Signed-off-by: Fiona Trahe 
---
This patch depends on following patch :
  crypto: remove unused digest-appended feature
  http://dpdk.org/dev/patchwork/patch/17079/

 drivers/crypto/qat/qat_crypto.c | 66 -
 drivers/crypto/qat/qat_crypto.h |  1 +
 2 files changed, 34 insertions(+), 33 deletions(-)

diff --git a/drivers/crypto/qat/qat_crypto.c b/drivers/crypto/qat/qat_crypto.c
index 6a6bd2e..afce4ac 100644
--- a/drivers/crypto/qat/qat_crypto.c
+++ b/drivers/crypto/qat/qat_crypto.c
@@ -955,7 +955,7 @@ qat_write_hw_desc_entry(struct rte_crypto_op *op, uint8_t 
*out_msg)
uint32_t cipher_len = 0, cipher_ofs = 0;
uint32_t auth_len = 0, auth_ofs = 0;
uint32_t min_ofs = 0;
-   uint64_t buf_start = 0;
+   uint64_t src_buf_start = 0, dst_buf_start = 0;


 #ifdef RTE_LIBRTE_PMD_QAT_DEBUG_TX
@@ -1077,27 +1077,40 @@ qat_write_hw_desc_entry(struct rte_crypto_op *op, 
uint8_t *out_msg)
if (do_cipher && do_auth)
min_ofs = cipher_ofs < auth_ofs ? cipher_ofs : auth_ofs;

-
-   /* Start DMA at nearest aligned address below min_ofs */
-   #define QAT_64_BTYE_ALIGN_MASK (~0x3f)
-   buf_start = rte_pktmbuf_mtophys_offset(op->sym->m_src, min_ofs) &
-   QAT_64_BTYE_ALIGN_MASK;
-
-   if (unlikely((rte_pktmbuf_mtophys(op->sym->m_src)
-   - rte_pktmbuf_headroom(op->sym->m_src)) > buf_start)) {
-   /* alignment has pushed addr ahead of start of mbuf
-* so revert and take the performance hit
+   if (unlikely(op->sym->m_dst != NULL)) {
+   /* Out-of-place operation (OOP)
+* Don't align DMA start. DMA the minimum data-set
+* so as not to overwrite data in dest buffer
 */
-   buf_start = rte_pktmbuf_mtophys(op->sym->m_src);
+   src_buf_start =
+   rte_pktmbuf_mtophys_offset(op->sym->m_src, min_ofs);
+   dst_buf_start =
+   rte_pktmbuf_mtophys_offset(op->sym->m_dst, min_ofs);
+   } else {
+   /* In-place operation
+* Start DMA at nearest aligned address below min_ofs
+*/
+   src_buf_start =
+   rte_pktmbuf_mtophys_offset(op->sym->m_src, min_ofs)
+   & QAT_64_BTYE_ALIGN_MASK;
+
+   if (unlikely((rte_pktmbuf_mtophys(op->sym->m_src) -
+   rte_pktmbuf_headroom(op->sym->m_src))
+   > src_buf_start)) {
+   /* alignment has pushed addr ahead of start of mbuf
+* so revert and take the performance hit
+*/
+   src_buf_start =
+   rte_pktmbuf_mtophys_offset(op->sym->m_src,
+   min_ofs);
+   }
+   dst_buf_start = src_buf_start;
}

-   qat_req->comn_mid.dest_data_addr =
-   qat_req->comn_mid.src_data_addr = buf_start;
-
if (do_cipher) {
cipher_param->cipher_offset =
-   (uint32_t)rte_pktmbuf_mtophys_offset(
-   op->sym->m_src, cipher_ofs) - buf_start;
+   (uint32_t)rte_pktmbuf_mtophys_offset(
+   op->sym->m_src, cipher_ofs) - src_buf_start;
cipher_param->cipher_length = cipher_len;
} else {
cipher_param->cipher_offset = 0;
@@ -1105,7 +1118,7 @@ qat_write_hw_desc_entry(struct rte_crypto_op *op, uint8_t 
*out_msg)
}
if (do_auth) {
auth_param->auth_off = (uint32_t)rte_pktmbuf_mtophys_offset(
-   op->sym->m_src, auth_ofs) - buf_start;
+   op->sym->m_src, auth_ofs) - src_buf_start;
auth_param->auth_len = auth_len;
} else {
auth_param->auth_off = 0;
@@ -1118,21 +1131,8 @@ qat_write_hw_desc_entry(struct rte_crypto_op *op, 
uint8_t *out_msg)
(cipher_param->cipher_offset + cipher_param->cipher_length)
: (auth_param->auth_off + auth_param->auth_len);

-
-   /* out-of-place operation (OOP) */
-   if (unlikely(op->sym->m_dst != NULL)) {
-
-   if (do_auth)
-   qat_req->comn_mid.dest_data_addr =
-   rte_pktmbuf_mtophys_offset(op->sym->m_dst,
-   auth_ofs)
-

[dpdk-dev] [RFC 2/9] ethdev: move queue id check in generic layer

2016-11-24 Thread Ferruh Yigit

On 11/24/2016 9:54 AM, Olivier Matz wrote:
> The check of queue_id is done in all drivers implementing
> rte_eth_rx_queue_count(). Factorize this check in the generic function.
> 
> Note that the nfp driver was doing the check differently, which could
> induce crashes if the queue index was too big.
> 
> By the way, also move the is_supported test before the port valid and
> queue valid test.
> 
> PR=52423
> Signed-off-by: Olivier Matz 
> Acked-by: Ivan Boule 
> ---

<...>

> diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h
> index c3edc23..9551cfd 100644
> --- a/lib/librte_ether/rte_ethdev.h
> +++ b/lib/librte_ether/rte_ethdev.h
> @@ -2693,7 +2693,7 @@ rte_eth_rx_burst(uint8_t port_id, uint16_t queue_id,
>   *  The queue id on the specific port.
>   * @return
>   *  The number of used descriptors in the specific queue, or:
> - * (-EINVAL) if *port_id* is invalid
> + * (-EINVAL) if *port_id* or *queue_id* is invalid
>   * (-ENOTSUP) if the device does not support this function
>   */
>  static inline int
> @@ -2701,8 +2701,10 @@ rte_eth_rx_queue_count(uint8_t port_id, uint16_t 
> queue_id)
>  {
>   struct rte_eth_dev *dev = &rte_eth_devices[port_id];
>  
> - RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
>   RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_count, -ENOTSUP);
> + RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);

Doing port validity check before accessing dev->dev_ops->rx_queue_count
can be good idea.

What about validating port_id even before accessing
rte_eth_devices[port_id]?

> + if (queue_id >= dev->data->nb_rx_queues)
> + return -EINVAL;
>  
>   return (*dev->dev_ops->rx_queue_count)(dev, queue_id);
>  }
>

[dpdk-dev] [RFC 9/9] net/e1000: add handler for tx queue descriptor count

2016-11-24 Thread Olivier Matz

Like for TX, use a binary search algorithm to get the number of used Tx
descriptors.

PR=52423
Signed-off-by: Olivier Matz 
Acked-by: Ivan Boule 
---
 drivers/net/e1000/e1000_ethdev.h |  5 +++-
 drivers/net/e1000/em_ethdev.c|  1 +
 drivers/net/e1000/em_rxtx.c  | 51 
 3 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/drivers/net/e1000/e1000_ethdev.h b/drivers/net/e1000/e1000_ethdev.h
index ad9ddaf..8945916 100644
--- a/drivers/net/e1000/e1000_ethdev.h
+++ b/drivers/net/e1000/e1000_ethdev.h
@@ -364,7 +364,10 @@ int eth_em_rx_queue_setup(struct rte_eth_dev *dev, 
uint16_t rx_queue_id,
struct rte_mempool *mb_pool);

 uint32_t eth_em_rx_queue_count(struct rte_eth_dev *dev,
-   uint16_t rx_queue_id);
+  uint16_t rx_queue_id);
+
+uint32_t eth_em_tx_queue_count(struct rte_eth_dev *dev,
+  uint16_t tx_queue_id);

 int eth_em_rx_descriptor_done(void *rx_queue, uint16_t offset);

diff --git a/drivers/net/e1000/em_ethdev.c b/drivers/net/e1000/em_ethdev.c
index 866a5cf..7fe5e3b 100644
--- a/drivers/net/e1000/em_ethdev.c
+++ b/drivers/net/e1000/em_ethdev.c
@@ -190,6 +190,7 @@ static const struct eth_dev_ops eth_em_ops = {
.rx_queue_setup   = eth_em_rx_queue_setup,
.rx_queue_release = eth_em_rx_queue_release,
.rx_queue_count   = eth_em_rx_queue_count,
+   .tx_queue_count   = eth_em_tx_queue_count,
.rx_descriptor_done   = eth_em_rx_descriptor_done,
.tx_queue_setup   = eth_em_tx_queue_setup,
.tx_queue_release = eth_em_tx_queue_release,
diff --git a/drivers/net/e1000/em_rxtx.c b/drivers/net/e1000/em_rxtx.c
index a469fd7..8afcfda 100644
--- a/drivers/net/e1000/em_rxtx.c
+++ b/drivers/net/e1000/em_rxtx.c
@@ -1432,6 +1432,57 @@ eth_em_rx_queue_count(struct rte_eth_dev *dev, uint16_t 
rx_queue_id)
return offset;
 }

+uint32_t
+eth_em_tx_queue_count(struct rte_eth_dev *dev, uint16_t tx_queue_id)
+{
+   volatile uint8_t *status;
+   struct em_tx_queue *txq;
+   int32_t offset, interval, idx, resolution;
+
+   txq = dev->data->tx_queues[tx_queue_id];
+
+   /* check if ring empty */
+   idx = txq->tx_tail - 1;
+   if (idx < 0)
+   idx += txq->nb_tx_desc;
+   status = &txq->tx_ring[idx].upper.fields.status;
+   if (*status & E1000_TXD_STAT_DD)
+   return 0;
+
+   /* check if ring full */
+   idx = txq->tx_tail + 1;
+   if (idx >= txq->nb_tx_desc)
+   idx -= txq->nb_tx_desc;
+   status = &txq->tx_ring[idx].upper.fields.status;
+   if (!(*status & E1000_TXD_STAT_DD))
+   return txq->nb_tx_desc;
+
+   /* decrease the precision if ring is large */
+   if (txq->nb_tx_desc <= 256)
+   resolution = 4;
+   else
+   resolution = 16;
+
+   /* use a binary search */
+   offset = txq->nb_tx_desc >> 1;
+   interval = offset;
+
+   do {
+   idx = txq->tx_tail + offset;
+   if (idx >= txq->nb_tx_desc)
+   idx -= txq->nb_tx_desc;
+
+   interval >>= 1;
+   status = &txq->tx_ring[idx].upper.fields.status;
+   if (*status & E1000_TXD_STAT_DD)
+   offset += interval;
+   else
+   offset -= interval;
+   } while (interval >= resolution);
+
+   return txq->nb_tx_desc - offset;
+}
+
 int
 eth_em_rx_descriptor_done(void *rx_queue, uint16_t offset)
 {
-- 
2.8.1

[dpdk-dev] [RFC 8/9] net/e1000: optimize rx queue descriptor count

2016-11-24 Thread Olivier Matz

Use a binary search algorithm to find the first empty DD bit. The
ring-empty and ring-full cases are managed separately as they are more
likely to happen.

PR=52423
Signed-off-by: Olivier Matz 
Acked-by: Ivan Boule 
---
 drivers/net/e1000/em_rxtx.c | 55 +
 1 file changed, 41 insertions(+), 14 deletions(-)

diff --git a/drivers/net/e1000/em_rxtx.c b/drivers/net/e1000/em_rxtx.c
index c1c724b..a469fd7 100644
--- a/drivers/net/e1000/em_rxtx.c
+++ b/drivers/net/e1000/em_rxtx.c
@@ -1385,24 +1385,51 @@ eth_em_rx_queue_setup(struct rte_eth_dev *dev,
 uint32_t
 eth_em_rx_queue_count(struct rte_eth_dev *dev, uint16_t rx_queue_id)
 {
-#define EM_RXQ_SCAN_INTERVAL 4
-   volatile struct e1000_rx_desc *rxdp;
+   volatile uint8_t *status;
struct em_rx_queue *rxq;
-   uint32_t desc = 0;
+   uint32_t offset, interval, resolution;
+   int32_t idx;

rxq = dev->data->rx_queues[rx_queue_id];
-   rxdp = &(rxq->rx_ring[rxq->rx_tail]);
-
-   while ((desc < rxq->nb_rx_desc) &&
-   (rxdp->status & E1000_RXD_STAT_DD)) {
-   desc += EM_RXQ_SCAN_INTERVAL;
-   rxdp += EM_RXQ_SCAN_INTERVAL;
-   if (rxq->rx_tail + desc >= rxq->nb_rx_desc)
-   rxdp = &(rxq->rx_ring[rxq->rx_tail +
-   desc - rxq->nb_rx_desc]);
-   }

-   return desc;
+   /* check if ring empty */
+   idx = rxq->rx_tail;
+   status = &rxq->rx_ring[idx].status;
+   if (!(*status & E1000_RXD_STAT_DD))
+   return 0;
+
+   /* decrease the precision if ring is large */
+   if (rxq->nb_rx_desc <= 256)
+   resolution = 4;
+   else
+   resolution = 16;
+
+   /* check if ring full */
+   idx = rxq->rx_tail - rxq->nb_rx_hold - resolution;
+   if (idx < 0)
+   idx += rxq->nb_rx_desc;
+   status = &rxq->rx_ring[idx].status;
+   if (*status & E1000_RXD_STAT_DD)
+   return rxq->nb_rx_desc;
+
+   /* use a binary search */
+   interval = (rxq->nb_rx_desc - rxq->nb_rx_hold) >> 1;
+   offset = interval;
+
+   do {
+   idx = rxq->rx_tail + offset;
+   if (idx >= rxq->nb_rx_desc)
+   idx -= rxq->nb_rx_desc;
+
+   interval >>= 1;
+   status = &rxq->rx_ring[idx].status;
+   if (*status & E1000_RXD_STAT_DD)
+   offset += interval;
+   else
+   offset -= interval;
+   } while (interval >= resolution);
+
+   return offset;
 }

 int
-- 
2.8.1

[dpdk-dev] [RFC 7/9] net/igb: add handler for tx queue descriptor count

2016-11-24 Thread Olivier Matz

Like for TX, use a binary search algorithm to get the number of used Tx
descriptors.

PR=52423
Signed-off-by: Olivier Matz 
Acked-by: Ivan Boule 
---
 drivers/net/e1000/e1000_ethdev.h |  5 +++-
 drivers/net/e1000/igb_ethdev.c   |  1 +
 drivers/net/e1000/igb_rxtx.c | 51 
 3 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/drivers/net/e1000/e1000_ethdev.h b/drivers/net/e1000/e1000_ethdev.h
index 6c25c8d..ad9ddaf 100644
--- a/drivers/net/e1000/e1000_ethdev.h
+++ b/drivers/net/e1000/e1000_ethdev.h
@@ -300,7 +300,10 @@ int eth_igb_rx_queue_setup(struct rte_eth_dev *dev, 
uint16_t rx_queue_id,
struct rte_mempool *mb_pool);

 uint32_t eth_igb_rx_queue_count(struct rte_eth_dev *dev,
-   uint16_t rx_queue_id);
+   uint16_t rx_queue_id);
+
+uint32_t eth_igb_tx_queue_count(struct rte_eth_dev *dev,
+   uint16_t tx_queue_id);

 int eth_igb_rx_descriptor_done(void *rx_queue, uint16_t offset);

diff --git a/drivers/net/e1000/igb_ethdev.c b/drivers/net/e1000/igb_ethdev.c
index 08f2a68..a54d374 100644
--- a/drivers/net/e1000/igb_ethdev.c
+++ b/drivers/net/e1000/igb_ethdev.c
@@ -399,6 +399,7 @@ static const struct eth_dev_ops eth_igb_ops = {
.rx_queue_intr_disable = eth_igb_rx_queue_intr_disable,
.rx_queue_release = eth_igb_rx_queue_release,
.rx_queue_count   = eth_igb_rx_queue_count,
+   .tx_queue_count   = eth_igb_tx_queue_count,
.rx_descriptor_done   = eth_igb_rx_descriptor_done,
.tx_queue_setup   = eth_igb_tx_queue_setup,
.tx_queue_release = eth_igb_tx_queue_release,
diff --git a/drivers/net/e1000/igb_rxtx.c b/drivers/net/e1000/igb_rxtx.c
index 6b0111f..2ff2417 100644
--- a/drivers/net/e1000/igb_rxtx.c
+++ b/drivers/net/e1000/igb_rxtx.c
@@ -1554,6 +1554,57 @@ eth_igb_rx_queue_count(struct rte_eth_dev *dev, uint16_t 
rx_queue_id)
return offset;
 }

+uint32_t
+eth_igb_tx_queue_count(struct rte_eth_dev *dev, uint16_t tx_queue_id)
+{
+   volatile uint32_t *status;
+   struct igb_tx_queue *txq;
+   int32_t offset, interval, idx, resolution;
+
+   txq = dev->data->tx_queues[tx_queue_id];
+
+   /* check if ring empty */
+   idx = txq->tx_tail - 1;
+   if (idx < 0)
+   idx += txq->nb_tx_desc;
+   status = &txq->tx_ring[idx].wb.status;
+   if (*status & rte_cpu_to_le_32(E1000_TXD_STAT_DD))
+   return 0;
+
+   /* check if ring full */
+   idx = txq->tx_tail + 1;
+   if (idx >= txq->nb_tx_desc)
+   idx -= txq->nb_tx_desc;
+   status = &txq->tx_ring[idx].wb.status;
+   if (!(*status & rte_cpu_to_le_32(E1000_TXD_STAT_DD)))
+   return txq->nb_tx_desc;
+
+   /* decrease the precision if ring is large */
+   if (txq->nb_tx_desc <= 256)
+   resolution = 4;
+   else
+   resolution = 16;
+
+   /* use a binary search */
+   interval = txq->nb_tx_desc >> 1;
+   offset = interval;
+
+   do {
+   interval >>= 1;
+   idx = txq->tx_tail + offset;
+   if (idx >= txq->nb_tx_desc)
+   idx -= txq->nb_tx_desc;
+
+   status = &txq->tx_ring[idx].wb.status;
+   if (*status & rte_cpu_to_le_32(E1000_TXD_STAT_DD))
+   offset += interval;
+   else
+   offset -= interval;
+   } while (interval >= resolution);
+
+   return txq->nb_tx_desc - offset;
+}
+
 int
 eth_igb_rx_descriptor_done(void *rx_queue, uint16_t offset)
 {
-- 
2.8.1

[dpdk-dev] [RFC 6/9] net/igb: optimize rx queue descriptor count

2016-11-24 Thread Olivier Matz

Use a binary search algorithm to find the first empty DD bit. The
ring-empty and ring-full cases are managed separately as they are more
likely to happen.

PR=52423
Signed-off-by: Olivier Matz 
Acked-by: Ivan Boule 
---
 drivers/net/e1000/igb_rxtx.c | 55 +---
 1 file changed, 41 insertions(+), 14 deletions(-)

diff --git a/drivers/net/e1000/igb_rxtx.c b/drivers/net/e1000/igb_rxtx.c
index e9aa356..6b0111f 100644
--- a/drivers/net/e1000/igb_rxtx.c
+++ b/drivers/net/e1000/igb_rxtx.c
@@ -1507,24 +1507,51 @@ eth_igb_rx_queue_setup(struct rte_eth_dev *dev,
 uint32_t
 eth_igb_rx_queue_count(struct rte_eth_dev *dev, uint16_t rx_queue_id)
 {
-#define IGB_RXQ_SCAN_INTERVAL 4
-   volatile union e1000_adv_rx_desc *rxdp;
+   volatile uint32_t *status;
struct igb_rx_queue *rxq;
-   uint32_t desc = 0;
+   uint32_t offset, interval, resolution;
+   int32_t idx;

rxq = dev->data->rx_queues[rx_queue_id];
-   rxdp = &(rxq->rx_ring[rxq->rx_tail]);
-
-   while ((desc < rxq->nb_rx_desc) &&
-   (rxdp->wb.upper.status_error & E1000_RXD_STAT_DD)) {
-   desc += IGB_RXQ_SCAN_INTERVAL;
-   rxdp += IGB_RXQ_SCAN_INTERVAL;
-   if (rxq->rx_tail + desc >= rxq->nb_rx_desc)
-   rxdp = &(rxq->rx_ring[rxq->rx_tail +
-   desc - rxq->nb_rx_desc]);
-   }

-   return desc;
+   /* check if ring empty */
+   idx = rxq->rx_tail;
+   status = &rxq->rx_ring[idx].wb.upper.status_error;
+   if (!(*status & rte_cpu_to_le_32(E1000_RXD_STAT_DD)))
+   return 0;
+
+   /* decrease the precision if ring is large */
+   if (rxq->nb_rx_desc <= 256)
+   resolution = 4;
+   else
+   resolution = 16;
+
+   /* check if ring full */
+   idx = rxq->rx_tail - rxq->nb_rx_hold - resolution;
+   if (idx < 0)
+   idx += rxq->nb_rx_desc;
+   status = &rxq->rx_ring[idx].wb.upper.status_error;
+   if (*status & rte_cpu_to_le_32(E1000_RXD_STAT_DD))
+   return rxq->nb_rx_desc;
+
+   /* use a binary search */
+   interval = (rxq->nb_rx_desc - rxq->nb_rx_hold) >> 1;
+   offset = interval;
+
+   do {
+   idx = rxq->rx_tail + offset;
+   if (idx >= rxq->nb_rx_desc)
+   idx -= rxq->nb_rx_desc;
+
+   interval >>= 1;
+   status = &rxq->rx_ring[idx].wb.upper.status_error;
+   if (*status & rte_cpu_to_le_32(E1000_RXD_STAT_DD))
+   offset += interval;
+   else
+   offset -= interval;
+   } while (interval >= resolution);
+
+   return offset;
 }

 int
-- 
2.8.1

[dpdk-dev] [RFC 5/9] net/ixgbe: add handler for Tx queue descriptor count

2016-11-24 Thread Olivier Matz

Like for TX, use a binary search algorithm to get the number of used Tx
descriptors.

PR=52423
Signed-off-by: Olivier Matz 
Acked-by: Ivan Boule 
---
 drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
 drivers/net/ixgbe/ixgbe_ethdev.h |  4 ++-
 drivers/net/ixgbe/ixgbe_rxtx.c   | 57 
 drivers/net/ixgbe/ixgbe_rxtx.h   |  2 ++
 4 files changed, 63 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index baffc71..0ba098a 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -553,6 +553,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
.rx_queue_intr_disable = ixgbe_dev_rx_queue_intr_disable,
.rx_queue_release = ixgbe_dev_rx_queue_release,
.rx_queue_count   = ixgbe_dev_rx_queue_count,
+   .tx_queue_count   = ixgbe_dev_tx_queue_count,
.rx_descriptor_done   = ixgbe_dev_rx_descriptor_done,
.tx_queue_setup   = ixgbe_dev_tx_queue_setup,
.tx_queue_release = ixgbe_dev_tx_queue_release,
diff --git a/drivers/net/ixgbe/ixgbe_ethdev.h b/drivers/net/ixgbe/ixgbe_ethdev.h
index 4ff6338..e060c3d 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.h
+++ b/drivers/net/ixgbe/ixgbe_ethdev.h
@@ -348,7 +348,9 @@ int  ixgbe_dev_tx_queue_setup(struct rte_eth_dev *dev, 
uint16_t tx_queue_id,
const struct rte_eth_txconf *tx_conf);

 uint32_t ixgbe_dev_rx_queue_count(struct rte_eth_dev *dev,
-   uint16_t rx_queue_id);
+ uint16_t rx_queue_id);
+uint32_t ixgbe_dev_tx_queue_count(struct rte_eth_dev *dev,
+ uint16_t tx_queue_id);

 int ixgbe_dev_rx_descriptor_done(void *rx_queue, uint16_t offset);
 int ixgbevf_dev_rx_descriptor_done(void *rx_queue, uint16_t offset);
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 07509b4..5bf6b1a 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -2437,6 +2437,7 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev *dev,

txq->nb_tx_desc = nb_desc;
txq->tx_rs_thresh = tx_rs_thresh;
+   txq->tx_rs_thresh_div = nb_desc / tx_rs_thresh;
txq->tx_free_thresh = tx_free_thresh;
txq->pthresh = tx_conf->tx_thresh.pthresh;
txq->hthresh = tx_conf->tx_thresh.hthresh;
@@ -2906,6 +2907,62 @@ ixgbe_dev_rx_queue_count(struct rte_eth_dev *dev, 
uint16_t rx_queue_id)
return offset;
 }

+uint32_t
+ixgbe_dev_tx_queue_count(struct rte_eth_dev *dev, uint16_t tx_queue_id)
+{
+   struct ixgbe_tx_queue *txq;
+   uint32_t status;
+   int32_t offset, interval, idx = 0;
+   int32_t max_offset, used_desc;
+
+   txq = dev->data->tx_queues[tx_queue_id];
+
+   /* if DD on next threshold desc is not set, assume used packets
+* are pending.
+*/
+   status = txq->tx_ring[txq->tx_next_dd].wb.status;
+   if (!(status & rte_cpu_to_le_32(IXGBE_ADVTXD_STAT_DD)))
+   return txq->nb_tx_desc - txq->nb_tx_free - 1;
+
+   /* browse DD bits between tail starting from tx_next_dd: we have
+* to be careful since DD bits are only set every tx_rs_thresh
+* descriptor.
+*/
+   interval = txq->tx_rs_thresh_div >> 1;
+   offset = interval * txq->tx_rs_thresh;
+
+   /* don't go beyond tail */
+   max_offset = txq->tx_tail - txq->tx_next_dd;
+   if (max_offset < 0)
+   max_offset += txq->nb_tx_desc;
+
+   do {
+   interval >>= 1;
+
+   if (offset >= max_offset) {
+   offset -= (interval * txq->tx_rs_thresh);
+   continue;
+   }
+
+   idx = txq->tx_next_dd + offset;
+   if (idx >= txq->nb_tx_desc)
+   idx -= txq->nb_tx_desc;
+
+   status = txq->tx_ring[idx].wb.status;
+   if (status & rte_cpu_to_le_32(IXGBE_ADVTXD_STAT_DD))
+   offset += (interval * txq->tx_rs_thresh);
+   else
+   offset -= (interval * txq->tx_rs_thresh);
+   } while (interval > 0);
+
+   /* idx is now the index of the head */
+   used_desc = txq->tx_tail - idx;
+   if (used_desc < 0)
+   used_desc += txq->nb_tx_desc;
+
+   return used_desc;
+}
+
 int
 ixgbe_dev_rx_descriptor_done(void *rx_queue, uint16_t offset)
 {
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
index 2608b36..f69b5de 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.h
+++ b/drivers/net/ixgbe/ixgbe_rxtx.h
@@ -221,6 +221,8 @@ struct ixgbe_tx_queue {
uint16_ttx_free_thresh;
/** Number of TX descriptors to use before RS bit is set. */
uint16_ttx_rs_thresh;
+   /** Number of TX descriptors divided by tx_rs_thresh. */
+   uint16_ttx_rs_thresh_div;
/** Number of TX descriptors used since RS bit was s

[dpdk-dev] [RFC 4/9] net/ixgbe: optimize Rx queue descriptor count

2016-11-24 Thread Olivier Matz

Use a binary search algorithm to find the first empty DD bit. The
ring-empty and ring-full cases are managed separately as they are more
likely to happen.

PR=52423
Signed-off-by: Olivier Matz 
Acked-by: Ivan Boule 
---
 drivers/net/ixgbe/ixgbe_rxtx.c | 63 --
 1 file changed, 48 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 1a8ea5f..07509b4 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -2852,25 +2852,58 @@ ixgbe_dev_rx_queue_setup(struct rte_eth_dev *dev,
 uint32_t
 ixgbe_dev_rx_queue_count(struct rte_eth_dev *dev, uint16_t rx_queue_id)
 {
-#define IXGBE_RXQ_SCAN_INTERVAL 4
-   volatile union ixgbe_adv_rx_desc *rxdp;
+   volatile uint32_t *status;
struct ixgbe_rx_queue *rxq;
-   uint32_t desc = 0;
+   uint32_t offset, interval, nb_hold, resolution;
+   int32_t idx;

rxq = dev->data->rx_queues[rx_queue_id];
-   rxdp = &(rxq->rx_ring[rxq->rx_tail]);
-
-   while ((desc < rxq->nb_rx_desc) &&
-   (rxdp->wb.upper.status_error &
-   rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD))) {
-   desc += IXGBE_RXQ_SCAN_INTERVAL;
-   rxdp += IXGBE_RXQ_SCAN_INTERVAL;
-   if (rxq->rx_tail + desc >= rxq->nb_rx_desc)
-   rxdp = &(rxq->rx_ring[rxq->rx_tail +
-   desc - rxq->nb_rx_desc]);
-   }

-   return desc;
+   /* check if ring empty */
+   idx = rxq->rx_tail;
+   status = &rxq->rx_ring[idx].wb.upper.status_error;
+   if (!(*status & rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD)))
+   return 0;
+
+   /* decrease the precision if ring is large */
+   if (rxq->nb_rx_desc <= 256)
+   resolution = 4;
+   else
+   resolution = 16;
+
+   /* check if ring full */
+#ifdef RTE_IXGBE_INC_VECTOR
+   if (rxq->rx_using_sse)
+   nb_hold = rxq->rxrearm_nb;
+   else
+#endif
+   nb_hold = rxq->nb_rx_hold;
+
+   idx = rxq->rx_tail - nb_hold - resolution;
+   if (idx < 0)
+   idx += rxq->nb_rx_desc;
+   status = &rxq->rx_ring[idx].wb.upper.status_error;
+   if (*status & rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD))
+   return rxq->nb_rx_desc;
+
+   /* use a binary search */
+   interval = (rxq->nb_rx_desc - nb_hold) >> 1;
+   offset = interval;
+
+   do {
+   idx = rxq->rx_tail + offset;
+   if (idx >= rxq->nb_rx_desc)
+   idx -= rxq->nb_rx_desc;
+
+   interval >>= 1;
+   status = &rxq->rx_ring[idx].wb.upper.status_error;
+   if (*status & rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD))
+   offset += interval;
+   else
+   offset -= interval;
+   } while (interval >= resolution);
+
+   return offset;
 }

 int
-- 
2.8.1

[dpdk-dev] [RFC 3/9] ethdev: add handler for Tx queue descriptor count

2016-11-24 Thread Olivier Matz

Implement the Tx counterpart of rte_eth_rx_queue_count() in ethdev API,
which returns the number of used descriptors in a Tx queue.

It can help an application to detect that a link is too slow and cannot
send at the desired rate. In this case, the application can decide to
decrease the rate, or drop the packets with the lowest priority.

PR=52423
Signed-off-by: Olivier Matz 
Acked-by: Ivan Boule 
---
 lib/librte_ether/rte_ethdev.h | 31 +++
 1 file changed, 31 insertions(+)

diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h
index 9551cfd..8244807 100644
--- a/lib/librte_ether/rte_ethdev.h
+++ b/lib/librte_ether/rte_ethdev.h
@@ -1147,6 +1147,10 @@ typedef uint32_t (*eth_rx_queue_count_t)(struct 
rte_eth_dev *dev,
 uint16_t rx_queue_id);
 /**< @internal Get number of used descriptors on a receive queue. */

+typedef uint32_t (*eth_tx_queue_count_t)(struct rte_eth_dev *dev,
+uint16_t tx_queue_id);
+/**< @internal Get number of used descriptors on a transmit queue */
+
 typedef int (*eth_rx_descriptor_done_t)(void *rxq, uint16_t offset);
 /**< @internal Check DD bit of specific RX descriptor */

@@ -1461,6 +1465,8 @@ struct eth_dev_ops {
eth_queue_release_trx_queue_release;/**< Release RX queue.*/
eth_rx_queue_count_t   rx_queue_count;
/**< Get the number of used RX descriptors. */
+   eth_tx_queue_count_t   tx_queue_count;
+   /**< Get the number of used TX descriptors. */
eth_rx_descriptor_done_t   rx_descriptor_done;  /**< Check rxd DD bit */
/**< Enable Rx queue interrupt. */
eth_rx_enable_intr_t   rx_queue_intr_enable;
@@ -2710,6 +2716,31 @@ rte_eth_rx_queue_count(uint8_t port_id, uint16_t 
queue_id)
 }

 /**
+ * Get the number of used descriptors of a tx queue
+ *
+ * @param port_id
+ *  The port identifier of the Ethernet device.
+ * @param queue_id
+ *  The queue id on the specific port.
+ * @return
+ *  - number of free descriptors if positive or zero
+ *  - (-EINVAL) if *port_id* or *queue_id* is invalid.
+ *  - (-ENOTSUP) if the device does not support this function
+ */
+static inline int
+rte_eth_tx_queue_count(uint8_t port_id, uint16_t queue_id)
+{
+   struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+
+   RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->tx_queue_count, -ENOTSUP);
+   RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+   if (queue_id >= dev->data->nb_tx_queues)
+   return -EINVAL;
+
+   return (*dev->dev_ops->tx_queue_count)(dev, queue_id);
+}
+
+/**
  * Check if the DD bit of the specific RX descriptor in the queue has been set
  *
  * @param port_id
-- 
2.8.1

[dpdk-dev] [RFC 2/9] ethdev: move queue id check in generic layer

2016-11-24 Thread Olivier Matz

The check of queue_id is done in all drivers implementing
rte_eth_rx_queue_count(). Factorize this check in the generic function.

Note that the nfp driver was doing the check differently, which could
induce crashes if the queue index was too big.

By the way, also move the is_supported test before the port valid and
queue valid test.

PR=52423
Signed-off-by: Olivier Matz 
Acked-by: Ivan Boule 
---
 drivers/net/e1000/em_rxtx.c| 5 -
 drivers/net/e1000/igb_rxtx.c   | 5 -
 drivers/net/i40e/i40e_rxtx.c   | 5 -
 drivers/net/ixgbe/ixgbe_rxtx.c | 5 -
 drivers/net/nfp/nfp_net.c  | 6 --
 lib/librte_ether/rte_ethdev.h  | 6 --
 6 files changed, 4 insertions(+), 28 deletions(-)

diff --git a/drivers/net/e1000/em_rxtx.c b/drivers/net/e1000/em_rxtx.c
index 41f51c0..c1c724b 100644
--- a/drivers/net/e1000/em_rxtx.c
+++ b/drivers/net/e1000/em_rxtx.c
@@ -1390,11 +1390,6 @@ eth_em_rx_queue_count(struct rte_eth_dev *dev, uint16_t 
rx_queue_id)
struct em_rx_queue *rxq;
uint32_t desc = 0;

-   if (rx_queue_id >= dev->data->nb_rx_queues) {
-   PMD_RX_LOG(DEBUG, "Invalid RX queue_id=%d", rx_queue_id);
-   return 0;
-   }
-
rxq = dev->data->rx_queues[rx_queue_id];
rxdp = &(rxq->rx_ring[rxq->rx_tail]);

diff --git a/drivers/net/e1000/igb_rxtx.c b/drivers/net/e1000/igb_rxtx.c
index dbd37ac..e9aa356 100644
--- a/drivers/net/e1000/igb_rxtx.c
+++ b/drivers/net/e1000/igb_rxtx.c
@@ -1512,11 +1512,6 @@ eth_igb_rx_queue_count(struct rte_eth_dev *dev, uint16_t 
rx_queue_id)
struct igb_rx_queue *rxq;
uint32_t desc = 0;

-   if (rx_queue_id >= dev->data->nb_rx_queues) {
-   PMD_RX_LOG(ERR, "Invalid RX queue id=%d", rx_queue_id);
-   return 0;
-   }
-
rxq = dev->data->rx_queues[rx_queue_id];
rxdp = &(rxq->rx_ring[rxq->rx_tail]);

diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 7ae7d9f..79a72f0 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -1793,11 +1793,6 @@ i40e_dev_rx_queue_count(struct rte_eth_dev *dev, 
uint16_t rx_queue_id)
struct i40e_rx_queue *rxq;
uint16_t desc = 0;

-   if (unlikely(rx_queue_id >= dev->data->nb_rx_queues)) {
-   PMD_DRV_LOG(ERR, "Invalid RX queue id %u", rx_queue_id);
-   return 0;
-   }
-
rxq = dev->data->rx_queues[rx_queue_id];
rxdp = &(rxq->rx_ring[rxq->rx_tail]);
while ((desc < rxq->nb_rx_desc) &&
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index b2d9f45..1a8ea5f 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -2857,11 +2857,6 @@ ixgbe_dev_rx_queue_count(struct rte_eth_dev *dev, 
uint16_t rx_queue_id)
struct ixgbe_rx_queue *rxq;
uint32_t desc = 0;

-   if (rx_queue_id >= dev->data->nb_rx_queues) {
-   PMD_RX_LOG(ERR, "Invalid RX queue id=%d", rx_queue_id);
-   return 0;
-   }
-
rxq = dev->data->rx_queues[rx_queue_id];
rxdp = &(rxq->rx_ring[rxq->rx_tail]);

diff --git a/drivers/net/nfp/nfp_net.c b/drivers/net/nfp/nfp_net.c
index e315dd8..f1d00fb 100644
--- a/drivers/net/nfp/nfp_net.c
+++ b/drivers/net/nfp/nfp_net.c
@@ -1084,12 +1084,6 @@ nfp_net_rx_queue_count(struct rte_eth_dev *dev, uint16_t 
queue_idx)
uint32_t count;

rxq = (struct nfp_net_rxq *)dev->data->rx_queues[queue_idx];
-
-   if (rxq == NULL) {
-   PMD_INIT_LOG(ERR, "Bad queue: %u\n", queue_idx);
-   return 0;
-   }
-
idx = rxq->rd_p % rxq->rx_count;
rxds = &rxq->rxds[idx];

diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h
index c3edc23..9551cfd 100644
--- a/lib/librte_ether/rte_ethdev.h
+++ b/lib/librte_ether/rte_ethdev.h
@@ -2693,7 +2693,7 @@ rte_eth_rx_burst(uint8_t port_id, uint16_t queue_id,
  *  The queue id on the specific port.
  * @return
  *  The number of used descriptors in the specific queue, or:
- * (-EINVAL) if *port_id* is invalid
+ * (-EINVAL) if *port_id* or *queue_id* is invalid
  * (-ENOTSUP) if the device does not support this function
  */
 static inline int
@@ -2701,8 +2701,10 @@ rte_eth_rx_queue_count(uint8_t port_id, uint16_t 
queue_id)
 {
struct rte_eth_dev *dev = &rte_eth_devices[port_id];

-   RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_count, -ENOTSUP);
+   RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+   if (queue_id >= dev->data->nb_rx_queues)
+   return -EINVAL;

return (*dev->dev_ops->rx_queue_count)(dev, queue_id);
 }
-- 
2.8.1

[dpdk-dev] [RFC 1/9] ethdev: clarify api comments of rx queue count

2016-11-24 Thread Olivier Matz

The API comments are not consistent between each other.

The function rte_eth_rx_queue_count() returns the number of used
descriptors on a receive queue.

PR=52423
Signed-off-by: Olivier Matz 
Acked-by: Ivan Boule 
---
 lib/librte_ether/rte_ethdev.h | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h
index 9678179..c3edc23 100644
--- a/lib/librte_ether/rte_ethdev.h
+++ b/lib/librte_ether/rte_ethdev.h
@@ -1145,7 +1145,7 @@ typedef void (*eth_queue_release_t)(void *queue);

 typedef uint32_t (*eth_rx_queue_count_t)(struct rte_eth_dev *dev,
 uint16_t rx_queue_id);
-/**< @internal Get number of available descriptors on a receive queue of an 
Ethernet device. */
+/**< @internal Get number of used descriptors on a receive queue. */

 typedef int (*eth_rx_descriptor_done_t)(void *rxq, uint16_t offset);
 /**< @internal Check DD bit of specific RX descriptor */
@@ -1459,7 +1459,8 @@ struct eth_dev_ops {
eth_queue_stop_t   tx_queue_stop;/**< Stop TX for a queue.*/
eth_rx_queue_setup_t   rx_queue_setup;/**< Set up device RX queue.*/
eth_queue_release_trx_queue_release;/**< Release RX queue.*/
-   eth_rx_queue_count_t   rx_queue_count; /**< Get Rx queue count. */
+   eth_rx_queue_count_t   rx_queue_count;
+   /**< Get the number of used RX descriptors. */
eth_rx_descriptor_done_t   rx_descriptor_done;  /**< Check rxd DD bit */
/**< Enable Rx queue interrupt. */
eth_rx_enable_intr_t   rx_queue_intr_enable;
@@ -2684,7 +2685,7 @@ rte_eth_rx_burst(uint8_t port_id, uint16_t queue_id,
 }

 /**
- * Get the number of used descriptors in a specific queue
+ * Get the number of used descriptors of a rx queue
  *
  * @param port_id
  *  The port identifier of the Ethernet device.
@@ -2699,9 +2700,11 @@ static inline int
 rte_eth_rx_queue_count(uint8_t port_id, uint16_t queue_id)
 {
struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+
RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_count, -ENOTSUP);
-return (*dev->dev_ops->rx_queue_count)(dev, queue_id);
+
+   return (*dev->dev_ops->rx_queue_count)(dev, queue_id);
 }

 /**
-- 
2.8.1

[dpdk-dev] [RFC 0/9] get Rx and Tx used descriptors

2016-11-24 Thread Olivier Matz

This RFC patchset introduces a new ethdev API function
rte_eth_tx_queue_count() which is the tx counterpart of
rte_eth_rx_queue_count(). It implements this API on some
Intel drivers for reference, and it also optimizes the
implementation of rte_eth_rx_queue_count().

The usage of these functions can be:
- on Rx, anticipate that the cpu is not fast enough to process
  all incoming packets, and take dispositions to solve the
  problem (add more cpus, drop specific packets, ...)
- on Tx, detect that the link is overloaded, and take dispositions
  to solve the problem (notify flow control, drop specific
  packets)

The tests I've done (instrumenting testpmd) show that browsing
the descriptors linearly is slow when the ring size increases.
Accessing the head/tail registers through pci is also slow
whatever the size of the ring. A binary search is a good compromise
that gives quite good performance whatever the size of the ring.

Remaining question are about:
- should we keep this name? I'd say "queue_count" is quite confusing,
  and I would expect it returns the number of queues, not the
  number of used descriptors
- how shall we count the used descriptors, knowing that the driver
  can hold some to free them by bulk, which reduces the effective
  size of the ring

I would be happy to have some feedback about this RFC before
I send it as a patch.

Here are some helpers to understand the code more easily (I sometimes
make some shortcuts between like 1 pkt == 1 desc).

RX side
===

- sw advances the tail pointer
- hw advances the head pointer
- the software populates the ring with descs to buffers that are filled
  when the hw receives packets
- head == tail means there is no available buffer for hw to receive a packet
- head points to the next descriptor to be filled
- hw owns all descriptors between [head...tail]
- when a packet is written in a descriptor, the DD (descriptor done)
  bit is set, and the head is advanced
- the driver never reads the head (needs a pci transaction), instead it
  monitors the DD bit of next descriptor
- when a filled packet is retrieved by the software, the descriptor has
  to be populated with a new empty buffer. This is not done for each
  packet: the driver holds them and waits until it has many descriptors
  to populate, and do it by bulk.
  (by the way, it means that the effective size a queue of size=N is
  lower than N since these descriptors cannot be used by the hw)

rxq->rx_tail: current value of the sw tail (the idx of the next packet to
  be received). The real tail (hw) can be different since the driver can
  hold descriptors.
rxq->nb_rx_hold: number of held descriptors
rxq->rxrearm_nb: same, but for vector driver
rxq->rx_free_thresh: when the number of held descriptors reaches this threshold,
  descriptors are populated with buffers to be filled, and sw advances the tail

Example with a ring size of 64:

||
|xx  |
|x buffers filled with data by hw x  |
|xx  |
||
 ^hw_tail=20
^sw_tail=35
   ^hw_head=54
 <--- nb_hold -->
<-pkts in hw queue->

The descriptors marked with 'x' has their DD bit set, the other
  (' ') reference empty buffers.
The next packet to be received by software is at index 35.
The software holds 15 descriptors that will be rearmed later.
There are 19 packets waiting in the hw queue.

We want the function rx_queue_count() to return the number of
"used" descriptors. The first question is: what does that mean
exactly? Should it be pkts_in_hw_queue or pkts_in_hw_queue + nb_hold?
The current implementation returns pkts_in_hw_queue, but including
nb_hold could be useful to know how many descriptors are really
free (= size - used).

The current implementation checks the DD bit starting from sw_tail,
every 4 packets. It can be quite slow for large rings. An alternative
is to read the head register, but it's also slow.

This patchset optimizes rx_queue_count() by doing a binary
search (checking for DD) between sw_tail and hw_tail, instead of a
linear search.

TX side
===

- sw advances the tail pointer
- hw advances the head pointer
- the software populates the ring with full buffers to be sent by
  the hw
- head points to the in-progress descriptor.
- sw writes new descriptors at tail
- head == tail means that the transmit queue is empty
- when the hw has processed a descriptor, it sets the DD bit if
  the descriptor has the RS (report status) bit.
- the driver never reads the head (needs a pci transaction), instead it
  monitors the DD bit of a descriptor that has the RS bit

txq->tx_tail: sw value for tail register
tx

[dpdk-dev] [RFC 1/9] ethdev: clarify api comments of rx queue count

2016-11-24 Thread Ferruh Yigit

On 11/24/2016 9:54 AM, Olivier Matz wrote:
> The API comments are not consistent between each other.
> 
> The function rte_eth_rx_queue_count() returns the number of used
> descriptors on a receive queue.
> 
> PR=52423

What is this marker?

> Signed-off-by: Olivier Matz 
> Acked-by: Ivan Boule 

Acked-by: Ferruh Yigit

[dpdk-dev] Proposal for a new Committer model

2016-11-24 Thread Thomas Monjalon

2016-11-23 15:13, Neil Horman:
> Can either you or thomas provide some detail as to how you are doing patch
> management between trees (details of the commands you use are what I would be
> interested in). It sounds to me like there may be some optimization to be made
> here before we even make changes to the subtrees.

Until now, we preferred avoiding merge commits.
That's why I rebase sub-trees to integrate them in the mainline.
As Ferruh explained, it does not require more work because sub-trees are
regularly rebased anyway.
And I use a script to keep original committer name and date.

Hope it's clear now

[dpdk-dev] [PATCH 5/5] net/virtio: fix Tso when mbuf is shared

2016-11-24 Thread Olivier Matz

With virtio, doing tso requires to modify the network
packet data:
- the dpdk API requires to set the l4 checksum to an
  Intel-Nic-like pseudo header checksum that does
  not include the ip length
- the virtio peer expects that the l4 checksum is
  a standard pseudo header checksum.

This is a problem with shared packets, because they
should not be modified.

This patch fixes this issue by copying the headers into
a linear buffer in that case. This buffer is located in
the virtio_tx_region, at the same place where the
virtio header is stored.

The size of this buffer is set to 256, which should
be enough in all cases:
  sizeof(ethernet) + sizeof(vlan) * 2 + sizeof(ip6)
sizeof(ip6-ext) + sizeof(tcp) + sizeof(tcp-opts)
  = 14 + 8 + 40 + sizeof(ip6-ext) + 40 + sizeof(tcp-opts)
  = 102 + sizeof(ip6-ext) + sizeof(tcp-opts)

Fixes: 696573046e9e ("net/virtio: support TSO")

Signed-off-by: Olivier Matz 
---
 drivers/net/virtio/virtio_rxtx.c | 119 +++
 drivers/net/virtio/virtqueue.h   |   2 +
 2 files changed, 85 insertions(+), 36 deletions(-)

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 22d97a4..577c775 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -211,43 +211,73 @@ virtqueue_enqueue_recv_refill(struct virtqueue *vq, 
struct rte_mbuf *cookie)

 /* When doing TSO, the IP length is not included in the pseudo header
  * checksum of the packet given to the PMD, but for virtio it is
- * expected.
+ * expected. Fix the mbuf or a copy if the mbuf is shared.
  */
-static void
-virtio_tso_fix_cksum(struct rte_mbuf *m)
+static unsigned int
+virtio_tso_fix_cksum(struct rte_mbuf *m, char *hdr, size_t hdr_sz)
 {
-   /* common case: header is not fragmented */
-   if (likely(rte_pktmbuf_data_len(m) >= m->l2_len + m->l3_len +
-   m->l4_len)) {
-   struct ipv4_hdr *iph;
-   struct ipv6_hdr *ip6h;
-   struct tcp_hdr *th;
-   uint16_t prev_cksum, new_cksum, ip_len, ip_paylen;
-   uint32_t tmp;
-
-   iph = rte_pktmbuf_mtod_offset(m, struct ipv4_hdr *, m->l2_len);
-   th = RTE_PTR_ADD(iph, m->l3_len);
-   if ((iph->version_ihl >> 4) == 4) {
-   iph->hdr_checksum = 0;
-   iph->hdr_checksum = rte_ipv4_cksum(iph);
-   ip_len = iph->total_length;
-   ip_paylen = rte_cpu_to_be_16(rte_be_to_cpu_16(ip_len) -
-   m->l3_len);
-   } else {
-   ip6h = (struct ipv6_hdr *)iph;
-   ip_paylen = ip6h->payload_len;
+   struct ipv4_hdr *iph, iph_copy;
+   struct ipv6_hdr *ip6h = NULL, ip6h_copy;
+   struct tcp_hdr *th, th_copy;
+   size_t hdrlen = m->l2_len + m->l3_len + m->l4_len;
+   uint16_t prev_cksum, new_cksum, ip_len, ip_paylen;
+   uint32_t tmp;
+   int shared = 0;
+
+   /* mbuf is write-only, we need to copy the headers in a linear buffer */
+   if (unlikely(rte_pktmbuf_data_is_shared(m, 0, hdrlen))) {
+   shared = 1;
+
+   /* network headers are too big, there's nothing we can do */
+   if (hdrlen > hdr_sz)
+   return 0;
+
+   rte_pktmbuf_read_copy(m, 0, hdrlen, hdr);
+   iph = (struct ipv4_hdr *)(hdr + m->l2_len);
+   ip6h = (struct ipv6_hdr *)(hdr + m->l2_len);
+   th = (struct tcp_hdr *)(hdr + m->l2_len + m->l3_len);
+   } else {
+   iph = rte_pktmbuf_read(m, m->l2_len, sizeof(*iph), &iph_copy);
+   th = rte_pktmbuf_read(m, m->l2_len + m->l3_len, sizeof(*th),
+   &th_copy);
+   }
+
+   if ((iph->version_ihl >> 4) == 4) {
+   iph->hdr_checksum = 0;
+   iph->hdr_checksum = rte_ipv4_cksum(iph);
+   ip_len = iph->total_length;
+   ip_paylen = rte_cpu_to_be_16(rte_be_to_cpu_16(ip_len) -
+   m->l3_len);
+   } else {
+   if (!shared) {
+   ip6h = rte_pktmbuf_read(m, m->l2_len, sizeof(*ip6h),
+   &ip6h_copy);
}
+   ip_paylen = ip6h->payload_len;
+   }

-   /* calculate the new phdr checksum not including ip_paylen */
-   prev_cksum = th->cksum;
-   tmp = prev_cksum;
-   tmp += ip_paylen;
-   tmp = (tmp & 0x) + (tmp >> 16);
-   new_cksum = tmp;
+   /* calculate the new phdr checksum not including ip_paylen */
+   prev_cksum = th->cksum;
+   tmp = prev_cksum;
+   tmp += ip_paylen;
+   tmp = (tmp & 0x) + (tmp >> 16);
+   new_cksum = tmp;

-   /* replace it in the packet */
-   th->cksum = new_cksum;
-   }
+   /* replace it in the header */
+   th->cksum =

[dpdk-dev] [PATCH 4/5] mbuf: new helper to copy data from a mbuf

2016-11-24 Thread Olivier Matz

Signed-off-by: Olivier Matz 
---
 app/test/test_mbuf.c   |  7 +++
 lib/librte_mbuf/rte_mbuf.h | 32 +++-
 2 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/app/test/test_mbuf.c b/app/test/test_mbuf.c
index 5f1bc5d..73fd7df 100644
--- a/app/test/test_mbuf.c
+++ b/app/test/test_mbuf.c
@@ -451,6 +451,13 @@ testclone_testupdate_testdetach(void)
GOTO_FAIL("invalid data");
if (data != check_data)
GOTO_FAIL("data should have been copied");
+   if (rte_pktmbuf_read_copy(m2, 0, sizeof(uint32_t), check_data) < 0)
+   GOTO_FAIL("cannot copy data");
+   if (check_data[0] != MAGIC_DATA)
+   GOTO_FAIL("invalid data");
+   if (data != check_data)
+   GOTO_FAIL("data should have been copied");
+
/* free mbuf */
rte_pktmbuf_free(m);
m = NULL;
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index e898d25..edae89f 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -1643,7 +1643,7 @@ static inline int rte_pktmbuf_data_is_shared(const struct 
rte_mbuf *m,
 }

 /**
- * @internal used by rte_pktmbuf_read().
+ * @internal used by rte_pktmbuf_read() and rte_pktmbuf_read_copy().
  */
 void *__rte_pktmbuf_read(const struct rte_mbuf *m, uint32_t off,
uint32_t len, void *buf);
@@ -1728,6 +1728,36 @@ static inline int rte_pktmbuf_write(const struct 
rte_mbuf *m,
 }

 /**
+ * Copy data from a mbuf into a linear buffer
+ *
+ * @param m
+ *   The pointer to the mbuf.
+ * @param off
+ *   The offset of the data in the mbuf.
+ * @param len
+ *   The amount of bytes to copy.
+ * @param buf
+ *   The buffer where data is copied, it should be at least
+ *   as large as len.
+ * @return
+ *   - (0) on success
+ *   - (-1) on error: mbuf is too small
+ */
+static inline int rte_pktmbuf_read_copy(const struct rte_mbuf *m,
+   uint32_t off, uint32_t len, void *buf)
+{
+   if (likely(off + len <= rte_pktmbuf_data_len(m))) {
+   rte_memcpy(buf, rte_pktmbuf_mtod_offset(m, char *, off), len);
+   return 0;
+   }
+
+   if (__rte_pktmbuf_read(m, off, len, buf) == NULL)
+   return -1;
+
+   return 0;
+}
+
+/**
  * Chain an mbuf to another, thereby creating a segmented packet.
  *
  * Note: The implementation will do a linear walk over the segments to find
-- 
2.8.1

[dpdk-dev] [PATCH 3/5] mbuf: new helper to write data in a mbuf chain

2016-11-24 Thread Olivier Matz

Introduce a new helper to write data in a chain of mbufs,
spreading it in the segments.

Signed-off-by: Olivier Matz 
---
 app/test/test_mbuf.c | 21 +++
 lib/librte_mbuf/rte_mbuf.c   | 44 +++
 lib/librte_mbuf/rte_mbuf.h   | 50 
 lib/librte_mbuf/rte_mbuf_version.map |  6 +
 4 files changed, 121 insertions(+)

diff --git a/app/test/test_mbuf.c b/app/test/test_mbuf.c
index 7656a4d..5f1bc5d 100644
--- a/app/test/test_mbuf.c
+++ b/app/test/test_mbuf.c
@@ -335,6 +335,10 @@ testclone_testupdate_testdetach(void)
struct rte_mbuf *clone2 = NULL;
struct rte_mbuf *m2 = NULL;
unaligned_uint32_t *data;
+   uint32_t magic = MAGIC_DATA;
+   uint32_t check_data[2];
+
+   memset(check_data, 0, sizeof(check_data));

/* alloc a mbuf */
m = rte_pktmbuf_alloc(pktmbuf_pool);
@@ -421,6 +425,8 @@ testclone_testupdate_testdetach(void)
if (m2 == NULL)
GOTO_FAIL("cannot allocate m2");
rte_pktmbuf_append(m2, sizeof(uint32_t));
+   if (rte_pktmbuf_write(m2, 0, sizeof(uint32_t), &magic) < 0)
+   GOTO_FAIL("cannot write data in m2");
rte_pktmbuf_chain(m2, clone);
clone = NULL;

@@ -430,6 +436,21 @@ testclone_testupdate_testdetach(void)
rte_pktmbuf_pkt_len(m2) - sizeof(uint32_t)) == 0)
GOTO_FAIL("m2 data should be marked as shared");

+   /* check data content */
+   data = rte_pktmbuf_read(m2, 0, sizeof(uint32_t), check_data);
+   if (data == NULL)
+   GOTO_FAIL("cannot read data");
+   if (*data != MAGIC_DATA)
+   GOTO_FAIL("invalid data");
+   if (data == check_data)
+   GOTO_FAIL("data should not have been copied");
+   data = rte_pktmbuf_read(m2, 0, sizeof(uint32_t) * 2, check_data);
+   if (data == NULL)
+   GOTO_FAIL("cannot read data");
+   if (data[0] != MAGIC_DATA || data[1] != MAGIC_DATA)
+   GOTO_FAIL("invalid data");
+   if (data != check_data)
+   GOTO_FAIL("data should have been copied");
/* free mbuf */
rte_pktmbuf_free(m);
m = NULL;
diff --git a/lib/librte_mbuf/rte_mbuf.c b/lib/librte_mbuf/rte_mbuf.c
index b31958e..ed56193 100644
--- a/lib/librte_mbuf/rte_mbuf.c
+++ b/lib/librte_mbuf/rte_mbuf.c
@@ -298,6 +298,50 @@ void *__rte_pktmbuf_read(const struct rte_mbuf *m, 
uint32_t off,
return buf;
 }

+/* write len data bytes in a mbuf at specified offset (internal) */
+int
+__rte_pktmbuf_write(const struct rte_mbuf *m, uint32_t off,
+   uint32_t len, const void *buf)
+{
+   const struct rte_mbuf *seg = m;
+   uint32_t buf_off = 0, copy_len;
+   char *dst;
+
+   if (off + len > rte_pktmbuf_pkt_len(m))
+   return -1;
+
+   while (off >= rte_pktmbuf_data_len(seg)) {
+   off -= rte_pktmbuf_data_len(seg);
+   seg = seg->next;
+   }
+
+   dst = rte_pktmbuf_mtod_offset(seg, char *, off);
+   if (buf == dst)
+   return 0;
+
+   if (off + len <= rte_pktmbuf_data_len(seg)) {
+   RTE_ASSERT(!rte_pktmbuf_is_shared(seg));
+   rte_memcpy(dst, buf, len);
+   return 0;
+   }
+
+   /* copy data in several segments */
+   while (len > 0) {
+   RTE_ASSERT(!rte_pktmbuf_is_shared(seg));
+   copy_len = rte_pktmbuf_data_len(seg) - off;
+   if (copy_len > len)
+   copy_len = len;
+   dst = rte_pktmbuf_mtod_offset(seg, char *, off);
+   rte_memcpy(dst, (const char *)buf + buf_off, copy_len);
+   off = 0;
+   buf_off += copy_len;
+   len -= copy_len;
+   seg = seg->next;
+   }
+
+   return 0;
+}
+
 /*
  * Get the name of a RX offload flag. Must be kept synchronized with flag
  * definitions in rte_mbuf.h.
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index cd77a56..e898d25 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -1678,6 +1678,56 @@ static inline void *rte_pktmbuf_read(const struct 
rte_mbuf *m,
 }

 /**
+ * @internal used by rte_pktmbuf_write().
+ */
+int __rte_pktmbuf_write(const struct rte_mbuf *m, uint32_t off,
+   uint32_t len, const void *buf);
+
+/**
+ * Write len data bytes in a mbuf at specified offset.
+ *
+ * If the mbuf is contiguous between off and off+len, rte_memcpy() is
+ * called. Else, it will split the data in the segments.
+ *
+ * The caller must ensure that all destination segments are writable
+ * (not shared).
+ *
+ * If the destination pointer in the mbuf is the same than the source
+ * buffer, the function do nothing and is successful.
+ *
+ * If the mbuf is too small, the function fails.
+ *
+ * @param m
+ *   The pointer to the mbuf.
+ * @param off
+ *   The offset of the data in the mbuf.
+ * @par

[dpdk-dev] [PATCH 2/5] mbuf: new helper to check if a mbuf is shared

2016-11-24 Thread Olivier Matz

Introduce 2 new helpers rte_pktmbuf_seg_is_shared() and
rte_pktmbuf_data_is_shared() to check if the packet data inside
a mbuf is shared (and shall not be modified).

To avoid a "discards const qualifier" error, add a const to the argument
of rte_mbuf_from_indirect().

Signed-off-by: Olivier Matz 
---
 app/test/test_mbuf.c   | 34 +++---
 lib/librte_mbuf/rte_mbuf.h | 71 +-
 2 files changed, 100 insertions(+), 5 deletions(-)

diff --git a/app/test/test_mbuf.c b/app/test/test_mbuf.c
index c0823ea..7656a4d 100644
--- a/app/test/test_mbuf.c
+++ b/app/test/test_mbuf.c
@@ -333,6 +333,7 @@ testclone_testupdate_testdetach(void)
struct rte_mbuf *m = NULL;
struct rte_mbuf *clone = NULL;
struct rte_mbuf *clone2 = NULL;
+   struct rte_mbuf *m2 = NULL;
unaligned_uint32_t *data;

/* alloc a mbuf */
@@ -384,6 +385,11 @@ testclone_testupdate_testdetach(void)
if (*data != MAGIC_DATA)
GOTO_FAIL("invalid data in clone->next\n");

+   if (rte_pktmbuf_seg_is_shared(m) == 0)
+   GOTO_FAIL("m should be marked as shared\n");
+   if (rte_pktmbuf_seg_is_shared(clone) == 0)
+   GOTO_FAIL("clone should be marked as shared\n");
+
if (rte_mbuf_refcnt_read(m) != 2)
GOTO_FAIL("invalid refcnt in m\n");

@@ -410,14 +416,32 @@ testclone_testupdate_testdetach(void)
if (rte_mbuf_refcnt_read(m->next) != 3)
GOTO_FAIL("invalid refcnt in m->next\n");

+   /* prepend data to one of the clone */
+   m2 = rte_pktmbuf_alloc(pktmbuf_pool);
+   if (m2 == NULL)
+   GOTO_FAIL("cannot allocate m2");
+   rte_pktmbuf_append(m2, sizeof(uint32_t));
+   rte_pktmbuf_chain(m2, clone);
+   clone = NULL;
+
+   if (rte_pktmbuf_data_is_shared(m2, 0, sizeof(uint32_t)))
+   GOTO_FAIL("m2 headers should not be marked as shared");
+   if (rte_pktmbuf_data_is_shared(m2, sizeof(uint32_t),
+   rte_pktmbuf_pkt_len(m2) - sizeof(uint32_t)) == 0)
+   GOTO_FAIL("m2 data should be marked as shared");
+
/* free mbuf */
rte_pktmbuf_free(m);
-   rte_pktmbuf_free(clone);
-   rte_pktmbuf_free(clone2);
-
m = NULL;
-   clone = NULL;
+   rte_pktmbuf_free(m2);
+   m2 = NULL;
+
+   if (rte_pktmbuf_seg_is_shared(clone2))
+   GOTO_FAIL("clone2 should not be marked as shared\n");
+
+   rte_pktmbuf_free(clone2);
clone2 = NULL;
+
printf("%s ok\n", __func__);
return 0;

@@ -428,6 +452,8 @@ testclone_testupdate_testdetach(void)
rte_pktmbuf_free(clone);
if (clone2)
rte_pktmbuf_free(clone2);
+   if (m2)
+   rte_pktmbuf_free(m2);
return -1;
 }

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 14956f6..cd77a56 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -576,7 +576,7 @@ rte_mbuf_data_dma_addr_default(const struct rte_mbuf *mb)
  *   The address of the direct mbuf corresponding to buffer_addr.
  */
 static inline struct rte_mbuf *
-rte_mbuf_from_indirect(struct rte_mbuf *mi)
+rte_mbuf_from_indirect(const struct rte_mbuf *mi)
 {
return (struct rte_mbuf *)RTE_PTR_SUB(mi->buf_addr, sizeof(*mi) + 
mi->priv_size);
 }
@@ -1574,6 +1574,75 @@ static inline int rte_pktmbuf_is_contiguous(const struct 
rte_mbuf *m)
 }

 /**
+ * Test if a mbuf segment is shared
+ *
+ * Return true if the data embedded in this segment is shared by several
+ * mbufs. In this case, the mbuf data should be considered as read-only.
+ *
+ * @param m
+ *   The packet mbuf.
+ * @return
+ *   - (1), the mbuf segment is shared (read-only)
+ *   - (0), the mbuf segment is not shared (writable)
+ */
+static inline int rte_pktmbuf_seg_is_shared(const struct rte_mbuf *m)
+{
+   if (rte_mbuf_refcnt_read(m) > 1)
+   return 1;
+
+   if (RTE_MBUF_INDIRECT(m) &&
+   rte_mbuf_refcnt_read(rte_mbuf_from_indirect(m)) > 1)
+   return 1;
+
+   return 0;
+}
+
+/**
+ * Test if some data in an mbuf chain is shared
+ *
+ * Return true if the specified data area in the mbuf chain is shared by
+ * several mbufs. In this case, this data should be considered as
+ * read-only.
+ *
+ * If the area described by off and len exceeds the bounds of the mbuf
+ * chain (off + len <= rte_pktmbuf_pkt_len()), the exceeding part of the
+ * area is ignored.
+ *
+ * @param m
+ *   The packet mbuf.
+ * @return
+ *   - (1), the mbuf data is shared (read-only)
+ *   - (0), the mbuf data is not shared (writable)
+ */
+static inline int rte_pktmbuf_data_is_shared(const struct rte_mbuf *m,
+   uint32_t off, uint32_t len)
+{
+   const struct rte_mbuf *seg = m;
+
+   if (likely(off + len <= rte_pktmbuf_data_len(seg)))
+   return rte_pktmbuf_seg_is_shared(seg);
+
+   while (seg->next && off >= r

[dpdk-dev] [PATCH 1/5] mbuf: remove const attribute in mbuf read function

2016-11-24 Thread Olivier Matz

There is no good reason to have this const attribute: rte_pktmbuf_read()
returns a pointer which is either in a private buffer, or in the mbuf.

In the first case, it is clearly not const. In the second case, it is up
to the user to check that the mbuf is not shared and that data can be
modified.

Signed-off-by: Olivier Matz 
---
 lib/librte_mbuf/rte_mbuf.c | 2 +-
 lib/librte_mbuf/rte_mbuf.h | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/lib/librte_mbuf/rte_mbuf.c b/lib/librte_mbuf/rte_mbuf.c
index 63f43c8..b31958e 100644
--- a/lib/librte_mbuf/rte_mbuf.c
+++ b/lib/librte_mbuf/rte_mbuf.c
@@ -265,7 +265,7 @@ rte_pktmbuf_dump(FILE *f, const struct rte_mbuf *m, 
unsigned dump_len)
 }

 /* read len data bytes in a mbuf at specified offset (internal) */
-const void *__rte_pktmbuf_read(const struct rte_mbuf *m, uint32_t off,
+void *__rte_pktmbuf_read(const struct rte_mbuf *m, uint32_t off,
uint32_t len, void *buf)
 {
const struct rte_mbuf *seg = m;
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index ead7c6e..14956f6 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -1576,7 +1576,7 @@ static inline int rte_pktmbuf_is_contiguous(const struct 
rte_mbuf *m)
 /**
  * @internal used by rte_pktmbuf_read().
  */
-const void *__rte_pktmbuf_read(const struct rte_mbuf *m, uint32_t off,
+void *__rte_pktmbuf_read(const struct rte_mbuf *m, uint32_t off,
uint32_t len, void *buf);

 /**
@@ -1599,7 +1599,7 @@ const void *__rte_pktmbuf_read(const struct rte_mbuf *m, 
uint32_t off,
  *   The pointer to the data, either in the mbuf if it is contiguous,
  *   or in the user buffer. If mbuf is too small, NULL is returned.
  */
-static inline const void *rte_pktmbuf_read(const struct rte_mbuf *m,
+static inline void *rte_pktmbuf_read(const struct rte_mbuf *m,
uint32_t off, uint32_t len, void *buf)
 {
if (likely(off + len <= rte_pktmbuf_data_len(m)))
-- 
2.8.1

[dpdk-dev] [PATCH 0/5] virtio/mbuf: fix virtio tso with shared mbufs

2016-11-24 Thread Olivier Matz

This patchset fixes the transmission of cloned mbufs when using
virtio + TSO. The problem is we need to fix the L4 checksum in the
packet, but it should be considered as read-only, as pointed-out
by Stephen here:
http://dpdk.org/ml/archives/dev/2016-October/048873.html

Unfortunatly the patchset is quite big, but I did not manage to
find a shorter solution. The first patches add some mbuf helpers
that are used in virtio in the last patch.

This last patch adds a zone for each tx ring entry where headers
can be copied, patched, and referenced by virtio descriptors in
case the mbuf is read-only. If its not the case, the mbuf is
modified as before.

I tested with the same test plan than the one described in
http://dpdk.org/ml/archives/dev/2016-October/048092.html
(only the TSO test case).

I also replayed the test with the following patches to validate
the code path for:

- segmented packets (it forces a local copy in virtio_tso_fix_cksum)

--- a/lib/librte_mbuf/rte_mbuf.c
+++ b/lib/librte_mbuf/rte_mbuf.c
@@ -279,7 +279,7 @@ void *__rte_pktmbuf_read(const struct rte_mbuf *m, uint32_t 
off,
seg = seg->next;
}

-   if (off + len <= rte_pktmbuf_data_len(seg))
+   if (0 && off + len <= rte_pktmbuf_data_len(seg))
return rte_pktmbuf_mtod_offset(seg, char *, off);

/* rare case: header is split among several segments */
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 9dc6f10..5a4312a 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -1671,7 +1671,7 @@ void *__rte_pktmbuf_read(const struct rte_mbuf *m, 
uint32_t off,
 static inline void *rte_pktmbuf_read(const struct rte_mbuf *m,
uint32_t off, uint32_t len, void *buf)
 {
-   if (likely(off + len <= rte_pktmbuf_data_len(m)))
+   if (likely(0 && off + len <= rte_pktmbuf_data_len(m)))
return rte_pktmbuf_mtod_offset(m, char *, off);
else
return __rte_pktmbuf_read(m, off, len, buf);

- and for shared mbuf (force to use the buffer in virtio tx ring)

--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -225,7 +225,7 @@ virtio_tso_fix_cksum(struct rte_mbuf *m, char *hdr, size_t 
hdr_sz)
int shared = 0;

/* mbuf is write-only, we need to copy the headers in a linear buffer */
-   if (unlikely(rte_pktmbuf_data_is_shared(m, 0, hdrlen))) {
+   if (unlikely(1 || rte_pktmbuf_data_is_shared(m, 0, hdrlen))) {
shared = 1;

/* network headers are too big, there's nothing we can do */


Olivier Matz (5):
  mbuf: remove const attribute in mbuf read function
  mbuf: new helper to check if a mbuf is shared
  mbuf: new helper to write data in a mbuf chain
  mbuf: new helper to copy data from a mbuf
  net/virtio: fix Tso when mbuf is shared

 app/test/test_mbuf.c |  62 +-
 drivers/net/virtio/virtio_rxtx.c | 119 ++
 drivers/net/virtio/virtqueue.h   |   2 +
 lib/librte_mbuf/rte_mbuf.c   |  46 +-
 lib/librte_mbuf/rte_mbuf.h   | 157 ++-
 lib/librte_mbuf/rte_mbuf_version.map |   6 ++
 6 files changed, 347 insertions(+), 45 deletions(-)

-- 
2.8.1

[dpdk-dev] dpdk/vpp and cross-version migration for vhost

2016-11-24 Thread Kevin Traynor

On 11/24/2016 06:31 AM, Yuanhan Liu wrote:
> On Tue, Nov 22, 2016 at 04:53:05PM +0200, Michael S. Tsirkin wrote:
 You keep assuming that you have the VM started first and
 figure out things afterwards, but this does not work.

 Think about a cluster of machines. You want to start a VM in
 a way that will ensure compatibility with all hosts
 in a cluster.
>>>
>>> I see. I was more considering about the case when the dst
>>> host (including the qemu and dpdk combo) is given, and
>>> then determine whether it will be a successfull migration
>>> or not.
>>>
>>> And you are asking that we need to know which host could
>>> be a good candidate before starting the migration. In such
>>> case, we indeed need some inputs from both the qemu and
>>> vhost-user backend.
>>>
>>> For DPDK, I think it could be simple, just as you said, it
>>> could be either a tiny script, or even a macro defined in
>>> the source code file (we extend it every time we add a
>>> new feature) to let the libvirt to read it. Or something
>>> else.
>>
>> There's the issue of APIs that tweak features as Maxime
>> suggested.
> 
> Yes, it's a good point.
> 
>> Maybe the only thing to do is to deprecate it,
> 
> Looks like so.
> 
>> but I feel some way for application to pass info into
>> guest might be benefitial.
> 
> The two APIs are just for tweaking feature bits DPDK supports before
> any device got connected. It's another way to disable some features
> (the another obvious way is to through QEMU command lines).
> 
> IMO, it's bit handy only in a case like: we have bunch of VMs. Instead
> of disabling something though qemu one by one, we could disable it
> once in DPDK.
> 
> But I doubt the useful of it. It's only used in DPDK's vhost example
> after all. Nor is it used in vhost pmd, neither is it used in OVS.

rte_vhost_feature_disable() is currently used in OVS, lib/netdev-dpdk.c

netdev_dpdk_vhost_class_init(void)
{
static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;

/* This function can be called for different classes.  The
initialization
 * needs to be done only once */
if (ovsthread_once_start(&once)) {
rte_vhost_driver_callback_register(&virtio_net_device_ops);
rte_vhost_feature_disable(1ULL << VIRTIO_NET_F_HOST_TSO4
  | 1ULL << VIRTIO_NET_F_HOST_TSO6
  | 1ULL << VIRTIO_NET_F_CSUM);

> 
 If you don't, guest visible interface will change
 and you won't be able to migrate.

 It does not make sense to discuss feature bits specifically
 since that is not the only part of interface.
 For example, max ring size supported might change.
>>>
>>> I don't quite understand why we have to consider the max ring
>>> size here? Isn't it a virtio device attribute, that QEMU could
>>> provide such compatibility information?
>>>
>>> I mean, DPDK is supposed to support vary vring size, it's QEMU
>>> to give a specifc value.
>>
>> If backend supports s/g of any size up to 2^16, there's no issue.
> 
> I don't know others, but I see no issues in DPDK.
> 
>> ATM some backends might be assuming up to 1K s/g since
>> QEMU never supported bigger ones. We might classify this
>> as a bug, or not and add a feature flag.
>>
>> But it's just an example. There might be more values at issue
>> in the future.
> 
> Yeah, maybe. But we could analysis it one by one.
> 
 Let me describe how it works in qemu/libvirt.
 When you install a VM, you can specify compatibility
 level (aka "machine type"), and you can query the supported compatibility
 levels. Management uses that to find the supported compatibility
 and stores the compatibility in XML that is migrated with the VM.
 There's also a way to find the latest level which is the
 default unless overridden by user, again this level
 is recorded and then
 - management can make sure migration destination is compatible
 - management can avoid migration to hosts without that support
>>>
>>> Thanks for the info, it helps.
>>>
>>> ...
 As version here is an opaque string for libvirt and qemu,
 anything can be used - but I suggest either a list
 of values defining the interface, e.g.
 any_layout=on,max_ring=256
 or a version including the name and vendor of the backend,
 e.g. "org.dpdk.v4.5.6".
>>>
>>> The version scheme may not be ideal here. Assume a QEMU is supposed
>>> to work with a specific DPDK version, however, user may disable some
>>> newer features through qemu command line, that it also could work with
>>> an elder DPDK version. Using the version scheme will not allow us doing
>>> such migration to an elder DPDK version. The MTU is a lively example
>>> here? (when MTU feature is provided by QEMU but is actually disabled
>>> by user, that it could also work with an elder DPDK without MTU support).
>>>
>>> --yliu
>>
>> OK, so does a list of values look better to you then?
> 
>

[dpdk-dev] [PATCH] doc: introduce PVP reference benchmark

2016-11-24 Thread Maxime Coquelin



On 11/24/2016 06:07 AM, Yuanhan Liu wrote:
> First of all, thanks for the doc! It's a great one.
Thanks.
I would be interested to know if you have other tuning I don't mention
in this doc.

>
> On Wed, Nov 23, 2016 at 10:00:06PM +0100, Maxime Coquelin wrote:
>> +Qemu build
>> +~~
>> +
>> +   .. code-block:: console
>> +
>> +git clone git://dpdk.org/dpdk
>> +cd dpdk
>> +export RTE_SDK=$PWD
>> +make install T=x86_64-native-linuxapp-gcc DESTDIR=install
>
> It's actually DPDK build.
>
Oh right! Copy/paste mistake...
This is the Qemu build block:

Qemu build
~~

.. code-block:: console

 git clone git://git.qemu.org/qemu.git
 cd qemu
 mkdir bin
 cd bin
 ../configure --target-list=x86_64-softmmu
 make

> I will take a closer look at it and also render it to see how it looks
> like when I get back to office next week.
>
>   --yliu
>> +
>> +DPDK build
>> +~~
>> +
>> +   .. code-block:: console
>> +
>> +git clone git://dpdk.org/dpdk
>> +cd dpdk
>> +export RTE_SDK=$PWD
>> +make install T=x86_64-native-linuxapp-gcc DESTDIR=install
>> +

[dpdk-dev] Recall: [PATCH 1/1] net/i40e: enable auto link update for XXV710

2016-11-24 Thread Zhang, Qi Z

Zhang, Qi Z would like to recall the message, "[PATCH 1/1] net/i40e: enable 
auto link update for XXV710".

[dpdk-dev] Recall: [PATCH 1/1] net/i40e: enable auto link update for XXV710

2016-11-24 Thread Zhang, Qi Z

Zhang, Qi Z would like to recall the message, "[PATCH 1/1] net/i40e: enable 
auto link update for XXV710".

[dpdk-dev] [PATCH 1/4] eventdev: introduce event driven programming model

2016-11-24 Thread Jerin Jacob

On Wed, Nov 23, 2016 at 07:39:09PM +0100, Thomas Monjalon wrote:
> Hi Jerin,

Hi Thomas,

> 
> Thanks for bringing a big new piece in DPDK.
> 
> I made some comments below.

Thanks for the review.

> 
> 2016-11-18 11:14, Jerin Jacob:
> > +Eventdev API - EXPERIMENTAL
> > +M: Jerin Jacob 
> > +F: lib/librte_eventdev/
> 
> OK to mark it experimental.
> What is the plan to remove the experimental word?

IMO, EXPERIMENTAL status can be changed when
- At least two event drivers available(Intel and Cavium are working on
  SW and HW event drivers)
- Functional test applications are fine with at least two drivers
- Portable example application to showcase the features of the library
- eventdev integration with another dpdk subsystem such as ethdev

Thoughts?. I am not sure the criteria used in cryptodev case.


> 
> > + * RTE event device drivers do not use interrupts for enqueue or dequeue
> > + * operation. Instead, Event drivers export Poll-Mode enqueue and dequeue
> > + * functions to applications.
> 
> To the question "what makes DPDK different" it could be answered
> that DPDK event drivers implement polling functions :)

Mostly taken from ethdev API header file :-)

> 
> > +#include 
> > +
> > +#include 
> > +#include 
> > +#include 
> 
> Is it possible to remove some of these includes from the API?

OK. I will scan through all the header file and remove the not required
ones.

> 
> > +
> > +#define EVENTDEV_NAME_SKELETON_PMD event_skeleton
> > +/**< Skeleton event device PMD name */
> 
> I do not understand this #define.

Applications can explicitly request the a specific driver though driver
name. This will go as argument to rte_event_dev_get_dev_id(const char *name).
The reason for keeping this #define in rte_eventdev.h is that,
application needs to include only rte_eventdev.h not rte_eventdev_pmd.h.

I will remove the definition from this patch and add this definition in
skeleton driver patch(patch 03/04)

> And it is not properly prefixed.

OK. I will prefix with RTE_ in v2.

> 
> > +struct rte_event_dev_info {
> > +   const char *driver_name;/**< Event driver name */
> > +   struct rte_pci_device *pci_dev; /**< PCI information */
> 
> There is some work in progress to remove PCI information from ethdev.
> Please do not add any PCI related structure in eventdev.
> The generic structure is rte_device.

OK. Makes sense. A grep of "rte_device" shows none of the subsystem
implemented yet and the work in progress. I will change to rte_device
when it is mainline. The skeleton eventdev driver based on PCI bus needs
this for the moment.


> 
> > +struct rte_event_dev_config {
> > +   uint32_t dequeue_wait_ns;
> > +   /**< rte_event_dequeue() wait for *dequeue_wait_ns* ns on this device.
> 
> Please explain exactly when the wait occurs and why.

Here is the explanation from rte_event_dequeue() API definition,
-
@param wait
0 - no-wait, returns immediately if there is no event.
>0 - wait for the event, if the device is configured with
RTE_EVENT_DEV_CFG_PER_DEQUEUE_WAIT then this function will wait until
the event available or *wait* time.
if the device is not configured with RTE_EVENT_DEV_CFG_PER_DEQUEUE_WAIT
then this function will wait until the event available or *dequeue_wait_ns*
  ^^
ns which was previously supplied to rte_event_dev_configure()
-
This is provides the application to have control over, how long the
implementation should wait if event is not available.

Let me know what exact changes are required if details are not enough in
rte_event_dequeue() API definition.

> 
> > +* This value should be in the range of *min_dequeue_wait_ns* and
> > +* *max_dequeue_wait_ns* which previously provided in
> > +* rte_event_dev_info_get()
> > +* \see RTE_EVENT_DEV_CFG_PER_DEQUEUE_WAIT
> 
> I think the @see syntax would be more consistent than \see.

OK. I will change to @see

> 
> > +   uint8_t nb_event_port_dequeue_depth;
> > +   /**< Number of dequeue queue depth for any event port on this device.
> 
> I think it deserves more explanations.

see below

> 
> > +   uint32_t event_dev_cfg;
> > +   /**< Event device config flags(RTE_EVENT_DEV_CFG_)*/
> 
> How this field differs from others in the struct?
> Should it be named flags?

OK. I will change to flags

> 
> > +   uint32_t event_queue_cfg; /**< Queue config flags(EVENT_QUEUE_CFG_) */
> 
> Same comment about the naming of this field for event_queue config sruct.

OK. I will change to flags

> 
> > +/** Event port configuration structure */
> > +struct rte_event_port_conf {
> > +   int32_t new_event_threshold;
> > +   /**< A backpressure threshold for new event enqueues on this port.
> > +* Use for *closed system* event dev where event capacity is limited,
> > +* and cannot exceed the capacity of the event dev.
> > +* Configuring ports with different thresholds can make higher priority
> > +* traffic less likely to  be backpressured.
> > +* For

67 matches

Mail list logo