[dpdk-dev] [PATCH 00/29] i40e base driver update

2016-01-17 Thread Zhang, Helin


> -Original Message-
> From: Richardson, Bruce
> Sent: Friday, January 15, 2016 6:48 PM
> To: Zhang, Helin
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 00/29] i40e base driver update
> 
> On Fri, Jan 15, 2016 at 10:40:24AM +0800, Helin Zhang wrote:
> > i40e base driver is updated, to support new X722 device IDs, and use
> > rx control AQ commands to read/write rx control registers.
> > Of cause, fixes and enhancements are added as listed as below.
> >
> > Helin Zhang (29):
> >   i40e/base: use explicit cast from u16 to u8
> >   i40e/base: Acquire NVM, before issuing an AQ read nvm command
> >   i40e/base: add hw flag for doing the SRCTL access using AQ for X722
> >   i40e/base: add changes in nvm read to support X722
> >   i40e/base: Limit DCB FW version checks to XL710/X710 devices
> >   i40e/base: check for stopped admin queue
> >   i40e/base: set aq count after memory allocation
> >   i40e/base: clean event descriptor before use
> >   i40e/base: add new device IDs and delete deprecated one
> >   i40e/base: fix up recent proxy and wol bits for X722_SUPPORT
> >   i40e/base: define function capabilities in only one place
> >   i40e/base: Fix for PHY NVM interaction problem
> >   i40e/base: set shared bit for multicast filters
> >   i40e/base: add APIs to Add/remove port mirroring rules
> >   i40e/base: add VEB stat control and remove L2 cloud filter
> >   i40e/base: implement the API function for aq_set_switch_config
> >   i40e/base: Add functions to blink led on Coppervale PHY
> >   i40e/base: When in promisc mode apply promisc mode to Tx Traffic as
> > well
> >   i40e/base: Increase timeout when checking GLGEN_RSTAT_DEVSTATE bit
> >   i40e/base: Save off VSI resource count when updating VSI
> >   i40e/base: coding style fixes
> >   i40e/base: use FW to read/write rx control registers
> >   i40e/base: expose some registers to program parser, FD and RSS logic
> >   i40e/base: Add a Virtchnl offload for RSS PCTYPE V2
> >   i40e/base: add AQ thermal sensor control struct
> >   i40e/base: add/update structure and macro definitions
> >   i40e: add base driver release info
> >   i40e: add/remove new device IDs
> >   i40e: use rx control function for rx control registers
> 
> Couple of minor nits looking through the subject list above.
> * the promiscuous mode fix has too long a title, so please shorten (maybe
> drop the "when in promisc mode" bit)
> * some messages start with a capital letter, others not. They should be
> consistent and the standard is to not capitalize.
OK, I will have it reworked in the second version. Thanks a lot for the 
guidance!

Regards,
Helin

> 
> /Bruce


[dpdk-dev] [PATCH v3 8/8] virtio: move VIRTIO_READ/WRITE_REG_X into virtio_pci.c

2016-01-17 Thread Santosh Shukla
On Sat, Jan 16, 2016 at 3:38 PM, Santosh Shukla  wrote:
> On Thu, Jan 14, 2016 at 1:12 PM, Yuanhan Liu
>  wrote:
>> virtio_pci.c become the only file references those macros; move them there.
>>
>
> My patch VFIO series need virtio_rd/wr So keeping these api in
> virtio_pci.h make more sense to me.

Ignore my comment, I just finished reading your patch series and wont
see any potential issue in moving my vfio rd/wr api to vfio_pci.c.


[dpdk-dev] ethdev: fix link status race condition

2016-01-17 Thread Yaacov Hazan
Hi,

I looked in your patch ethdev: fix link status race condition (d5790b03), and 
have a question.
According to your change when dev_start is called and the device supports lsc 
we doesn't ask the device status.
But, if the device is already up, when you start dpdk application, the 
application can't know that the status of the port is up since dev_start won't 
ask for the device status and no interrupt is trigger since no changed has been 
made (the device is already up).

Can you please explain the idea of this patch?

Thanks,
Yaacov.



[dpdk-dev] [PATCH v2 1/1] vhost: fix leak of fds and mmaps

2016-01-17 Thread Rich Lane
The common vhost code only supported a single mmap per device. vhost-user
worked around this by saving the address/length/fd of each mmap after the end
of the rte_virtio_memory struct. This only works if the vhost-user code frees
dev->mem, since the common code is unaware of the extra info. The
VHOST_USER_RESET_OWNER message is one situation where the common code frees
dev->mem and leaks the fds and mappings. This happens every time I shut down a
VM.

The new code calls back into the implementation (vhost-user or vhost-cuse) to
clean up these resources.

The vhost-cuse changes are only compile tested.

Signed-off-by: Rich Lane 
---
v1->v2:
- Call into vhost-user/vhost-cuse to free mmaps.

 lib/librte_vhost/vhost-net.h  |  6 ++
 lib/librte_vhost/vhost_cuse/virtio-net-cdev.c | 12 
 lib/librte_vhost/vhost_user/vhost-net-user.c  |  1 -
 lib/librte_vhost/vhost_user/virtio-net-user.c | 25 ++---
 lib/librte_vhost/vhost_user/virtio-net-user.h |  1 -
 lib/librte_vhost/virtio-net.c |  8 +---
 6 files changed, 29 insertions(+), 24 deletions(-)

diff --git a/lib/librte_vhost/vhost-net.h b/lib/librte_vhost/vhost-net.h
index c69b60b..e8d7477 100644
--- a/lib/librte_vhost/vhost-net.h
+++ b/lib/librte_vhost/vhost-net.h
@@ -115,4 +115,10 @@ struct vhost_net_device_ops {


 struct vhost_net_device_ops const *get_virtio_net_callbacks(void);
+
+/*
+ * Implementation-specific cleanup. Defined by vhost-cuse and vhost-user.
+ */
+void vhost_impl_cleanup(struct virtio_net *dev);
+
 #endif /* _VHOST_NET_CDEV_H_ */
diff --git a/lib/librte_vhost/vhost_cuse/virtio-net-cdev.c 
b/lib/librte_vhost/vhost_cuse/virtio-net-cdev.c
index ae2c3fa..06bfd5f 100644
--- a/lib/librte_vhost/vhost_cuse/virtio-net-cdev.c
+++ b/lib/librte_vhost/vhost_cuse/virtio-net-cdev.c
@@ -421,3 +421,15 @@ int cuse_set_backend(struct vhost_device_ctx ctx, struct 
vhost_vring_file *file)

return ops->set_backend(ctx, file);
 }
+
+void
+vhost_impl_cleanup(struct virtio_net *dev)
+{
+   /* Unmap QEMU memory file if mapped. */
+   if (dev->mem) {
+   munmap((void *)(uintptr_t)dev->mem->mapped_address,
+   (size_t)dev->mem->mapped_size);
+   free(dev->mem);
+   dev->mem = NULL;
+   }
+}
diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c 
b/lib/librte_vhost/vhost_user/vhost-net-user.c
index 8b7a448..336efba 100644
--- a/lib/librte_vhost/vhost_user/vhost-net-user.c
+++ b/lib/librte_vhost/vhost_user/vhost-net-user.c
@@ -347,7 +347,6 @@ vserver_message_handler(int connfd, void *dat, int *remove)
close(connfd);
*remove = 1;
free(cfd_ctx);
-   user_destroy_device(ctx);
ops->destroy_device(ctx);

return;
diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.c 
b/lib/librte_vhost/vhost_user/virtio-net-user.c
index 2934d1c..190679f 100644
--- a/lib/librte_vhost/vhost_user/virtio-net-user.c
+++ b/lib/librte_vhost/vhost_user/virtio-net-user.c
@@ -339,21 +339,6 @@ user_set_vring_enable(struct vhost_device_ctx ctx,
 }

 void
-user_destroy_device(struct vhost_device_ctx ctx)
-{
-   struct virtio_net *dev = get_device(ctx);
-
-   if (dev && (dev->flags & VIRTIO_DEV_RUNNING))
-   notify_ops->destroy_device(dev);
-
-   if (dev && dev->mem) {
-   free_mem_region(dev);
-   free(dev->mem);
-   dev->mem = NULL;
-   }
-}
-
-void
 user_set_protocol_features(struct vhost_device_ctx ctx,
   uint64_t protocol_features)
 {
@@ -365,3 +350,13 @@ user_set_protocol_features(struct vhost_device_ctx ctx,

dev->protocol_features = protocol_features;
 }
+
+void
+vhost_impl_cleanup(struct virtio_net *dev)
+{
+   if (dev->mem) {
+   free_mem_region(dev);
+   free(dev->mem);
+   dev->mem = NULL;
+   }
+}
diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.h 
b/lib/librte_vhost/vhost_user/virtio-net-user.h
index b82108d..1140ee1 100644
--- a/lib/librte_vhost/vhost_user/virtio-net-user.h
+++ b/lib/librte_vhost/vhost_user/virtio-net-user.h
@@ -55,5 +55,4 @@ int user_get_vring_base(struct vhost_device_ctx, struct 
vhost_vring_state *);
 int user_set_vring_enable(struct vhost_device_ctx ctx,
  struct vhost_vring_state *state);

-void user_destroy_device(struct vhost_device_ctx);
 #endif
diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
index de78a0f..50fc68c 100644
--- a/lib/librte_vhost/virtio-net.c
+++ b/lib/librte_vhost/virtio-net.c
@@ -199,13 +199,7 @@ cleanup_device(struct virtio_net *dev, int destroy)
 {
uint32_t i;

-   /* Unmap QEMU memory file if mapped. */
-   if (dev->mem) {
-   munmap((void *)(uintptr_t)dev->mem->mapped_address,
-   (size_t)dev->mem->mapped_size);
-   free(dev->mem);
-   dev->m

[dpdk-dev] [PATCH v2 0/5] Optimize memcpy for AVX512 platforms

2016-01-17 Thread Zhihong Wang
This patch set optimizes DPDK memcpy for AVX512 platforms, to make full
utilization of hardware resources and deliver high performance.

In current DPDK, memcpy holds a large proportion of execution time in
libs like Vhost, especially for large packets, and this patch can bring
considerable benefits.

The implementation is based on the current DPDK memcpy framework, some
background introduction can be found in these threads:
http://dpdk.org/ml/archives/dev/2014-November/008158.html
http://dpdk.org/ml/archives/dev/2015-January/011800.html

Code changes are:

  1. Read CPUID to check if AVX512 is supported by CPU

  2. Predefine AVX512 macro if AVX512 is enabled by compiler

  3. Implement AVX512 memcpy and choose the right implementation based on
 predefined macros

  4. Decide alignment unit for memcpy perf test based on predefined macros

--
Changes in v2:

  1. Tune performance for prior platforms

Zhihong Wang (5):
  lib/librte_eal: Identify AVX512 CPU flag
  mk: Predefine AVX512 macro for compiler
  lib/librte_eal: Optimize memcpy for AVX512 platforms
  app/test: Adjust alignment unit for memcpy perf test
  lib/librte_eal: Tune memcpy for prior platforms

 app/test/test_memcpy_perf.c|   6 +
 .../common/include/arch/x86/rte_cpuflags.h |   2 +
 .../common/include/arch/x86/rte_memcpy.h   | 269 -
 mk/rte.cpuflags.mk |   4 +
 4 files changed, 268 insertions(+), 13 deletions(-)

-- 
2.5.0



[dpdk-dev] [PATCH v2 1/5] lib/librte_eal: Identify AVX512 CPU flag

2016-01-17 Thread Zhihong Wang
Read CPUID to check if AVX512 is supported by CPU.

Signed-off-by: Zhihong Wang 
---
 lib/librte_eal/common/include/arch/x86/rte_cpuflags.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h 
b/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h
index dd56553..89c0d9d 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h
@@ -131,6 +131,7 @@ enum rte_cpu_flag_t {
RTE_CPUFLAG_ERMS,   /**< ERMS */
RTE_CPUFLAG_INVPCID,/**< INVPCID */
RTE_CPUFLAG_RTM,/**< Transactional memory */
+   RTE_CPUFLAG_AVX512F,/**< AVX512F */

/* (EAX 8001h) ECX features */
RTE_CPUFLAG_LAHF_SAHF,  /**< LAHF_SAHF */
@@ -238,6 +239,7 @@ static const struct feature_entry cpu_feature_table[] = {
FEAT_DEF(ERMS, 0x0007, 0, RTE_REG_EBX,  8)
FEAT_DEF(INVPCID, 0x0007, 0, RTE_REG_EBX, 10)
FEAT_DEF(RTM, 0x0007, 0, RTE_REG_EBX, 11)
+   FEAT_DEF(AVX512F, 0x0007, 0, RTE_REG_EBX, 16)

FEAT_DEF(LAHF_SAHF, 0x8001, 0, RTE_REG_ECX,  0)
FEAT_DEF(LZCNT, 0x8001, 0, RTE_REG_ECX,  4)
-- 
2.5.0



[dpdk-dev] [PATCH v2 2/5] mk: Predefine AVX512 macro for compiler

2016-01-17 Thread Zhihong Wang
Predefine AVX512 macro if AVX512 is enabled by compiler.

Signed-off-by: Zhihong Wang 
---
 mk/rte.cpuflags.mk | 4 
 1 file changed, 4 insertions(+)

diff --git a/mk/rte.cpuflags.mk b/mk/rte.cpuflags.mk
index 28f203b..19a3e7e 100644
--- a/mk/rte.cpuflags.mk
+++ b/mk/rte.cpuflags.mk
@@ -89,6 +89,10 @@ ifneq ($(filter $(AUTO_CPUFLAGS),__AVX2__),)
 CPUFLAGS += AVX2
 endif

+ifneq ($(filter $(AUTO_CPUFLAGS),__AVX512F__),)
+CPUFLAGS += AVX512F
+endif
+
 # IBM Power CPU flags
 ifneq ($(filter $(AUTO_CPUFLAGS),__PPC64__),)
 CPUFLAGS += PPC64
-- 
2.5.0



[dpdk-dev] [PATCH v2 3/5] lib/librte_eal: Optimize memcpy for AVX512 platforms

2016-01-17 Thread Zhihong Wang
Implement AVX512 memcpy and choose the right implementation based on
predefined macros, to make full utilization of hardware resources and
deliver high performance.

In current DPDK, memcpy holds a large proportion of execution time in
libs like Vhost, especially for large packets, and this patch can bring
considerable benefits for AVX512 platforms.

The implementation is based on the current DPDK memcpy framework, some
background introduction can be found in these threads:
http://dpdk.org/ml/archives/dev/2014-November/008158.html
http://dpdk.org/ml/archives/dev/2015-January/011800.html

Signed-off-by: Zhihong Wang 
---
 .../common/include/arch/x86/rte_memcpy.h   | 247 -
 1 file changed, 243 insertions(+), 4 deletions(-)

diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h 
b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
index 6a57426..fee954a 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
@@ -37,7 +37,7 @@
 /**
  * @file
  *
- * Functions for SSE/AVX/AVX2 implementation of memcpy().
+ * Functions for SSE/AVX/AVX2/AVX512 implementation of memcpy().
  */

 #include 
@@ -67,7 +67,246 @@ extern "C" {
 static inline void *
 rte_memcpy(void *dst, const void *src, size_t n) 
__attribute__((always_inline));

-#ifdef RTE_MACHINE_CPUFLAG_AVX2
+#ifdef RTE_MACHINE_CPUFLAG_AVX512F
+
+/**
+ * AVX512 implementation below
+ */
+
+/**
+ * Copy 16 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov16(uint8_t *dst, const uint8_t *src)
+{
+   __m128i xmm0;
+
+   xmm0 = _mm_loadu_si128((const __m128i *)src);
+   _mm_storeu_si128((__m128i *)dst, xmm0);
+}
+
+/**
+ * Copy 32 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov32(uint8_t *dst, const uint8_t *src)
+{
+   __m256i ymm0;
+
+   ymm0 = _mm256_loadu_si256((const __m256i *)src);
+   _mm256_storeu_si256((__m256i *)dst, ymm0);
+}
+
+/**
+ * Copy 64 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov64(uint8_t *dst, const uint8_t *src)
+{
+   __m512i zmm0;
+
+   zmm0 = _mm512_loadu_si512((const void *)src);
+   _mm512_storeu_si512((void *)dst, zmm0);
+}
+
+/**
+ * Copy 128 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128(uint8_t *dst, const uint8_t *src)
+{
+   rte_mov64(dst + 0 * 64, src + 0 * 64);
+   rte_mov64(dst + 1 * 64, src + 1 * 64);
+}
+
+/**
+ * Copy 256 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov256(uint8_t *dst, const uint8_t *src)
+{
+   rte_mov64(dst + 0 * 64, src + 0 * 64);
+   rte_mov64(dst + 1 * 64, src + 1 * 64);
+   rte_mov64(dst + 2 * 64, src + 2 * 64);
+   rte_mov64(dst + 3 * 64, src + 3 * 64);
+}
+
+/**
+ * Copy 128-byte blocks from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
+{
+   __m512i zmm0, zmm1;
+
+   while (n >= 128) {
+   zmm0 = _mm512_loadu_si512((const void *)(src + 0 * 64));
+   n -= 128;
+   zmm1 = _mm512_loadu_si512((const void *)(src + 1 * 64));
+   src = src + 128;
+   _mm512_storeu_si512((void *)(dst + 0 * 64), zmm0);
+   _mm512_storeu_si512((void *)(dst + 1 * 64), zmm1);
+   dst = dst + 128;
+   }
+}
+
+/**
+ * Copy 512-byte blocks from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov512blocks(uint8_t *dst, const uint8_t *src, size_t n)
+{
+   __m512i zmm0, zmm1, zmm2, zmm3, zmm4, zmm5, zmm6, zmm7;
+
+   while (n >= 512) {
+   zmm0 = _mm512_loadu_si512((const void *)(src + 0 * 64));
+   n -= 512;
+   zmm1 = _mm512_loadu_si512((const void *)(src + 1 * 64));
+   zmm2 = _mm512_loadu_si512((const void *)(src + 2 * 64));
+   zmm3 = _mm512_loadu_si512((const void *)(src + 3 * 64));
+   zmm4 = _mm512_loadu_si512((const void *)(src + 4 * 64));
+   zmm5 = _mm512_loadu_si512((const void *)(src + 5 * 64));
+   zmm6 = _mm512_loadu_si512((const void *)(src + 6 * 64));
+   zmm7 = _mm512_loadu_si512((const void *)(src + 7 * 64));
+   src = src + 512;
+   _mm512_storeu_si512((void *)(dst + 0 * 64), zmm0);
+   _mm512_storeu_si512((void *)(dst + 1 * 64), zmm1);
+   _mm512_storeu_si512((void *)(dst + 2 * 64), zmm2);
+   _mm512_storeu_si512((void *)(dst + 3 * 64), zmm3);
+   _mm512_storeu_si512((void *)(dst + 4 * 64), zmm4);
+   _mm512_storeu_si512((void *)(dst + 5 * 64), zmm5);
+   _mm512_storeu_si512((void *)(dst + 6 * 64), zmm6);
+   _mm512_store

[dpdk-dev] [PATCH v2 4/5] app/test: Adjust alignment unit for memcpy perf test

2016-01-17 Thread Zhihong Wang
Decide alignment unit for memcpy perf test based on predefined macros.

Signed-off-by: Zhihong Wang 
---
 app/test/test_memcpy_perf.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/app/test/test_memcpy_perf.c b/app/test/test_memcpy_perf.c
index 754828e..73babec 100644
--- a/app/test/test_memcpy_perf.c
+++ b/app/test/test_memcpy_perf.c
@@ -79,7 +79,13 @@ static size_t buf_sizes[TEST_VALUE_RANGE];
 #define TEST_BATCH_SIZE 100

 /* Data is aligned on this many bytes (power of 2) */
+#ifdef RTE_MACHINE_CPUFLAG_AVX512F
+#define ALIGNMENT_UNIT  64
+#elif RTE_MACHINE_CPUFLAG_AVX2
 #define ALIGNMENT_UNIT  32
+#else /* RTE_MACHINE_CPUFLAG */
+#define ALIGNMENT_UNIT  16
+#endif /* RTE_MACHINE_CPUFLAG */

 /*
  * Pointers used in performance tests. The two large buffers are for uncached
-- 
2.5.0



[dpdk-dev] [PATCH v2 5/5] lib/librte_eal: Tune memcpy for prior platforms

2016-01-17 Thread Zhihong Wang
For prior platforms, add condition for unalignment handling, to keep this
operation from interrupting the batch copy loop for aligned cases.

Signed-off-by: Zhihong Wang 
---
 .../common/include/arch/x86/rte_memcpy.h   | 22 +-
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h 
b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
index fee954a..d965957 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
@@ -513,10 +513,12 @@ COPY_BLOCK_64_BACK31:
 * Make store aligned when copy size exceeds 512 bytes
 */
dstofss = 32 - ((uintptr_t)dst & 0x1F);
-   n -= dstofss;
-   rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-   src = (const uint8_t *)src + dstofss;
-   dst = (uint8_t *)dst + dstofss;
+   if (dstofss > 0) {
+   n -= dstofss;
+   rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+   src = (const uint8_t *)src + dstofss;
+   dst = (uint8_t *)dst + dstofss;
+   }

/**
 * Copy 256-byte blocks.
@@ -833,11 +835,13 @@ COPY_BLOCK_64_BACK15:
 * backwards access.
 */
dstofss = 16 - ((uintptr_t)dst & 0x0F) + 16;
-   n -= dstofss;
-   rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-   src = (const uint8_t *)src + dstofss;
-   dst = (uint8_t *)dst + dstofss;
-   srcofs = ((uintptr_t)src & 0x0F);
+   if (dstofss > 0) {
+   n -= dstofss;
+   rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+   src = (const uint8_t *)src + dstofss;
+   dst = (uint8_t *)dst + dstofss;
+   srcofs = ((uintptr_t)src & 0x0F);
+   }

/**
 * For aligned copy
-- 
2.5.0