date:20151221

[dpdk-dev] [PATCH v2 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API

2015-12-21 Thread Don Provan

>From: Xie, Huawei [mailto:huawei.xie at intel.com] 
>Sent: Monday, December 21, 2015 7:22 AM
>Subject: Re: [dpdk-dev] [PATCH v2 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API
>
>The loop unwinding could give performance gain. The only problem is the 
>switch/loop
>combination makes people feel weird at the first glance but soon they will 
>grasp this style.
>Since this is inherited from old famous duff's device, i prefer to keep this 
>style which saves
>lines of code.

You don't really mean "lines of code", of course, since it increases the lines 
of code.
It reduces the number of branches.

Is Duff's Device used in other "bulk" routines? If not, what justifies making 
this a special case?

-don provan
dprovan at bivio.net

[dpdk-dev] [PATCH v2 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API

2015-12-21 Thread Thomas Monjalon

2015-12-21 17:20, Wiles, Keith:
> On 12/21/15, 9:21 AM, "Xie, Huawei"  wrote:
> >On 12/19/2015 3:27 AM, Wiles, Keith wrote:
> >> On 12/18/15, 11:32 AM, "dev on behalf of Stephen Hemminger"  >> at dpdk.org on behalf of stephen at networkplumber.org> wrote:
> >>> On Fri, 18 Dec 2015 10:44:02 +
> >>> "Ananyev, Konstantin"  wrote:
>  From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Stephen Hemminger
> > On Mon, 14 Dec 2015 09:14:41 +0800
> > Huawei Xie  wrote:
> >> +  switch (count % 4) {
> >> +  while (idx != count) {
> >> +  case 0:
> >> +  
> >> RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> >> +  rte_mbuf_refcnt_set(mbufs[idx], 1);
> >> +  rte_pktmbuf_reset(mbufs[idx]);
> >> +  idx++;
> >> +  case 3:
> >> +  
> >> RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> >> +  rte_mbuf_refcnt_set(mbufs[idx], 1);
> >> +  rte_pktmbuf_reset(mbufs[idx]);
> >> +  idx++;
> >> +  case 2:
> >> +  
> >> RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> >> +  rte_mbuf_refcnt_set(mbufs[idx], 1);
> >> +  rte_pktmbuf_reset(mbufs[idx]);
> >> +  idx++;
> >> +  case 1:
> >> +  
> >> RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> >> +  rte_mbuf_refcnt_set(mbufs[idx], 1);
> >> +  rte_pktmbuf_reset(mbufs[idx]);
> >> +  idx++;
> >> +  }
> >> +  }
> >> +  return 0;
> >> +}
> > This is weird. Why not just use Duff's device in a more normal manner.
>  But it is a sort of Duff's method.
>  Not sure what looks weird to you here?
>  while () {} instead of do {} while();?
>  Konstantin
> 
> 
> 
> >>> It is unusual to have cases not associated with block of the switch.
> >>> Unusual to me means, "not used commonly in most code".
> >>>
> >>> Since you are jumping into the loop, might make more sense as a do { } 
> >>> while()
> >> I find this a very odd coding practice and I would suggest we not do this, 
> >> unless it gives us some great performance gain.
> >>
> >> Keith
> >The loop unwinding could give performance gain. The only problem is the
> >switch/loop combination makes people feel weird at the first glance but
> >soon they will grasp this style. Since this is inherited from old famous
> >duff's device, i prefer to keep this style which saves lines of code.
> 
> Please add a comment to the code to reflex where this style came from and why 
> you are using it, would be very handy here.

+1
At least the words "loop" and "unwinding" may be helpful to some readers.
Thanks

[dpdk-dev] [PATCH v5 1/3] vhost: Add callback and private data for vhost PMD

2015-12-21 Thread Rich Lane

On Mon, Dec 21, 2015 at 7:41 PM, Yuanhan Liu 
wrote:

> On Fri, Dec 18, 2015 at 10:01:25AM -0800, Rich Lane wrote:
> > I'm using the vhost callbacks and struct virtio_net with the vhost PMD
> in a few
> > ways:
>
> Rich, thanks for the info!
>
> >
> > 1. new_device/destroy_device: Link state change (will be covered by the
> link
> > status interrupt).
> > 2. new_device: Add first queue to datapath.
>
> I'm wondering why vring_state_changed() is not used, as it will also be
> triggered at the beginning, when the default queue (the first queue) is
> enabled.
>

Turns out I'd misread the code and it's already using the
vring_state_changed callback for the
first queue. Not sure if this is intentional but vring_state_changed is
called for the first queue
before new_device.

> > 3. vring_state_changed: Add/remove queue to datapath.
> > 4. destroy_device: Remove all queues (vring_state_changed is not called
> when
> > qemu is killed).
>
> I had a plan to invoke vring_state_changed() to disable all vrings
> when destroy_device() is called.
>

That would be good.

> > 5. new_device and struct virtio_net: Determine NUMA node of the VM.
>
> You can get the 'struct virtio_net' dev from all above callbacks.

> 1. Link status interrupt.
>
> To vhost pmd, new_device()/destroy_device() equals to the link status
> interrupt, where new_device() is a link up, and destroy_device() is link
> down().
>
>
> > 2. New queue_state_changed callback. Unlike vring_state_changed this
> should
> > cover the first queue at new_device and removal of all queues at
> > destroy_device.
>
> As stated above, vring_state_changed() should be able to do that, except
> the one on destroy_device(), which is not done yet.
>
> > 3. Per-queue or per-device NUMA node info.
>
> You can query the NUMA node info implicitly by get_mempolicy(); check
> numa_realloc() at lib/librte_vhost/virtio-net.c for reference.
>

Your suggestions are exactly how my application is already working. I was
commenting on the
proposed changes to the vhost PMD API. I would prefer to
use RTE_ETH_EVENT_INTR_LSC
and rte_eth_dev_socket_id for consistency with other NIC drivers, instead
of these vhost-specific
hacks. The queue state change callback is the one new API that needs to be
added because
normal NICs don't have this behavior.

You could add another rte_eth_event_type for the queue state change
callback, and pass the
queue ID, RX/TX direction, and enable bit through cb_arg. The application
would never need
to touch struct virtio_net.

[dpdk-dev] [PATCH] vfio: Support for no-IOMMU mode

2015-12-21 Thread Anatoly Burakov

This commit is adding a generic mechanism to support multiple IOMMU
types. For now, it's only type 1 (x86 IOMMU) and no-IOMMU (a special
VFIO mode that doesn't use IOMMU at all), but it's easily extended
by adding necessary definitions into eal_pci_init.h and a DMA
mapping function to eal_pci_vfio_dma.c.

Since type 1 IOMMU module is no longer necessary to have VFIO,
we fix the module check to check for vfio-pci instead. It's not
ideal and triggers VFIO checks more often (and thus produces more
error output, which was the reason behind the module check in the
first place), so we compensate for that by providing more verbose
logging, indicating whether VFIO initialization has succeeded or
failed.

Signed-off-by: Anatoly Burakov 
---
 lib/librte_eal/linuxapp/eal/Makefile   |   1 +
 lib/librte_eal/linuxapp/eal/eal_pci_init.h |  22 
 lib/librte_eal/linuxapp/eal/eal_pci_vfio.c | 142 -
 lib/librte_eal/linuxapp/eal/eal_pci_vfio_dma.c |  84 +++
 lib/librte_eal/linuxapp/eal/eal_vfio.h |   5 +
 5 files changed, 201 insertions(+), 53 deletions(-)
 create mode 100644 lib/librte_eal/linuxapp/eal/eal_pci_vfio_dma.c

diff --git a/lib/librte_eal/linuxapp/eal/Makefile 
b/lib/librte_eal/linuxapp/eal/Makefile
index 26eced5..5c9e9d9 100644
--- a/lib/librte_eal/linuxapp/eal/Makefile
+++ b/lib/librte_eal/linuxapp/eal/Makefile
@@ -59,6 +59,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_EAL_LINUXAPP) += eal_log.c
 SRCS-$(CONFIG_RTE_LIBRTE_EAL_LINUXAPP) += eal_pci.c
 SRCS-$(CONFIG_RTE_LIBRTE_EAL_LINUXAPP) += eal_pci_uio.c
 SRCS-$(CONFIG_RTE_LIBRTE_EAL_LINUXAPP) += eal_pci_vfio.c
+SRCS-$(CONFIG_RTE_LIBRTE_EAL_LINUXAPP) += eal_pci_vfio_dma.c
 SRCS-$(CONFIG_RTE_LIBRTE_EAL_LINUXAPP) += eal_pci_vfio_mp_sync.c
 SRCS-$(CONFIG_RTE_LIBRTE_EAL_LINUXAPP) += eal_debug.c
 SRCS-$(CONFIG_RTE_LIBRTE_EAL_LINUXAPP) += eal_lcore.c
diff --git a/lib/librte_eal/linuxapp/eal/eal_pci_init.h 
b/lib/librte_eal/linuxapp/eal/eal_pci_init.h
index a17c708..da1c431 100644
--- a/lib/librte_eal/linuxapp/eal/eal_pci_init.h
+++ b/lib/librte_eal/linuxapp/eal/eal_pci_init.h
@@ -106,6 +106,28 @@ struct vfio_config {
struct vfio_group vfio_groups[VFIO_MAX_GROUPS];
 };

+/* function pointer typedef for DMA mapping functions */
+typedef  int (*vfio_dma_func_t)(int);
+
+/* Structure to hold supported IOMMU types */
+struct vfio_iommu_type {
+   int type_id;
+   const char *name;
+   vfio_dma_func_t dma_map_func;
+};
+
+/* function prototypes for different IOMMU types */
+int vfio_iommu_type1_dma_map(int container_fd);
+int vfio_iommu_noiommu_dma_map(int container_fd);
+
+/* IOMMU types we support */
+static const struct vfio_iommu_type iommu_types[] = {
+   /* x86 IOMMU, otherwise known as type 1 */
+   { VFIO_TYPE1_IOMMU, "Type 1", _iommu_type1_dma_map},
+   /* IOMMU-less mode */
+   { VFIO_NOIOMMU_IOMMU, "No-IOMMU", _iommu_noiommu_dma_map},
+};
+
 #endif

 #endif /* EAL_PCI_INIT_H_ */
diff --git a/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c 
b/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c
index 74f91ba..71eeea8 100644
--- a/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c
+++ b/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c
@@ -72,6 +72,7 @@ EAL_REGISTER_TAILQ(rte_vfio_tailq)
 #define VFIO_DIR "/dev/vfio"
 #define VFIO_CONTAINER_PATH "/dev/vfio/vfio"
 #define VFIO_GROUP_FMT "/dev/vfio/%u"
+#define VFIO_NOIOMMU_GROUP_FMT "/dev/vfio/noiommu-%u"
 #define VFIO_GET_REGION_ADDR(x) ((uint64_t) x << 40ULL)

 /* per-process VFIO config */
@@ -208,42 +209,57 @@ pci_vfio_set_bus_master(int dev_fd)
return 0;
 }

-/* set up DMA mappings */
-static int
-pci_vfio_setup_dma_maps(int vfio_container_fd)
-{
-   const struct rte_memseg *ms = rte_eal_get_physmem_layout();
-   int i, ret;
-
-   ret = ioctl(vfio_container_fd, VFIO_SET_IOMMU,
-   VFIO_TYPE1_IOMMU);
-   if (ret) {
-   RTE_LOG(ERR, EAL, "  cannot set IOMMU type, "
-   "error %i (%s)\n", errno, strerror(errno));
-   return -1;
+/* pick IOMMU type. returns a pointer to vfio_iommu_type or NULL for error */
+static const struct vfio_iommu_type *
+pci_vfio_set_iommu_type(int vfio_container_fd) {
+   for (unsigned idx = 0; idx < RTE_DIM(iommu_types); idx++) {
+   const struct vfio_iommu_type *t = _types[idx];
+
+   int ret = ioctl(vfio_container_fd, VFIO_SET_IOMMU,
+   t->type_id);
+   if (!ret) {
+   RTE_LOG(NOTICE, EAL, "  using IOMMU type %d (%s)\n",
+   t->type_id, t->name);
+   return t;
+   }
+   /* not an error, there may be more supported IOMMU types */
+   RTE_LOG(DEBUG, EAL, "  set IOMMU type %d (%s) failed, "
+   "error %i (%s)\n", t->type_id, t->name, errno,
+   strerror(errno));
}
+

[dpdk-dev] [PATCH 1/2] testpmd: optimize tx_vlan_set and tx_qinq_set function

2015-12-21 Thread Wang Xiao W

Now in cmd_tx_vlan_set_parsed function, we check the vlan_offload
capability first, if it's a invalid port we'll get a prompt saying
"Error, as QinQ has been enabled.". So we should always make sure
that we get a valid port_id first before we check other information.
It's the same problem for cmd_tx_vlan_set_qinq_parsed.

Meanwhile, tx_vlan reset operation is simple enough to be put directly
into tx_vlan_set and tx_qinq_set function.

Signed-off-by: Wang Xiao W 
---
 app/test-pmd/cmdline.c | 12 
 app/test-pmd/config.c  | 21 +++--
 2 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/app/test-pmd/cmdline.c b/app/test-pmd/cmdline.c
index 73298c9..2adf6ca 100644
--- a/app/test-pmd/cmdline.c
+++ b/app/test-pmd/cmdline.c
@@ -2952,12 +2952,6 @@ cmd_tx_vlan_set_parsed(void *parsed_result,
   __attribute__((unused)) void *data)
 {
struct cmd_tx_vlan_set_result *res = parsed_result;
-   int vlan_offload = rte_eth_dev_get_vlan_offload(res->port_id);
-
-   if (vlan_offload & ETH_VLAN_EXTEND_OFFLOAD) {
-   printf("Error, as QinQ has been enabled.\n");
-   return;
-   }

tx_vlan_set(res->port_id, res->vlan_id);
 }
@@ -3004,12 +2998,6 @@ cmd_tx_vlan_set_qinq_parsed(void *parsed_result,
__attribute__((unused)) void *data)
 {
struct cmd_tx_vlan_set_qinq_result *res = parsed_result;
-   int vlan_offload = rte_eth_dev_get_vlan_offload(res->port_id);
-
-   if (!(vlan_offload & ETH_VLAN_EXTEND_OFFLOAD)) {
-   printf("Error, as QinQ hasn't been enabled.\n");
-   return;
-   }

tx_qinq_set(res->port_id, res->vlan_id, res->vlan_id_outer);
 }
diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 7088f6f..7572b3e 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -1839,25 +1839,42 @@ vlan_tpid_set(portid_t port_id, uint16_t tp_id)
 void
 tx_vlan_set(portid_t port_id, uint16_t vlan_id)
 {
+   int vlan_offload;
if (port_id_is_invalid(port_id, ENABLED_WARN))
return;
if (vlan_id_is_invalid(vlan_id))
return;
-   tx_vlan_reset(port_id);
+
+   vlan_offload = rte_eth_dev_get_vlan_offload(port_id);
+   if (vlan_offload & ETH_VLAN_EXTEND_OFFLOAD) {
+   printf("Error, as QinQ has been enabled.\n");
+   return;
+   }
+
+   ports[port_id].tx_ol_flags &= ~TESTPMD_TX_OFFLOAD_INSERT_QINQ;
ports[port_id].tx_ol_flags |= TESTPMD_TX_OFFLOAD_INSERT_VLAN;
ports[port_id].tx_vlan_id = vlan_id;
+   ports[port_id].tx_vlan_id_outer = 0;
 }

 void
 tx_qinq_set(portid_t port_id, uint16_t vlan_id, uint16_t vlan_id_outer)
 {
+   int vlan_offload;
if (port_id_is_invalid(port_id, ENABLED_WARN))
return;
if (vlan_id_is_invalid(vlan_id))
return;
if (vlan_id_is_invalid(vlan_id_outer))
return;
-   tx_vlan_reset(port_id);
+
+   vlan_offload = rte_eth_dev_get_vlan_offload(port_id);
+   if (!(vlan_offload & ETH_VLAN_EXTEND_OFFLOAD)) {
+   printf("Error, as QinQ hasn't been enabled.\n");
+   return;
+   }
+
+   ports[port_id].tx_ol_flags &= ~TESTPMD_TX_OFFLOAD_INSERT_VLAN;
ports[port_id].tx_ol_flags |= TESTPMD_TX_OFFLOAD_INSERT_QINQ;
ports[port_id].tx_vlan_id = vlan_id;
ports[port_id].tx_vlan_id_outer = vlan_id_outer;
-- 
1.9.3

[dpdk-dev] [PATCH v4 6/6] l3fwd-power: fix a memory leak for non-ip packet

2015-12-21 Thread Shaopeng He

Previous l3fwd-power only processes IP and IPv6 packet, other
packet's mbuf is not released, and causes a memory leak.
This patch fixes this issue.

Signed-off-by: Shaopeng He 
Acked-by: Jing Chen 
---
 doc/guides/rel_notes/release_2_3.rst | 6 ++
 examples/l3fwd-power/main.c  | 3 ++-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/doc/guides/rel_notes/release_2_3.rst 
b/doc/guides/rel_notes/release_2_3.rst
index 2cb5ebd..fc871ab 100644
--- a/doc/guides/rel_notes/release_2_3.rst
+++ b/doc/guides/rel_notes/release_2_3.rst
@@ -25,6 +25,12 @@ Libraries
 Examples
 

+* **l3fwd-power: Fixed memory leak for non-ip packet.**
+
+  Fixed issue in l3fwd-power where, recieving other packet than
+  types of IP and IPv6, the mbuf was not released, and caused
+  a memory leak.
+

 Other
 ~
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index 828c18a..d9cd848 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -714,7 +714,8 @@ l3fwd_simple_forward(struct rte_mbuf *m, uint8_t portid,
/* We don't currently handle IPv6 packets in LPM mode. */
rte_pktmbuf_free(m);
 #endif
-   }
+   } else
+   rte_pktmbuf_free(m);

 }

-- 
1.9.3

[dpdk-dev] [PATCH v4 5/6] fm10k: make sure default VID available in dev_init

2015-12-21 Thread Shaopeng He

When PF establishes a connection with Switch Manager, it receives
a logic port range from SM, and registers certain logic ports from
that range, then a default VID will be send back from SM. This whole
transaction needs to be finished in dev_init, otherwise, in dev_start
the interrupt setting will be changed according to RX queue number,
and probably will cause this transaction failed.

Signed-off-by: Shaopeng He 
Acked-by: Jing Chen 
---
 drivers/net/fm10k/fm10k_ethdev.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/drivers/net/fm10k/fm10k_ethdev.c b/drivers/net/fm10k/fm10k_ethdev.c
index 83e2f65..08d4ea9 100644
--- a/drivers/net/fm10k/fm10k_ethdev.c
+++ b/drivers/net/fm10k/fm10k_ethdev.c
@@ -2815,6 +2815,21 @@ eth_fm10k_dev_init(struct rte_eth_dev *dev)

fm10k_mbx_unlock(hw);

+   /* Make sure default VID is ready before going forward. */
+   if (hw->mac.type == fm10k_mac_pf) {
+   for (i = 0; i < MAX_QUERY_SWITCH_STATE_TIMES; i++) {
+   if (hw->mac.default_vid)
+   break;
+   /* Delay some time to acquire async port VLAN info. */
+   rte_delay_us(WAIT_SWITCH_MSG_US);
+   }
+
+   if (!hw->mac.default_vid) {
+   PMD_INIT_LOG(ERR, "default VID is not ready");
+   return -1;
+   }
+   }
+
/* Add default mac address */
fm10k_MAC_filter_set(dev, hw->mac.addr, true,
MAIN_VSI_POOL_NUMBER);
-- 
1.9.3

[dpdk-dev] [PATCH v4 4/6] fm10k: add rx queue interrupt en/dis functions

2015-12-21 Thread Shaopeng He

Interrupt mode framework has enable/disable functions for individual
rx queue, this patch implements these two functions.

Signed-off-by: Shaopeng He 
Acked-by: Jing Chen 
---
 drivers/net/fm10k/fm10k_ethdev.c | 33 +
 1 file changed, 33 insertions(+)

diff --git a/drivers/net/fm10k/fm10k_ethdev.c b/drivers/net/fm10k/fm10k_ethdev.c
index b5b809c..83e2f65 100644
--- a/drivers/net/fm10k/fm10k_ethdev.c
+++ b/drivers/net/fm10k/fm10k_ethdev.c
@@ -2205,6 +2205,37 @@ fm10k_dev_disable_intr_vf(struct rte_eth_dev *dev)
 }

 static int
+fm10k_dev_rx_queue_intr_enable(struct rte_eth_dev *dev, uint16_t queue_id)
+{
+   struct fm10k_hw *hw = FM10K_DEV_PRIVATE_TO_HW(dev->data->dev_private);
+
+   /* Enable ITR */
+   if (hw->mac.type == fm10k_mac_pf)
+   FM10K_WRITE_REG(hw, FM10K_ITR(Q2V(dev, queue_id)),
+   FM10K_ITR_AUTOMASK | FM10K_ITR_MASK_CLEAR);
+   else
+   FM10K_WRITE_REG(hw, FM10K_VFITR(Q2V(dev, queue_id)),
+   FM10K_ITR_AUTOMASK | FM10K_ITR_MASK_CLEAR);
+   rte_intr_enable(>pci_dev->intr_handle);
+   return 0;
+}
+
+static int
+fm10k_dev_rx_queue_intr_disable(struct rte_eth_dev *dev, uint16_t queue_id)
+{
+   struct fm10k_hw *hw = FM10K_DEV_PRIVATE_TO_HW(dev->data->dev_private);
+
+   /* Disable ITR */
+   if (hw->mac.type == fm10k_mac_pf)
+   FM10K_WRITE_REG(hw, FM10K_ITR(Q2V(dev, queue_id)),
+   FM10K_ITR_MASK_SET);
+   else
+   FM10K_WRITE_REG(hw, FM10K_VFITR(Q2V(dev, queue_id)),
+   FM10K_ITR_MASK_SET);
+   return 0;
+}
+
+static int
 fm10k_dev_rxq_interrupt_setup(struct rte_eth_dev *dev)
 {
struct fm10k_hw *hw = FM10K_DEV_PRIVATE_TO_HW(dev->data->dev_private);
@@ -2537,6 +2568,8 @@ static const struct eth_dev_ops fm10k_eth_dev_ops = {
.tx_queue_setup = fm10k_tx_queue_setup,
.tx_queue_release   = fm10k_tx_queue_release,
.rx_descriptor_done = fm10k_dev_rx_descriptor_done,
+   .rx_queue_intr_enable   = fm10k_dev_rx_queue_intr_enable,
+   .rx_queue_intr_disable  = fm10k_dev_rx_queue_intr_disable,
.reta_update= fm10k_reta_update,
.reta_query = fm10k_reta_query,
.rss_hash_update= fm10k_rss_hash_update,
-- 
1.9.3

[dpdk-dev] [PATCH v4 3/6] fm10k: remove rx queue interrupts when dev stops

2015-12-21 Thread Shaopeng He

Previous dev_stop function stops the rx/tx queues. This patch adds logic
to disable rx queue interrupt, clean the datapath event and queue/vec map.

Signed-off-by: Shaopeng He 
Acked-by: Jing Chen 
---
 drivers/net/fm10k/fm10k_ethdev.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/drivers/net/fm10k/fm10k_ethdev.c b/drivers/net/fm10k/fm10k_ethdev.c
index a34c5e2..b5b809c 100644
--- a/drivers/net/fm10k/fm10k_ethdev.c
+++ b/drivers/net/fm10k/fm10k_ethdev.c
@@ -1125,6 +1125,8 @@ fm10k_dev_start(struct rte_eth_dev *dev)
 static void
 fm10k_dev_stop(struct rte_eth_dev *dev)
 {
+   struct fm10k_hw *hw = FM10K_DEV_PRIVATE_TO_HW(dev->data->dev_private);
+   struct rte_intr_handle *intr_handle = >pci_dev->intr_handle;
int i;

PMD_INIT_FUNC_TRACE();
@@ -1136,6 +1138,26 @@ fm10k_dev_stop(struct rte_eth_dev *dev)
if (dev->data->rx_queues)
for (i = 0; i < dev->data->nb_rx_queues; i++)
fm10k_dev_rx_queue_stop(dev, i);
+
+   /* Disable datapath event */
+   if (rte_intr_dp_is_en(intr_handle)) {
+   for (i = 0; i < dev->data->nb_rx_queues; i++) {
+   FM10K_WRITE_REG(hw, FM10K_RXINT(i),
+   3 << FM10K_RXINT_TIMER_SHIFT);
+   if (hw->mac.type == fm10k_mac_pf)
+   FM10K_WRITE_REG(hw, FM10K_ITR(Q2V(dev, i)),
+   FM10K_ITR_MASK_SET);
+   else
+   FM10K_WRITE_REG(hw, FM10K_VFITR(Q2V(dev, i)),
+   FM10K_ITR_MASK_SET);
+   }
+   }
+   /* Clean datapath event and queue/vec mapping */
+   rte_intr_efd_disable(intr_handle);
+   if (intr_handle->intr_vec != NULL) {
+   rte_free(intr_handle->intr_vec);
+   intr_handle->intr_vec = NULL;
+   }
 }

 static void
-- 
1.9.3

[dpdk-dev] [PATCH v4 2/6] fm10k: setup rx queue interrupts for PF and VF

2015-12-21 Thread Shaopeng He

In interrupt mode, each rx queue can have one interrupt to notify the up
layer application when packets are available in that queue. Some queues
also can share one interrupt.
Currently, fm10k needs one separate interrupt for mailbox. So, only those
drivers which support multiple interrupt vectors e.g. vfio-pci can work
in fm10k interrupt mode.
This patch uses the RXINT/INT_MAP registers to map interrupt causes
(rx queue and other events) to vectors, and enable these interrupts
through kernel drivers like vfio-pci.

Signed-off-by: Shaopeng He 
Acked-by: Jing Chen 
---
 doc/guides/rel_notes/release_2_3.rst |   2 +
 drivers/net/fm10k/fm10k.h|   3 ++
 drivers/net/fm10k/fm10k_ethdev.c | 101 +++
 3 files changed, 95 insertions(+), 11 deletions(-)

diff --git a/doc/guides/rel_notes/release_2_3.rst 
b/doc/guides/rel_notes/release_2_3.rst
index 99de186..2cb5ebd 100644
--- a/doc/guides/rel_notes/release_2_3.rst
+++ b/doc/guides/rel_notes/release_2_3.rst
@@ -4,6 +4,8 @@ DPDK Release 2.3
 New Features
 

+* **Added fm10k Rx interrupt support.**
+

 Resolved Issues
 ---
diff --git a/drivers/net/fm10k/fm10k.h b/drivers/net/fm10k/fm10k.h
index e2f677a..770d6ba 100644
--- a/drivers/net/fm10k/fm10k.h
+++ b/drivers/net/fm10k/fm10k.h
@@ -129,6 +129,9 @@
 #define RTE_FM10K_TX_MAX_FREE_BUF_SZ64
 #define RTE_FM10K_DESCS_PER_LOOP4

+#define FM10K_MISC_VEC_ID   RTE_INTR_VEC_ZERO_OFFSET
+#define FM10K_RX_VEC_START  RTE_INTR_VEC_RXTX_OFFSET
+
 #define FM10K_SIMPLE_TX_FLAG ((uint32_t)ETH_TXQ_FLAGS_NOMULTSEGS | \
ETH_TXQ_FLAGS_NOOFFLOADS)

diff --git a/drivers/net/fm10k/fm10k_ethdev.c b/drivers/net/fm10k/fm10k_ethdev.c
index d39c33b..a34c5e2 100644
--- a/drivers/net/fm10k/fm10k_ethdev.c
+++ b/drivers/net/fm10k/fm10k_ethdev.c
@@ -54,6 +54,8 @@
 /* Number of chars per uint32 type */
 #define CHARS_PER_UINT32 (sizeof(uint32_t))
 #define BIT_MASK_PER_UINT32 ((1 << CHARS_PER_UINT32) - 1)
+/* default 1:1 map from queue ID to interrupt vector ID */
+#define Q2V(dev, queue_id) (dev->pci_dev->intr_handle.intr_vec[queue_id])

 static void fm10k_close_mbx_service(struct fm10k_hw *hw);
 static void fm10k_dev_promiscuous_enable(struct rte_eth_dev *dev);
@@ -109,6 +111,8 @@ struct fm10k_xstats_name_off fm10k_hw_stats_tx_q_strings[] 
= {

 #define FM10K_NB_XSTATS (FM10K_NB_HW_XSTATS + FM10K_MAX_QUEUES_PF * \
(FM10K_NB_RX_Q_XSTATS + FM10K_NB_TX_Q_XSTATS))
+static int
+fm10k_dev_rxq_interrupt_setup(struct rte_eth_dev *dev);

 static void
 fm10k_mbx_initlock(struct fm10k_hw *hw)
@@ -687,6 +691,7 @@ static int
 fm10k_dev_rx_init(struct rte_eth_dev *dev)
 {
struct fm10k_hw *hw = FM10K_DEV_PRIVATE_TO_HW(dev->data->dev_private);
+   struct rte_intr_handle *intr_handle = >pci_dev->intr_handle;
int i, ret;
struct fm10k_rx_queue *rxq;
uint64_t base_addr;
@@ -694,10 +699,23 @@ fm10k_dev_rx_init(struct rte_eth_dev *dev)
uint32_t rxdctl = FM10K_RXDCTL_WRITE_BACK_MIN_DELAY;
uint16_t buf_size;

-   /* Disable RXINT to avoid possible interrupt */
-   for (i = 0; i < hw->mac.max_queues; i++)
+   /* enable RXINT for interrupt mode */
+   i = 0;
+   if (rte_intr_dp_is_en(intr_handle)) {
+   for (; i < dev->data->nb_rx_queues; i++) {
+   FM10K_WRITE_REG(hw, FM10K_RXINT(i), Q2V(dev, i));
+   if (hw->mac.type == fm10k_mac_pf)
+   FM10K_WRITE_REG(hw, FM10K_ITR(Q2V(dev, i)),
+   FM10K_ITR_AUTOMASK | 
FM10K_ITR_MASK_CLEAR);
+   else
+   FM10K_WRITE_REG(hw, FM10K_VFITR(Q2V(dev, i)),
+   FM10K_ITR_AUTOMASK | 
FM10K_ITR_MASK_CLEAR);
+   }
+   }
+   /* Disable other RXINT to avoid possible interrupt */
+   for (; i < hw->mac.max_queues; i++)
FM10K_WRITE_REG(hw, FM10K_RXINT(i),
-   3 << FM10K_RXINT_TIMER_SHIFT);
+   3 << FM10K_RXINT_TIMER_SHIFT);

/* Setup RX queues */
for (i = 0; i < dev->data->nb_rx_queues; ++i) {
@@ -1053,6 +1071,9 @@ fm10k_dev_start(struct rte_eth_dev *dev)
return diag;
}

+   if (fm10k_dev_rxq_interrupt_setup(dev))
+   return -EIO;
+
diag = fm10k_dev_rx_init(dev);
if (diag) {
PMD_INIT_LOG(ERR, "RX init failed: %d", diag);
@@ -2072,7 +2093,7 @@ fm10k_dev_enable_intr_pf(struct rte_eth_dev *dev)
uint32_t int_map = FM10K_INT_MAP_IMMEDIATE;

/* Bind all local non-queue interrupt to vector 0 */
-   int_map |= 0;
+   int_map |= FM10K_MISC_VEC_ID;

FM10K_WRITE_REG(hw, FM10K_INT_MAP(fm10k_int_Mailbox), int_map);
FM10K_WRITE_REG(hw, FM10K_INT_MAP(fm10k_int_PCIeFault), int_map);
@@ -2103,7 +2124,7 @@ fm10k_dev_disable_intr_pf(struct

[dpdk-dev] [PATCH v4 1/6] fm10k: implement rx_descriptor_done function

2015-12-21 Thread Shaopeng He

rx_descriptor_done is used by interrupt mode example application
(l3fwd-power) to check rxd DD bit to decide the RX trend,
then l3fwd-power will adjust the cpu frequency according to
the result.

Signed-off-by: Shaopeng He 
Acked-by: Jing Chen 
---
 drivers/net/fm10k/fm10k.h|  3 +++
 drivers/net/fm10k/fm10k_ethdev.c |  1 +
 drivers/net/fm10k/fm10k_rxtx.c   | 25 +
 3 files changed, 29 insertions(+)

diff --git a/drivers/net/fm10k/fm10k.h b/drivers/net/fm10k/fm10k.h
index cd38af2..e2f677a 100644
--- a/drivers/net/fm10k/fm10k.h
+++ b/drivers/net/fm10k/fm10k.h
@@ -345,6 +345,9 @@ uint16_t fm10k_recv_pkts(void *rx_queue, struct rte_mbuf 
**rx_pkts,
 uint16_t fm10k_recv_scattered_pkts(void *rx_queue,
struct rte_mbuf **rx_pkts, uint16_t nb_pkts);

+int
+fm10k_dev_rx_descriptor_done(void *rx_queue, uint16_t offset);
+
 uint16_t fm10k_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
uint16_t nb_pkts);

diff --git a/drivers/net/fm10k/fm10k_ethdev.c b/drivers/net/fm10k/fm10k_ethdev.c
index e4aed94..d39c33b 100644
--- a/drivers/net/fm10k/fm10k_ethdev.c
+++ b/drivers/net/fm10k/fm10k_ethdev.c
@@ -2435,6 +2435,7 @@ static const struct eth_dev_ops fm10k_eth_dev_ops = {
.rx_queue_release   = fm10k_rx_queue_release,
.tx_queue_setup = fm10k_tx_queue_setup,
.tx_queue_release   = fm10k_tx_queue_release,
+   .rx_descriptor_done = fm10k_dev_rx_descriptor_done,
.reta_update= fm10k_reta_update,
.reta_query = fm10k_reta_query,
.rss_hash_update= fm10k_rss_hash_update,
diff --git a/drivers/net/fm10k/fm10k_rxtx.c b/drivers/net/fm10k/fm10k_rxtx.c
index e958865..36d3002 100644
--- a/drivers/net/fm10k/fm10k_rxtx.c
+++ b/drivers/net/fm10k/fm10k_rxtx.c
@@ -369,6 +369,31 @@ fm10k_recv_scattered_pkts(void *rx_queue, struct rte_mbuf 
**rx_pkts,
return nb_rcv;
 }

+int
+fm10k_dev_rx_descriptor_done(void *rx_queue, uint16_t offset)
+{
+   volatile union fm10k_rx_desc *rxdp;
+   struct fm10k_rx_queue *rxq = rx_queue;
+   uint16_t desc;
+   int ret;
+
+   if (unlikely(offset >= rxq->nb_desc)) {
+   PMD_DRV_LOG(ERR, "Invalid RX queue id %u", offset);
+   return 0;
+   }
+
+   desc = rxq->next_dd + offset;
+   if (desc >= rxq->nb_desc)
+   desc -= rxq->nb_desc;
+
+   rxdp = >hw_ring[desc];
+
+   ret = !!(rxdp->w.status &
+   rte_cpu_to_le_16(FM10K_RXD_STATUS_DD));
+
+   return ret;
+}
+
 static inline void tx_free_descriptors(struct fm10k_tx_queue *q)
 {
uint16_t next_rs, count = 0;
-- 
1.9.3

[dpdk-dev] [PATCH v4 0/6] interrupt mode for fm10k

2015-12-21 Thread Shaopeng He

This patch series adds interrupt mode support for fm10k,
contains four major parts:

1. implement rx_descriptor_done function in fm10k
2. add rx interrupt support in fm10k PF and VF
3. make sure default VID available in dev_init in fm10k
4. fix a memory leak for non-ip packet in l3fwd-power,
   which happens mostly when testing fm10k interrupt mode.

Changes in v4:
- Rebase to latest code
- Update release 2.3 note in each patch

Changes in v3:
- Rebase to latest code

Changes in v2:
- Reword some comments and commit messages
- Split one big patch into three smaller ones

Shaopeng He (6):
  fm10k: implement rx_descriptor_done function
  fm10k: setup rx queue interrupts for PF and VF
  fm10k: remove rx queue interrupts when dev stops
  fm10k: add rx queue interrupt en/dis functions
  fm10k: make sure default VID available in dev_init
  l3fwd-power: fix a memory leak for non-ip packet

 doc/guides/rel_notes/release_2_3.rst |   8 ++
 drivers/net/fm10k/fm10k.h|   6 ++
 drivers/net/fm10k/fm10k_ethdev.c | 172 ---
 drivers/net/fm10k/fm10k_rxtx.c   |  25 +
 examples/l3fwd-power/main.c  |   3 +-
 5 files changed, 202 insertions(+), 12 deletions(-)

-- 
1.9.3

[dpdk-dev] [PATCH v2 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API

2015-12-21 Thread Wiles, Keith

On 12/21/15, 9:21 AM, "Xie, Huawei"  wrote:

>On 12/19/2015 3:27 AM, Wiles, Keith wrote:
>> On 12/18/15, 11:32 AM, "dev on behalf of Stephen Hemminger" > dpdk.org on behalf of stephen at networkplumber.org> wrote:
>>
>>> On Fri, 18 Dec 2015 10:44:02 +
>>> "Ananyev, Konstantin"  wrote:
>>>

> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Stephen Hemminger
> Sent: Friday, December 18, 2015 5:01 AM
> To: Xie, Huawei
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v2 1/2] mbuf: provide 
> rte_pktmbuf_alloc_bulk API
>
> On Mon, 14 Dec 2015 09:14:41 +0800
> Huawei Xie  wrote:
>
>> v2 changes:
>>  unroll the loop a bit to help the performance
>>
>> rte_pktmbuf_alloc_bulk allocates a bulk of packet mbufs.
>>
>> There is related thread about this bulk API.
>> http://dpdk.org/dev/patchwork/patch/4718/
>> Thanks to Konstantin's loop unrolling.
>>
>> Signed-off-by: Gerald Rogers 
>> Signed-off-by: Huawei Xie 
>> Acked-by: Konstantin Ananyev 
>> ---
>>  lib/librte_mbuf/rte_mbuf.h | 50 
>> ++
>>  1 file changed, 50 insertions(+)
>>
>> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
>> index f234ac9..4e209e0 100644
>> --- a/lib/librte_mbuf/rte_mbuf.h
>> +++ b/lib/librte_mbuf/rte_mbuf.h
>> @@ -1336,6 +1336,56 @@ static inline struct rte_mbuf 
>> *rte_pktmbuf_alloc(struct rte_mempool *mp)
>>  }
>>
>>  /**
>> + * Allocate a bulk of mbufs, initialize refcnt and reset the fields to 
>> default
>> + * values.
>> + *
>> + *  @param pool
>> + *The mempool from which mbufs are allocated.
>> + *  @param mbufs
>> + *Array of pointers to mbufs
>> + *  @param count
>> + *Array size
>> + *  @return
>> + *   - 0: Success
>> + */
>> +static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
>> + struct rte_mbuf **mbufs, unsigned count)
>> +{
>> +unsigned idx = 0;
>> +int rc;
>> +
>> +rc = rte_mempool_get_bulk(pool, (void **)mbufs, count);
>> +if (unlikely(rc))
>> +return rc;
>> +
>> +switch (count % 4) {
>> +while (idx != count) {
>> +case 0:
>> +
>> RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
>> +rte_mbuf_refcnt_set(mbufs[idx], 1);
>> +rte_pktmbuf_reset(mbufs[idx]);
>> +idx++;
>> +case 3:
>> +
>> RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
>> +rte_mbuf_refcnt_set(mbufs[idx], 1);
>> +rte_pktmbuf_reset(mbufs[idx]);
>> +idx++;
>> +case 2:
>> +
>> RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
>> +rte_mbuf_refcnt_set(mbufs[idx], 1);
>> +rte_pktmbuf_reset(mbufs[idx]);
>> +idx++;
>> +case 1:
>> +
>> RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
>> +rte_mbuf_refcnt_set(mbufs[idx], 1);
>> +rte_pktmbuf_reset(mbufs[idx]);
>> +idx++;
>> +}
>> +}
>> +return 0;
>> +}
> This is weird. Why not just use Duff's device in a more normal manner.
 But it is a sort of Duff's method.
 Not sure what looks weird to you here?
 while () {} instead of do {} while();?
 Konstantin



>>> It is unusual to have cases not associated with block of the switch.
>>> Unusual to me means, "not used commonly in most code".
>>>
>>> Since you are jumping into the loop, might make more sense as a do { } 
>>> while()
>> I find this a very odd coding practice and I would suggest we not do this, 
>> unless it gives us some great performance gain.
>>
>> Keith
>The loop unwinding could give performance gain. The only problem is the
>switch/loop combination makes people feel weird at the first glance but
>soon they will grasp this style. Since this is inherited from old famous
>duff's device, i prefer to keep this style which saves lines of code.

Please add a comment to the code to reflex where this style came from and why 
you are using it, would be very handy here.

>>>
>>
>> Regards,
>> Keith
>>
>>
>>
>>
>
>


Regards,
Keith

[dpdk-dev] tcpdump support in DPDK 2.3

2015-12-21 Thread Morten Brørup

Bruce,

Please reconsider your interpretation of the word "debuggability". Debugging is 
not only something that R staff does in a lab. Debuggability can also be 
interpreted as a network engineer's ability to debug what is happening in a 
production network.

Referring to the link you kindly provided (to the discussion on the OVF mailing 
list), in my eyes the context of the itemized requirements is a production 
environment, not a development environment. Daniele Di Proietto wrote:

>I think we can agree that there are a few rough spots that prevent it from 
>being easily deployed and used.

>I was hoping to get some feedback from the community about those rough spots, 
>i.e. areas where OVS+DPDK can/needs to improve to become more "production 
>ready" and user-friendly.

Med venlig hilsen / kind regards
- Morten Br?rup

-Original Message-
From: Bruce Richardson [mailto:bruce.richard...@intel.com] 
Sent: 21. december 2015 16:40
To: Matthew Hall
Cc: Morten Br?rup; Kyle Larose; dev at dpdk.org
Subject: Re: [dpdk-dev] tcpdump support in DPDK 2.3

On Wed, Dec 16, 2015 at 01:15:57PM -0500, Matthew Hall wrote:
> On Wed, Dec 16, 2015 at 11:56:11AM +, Bruce Richardson wrote:
> > Having this work with any application is one of our primary targets here. 
> > The app author should not have to worry too much about getting basic 
> > debug support. Even if it doesn't work at 40G small packet rates, 
> > you can get a lot of benefit from a scheme that provides functional 
> > debugging for an app.
> 
> I think my issue is that I don't think I buy into this particular set 
> of assumptions above.
> 
> I don't think a capture mechanism that doesn't work right in the real 
> use cases of the apps actually buys us much. If all we care about is 
> quickly dumping some frames to a pcap for occasional debugging, I 
> already have some C code for that I can donate which is a lot less 
> complicated than the trouble being proposed for "basic debug support". 
> Or we could use libpcap's equivalent... but it's quite a lot more complicated 
> than the code I have.
> 
> If we're going to assign engineers to this it's costing somebody a lot 
> of time and money. So I'd prefer to get them focused on something that 
> will always work even with high loads, such as real bpfjit support.
> 
> Matthew.

Hi,

I think it basic boils down to the fact that we are trying to solve different 
problems. Our current focus is the generic usability of all DPDK applications, 
as discussed at the DPDK Userspace Summit. Our plan is to provide some way to 
allow standard packet capture apps, such as tcpdump, to be used easily with 
DPDK. This is something also being looked for by folks such as those working on 
OVS e.g. called out at 
http://openvswitch.org/pipermail/dev/2015-August/058814.html

  "- Insight into the system and debuggability: nothing beats tcpdump for the
kernel datapath.  Can something similar be done for the userspace
datapath?

  - Consistency of the tools: some commands are slightly different for the
userspace/kernel datapath.  Ideally there shouldn't be any difference."

Providing libraries for packet capture at high packet rates is a related, but 
different problem, that we'll maybe look to investigate in the future - 
assuming that nobody else solves it first.

/Bruce

[dpdk-dev] tcpdump support in DPDK 2.3

2015-12-21 Thread Gray, Mark D

> Bruce,
> 
> Please reconsider your interpretation of the word "debuggability".
> Debugging is not only something that R staff does in a lab. Debuggability
> can also be interpreted as a network engineer's ability to debug what is
> happening in a production network.

Is tcpdump used in large production cloud environments? I would have 
thought other less intrusive (and less manual) tools would be used? Isn't
that one of the benefits of SDN.

[dpdk-dev] tcpdump support in DPDK 2.3

2015-12-21 Thread Gray, Mark D

> This is something also being looked for by folks such as those
> working on OVS e.g. called out at
> http://openvswitch.org/pipermail/dev/2015-August/058814.html
> 
>   "- Insight into the system and debuggability: nothing beats tcpdump for the
> kernel datapath.  Can something similar be done for the userspace
> datapath?
> 
>   - Consistency of the tools: some commands are slightly different for the
> userspace/kernel datapath.  Ideally there shouldn't be any difference."
> 

I had a painful experience with OVS-DPDK recently which may be representative
of a typical usability issue encountered. 

I was trying to connect two Openstack compute nodes together.  I had done
the configuration without DPDK first. It was easy to debug as I could use
tcpdump to look at the eth ports and see what type of traffic
was entering the compute node. I also needed to check if the traffic
was actually VxLAN traffic and what the VNI was in order to be able to
follow the traffic around the bridges in OVS. This all went quite well and
I was able to bring up my set up quite easily. 

Then I tried to set up the same thing with DPDK. I couldn't get traffic between
the compute nodes but I had no easy way to just dump the traffic coming into
(or out of) the compute node. Of course, there were some things I could do but,
for me, DPDK would be far more usable if I could just use tcpdump. As I know
DPDK to some extent, I can usually get around these problems but I suspect
that a new user to DPDK  would get very discouraged and frustrated by an 
experience like that. 

I'm not sure how often tcpdump is used in production environments but it is
very useful when debugging a live system without having to modify code. It 
would be
good if it could work at high rates and be really flexible but it probably makes
sense to focus on the basics first.

[dpdk-dev] building LIBRTE_PMD_XENVIRT in 32bit triggers some errors

2015-12-21 Thread Martinx - ジェームズ

On 10 December 2015 at 02:45, Xie, Huawei  wrote:
> On 12/10/2015 6:49 AM, Martinx - ? wrote:
>> On 9 December 2015 at 18:05, Thomas Monjalon  
>> wrote:
>>> 2015-12-09 15:54, Martinx - ?:
  Sorry to insist on this subject but, the time for releasing DPDK 2.2
 is near and DPDK build with Xen 32-bit is broken.

  If DPDK doesn't fix this, there will be no way to enable XenVirt
 support for next Ubuntu LTS 16.04, which is a shame...

  I'm planning to use DPDK on Xen domUs (PVM, HVM, XenServer and on
 Amazon EC2) powered exclusively by a supported version of Ubuntu but,
 it is broken now...

  So, please, can someone take a look into this?:-P

  Thanks in advance!
>>> Sorry, this area has no maintainer:
>>> http://dpdk.org/browse/dpdk/tree/MAINTAINERS#n169
>>>
>>> In such case, it may be logic to remove the dead code.
>>> If someone wants to make it alive, he's welcome!
>> Hi Thomas,
>>
>>  Listen, if DPDK on Xen has no maintainer, where can I find the
>> current state of DPDK on Xen?
>>
>>  I mean, I'm planning to use DPDK with Xen on the following environments:
>>
>>  * Amazon EC2 - HVM Enhanced Networking - *priority*
>>  * XenServer
>>  * Open Source Xen on Debian / Ubuntu (both PVM / HVM)
>>
>>  But, if Xen support on DPDK has no maintainer, how to you guys are
>> running DPDK on top of Xen (like for example, within Amazon EC2)?
>>
>>  If I google for "DPDK Xen", I can find lots of good information but,
>> I can't find recommended setup / drivers...
>>
>>  Do you have any recommendation?
> Thiago:
> This xen PMD is based on grant table mechanism and virtio interface.
> Worth to note is it needs customized backend, which now resides in
> examples/vhost_xen.
> Another approach is netfront based PMD, which has kernel netback backend
> in place, but i guess it couldn't achieve best performance as we need
> map each grant page in backend. Stephen submitted the patch for netfront
> PMD http://dpdk.org/dev/patchwork/patch/3330/. Thomas, do you know its
> status?
> Anyway i will try to create the XEN environment, and check the issues.
>
>>
>>  Thank you!
>>
>> Best,
>> Thiago
>>
>

Hello Xie,

Thank you for your help, I really appreciated it!

Basically, what I would like to understand is:

- What is the BEST way of running DPDK inside a Xen domU guest?

I'm seeing that there are too many options and not enough
documentation about each, for example...

* Does DPDK XENVIRT option, depends on XENDOM0 option? However, you
said that it isn't fast / can't achieve best performance...

* Apparently, Xen supports VirtIO (if I'm not wrong), but, I honestly
don't know for sure, where/when it is available (XenServer? Amazon
high-perf Net Instance? HVM? PVM?)

* If Xen supports VirtIO (especially on Amazon / XenServer), isn't
this the BEST way of running DPDK Apps on top of this kind of
hypervisor (i.e., by not using XENVIRT at all)?

Thanks again!
Thiago

[dpdk-dev] [PATCH] ixgbe: fix link down issue on x550em_x

2015-12-21 Thread Wenzhuo Lu

Normally the auto-negotiation is supported by FW. But on
X550EM_X_10G_T it's not supported by FW. As the port of
X550EM_X_10G_T is 10G. If we connect the port with a peer
which is 1G. The link is always down.
We have to supprted auto-neg by SW to avoid such link down
issue.

Signed-off-by: root 
---
 doc/guides/rel_notes/release_2_3.rst |  6 ++
 drivers/net/ixgbe/ixgbe_ethdev.c | 38 
 drivers/net/ixgbe/ixgbe_ethdev.h |  1 +
 3 files changed, 45 insertions(+)

diff --git a/doc/guides/rel_notes/release_2_3.rst 
b/doc/guides/rel_notes/release_2_3.rst
index 99de186..a8d34d1 100644
--- a/doc/guides/rel_notes/release_2_3.rst
+++ b/doc/guides/rel_notes/release_2_3.rst
@@ -15,6 +15,12 @@ EAL
 Drivers
 ~~~

+* **ixgbe: fix link down issue on X550EM_X.**
+  Normally the auto-negotiation is supported by FW. SW need not care about
+  that. But on x550em_x, FW doesn't support auto-neg. As the ports of x550em_x
+  are 10G, if we connect the port will a peer which is 1G, the link will always
+  be donw on x550em_x.
+  We will support auto-neg by SW to avoid this link down issue.

 Libraries
 ~
diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index 4c4c6df..a71c49f 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -1961,6 +1961,25 @@ ixgbe_dev_configure(struct rte_eth_dev *dev)
return 0;
 }

+static void
+ixgbe_dev_phy_intr_setup(struct rte_eth_dev *dev)
+{
+   struct ixgbe_hw *hw =
+   IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
+   struct ixgbe_interrupt *intr =
+   IXGBE_DEV_PRIVATE_TO_INTR(dev->data->dev_private);
+   uint32_t gpie;
+
+   /* only set up it on X550EM_X */
+   if (hw->mac.type == ixgbe_mac_X550EM_x) {
+   gpie = IXGBE_READ_REG(hw, IXGBE_GPIE);
+   gpie |= IXGBE_SDP0_GPIEN_X550EM_x;
+   IXGBE_WRITE_REG(hw, IXGBE_GPIE, gpie);
+   if (hw->phy.type == ixgbe_phy_x550em_ext_t)
+   intr->mask |= IXGBE_EICR_GPI_SDP0_X550EM_x;
+   }
+}
+
 /*
  * Configure device link speed and setup link.
  * It returns 0 on success.
@@ -2009,6 +2028,8 @@ ixgbe_dev_start(struct rte_eth_dev *dev)
/* configure PF module if SRIOV enabled */
ixgbe_pf_host_configure(dev);

+   ixgbe_dev_phy_intr_setup(dev);
+
/* check and configure queue intr-vector mapping */
if ((rte_intr_cap_multiple(intr_handle) ||
 !RTE_ETH_DEV_SRIOV(dev).active) &&
@@ -3082,6 +3103,11 @@ ixgbe_dev_interrupt_get_status(struct rte_eth_dev *dev)
if (eicr & IXGBE_EICR_MAILBOX)
intr->flags |= IXGBE_FLAG_MAILBOX;

+   if (hw->mac.type ==  ixgbe_mac_X550EM_x &&
+   hw->phy.type == ixgbe_phy_x550em_ext_t &&
+   (eicr & IXGBE_EICR_GPI_SDP0_X550EM_x))
+   intr->flags |= IXGBE_FLAG_PHY_INTERRUPT;
+
return 0;
 }

@@ -3137,6 +3163,8 @@ ixgbe_dev_interrupt_action(struct rte_eth_dev *dev)
int64_t timeout;
struct rte_eth_link link;
int intr_enable_delay = false;
+   struct ixgbe_hw *hw =
+   IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);

PMD_DRV_LOG(DEBUG, "intr action type %d", intr->flags);

@@ -3145,6 +3173,11 @@ ixgbe_dev_interrupt_action(struct rte_eth_dev *dev)
intr->flags &= ~IXGBE_FLAG_MAILBOX;
}

+   if (intr->flags & IXGBE_FLAG_PHY_INTERRUPT) {
+   ixgbe_handle_lasi(hw);
+   intr->flags &= ~IXGBE_FLAG_PHY_INTERRUPT;
+   }
+
if (intr->flags & IXGBE_FLAG_NEED_LINK_UPDATE) {
/* get the link status before link update, for predicting later 
*/
memset(, 0, sizeof(link));
@@ -3208,6 +3241,11 @@ ixgbe_dev_interrupt_delayed_handler(void *param)
if (eicr & IXGBE_EICR_MAILBOX)
ixgbe_pf_mbx_process(dev);

+   if (intr->flags & IXGBE_FLAG_PHY_INTERRUPT) {
+   ixgbe_handle_lasi(hw);
+   intr->flags &= ~IXGBE_FLAG_PHY_INTERRUPT;
+   }
+
if (intr->flags & IXGBE_FLAG_NEED_LINK_UPDATE) {
ixgbe_dev_link_update(dev, 0);
intr->flags &= ~IXGBE_FLAG_NEED_LINK_UPDATE;
diff --git a/drivers/net/ixgbe/ixgbe_ethdev.h b/drivers/net/ixgbe/ixgbe_ethdev.h
index d26771a..5c3aa16 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.h
+++ b/drivers/net/ixgbe/ixgbe_ethdev.h
@@ -42,6 +42,7 @@
 /* need update link, bit flag */
 #define IXGBE_FLAG_NEED_LINK_UPDATE (uint32_t)(1 << 0)
 #define IXGBE_FLAG_MAILBOX  (uint32_t)(1 << 1)
+#define IXGBE_FLAG_PHY_INTERRUPT(uint32_t)(1 << 2)

 /*
  * Defines that were not part of ixgbe_type.h as they are not used by the
-- 
1.9.3

[dpdk-dev] tcpdump support in DPDK 2.3

2015-12-21 Thread Bruce Richardson

On Wed, Dec 16, 2015 at 01:15:57PM -0500, Matthew Hall wrote:
> On Wed, Dec 16, 2015 at 11:56:11AM +, Bruce Richardson wrote:
> > Having this work with any application is one of our primary targets here. 
> > The app author should not have to worry too much about getting basic debug 
> > support. Even if it doesn't work at 40G small packet rates, you can get a 
> > lot of benefit from a scheme that provides functional debugging for an app. 
> 
> I think my issue is that I don't think I buy into this particular set of 
> assumptions above.
> 
> I don't think a capture mechanism that doesn't work right in the real use 
> cases of the apps actually buys us much. If all we care about is quickly 
> dumping some frames to a pcap for occasional debugging, I already have some C 
> code for that I can donate which is a lot less complicated than the trouble 
> being proposed for "basic debug support". Or we could use libpcap's 
> equivalent... but it's quite a lot more complicated than the code I have.
> 
> If we're going to assign engineers to this it's costing somebody a lot of 
> time 
> and money. So I'd prefer to get them focused on something that will always 
> work even with high loads, such as real bpfjit support.
> 
> Matthew.

Hi,

I think it basic boils down to the fact that we are trying to solve different
problems. Our current focus is the generic usability of all DPDK applications,
as discussed at the DPDK Userspace Summit. Our plan is to provide some way to
allow standard packet capture apps, such as tcpdump, to be used easily with
DPDK. This is something also being looked for by folks such as those working
on OVS e.g. called out at 
http://openvswitch.org/pipermail/dev/2015-August/058814.html

  "- Insight into the system and debuggability: nothing beats tcpdump for the
kernel datapath.  Can something similar be done for the userspace
datapath?

  - Consistency of the tools: some commands are slightly different for the
userspace/kernel datapath.  Ideally there shouldn't be any difference."

Providing libraries for packet capture at high packet rates is a related, but
different problem, that we'll maybe look to investigate in the future - assuming
that nobody else solves it first.

/Bruce

[dpdk-dev] [PATCH v2 1/6] vhost: handle VHOST_USER_SET_LOG_BASE request

2015-12-21 Thread Xie, Huawei

On 12/17/2015 11:11 AM, Yuanhan Liu wrote:
> VHOST_USER_SET_LOG_BASE request is used to tell the backend (dpdk
> vhost-user) where we should log dirty pages, and how big the log
> buffer is.
>
> This request introduces a new payload:
>
> typedef struct VhostUserLog {
> uint64_t mmap_size;
> uint64_t mmap_offset;
> } VhostUserLog;
>
> Also, a fd is delivered from QEMU by ancillary data.
>
> With those info given, an area of memory is mmaped, assigned
> to dev->log_base, for logging dirty pages.
>
> Signed-off-by: Yuanhan Liu 
> Signed-off-by: Victor Kaplansky  ---
>
> v2: workaround mmap issue when offset is not zero
> ---
>  lib/librte_vhost/rte_virtio_net.h |  4 ++-
>  lib/librte_vhost/vhost_user/vhost-net-user.c  |  7 ++--
>  lib/librte_vhost/vhost_user/vhost-net-user.h  |  6 
>  lib/librte_vhost/vhost_user/virtio-net-user.c | 48 
> +++
>  lib/librte_vhost/vhost_user/virtio-net-user.h |  1 +
>  5 files changed, 63 insertions(+), 3 deletions(-)
>
> diff --git a/lib/librte_vhost/rte_virtio_net.h 
> b/lib/librte_vhost/rte_virtio_net.h
> index 10dcb90..8acee02 100644
> --- a/lib/librte_vhost/rte_virtio_net.h
> +++ b/lib/librte_vhost/rte_virtio_net.h
> @@ -129,7 +129,9 @@ struct virtio_net {
>   charifname[IF_NAME_SZ]; /**< Name of the tap 
> device or socket path. */
>   uint32_tvirt_qp_nb; /**< number of queue pair we 
> have allocated */
>   void*priv;  /**< private context */
> - uint64_treserved[64];   /**< Reserve some spaces for 
> future extension. */
> + uint64_tlog_size;   /**< Size of log area */
> + uint64_tlog_base;   /**< Where dirty pages are 
> logged */
> + uint64_treserved[62];   /**< Reserve some spaces for 
> future extension. */
>   struct vhost_virtqueue  *virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];  /**< 
> Contains all virtqueue information. */
>  } __rte_cache_aligned;
>  
> diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c 
> b/lib/librte_vhost/vhost_user/vhost-net-user.c
> index 8b7a448..32ad6f6 100644
> --- a/lib/librte_vhost/vhost_user/vhost-net-user.c
> +++ b/lib/librte_vhost/vhost_user/vhost-net-user.c
> @@ -388,9 +388,12 @@ vserver_message_handler(int connfd, void *dat, int 
> *remove)
>   break;
>  
>   case VHOST_USER_SET_LOG_BASE:
> - RTE_LOG(INFO, VHOST_CONFIG, "not implemented.\n");
> - break;
> + user_set_log_base(ctx, );
>  
> + /* it needs a reply */
> + msg.size = sizeof(msg.payload.u64);
> + send_vhost_message(connfd, );
> + break;
>   case VHOST_USER_SET_LOG_FD:
>   close(msg.fds[0]);
>   RTE_LOG(INFO, VHOST_CONFIG, "not implemented.\n");
> diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.h 
> b/lib/librte_vhost/vhost_user/vhost-net-user.h
> index 38637cc..6d252a3 100644
> --- a/lib/librte_vhost/vhost_user/vhost-net-user.h
> +++ b/lib/librte_vhost/vhost_user/vhost-net-user.h
> @@ -83,6 +83,11 @@ typedef struct VhostUserMemory {
>   VhostUserMemoryRegion regions[VHOST_MEMORY_MAX_NREGIONS];
>  } VhostUserMemory;
>  
> +typedef struct VhostUserLog {
> + uint64_t mmap_size;
> + uint64_t mmap_offset;
> +} VhostUserLog;
> +
>  typedef struct VhostUserMsg {
>   VhostUserRequest request;
>  
> @@ -97,6 +102,7 @@ typedef struct VhostUserMsg {
>   struct vhost_vring_state state;
>   struct vhost_vring_addr addr;
>   VhostUserMemory memory;
> + VhostUserLoglog;
>   } payload;
>   int fds[VHOST_MEMORY_MAX_NREGIONS];
>  } __attribute((packed)) VhostUserMsg;
> diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.c 
> b/lib/librte_vhost/vhost_user/virtio-net-user.c
> index 2934d1c..b77c9b3 100644
> --- a/lib/librte_vhost/vhost_user/virtio-net-user.c
> +++ b/lib/librte_vhost/vhost_user/virtio-net-user.c
> @@ -365,3 +365,51 @@ user_set_protocol_features(struct vhost_device_ctx ctx,
>  
>   dev->protocol_features = protocol_features;
>  }
> +
> +int
> +user_set_log_base(struct vhost_device_ctx ctx,
> +  struct VhostUserMsg *msg)
> +{
> + struct virtio_net *dev;
> + int fd = msg->fds[0];
> + uint64_t size, off;
> + void *addr;
> +
> + dev = get_device(ctx);
> + if (!dev)
> + return -1;
> +
> + if (fd < 0) {
> + RTE_LOG(ERR, VHOST_CONFIG, "invalid log fd: %d\n", fd);
> + return -1;
> + }
> +
> + if (msg->size != sizeof(VhostUserLog)) {
> + RTE_LOG(ERR, VHOST_CONFIG,
> + "invalid log base msg size: %"PRId32" != %d\n",
> + msg->size, (int)sizeof(VhostUserLog));
> + return -1;
> + }
> +
> + size = msg->payload.log.mmap_size;
> + off  = msg->payload.log.mmap_offset;
> +

[dpdk-dev] [PATCH] vfio: add no-iommu support

2015-12-21 Thread Burakov, Anatoly

Hi Ferruh,

> On Mon, Dec 21, 2015 at 03:15:46PM +, Burakov, Anatoly wrote:
> > > This is based on patch from Alex Williamson:
> > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/comm
> > > it/?id=03
> > > 3291eccbdb
> > > plus
> > > http://dpdk.org/dev/patchwork/patch/9598/
> > >
> > > This patch is intended to test above patches on DPDK rather than
> > > official patch to DPDK.
> > >
> > > Test result is DPDK successfully run on no-iommu environment.
> > >
> >
> > This is one approach :) I was thinking of another, building some kind of
> more generic support for multiple VFIO drivers. It's a bit more code and
> probably overkill as a solution to this particular problem, but hopefully 
> it'll
> make it easier to add new VFIO drivers down the line (with each driver
> having their own DMA mapping function), should we choose to do so. I'm still
> working on the patch, but if everyone is OK with this approach instead of a
> more general one, that's fine with me.
> >
> Hi Anatoly,
> 
> This patch sent just to show what changes done to test VFIO no-iommu I
> mentioned, and to have a justification for the kernel patch, not sent as a 
> final
> solution in DPDK, sorry for interrupting your work.
> 
> Thanks,
> Ferruh

Ah OK, I misread the part where it said that it is not to be applied as-is. 
Thanks!

Thanks,
Anatoly

[dpdk-dev] [PATCH] vfio: add no-iommu support

2015-12-21 Thread Yigit, Ferruh

On Mon, Dec 21, 2015 at 03:15:46PM +, Burakov, Anatoly wrote:
> > This is based on patch from Alex Williamson:
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=03
> > 3291eccbdb
> > plus
> > http://dpdk.org/dev/patchwork/patch/9598/
> > 
> > This patch is intended to test above patches on DPDK rather than official
> > patch to DPDK.
> > 
> > Test result is DPDK successfully run on no-iommu environment.
> > 
> 
> This is one approach :) I was thinking of another, building some kind of more 
> generic support for multiple VFIO drivers. It's a bit more code and probably 
> overkill as a solution to this particular problem, but hopefully it'll make 
> it easier to add new VFIO drivers down the line (with each driver having 
> their own DMA mapping function), should we choose to do so. I'm still working 
> on the patch, but if everyone is OK with this approach instead of a more 
> general one, that's fine with me.
> 
Hi Anatoly,

This patch sent just to show what changes done to test VFIO no-iommu I 
mentioned,
and to have a justification for the kernel patch, not sent as a final solution 
in DPDK,
sorry for interrupting your work.

Thanks,
ferruh

[dpdk-dev] [PATCH v2 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API

2015-12-21 Thread Xie, Huawei

On 12/19/2015 3:27 AM, Wiles, Keith wrote:
> On 12/18/15, 11:32 AM, "dev on behalf of Stephen Hemminger"  dpdk.org on behalf of stephen at networkplumber.org> wrote:
>
>> On Fri, 18 Dec 2015 10:44:02 +
>> "Ananyev, Konstantin"  wrote:
>>
>>>
 -Original Message-
 From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Stephen Hemminger
 Sent: Friday, December 18, 2015 5:01 AM
 To: Xie, Huawei
 Cc: dev at dpdk.org
 Subject: Re: [dpdk-dev] [PATCH v2 1/2] mbuf: provide 
 rte_pktmbuf_alloc_bulk API

 On Mon, 14 Dec 2015 09:14:41 +0800
 Huawei Xie  wrote:

> v2 changes:
>  unroll the loop a bit to help the performance
>
> rte_pktmbuf_alloc_bulk allocates a bulk of packet mbufs.
>
> There is related thread about this bulk API.
> http://dpdk.org/dev/patchwork/patch/4718/
> Thanks to Konstantin's loop unrolling.
>
> Signed-off-by: Gerald Rogers 
> Signed-off-by: Huawei Xie 
> Acked-by: Konstantin Ananyev 
> ---
>  lib/librte_mbuf/rte_mbuf.h | 50 
> ++
>  1 file changed, 50 insertions(+)
>
> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> index f234ac9..4e209e0 100644
> --- a/lib/librte_mbuf/rte_mbuf.h
> +++ b/lib/librte_mbuf/rte_mbuf.h
> @@ -1336,6 +1336,56 @@ static inline struct rte_mbuf 
> *rte_pktmbuf_alloc(struct rte_mempool *mp)
>  }
>
>  /**
> + * Allocate a bulk of mbufs, initialize refcnt and reset the fields to 
> default
> + * values.
> + *
> + *  @param pool
> + *The mempool from which mbufs are allocated.
> + *  @param mbufs
> + *Array of pointers to mbufs
> + *  @param count
> + *Array size
> + *  @return
> + *   - 0: Success
> + */
> +static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
> +  struct rte_mbuf **mbufs, unsigned count)
> +{
> + unsigned idx = 0;
> + int rc;
> +
> + rc = rte_mempool_get_bulk(pool, (void **)mbufs, count);
> + if (unlikely(rc))
> + return rc;
> +
> + switch (count % 4) {
> + while (idx != count) {
> + case 0:
> + RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> + rte_mbuf_refcnt_set(mbufs[idx], 1);
> + rte_pktmbuf_reset(mbufs[idx]);
> + idx++;
> + case 3:
> + RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> + rte_mbuf_refcnt_set(mbufs[idx], 1);
> + rte_pktmbuf_reset(mbufs[idx]);
> + idx++;
> + case 2:
> + RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> + rte_mbuf_refcnt_set(mbufs[idx], 1);
> + rte_pktmbuf_reset(mbufs[idx]);
> + idx++;
> + case 1:
> + RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
> + rte_mbuf_refcnt_set(mbufs[idx], 1);
> + rte_pktmbuf_reset(mbufs[idx]);
> + idx++;
> + }
> + }
> + return 0;
> +}
 This is weird. Why not just use Duff's device in a more normal manner.
>>> But it is a sort of Duff's method.
>>> Not sure what looks weird to you here?
>>> while () {} instead of do {} while();?
>>> Konstantin
>>>
>>>
>>>
>> It is unusual to have cases not associated with block of the switch.
>> Unusual to me means, "not used commonly in most code".
>>
>> Since you are jumping into the loop, might make more sense as a do { } 
>> while()
> I find this a very odd coding practice and I would suggest we not do this, 
> unless it gives us some great performance gain.
>
> Keith
The loop unwinding could give performance gain. The only problem is the
switch/loop combination makes people feel weird at the first glance but
soon they will grasp this style. Since this is inherited from old famous
duff's device, i prefer to keep this style which saves lines of code.
>>
>
> Regards,
> Keith
>
>
>
>

[dpdk-dev] [PATCH] vfio: add no-iommu support

2015-12-21 Thread Burakov, Anatoly

Hi Ferruh,

> This is based on patch from Alex Williamson:
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=03
> 3291eccbdb
> plus
> http://dpdk.org/dev/patchwork/patch/9598/
> 
> This patch is intended to test above patches on DPDK rather than official
> patch to DPDK.
> 
> Test result is DPDK successfully run on no-iommu environment.
> 
> Signed-off-by: Ferruh Yigit 
> ---
>  lib/librte_eal/linuxapp/eal/eal_pci_vfio.c | 28
> +---
>  1 file changed, 25 insertions(+), 3 deletions(-)
> 
> diff --git a/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c
> b/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c
> index 74f91ba..90bba4a 100644
> --- a/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c
> +++ b/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c
> @@ -61,6 +61,18 @@
> 
>  #ifdef VFIO_PRESENT
> 
> +/*#define VFIO_NOIOMMU*/
> +
> +#ifndef VFIO_NOIOMMU_IOMMU
> +#define VFIO_NOIOMMU_IOMMU 8
> +#endif
> +
> +#ifdef VFIO_NOIOMMU
> +#define VFIO_IOMMU_TYPE VFIO_NOIOMMU_IOMMU #else #define
> +VFIO_IOMMU_TYPE VFIO_TYPE1_IOMMU #endif
> +
>  #define PAGE_SIZE   (sysconf(_SC_PAGESIZE))
>  #define PAGE_MASK   (~(PAGE_SIZE - 1))
> 
> @@ -71,7 +83,11 @@ EAL_REGISTER_TAILQ(rte_vfio_tailq)
> 
>  #define VFIO_DIR "/dev/vfio"
>  #define VFIO_CONTAINER_PATH "/dev/vfio/vfio"
> +#ifdef VFIO_NOIOMMU
> +#define VFIO_GROUP_FMT "/dev/vfio/noiommu-%u"
> +#else
>  #define VFIO_GROUP_FMT "/dev/vfio/%u"
> +#endif
>  #define VFIO_GET_REGION_ADDR(x) ((uint64_t) x << 40ULL)
> 
>  /* per-process VFIO config */
> @@ -212,17 +228,21 @@ pci_vfio_set_bus_master(int dev_fd)  static int
> pci_vfio_setup_dma_maps(int vfio_container_fd)  {
> +#ifndef VFIO_NOIOMMU
>   const struct rte_memseg *ms = rte_eal_get_physmem_layout();
> - int i, ret;
> + int i;
> +#endif
> + int ret;
> 
>   ret = ioctl(vfio_container_fd, VFIO_SET_IOMMU,
> - VFIO_TYPE1_IOMMU);
> + VFIO_IOMMU_TYPE);
>   if (ret) {
>   RTE_LOG(ERR, EAL, "  cannot set IOMMU type, "
>   "error %i (%s)\n", errno, strerror(errno));
>   return -1;
>   }
> 
> +#ifndef VFIO_NOIOMMU
>   /* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */
>   for (i = 0; i < RTE_MAX_MEMSEG; i++) {
>   struct vfio_iommu_type1_dma_map dma_map; @@ -245,6
> +265,7 @@ pci_vfio_setup_dma_maps(int vfio_container_fd)
>   return -1;
>   }
>   }
> +#endif
> 
>   return 0;
>  }
> @@ -373,7 +394,8 @@ pci_vfio_get_container_fd(void)
>   }
> 
>   /* check if we support IOMMU type 1 */
> - ret = ioctl(vfio_container_fd, VFIO_CHECK_EXTENSION,
> VFIO_TYPE1_IOMMU);
> + ret = ioctl(vfio_container_fd, VFIO_CHECK_EXTENSION,
> + VFIO_IOMMU_TYPE);
>   if (ret != 1) {
>   if (ret < 0)
>   RTE_LOG(ERR, EAL, "  could not get IOMMU
> type, "
> --
> 2.5.0

This is one approach :) I was thinking of another, building some kind of more 
generic support for multiple VFIO drivers. It's a bit more code and probably 
overkill as a solution to this particular problem, but hopefully it'll make it 
easier to add new VFIO drivers down the line (with each driver having their own 
DMA mapping function), should we choose to do so. I'm still working on the 
patch, but if everyone is OK with this approach instead of a more general one, 
that's fine with me.

Thanks,
Anatoly

[dpdk-dev] [PATCH v1] Modify and modularize l3fwd code

2015-12-21 Thread Ravi Kerur

v1:
> Rebase to latest code base for DPDK team review.

Intel team's (Konstantin, Bruce and Declan) review comments

v4<-v3:
> Fix code review comments from Konstantin
> Move buffer optimization code into l3fwd_lpm_sse.h
  and l3fwd_em_sse.h for LPM and EM respectively
> Add compile time __SSE4_1__ for header file inclusion
> Tested with CONFIG_RTE_MACHINE=default for non
  __SSE4_1__ compilation and build
> Compiled for GCC 4.8.4 and 5.1 on Ubuntu 14.04

v3<-v2:
> Fix code review comments from Bruce
> Fix multiple static definitions
> Move local #defines to C files, common #defines
to H file.
> Rename ipv4_l3fwd_route to ipv4_l3fwd_lpm and ipv4_l3fwd_em
> Rename ipv6_l3fwd_route to ipv6_l3fwd_lpm and ipv6_l3fwd_lpm
> Pass additional parameter to send_single_packet
> Compiled for GCC 4.8.4 and 5.1 on Ubuntu 14.04

v2<-v1:
> Fix errors in GCC 5.1
> Restore "static inline" functions, rearrange
functions to take "static inline" into account
> Duplicate main_loop for LPM and EM

v1:
> Split main.c into following 3 files
> main.c, (parsing, buffer alloc, and other utilities)
> l3fwd_lpm.c, (Longest Prefix Match functions)
> l3fwd_em.c, (Exact Match f.e. Hash functions)
> l3fwd.h, (Common defines and prototypes)

> Select LPM or EM based on run time selection f.e.
> l3fwd -c 0x1 -n 1 -- -p 0x1 -E ... (Exact Match)
> l3fwd -c 0x1 -n 1 -- -p 0x1 -L ... (LPM)

> Options "E" and "L" are mutualy-exclusive.

> Use function pointers during initialiation of relevant
data structures.

> Remove unwanted #ifdefs in the code with exception to
> DO_RFC_1812_CHECKS
> RTE_MACHINE_CPUFLAG_SSE4_2

> Compiled for
> i686-native-linuxapp-gcc
> x86_64-native-linuxapp-gcc
> x86_x32-native-linuxapp-gcc
> x86_64-native-bsdapp-gcc

> Tested on
> Ubuntu 14.04 (GCC 4.8.4)
> FreeBSD 10.0 (GCC 4.8)
> I217 and I218 respectively.

Signed-off-by: Ravi Kerur 
---
 examples/l3fwd/Makefile|9 +-
 examples/l3fwd/l3fwd.h |  209 
 examples/l3fwd/l3fwd_em.c  |  773 ++
 examples/l3fwd/l3fwd_em_sse.h  |  479 +
 examples/l3fwd/l3fwd_lpm.c |  414 
 examples/l3fwd/l3fwd_lpm_sse.h |  610 +++
 examples/l3fwd/main.c  | 2202 
 7 files changed, 2694 insertions(+), 2002 deletions(-)
 create mode 100644 examples/l3fwd/l3fwd.h
 create mode 100644 examples/l3fwd/l3fwd_em.c
 create mode 100644 examples/l3fwd/l3fwd_em_sse.h
 create mode 100644 examples/l3fwd/l3fwd_lpm.c
 create mode 100644 examples/l3fwd/l3fwd_lpm_sse.h

diff --git a/examples/l3fwd/Makefile b/examples/l3fwd/Makefile
index 68de8fc..94a2282 100644
--- a/examples/l3fwd/Makefile
+++ b/examples/l3fwd/Makefile
@@ -42,15 +42,10 @@ include $(RTE_SDK)/mk/rte.vars.mk
 APP = l3fwd

 # all source are stored in SRCS-y
-SRCS-y := main.c
+SRCS-y := main.c l3fwd_lpm.c l3fwd_em.c

+CFLAGS += -I$(SRCDIR)
 CFLAGS += -O3 $(USER_FLAGS)
 CFLAGS += $(WERROR_FLAGS)

-# workaround for a gcc bug with noreturn attribute
-# http://gcc.gnu.org/bugzilla/show_bug.cgi?id=12603
-ifeq ($(CONFIG_RTE_TOOLCHAIN_GCC),y)
-CFLAGS_main.o += -Wno-return-type
-endif
-
 include $(RTE_SDK)/mk/rte.extapp.mk
diff --git a/examples/l3fwd/l3fwd.h b/examples/l3fwd/l3fwd.h
new file mode 100644
index 000..50e40fe
--- /dev/null
+++ b/examples/l3fwd/l3fwd.h
@@ -0,0 +1,209 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2015 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ *   notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ *   notice, this list of conditions and the following disclaimer in
+ *   the documentation and/or other materials provided with the
+ *   distribution.
+ * * Neither the name of Intel Corporation nor the names of its
+ *   contributors may be used to endorse or promote products derived
+ *   from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *

[dpdk-dev] [PATCH v1] Modify and modularize l3fwd code

2015-12-21 Thread Ravi Kerur

Many thanks to Intel team (Konstantin, Bruce and Declan) for below proposal to
make changes to l3fwd code, their valuable inputs during interal review and help
in performance tests.

The main problem with l3fwd is that it is too monolithic with everything being
in one file, and the various options all controlled by compile time flags. This 
means that it's hard to read and understand, and when making any changes, you 
need
to go to a lot of work to try and ensure you cover all the code paths, since a 
compile of the app will not touch large parts of the l3fwd codebase.

Following changes were done to fix the issues mentioned above

> Split out the various lpm and hash specific functionality into 
separate
  files, so that l3fwd code has one file for common code e.g. args 
  processing, mempool creation, and then individual files for the 
various
  forwarding approaches.

  Following are new file lists

  main.c (Common code for args processing, memppol creation, etc)
  l3fwd_em.c (Hash/Exact match aka 'EM' functionality)
  l3fwd_em_sse.h (SSE4_1 buffer optimizated 'EM' code)
  l3fwd_lpm.c (Longest Prefix Match aka 'LPM' functionality)
  l3fwd_lpm_sse.h (SSE4_1 buffer optimizated 'LPM' code)
  l3fwd.h (Common include for 'EM' and 'LPM')


> The choosing of the lpm/hash path should be done at runtime, not
  compile time, via a command-line argument. This will ensure that 
  both code paths get compiled in a single go

  Following examples show runtime options provided

  Select 'LPM' or 'EM' based on run time selection f.e.
> l3fwd -c 0x1 -n 1 -- -p 0x1 -E ... (EM)
> l3fwd -c 0x1 -n 1 -- -p 0x1 -L ... (LPM)

  Options "E" and "L" are mutualy-exclusive.

  If none selected, "L" is default.

Ravi Kerur (1):
  Modify and modularize l3fwd code

 examples/l3fwd/Makefile|9 +-
 examples/l3fwd/l3fwd.h |  209 
 examples/l3fwd/l3fwd_em.c  |  773 ++
 examples/l3fwd/l3fwd_em_sse.h  |  479 +
 examples/l3fwd/l3fwd_lpm.c |  414 
 examples/l3fwd/l3fwd_lpm_sse.h |  610 +++
 examples/l3fwd/main.c  | 2202 
 7 files changed, 2694 insertions(+), 2002 deletions(-)
 create mode 100644 examples/l3fwd/l3fwd.h
 create mode 100644 examples/l3fwd/l3fwd_em.c
 create mode 100644 examples/l3fwd/l3fwd_em_sse.h
 create mode 100644 examples/l3fwd/l3fwd_lpm.c
 create mode 100644 examples/l3fwd/l3fwd_lpm_sse.h

-- 
1.9.1

[dpdk-dev] [PATCH v2 2/6] vhost: introduce vhost_log_write

2015-12-21 Thread Xie, Huawei

On 12/17/2015 11:11 AM, Yuanhan Liu wrote:
> Introduce vhost_log_write() helper function to log the dirty pages we
> touched. Page size is harded code to 4096 (VHOST_LOG_PAGE), and each
> log is presented by 1 bit.
>
> Therefore, vhost_log_write() simply finds the right bit for related
> page we are gonna change, and set it to 1. dev->log_base denotes the
> start of the dirty page bitmap.
>
> Signed-off-by: Yuanhan Liu 
> Signed-off-by: Victor Kaplansky  ---
>  lib/librte_vhost/rte_virtio_net.h | 29 +
>  1 file changed, 29 insertions(+)
>
> diff --git a/lib/librte_vhost/rte_virtio_net.h 
> b/lib/librte_vhost/rte_virtio_net.h
> index 8acee02..5726683 100644
> --- a/lib/librte_vhost/rte_virtio_net.h
> +++ b/lib/librte_vhost/rte_virtio_net.h
> @@ -40,6 +40,7 @@
>   */
>  
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -59,6 +60,8 @@ struct rte_mbuf;
>  /* Backend value set by guest. */
>  #define VIRTIO_DEV_STOPPED -1
>  
> +#define VHOST_LOG_PAGE   4096
> +
>  
>  /* Enum for virtqueue management. */
>  enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
> @@ -205,6 +208,32 @@ gpa_to_vva(struct virtio_net *dev, uint64_t guest_pa)
>   return vhost_va;
>  }
>  
> +static inline void __attribute__((always_inline))
> +vhost_log_page(uint8_t *log_base, uint64_t page)
> +{
> + log_base[page / 8] |= 1 << (page % 8);
> +}
> +
Those logging functions are not supposed to be API. Could we move them
into an internal header file?
> +static inline void __attribute__((always_inline))
> +vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
> +{
> + uint64_t page;
> +
Before we log, we need memory barrier to make sure updates are in place.
> + if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
> +!dev->log_base || !len))
> + return;
> +
> + if (unlikely(dev->log_size < ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
> + return;
> +
> + page = addr / VHOST_LOG_PAGE;
> + while (page * VHOST_LOG_PAGE < addr + len) {
Let us have a page_end var to make the code simpler?
> + vhost_log_page((uint8_t *)(uintptr_t)dev->log_base, page);
> + page += VHOST_LOG_PAGE;
page += 1?
> + }
> +}
> +
> +
>  /**
>   *  Disable features in feature_mask. Returns 0 on success.
>   */

[dpdk-dev] [PATCH] librte_ether: fix crashes in rte_ethdev functions.

2015-12-21 Thread Iremonger, Bernard

Hi Konstantin,

> -Original Message-
> From: Ananyev, Konstantin
> Sent: Monday, December 21, 2015 12:02 PM
> To: Iremonger, Bernard ; Qiu, Michael
> ; dev at dpdk.org
> Subject: RE: [dpdk-dev] [PATCH] librte_ether: fix crashes in rte_ethdev
> functions.
> 
> 
> 
> > -Original Message-
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Iremonger,
> > Bernard
> > Sent: Monday, December 21, 2015 11:40 AM
> > To: Qiu, Michael; dev at dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH] librte_ether: fix crashes in rte_ethdev
> functions.
> >
> > Hi Michael,
> >
> > > -Original Message-
> > > From: Qiu, Michael
> > > Sent: Monday, December 21, 2015 9:03 AM
> > > To: Iremonger, Bernard ; dev at dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH] librte_ether: fix crashes in
> > > rte_ethdev functions.
> > >
> > > On 2015/12/18 1:24, Bernard Iremonger wrote:
> > > > The nb_rx_queues and nb_tx_queues are initialised before the
> > > > tx_queue and rx_queue arrays are allocated. The arrays are
> > > > allocated when the ethdev port is started.
> > > >
> > > > If any of the following functions are called before the ethdev
> > > > port is started there is a segmentation fault:
> > > >
> > > > rte_eth_stats_get
> > > > rte_eth_stats_reset
> > > > rte_eth_xstats_get
> > > > rte_eth_xstats_reset
> > > >
> > > > Fixes: af75078fece3 ("first public release")
> > > > Fixes: ce757f5c9a4d ("ethdev: new method to retrieve extended
> > > > statistics")
> > > > Fixes: d4fef8b0d5e5 ("ethdev: expose generic and driver specific
> > > > stats in xstats")
> > > > Signed-off-by: Bernard Iremonger 
> > > > ---
> > > >  lib/librte_ether/rte_ethdev.c | 16 
> > > >  1 file changed, 12 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/lib/librte_ether/rte_ethdev.c
> > > > b/lib/librte_ether/rte_ethdev.c index ed971b4..a0ee84d 100644
> > > > --- a/lib/librte_ether/rte_ethdev.c
> > > > +++ b/lib/librte_ether/rte_ethdev.c
> > > > @@ -1441,7 +1441,10 @@ rte_eth_stats_get(uint8_t port_id, struct
> > > rte_eth_stats *stats)
> > > > memset(stats, 0, sizeof(*stats));
> > > >
> > > > RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->stats_get, -
> > > ENOTSUP);
> > > > -   (*dev->dev_ops->stats_get)(dev, stats);
> > > > +
> > > > +   if (dev->data->dev_started)
> > > > +   (*dev->dev_ops->stats_get)(dev, stats);
> > > > +
> 
> So why it would be no possible now to get statistics on the stopped device?
> Konstantin



I did not consider this scenario.
I need to rethink this patch.

Self NAK

Regards,

Bernard.

[dpdk-dev] [PATCH 3/3] doc: rename release 2.3 to 16.04

2015-12-21 Thread Bruce Richardson

Update documentation to reflect new numbering scheme

Signed-off-by: Bruce Richardson 
---
 doc/guides/rel_notes/index.rst |  2 +-
 doc/guides/rel_notes/release_16_04.rst | 83 ++
 doc/guides/rel_notes/release_2_3.rst   | 76 ---
 3 files changed, 84 insertions(+), 77 deletions(-)
 create mode 100644 doc/guides/rel_notes/release_16_04.rst
 delete mode 100644 doc/guides/rel_notes/release_2_3.rst

diff --git a/doc/guides/rel_notes/index.rst b/doc/guides/rel_notes/index.rst
index 29013cf..84317b8 100644
--- a/doc/guides/rel_notes/index.rst
+++ b/doc/guides/rel_notes/index.rst
@@ -36,7 +36,7 @@ Release Notes
 :numbered:

 rel_description
-release_2_3
+release_16_04
 release_2_2
 release_2_1
 release_2_0
diff --git a/doc/guides/rel_notes/release_16_04.rst 
b/doc/guides/rel_notes/release_16_04.rst
new file mode 100644
index 000..2c487c5
--- /dev/null
+++ b/doc/guides/rel_notes/release_16_04.rst
@@ -0,0 +1,83 @@
+DPDK Release 16.04
+==
+
+.. note::
+
+  Following on from the DPDK Release 2.2, the numbering scheme for this
+  project has changed. Releases are now being numbered based off the year
+  and month of release. What would have been DPDK Release 2.3 is now called
+  Release 16.04, as its release date is April 2016.
+
+New Features
+
+
+
+Resolved Issues
+---
+
+EAL
+~~~
+
+
+Drivers
+~~~
+
+
+Libraries
+~
+
+
+Examples
+
+
+
+Other
+~
+
+
+Known Issues
+
+
+
+API Changes
+---
+
+
+ABI Changes
+---
+
+
+Shared Library Versions
+---
+
+The libraries prepended with a plus sign were incremented in this version.
+
+.. code-block:: diff
+
+ libethdev.so.2
+ librte_acl.so.2
+ librte_cfgfile.so.2
+ librte_cmdline.so.1
+ librte_distributor.so.1
+ librte_eal.so.2
+ librte_hash.so.2
+ librte_ip_frag.so.1
+ librte_ivshmem.so.1
+ librte_jobstats.so.1
+ librte_kni.so.2
+ librte_kvargs.so.1
+ librte_lpm.so.2
+ librte_mbuf.so.2
+ librte_mempool.so.1
+ librte_meter.so.1
+ librte_pipeline.so.2
+ librte_pmd_bond.so.1
+ librte_pmd_ring.so.2
+ librte_port.so.2
+ librte_power.so.1
+ librte_reorder.so.1
+ librte_ring.so.1
+ librte_sched.so.1
+ librte_table.so.2
+ librte_timer.so.1
+ librte_vhost.so.2
diff --git a/doc/guides/rel_notes/release_2_3.rst 
b/doc/guides/rel_notes/release_2_3.rst
deleted file mode 100644
index 99de186..000
--- a/doc/guides/rel_notes/release_2_3.rst
+++ /dev/null
@@ -1,76 +0,0 @@
-DPDK Release 2.3
-
-
-New Features
-
-
-
-Resolved Issues

-
-EAL
-~~~
-
-
-Drivers
-~~~
-
-
-Libraries
-~
-
-
-Examples
-
-
-
-Other
-~
-
-
-Known Issues
-
-
-
-API Changes

-
-
-ABI Changes

-
-
-Shared Library Versions

-
-The libraries prepended with a plus sign were incremented in this version.
-
-.. code-block:: diff
-
- libethdev.so.2
- librte_acl.so.2
- librte_cfgfile.so.2
- librte_cmdline.so.1
- librte_distributor.so.1
- librte_eal.so.2
- librte_hash.so.2
- librte_ip_frag.so.1
- librte_ivshmem.so.1
- librte_jobstats.so.1
- librte_kni.so.2
- librte_kvargs.so.1
- librte_lpm.so.2
- librte_mbuf.so.2
- librte_mempool.so.1
- librte_meter.so.1
- librte_pipeline.so.2
- librte_pmd_bond.so.1
- librte_pmd_ring.so.2
- librte_port.so.2
- librte_power.so.1
- librte_reorder.so.1
- librte_ring.so.1
- librte_sched.so.1
- librte_table.so.2
- librte_timer.so.1
- librte_vhost.so.2
-- 
2.5.0

[dpdk-dev] [PATCH 2/3] version: adjust printing for new version scheme

2015-12-21 Thread Bruce Richardson

Since we are now using a year-month numbering scheme, adjust
the printing of the version to always use 2-digits for YY.MM
format.
Also omit the patch version unless there is a patch version present,
since patches for releases are rare on DPDK. This means that the
final release of 16.04 will report as 16.04, rather than 16.04.0.
Release candidates for it will similarly report as 16.04-rcX.

Signed-off-by: Bruce Richardson 
---
 lib/librte_eal/common/include/rte_version.h | 27 +--
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/lib/librte_eal/common/include/rte_version.h 
b/lib/librte_eal/common/include/rte_version.h
index f1c7b98..7feea73 100644
--- a/lib/librte_eal/common/include/rte_version.h
+++ b/lib/librte_eal/common/include/rte_version.h
@@ -55,12 +55,12 @@ extern "C" {
 /**
  * Major version number i.e. the x in x.y.z
  */
-#define RTE_VER_MAJOR 16
+#define RTE_REL_YEAR 16

 /**
  * Minor version number i.e. the y in x.y.z
  */
-#define RTE_VER_MINOR 4
+#define RTE_REL_MONTH 4

 /**
  * Patch level number i.e. the z in x.y.z
@@ -88,8 +88,8 @@ extern "C" {
  * All version numbers in one to compare with RTE_VERSION_NUM()
  */
 #define RTE_VERSION RTE_VERSION_NUM( \
-   RTE_VER_MAJOR, \
-   RTE_VER_MINOR, \
+   RTE_REL_YEAR, \
+   RTE_REL_MONTH, \
RTE_VER_PATCH_LEVEL, \
RTE_VER_PATCH_RELEASE)

@@ -102,20 +102,19 @@ static inline const char *
 rte_version(void)
 {
static char version[32];
+   int pos;
if (version[0] != 0)
return version;
-   if (strlen(RTE_VER_SUFFIX) == 0)
-   snprintf(version, sizeof(version), "%s %d.%d.%d",
+
+   pos = snprintf(version, sizeof(version), "%s %02d.%02d",
RTE_VER_PREFIX,
-   RTE_VER_MAJOR,
-   RTE_VER_MINOR,
+   RTE_REL_YEAR,
+   RTE_REL_MONTH);
+   if (RTE_VER_PATCH_LEVEL > 0)
+   pos += snprintf(version + pos, sizeof(version) - pos, ".%d",
RTE_VER_PATCH_LEVEL);
-   else
-   snprintf(version, sizeof(version), "%s %d.%d.%d%s%d",
-   RTE_VER_PREFIX,
-   RTE_VER_MAJOR,
-   RTE_VER_MINOR,
-   RTE_VER_PATCH_LEVEL,
+   if (strlen(RTE_VER_SUFFIX) > 0)
+   pos += snprintf(version + pos, sizeof(version) - pos, "%s%d",
RTE_VER_SUFFIX,
RTE_VER_PATCH_RELEASE < 16 ?
RTE_VER_PATCH_RELEASE :
-- 
2.5.0

[dpdk-dev] [PATCH 1/3] version: switch to year/month version numbers

2015-12-21 Thread Bruce Richardson

As discussed on list, switch numbering scheme to be based on year/month.
Release 2.3 then becomes 16.04.

Ref: http://dpdk.org/ml/archives/dev/2015-December/030336.html

Signed-off-by: Bruce Richardson 
---
 lib/librte_eal/common/include/rte_version.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/librte_eal/common/include/rte_version.h 
b/lib/librte_eal/common/include/rte_version.h
index 6b1890e..f1c7b98 100644
--- a/lib/librte_eal/common/include/rte_version.h
+++ b/lib/librte_eal/common/include/rte_version.h
@@ -55,12 +55,12 @@ extern "C" {
 /**
  * Major version number i.e. the x in x.y.z
  */
-#define RTE_VER_MAJOR 2
+#define RTE_VER_MAJOR 16

 /**
  * Minor version number i.e. the y in x.y.z
  */
-#define RTE_VER_MINOR 3
+#define RTE_VER_MINOR 4

 /**
  * Patch level number i.e. the z in x.y.z
-- 
2.5.0

[dpdk-dev] [PATCH 0/3] switch to using YY.MM version numbers

2015-12-21 Thread Bruce Richardson

As discussed on the list, e.g. on threads:
 http://dpdk.org/ml/archives/dev/2015-December/030336.html
 http://dpdk.org/ml/archives/dev/2015-December/030551.html

switch the release number from 2.3 to 16.04 to have a month/year
based numbering scheme.


Bruce Richardson (3):
  version: switch to year/month version numbers
  version: adjust printing for new version scheme
  doc: rename release 2.3 to 16.04

 doc/guides/rel_notes/index.rst  |  2 +-
 doc/guides/rel_notes/release_16_04.rst  | 83 +
 doc/guides/rel_notes/release_2_3.rst| 76 --
 lib/librte_eal/common/include/rte_version.h | 27 +-
 4 files changed, 97 insertions(+), 91 deletions(-)
 create mode 100644 doc/guides/rel_notes/release_16_04.rst
 delete mode 100644 doc/guides/rel_notes/release_2_3.rst

-- 
2.5.0

[dpdk-dev] [PATCH v6 4/4] example/vhost: add virtio offload test in vhost sample

2015-12-21 Thread Jijiang Liu

Change the codes in vhost sample to test virtio offload feature.

These changes include,

1. add two test options: tx-csum and tso.

2. add virtio_tx_offload() function to test vhost TX offload feature for VM to 
NIC case;

however, for VM to VM case, it doesn't need to call this function, the reason 
is explained in patch 2.

Signed-off-by: Jijiang Liu 
---
 examples/vhost/main.c |  105 +++-
 1 files changed, 102 insertions(+), 3 deletions(-)

diff --git a/examples/vhost/main.c b/examples/vhost/main.c
index 044c680..210e631 100644
--- a/examples/vhost/main.c
+++ b/examples/vhost/main.c
@@ -51,6 +51,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 

 #include "main.h"

@@ -198,6 +201,13 @@ typedef enum {
 static uint32_t enable_stats = 0;
 /* Enable retries on RX. */
 static uint32_t enable_retry = 1;
+
+/* Disable TX checksum offload */
+static uint32_t enable_tx_csum;
+
+/* Disable TSO offload */
+static uint32_t enable_tso;
+
 /* Specify timeout (in useconds) between retries on RX. */
 static uint32_t burst_rx_delay_time = BURST_RX_WAIT_US;
 /* Specify the number of retries on RX. */
@@ -428,6 +438,14 @@ port_init(uint8_t port)

if (port >= rte_eth_dev_count()) return -1;

+   if (enable_tx_csum == 0)
+   rte_vhost_feature_disable(1ULL << VIRTIO_NET_F_CSUM);
+
+   if (enable_tso == 0) {
+   rte_vhost_feature_disable(1ULL << VIRTIO_NET_F_HOST_TSO4);
+   rte_vhost_feature_disable(1ULL << VIRTIO_NET_F_HOST_TSO6);
+   }
+
rx_rings = (uint16_t)dev_info.max_rx_queues;
/* Configure ethernet device. */
retval = rte_eth_dev_configure(port, rx_rings, tx_rings, _conf);
@@ -563,7 +581,9 @@ us_vhost_usage(const char *prgname)
"   --rx-desc-num [0-N]: the number of descriptors on rx, "
"used only when zero copy is enabled.\n"
"   --tx-desc-num [0-N]: the number of descriptors on tx, "
-   "used only when zero copy is enabled.\n",
+   "used only when zero copy is enabled.\n"
+   "   --tx-csum [0|1] disable/enable TX checksum offload.\n"
+   "   --tso [0|1] disable/enable TCP segement offload.\n",
   prgname);
 }

@@ -589,6 +609,8 @@ us_vhost_parse_args(int argc, char **argv)
{"zero-copy", required_argument, NULL, 0},
{"rx-desc-num", required_argument, NULL, 0},
{"tx-desc-num", required_argument, NULL, 0},
+   {"tx-csum", required_argument, NULL, 0},
+   {"tso", required_argument, NULL, 0},
{NULL, 0, 0, 0},
};

@@ -643,6 +665,28 @@ us_vhost_parse_args(int argc, char **argv)
}
}

+   /* Enable/disable TX checksum offload. */
+   if (!strncmp(long_option[option_index].name, "tx-csum", 
MAX_LONG_OPT_SZ)) {
+   ret = parse_num_opt(optarg, 1);
+   if (ret == -1) {
+   RTE_LOG(INFO, VHOST_CONFIG, "Invalid 
argument for tx-csum [0|1]\n");
+   us_vhost_usage(prgname);
+   return -1;
+   } else
+   enable_tx_csum = ret;
+   }
+
+   /* Enable/disable TSO offload. */
+   if (!strncmp(long_option[option_index].name, "tso", 
MAX_LONG_OPT_SZ)) {
+   ret = parse_num_opt(optarg, 1);
+   if (ret == -1) {
+   RTE_LOG(INFO, VHOST_CONFIG, "Invalid 
argument for tso [0|1]\n");
+   us_vhost_usage(prgname);
+   return -1;
+   } else
+   enable_tso = ret;
+   }
+
/* Specify the retries delay time (in useconds) on RX. 
*/
if (!strncmp(long_option[option_index].name, 
"rx-retry-delay", MAX_LONG_OPT_SZ)) {
ret = parse_num_opt(optarg, INT32_MAX);
@@ -1101,6 +1145,58 @@ find_local_dest(struct virtio_net *dev, struct rte_mbuf 
*m,
return 0;
 }

+static uint16_t
+get_psd_sum(void *l3_hdr, uint64_t ol_flags)
+{
+   if (ol_flags & PKT_TX_IPV4)
+   return rte_ipv4_phdr_cksum(l3_hdr, ol_flags);
+   else /* assume ethertype == ETHER_TYPE_IPv6 */
+   return rte_ipv6_phdr_cksum(l3_hdr, ol_flags);
+}
+
+static void virtio_tx_offload(struct rte_mbuf *m)
+{
+   void *l3_hdr;
+   struct ipv4_hdr *ipv4_hdr = NULL;
+   struct tcp_hdr *tcp_hdr = NULL;
+   struct udp_hdr *udp_hdr = NULL;
+   struct sctp_hdr *sctp_hdr = NULL;

[dpdk-dev] [PATCH v6 3/4] sample/vhost: remove the ipv4_hdr structure defination

2015-12-21 Thread Jijiang Liu

Remove the ipv4_hdr structure defination in vhost sample.

The same structure has already defined in the rte_ip.h file, so we remove the 
defination from the sample, and include that header file.

Signed-off-by: Jijiang Liu 
---
 examples/vhost/main.c |   15 +--
 1 files changed, 1 insertions(+), 14 deletions(-)

diff --git a/examples/vhost/main.c b/examples/vhost/main.c
index c081b18..044c680 100644
--- a/examples/vhost/main.c
+++ b/examples/vhost/main.c
@@ -50,6 +50,7 @@
 #include 
 #include 
 #include 
+#include 

 #include "main.h"

@@ -292,20 +293,6 @@ struct vlan_ethhdr {
__be16  h_vlan_encapsulated_proto;
 };

-/* IPv4 Header */
-struct ipv4_hdr {
-   uint8_t  version_ihl;   /**< version and header length */
-   uint8_t  type_of_service;   /**< type of service */
-   uint16_t total_length;  /**< length of packet */
-   uint16_t packet_id; /**< packet ID */
-   uint16_t fragment_offset;   /**< fragmentation offset */
-   uint8_t  time_to_live;  /**< time to live */
-   uint8_t  next_proto_id; /**< protocol ID */
-   uint16_t hdr_checksum;  /**< header checksum */
-   uint32_t src_addr;  /**< source address */
-   uint32_t dst_addr;  /**< destination address */
-} __attribute__((__packed__));
-
 /* Header lengths. */
 #define VLAN_HLEN   4
 #define VLAN_ETH_HLEN   18
-- 
1.7.7.6

[dpdk-dev] [PATCH v6 2/4] vhost/lib: add guest offload setting

2015-12-21 Thread Jijiang Liu

Add guest offload setting in vhost lib.

Refer to the feature bits description in the Virtual I/O Device (VIRTIO) 
Version 1.0 below, 

1. VIRTIO_NET_F_GUEST_CSUM (1) Driver handles packets with partial checksum.

2. If the VIRTIO_NET_F_GUEST_CSUM feature was negotiated, the 
VIRTIO_NET_HDR_F_NEEDS_- CSUM bit in flags MAY be set: if so, the checksum on 
the packet is incomplete and csum_start and csum_offset indicate how to 
calculate it (see Packet Transmission point 1).

3. If the VIRTIO_NET_F_GUEST_TSO4, TSO6 or UFO options were negotiated, then 
gso_type MAY be something other than VIRTIO_NET_HDR_GSO_NONE, and gso_size 
field indicates the desired MSS (see Packet Transmission point 2).

In order to support these features, the following changes are added,

1. Extend 'VHOST_SUPPORTED_FEATURES' macro to add the offload features 
negotiation.

2. Enqueue these offloads: convert some fields in mbuf to the fields in 
virtio_net_hdr.

There are more explanations for the implementation.

For VM2VM case, there is no need to do checksum, for we
think the data should be reliable enough, and setting 
VIRTIO_NET_HDR_F_NEEDS_CSUM
at RX side will let the TCP layer to bypass the checksum validation,
so that the RX side could receive the packet in the end.

In terms of us-vhost, at vhost RX side, the offload information is inherited 
from mbuf, which is
in turn inherited from TX side. If we can still get those info at RX
side, it means the packet is from another VM at same host.  So, it's
safe to set the VIRTIO_NET_HDR_F_NEEDS_CSUM, to skip checksum validation.

Signed-off-by: Jijiang Liu 
---
 lib/librte_vhost/vhost_rxtx.c |   47 +++-
 lib/librte_vhost/virtio-net.c |5 +++-
 2 files changed, 49 insertions(+), 3 deletions(-)

diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
index 47d5f85..9d97e19 100644
--- a/lib/librte_vhost/vhost_rxtx.c
+++ b/lib/librte_vhost/vhost_rxtx.c
@@ -54,6 +54,44 @@ is_valid_virt_queue_idx(uint32_t idx, int is_tx, uint32_t 
qp_nb)
return (is_tx ^ (idx & 1)) == 0 && idx < qp_nb * VIRTIO_QNUM;
 }

+static void
+virtio_enqueue_offload(struct rte_mbuf *m_buf, struct virtio_net_hdr *net_hdr)
+{
+   memset(net_hdr, 0, sizeof(struct virtio_net_hdr));
+
+   if (m_buf->ol_flags & PKT_TX_L4_MASK) {
+   net_hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+   net_hdr->csum_start = m_buf->l2_len + m_buf->l3_len;
+
+   switch (m_buf->ol_flags & PKT_TX_L4_MASK) {
+   case PKT_TX_TCP_CKSUM:
+   net_hdr->csum_offset = (offsetof(struct tcp_hdr,
+   cksum));
+   break;
+   case PKT_TX_UDP_CKSUM:
+   net_hdr->csum_offset = (offsetof(struct udp_hdr,
+   dgram_cksum));
+   break;
+   case PKT_TX_SCTP_CKSUM:
+   net_hdr->csum_offset = (offsetof(struct sctp_hdr,
+   cksum));
+   break;
+   }
+   }
+
+   if (m_buf->ol_flags & PKT_TX_TCP_SEG) {
+   if (m_buf->ol_flags & PKT_TX_IPV4)
+   net_hdr->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
+   else
+   net_hdr->gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
+   net_hdr->gso_size = m_buf->tso_segsz;
+   net_hdr->hdr_len = m_buf->l2_len + m_buf->l3_len
+   + m_buf->l4_len;
+   }
+
+   return;
+}
+
 /**
  * This function adds buffers to the virtio devices RX virtqueue. Buffers can
  * be received from the physical port or from another virtio device. A packet
@@ -67,7 +105,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 {
struct vhost_virtqueue *vq;
struct vring_desc *desc;
-   struct rte_mbuf *buff;
+   struct rte_mbuf *buff, *first_buff;
/* The virtio_hdr is initialised to 0. */
struct virtio_net_hdr_mrg_rxbuf virtio_hdr = {{0, 0, 0, 0, 0, 0}, 0};
uint64_t buff_addr = 0;
@@ -139,6 +177,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
desc = >desc[head[packet_success]];

buff = pkts[packet_success];
+   first_buff = buff;

/* Convert from gpa to vva (guest physical addr -> vhost 
virtual addr) */
buff_addr = gpa_to_vva(dev, desc->addr);
@@ -221,7 +260,9 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,

if (unlikely(uncompleted_pkt == 1))
continue;
-
+
+   virtio_enqueue_offload(first_buff, _hdr.hdr);
+
rte_memcpy((void *)(uintptr_t)buff_hdr_addr,
(const void *)_hdr, vq->vhost_hlen);

@@ -295,6 +336,8 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t 
queue_id,

[dpdk-dev] [PATCH v6 1/4] vhost/lib: add vhost TX offload capabilities in vhost lib

2015-12-21 Thread Jijiang Liu

Add vhost TX offload(CSUM and TSO) support capabilities in vhost lib.

Refer to feature bits in Virtual I/O Device (VIRTIO) Version 1.0 below,

VIRTIO_NET_F_CSUM (0) Device handles packets with partial checksum. This 
"checksum offload" is a common feature on modern network cards.
VIRTIO_NET_F_HOST_TSO4 (11) Device can receive TSOv4.
VIRTIO_NET_F_HOST_TSO6 (12) Device can receive TSOv6.

In order to support these features, and the following changes are added,

1. Extend 'VHOST_SUPPORTED_FEATURES' macro to add the offload features 
negotiation.

2. Dequeue TX offload: convert the fileds in virtio_net_hdr to the related 
fileds in mbuf.


Signed-off-by: Jijiang Liu 
---
 lib/librte_vhost/vhost_rxtx.c |  103 +
 lib/librte_vhost/virtio-net.c |6 ++-
 2 files changed, 108 insertions(+), 1 deletions(-)

diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
index 9322ce6..47d5f85 100644
--- a/lib/librte_vhost/vhost_rxtx.c
+++ b/lib/librte_vhost/vhost_rxtx.c
@@ -37,7 +37,12 @@

 #include 
 #include 
+#include 
+#include 
 #include 
+#include 
+#include 
+#include 

 #include "vhost-net.h"

@@ -568,6 +573,97 @@ rte_vhost_enqueue_burst(struct virtio_net *dev, uint16_t 
queue_id,
return virtio_dev_rx(dev, queue_id, pkts, count);
 }

+static void
+parse_ethernet(struct rte_mbuf *m, uint16_t *l4_proto, void **l4_hdr)
+{
+   struct ipv4_hdr *ipv4_hdr;
+   struct ipv6_hdr *ipv6_hdr;
+   void *l3_hdr = NULL;
+   struct ether_hdr *eth_hdr;
+   uint16_t ethertype;
+
+   eth_hdr = rte_pktmbuf_mtod(m, struct ether_hdr *);
+
+   m->l2_len = sizeof(struct ether_hdr);
+   ethertype = rte_be_to_cpu_16(eth_hdr->ether_type);
+
+   if (ethertype == ETHER_TYPE_VLAN) {
+   struct vlan_hdr *vlan_hdr = (struct vlan_hdr *)(eth_hdr + 1);
+
+   m->l2_len += sizeof(struct vlan_hdr);
+   ethertype = rte_be_to_cpu_16(vlan_hdr->eth_proto);
+   }
+
+   l3_hdr = (char *)eth_hdr + m->l2_len;
+
+   switch (ethertype) {
+   case ETHER_TYPE_IPv4:
+   ipv4_hdr = (struct ipv4_hdr *)l3_hdr;
+   *l4_proto = ipv4_hdr->next_proto_id;
+   m->l3_len = (ipv4_hdr->version_ihl & 0x0f) * 4;
+   *l4_hdr = (char *)l3_hdr + m->l3_len;
+   m->ol_flags |= PKT_TX_IPV4;
+   break;
+   case ETHER_TYPE_IPv6:
+   ipv6_hdr = (struct ipv6_hdr *)l3_hdr;
+   *l4_proto = ipv6_hdr->proto;
+   m->l3_len = sizeof(struct ipv6_hdr);
+   *l4_hdr = (char *)l3_hdr + m->l3_len;
+   m->ol_flags |= PKT_TX_IPV6;
+   break;
+   default:
+   m->l3_len = 0;
+   *l4_proto = 0;
+   break;
+   }
+}
+
+static inline void __attribute__((always_inline))
+vhost_dequeue_offload(struct virtio_net_hdr *hdr, struct rte_mbuf *m)
+{
+   uint16_t l4_proto = 0;
+   void *l4_hdr = NULL;
+   struct tcp_hdr *tcp_hdr = NULL;
+
+   parse_ethernet(m, _proto, _hdr);
+   if (hdr->flags == VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+   if (hdr->csum_start == (m->l2_len + m->l3_len)) {
+   switch (hdr->csum_offset) {
+   case (offsetof(struct tcp_hdr, cksum)):
+   if (l4_proto == IPPROTO_TCP)
+   m->ol_flags |= PKT_TX_TCP_CKSUM;
+   break;
+   case (offsetof(struct udp_hdr, dgram_cksum)):
+   if (l4_proto == IPPROTO_UDP)
+   m->ol_flags |= PKT_TX_UDP_CKSUM;
+   break;
+   case (offsetof(struct sctp_hdr, cksum)):
+   if (l4_proto == IPPROTO_SCTP)
+   m->ol_flags |= PKT_TX_SCTP_CKSUM;
+   break;
+   default:
+   break;
+   }
+   }
+   }
+
+   if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
+   switch (hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
+   case VIRTIO_NET_HDR_GSO_TCPV4:
+   case VIRTIO_NET_HDR_GSO_TCPV6:
+   tcp_hdr = (struct tcp_hdr *)l4_hdr;
+   m->ol_flags |= PKT_TX_TCP_SEG;
+   m->tso_segsz = hdr->gso_size;
+   m->l4_len = (tcp_hdr->data_off & 0xf0) >> 2;
+   break;
+   default:
+   RTE_LOG(WARNING, VHOST_DATA,
+   "unsupported gso type %u.\n", hdr->gso_type);
+   break;
+   }
+   }
+}
+
 uint16_t
 rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t count)
@@ -576,11 +672,13 @@

[dpdk-dev] [PATCH v6 0/4] add virtio offload support in us-vhost

2015-12-21 Thread Jijiang Liu

Adds virtio offload support in us-vhost.

The patch set adds the feature negotiation of checksum and TSO between us-vhost 
and vanilla Linux virtio guest, and add these offload features support in the 
vhost lib, and change vhost sample to test them.

In short, this patch set supports the followings,

 1. DPDK vhost CSUM & TSO for VM2NIC case

 2. CSUM and TSO support between legacy virtio-net and DPDK vhost for VM2VM and 
NIC2VM cases.

v6 change:
  Rebase latest codes.

v5 changes:
  Add more clear descriptions to explain these changes.
  reset the 'virtio_net_hdr' value in the virtio_enqueue_offload() function.
  reorganize patches. 


v4 change:
  remove virtio-net change, only keep vhost changes.
  add guest TX offload capabilities to support VM to VM case.
  split the cleanup code as a separate patch.

v3 change:
  rebase latest codes.

v2 change:
  fill virtio device information for TX offloads.

*** BLURB HERE ***

Jijiang Liu (4):
  add vhost offload capabilities
  remove ipv4_hdr structure from vhost sample.
  add guest offload setting ln the vhost lib.
  change vhost application to test checksum and TSO for VM to NIC case

 examples/vhost/main.c |  120 -
 lib/librte_vhost/vhost_rxtx.c |  150 -
 lib/librte_vhost/virtio-net.c |9 ++-
 3 files changed, 259 insertions(+), 20 deletions(-)

-- 
1.7.7.6

[dpdk-dev] [PATCH v2 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API

2015-12-21 Thread Xie, Huawei

On 12/19/2015 1:32 AM, Stephen Hemminger wrote:
> On Fri, 18 Dec 2015 10:44:02 +
> "Ananyev, Konstantin"  wrote:
>
>>
>>> -Original Message-
>>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Stephen Hemminger
>>> Sent: Friday, December 18, 2015 5:01 AM
>>> To: Xie, Huawei
>>> Cc: dev at dpdk.org
>>> Subject: Re: [dpdk-dev] [PATCH v2 1/2] mbuf: provide rte_pktmbuf_alloc_bulk 
>>> API
>>>
>>> On Mon, 14 Dec 2015 09:14:41 +0800
>>> Huawei Xie  wrote:
>>>
 v2 changes:
  unroll the loop a bit to help the performance

 rte_pktmbuf_alloc_bulk allocates a bulk of packet mbufs.

 There is related thread about this bulk API.
 http://dpdk.org/dev/patchwork/patch/4718/
 Thanks to Konstantin's loop unrolling.

 Signed-off-by: Gerald Rogers 
 Signed-off-by: Huawei Xie 
 Acked-by: Konstantin Ananyev 
 ---
  lib/librte_mbuf/rte_mbuf.h | 50 
 ++
  1 file changed, 50 insertions(+)

 diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
 index f234ac9..4e209e0 100644
 --- a/lib/librte_mbuf/rte_mbuf.h
 +++ b/lib/librte_mbuf/rte_mbuf.h
 @@ -1336,6 +1336,56 @@ static inline struct rte_mbuf 
 *rte_pktmbuf_alloc(struct rte_mempool *mp)
  }

  /**
 + * Allocate a bulk of mbufs, initialize refcnt and reset the fields to 
 default
 + * values.
 + *
 + *  @param pool
 + *The mempool from which mbufs are allocated.
 + *  @param mbufs
 + *Array of pointers to mbufs
 + *  @param count
 + *Array size
 + *  @return
 + *   - 0: Success
 + */
 +static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
 +   struct rte_mbuf **mbufs, unsigned count)
 +{
 +  unsigned idx = 0;
 +  int rc;
 +
 +  rc = rte_mempool_get_bulk(pool, (void **)mbufs, count);
 +  if (unlikely(rc))
 +  return rc;
 +
 +  switch (count % 4) {
 +  while (idx != count) {
 +  case 0:
 +  RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
 +  rte_mbuf_refcnt_set(mbufs[idx], 1);
 +  rte_pktmbuf_reset(mbufs[idx]);
 +  idx++;
 +  case 3:
 +  RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
 +  rte_mbuf_refcnt_set(mbufs[idx], 1);
 +  rte_pktmbuf_reset(mbufs[idx]);
 +  idx++;
 +  case 2:
 +  RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
 +  rte_mbuf_refcnt_set(mbufs[idx], 1);
 +  rte_pktmbuf_reset(mbufs[idx]);
 +  idx++;
 +  case 1:
 +  RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(mbufs[idx]) == 0);
 +  rte_mbuf_refcnt_set(mbufs[idx], 1);
 +  rte_pktmbuf_reset(mbufs[idx]);
 +  idx++;
 +  }
 +  }
 +  return 0;
 +}
>>> This is weird. Why not just use Duff's device in a more normal manner.
>> But it is a sort of Duff's method.
>> Not sure what looks weird to you here?
>> while () {} instead of do {} while();?
>> Konstantin
>>
>>
>>
> It is unusual to have cases not associated with block of the switch.
> Unusual to me means, "not used commonly in most code".
>
> Since you are jumping into the loop, might make more sense as a do { } while()
>
Stephen:
How about we move while a bit:
switch(count % 4) {
case 0: while (idx != count) {
... reset ...
case 3:
... reset ...
case 2:
... reset ...
case 1:
... reset ...
 }
 }

With do {} while, we probably need one more extra check on if count is
zero. Duff's initial implementation assumes that count isn't zero. With
while loop, we save one line of code.

[dpdk-dev] tcpdump support in DPDK 2.3

2015-12-21 Thread Matthew Hall

On Mon, Dec 21, 2015 at 04:17:26PM +, Gray, Mark D wrote:
> Is tcpdump used in large production cloud environments? I would have 
> thought other less intrusive (and less manual) tools would be used? Isn't
> that one of the benefits of SDN.

tcpdump, tshark, wireshark, libpcap, etc. have been used every single place I 
ever worked, including in production under heavy load.

This is because nobody wants to redo the library of many tens of thousands of 
hours of protocol dissectors.

This is also why I am trying to point out what is required to get a solution 
that I am confident will really work when people are counting on it, which I 
am concerned the current proposals do not cover.

Matthew.

[dpdk-dev] VFIO no-iommu

2015-12-21 Thread Alex Williamson

On Mon, 2015-12-21 at 11:46 +, Yigit, Ferruh wrote:
> On Fri, Dec 18, 2015 at 02:50:17PM -0700, Alex Williamson wrote:
> > On Fri, 2015-12-18 at 07:38 -0700, Alex Williamson wrote:
> > > On Fri, 2015-12-18 at 10:43 +, Yigit, Ferruh wrote:
> > > > On Thu, Dec 17, 2015 at 09:43:59AM -0700, Alex Williamson
> > > > wrote:
> > > > <...>
> > > > > > > > > > 
> > > > > > > > > > Also I need to disable VFIO_CHECK_EXTENSION ioctl,
> > > > > > > > > > because in
> > > > > > > > > > vfio
> > > > > > > > > > module,
> > > > > > > > > > container->noiommu is not set before doing a
> > > > > > > > > > vfio_group_set_container()
> > > > > > > > > > and vfio_for_each_iommu_driver selects wrong
> > > > > > > > > > driver.
> > > > > > > > > 
> > > > > > > > > Running CHECK_EXTENSION on a container without the
> > > > > > > > > group
> > > > > > > > > attached is
> > > > > > > > > only going to tell you what extensions vfio is
> > > > > > > > > capable
> > > > > > > > > of,
> > > > > > > > > not
> > > > > > > > > necessarily what extensions are available to you with
> > > > > > > > > that
> > > > > > > > > group.
> > > > > > > > > Is this just a general dpdk- vfio ordering bug?
> > > > > > > > 
> > > > > > > > Yes, that is how VFIO was implemented in DPDK. I was
> > > > > > > > under
> > > > > > > > the
> > > > > > > > impression that checking extension before assigning
> > > > > > > > devices
> > > > > > > > was
> > > > > > > > the
> > > > > > > > correct way to do things, so as to not to try anything
> > > > > > > > we
> > > > > > > > know
> > > > > > > > would
> > > > > > > > fail anyway. Does this imply that CHECK_EXTENSION needs
> > > > > > > > to
> > > > > > > > be
> > > > > > > > called
> > > > > > > > on both container and groups (or just on groups)?
> > > > > > > 
> > > > > > > Hmm, in Documentation/vfio.txt we do give the following
> > > > > > > algorithm:
> > > > > > > 
> > > > > > > if (ioctl(container, VFIO_GET_API_VERSION) !=
> > > > > > > VFIO_API_VERSION)
> > > > > > > /* Unknown API version */
> > > > > > > 
> > > > > > > if (!ioctl(container, VFIO_CHECK_EXTENSION,
> > > > > > > VFIO_TYPE1_IOMMU))
> > > > > > > /* Doesn't support the IOMMU driver we
> > > > > > > want.
> > > > > > > */
> > > > > > > ...
> > > > > > > 
> > > > > > > That's just going to query each iommu driver and we can't
> > > > > > > yet
> > > > > > > say
> > > > > > > whether
> > > > > > > the group the user attaches to the container later will
> > > > > > > actually
> > > > > > > support that
> > > > > > > extension until we try to do it, that would come at
> > > > > > > VFIO_SET_IOMMU.
> > > > > > > ?So is
> > > > > > > it perhaps a vfio bug that we're not advertising no-iommu
> > > > > > > until
> > > > > > > the
> > > > > > > group is
> > > > > > > attached? ?After all, we are capable of it with just an
> > > > > > > empty
> > > > > > > container, just
> > > > > > > like we are with type1, but we're going to fail SET_IOMMU
> > > > > > > for
> > > > > > > the
> > > > > > > wrong
> > > > > > > combination.
> > > > > > > ?This is exactly the sort of thing that makes me glad we
> > > > > > > reverted
> > > > > > > it without
> > > > > > > feedback from a working user driver. ?Thanks,
> > > > > > 
> > > > > > Whether it should be considered a "bug" in VFIO or "by
> > > > > > design"
> > > > > > is
> > > > > > up
> > > > > > to you, of course, but at least according to the VFIO
> > > > > > documentation,
> > > > > > we are meant to check for type 1 extension and then attach
> > > > > > devices,
> > > > > > so it would be expected to get VFIO_NOIOMMU_IOMMU marked as
> > > > > > supported
> > > > > > even without any devices attached to the container (just
> > > > > > like
> > > > > > we
> > > > > > get
> > > > > > type 1 as supported without any devices attached). Having
> > > > > > said
> > > > > > that,
> > > > > > if it was meant to attach devices first and then check the
> > > > > > extensions, then perhaps the documentation should also
> > > > > > point
> > > > > > out
> > > > > > that
> > > > > > fact (or perhaps I missed that detail in my readings of the
> > > > > > docs,
> > > > > > in
> > > > > > which case my apologies).
> > > > > 
> > > > > Hi Anatoly,
> > > > > 
> > > > > Does the below patch make it behave more like you'd expect.
> > > > > ?This
> > > > > applies to v4.4-rc4, I'd fold this into the base patch if we
> > > > > reincorporate it to a future kernel. ?Thanks,
> > > > > 
> > > > > Alex
> > > > > 
> > > > > commit 88d4dcb6b77624965f0b45b5cd305a2b4a105c94
> > > > > Author: Alex Williamson 
> > > > > Date:???Wed Dec 16 19:02:01 2015 -0700
> > > > > 
> > > > > vfio: Fix no-iommu CHECK_EXTENSION
> > > > > 
> > > > > Previously the no-iommu iommu driver was only visible
> > > > > when
> > > > > the
> > > > > container had an attached no-iommu group.??This means
> > > > > that
> > > > > CHECK_EXTENSION on and empty container couldn't report
>

[dpdk-dev] [PATCH] vfio: add no-iommu support

2015-12-21 Thread Ferruh Yigit

This is based on patch from Alex Williamson:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=033291eccbdb
plus
http://dpdk.org/dev/patchwork/patch/9598/

This patch is intended to test above patches on DPDK rather than
official patch to DPDK.

Test result is DPDK successfully run on no-iommu environment.

Signed-off-by: Ferruh Yigit 
---
 lib/librte_eal/linuxapp/eal/eal_pci_vfio.c | 28 +---
 1 file changed, 25 insertions(+), 3 deletions(-)

diff --git a/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c 
b/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c
index 74f91ba..90bba4a 100644
--- a/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c
+++ b/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c
@@ -61,6 +61,18 @@

 #ifdef VFIO_PRESENT

+/*#define VFIO_NOIOMMU*/
+
+#ifndef VFIO_NOIOMMU_IOMMU
+#define VFIO_NOIOMMU_IOMMU 8
+#endif
+
+#ifdef VFIO_NOIOMMU
+#define VFIO_IOMMU_TYPE VFIO_NOIOMMU_IOMMU
+#else
+#define VFIO_IOMMU_TYPE VFIO_TYPE1_IOMMU
+#endif
+
 #define PAGE_SIZE   (sysconf(_SC_PAGESIZE))
 #define PAGE_MASK   (~(PAGE_SIZE - 1))

@@ -71,7 +83,11 @@ EAL_REGISTER_TAILQ(rte_vfio_tailq)

 #define VFIO_DIR "/dev/vfio"
 #define VFIO_CONTAINER_PATH "/dev/vfio/vfio"
+#ifdef VFIO_NOIOMMU
+#define VFIO_GROUP_FMT "/dev/vfio/noiommu-%u"
+#else
 #define VFIO_GROUP_FMT "/dev/vfio/%u"
+#endif
 #define VFIO_GET_REGION_ADDR(x) ((uint64_t) x << 40ULL)

 /* per-process VFIO config */
@@ -212,17 +228,21 @@ pci_vfio_set_bus_master(int dev_fd)
 static int
 pci_vfio_setup_dma_maps(int vfio_container_fd)
 {
+#ifndef VFIO_NOIOMMU
const struct rte_memseg *ms = rte_eal_get_physmem_layout();
-   int i, ret;
+   int i;
+#endif
+   int ret;

ret = ioctl(vfio_container_fd, VFIO_SET_IOMMU,
-   VFIO_TYPE1_IOMMU);
+   VFIO_IOMMU_TYPE);
if (ret) {
RTE_LOG(ERR, EAL, "  cannot set IOMMU type, "
"error %i (%s)\n", errno, strerror(errno));
return -1;
}

+#ifndef VFIO_NOIOMMU
/* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */
for (i = 0; i < RTE_MAX_MEMSEG; i++) {
struct vfio_iommu_type1_dma_map dma_map;
@@ -245,6 +265,7 @@ pci_vfio_setup_dma_maps(int vfio_container_fd)
return -1;
}
}
+#endif

return 0;
 }
@@ -373,7 +394,8 @@ pci_vfio_get_container_fd(void)
}

/* check if we support IOMMU type 1 */
-   ret = ioctl(vfio_container_fd, VFIO_CHECK_EXTENSION, 
VFIO_TYPE1_IOMMU);
+   ret = ioctl(vfio_container_fd, VFIO_CHECK_EXTENSION,
+   VFIO_IOMMU_TYPE);
if (ret != 1) {
if (ret < 0)
RTE_LOG(ERR, EAL, "  could not get IOMMU type, "
-- 
2.5.0

[dpdk-dev] [PATCH] librte_ether: fix crashes in rte_ethdev functions.

2015-12-21 Thread Ananyev, Konstantin



> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Iremonger, Bernard
> Sent: Monday, December 21, 2015 11:40 AM
> To: Qiu, Michael; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] librte_ether: fix crashes in rte_ethdev 
> functions.
> 
> Hi Michael,
> 
> > -Original Message-
> > From: Qiu, Michael
> > Sent: Monday, December 21, 2015 9:03 AM
> > To: Iremonger, Bernard ; dev at dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH] librte_ether: fix crashes in rte_ethdev
> > functions.
> >
> > On 2015/12/18 1:24, Bernard Iremonger wrote:
> > > The nb_rx_queues and nb_tx_queues are initialised before the tx_queue
> > > and rx_queue arrays are allocated. The arrays are allocated when the
> > > ethdev port is started.
> > >
> > > If any of the following functions are called before the ethdev port is
> > > started there is a segmentation fault:
> > >
> > > rte_eth_stats_get
> > > rte_eth_stats_reset
> > > rte_eth_xstats_get
> > > rte_eth_xstats_reset
> > >
> > > Fixes: af75078fece3 ("first public release")
> > > Fixes: ce757f5c9a4d ("ethdev: new method to retrieve extended
> > > statistics")
> > > Fixes: d4fef8b0d5e5 ("ethdev: expose generic and driver specific stats
> > > in xstats")
> > > Signed-off-by: Bernard Iremonger 
> > > ---
> > >  lib/librte_ether/rte_ethdev.c | 16 
> > >  1 file changed, 12 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/lib/librte_ether/rte_ethdev.c
> > > b/lib/librte_ether/rte_ethdev.c index ed971b4..a0ee84d 100644
> > > --- a/lib/librte_ether/rte_ethdev.c
> > > +++ b/lib/librte_ether/rte_ethdev.c
> > > @@ -1441,7 +1441,10 @@ rte_eth_stats_get(uint8_t port_id, struct
> > rte_eth_stats *stats)
> > >   memset(stats, 0, sizeof(*stats));
> > >
> > >   RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->stats_get, -
> > ENOTSUP);
> > > - (*dev->dev_ops->stats_get)(dev, stats);
> > > +
> > > + if (dev->data->dev_started)
> > > + (*dev->dev_ops->stats_get)(dev, stats);
> > > +

So why it would be no possible now to get statistics on the stopped device?
Konstantin

> >
> > My question is should we mark an error or a warning here and return an
> > error so that the caller knows what happens?
> >
> > Thanks,
> > Michael
> 
> 
> 
> In other cases in rte_ethdev.c where there is a check on 
> "dev->data->dev_started" there is a RTE_PMD_DEBUG_TRACE()  line. I will
> add a RTE_PMD_DEBUG_TRACE()  line.
> The rte_eth_stats_reset() and rte_eth_xstats_reset() functions return void.
> Not sure if an error is required for the rte_eth_stats_get() and 
> rte_eth_xstats_get() functions as the stats information returned is all
> zero's at present.
> 
> Regards,
> 
> Bernard.

[dpdk-dev] VFIO no-iommu

2015-12-21 Thread Yigit, Ferruh

On Fri, Dec 18, 2015 at 02:50:17PM -0700, Alex Williamson wrote:
> On Fri, 2015-12-18 at 07:38 -0700, Alex Williamson wrote:
> > On Fri, 2015-12-18 at 10:43 +, Yigit, Ferruh wrote:
> > > On Thu, Dec 17, 2015 at 09:43:59AM -0700, Alex Williamson wrote:
> > > <...>
> > > > > > > > > 
> > > > > > > > > Also I need to disable VFIO_CHECK_EXTENSION ioctl,
> > > > > > > > > because in
> > > > > > > > > vfio
> > > > > > > > > module,
> > > > > > > > > container->noiommu is not set before doing a
> > > > > > > > > vfio_group_set_container()
> > > > > > > > > and vfio_for_each_iommu_driver selects wrong driver.
> > > > > > > > 
> > > > > > > > Running CHECK_EXTENSION on a container without the group
> > > > > > > > attached is
> > > > > > > > only going to tell you what extensions vfio is capable
> > > > > > > > of,
> > > > > > > > not
> > > > > > > > necessarily what extensions are available to you with
> > > > > > > > that
> > > > > > > > group.
> > > > > > > > Is this just a general dpdk- vfio ordering bug?
> > > > > > > 
> > > > > > > Yes, that is how VFIO was implemented in DPDK. I was under
> > > > > > > the
> > > > > > > impression that checking extension before assigning devices
> > > > > > > was
> > > > > > > the
> > > > > > > correct way to do things, so as to not to try anything we
> > > > > > > know
> > > > > > > would
> > > > > > > fail anyway. Does this imply that CHECK_EXTENSION needs to
> > > > > > > be
> > > > > > > called
> > > > > > > on both container and groups (or just on groups)?
> > > > > > 
> > > > > > Hmm, in Documentation/vfio.txt we do give the following
> > > > > > algorithm:
> > > > > > 
> > > > > > if (ioctl(container, VFIO_GET_API_VERSION) !=
> > > > > > VFIO_API_VERSION)
> > > > > > /* Unknown API version */
> > > > > > 
> > > > > > if (!ioctl(container, VFIO_CHECK_EXTENSION,
> > > > > > VFIO_TYPE1_IOMMU))
> > > > > > /* Doesn't support the IOMMU driver we want.
> > > > > > */
> > > > > > ...
> > > > > > 
> > > > > > That's just going to query each iommu driver and we can't yet
> > > > > > say
> > > > > > whether
> > > > > > the group the user attaches to the container later will
> > > > > > actually
> > > > > > support that
> > > > > > extension until we try to do it, that would come at
> > > > > > VFIO_SET_IOMMU.
> > > > > > ?So is
> > > > > > it perhaps a vfio bug that we're not advertising no-iommu
> > > > > > until
> > > > > > the
> > > > > > group is
> > > > > > attached? ?After all, we are capable of it with just an empty
> > > > > > container, just
> > > > > > like we are with type1, but we're going to fail SET_IOMMU for
> > > > > > the
> > > > > > wrong
> > > > > > combination.
> > > > > > ?This is exactly the sort of thing that makes me glad we
> > > > > > reverted
> > > > > > it without
> > > > > > feedback from a working user driver. ?Thanks,
> > > > > 
> > > > > Whether it should be considered a "bug" in VFIO or "by design"
> > > > > is
> > > > > up
> > > > > to you, of course, but at least according to the VFIO
> > > > > documentation,
> > > > > we are meant to check for type 1 extension and then attach
> > > > > devices,
> > > > > so it would be expected to get VFIO_NOIOMMU_IOMMU marked as
> > > > > supported
> > > > > even without any devices attached to the container (just like
> > > > > we
> > > > > get
> > > > > type 1 as supported without any devices attached). Having said
> > > > > that,
> > > > > if it was meant to attach devices first and then check the
> > > > > extensions, then perhaps the documentation should also point
> > > > > out
> > > > > that
> > > > > fact (or perhaps I missed that detail in my readings of the
> > > > > docs,
> > > > > in
> > > > > which case my apologies).
> > > > 
> > > > Hi Anatoly,
> > > > 
> > > > Does the below patch make it behave more like you'd expect. ?This
> > > > applies to v4.4-rc4, I'd fold this into the base patch if we
> > > > reincorporate it to a future kernel. ?Thanks,
> > > > 
> > > > Alex
> > > > 
> > > > commit 88d4dcb6b77624965f0b45b5cd305a2b4a105c94
> > > > Author: Alex Williamson 
> > > > Date:???Wed Dec 16 19:02:01 2015 -0700
> > > > 
> > > > vfio: Fix no-iommu CHECK_EXTENSION
> > > > 
> > > > Previously the no-iommu iommu driver was only visible when
> > > > the
> > > > container had an attached no-iommu group.??This means that
> > > > CHECK_EXTENSION on and empty container couldn't report the
> > > > possibility
> > > > of using VFIO_NOIOMMU_IOMMU.??We report TYPE1 whether or not
> > > > the user
> > > > can make use of it with the group, so this is
> > > > inconsistent.??Add the
> > > > no-iommu iommu to the list of iommu drivers when enabled via
> > > > module
> > > > option, but skip all the others if the container is attached
> > > > to
> > > > a
> > > > no-iommu groups.??Note that tainting is now done with the
> > > > "unsafe"
> > > > module callback rather than explictly within vfio.
> > > > 
>

[dpdk-dev] [PATCH] librte_ether: fix crashes in rte_ethdev functions.

2015-12-21 Thread Iremonger, Bernard

Hi Michael,

> -Original Message-
> From: Qiu, Michael
> Sent: Monday, December 21, 2015 9:03 AM
> To: Iremonger, Bernard ; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] librte_ether: fix crashes in rte_ethdev
> functions.
> 
> On 2015/12/18 1:24, Bernard Iremonger wrote:
> > The nb_rx_queues and nb_tx_queues are initialised before the tx_queue
> > and rx_queue arrays are allocated. The arrays are allocated when the
> > ethdev port is started.
> >
> > If any of the following functions are called before the ethdev port is
> > started there is a segmentation fault:
> >
> > rte_eth_stats_get
> > rte_eth_stats_reset
> > rte_eth_xstats_get
> > rte_eth_xstats_reset
> >
> > Fixes: af75078fece3 ("first public release")
> > Fixes: ce757f5c9a4d ("ethdev: new method to retrieve extended
> > statistics")
> > Fixes: d4fef8b0d5e5 ("ethdev: expose generic and driver specific stats
> > in xstats")
> > Signed-off-by: Bernard Iremonger 
> > ---
> >  lib/librte_ether/rte_ethdev.c | 16 
> >  1 file changed, 12 insertions(+), 4 deletions(-)
> >
> > diff --git a/lib/librte_ether/rte_ethdev.c
> > b/lib/librte_ether/rte_ethdev.c index ed971b4..a0ee84d 100644
> > --- a/lib/librte_ether/rte_ethdev.c
> > +++ b/lib/librte_ether/rte_ethdev.c
> > @@ -1441,7 +1441,10 @@ rte_eth_stats_get(uint8_t port_id, struct
> rte_eth_stats *stats)
> > memset(stats, 0, sizeof(*stats));
> >
> > RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->stats_get, -
> ENOTSUP);
> > -   (*dev->dev_ops->stats_get)(dev, stats);
> > +
> > +   if (dev->data->dev_started)
> > +   (*dev->dev_ops->stats_get)(dev, stats);
> > +
> 
> My question is should we mark an error or a warning here and return an
> error so that the caller knows what happens?
> 
> Thanks,
> Michael



In other cases in rte_ethdev.c where there is a check on 
"dev->data->dev_started" there is a RTE_PMD_DEBUG_TRACE()  line. I will add a 
RTE_PMD_DEBUG_TRACE()  line.
The rte_eth_stats_reset() and rte_eth_xstats_reset() functions return void.
Not sure if an error is required for the rte_eth_stats_get() and 
rte_eth_xstats_get() functions as the stats information returned is all zero's 
at present.

Regards,

Bernard.

[dpdk-dev] [PATCH v2 0/6] vhost-user live migration support

2015-12-21 Thread Pavel Fedin

 Works fine.

 Tested-by: Pavel Fedin 

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia

> -Original Message-
> From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com]
> Sent: Thursday, December 17, 2015 6:12 AM
> To: dev at dpdk.org
> Cc: huawei.xie at intel.com; Michael S. Tsirkin; Victor Kaplansky; Iremonger 
> Bernard; Pavel
> Fedin; Peter Xu; Yuanhan Liu; Chen Zhihui; Yang Maggie
> Subject: [PATCH v2 0/6] vhost-user live migration support
> 
> This patch set adds the vhost-user live migration support.
> 
> The major task behind that is to log pages we touched during
> live migration, including used vring and desc buffer. So, this
> patch set is basically about adding vhost log support, and
> using it.
> 
> Patchset
> 
> - Patch 1 handles VHOST_USER_SET_LOG_BASE, which tells us where
>   the dirty memory bitmap is.
> 
> - Patch 2 introduces a vhost_log_write() helper function to log
>   pages we are gonna change.
> 
> - Patch 3 logs changes we made to used vring.
> 
> - Patch 4 logs changes we made to vring desc buffer.
> 
> - Patch 5 and 6 add some feature bits related to live migration.
> 
> 
> A simple test guide (on same host)
> ==
> 
> The following test is based on OVS + DPDK (check [0] for
> how to setup OVS + DPDK):
> 
> [0]: http://wiki.qemu.org/Features/vhost-user-ovs-dpdk
> 
> Here is the rough test guide:
> 
> 1. start ovs-vswitchd
> 
> 2. Add two ovs vhost-user port, say vhost0 and vhost1
> 
> 3. Start a VM1 to connect to vhost0. Here is my example:
> 
>$ $QEMU -enable-kvm -m 1024 -smp 4 \
>-chardev socket,id=char0,path=/var/run/openvswitch/vhost0  \
>-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
>-device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
>-object 
> memory-backend-file,id=mem,size=1024M,mem-path=$HOME/hugetlbfs,share=on \
>-numa node,memdev=mem -mem-prealloc \
>-kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
>-hda fc-19-i386.img \
>-monitor telnet::,server,nowait -curses
> 
> 4. run "ping $host" inside VM1
> 
> 5. Start VM2 to connect to vhost0, and marking it as the target
>of live migration (by adding -incoming tcp:0: option)
> 
>$ $QEMU -enable-kvm -m 1024 -smp 4 \
>-chardev socket,id=char0,path=/var/run/openvswitch/vhost1  \
>-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
>-device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
>-object 
> memory-backend-file,id=mem,size=1024M,mem-path=$HOME/hugetlbfs,share=on \
>-numa node,memdev=mem -mem-prealloc \
>-kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
>-hda fc-19-i386.img \
>-monitor telnet::3334,server,nowait -curses \
>-incoming tcp:0:
> 
> 6. connect to VM1 monitor, and start migration:
> 
>> migrate tcp:0:
> 
> 7. After a while, you will find that VM1 has been migrated to VM2,
>and the "ping" command continues running, perfectly.
> 
> 
> Cc: Chen Zhihui 
> Cc: Yang Maggie 
> ---
> Yuanhan Liu (6):
>   vhost: handle VHOST_USER_SET_LOG_BASE request
>   vhost: introduce vhost_log_write
>   vhost: log used vring changes
>   vhost: log vring desc buffer changes
>   vhost: claim that we support GUEST_ANNOUNCE feature
>   vhost: enable log_shmfd protocol feature
> 
>  lib/librte_vhost/rte_virtio_net.h | 36 ++-
>  lib/librte_vhost/vhost_rxtx.c | 88 
> +++
>  lib/librte_vhost/vhost_user/vhost-net-user.c  |  7 ++-
>  lib/librte_vhost/vhost_user/vhost-net-user.h  |  6 ++
>  lib/librte_vhost/vhost_user/virtio-net-user.c | 48 +++
>  lib/librte_vhost/vhost_user/virtio-net-user.h |  5 +-
>  lib/librte_vhost/virtio-net.c |  5 ++
>  7 files changed, 165 insertions(+), 30 deletions(-)
> 
> --
> 1.9.0

[dpdk-dev] [PATCH v5 1/3] vhost: Add callback and private data for vhost PMD

2015-12-21 Thread Tetsuya Mukawa

On 2015/12/19 3:01, Rich Lane wrote:
> I'm using the vhost callbacks and struct virtio_net with the vhost PMD in a
> few ways:
>
> 1. new_device/destroy_device: Link state change (will be covered by the
> link status interrupt).
> 2. new_device: Add first queue to datapath.
> 3. vring_state_changed: Add/remove queue to datapath.
> 4. destroy_device: Remove all queues (vring_state_changed is not called
> when qemu is killed).
> 5. new_device and struct virtio_net: Determine NUMA node of the VM.
>
> The vring_state_changed callback is necessary because the VM might not be
> using the maximum number of RX queues. If I boot Linux in the VM it will
> start out using one RX queue, which can be changed with ethtool. The DPDK
> app in the host needs to be notified that it can start sending traffic to
> the new queue.
>
> The vring_state_changed callback is also useful for guest TX queues to
> avoid reading from an inactive queue.
>
> API I'd like to have:
>
> 1. Link status interrupt.
> 2. New queue_state_changed callback. Unlike vring_state_changed this should
> cover the first queue at new_device and removal of all queues at
> destroy_device.
> 3. Per-queue or per-device NUMA node info.

Hi Rich and Yuanhan,

As Rich described, some users needs more information when the interrupts
comes.
And the virtio_net structure contains the information.

I guess it's very similar to interrupt handling of normal hardware.
First, a interrupt comes, then an interrupt handler checks status
register of the device to know actually what was happened.
In vhost PMD case, reading status register equals reading virtio_net
structure.

So how about below specification?

1. The link status interrupt of vhost PMD will occurs when new_device,
destroy_device and vring_state_changed events are happened.
2. Vhost PMD provides a function to let the users know virtio_net
structure of the interrupted port.
   (Probably almost same as "rte_eth_vhost_portid2vdev" that I described
in "[PATCH v5 3/3] vhost: Add helper function to convert port id to
virtio device pointer")

I guess what kind of information the users need will depends on their
environments.
So just providing virtio_net structure may be good.
What do you think?

Tetsuya,

>
> On Thu, Dec 17, 2015 at 8:28 PM, Tetsuya Mukawa  wrote:
>
>> On 2015/12/18 13:15, Yuanhan Liu wrote:
>>> On Fri, Dec 18, 2015 at 12:15:42PM +0900, Tetsuya Mukawa wrote:
 On 2015/12/17 20:42, Yuanhan Liu wrote:
> On Tue, Nov 24, 2015 at 06:00:01PM +0900, Tetsuya Mukawa wrote:
>> The vhost PMD will be a wrapper of vhost library, but some of vhost
>> library APIs cannot be mapped to ethdev library APIs.
>> Becasue of this, in some cases, we still need to use vhost library
>> APIs
>> for a port created by the vhost PMD.
>>
>> Currently, when virtio device is created and destroyed, vhost library
>> will call one of callback handlers. The vhost PMD need to use this
>> pair of callback handlers to know which virtio devices are connected
>> actually.
>> Because we can register only one pair of callbacks to vhost library,
>> if
>> the PMD use it, DPDK applications cannot have a way to know the
>> events.
>> This may break legacy DPDK applications that uses vhost library. To
>> prevent
>> it, this patch adds one more pair of callbacks to vhost library
>> especially
>> for the vhost PMD.
>> With the patch, legacy applications can use the vhost PMD even if
>> they need
>> additional specific handling for virtio device creation and
>> destruction.
>> For example, legacy application can call
>> rte_vhost_enable_guest_notification() in callbacks to change setting.
> TBH, I never liked it since the beginning. Introducing two callbacks
> for one event is a bit messy, and therefore error prone.
 I agree with you.

> I have been thinking this occasionally last few weeks, and have came
> up something that we may introduce another layer callback based on
> the vhost pmd itself, by a new API:
>
> rte_eth_vhost_register_callback().
>
> And we then call those new callback inside the vhost pmd new_device()
> and vhost pmd destroy_device() implementations.
>
> And we could have same callbacks like vhost have, but I'm thinking
> that new_device() and destroy_device() doesn't sound like a good name
> to a PMD driver. Maybe a name like "link_state_changed" is better?
>
> What do you think of that?
 Yes,  "link_state_changed" will be good.

 BTW, I thought it was ok that an DPDK app that used vhost PMD called
 vhost library APIs directly.
 But probably you may feel strangeness about it. Is this correct?
>>> Unluckily, that's true :)
>>>
 If so, how about implementing legacy status interrupt mechanism to vhost
 PMD?
 For example, an DPDK app can register callback handler like
 "examples/link_status_interrupt".

 Also, if the app doesn't call vhost

[dpdk-dev] [PATCH v5 1/3] vhost: Add callback and private data for vhost PMD

2015-12-21 Thread Tetsuya Mukawa

On 2015/12/18 19:03, Xie, Huawei wrote:
> On 12/18/2015 12:15 PM, Yuanhan Liu wrote:
>> On Fri, Dec 18, 2015 at 12:15:42PM +0900, Tetsuya Mukawa wrote:
>>> On 2015/12/17 20:42, Yuanhan Liu wrote:
 On Tue, Nov 24, 2015 at 06:00:01PM +0900, Tetsuya Mukawa wrote:
> The vhost PMD will be a wrapper of vhost library, but some of vhost
> library APIs cannot be mapped to ethdev library APIs.
> Becasue of this, in some cases, we still need to use vhost library APIs
> for a port created by the vhost PMD.
>
> Currently, when virtio device is created and destroyed, vhost library
> will call one of callback handlers. The vhost PMD need to use this
> pair of callback handlers to know which virtio devices are connected
> actually.
> Because we can register only one pair of callbacks to vhost library, if
> the PMD use it, DPDK applications cannot have a way to know the events.
>
> This may break legacy DPDK applications that uses vhost library. To 
> prevent
> it, this patch adds one more pair of callbacks to vhost library especially
> for the vhost PMD.
> With the patch, legacy applications can use the vhost PMD even if they 
> need
> additional specific handling for virtio device creation and destruction.
>
> For example, legacy application can call
> rte_vhost_enable_guest_notification() in callbacks to change setting.
 TBH, I never liked it since the beginning. Introducing two callbacks
 for one event is a bit messy, and therefore error prone.
>>> I agree with you.
>>>
 I have been thinking this occasionally last few weeks, and have came
 up something that we may introduce another layer callback based on
 the vhost pmd itself, by a new API:

rte_eth_vhost_register_callback().

 And we then call those new callback inside the vhost pmd new_device()
 and vhost pmd destroy_device() implementations.

 And we could have same callbacks like vhost have, but I'm thinking
 that new_device() and destroy_device() doesn't sound like a good name
 to a PMD driver. Maybe a name like "link_state_changed" is better?

 What do you think of that?
>>> Yes,  "link_state_changed" will be good.
>>>
>>> BTW, I thought it was ok that an DPDK app that used vhost PMD called
>>> vhost library APIs directly.
>>> But probably you may feel strangeness about it. Is this correct?
>> Unluckily, that's true :)
>>
>>> If so, how about implementing legacy status interrupt mechanism to vhost
>>> PMD?
>>> For example, an DPDK app can register callback handler like
>>> "examples/link_status_interrupt".
>>>
>>> Also, if the app doesn't call vhost library APIs directly,
>>> rte_eth_vhost_portid2vdev() will be needless, because the app doesn't
>>> need to handle virtio device structure anymore.
>>>
 On the other hand, I'm still thinking is that really necessary to let
 the application be able to call vhost functions like 
 rte_vhost_enable_guest_notification()
 with the vhost PMD driver?
>>> Basic concept of my patch is that vhost PMD will provides the features
>>> that vhost library provides.
>> I don't think that's necessary. Let's just treat it as a normal pmd
>> driver, having nothing to do with vhost library.
>>
>>> How about removing rte_vhost_enable_guest_notification() from "vhost
>>> library"?
>>> (I also not sure what are use cases)
>>> If we can do this, vhost PMD also doesn't need to take care of it.
>>> Or if rte_vhost_enable_guest_notification() will be removed in the
>>> future, vhost PMD is able to ignore it.
>> You could either call it in vhost-pmd (which you already have done that),
>> or ignore it in vhost-pmd, but dont' remove it from vhost library.
>>
>>> Please let me correct up my thinking about your questions.
>>>  - Change concept of patch not to call vhost library APIs directly.
>>> These should be wrapped by ethdev APIs.
>>>  - Remove rte_eth_vhost_portid2vdev(), because of above concept changing.
>>>  - Implement legacy status changed interrupt to vhost PMD instead of
>>> using own callback mechanism.
>>>  - Check if we can remove rte_vhost_enable_guest_notification() from
>>> vhost library.
>> So, how about making it __fare__ simple as the first step, to get merged
>> easily, that we don't assume the applications will call any vhost library
>> functions any more, so that we don't need the callback, and we don't need
>> the rte_eth_vhost_portid2vdev(), either. Again, just let it be a fare
>> normal (nothing special) pmd driver.  (UNLESS, there is a real must, which
>> I don't see so far).
>>
>> Tetsuya, what do you think of that then?
>>
>>> Hi Xie,
>>>
>>> Do you know the use cases of rte_vhost_enable_guest_notification()?
> If vhost runs in loop mode, it doesn't need to be notified. You have
> wrapped vhost as the PMD, which is nice for OVS integration. If we
> require that all PMDs could be polled by select/poll, then we could use
> this API for vhost PMD,

[dpdk-dev] [PATCH 3/3] igb_uio: remove sys files for setting pci config space

2015-12-21 Thread Stephen Hemminger

On Mon, 21 Dec 2015 10:38:06 +0800
Helin Zhang  wrote:

> Sys files of 'extended_tag' and 'max_read_request_size' are
> useless, as nobody will use them for setting pci config space.
> 
> Signed-off-by: Helin Zhang 
> ---
>  doc/guides/linux_gsg/enable_func.rst  |  22 --
>  doc/guides/rel_notes/deprecation.rst  |   3 +
>  doc/guides/rel_notes/release_2_3.rst  |   6 ++
>  lib/librte_eal/linuxapp/igb_uio/igb_uio.c | 108 
> --
>  4 files changed, 9 insertions(+), 130 deletions(-)
> 
> diff --git a/doc/guides/linux_gsg/enable_func.rst 
> b/doc/guides/linux_gsg/enable_func.rst
> index c3fa6d3..ec0e04d 100644
> --- a/doc/guides/linux_gsg/enable_func.rst
> +++ b/doc/guides/linux_gsg/enable_func.rst
> @@ -186,28 +186,6 @@ Check with the local Intel's Network Division 
> application engineers for firmware
>  The base driver to support firmware version of FVL3E will be integrated in 
> the next
>  DPDK release, so currently the validated firmware version is 4.2.6.
>  
> -Enabling Extended Tag and Setting Max Read Request Size
> -~~~
> -
> -PCI configurations of ``extended_tag`` and max _read_requ st_size have big 
> impacts on performance of small packets on 40G NIC.
> -Enabling extended_tag and setting ``max_read_request_size`` to small size 
> such as 128 bytes provide great helps to high performance of small packets.
> -
> -*   These can be done in some BIOS implementations.
> -
> -*   For other BIOS implementations, PCI configurations can be changed by 
> using command of ``setpci``, or special configurations in DPDK config file of 
> ``common_linux``.
> -
> -*   Bits 7:5 at address of 0xA8 of each PCI device is used for setting 
> the max_read_request_size,
> -and bit 8 of 0xA8 of each PCI device is used for enabling/disabling 
> the extended_tag.
> -lspci and setpci can be used to read the values of 0xA8 and then 
> write it back after being changed.
> -
> -*   In config file of common_linux, below three configurations can be 
> changed for the same purpose.
> -
> -``CONFIG_RTE_PCI_CONFIG``
> -
> -``CONFIG_RTE_PCI_EXTENDED_TAG``
> -
> -``CONFIG_RTE_PCI_MAX_READ_REQUEST_SIZE``
> -
>  Use 16 Bytes RX Descriptor Size
>  ~~~
>  
> diff --git a/doc/guides/rel_notes/deprecation.rst 
> b/doc/guides/rel_notes/deprecation.rst
> index e94d4a2..7438f80 100644
> --- a/doc/guides/rel_notes/deprecation.rst
> +++ b/doc/guides/rel_notes/deprecation.rst
> @@ -49,3 +49,6 @@ Deprecation Notices
>commands (such as RETA update in testpmd).  This should impact
>CMDLINE_PARSE_RESULT_BUFSIZE, STR_TOKEN_SIZE and RDLINE_BUF_SIZE.
>It should be integrated in release 2.3.
> +
> +* The eal function of pci_config_space_set is deprecated in release 2.3, and
> +  will be removed from 2.4.
> diff --git a/doc/guides/rel_notes/release_2_3.rst 
> b/doc/guides/rel_notes/release_2_3.rst
> index efd258b..ed10d94 100644
> --- a/doc/guides/rel_notes/release_2_3.rst
> +++ b/doc/guides/rel_notes/release_2_3.rst
> @@ -16,6 +16,12 @@ Resolved Issues
>  EAL
>  ~~~
>  
> +* **eal/linux: removed sys files for pci config space.**
> +
> +  Removed sys files of 'extended_tag' and 'max_read_request_size' and
> +  their relavant operations, as they shouldn't be done in eal for all
> +  possible devices.
> +
>  
>  Drivers
>  ~~~
> diff --git a/lib/librte_eal/linuxapp/igb_uio/igb_uio.c 
> b/lib/librte_eal/linuxapp/igb_uio/igb_uio.c
> index f5617d2..054d053 100644
> --- a/lib/librte_eal/linuxapp/igb_uio/igb_uio.c
> +++ b/lib/librte_eal/linuxapp/igb_uio/igb_uio.c
> @@ -40,15 +40,6 @@
>  
>  #include "compat.h"
>  
> -#ifdef RTE_PCI_CONFIG
> -#define PCI_SYS_FILE_BUF_SIZE  10
> -#define PCI_DEV_CAP_REG0xA4
> -#define PCI_DEV_CTRL_REG   0xA8
> -#define PCI_DEV_CAP_EXT_TAG_MASK   0x20
> -#define PCI_DEV_CTRL_EXT_TAG_SHIFT 8
> -#define PCI_DEV_CTRL_EXT_TAG_MASK  (1 << PCI_DEV_CTRL_EXT_TAG_SHIFT)
> -#endif
> -
>  /**
>   * A structure describing the private information for a uio device.
>   */
> @@ -90,109 +81,10 @@ store_max_vfs(struct device *dev, struct 
> device_attribute *attr,
>   return err ? err : count;
>  }
>  
> -#ifdef RTE_PCI_CONFIG
> -static ssize_t
> -show_extended_tag(struct device *dev, struct device_attribute *attr, char 
> *buf)
> -{
> - struct pci_dev *pci_dev = to_pci_dev(dev);
> - uint32_t val = 0;
> -
> - pci_read_config_dword(pci_dev, PCI_DEV_CAP_REG, );
> - if (!(val & PCI_DEV_CAP_EXT_TAG_MASK)) /* Not supported */
> - return snprintf(buf, PCI_SYS_FILE_BUF_SIZE, "%s\n", "invalid");
> -
> - val = 0;
> - pci_bus_read_config_dword(pci_dev->bus, pci_dev->devfn,
> - PCI_DEV_CTRL_REG, );
> -
> - return snprintf(buf, PCI_SYS_FILE_BUF_SIZE, "%s\n",
> - (val & PCI_DEV_CTRL_EXT_TAG_MASK) ? "on" : "off");
> -}
> -
> -static ssize_t
>

[dpdk-dev] [PATCH 3/3] igb_uio: remove sys files for setting pci config space

2015-12-21 Thread Helin Zhang

Sys files of 'extended_tag' and 'max_read_request_size' are
useless, as nobody will use them for setting pci config space.

Signed-off-by: Helin Zhang 
---
 doc/guides/linux_gsg/enable_func.rst  |  22 --
 doc/guides/rel_notes/deprecation.rst  |   3 +
 doc/guides/rel_notes/release_2_3.rst  |   6 ++
 lib/librte_eal/linuxapp/igb_uio/igb_uio.c | 108 --
 4 files changed, 9 insertions(+), 130 deletions(-)

diff --git a/doc/guides/linux_gsg/enable_func.rst 
b/doc/guides/linux_gsg/enable_func.rst
index c3fa6d3..ec0e04d 100644
--- a/doc/guides/linux_gsg/enable_func.rst
+++ b/doc/guides/linux_gsg/enable_func.rst
@@ -186,28 +186,6 @@ Check with the local Intel's Network Division application 
engineers for firmware
 The base driver to support firmware version of FVL3E will be integrated in the 
next
 DPDK release, so currently the validated firmware version is 4.2.6.

-Enabling Extended Tag and Setting Max Read Request Size
-~~~
-
-PCI configurations of ``extended_tag`` and max _read_requ st_size have big 
impacts on performance of small packets on 40G NIC.
-Enabling extended_tag and setting ``max_read_request_size`` to small size such 
as 128 bytes provide great helps to high performance of small packets.
-
-*   These can be done in some BIOS implementations.
-
-*   For other BIOS implementations, PCI configurations can be changed by using 
command of ``setpci``, or special configurations in DPDK config file of 
``common_linux``.
-
-*   Bits 7:5 at address of 0xA8 of each PCI device is used for setting the 
max_read_request_size,
-and bit 8 of 0xA8 of each PCI device is used for enabling/disabling 
the extended_tag.
-lspci and setpci can be used to read the values of 0xA8 and then write 
it back after being changed.
-
-*   In config file of common_linux, below three configurations can be 
changed for the same purpose.
-
-``CONFIG_RTE_PCI_CONFIG``
-
-``CONFIG_RTE_PCI_EXTENDED_TAG``
-
-``CONFIG_RTE_PCI_MAX_READ_REQUEST_SIZE``
-
 Use 16 Bytes RX Descriptor Size
 ~~~

diff --git a/doc/guides/rel_notes/deprecation.rst 
b/doc/guides/rel_notes/deprecation.rst
index e94d4a2..7438f80 100644
--- a/doc/guides/rel_notes/deprecation.rst
+++ b/doc/guides/rel_notes/deprecation.rst
@@ -49,3 +49,6 @@ Deprecation Notices
   commands (such as RETA update in testpmd).  This should impact
   CMDLINE_PARSE_RESULT_BUFSIZE, STR_TOKEN_SIZE and RDLINE_BUF_SIZE.
   It should be integrated in release 2.3.
+
+* The eal function of pci_config_space_set is deprecated in release 2.3, and
+  will be removed from 2.4.
diff --git a/doc/guides/rel_notes/release_2_3.rst 
b/doc/guides/rel_notes/release_2_3.rst
index efd258b..ed10d94 100644
--- a/doc/guides/rel_notes/release_2_3.rst
+++ b/doc/guides/rel_notes/release_2_3.rst
@@ -16,6 +16,12 @@ Resolved Issues
 EAL
 ~~~

+* **eal/linux: removed sys files for pci config space.**
+
+  Removed sys files of 'extended_tag' and 'max_read_request_size' and
+  their relavant operations, as they shouldn't be done in eal for all
+  possible devices.
+

 Drivers
 ~~~
diff --git a/lib/librte_eal/linuxapp/igb_uio/igb_uio.c 
b/lib/librte_eal/linuxapp/igb_uio/igb_uio.c
index f5617d2..054d053 100644
--- a/lib/librte_eal/linuxapp/igb_uio/igb_uio.c
+++ b/lib/librte_eal/linuxapp/igb_uio/igb_uio.c
@@ -40,15 +40,6 @@

 #include "compat.h"

-#ifdef RTE_PCI_CONFIG
-#define PCI_SYS_FILE_BUF_SIZE  10
-#define PCI_DEV_CAP_REG0xA4
-#define PCI_DEV_CTRL_REG   0xA8
-#define PCI_DEV_CAP_EXT_TAG_MASK   0x20
-#define PCI_DEV_CTRL_EXT_TAG_SHIFT 8
-#define PCI_DEV_CTRL_EXT_TAG_MASK  (1 << PCI_DEV_CTRL_EXT_TAG_SHIFT)
-#endif
-
 /**
  * A structure describing the private information for a uio device.
  */
@@ -90,109 +81,10 @@ store_max_vfs(struct device *dev, struct device_attribute 
*attr,
return err ? err : count;
 }

-#ifdef RTE_PCI_CONFIG
-static ssize_t
-show_extended_tag(struct device *dev, struct device_attribute *attr, char *buf)
-{
-   struct pci_dev *pci_dev = to_pci_dev(dev);
-   uint32_t val = 0;
-
-   pci_read_config_dword(pci_dev, PCI_DEV_CAP_REG, );
-   if (!(val & PCI_DEV_CAP_EXT_TAG_MASK)) /* Not supported */
-   return snprintf(buf, PCI_SYS_FILE_BUF_SIZE, "%s\n", "invalid");
-
-   val = 0;
-   pci_bus_read_config_dword(pci_dev->bus, pci_dev->devfn,
-   PCI_DEV_CTRL_REG, );
-
-   return snprintf(buf, PCI_SYS_FILE_BUF_SIZE, "%s\n",
-   (val & PCI_DEV_CTRL_EXT_TAG_MASK) ? "on" : "off");
-}
-
-static ssize_t
-store_extended_tag(struct device *dev,
-  struct device_attribute *attr,
-  const char *buf,
-  size_t count)
-{
-   struct pci_dev *pci_dev = to_pci_dev(dev);
-   uint32_t val = 0, enable;
-
-   if (strncmp(buf, "on", 2) == 0)
-   enable = 1;
-   else if

[dpdk-dev] [PATCH 2/3] eal: remove pci config of extended tag

2015-12-21 Thread Helin Zhang

Remove pci configuration of 'extended tag' and 'max read request
size', as they are not required by all devices and it lets PMD to
configure them if neccessary.
In addition, 'pci_config_space_set()' is deprecated.

Signed-off-by: Helin Zhang 
---
 config/common_linuxapp  |  7 ---
 lib/librte_eal/common/eal_common_pci.c  |  7 ---
 lib/librte_eal/common/include/rte_pci.h |  4 +-
 lib/librte_eal/linuxapp/eal/eal_pci.c   | 90 +++--
 4 files changed, 9 insertions(+), 99 deletions(-)

diff --git a/config/common_linuxapp b/config/common_linuxapp
index 74bc515..f52baf9 100644
--- a/config/common_linuxapp
+++ b/config/common_linuxapp
@@ -115,13 +115,6 @@ CONFIG_RTE_MALLOC_DEBUG=n
 CONFIG_RTE_EAL_PMD_PATH=""

 #
-# Special configurations in PCI Config Space for high performance
-#
-CONFIG_RTE_PCI_CONFIG=n
-CONFIG_RTE_PCI_EXTENDED_TAG=""
-CONFIG_RTE_PCI_MAX_READ_REQUEST_SIZE=0
-
-#
 # Compile Environment Abstraction Layer for linux
 #
 CONFIG_RTE_LIBRTE_EAL_LINUXAPP=y
diff --git a/lib/librte_eal/common/eal_common_pci.c 
b/lib/librte_eal/common/eal_common_pci.c
index dcfe947..63d0829 100644
--- a/lib/librte_eal/common/eal_common_pci.c
+++ b/lib/librte_eal/common/eal_common_pci.c
@@ -180,13 +180,6 @@ rte_eal_pci_probe_one_driver(struct rte_pci_driver *dr, 
struct rte_pci_device *d
}

if (dr->drv_flags & RTE_PCI_DRV_NEED_MAPPING) {
-#ifdef RTE_PCI_CONFIG
-   /*
-* Set PCIe config space for high performance.
-* Return value can be ignored.
-*/
-   pci_config_space_set(dev);
-#endif
/* map resources for devices that use igb_uio */
ret = pci_map_device(dev);
if (ret != 0)
diff --git a/lib/librte_eal/common/include/rte_pci.h 
b/lib/librte_eal/common/include/rte_pci.h
index 334c12e..8201fe8 100644
--- a/lib/librte_eal/common/include/rte_pci.h
+++ b/lib/librte_eal/common/include/rte_pci.h
@@ -489,12 +489,14 @@ int rte_eal_pci_write_config(const struct rte_pci_device 
*device,
 #ifdef RTE_PCI_CONFIG
 /**
  * Set special config space registers for performance purpose.
+ * It is deprecated, as all configurations have been moved into
+ * each PMDs respectively.
  *
  * @param dev
  *   A pointer to a rte_pci_device structure describing the device
  *   to use
  */
-void pci_config_space_set(struct rte_pci_device *dev);
+void pci_config_space_set(struct rte_pci_device *dev) __rte_deprecated;
 #endif /* RTE_PCI_CONFIG */

 #ifdef __cplusplus
diff --git a/lib/librte_eal/linuxapp/eal/eal_pci.c 
b/lib/librte_eal/linuxapp/eal/eal_pci.c
index bc5b5be..11de652 100644
--- a/lib/librte_eal/linuxapp/eal/eal_pci.c
+++ b/lib/librte_eal/linuxapp/eal/eal_pci.c
@@ -482,92 +482,14 @@ error:
 }

 #ifdef RTE_PCI_CONFIG
-static int
-pci_config_extended_tag(struct rte_pci_device *dev)
-{
-   struct rte_pci_addr *loc = >addr;
-   char filename[PATH_MAX];
-   char buf[BUFSIZ];
-   FILE *f;
-
-   /* not configured, let it as is */
-   if (strncmp(RTE_PCI_EXTENDED_TAG, "on", 2) != 0 &&
-   strncmp(RTE_PCI_EXTENDED_TAG, "off", 3) != 0)
-   return 0;
-
-   snprintf(filename, sizeof(filename),
-   SYSFS_PCI_DEVICES "/" PCI_PRI_FMT "/" "extended_tag",
-   loc->domain, loc->bus, loc->devid, loc->function);
-   f = fopen(filename, "rw+");
-   if (!f)
-   return -1;
-
-   fgets(buf, sizeof(buf), f);
-   if (strncmp(RTE_PCI_EXTENDED_TAG, "on", 2) == 0) {
-   /* enable Extended Tag*/
-   if (strncmp(buf, "on", 2) != 0) {
-   fseek(f, 0, SEEK_SET);
-   fputs("on", f);
-   }
-   } else {
-   /* disable Extended Tag */
-   if (strncmp(buf, "off", 3) != 0) {
-   fseek(f, 0, SEEK_SET);
-   fputs("off", f);
-   }
-   }
-   fclose(f);
-
-   return 0;
-}
-
-static int
-pci_config_max_read_request_size(struct rte_pci_device *dev)
-{
-   struct rte_pci_addr *loc = >addr;
-   char filename[PATH_MAX];
-   char buf[BUFSIZ], param[BUFSIZ];
-   FILE *f;
-   /* size can be 128, 256, 512, 1024, 2048, 4096 */
-   uint32_t max_size = RTE_PCI_MAX_READ_REQUEST_SIZE;
-
-   /* not configured, let it as is */
-   if (!max_size)
-   return 0;
-
-   snprintf(filename, sizeof(filename),
-   SYSFS_PCI_DEVICES "/" PCI_PRI_FMT "/" "max_read_request_size",
-   loc->domain, loc->bus, loc->devid, loc->function);
-   f = fopen(filename, "rw+");
-   if (!f)
-   return -1;
-
-   fgets(buf, sizeof(buf), f);
-   snprintf(param, sizeof(param), "%d", max_size);
-
-   /* check if the size to be set is the same as current */
-   if (strcmp(buf, param) == 0) {
-

[dpdk-dev] [PATCH 1/3] i40e: enable extended tag

2015-12-21 Thread Helin Zhang

PCIe feature of 'Extended Tag' is important for 40G performance.
It adds its enabling during each port initialization, to ensure
the high performance.

Signed-off-by: Helin Zhang 
---
 doc/guides/rel_notes/release_2_3.rst |  5 +++
 drivers/net/i40e/i40e_ethdev.c   | 67 ++--
 2 files changed, 69 insertions(+), 3 deletions(-)

diff --git a/doc/guides/rel_notes/release_2_3.rst 
b/doc/guides/rel_notes/release_2_3.rst
index 99de186..efd258b 100644
--- a/doc/guides/rel_notes/release_2_3.rst
+++ b/doc/guides/rel_notes/release_2_3.rst
@@ -4,6 +4,11 @@ DPDK Release 2.3
 New Features
 

+* **i40e: Enabled extended tag.**
+
+  It enabled extended tag by checking and writing corresponding PCI config
+  space bytes, to boost the performance.
+

 Resolved Issues
 ---
diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index bf6220d..973aca8 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -273,6 +273,17 @@
 #define I40E_INSET_IPV6_TC_MASK   0x0009F00FUL
 #define I40E_INSET_IPV6_NEXT_HDR_MASK 0x000C00FFUL

+/* PCI offset for querying capability */
+#define PCI_DEV_CAP_REG0xA4
+/* PCI offset for enabling/disabling Extended Tag */
+#define PCI_DEV_CTRL_REG   0xA8
+/* Bit mask of Extended Tag capability */
+#define PCI_DEV_CAP_EXT_TAG_MASK   0x20
+/* Bit shift of Extended Tag enable/disable */
+#define PCI_DEV_CTRL_EXT_TAG_SHIFT 8
+/* Bit mask of Extended Tag enable/disable */
+#define PCI_DEV_CTRL_EXT_TAG_MASK  (1 << PCI_DEV_CTRL_EXT_TAG_SHIFT)
+
 static int eth_i40e_dev_init(struct rte_eth_dev *eth_dev);
 static int eth_i40e_dev_uninit(struct rte_eth_dev *eth_dev);
 static int i40e_dev_configure(struct rte_eth_dev *dev);
@@ -386,7 +397,7 @@ static int i40e_dev_filter_ctrl(struct rte_eth_dev *dev,
 static int i40e_dev_get_dcb_info(struct rte_eth_dev *dev,
  struct rte_eth_dcb_info *dcb_info);
 static void i40e_configure_registers(struct i40e_hw *hw);
-static void i40e_hw_init(struct i40e_hw *hw);
+static void i40e_hw_init(struct rte_eth_dev *dev);
 static int i40e_config_qinq(struct i40e_hw *hw, struct i40e_vsi *vsi);
 static int i40e_mirror_rule_set(struct rte_eth_dev *dev,
struct rte_eth_mirror_conf *mirror_conf,
@@ -765,7 +776,7 @@ eth_i40e_dev_init(struct rte_eth_dev *dev)
i40e_clear_hw(hw);

/* Initialize the hardware */
-   i40e_hw_init(hw);
+   i40e_hw_init(dev);

/* Reset here to make sure all is clean for each PF */
ret = i40e_pf_reset(hw);
@@ -7262,13 +7273,63 @@ i40e_dev_filter_ctrl(struct rte_eth_dev *dev,
 }

 /*
+ * Check and enable Extended Tag.
+ * Enabling Extended Tag is important for 40G performance.
+ */
+static int
+i40e_enable_extended_tag(struct rte_eth_dev *dev)
+{
+   uint32_t buf = 0;
+   int ret;
+
+   ret = rte_eal_pci_read_config(dev->pci_dev, , sizeof(buf),
+ PCI_DEV_CAP_REG);
+   if (ret < 0) {
+   PMD_DRV_LOG(ERR, "Failed to read PCI offset 0x%x",
+   PCI_DEV_CAP_REG);
+   return -1;
+   }
+   if (!(buf & PCI_DEV_CAP_EXT_TAG_MASK)) {
+   PMD_DRV_LOG(ERR, "Does not support Extended Tag");
+   return -1;
+   }
+
+   buf = 0;
+   ret = rte_eal_pci_read_config(dev->pci_dev, , sizeof(buf),
+ PCI_DEV_CTRL_REG);
+   if (ret < 0) {
+   PMD_DRV_LOG(ERR, "Failed to read PCI offset 0x%x",
+   PCI_DEV_CTRL_REG);
+   return -1;
+   }
+   if (buf & PCI_DEV_CTRL_EXT_TAG_MASK) {
+   PMD_DRV_LOG(DEBUG, "Extended Tag has already been enabled");
+   return 0;
+   }
+   buf |= PCI_DEV_CTRL_EXT_TAG_MASK;
+   ret = rte_eal_pci_write_config(dev->pci_dev, , sizeof(buf),
+  PCI_DEV_CTRL_REG);
+   if (ret < 0) {
+   PMD_DRV_LOG(ERR, "Failed to write PCI offset 0x%x",
+   PCI_DEV_CTRL_REG);
+   return -1;
+   }
+
+   return 0;
+}
+
+/*
  * As some registers wouldn't be reset unless a global hardware reset,
  * hardware initialization is needed to put those registers into an
  * expected initial state.
  */
 static void
-i40e_hw_init(struct i40e_hw *hw)
+i40e_hw_init(struct rte_eth_dev *dev)
 {
+   struct i40e_hw *hw = I40E_DEV_PRIVATE_TO_HW(dev->data->dev_private);
+
+   i40e_enable_extended_tag(dev);
+
/* clear the PF Queue Filter control register */
I40E_WRITE_REG(hw, I40E_PFQF_CTL_0, 0);

-- 
1.9.3

[dpdk-dev] [PATCH 0/3] i40e: enable extended tag

2015-12-21 Thread Helin Zhang

'extended tag' is important for XL710 performance, while might not be neccessary
for other NICs. It adds the enabling 'extended tag' into i40e PMD specifically,
then the sys files of 'extended_tag' and 'max_read_request_size', and all of 
their
relavant operations are removed as they are not neccessary for all devices.
In addition, documentations are updated at the same time.

Helin Zhang (3):
  i40e: enable extended tag
  eal: remove pci config of extended tag
  igb_uio: remove sys files for setting pci config space

 config/common_linuxapp|   7 --
 doc/guides/linux_gsg/enable_func.rst  |  22 --
 doc/guides/rel_notes/deprecation.rst  |   3 +
 doc/guides/rel_notes/release_2_3.rst  |  11 +++
 drivers/net/i40e/i40e_ethdev.c|  67 +-
 lib/librte_eal/common/eal_common_pci.c|   7 --
 lib/librte_eal/common/include/rte_pci.h   |   4 +-
 lib/librte_eal/linuxapp/eal/eal_pci.c |  90 ++---
 lib/librte_eal/linuxapp/igb_uio/igb_uio.c | 108 --
 9 files changed, 87 insertions(+), 232 deletions(-)

-- 
1.9.3

[dpdk-dev] [PATCH] librte_ether: fix crashes in rte_ethdev functions.

2015-12-21 Thread Qiu, Michael

On 2015/12/18 1:24, Bernard Iremonger wrote:
> The nb_rx_queues and nb_tx_queues are initialised before
> the tx_queue and rx_queue arrays are allocated. The arrays
> are allocated when the ethdev port is started.
>
> If any of the following functions are called before the ethdev
> port is started there is a segmentation fault:
>
> rte_eth_stats_get
> rte_eth_stats_reset
> rte_eth_xstats_get
> rte_eth_xstats_reset
>
> Fixes: af75078fece3 ("first public release")
> Fixes: ce757f5c9a4d ("ethdev: new method to retrieve extended statistics")
> Fixes: d4fef8b0d5e5 ("ethdev: expose generic and driver specific stats in 
> xstats")
> Signed-off-by: Bernard Iremonger 
> ---
>  lib/librte_ether/rte_ethdev.c | 16 
>  1 file changed, 12 insertions(+), 4 deletions(-)
>
> diff --git a/lib/librte_ether/rte_ethdev.c b/lib/librte_ether/rte_ethdev.c
> index ed971b4..a0ee84d 100644
> --- a/lib/librte_ether/rte_ethdev.c
> +++ b/lib/librte_ether/rte_ethdev.c
> @@ -1441,7 +1441,10 @@ rte_eth_stats_get(uint8_t port_id, struct 
> rte_eth_stats *stats)
>   memset(stats, 0, sizeof(*stats));
>  
>   RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->stats_get, -ENOTSUP);
> - (*dev->dev_ops->stats_get)(dev, stats);
> +
> + if (dev->data->dev_started)
> + (*dev->dev_ops->stats_get)(dev, stats);
> +

My question is should we mark an error or a warning here and return an
error so that the caller knows what happens?

Thanks,
Michael

>   stats->rx_nombuf = dev->data->rx_mbuf_alloc_failed;
>   return 0;
>  }
> @@ -1455,7 +1458,10 @@ rte_eth_stats_reset(uint8_t port_id)
>   dev = _eth_devices[port_id];
>  
>   RTE_FUNC_PTR_OR_RET(*dev->dev_ops->stats_reset);
> - (*dev->dev_ops->stats_reset)(dev);
> +
> + if (dev->data->dev_started)
> + (*dev->dev_ops->stats_reset)(dev);
> +
>   dev->data->rx_mbuf_alloc_failed = 0;
>  }
>  
> @@ -1479,7 +1485,8 @@ rte_eth_xstats_get(uint8_t port_id, struct 
> rte_eth_xstats *xstats,
>   (dev->data->nb_tx_queues * RTE_NB_TXQ_STATS);
>  
>   /* implemented by the driver */
> - if (dev->dev_ops->xstats_get != NULL) {
> + if ((dev->dev_ops->xstats_get != NULL) &&
> + (dev->data->dev_started)) {
>   /* Retrieve the xstats from the driver at the end of the
>* xstats struct.
>*/
> @@ -1548,7 +1555,8 @@ rte_eth_xstats_reset(uint8_t port_id)
>   dev = _eth_devices[port_id];
>  
>   /* implemented by the driver */
> - if (dev->dev_ops->xstats_reset != NULL) {
> + if ((dev->dev_ops->xstats_reset != NULL) &&
> + (dev->data->dev_started)) {
>   (*dev->dev_ops->xstats_reset)(dev);
>   return;
>   }

[dpdk-dev] [PATCH v3] mem: calculate space left in a hugetlbfs

2015-12-21 Thread Qiu, Michael

On 2015/11/18 17:42, Jianfeng Tan wrote:
> Currently DPDK does not respect the quota of a hugetblfs mount.
> It will fail to init the EAL because it tries to map the number of
> free hugepages in the system rather than using the number specified
> in the quota for that mount.
>
> To solve this issue, we take the quota into consideration when
> calculating the number of hugepages to map.  We use either the number
> specified in the quota, or number of available hugepages, whichever
> is lower.
>
> There are possible race conditions when multiple applications
> allocate hugepages in different hugetlbfs mounts of the same size,
> so the suggested system would have a pool with enough hugepages for
> all hugetlbfs mount quotas.
>
> There is, however, still an open issue with
> CONFIG_RTE_EAL_SINGLE_FILE_SEGMENTS. When this option is enabled
> (IVSHMEM target does this by default), having hugetlbfs mounts with
> quota will fail to remap hugepages because it relies on having
> mapped all free hugepages in the system.
>
> Signed-off-by: Jianfeng Tan 
>

Acked-by: Michael Qiu

[dpdk-dev] [PATCH] Unlink existing unused sockets at start up

2015-12-21 Thread Wang, Zhihong



> -Original Message-
> From: Ilya Maximets [mailto:i.maximets at samsung.com]
> Sent: Friday, December 18, 2015 2:18 PM
> To: Wang, Zhihong ; dev at dpdk.org
> Cc: p.fedin at samsung.com; yuanhan.liu at linux.intel.com; s.dyasly at 
> samsung.com;
> Xie, Huawei 
> Subject: Re: [PATCH] Unlink existing unused sockets at start up
> 
> On 18.12.2015 05:39, Wang, Zhihong wrote:
> 
> > Yes ideally the underneath lib shouldn't meddle with the recovery logic.
> > But I do think we should at least put a warning in the lib function
> > said the app should make the path available. This is another topic though 
> > :-)
> Like we did in memcpy:
> > /**
> >  * Copy 16 bytes from one location to another,
> >  * locations should not overlap.
> >  */
> >
> 
> Isn't it enough to have an error in the log?

Function comments and function code are different things and are both necessary.
Also why wait till error occurs when a comment can warn the developer?

> 
> lib/librte_vhost/vhost_user/vhost-net-user.c:130:
> RTE_LOG(ERR, VHOST_CONFIG, "fail to bind fd:%d, remove file:%s and try
> again.\n",
> 
> Best regards, Ilya Maximets.

55 matches

Mail list logo