[dpdk-dev] [PATCH] ixgbe: fix a x550 DCB issue

2015-09-08 Thread Wu, Jingjing


> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Wenzhuo Lu
> Sent: Wednesday, August 26, 2015 3:11 PM
> To: dev at dpdk.org
> Subject: [dpdk-dev] [PATCH] ixgbe: fix a x550 DCB issue
> 
> There's a DCB issue on x550. For 8 TCs, if a packet with user priority 6 or 7 
> is
> injected to the NIC, then the NIC will put 3 packets into the queue. There's
> also a similar issue for 4 TCs.
> The root cause is RXPBSIZE is not right. RXPBSIZE of x550 is 384. It's 
> different
> from other 10G NICs. We need to set the RXPBSIZE according to the NIC type.
> 
> Signed-off-by: Wenzhuo Lu 
Acked-by: Jingjing Wu 
> ---
>  drivers/net/ixgbe/ixgbe_rxtx.c | 27 +++
>  1 file changed, 23 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
> index 91023b9..021229f 100644
> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
> @@ -2915,6 +2915,7 @@ ixgbe_rss_configure(struct rte_eth_dev *dev)
> 
>  #define NUM_VFTA_REGISTERS 128
>  #define NIC_RX_BUFFER_SIZE 0x200
> +#define X550_RX_BUFFER_SIZE 0x180
> 
>  static void
>  ixgbe_vmdq_dcb_configure(struct rte_eth_dev *dev) @@ -2943,7 +2944,15
> @@ ixgbe_vmdq_dcb_configure(struct rte_eth_dev *dev)
>* RXPBSIZE
>* split rx buffer up into sections, each for 1 traffic class
>*/
> - pbsize = (uint16_t)(NIC_RX_BUFFER_SIZE / nb_tcs);
> + switch (hw->mac.type) {
> + case ixgbe_mac_X550:
> + case ixgbe_mac_X550EM_x:
> + pbsize = (uint16_t)(X550_RX_BUFFER_SIZE / nb_tcs);
> + break;
> + default:
> + pbsize = (uint16_t)(NIC_RX_BUFFER_SIZE / nb_tcs);
> + break;
> + }
>   for (i = 0 ; i < nb_tcs; i++) {
>   uint32_t rxpbsize = IXGBE_READ_REG(hw,
> IXGBE_RXPBSIZE(i));
>   rxpbsize &= (~(0x3FF << IXGBE_RXPBSIZE_SHIFT)); @@ -
> 3317,7 +3326,7 @@ ixgbe_dcb_hw_configure(struct rte_eth_dev *dev,  {
>   int ret = 0;
>   uint8_t i,pfc_en,nb_tcs;
> - uint16_t pbsize;
> + uint16_t pbsize, rx_buffer_size;
>   uint8_t config_dcb_rx = 0;
>   uint8_t config_dcb_tx = 0;
>   uint8_t tsa[IXGBE_DCB_MAX_TRAFFIC_CLASS] = {0}; @@ -3408,9
> +3417,19 @@ ixgbe_dcb_hw_configure(struct rte_eth_dev *dev,
>   }
>   }
> 
> + switch (hw->mac.type) {
> + case ixgbe_mac_X550:
> + case ixgbe_mac_X550EM_x:
> + rx_buffer_size = X550_RX_BUFFER_SIZE;
> + break;
> + default:
> + rx_buffer_size = NIC_RX_BUFFER_SIZE;
> + break;
> + }
> +
>   if(config_dcb_rx) {
>   /* Set RX buffer size */
> - pbsize = (uint16_t)(NIC_RX_BUFFER_SIZE / nb_tcs);
> + pbsize = (uint16_t)(rx_buffer_size / nb_tcs);
>   uint32_t rxpbsize = pbsize << IXGBE_RXPBSIZE_SHIFT;
>   for (i = 0 ; i < nb_tcs; i++) {
>   IXGBE_WRITE_REG(hw, IXGBE_RXPBSIZE(i), rxpbsize);
> @@ -3466,7 +3485,7 @@ ixgbe_dcb_hw_configure(struct rte_eth_dev *dev,
> 
>   /* Check if the PFC is supported */
>   if(dev->data->dev_conf.dcb_capability_en &
> ETH_DCB_PFC_SUPPORT) {
> - pbsize = (uint16_t) (NIC_RX_BUFFER_SIZE / nb_tcs);
> + pbsize = (uint16_t) (rx_buffer_size / nb_tcs);
>   for (i = 0; i < nb_tcs; i++) {
>   /*
>   * If the TC count is 8,and the default high_water is 48,
> --
> 1.9.3



[dpdk-dev] vhost compliant virtio based networking interface in container

2015-09-08 Thread Tetsuya Mukawa
On 2015/09/07 14:54, Xie, Huawei wrote:
> On 8/26/2015 5:23 PM, Tetsuya Mukawa wrote:
>> On 2015/08/25 18:56, Xie, Huawei wrote:
>>> On 8/25/2015 10:59 AM, Tetsuya Mukawa wrote:
 Hi Xie and Yanping,


 May I ask you some questions?
 It seems we are also developing an almost same one.
>>> Good to know that we are tackling the same problem and have the similar
>>> idea.
>>> What is your status now? We had the POC running, and compliant with
>>> dpdkvhost.
>>> Interrupt like notification isn't supported.
>> We implemented vhost PMD first, so we just start implementing it.
>>
 On 2015/08/20 19:14, Xie, Huawei wrote:
> Added dev at dpdk.org
>
> On 8/20/2015 6:04 PM, Xie, Huawei wrote:
>> Yanping:
>> I read your mail, seems what we did are quite similar. Here i wrote a
>> quick mail to describe our design. Let me know if it is the same thing.
>>
>> Problem Statement:
>> We don't have a high performance networking interface in container for
>> NFV. Current veth pair based interface couldn't be easily accelerated.
>>
>> The key components involved:
>> 1.DPDK based virtio PMD driver in container.
>> 2.device simulation framework in container.
>> 3.dpdk(or kernel) vhost running in host.
>>
>> How virtio is created?
>> A:  There is no "real" virtio-pci device in container environment.
>> 1). Host maintains pools of memories, and shares memory to container.
>> This could be accomplished through host share a huge page file to 
>> container.
>> 2). Containers creates virtio rings based on the shared memory.
>> 3). Container creates mbuf memory pools on the shared memory.
>> 4) Container send the memory and vring information to vhost through
>> vhost message. This could be done either through ioctl call or vhost
>> user message.
>>
>> How vhost message is sent?
>> A: There are two alternative ways to do this.
>> 1) The customized virtio PMD is responsible for all the vring creation,
>> and vhost message sending.
 Above is our approach so far.
 It seems Yanping also takes this kind of approach.
 We are using vhost-user functionality instead of using the vhost-net
 kernel module.
 Probably this is the difference between Yanping and us.
>>> In my current implementation, the device simulation layer talks to "user
>>> space" vhost through cuse interface. It could also be done through vhost
>>> user socket. This isn't the key point.
>>> Here vhost-user is kind of confusing, maybe user space vhost is more
>>> accurate, either cuse or unix domain socket. :).
>>>
>>> As for yanping, they are now connecting to vhost-net kernel module, but
>>> they are also trying to connect to "user space" vhost.  Correct me if wrong.
>>> Yes, there is some difference between these two. Vhost-net kernel module
>>> could directly access other process's memory, while using
>>> vhost-user(cuse/user), we need do the memory mapping.
 BTW, we are going to submit a vhost PMD for DPDK-2.2.
 This PMD is implemented on librte_vhost.
 It allows DPDK application to handle a vhost-user(cuse) backend as a
 normal NIC port.
 This PMD should work with both Xie and Yanping approach.
 (In the case of Yanping approach, we may need vhost-cuse)

>> 2) We could do this through a lightweight device simulation framework.
>> The device simulation creates simple PCI bus. On the PCI bus,
>> virtio-net PCI devices are created. The device simulations provides
>> IOAPI for MMIO/IO access.
 Does it mean you implemented a kernel module?
 If so, do you still need vhost-cuse functionality to handle vhost
 messages n userspace?
>>> The device simulation is  a library running in user space in container. 
>>> It is linked with DPDK app. It creates pseudo buses and virtio-net PCI
>>> devices.
>>> The virtio-container-PMD configures the virtio-net pseudo devices
>>> through IOAPI provided by the device simulation rather than IO
>>> instructions as in KVM.
>>> Why we use device simulation?
>>> We could create other virtio devices in container, and provide an common
>>> way to talk to vhost-xx module.
>> Thanks for explanation.
>> At first reading, I thought the difference between approach1 and
>> approach2 is whether we need to implement a new kernel module, or not.
>> But I understand how you implemented.
>>
>> Please let me explain our design more.
>> We might use a kind of similar approach to handle a pseudo virtio-net
>> device in DPDK.
>> (Anyway, we haven't finished implementing yet, this overview might have
>> some technical problems)
>>
>> Step1. Separate virtio-net and vhost-user socket related code from QEMU,
>> then implement it as a separated program.
>> The program also has below features.
>>  - Create a directory that contains almost same files like
>> /sys/bus/pci/device//*
>>(To scan these file located on outside sysfs,

[dpdk-dev] Random packet drops with ip_pipeline on R730.

2015-09-08 Thread husainee
Hi

I am using a DELL730 with Dual socket. Processor in each socket is
Intel(R) Xeon(R) CPU E5-2603 v3 @ 1.60GHz- 6Cores.
The CPU layout has socket 0 with 0,2,4,6,8,10 cores and socket 1 with
1,3,5,7,9,11 cores.
The NIC card is i350.

The Cores 2-11 are isolated using isolcpus kernel parameter. We are
running the ip_peipeline application with only Master, RX and TX threads
(Flow and Route have been removed from cfg file). The threads are run as
follows

- Master on CPU core 2
- RX on CPU core 4
- TX on CPU core 6

64 byte packets are sent from ixia at different speeds, but we are
seeing random packet drops.  Same excercise is done on core 3,5,7 and
results are same.

We tried the l2fwd app and it works fine with no packet drops.

Hugepages per 1024 x 2M per socket.


Can anyone suggest what could be the reason for these random packet drops.

regards
husainee













[dpdk-dev] [PATCH v2 0/3] fix build issues with librte_sched, test_red on non x86 platform

2015-09-08 Thread Thomas Monjalon
2015-08-30 14:25, Jerin Jacob:
> This patch set enable librte_sched, app/test/test_sched and app/test/test_red
> to build on non x86 platform
> 
> v1..v2  use memory barrier version rte_rdtsc() for multi arch support as
> suggested by Thomas Monjalon
> 
> Jerin Jacob (3):
>   sched: remove unused inclusion of tmmintrin.h
>   app/test: test_sched: fix needless build dependency on
> CONFIG_RTE_ARCH_X86_64
>   app/test: use memory barrier version of rte_rdtsc() eal api for multi
> arch support

Applied, thanks


[dpdk-dev] [PATCH] eal/linux: fix rte_epoll_wait

2015-09-08 Thread Thomas Monjalon
> > Function rte_epoll_wait should return when underlying call
> > to epoll_wait times out.
> > 
> > Signed-off-by: Robert Sanford 
> 
> Acked-by: Cunming Liang 

Applied, thanks



[dpdk-dev] virtio optimization idea

2015-09-08 Thread Tetsuya Mukawa
On 2015/09/05 1:50, Xie, Huawei wrote:
> There is some format issue with the ascii chart of the tx ring. Update
> that chart.
> Sorry for the trouble.

Hi XIe,

Thanks for sharing a way to optimize virtio.
I have a few questions.

>
> On 9/4/2015 4:25 PM, Xie, Huawei wrote:
>> Hi:
>>
>> Recently I have done one virtio optimization proof of concept. The
>> optimization includes two parts:
>> 1) avail ring set with fixed descriptors
>> 2) RX vectorization
>> With the optimizations, we could have several times of performance boost
>> for purely vhost-virtio throughput.

When you check performance, have you optimized only virtio-net driver?
If so, can we optimize vhost backend(librte_vhost) also using your
optimization way?

>>
>> Here i will only cover the first part, which is the prerequisite for the
>> second part.
>> Let us first take RX for example. Currently when we fill the avail ring
>> with guest mbuf, we need
>> a) allocate one descriptor(for non sg mbuf) from free descriptors
>> b) set the idx of the desc into the entry of avail ring
>> c) set the addr/len field of the descriptor to point to guest blank mbuf
>> data area
>>
>> Those operation takes time, and especially step b results in modifed (M)
>> state of the cache line for the avail ring in the virtio processing
>> core. When vhost processes the avail ring, the cache line transfer from
>> virtio processing core to vhost processing core takes pretty much CPU
>> cycles.
>> To solve this problem, this is the arrangement of RX ring for DPDK
>> pmd(for non-mergable case).
>>
>> avail  
>> idx
>> +  
>> |  
>> +++---+-+--+   
>> | 0  | 1  | 2 | ... |  254  | 255  |  avail ring
>> +-+--+-+--+-+-+-+---+--+---+   
>>   |||   |   |  |   
>>   |||   |   |  |   
>>   vvv   |   v  v   
>> +-+--+-+--+-+-+-+---+--+---+   
>> | 0  | 1  | 2 | ... |  254  | 255  |  desc ring
>> +++---+-+--+   
>> |  
>> |  
>> +++---+-+--+   
>> | 0  | 1  | 2 | |  254  | 255  |  used ring
>> +++---+-+--+   
>> |  
>> +
>> Avail ring is initialized with fixed descriptor and is never changed,
>> i.e, the index value of the nth avail ring entry is always n, which
>> means virtio PMD is actually refilling desc ring only, without having to
>> change avail ring.

For example, avail ring is like below.
struct vring_avail {
uint16_t flags;
uint16_t idx;
uint16_t ring[QUEUE_SIZE];
};

My understanding is that virtio-net driver still needs to change
avail_ring.idx, but don't need to change avail_ring.ring[].
Is this correct?

Tetsuya

>> When vhost fetches avail ring, if not evicted, it is always in its first
>> level cache.
>>
>> When RX receives packets from used ring, we use the used->idx as the
>> desc idx. This requires that vhost processes and returns descs from
>> avail ring to used ring in order, which is true for both current dpdk
>> vhost and kernel vhost implementation. In my understanding, there is no
>> necessity for vhost net to process descriptors OOO. One case could be
>> zero copy, for example, if one descriptor doesn't meet zero copy
>> requirment, we could directly return it to used ring, earlier than the
>> descriptors in front of it.
>> To enforce this, i want to use a reserved bit to indicate in order
>> processing of descriptors.
>>
>> For tx ring, the arrangement is like below. Each transmitted mbuf needs
>> a desc for virtio_net_hdr, so actually we have only 128 free slots.
>>  
>>  
>>
>>
>> ++   
>> 
>> ||   
>> 
>> ||   
>> 
>>+-+-+-+--+--+--+--+   
>> 
>>|  0  |  1  | ... |  127 || 128  | 129  | ...  | 255  |   avail ring  
>> 
>>+--+--+--+--+-+---+--+---+--+---+--+--+---+   
>> 
>>   | ||  ||  |  | |   
>> 
>>   v vv  ||  v  v v   
>> 
>>+--+--+--+--+-+---+--+---+--+---+--+--+---+   
>> 
>>| 127 | 1

[dpdk-dev] virtio-net: bind systematically on all non blacklisted virtio-net devices

2015-09-08 Thread Franck Baudin
Hi,

virtio-net driver bind on all virtio-net devices, even if the devices 
are used by the kernel (leading to kernel soft-lookup/panic). One way 
around is to blacklist the ports in use by Linux. This is the case since 
v2.0.0, in fact since commit da978dfdc43b59e290a46d7ece5fd19ce79a1162 
and the removal of the RTE_PCI_DRV_NEED_MAPPING driver flag.

Questions:
 1/ Is it the expected behaviour?
 2/ Why is it different from vmxnet3 pmd? In other words, should't 
we re-add the RTE_PCI_DRV_NEED_MAPPING to virtio pmd or remove it from 
pmxnet3 pmd?
 3/ If this is the expected behaviour, shouldn't we update 
dpdk_nic_bind.py (binding status irrelevant for virtio) tool and the 
documentation (mentioning igb_uio while misleading and useless)?

Thanks!

Best Regards,
Franck






[dpdk-dev] virtio optimization idea

2015-09-08 Thread Xie, Huawei
On 9/8/2015 4:21 PM, Tetsuya Mukawa wrote:
> On 2015/09/05 1:50, Xie, Huawei wrote:
>> There is some format issue with the ascii chart of the tx ring. Update
>> that chart.
>> Sorry for the trouble.
> Hi XIe,
>
> Thanks for sharing a way to optimize virtio.
> I have a few questions.
>
>> On 9/4/2015 4:25 PM, Xie, Huawei wrote:
>>> Hi:
>>>
>>> Recently I have done one virtio optimization proof of concept. The
>>> optimization includes two parts:
>>> 1) avail ring set with fixed descriptors
>>> 2) RX vectorization
>>> With the optimizations, we could have several times of performance boost
>>> for purely vhost-virtio throughput.
> When you check performance, have you optimized only virtio-net driver?
> If so, can we optimize vhost backend(librte_vhost) also using your
> optimization way?

We could do some optimization to vhost based on the same vring layout,
but as vhost needs to support legacy virtio as well, it couldn't make
this assumption.
>>> Here i will only cover the first part, which is the prerequisite for the
>>> second part.
>>> Let us first take RX for example. Currently when we fill the avail ring
>>> with guest mbuf, we need
>>> a) allocate one descriptor(for non sg mbuf) from free descriptors
>>> b) set the idx of the desc into the entry of avail ring
>>> c) set the addr/len field of the descriptor to point to guest blank mbuf
>>> data area
>>>
>>> Those operation takes time, and especially step b results in modifed (M)
>>> state of the cache line for the avail ring in the virtio processing
>>> core. When vhost processes the avail ring, the cache line transfer from
>>> virtio processing core to vhost processing core takes pretty much CPU
>>> cycles.
>>> To solve this problem, this is the arrangement of RX ring for DPDK
>>> pmd(for non-mergable case).
>>>
>>> avail  
>>> idx
>>> +  
>>> |  
>>> +++---+-+--+   
>>> | 0  | 1  | 2 | ... |  254  | 255  |  avail ring
>>> +-+--+-+--+-+-+-+---+--+---+   
>>>   |||   |   |  |   
>>>   |||   |   |  |   
>>>   vvv   |   v  v   
>>> +-+--+-+--+-+-+-+---+--+---+   
>>> | 0  | 1  | 2 | ... |  254  | 255  |  desc ring
>>> +++---+-+--+   
>>> |  
>>> |  
>>> +++---+-+--+   
>>> | 0  | 1  | 2 | |  254  | 255  |  used ring
>>> +++---+-+--+   
>>> |  
>>> +
>>> Avail ring is initialized with fixed descriptor and is never changed,
>>> i.e, the index value of the nth avail ring entry is always n, which
>>> means virtio PMD is actually refilling desc ring only, without having to
>>> change avail ring.
> For example, avail ring is like below.
> struct vring_avail {
> uint16_t flags;
> uint16_t idx;
> uint16_t ring[QUEUE_SIZE];
> };
>
> My understanding is that virtio-net driver still needs to change
> avail_ring.idx, but don't need to change avail_ring.ring[].
> Is this correct?

Yes, avail ring is initialized once and never gets updated. It is like
virtio frontend is only using descriptor ring.
>
> Tetsuya
>
>>> When vhost fetches avail ring, if not evicted, it is always in its first
>>> level cache.
>>>
>>> When RX receives packets from used ring, we use the used->idx as the
>>> desc idx. This requires that vhost processes and returns descs from
>>> avail ring to used ring in order, which is true for both current dpdk
>>> vhost and kernel vhost implementation. In my understanding, there is no
>>> necessity for vhost net to process descriptors OOO. One case could be
>>> zero copy, for example, if one descriptor doesn't meet zero copy
>>> requirment, we could directly return it to used ring, earlier than the
>>> descriptors in front of it.
>>> To enforce this, i want to use a reserved bit to indicate in order
>>> processing of descriptors.
>>>
>>> For tx ring, the arrangement is like below. Each transmitted mbuf needs
>>> a desc for virtio_net_hdr, so actually we have only 128 free slots.
>>> 
>>>   
>>>
>>>
>>> ++  
>>>  
>>> ||  
>>>  
>>> ||  
>>>  
>>>+-+-+-+--+--+--+--+  
>>>  
>>>|  0  |  1  | ... |  127 || 128  | 129  | ...  | 255  |   ava

[dpdk-dev] [PATCH v4 0/2] ethdev: add port speed capability bitmap

2015-09-08 Thread Nélio Laranjeiro
On Mon, Sep 07, 2015 at 10:52:53PM +0200, Marc Sune wrote:
> 2015-08-29 2:16 GMT+02:00 Marc Sune :
> 
> > The current rte_eth_dev_info abstraction does not provide any mechanism to
> > get the supported speed(s) of an ethdev.
> >
> > For some drivers (e.g. ixgbe), an educated guess can be done based on the
> > driver's name (driver_name in rte_eth_dev_info), see:
> >
> > http://dpdk.org/ml/archives/dev/2013-August/000412.html
> >
> > However, i) doing string comparisons is annoying, and can silently
> > break existing applications if PMDs change their names ii) it does not
> > provide all the supported capabilities of the ethdev iii) for some drivers
> > it
> > is impossible determine correctly the (max) speed by the application
> > (e.g. in i40, distinguish between XL710 and X710).
> >
> > This small patch adds speed_capa bitmap in rte_eth_dev_info, which is
> > filled
> > by the PMDs according to the physical device capabilities.
> >
> > v2: rebase, converted speed_capa into 32 bits bitmap, fixed alignment
> > (checkpatch).
> >
> > v3: rebase to v2.1. unified ETH_LINK_SPEED and ETH_SPEED_CAP into
> > ETH_SPEED.
> > Converted field speed in struct rte_eth_conf to speeds, to allow a
> > bitmap
> > for defining the announced speeds, as suggested by M. Brorup. Fixed
> > spelling issues.
> >
> > v4: fixed errata in the documentation of field speeds of rte_eth_conf, and
> > commit 1/2 message. rebased to v2.1.0. v3 was incorrectly based on
> > ~2.1.0-rc1.
> >
> 
> Thomas,
> 
> Since mostly you were commenting for v1 and v2; any opinion on this one?
> 
> Regards
> marc

Hi Marc,

I have read your patches, and there are a few mistakes, for instance mlx4
(ConnectX-3 devices) does not support 100Gbps.

In addition, it seems your new bitmap does not support all kind of
speeds, take a look at the header of Ethtool, in the Linux kernel
(include/uapi/linux/ethtool.h) which already consumes 30bits without even
managing speeds above 56Gbps.  

It would be nice to keep the field to represent the real speed of the
link, in case it is not represented by the bitmap, it could be also
useful for aggregated links (bonding for instance).  The current API
already works this way, it just needs to be extended from 16 to 32 bit
to manage speed above 64Gbps.

>[...]

N?lio
-- 
N?lio Laranjeiro
6WIND


[dpdk-dev] [PATCH 0/4] librte_table: add name parameter to lpm table

2015-09-08 Thread Jasvinder Singh
This patchset links to ABI change announced for librte_table. For lpm table,
name parameter has been included in LPM table parameters structure.
It will eventually allow applications to create more than one instances
of lpm table, if required.


Jasvinder Singh (4):
  librte_table: add name parameter to LPM table
  app/test: modify table and pipeline test
  ip_pipeline: modify lpm table for routing pipeline
  librte_table: modify release notes and deprecation notice

 app/test-pipeline/pipeline_lpm.c   |   1 +
 app/test-pipeline/pipeline_lpm_ipv6.c  |   1 +
 app/test/test_table_combined.c |   2 +
 app/test/test_table_tables.c   | 102 -
 doc/guides/rel_notes/deprecation.rst   |   3 -
 doc/guides/rel_notes/release_2_2.rst   |   4 +-
 .../ip_pipeline/pipeline/pipeline_routing_be.c |   1 +
 lib/librte_table/Makefile  |   2 +-
 lib/librte_table/rte_table_lpm.c   |   8 +-
 lib/librte_table/rte_table_lpm.h   |   3 +
 lib/librte_table/rte_table_lpm_ipv6.c  |   8 +-
 lib/librte_table/rte_table_lpm_ipv6.h  |   3 +
 12 files changed, 86 insertions(+), 52 deletions(-)

-- 
2.1.0



[dpdk-dev] [PATCH 1/4] librte_table: modify LPM table parameter structure

2015-09-08 Thread Jasvinder Singh
This patch relates to ABI change proposed for librte_table
(lpm table). A new parameter to hold the table name has
been added to the LPM table parameter structures
rte_table_lpm_params and rte_table_lpm_ipv6_params.

Signed-off-by: Jasvinder Singh 
---
 lib/librte_table/rte_table_lpm.c  | 8 ++--
 lib/librte_table/rte_table_lpm.h  | 3 +++
 lib/librte_table/rte_table_lpm_ipv6.c | 8 ++--
 lib/librte_table/rte_table_lpm_ipv6.h | 3 +++
 4 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/lib/librte_table/rte_table_lpm.c b/lib/librte_table/rte_table_lpm.c
index b218d64..849d899 100644
--- a/lib/librte_table/rte_table_lpm.c
+++ b/lib/librte_table/rte_table_lpm.c
@@ -103,7 +103,11 @@ rte_table_lpm_create(void *params, int socket_id, uint32_t 
entry_size)
__func__);
return NULL;
}
-
+   if (p->name == NULL) {
+   RTE_LOG(ERR, TABLE, "%s: Table name is NULL\n",
+   __func__);
+   return NULL;
+   }
entry_size = RTE_ALIGN(entry_size, sizeof(uint64_t));

/* Memory allocation */
@@ -119,7 +123,7 @@ rte_table_lpm_create(void *params, int socket_id, uint32_t 
entry_size)
}

/* LPM low-level table creation */
-   lpm->lpm = rte_lpm_create("LPM", socket_id, p->n_rules, 0);
+   lpm->lpm = rte_lpm_create(p->name, socket_id, p->n_rules, 0);
if (lpm->lpm == NULL) {
rte_free(lpm);
RTE_LOG(ERR, TABLE, "Unable to create low-level LPM table\n");
diff --git a/lib/librte_table/rte_table_lpm.h b/lib/librte_table/rte_table_lpm.h
index c08c958..06e8410 100644
--- a/lib/librte_table/rte_table_lpm.h
+++ b/lib/librte_table/rte_table_lpm.h
@@ -77,6 +77,9 @@ extern "C" {

 /** LPM table parameters */
 struct rte_table_lpm_params {
+   /** Table name */
+   const char *name;
+
/** Maximum number of LPM rules (i.e. IP routes) */
uint32_t n_rules;

diff --git a/lib/librte_table/rte_table_lpm_ipv6.c 
b/lib/librte_table/rte_table_lpm_ipv6.c
index ff4a9c2..ce91db2 100644
--- a/lib/librte_table/rte_table_lpm_ipv6.c
+++ b/lib/librte_table/rte_table_lpm_ipv6.c
@@ -109,13 +109,17 @@ rte_table_lpm_ipv6_create(void *params, int socket_id, 
uint32_t entry_size)
__func__);
return NULL;
}
-
+   if (p->name == NULL) {
+   RTE_LOG(ERR, TABLE, "%s: Table name is NULL\n",
+   __func__);
+   return NULL;
+   }
entry_size = RTE_ALIGN(entry_size, sizeof(uint64_t));

/* Memory allocation */
nht_size = RTE_TABLE_LPM_MAX_NEXT_HOPS * entry_size;
total_size = sizeof(struct rte_table_lpm_ipv6) + nht_size;
-   lpm = rte_zmalloc_socket("TABLE", total_size, RTE_CACHE_LINE_SIZE,
+   lpm = rte_zmalloc_socket(p->name, total_size, RTE_CACHE_LINE_SIZE,
socket_id);
if (lpm == NULL) {
RTE_LOG(ERR, TABLE,
diff --git a/lib/librte_table/rte_table_lpm_ipv6.h 
b/lib/librte_table/rte_table_lpm_ipv6.h
index 91fb0d8..43aea39 100644
--- a/lib/librte_table/rte_table_lpm_ipv6.h
+++ b/lib/librte_table/rte_table_lpm_ipv6.h
@@ -79,6 +79,9 @@ extern "C" {

 /** LPM table parameters */
 struct rte_table_lpm_ipv6_params {
+   /** Table name */
+   const char *name;
+
/** Maximum number of LPM rules (i.e. IP routes) */
uint32_t n_rules;

-- 
2.1.0



[dpdk-dev] [PATCH 3/4] ip_pipeline: modify lpm table for routing pipeline

2015-09-08 Thread Jasvinder Singh
The name parameter has been defined in lpm table of
routing pipeline.

Signed-off-by: Jasvinder Singh 
---
 examples/ip_pipeline/pipeline/pipeline_routing_be.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/examples/ip_pipeline/pipeline/pipeline_routing_be.c 
b/examples/ip_pipeline/pipeline/pipeline_routing_be.c
index 1e817dd..06d3e65 100644
--- a/examples/ip_pipeline/pipeline/pipeline_routing_be.c
+++ b/examples/ip_pipeline/pipeline/pipeline_routing_be.c
@@ -484,6 +484,7 @@ pipeline_routing_init(struct pipeline_params *params,
p->n_tables = 1;
{
struct rte_table_lpm_params table_lpm_params = {
+   .name = p->name,
.n_rules = p_rt->n_routes,
.entry_unique_size = sizeof(struct routing_table_entry),
.offset = p_rt->ip_da_offset,
-- 
2.1.0



[dpdk-dev] [PATCH 2/4] app/test: modify table and pipeline test

2015-09-08 Thread Jasvinder Singh
LPM table and test-pipeline has been modified to include
name parameter of the lpm table.

Signed-off-by: Jasvinder Singh 
---
 app/test-pipeline/pipeline_lpm.c  |   1 +
 app/test-pipeline/pipeline_lpm_ipv6.c |   1 +
 app/test/test_table_combined.c|   2 +
 app/test/test_table_tables.c  | 102 --
 4 files changed, 63 insertions(+), 43 deletions(-)

diff --git a/app/test-pipeline/pipeline_lpm.c b/app/test-pipeline/pipeline_lpm.c
index b1a2c13..c03799c 100644
--- a/app/test-pipeline/pipeline_lpm.c
+++ b/app/test-pipeline/pipeline_lpm.c
@@ -112,6 +112,7 @@ app_main_loop_worker_pipeline_lpm(void) {
/* Table configuration */
{
struct rte_table_lpm_params table_lpm_params = {
+   .name = "LPM",
.n_rules = 1 << 24,
.entry_unique_size =
sizeof(struct rte_pipeline_table_entry),
diff --git a/app/test-pipeline/pipeline_lpm_ipv6.c 
b/app/test-pipeline/pipeline_lpm_ipv6.c
index 3f24a2d..02b7a9c 100644
--- a/app/test-pipeline/pipeline_lpm_ipv6.c
+++ b/app/test-pipeline/pipeline_lpm_ipv6.c
@@ -113,6 +113,7 @@ app_main_loop_worker_pipeline_lpm_ipv6(void) {
/* Table configuration */
{
struct rte_table_lpm_ipv6_params table_lpm_ipv6_params = {
+   .name = "LPM",
.n_rules = 1 << 24,
.number_tbl8s = 1 << 21,
.entry_unique_size =
diff --git a/app/test/test_table_combined.c b/app/test/test_table_combined.c
index dd09da5..f5c7c9b 100644
--- a/app/test/test_table_combined.c
+++ b/app/test/test_table_combined.c
@@ -293,6 +293,7 @@ test_table_lpm_combined(void)

/* Traffic flow */
struct rte_table_lpm_params lpm_params = {
+   .name = "LPM",
.n_rules = 1 << 16,
.entry_unique_size = 8,
.offset = 0,
@@ -352,6 +353,7 @@ test_table_lpm_ipv6_combined(void)

/* Traffic flow */
struct rte_table_lpm_ipv6_params lpm_ipv6_params = {
+   .name = "LPM",
.n_rules = 1 << 16,
.number_tbl8s = 1 << 13,
.entry_unique_size = 8,
diff --git a/app/test/test_table_tables.c b/app/test/test_table_tables.c
index 566964b..9d75fbf 100644
--- a/app/test/test_table_tables.c
+++ b/app/test/test_table_tables.c
@@ -322,6 +322,7 @@ test_table_lpm(void)

/* Initialize params and create tables */
struct rte_table_lpm_params lpm_params = {
+   .name = "LPM",
.n_rules = 1 << 24,
.entry_unique_size = entry_size,
.offset = 1
@@ -331,40 +332,47 @@ test_table_lpm(void)
if (table != NULL)
return -1;

-   lpm_params.n_rules = 0;
+   lpm_params.name = NULL;

table = rte_table_lpm_ops.f_create(&lpm_params, 0, entry_size);
if (table != NULL)
return -2;

+   lpm_params.name = "LPM";
+   lpm_params.n_rules = 0;
+
+   table = rte_table_lpm_ops.f_create(&lpm_params, 0, entry_size);
+   if (table != NULL)
+   return -3;
+
lpm_params.n_rules = 1 << 24;
lpm_params.offset = 32;
lpm_params.entry_unique_size = 0;

table = rte_table_lpm_ops.f_create(&lpm_params, 0, entry_size);
if (table != NULL)
-   return -3;
+   return -4;

lpm_params.entry_unique_size = entry_size + 1;

table = rte_table_lpm_ops.f_create(&lpm_params, 0, entry_size);
if (table != NULL)
-   return -4;
+   return -5;

lpm_params.entry_unique_size = entry_size;

table = rte_table_lpm_ops.f_create(&lpm_params, 0, entry_size);
if (table == NULL)
-   return -5;
+   return -6;

/* Free */
status = rte_table_lpm_ops.f_free(table);
if (status < 0)
-   return -6;
+   return -7;

status = rte_table_lpm_ops.f_free(NULL);
if (status == 0)
-   return -7;
+   return -8;

/* Add */
struct rte_table_lpm_key lpm_key;
@@ -372,75 +380,75 @@ test_table_lpm(void)

table = rte_table_lpm_ops.f_create(&lpm_params, 0, 1);
if (table == NULL)
-   return -8;
+   return -9;

status = rte_table_lpm_ops.f_add(NULL, &lpm_key, &entry, &key_found,
&entry_ptr);
if (status == 0)
-   return -9;
+   return -10;

status = rte_table_lpm_ops.f_add(table, NULL, &entry, &key_found,
&entry_ptr);
if (status == 0)
-   return -10;
+   return -11;

status = rte_table_lpm_ops.f_add(table, &lpm_key, NULL, &key_found,
&entry_ptr);
if (status == 0)
-   return -11;
+   return -12;

lpm_k

[dpdk-dev] [PATCH 4/4] librte_table: modify release notes and deprecation notice

2015-09-08 Thread Jasvinder Singh
The LIBABIVER number is incremented. The release notes
is updated and the deprecation announce is removed.

Signed-off-by: Jasvinder Singh 
---
 doc/guides/rel_notes/deprecation.rst | 3 ---
 doc/guides/rel_notes/release_2_2.rst | 4 +++-
 lib/librte_table/Makefile| 2 +-
 3 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/doc/guides/rel_notes/deprecation.rst 
b/doc/guides/rel_notes/deprecation.rst
index 5f6079b..ce6147e 100644
--- a/doc/guides/rel_notes/deprecation.rst
+++ b/doc/guides/rel_notes/deprecation.rst
@@ -62,9 +62,6 @@ Deprecation Notices
   as currently they are able to access any packet buffer location except the
   packet mbuf structure.

-* librte_table LPM: A new parameter to hold the table name will be added to
-  the LPM table parameter structure.
-
 * librte_table: New functions for table entry bulk add/delete will be added
   to the table operations structure.

diff --git a/doc/guides/rel_notes/release_2_2.rst 
b/doc/guides/rel_notes/release_2_2.rst
index abe57b4..75fc1ab 100644
--- a/doc/guides/rel_notes/release_2_2.rst
+++ b/doc/guides/rel_notes/release_2_2.rst
@@ -44,6 +44,8 @@ ABI Changes

 * The LPM structure is changed. The deprecated field mem_location is removed.

+* librte_table LPM: A new parameter to hold the table name will be added to
+  the LPM table parameter structure.

 Shared Library Versions
 ---
@@ -76,6 +78,6 @@ The libraries prepended with a plus sign were incremented in 
this version.
  librte_reorder.so.1
  librte_ring.so.1
  librte_sched.so.1
- librte_table.so.1
+   + librte_table.so.2
  librte_timer.so.1
  librte_vhost.so.1
diff --git a/lib/librte_table/Makefile b/lib/librte_table/Makefile
index c5b3eaf..7f02af3 100644
--- a/lib/librte_table/Makefile
+++ b/lib/librte_table/Makefile
@@ -41,7 +41,7 @@ CFLAGS += $(WERROR_FLAGS)

 EXPORT_MAP := rte_table_version.map

-LIBABIVER := 1
+LIBABIVER := 2

 #
 # all source are stored in SRCS-y
-- 
2.1.0



[dpdk-dev] [PATCH v2] mbuf/ip_frag: Move mbuf chaining to common code

2015-09-08 Thread Simon Kågström
On 2015-09-08 01:21, Ananyev, Konstantin wrote:
>>
>> Thanks. I got it wrong anyway, what I wanted was to be able to handle
>> the day when nb_segs changes to a 16-bit number, but then it should
>> really be
>>
>>   ... >= 1 << (sizeof(head->nb_segs) * 8)
>>
>> anyway. I'll fix that and also add a warning that the implementation
>> will do a linear search to find the tail entry.
> 
> Probably just me, but I can't foresee the situation when  we would need to 
> increase nb_segs to 16 bits.
> Looks like an overkill to me.

I don't think it will happen either, but with this solution, this
particular piece of code will work regardless. The value is known at
compile-time anyway, so it should not be a performance issue.

// Simon


[dpdk-dev] [PATCH v1] change hugepage sorting to avoid overlapping memcpy

2015-09-08 Thread Gonzalez Monroy, Sergio
Hi Ralf,

Just a few comments/suggestions:

Add 'eal/linux:'  to the commit title, ie:
   "eal/linux: change hugepage sorting to avoid overlapping memcpy"

On 04/09/2015 11:14, Ralf Hoffmann wrote:
> with only one hugepage or already sorted hugepage addresses, the sort
> function called memcpy with same src and dst pointer. Debugging with
> valgrind will issue a warning about overlapping area. This patch changes
> the bubble sort to avoid this behavior. Also, the function cannot fail
> any longer.
>
> Signed-off-by: Ralf Hoffmann 
> ---
>   lib/librte_eal/linuxapp/eal/eal_memory.c | 27 +--
>   1 file changed, 13 insertions(+), 14 deletions(-)
>
> diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c 
> b/lib/librte_eal/linuxapp/eal/eal_memory.c
> index ac2745e..6d01f61 100644
> --- a/lib/librte_eal/linuxapp/eal/eal_memory.c
> +++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
> @@ -699,25 +699,25 @@ error:
>* higher address first on powerpc). We use a slow algorithm, but we won't
>* have millions of pages, and this is only done at init time.
>*/
> -static int
> +static void
>   sort_by_physaddr(struct hugepage_file *hugepg_tbl, struct hugepage_info 
> *hpi)
>   {
>   unsigned i, j;
> - int compare_idx;
> + unsigned compare_idx;
>   uint64_t compare_addr;
>   struct hugepage_file tmp;
>   
>   for (i = 0; i < hpi->num_pages[0]; i++) {
> - compare_addr = 0;
> - compare_idx = -1;
> + compare_addr = hugepg_tbl[i].physaddr;
> + compare_idx = i;
>   
>   /*
> -  * browse all entries starting at 'i', and find the
> +  * browse all entries starting at 'i+1', and find the
>* entry with the smallest addr
>*/
> - for (j=i; j< hpi->num_pages[0]; j++) {
> + for (j=i + 1; j < hpi->num_pages[0]; j++) {
Although there are many style/checkpatch issues in current code, we try 
to fix them
in new patches.
In that regard, checkpatch complains about above line with:
ERROR:SPACING: spaces required around that '='

>   
> - if (compare_addr == 0 ||
> + if (
>   #ifdef RTE_ARCH_PPC_64
>   hugepg_tbl[j].physaddr > compare_addr) {
>   #else
> @@ -728,10 +728,9 @@ sort_by_physaddr(struct hugepage_file *hugepg_tbl, 
> struct hugepage_info *hpi)
>   }
>   }
>   
> - /* should not happen */
> - if (compare_idx == -1) {
> - RTE_LOG(ERR, EAL, "%s(): error in physaddr sorting\n", 
> __func__);
> - return -1;
> + if (compare_idx == i) {
> + /* no smaller page found */
> + continue;
>   }
>   
>   /* swap the 2 entries in the table */
> @@ -741,7 +740,8 @@ sort_by_physaddr(struct hugepage_file *hugepg_tbl, struct 
> hugepage_info *hpi)
>   sizeof(struct hugepage_file));
>   memcpy(&hugepg_tbl[i], &tmp, sizeof(struct hugepage_file));
>   }
> - return 0;
> +
> + return;
>   }
I reckon checkpatch is not picking this one because the end-of-function 
is not part of the patch,
but it is a warning:
WARNING:RETURN_VOID: void function return statements are not generally 
useful

>   
>   /*
> @@ -1164,8 +1164,7 @@ rte_eal_hugepage_init(void)
>   goto fail;
>   }
>   
> - if (sort_by_physaddr(&tmp_hp[hp_offset], hpi) < 0)
> - goto fail;
> + sort_by_physaddr(&tmp_hp[hp_offset], hpi);
>   
>   #ifdef RTE_EAL_SINGLE_FILE_SEGMENTS
>   /* remap all hugepages into single file segments */
>
>

Thanks,
Sergio


[dpdk-dev] [PATCH 0/4] librte_table: add name parameter to lpm table

2015-09-08 Thread Dumitrescu, Cristian


> -Original Message-
> From: Singh, Jasvinder
> Sent: Tuesday, September 8, 2015 1:11 PM
> To: dev at dpdk.org
> Cc: Dumitrescu, Cristian
> Subject: [PATCH 0/4] librte_table: add name parameter to lpm table
> 
> This patchset links to ABI change announced for librte_table. For lpm table,
> name parameter has been included in LPM table parameters structure.
> It will eventually allow applications to create more than one instances
> of lpm table, if required.
> 

Acked-by: Cristian Dumitrescu 



[dpdk-dev] Random packet drops with ip_pipeline on R730.

2015-09-08 Thread Dumitrescu, Cristian
Hi Husainee,

Can you please explain what do you mean by random packet drops? What percentage 
of the input packets get dropped, does it take place on every run, does the 
number of dropped packets vary on every run, etc?

Are you also able to reproduce this issue with other NICs, e.g. 10GbE NIC?

Can you share your config file?

Can you please double check the low level NIC settings between the two 
applications, i.e. the settings in structures link_params_default, 
default_hwq_in_params, default_hwq_out_params from ip_pipeline file 
config_parse.c vs. their equivalents from l2fwd? The only thing I can think of 
right now is maybe one of the low level threshold values for the Ethernet link 
is not tuned for your 1GbE NIC.

Regards,
Cristian

> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of husainee
> Sent: Tuesday, September 8, 2015 7:56 AM
> To: dev at dpdk.org
> Subject: [dpdk-dev] Random packet drops with ip_pipeline on R730.
> 
> Hi
> 
> I am using a DELL730 with Dual socket. Processor in each socket is
> Intel(R) Xeon(R) CPU E5-2603 v3 @ 1.60GHz- 6Cores.
> The CPU layout has socket 0 with 0,2,4,6,8,10 cores and socket 1 with
> 1,3,5,7,9,11 cores.
> The NIC card is i350.
> 
> The Cores 2-11 are isolated using isolcpus kernel parameter. We are
> running the ip_peipeline application with only Master, RX and TX threads
> (Flow and Route have been removed from cfg file). The threads are run as
> follows
> 
> - Master on CPU core 2
> - RX on CPU core 4
> - TX on CPU core 6
> 
> 64 byte packets are sent from ixia at different speeds, but we are
> seeing random packet drops.  Same excercise is done on core 3,5,7 and
> results are same.
> 
> We tried the l2fwd app and it works fine with no packet drops.
> 
> Hugepages per 1024 x 2M per socket.
> 
> 
> Can anyone suggest what could be the reason for these random packet
> drops.
> 
> regards
> husainee
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 



[dpdk-dev] [RFC PATCH] lpm: increase number of next hops for lpm (ipv4)

2015-09-08 Thread Michal Kobylinski
From: mkobylix 

The current DPDK implementation for LPM for IPv4 and IPv6 limits the number of 
next hops
to 256, as the next hop ID is an 8-bit long field.
Proposed extension increase number of next hops for IPv4 to 2^24 and also 
allows 32-bits
read/write operations.

Signed-off-by: Michal Kobylinski 
---
 app/test/test_lpm.c  |   75 +
 lib/librte_lpm/rte_lpm.c |   51 ++--
 lib/librte_lpm/rte_lpm.h |  168 --
 lib/librte_table/rte_table_lpm.c |6 +-
 4 files changed, 160 insertions(+), 140 deletions(-)

diff --git a/app/test/test_lpm.c b/app/test/test_lpm.c
index 8b4ded9..e7af796 100644
--- a/app/test/test_lpm.c
+++ b/app/test/test_lpm.c
@@ -278,7 +278,8 @@ test6(void)
 {
struct rte_lpm *lpm = NULL;
uint32_t ip = IPv4(0, 0, 0, 0);
-   uint8_t depth = 24, next_hop_add = 100, next_hop_return = 0;
+   uint8_t depth = 24;
+   uint32_t next_hop_add = 100, next_hop_return = 0;
int32_t status = 0;

lpm = rte_lpm_create(__func__, SOCKET_ID_ANY, MAX_RULES, 0);
@@ -309,10 +310,11 @@ int32_t
 test7(void)
 {
__m128i ipx4;
-   uint16_t hop[4];
+   uint32_t hop[4];
struct rte_lpm *lpm = NULL;
uint32_t ip = IPv4(0, 0, 0, 0);
-   uint8_t depth = 32, next_hop_add = 100, next_hop_return = 0;
+   uint8_t depth = 32;
+   uint32_t next_hop_add = 100, next_hop_return = 0;
int32_t status = 0;

lpm = rte_lpm_create(__func__, SOCKET_ID_ANY, MAX_RULES, 0);
@@ -325,10 +327,10 @@ test7(void)
TEST_LPM_ASSERT((status == 0) && (next_hop_return == next_hop_add));

ipx4 = _mm_set_epi32(ip, ip + 0x100, ip - 0x100, ip);
-   rte_lpm_lookupx4(lpm, ipx4, hop, UINT16_MAX);
+   rte_lpm_lookupx4(lpm, ipx4, hop, UINT32_MAX);
TEST_LPM_ASSERT(hop[0] == next_hop_add);
-   TEST_LPM_ASSERT(hop[1] == UINT16_MAX);
-   TEST_LPM_ASSERT(hop[2] == UINT16_MAX);
+   TEST_LPM_ASSERT(hop[1] == UINT32_MAX);
+   TEST_LPM_ASSERT(hop[2] == UINT32_MAX);
TEST_LPM_ASSERT(hop[3] == next_hop_add);

status = rte_lpm_delete(lpm, ip, depth);
@@ -355,10 +357,11 @@ int32_t
 test8(void)
 {
__m128i ipx4;
-   uint16_t hop[4];
+   uint32_t hop[4];
struct rte_lpm *lpm = NULL;
uint32_t ip1 = IPv4(127, 255, 255, 255), ip2 = IPv4(128, 0, 0, 0);
-   uint8_t depth, next_hop_add, next_hop_return;
+   uint8_t depth;
+   uint32_t next_hop_add, next_hop_return;
int32_t status = 0;

lpm = rte_lpm_create(__func__, SOCKET_ID_ANY, MAX_RULES, 0);
@@ -381,10 +384,10 @@ test8(void)
(next_hop_return == next_hop_add));

ipx4 = _mm_set_epi32(ip2, ip1, ip2, ip1);
-   rte_lpm_lookupx4(lpm, ipx4, hop, UINT16_MAX);
-   TEST_LPM_ASSERT(hop[0] == UINT16_MAX);
+   rte_lpm_lookupx4(lpm, ipx4, hop, UINT32_MAX);
+   TEST_LPM_ASSERT(hop[0] == UINT32_MAX);
TEST_LPM_ASSERT(hop[1] == next_hop_add);
-   TEST_LPM_ASSERT(hop[2] == UINT16_MAX);
+   TEST_LPM_ASSERT(hop[2] == UINT32_MAX);
TEST_LPM_ASSERT(hop[3] == next_hop_add);
}

@@ -409,16 +412,16 @@ test8(void)
TEST_LPM_ASSERT(status == -ENOENT);

ipx4 = _mm_set_epi32(ip1, ip1, ip2, ip2);
-   rte_lpm_lookupx4(lpm, ipx4, hop, UINT16_MAX);
+   rte_lpm_lookupx4(lpm, ipx4, hop, UINT32_MAX);
if (depth != 1) {
TEST_LPM_ASSERT(hop[0] == next_hop_add);
TEST_LPM_ASSERT(hop[1] == next_hop_add);
} else {
-   TEST_LPM_ASSERT(hop[0] == UINT16_MAX);
-   TEST_LPM_ASSERT(hop[1] == UINT16_MAX);
+   TEST_LPM_ASSERT(hop[0] == UINT32_MAX);
+   TEST_LPM_ASSERT(hop[1] == UINT32_MAX);
}
-   TEST_LPM_ASSERT(hop[2] == UINT16_MAX);
-   TEST_LPM_ASSERT(hop[3] == UINT16_MAX);
+   TEST_LPM_ASSERT(hop[2] == UINT32_MAX);
+   TEST_LPM_ASSERT(hop[3] == UINT32_MAX);
}

rte_lpm_free(lpm);
@@ -438,7 +441,8 @@ test9(void)
 {
struct rte_lpm *lpm = NULL;
uint32_t ip, ip_1, ip_2;
-   uint8_t depth, depth_1, depth_2, next_hop_add, next_hop_add_1,
+   uint8_t depth, depth_1, depth_2;
+   uint32_t next_hop_add, next_hop_add_1,
next_hop_add_2, next_hop_return;
int32_t status = 0;

@@ -602,7 +606,8 @@ test10(void)

struct rte_lpm *lpm = NULL;
uint32_t ip;
-   uint8_t depth, next_hop_add, next_hop_return;
+   uint8_t depth;
+   uint32_t next_hop_add, next_hop_return;
int32_t status = 0;

/* Add rule that covers a TBL24 range previously invalid & lookup
@@ -788,7 +793,8 @@ test11(void)

struct rte_lpm *lpm = NULL;
uint32_t ip;
-   uint8_t dept

[dpdk-dev] [PATCH v1] change hugepage sorting to avoid overlapping memcpy

2015-09-08 Thread Jay Rolette
Most of the code in sort_by_physaddr() should be replaced by a call to
qsort() instead. Less code and gets rid of an O(n^2) sort. It's only init
code, but given how long EAL init takes, every bit helps.

I submitted a patch for this close to a year ago:
http://dpdk.org/dev/patchwork/patch/2061/

Jay

On Tue, Sep 8, 2015 at 7:45 AM, Gonzalez Monroy, Sergio <
sergio.gonzalez.monroy at intel.com> wrote:

> Hi Ralf,
>
> Just a few comments/suggestions:
>
> Add 'eal/linux:'  to the commit title, ie:
>   "eal/linux: change hugepage sorting to avoid overlapping memcpy"
>
> On 04/09/2015 11:14, Ralf Hoffmann wrote:
>
>> with only one hugepage or already sorted hugepage addresses, the sort
>> function called memcpy with same src and dst pointer. Debugging with
>> valgrind will issue a warning about overlapping area. This patch changes
>> the bubble sort to avoid this behavior. Also, the function cannot fail
>> any longer.
>>
>> Signed-off-by: Ralf Hoffmann 
>> ---
>>   lib/librte_eal/linuxapp/eal/eal_memory.c | 27
>> +--
>>   1 file changed, 13 insertions(+), 14 deletions(-)
>>
>> diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c
>> b/lib/librte_eal/linuxapp/eal/eal_memory.c
>> index ac2745e..6d01f61 100644
>> --- a/lib/librte_eal/linuxapp/eal/eal_memory.c
>> +++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
>> @@ -699,25 +699,25 @@ error:
>>* higher address first on powerpc). We use a slow algorithm, but we
>> won't
>>* have millions of pages, and this is only done at init time.
>>*/
>> -static int
>> +static void
>>   sort_by_physaddr(struct hugepage_file *hugepg_tbl, struct hugepage_info
>> *hpi)
>>   {
>> unsigned i, j;
>> -   int compare_idx;
>> +   unsigned compare_idx;
>> uint64_t compare_addr;
>> struct hugepage_file tmp;
>> for (i = 0; i < hpi->num_pages[0]; i++) {
>> -   compare_addr = 0;
>> -   compare_idx = -1;
>> +   compare_addr = hugepg_tbl[i].physaddr;
>> +   compare_idx = i;
>> /*
>> -* browse all entries starting at 'i', and find the
>> +* browse all entries starting at 'i+1', and find the
>>  * entry with the smallest addr
>>  */
>> -   for (j=i; j< hpi->num_pages[0]; j++) {
>> +   for (j=i + 1; j < hpi->num_pages[0]; j++) {
>>
> Although there are many style/checkpatch issues in current code, we try to
> fix them
> in new patches.
> In that regard, checkpatch complains about above line with:
> ERROR:SPACING: spaces required around that '='
>
>   - if (compare_addr == 0 ||
>> +   if (
>>   #ifdef RTE_ARCH_PPC_64
>> hugepg_tbl[j].physaddr > compare_addr) {
>>   #else
>> @@ -728,10 +728,9 @@ sort_by_physaddr(struct hugepage_file *hugepg_tbl,
>> struct hugepage_info *hpi)
>> }
>> }
>>   - /* should not happen */
>> -   if (compare_idx == -1) {
>> -   RTE_LOG(ERR, EAL, "%s(): error in physaddr
>> sorting\n", __func__);
>> -   return -1;
>> +   if (compare_idx == i) {
>> +   /* no smaller page found */
>> +   continue;
>> }
>> /* swap the 2 entries in the table */
>> @@ -741,7 +740,8 @@ sort_by_physaddr(struct hugepage_file *hugepg_tbl,
>> struct hugepage_info *hpi)
>> sizeof(struct hugepage_file));
>> memcpy(&hugepg_tbl[i], &tmp, sizeof(struct
>> hugepage_file));
>> }
>> -   return 0;
>> +
>> +   return;
>>   }
>>
> I reckon checkpatch is not picking this one because the end-of-function is
> not part of the patch,
> but it is a warning:
> WARNING:RETURN_VOID: void function return statements are not generally
> useful
>
> /*
>> @@ -1164,8 +1164,7 @@ rte_eal_hugepage_init(void)
>> goto fail;
>> }
>>   - if (sort_by_physaddr(&tmp_hp[hp_offset], hpi) < 0)
>> -   goto fail;
>> +   sort_by_physaddr(&tmp_hp[hp_offset], hpi);
>> #ifdef RTE_EAL_SINGLE_FILE_SEGMENTS
>> /* remap all hugepages into single file segments */
>>
>>
>>
> Thanks,
> Sergio
>


[dpdk-dev] [PATCH v2 1/1] ip_frag: fix creating ipv6 fragment extension header

2015-09-08 Thread Piotr Azarewicz
Previous implementation won't work on every environment. The order of
allocation of bit-fields within a unit (high-order to low-order or
low-order to high-order) is implementation-defined.
Solution: used bytes instead of bit fields.

v2 changes:
- remove useless union
- fix process_ipv6 function (due to remove the union above)

Signed-off-by: Piotr Azarewicz 
---
 lib/librte_ip_frag/rte_ip_frag.h|   13 ++---
 lib/librte_ip_frag/rte_ipv6_fragmentation.c |6 ++
 lib/librte_port/rte_port_ras.c  |   10 +++---
 3 files changed, 11 insertions(+), 18 deletions(-)

diff --git a/lib/librte_ip_frag/rte_ip_frag.h b/lib/librte_ip_frag/rte_ip_frag.h
index 52f44c9..f3ca566 100644
--- a/lib/librte_ip_frag/rte_ip_frag.h
+++ b/lib/librte_ip_frag/rte_ip_frag.h
@@ -130,17 +130,8 @@ struct rte_ip_frag_tbl {
 /** IPv6 fragment extension header */
 struct ipv6_extension_fragment {
uint8_t next_header;/**< Next header type */
-   uint8_t reserved1;  /**< Reserved */
-   union {
-   struct {
-   uint16_t frag_offset:13; /**< Offset from the start of 
the packet */
-   uint16_t reserved2:2; /**< Reserved */
-   uint16_t more_frags:1;
-   /**< 1 if more fragments left, 0 if last fragment */
-   };
-   uint16_t frag_data;
-   /**< union of all fragmentation data */
-   };
+   uint8_t reserved;   /**< Reserved */
+   uint16_t frag_data; /**< All fragmentation data */
uint32_t id;/**< Packet ID */
 } __attribute__((__packed__));

diff --git a/lib/librte_ip_frag/rte_ipv6_fragmentation.c 
b/lib/librte_ip_frag/rte_ipv6_fragmentation.c
index 0e32aa8..ab62efd 100644
--- a/lib/librte_ip_frag/rte_ipv6_fragmentation.c
+++ b/lib/librte_ip_frag/rte_ipv6_fragmentation.c
@@ -65,10 +65,8 @@ __fill_ipv6hdr_frag(struct ipv6_hdr *dst,

fh = (struct ipv6_extension_fragment *) ++dst;
fh->next_header = src->proto;
-   fh->reserved1   = 0;
-   fh->frag_offset = rte_cpu_to_be_16(fofs);
-   fh->reserved2   = 0;
-   fh->more_frags  = rte_cpu_to_be_16(mf);
+   fh->reserved = 0;
+   fh->frag_data = rte_cpu_to_be_16((fofs & ~IPV6_HDR_FO_MASK) | mf);
fh->id = 0;
 }

diff --git a/lib/librte_port/rte_port_ras.c b/lib/librte_port/rte_port_ras.c
index 6bd0f8c..3dbd5be 100644
--- a/lib/librte_port/rte_port_ras.c
+++ b/lib/librte_port/rte_port_ras.c
@@ -205,6 +205,9 @@ process_ipv4(struct rte_port_ring_writer_ras *p, struct 
rte_mbuf *pkt)
}
 }

+#define MORE_FRAGS(x) ((x) & 0x0001)
+#define FRAG_OFFSET(x) ((x) >> 3)
+
 static void
 process_ipv6(struct rte_port_ring_writer_ras *p, struct rte_mbuf *pkt)
 {
@@ -212,12 +215,13 @@ process_ipv6(struct rte_port_ring_writer_ras *p, struct 
rte_mbuf *pkt)
struct ipv6_hdr *pkt_hdr = rte_pktmbuf_mtod(pkt, struct ipv6_hdr *);

struct ipv6_extension_fragment *frag_hdr;
+   uint16_t frag_data = 0;
frag_hdr = rte_ipv6_frag_get_ipv6_fragment_header(pkt_hdr);
-   uint16_t frag_offset = frag_hdr->frag_offset;
-   uint16_t frag_flag = frag_hdr->more_frags;
+   if (frag_hdr != NULL)
+   frag_data = rte_be_to_cpu_16(frag_hdr->frag_data);

/* If it is a fragmented packet, then try to reassemble */
-   if ((frag_flag == 0) && (frag_offset == 0))
+   if ((MORE_FRAGS(frag_data) == 0) && (FRAG_OFFSET(frag_data) == 0))
p->tx_buf[p->tx_buf_count++] = pkt;
else {
struct rte_mbuf *mo;
-- 
1.7.9.5



[dpdk-dev] [PATCH v1] change hugepage sorting to avoid overlapping memcpy

2015-09-08 Thread Gonzalez Monroy, Sergio
On 08/09/2015 14:29, Jay Rolette wrote:
> Most of the code in sort_by_physaddr() should be replaced by a call to 
> qsort() instead. Less code and gets rid of an O(n^2) sort. It's only 
> init code, but given how long EAL init takes, every bit helps.
>
Fair enough.
Actually, we already use qsort in 
lib/librte_eal/linuxapp/eal/eal_hugeapge_info.c

> I submitted a patch for this close to a year ago: 
> http://dpdk.org/dev/patchwork/patch/2061/
>
I just had a quick look at it and seems to be archived with 'Changes 
Requested' status.

I will comment on it.

Sergio
> Jay
>
> On Tue, Sep 8, 2015 at 7:45 AM, Gonzalez Monroy, Sergio 
>  > wrote:
>
> Hi Ralf,
>
> Just a few comments/suggestions:
>
> Add 'eal/linux:'  to the commit title, ie:
>   "eal/linux: change hugepage sorting to avoid overlapping memcpy"
>
> On 04/09/2015 11:14, Ralf Hoffmann wrote:
>
> with only one hugepage or already sorted hugepage addresses,
> the sort
> function called memcpy with same src and dst pointer.
> Debugging with
> valgrind will issue a warning about overlapping area. This
> patch changes
> the bubble sort to avoid this behavior. Also, the function
> cannot fail
> any longer.
>
> Signed-off-by: Ralf Hoffmann
>  >
> ---
>   lib/librte_eal/linuxapp/eal/eal_memory.c | 27
> +--
>   1 file changed, 13 insertions(+), 14 deletions(-)
>
> diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c
> b/lib/librte_eal/linuxapp/eal/eal_memory.c
> index ac2745e..6d01f61 100644
> --- a/lib/librte_eal/linuxapp/eal/eal_memory.c
> +++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
> @@ -699,25 +699,25 @@ error:
>* higher address first on powerpc). We use a slow
> algorithm, but we won't
>* have millions of pages, and this is only done at init time.
>*/
> -static int
> +static void
>   sort_by_physaddr(struct hugepage_file *hugepg_tbl, struct
> hugepage_info *hpi)
>   {
> unsigned i, j;
> -   int compare_idx;
> +   unsigned compare_idx;
> uint64_t compare_addr;
> struct hugepage_file tmp;
> for (i = 0; i < hpi->num_pages[0]; i++) {
> -   compare_addr = 0;
> -   compare_idx = -1;
> +   compare_addr = hugepg_tbl[i].physaddr;
> +   compare_idx = i;
> /*
> -* browse all entries starting at 'i', and
> find the
> +* browse all entries starting at 'i+1', and
> find the
>  * entry with the smallest addr
>  */
> -   for (j=i; j< hpi->num_pages[0]; j++) {
> +   for (j=i + 1; j < hpi->num_pages[0]; j++) {
>
> Although there are many style/checkpatch issues in current code,
> we try to fix them
> in new patches.
> In that regard, checkpatch complains about above line with:
> ERROR:SPACING: spaces required around that '='
>
>   - if (compare_addr == 0 ||
> +   if (
>   #ifdef RTE_ARCH_PPC_64
> hugepg_tbl[j].physaddr >
> compare_addr) {
>   #else
> @@ -728,10 +728,9 @@ sort_by_physaddr(struct hugepage_file
> *hugepg_tbl, struct hugepage_info *hpi)
> }
> }
>   - /* should not happen */
> -   if (compare_idx == -1) {
> -   RTE_LOG(ERR, EAL, "%s(): error in
> physaddr sorting\n", __func__);
> -   return -1;
> +   if (compare_idx == i) {
> +   /* no smaller page found */
> +   continue;
> }
> /* swap the 2 entries in the table */
> @@ -741,7 +740,8 @@ sort_by_physaddr(struct hugepage_file
> *hugepg_tbl, struct hugepage_info *hpi)
> sizeof(struct hugepage_file));
> memcpy(&hugepg_tbl[i], &tmp, sizeof(struct
> hugepage_file));
> }
> -   return 0;
> +
> +   return;
>   }
>
> I reckon checkpatch is not picking this one because the
> end-of-function is not part of the patch,
> but it is a warning:
> WARNING:RETURN_VOID: void function return statements are not
> generally useful
>
> /*
> @@ -1164,8 +1164,7 @@ rte_eal_hugepage_init(void)
>   

[dpdk-dev] virtio optimization idea

2015-09-08 Thread Stephen Hemminger
On Fri, 4 Sep 2015 08:25:05 +
"Xie, Huawei"  wrote:

> Hi:
> 
> Recently I have done one virtio optimization proof of concept. The
> optimization includes two parts:
> 1) avail ring set with fixed descriptors
> 2) RX vectorization
> With the optimizations, we could have several times of performance boost
> for purely vhost-virtio throughput.
> 
> Here i will only cover the first part, which is the prerequisite for the
> second part.
> Let us first take RX for example. Currently when we fill the avail ring
> with guest mbuf, we need
> a) allocate one descriptor(for non sg mbuf) from free descriptors
> b) set the idx of the desc into the entry of avail ring
> c) set the addr/len field of the descriptor to point to guest blank mbuf
> data area
> 
> Those operation takes time, and especially step b results in modifed (M)
> state of the cache line for the avail ring in the virtio processing
> core. When vhost processes the avail ring, the cache line transfer from
> virtio processing core to vhost processing core takes pretty much CPU
> cycles.
> To solve this problem, this is the arrangement of RX ring for DPDK
> pmd(for non-mergable case).
>
> avail  
> idx
> +  
> |  
> +++---+-+--+   
> | 0  | 1  | 2 | ... |  254  | 255  |  avail ring
> +-+--+-+--+-+-+-+---+--+---+   
>   |||   |   |  |   
>   |||   |   |  |   
>   vvv   |   v  v   
> +-+--+-+--+-+-+-+---+--+---+   
> | 0  | 1  | 2 | ... |  254  | 255  |  desc ring
> +++---+-+--+   
> |  
> |  
> +++---+-+--+   
> | 0  | 1  | 2 | |  254  | 255  |  used ring
> +++---+-+--+   
> |  
> +
> Avail ring is initialized with fixed descriptor and is never changed,
> i.e, the index value of the nth avail ring entry is always n, which
> means virtio PMD is actually refilling desc ring only, without having to
> change avail ring.
> When vhost fetches avail ring, if not evicted, it is always in its first
> level cache.
> 
> When RX receives packets from used ring, we use the used->idx as the
> desc idx. This requires that vhost processes and returns descs from
> avail ring to used ring in order, which is true for both current dpdk
> vhost and kernel vhost implementation. In my understanding, there is no
> necessity for vhost net to process descriptors OOO. One case could be
> zero copy, for example, if one descriptor doesn't meet zero copy
> requirment, we could directly return it to used ring, earlier than the
> descriptors in front of it.
> To enforce this, i want to use a reserved bit to indicate in order
> processing of descriptors.
> 
> For tx ring, the arrangement is like below. Each transmitted mbuf needs
> a desc for virtio_net_hdr, so actually we have only 128 free slots.
>   
> 
> 
>
> ++  
>
> ||  
>
> ||  
>   
> +-+-+-+--+--+--+--+   
>
> 
>|  0  |  1  | ... |  127 || 128  | 129  | ...  | 255  |   avail ring
> with fixed descriptor
>   
> +--+--+--+--+-+---+--+---+--+---+--+--+---+   
>
> 
>   | ||  ||  |  |
> |  
>   v vv  ||  v  v
> v  
>   
> +--+--+--+--+-+---+--+---+--+---+--+--+---+   
>
> 
>| 127 | 128 | ... |  255 || 127  | 128  | ...  | 255  |   desc ring
> for virtio_net_hdr
>   
> +--+--+--+--+-+---+--+---+--+---+--+--+---+   
>
> 
>   | ||  ||  |  |
> |  
>   v vv  ||  v  v
> v  
>   
> +--+--+--+--+-+---+--+---+--+---+--+--+---+   
>
> 
>|  0  |  1  | ... |  127 ||  0   |  1   | ...  | 127  |   desc ring
> for tx dat   
>   
> +-+-+-+--+--+--+--+   
>  
> 

Does this still work with Linux (or BSD) guest/host.
If you are assuming both virtio/vhost are DPDK this is never g

[dpdk-dev] virtio optimization idea

2015-09-08 Thread Xie, Huawei
On 9/8/2015 11:39 PM, Stephen Hemminger wrote:
> On Fri, 4 Sep 2015 08:25:05 +
> "Xie, Huawei"  wrote:
>
>> Hi:
>>
>> Recently I have done one virtio optimization proof of concept. The
>> optimization includes two parts:
>> 1) avail ring set with fixed descriptors
>> 2) RX vectorization
>> With the optimizations, we could have several times of performance boost
>> for purely vhost-virtio throughput.
>>
>> Here i will only cover the first part, which is the prerequisite for the
>> second part.
>> Let us first take RX for example. Currently when we fill the avail ring
>> with guest mbuf, we need
>> a) allocate one descriptor(for non sg mbuf) from free descriptors
>> b) set the idx of the desc into the entry of avail ring
>> c) set the addr/len field of the descriptor to point to guest blank mbuf
>> data area
>>
>> Those operation takes time, and especially step b results in modifed (M)
>> state of the cache line for the avail ring in the virtio processing
>> core. When vhost processes the avail ring, the cache line transfer from
>> virtio processing core to vhost processing core takes pretty much CPU
>> cycles.
>> To solve this problem, this is the arrangement of RX ring for DPDK
>> pmd(for non-mergable case).
>>
>> avail  
>> idx
>> +  
>> |  
>> +++---+-+--+   
>> | 0  | 1  | 2 | ... |  254  | 255  |  avail ring
>> +-+--+-+--+-+-+-+---+--+---+   
>>   |||   |   |  |   
>>   |||   |   |  |   
>>   vvv   |   v  v   
>> +-+--+-+--+-+-+-+---+--+---+   
>> | 0  | 1  | 2 | ... |  254  | 255  |  desc ring
>> +++---+-+--+   
>> |  
>> |  
>> +++---+-+--+   
>> | 0  | 1  | 2 | |  254  | 255  |  used ring
>> +++---+-+--+   
>> |  
>> +
>> Avail ring is initialized with fixed descriptor and is never changed,
>> i.e, the index value of the nth avail ring entry is always n, which
>> means virtio PMD is actually refilling desc ring only, without having to
>> change avail ring.
>> When vhost fetches avail ring, if not evicted, it is always in its first
>> level cache.
>>
>> When RX receives packets from used ring, we use the used->idx as the
>> desc idx. This requires that vhost processes and returns descs from
>> avail ring to used ring in order, which is true for both current dpdk
>> vhost and kernel vhost implementation. In my understanding, there is no
>> necessity for vhost net to process descriptors OOO. One case could be
>> zero copy, for example, if one descriptor doesn't meet zero copy
>> requirment, we could directly return it to used ring, earlier than the
>> descriptors in front of it.
>> To enforce this, i want to use a reserved bit to indicate in order
>> processing of descriptors.
>>
>> For tx ring, the arrangement is like below. Each transmitted mbuf needs
>> a desc for virtio_net_hdr, so actually we have only 128 free slots.
>>  
>>  
>>
>>
>> ++  
>>
>> ||  
>>
>> ||  
>>   
>> +-+-+-+--+--+--+--+  
>> 
>>
>>|  0  |  1  | ... |  127 || 128  | 129  | ...  | 255  |   avail ring
>> with fixed descriptor
>>   
>> +--+--+--+--+-+---+--+---+--+---+--+--+---+  
>> 
>>
>>   | ||  ||  |  |
>> |  
>>   v vv  ||  v  v
>> v  
>>   
>> +--+--+--+--+-+---+--+---+--+---+--+--+---+  
>> 
>>
>>| 127 | 128 | ... |  255 || 127  | 128  | ...  | 255  |   desc ring
>> for virtio_net_hdr
>>   
>> +--+--+--+--+-+---+--+---+--+---+--+--+---+  
>> 
>>
>>   | ||  ||  |  |
>> |  
>>   v vv  ||  v  v
>> v  
>>   
>> +--+--+--+--+-+---+--+---+--+---+--+--+---+  
>> 
>>
>>|  0  |  1  | ... |  127 ||  0   |  1   | ...  | 127  |   desc ring
>> for tx dat   
>>   
>> +-+-+-+--+--+--+--

[dpdk-dev] rte_eal_init() alternative?

2015-09-08 Thread Don Provan
From: Wiles, Keith:
>That stated I am not a big fan of huge structures being passed into
>a init routine as that structure would need to be versioned and it will
>grow/change. Plus he did not really want to deal in strings, so the
>structure would be binary values and strings as required.

A typical library has an init routine which establishes defaults, and
then the application adjusts parameters through targeted set routines
before starting to use the library operationally. In the argc/argv
wrapper, the parsing code would call one of those individual routines
when it parses the corresponding command line flag.

The idea that there has to be one massive init routine which is
passed every possible configuration parameter is more of the same
monolithic thinking that DPDK needs to shake.

-don provan
dprovan at bivio.net


[dpdk-dev] [PATCH v4 0/2] ethdev: add port speed capability bitmap

2015-09-08 Thread Marc Sune
Neilo,


2015-09-08 12:03 GMT+02:00 N?lio Laranjeiro :

> On Mon, Sep 07, 2015 at 10:52:53PM +0200, Marc Sune wrote:
> > 2015-08-29 2:16 GMT+02:00 Marc Sune :
> >
> > > The current rte_eth_dev_info abstraction does not provide any
> mechanism to
> > > get the supported speed(s) of an ethdev.
> > >
> > > For some drivers (e.g. ixgbe), an educated guess can be done based on
> the
> > > driver's name (driver_name in rte_eth_dev_info), see:
> > >
> > > http://dpdk.org/ml/archives/dev/2013-August/000412.html
> > >
> > > However, i) doing string comparisons is annoying, and can silently
> > > break existing applications if PMDs change their names ii) it does not
> > > provide all the supported capabilities of the ethdev iii) for some
> drivers
> > > it
> > > is impossible determine correctly the (max) speed by the application
> > > (e.g. in i40, distinguish between XL710 and X710).
> > >
> > > This small patch adds speed_capa bitmap in rte_eth_dev_info, which is
> > > filled
> > > by the PMDs according to the physical device capabilities.
> > >
> > > v2: rebase, converted speed_capa into 32 bits bitmap, fixed alignment
> > > (checkpatch).
> > >
> > > v3: rebase to v2.1. unified ETH_LINK_SPEED and ETH_SPEED_CAP into
> > > ETH_SPEED.
> > > Converted field speed in struct rte_eth_conf to speeds, to allow a
> > > bitmap
> > > for defining the announced speeds, as suggested by M. Brorup. Fixed
> > > spelling issues.
> > >
> > > v4: fixed errata in the documentation of field speeds of rte_eth_conf,
> and
> > > commit 1/2 message. rebased to v2.1.0. v3 was incorrectly based on
> > > ~2.1.0-rc1.
> > >
> >
> > Thomas,
> >
> > Since mostly you were commenting for v1 and v2; any opinion on this one?
> >
> > Regards
> > marc
>
> Hi Marc,
>
> I have read your patches, and there are a few mistakes, for instance mlx4
> (ConnectX-3 devices) does not support 100Gbps.
>

When I circulated v1 and v2 I was kindly asking maintainers and reviewers
of the drivers to fix any mistakes in SPEED capabilities, since I was
taking the speeds from the online websites&catalogues. Some were fixed, but
apparently some were still missing. I will remove 100Gbps. Please circulate
any other error you have spotted.



>
> In addition, it seems your new bitmap does not support all kind of
> speeds, take a look at the header of Ethtool, in the Linux kernel
> (include/uapi/linux/ethtool.h) which already consumes 30bits without even
> managing speeds above 56Gbps.
>

The bitmaps you are referring is SUPPORTED_ and ADVERTISED_. These bitmaps
not only contain the speeds but PHY properties (e.g. BASE for ETH).

The intention of this patch was to expose speed capabilities, similar to
the bitmap SPEED_ in include/uapi/linux/ethtool.h, which as you see maps
closely to ETH_SPEED_ proposed in this patch.

I think the encoding of other things, like the exact model of the interface
and its PHY details should go somewhere else. But I might be wrong here, so
open to hear opinions.


>
> It would be nice to keep the field to represent the real speed of the
> link, in case it is not represented by the bitmap, it could be also
> useful for aggregated links (bonding for instance).  The current API
> already works this way, it just needs to be extended from 16 to 32 bit
> to manage speed above 64Gbps.
>

This patch does not remove rte_eth_link_get() API. It just changes the
encoding of speed in struct rte_eth_link, to have an homogeneous set of
constants with the speed capabilities bitmap, as discussed previously in
the thread (see Thomas comments). IOW, it returns now a single SPEED_ value
in the struct rte_eth_link's link_speed field.

Marc


>
> >[...]
>
> N?lio
> --
> N?lio Laranjeiro
> 6WIND
>


[dpdk-dev] Recommended method of using DPDK inside a Vmware ESXi guest ?

2015-09-08 Thread Ale Mansoor
Hi, 
I am trying to use dpdk inside a Vmware 6.0 ESXi guest and see two possible 
approaches to achieving this:
Use vmxnet3-usermap with PMD Use igb_uio with SRIOV passthrough from ESXi host 
to guest
My guest systems are either Fedora Linux FC20 or Ubuntu 15.04 instances with 8 
vCpu's and 8 Gigs of memory. 
When using l2fwd example using igb_uio, the performance numbers I got were very 
low (<100 PPS) and when I tried using vmxnet3-usermap from dpdk.org, it does 
not even seem to compile under the Linux 3.x Kernel.
My system is a HP DL380 G8 platform with dual Xeon 2670 CPU's and 64 Gb 
physical memory and Intel 82599 NIC card running ESXi 6.0
Do I need to enable any additional Kernel build or boot options, based on some 
posts here, appears ESXi does not have emulated IOMMU.
Thanks in advance for your help.
-Ale Mansoor



[dpdk-dev] Recommended method of using DPDK inside a Vmware ESXi guest ?

2015-09-08 Thread Matthew Hall
On Tue, Sep 08, 2015 at 11:43:38PM +, Ale Mansoor wrote:
> When using l2fwd example using igb_uio, the performance numbers I got were 
> very low (<100 PPS) and when I tried using vmxnet3-usermap from dpdk.org, it 
> does not even seem to compile under the Linux 3.x Kernel.

Not everybody will have this environment available to do tests.

Can you send the log output, etc. so we could look for issues as well?

Thanks,
Matthew.