[dpdk-dev] [PATCH 1/3] vhost: pre update used ring for Tx and Rx

2016-06-03 Thread Xie, Huawei
On 6/1/2016 2:53 PM, Yuanhan Liu wrote:
> On Wed, Jun 01, 2016 at 06:40:41AM +0000, Xie, Huawei wrote:
>>> /* Retrieve all of the head indexes first to avoid caching issues. */
>>> for (i = 0; i < count; i++) {
>>> -   desc_indexes[i] = vq->avail->ring[(vq->last_used_idx + i) &
>>> -   (vq->size - 1)];
>>> +   used_idx = (vq->last_used_idx + i) & (vq->size - 1);
>>> +   desc_indexes[i] = vq->avail->ring[used_idx];
>>> +
>>> +   vq->used->ring[used_idx].id  = desc_indexes[i];
>>> +   vq->used->ring[used_idx].len = 0;
>>> +   vhost_log_used_vring(dev, vq,
>>> +   offsetof(struct vring_used, ring[used_idx]),
>>> +   sizeof(vq->used->ring[used_idx]));
>>> }
>>>  
>>> /* Prefetch descriptor index. */
>>> rte_prefetch0(&vq->desc[desc_indexes[0]]);
>>> -   rte_prefetch0(&vq->used->ring[vq->last_used_idx & (vq->size - 1)]);
>>> -
>>> for (i = 0; i < count; i++) {
>>> int err;
>>>  
>>> -   if (likely(i + 1 < count)) {
>>> +   if (likely(i + 1 < count))
>>> rte_prefetch0(&vq->desc[desc_indexes[i + 1]]);
>>> -   rte_prefetch0(&vq->used->ring[(used_idx + 1) &
>>> - (vq->size - 1)]);
>>> -   }
>>>  
>>> pkts[i] = rte_pktmbuf_alloc(mbuf_pool);
>>> if (unlikely(pkts[i] == NULL)) {
>>> @@ -916,18 +920,12 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
>>> rte_pktmbuf_free(pkts[i]);
>>> break;
>>> }
>>> -
>>> -   used_idx = vq->last_used_idx++ & (vq->size - 1);
>>> -   vq->used->ring[used_idx].id  = desc_indexes[i];
>>> -   vq->used->ring[used_idx].len = 0;
>>> -   vhost_log_used_vring(dev, vq,
>>> -   offsetof(struct vring_used, ring[used_idx]),
>>> -   sizeof(vq->used->ring[used_idx]));
>>> }
>> Had tried post-updating used ring in batch,  but forget the perf change.
> I would assume pre-updating gives better performance gain, as we are
> fiddling with avail and used ring together, which would be more cache
> friendly.

The distance between entry for avail ring and used ring are at least 8
cache lines.
The benefit comes from batch updates, if applicable.

>
>> One optimization would be on vhost_log_used_ring.
>> I have two ideas,
>> a) In QEMU side, we always assume use ring will be changed. so that we
>> don't need to log used ring in VHOST.
>>
>> Michael: feasible in QEMU? comments on this?
>>
>> b) We could always mark the total used ring modified rather than entry
>> by entry.
> I doubt it's worthwhile. One fact is that vhost_log_used_ring is
> a non operation in most time: it will take action only in the short
> gap of during live migration.
>
> And FYI, I even tried with all vhost_log_xxx being removed, it showed
> no performance boost at all. Therefore, it's not a factor that will
> impact performance.

I knew this.

>   --yliu
>



[dpdk-dev] [PATCH 2/3] vhost: optimize dequeue for small packets

2016-06-03 Thread Xie, Huawei
On 5/3/2016 8:42 AM, Yuanhan Liu wrote:
> Both current kernel virtio driver and DPDK virtio driver use at least
> 2 desc buffer for Tx: the first for storing the header, and the others
> for storing the data.
>
> Therefore, we could fetch the first data desc buf before the main loop,
> and do the copy first before the check of "are we done yet?". This
> could save one check for small packets, that just have one data desc
> buffer and need one mbuf to store it.
>
> Signed-off-by: Yuanhan Liu 

Acked-by: Huawei Xie 






[dpdk-dev] [PATCH 2/3] vhost: optimize dequeue for small packets

2016-06-03 Thread Xie, Huawei
On 6/1/2016 2:41 PM, Yuanhan Liu wrote:
> On Wed, Jun 01, 2016 at 06:24:18AM +0000, Xie, Huawei wrote:
>> On 5/3/2016 8:42 AM, Yuanhan Liu wrote:
>>> Both current kernel virtio driver and DPDK virtio driver use at least
>>> 2 desc buffer for Tx: the first for storing the header, and the others
>>> for storing the data.
>> Tx could prepend some space for virtio net header whenever possible, so
>> that it could use only one descriptor.
> In such case, it will work as well: it will goto the "else" code path
> then.

It works, but the problem is the statement.

>> Another thing is this doesn't reduce the check because you also add a check.
> Actually, yes, it does save check. Before this patch, we have:
>
>   while (not done yet) {
>  if (desc_is_drain)
>  ...;
>
>  if (mbuf_is_full)
>  ...;
>
>  COPY();
>   }
>
> Note that the "while" check will be done twice, therefore, it's 4
> checks in total. After this patch, it would be:
>
>   if (virtio_net_hdr_takes_one_desc)
>   ...
>
>   while (1) {
>   COPY();
>
>   if (desc_is_drain) {
>   break if done;
>   ...;
>   }
>
>   if (mbuf_is_full {
>   /* small packets will bypass this check */
>   ;
>   }
>   }
>
> So, for small packets, it takes 2 checks only, which actually saves
> 2 checks.
>
>   --yliu
>

For small packets, we have three checks totally. It worth we use space
to save checks, so would ack this patch, but note that there is one
assumption that rte_memcpy API could handle zero length copy.





[dpdk-dev] [PATCH] virtio: use volatile to get used->idx in the loop

2016-06-02 Thread Xie, Huawei
On 6/2/2016 4:52 PM, Yuanhan Liu wrote:
> On Thu, Jun 02, 2016 at 08:39:36AM +0000, Xie, Huawei wrote:
>> On 6/1/2016 2:03 PM, Yuanhan Liu wrote:
>>> On Wed, Jun 01, 2016 at 05:40:08AM +, Xie, Huawei wrote:
>>>> On 5/30/2016 4:20 PM, Yuanhan Liu wrote:
>>>>> On Wed, May 25, 2016 at 12:16:41AM +0800, Huawei Xie wrote:
>>>>>> There is no external function call or any barrier in the loop,
>>>>>> the used->idx would only be retrieved once.
>>>>>>
>>>>>> Signed-off-by: Huawei Xie 
>>>>>> ---
>>>>>>  drivers/net/virtio/virtio_ethdev.c | 3 ++-
>>>>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/net/virtio/virtio_ethdev.c 
>>>>>> b/drivers/net/virtio/virtio_ethdev.c
>>>>>> index c3fb628..f6d6305 100644
>>>>>> --- a/drivers/net/virtio/virtio_ethdev.c
>>>>>> +++ b/drivers/net/virtio/virtio_ethdev.c
>>>>>> @@ -204,7 +204,8 @@ virtio_send_command(struct virtqueue *vq, struct 
>>>>>> virtio_pmd_ctrl *ctrl,
>>>>>>  usleep(100);
>>>>>>  }
>>>>>>  
>>>>>> -while (vq->vq_used_cons_idx != vq->vq_ring.used->idx) {
>>>>>> +while (vq->vq_used_cons_idx !=
>>>>>> +   *((volatile uint16_t *)(&vq->vq_ring.used->idx))) {
>>>>> I'm wondering maybe we could fix VIRTQUEUE_NUSED (which has no such
>>>>> qualifier) and use this macro here?
>>>>>
>>>>> If you check the reference of that macro, you might find similar
>>>>> issues, say, it is also used inside the while-loop of
>>>>> virtio_recv_mergeable_pkts().
>>>>>
>>>>>   --yliu
>>>>>  
>>>>>
>>>> Yes, seems it has same issue though haven't confirmed with asm code.
>>> So, move the "volatile" qualifier to VIRTQUEUE_NUSED?
>>>
>>> --yliu
>>>
>> Yes, anyway this is just intermediate fix. In next patch, will declare
>> the idx as volatile, and remove the qualifier in the macro.
> Hmm.., why we need an intermediate fix then, if we can come up with an
> ultimate fix very quickly?
>
>   --yliu
>
... Either is OK. I have no preference.


[dpdk-dev] [PATCH] virtio: use volatile to get used->idx in the loop

2016-06-02 Thread Xie, Huawei
On 6/1/2016 2:03 PM, Yuanhan Liu wrote:
> On Wed, Jun 01, 2016 at 05:40:08AM +0000, Xie, Huawei wrote:
>> On 5/30/2016 4:20 PM, Yuanhan Liu wrote:
>>> On Wed, May 25, 2016 at 12:16:41AM +0800, Huawei Xie wrote:
>>>> There is no external function call or any barrier in the loop,
>>>> the used->idx would only be retrieved once.
>>>>
>>>> Signed-off-by: Huawei Xie 
>>>> ---
>>>>  drivers/net/virtio/virtio_ethdev.c | 3 ++-
>>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/net/virtio/virtio_ethdev.c 
>>>> b/drivers/net/virtio/virtio_ethdev.c
>>>> index c3fb628..f6d6305 100644
>>>> --- a/drivers/net/virtio/virtio_ethdev.c
>>>> +++ b/drivers/net/virtio/virtio_ethdev.c
>>>> @@ -204,7 +204,8 @@ virtio_send_command(struct virtqueue *vq, struct 
>>>> virtio_pmd_ctrl *ctrl,
>>>>usleep(100);
>>>>}
>>>>  
>>>> -  while (vq->vq_used_cons_idx != vq->vq_ring.used->idx) {
>>>> +  while (vq->vq_used_cons_idx !=
>>>> + *((volatile uint16_t *)(&vq->vq_ring.used->idx))) {
>>> I'm wondering maybe we could fix VIRTQUEUE_NUSED (which has no such
>>> qualifier) and use this macro here?
>>>
>>> If you check the reference of that macro, you might find similar
>>> issues, say, it is also used inside the while-loop of
>>> virtio_recv_mergeable_pkts().
>>>
>>> --yliu
>>>  
>>>
>> Yes, seems it has same issue though haven't confirmed with asm code.
> So, move the "volatile" qualifier to VIRTQUEUE_NUSED?
>
>   --yliu
>

Yes, anyway this is just intermediate fix. In next patch, will declare
the idx as volatile, and remove the qualifier in the macro.


[dpdk-dev] [PATCH v3] virtio: split virtio rx/tx queue

2016-06-02 Thread Xie, Huawei
On 6/2/2016 4:07 PM, Xie, Huawei wrote:
> We keep a common vq structure, containing only vq related fields,
> and then split others into RX, TX and control queue respectively.
>
> Signed-off-by: Huawei Xie 
sorry, this is v4.


[dpdk-dev] [PATCH v3] virtio: split virtio rx/tx queue

2016-06-02 Thread Xie, Huawei
On 6/1/2016 3:13 PM, Yuanhan Liu wrote:
> On Mon, May 30, 2016 at 05:06:20PM +0800, Huawei Xie wrote:
>> We keep a common vq structure, containing only vq related fields,
>> and then split others into RX, TX and control queue respectively.
>>
>> Signed-off-by: Huawei Xie 
>> ---
>> v2:
>> - don't split virtio_dev_rx/tx_queue_setup
>> v3:
>> - fix some 80 char warnings
>> - fix other newer version checkpatch warnings
>> - remove '\n' in PMD_RX_LOG
> Weird, I still saw them.

Maybe missed this.


>
> Besides that, I found a crash issue with this patch applied. You could
> easily reproduce it by:
>
> testpmd> start tx_first
> testpmd> quit
>
> [82774.055297] testpmd[9661]: segfault at 94 ip 004f7ef0 sp 
> 7ffcb1fa
> 66c0 error 4 in testpmd[40+25b000]
> ./t.pmd: line 11:  9661 Segmentation fault   (core dumped) 
> $RTE_SDK/$RTE
> _TARGET/app/testpmd -c 0x1f -n 4 -- --nb-cores=4 -i --disable-hw-vlan 
> --txqflags
>  0xf00 --rxq=$nr_queues --txq=$nr_queues --rxd=256 --txd=256
>
>   --yliu
>

Couldn't reproduce. Seems like resource free issue. Do you test with
multiple queues?


[dpdk-dev] [PATCH 3/3] vhost: arrange virtio_net fields for better cache sharing

2016-06-01 Thread Xie, Huawei
On 5/3/2016 8:42 AM, Yuanhan Liu wrote:
> the ifname[] field takes so much space, that it seperate some frequently
> used fields into different caches, say, features and broadcast_rarp.
>
> This patch move all those fields that will be accessed frequently in Rx/Tx
> together (before the ifname[] field) to let them share one cache line.
>
> Signed-off-by: Yuanhan Liu 

Acked-by: Huawei Xie 



[dpdk-dev] [PATCH 1/3] vhost: pre update used ring for Tx and Rx

2016-06-01 Thread Xie, Huawei
On 5/3/2016 8:42 AM, Yuanhan Liu wrote:
> Pre update and update used ring in batch for Tx and Rx at the stage
> while fetching all avail desc idx. This would reduce some cache misses
> and hence, increase the performance a bit.
>
> Pre update would be feasible as guest driver will not start processing
> those entries as far as we don't update "used->idx". (I'm not 100%
> certain I don't miss anything, though).
>
> Cc: Michael S. Tsirkin 
> Signed-off-by: Yuanhan Liu 
> ---
>  lib/librte_vhost/vhost_rxtx.c | 58 
> +--
>  1 file changed, 28 insertions(+), 30 deletions(-)
>
> diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
> index c9cd1c5..2c3b810 100644
> --- a/lib/librte_vhost/vhost_rxtx.c
> +++ b/lib/librte_vhost/vhost_rxtx.c
> @@ -137,7 +137,7 @@ copy_virtio_net_hdr(struct virtio_net *dev, uint64_t 
> desc_addr,
>  
>  static inline int __attribute__((always_inline))
>  copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
> -   struct rte_mbuf *m, uint16_t desc_idx, uint32_t *copied)
> +   struct rte_mbuf *m, uint16_t desc_idx)
>  {
>   uint32_t desc_avail, desc_offset;
>   uint32_t mbuf_avail, mbuf_offset;
> @@ -161,7 +161,6 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct 
> vhost_virtqueue *vq,
>   desc_offset = dev->vhost_hlen;
>   desc_avail  = desc->len - dev->vhost_hlen;
>  
> - *copied = rte_pktmbuf_pkt_len(m);
>   mbuf_avail  = rte_pktmbuf_data_len(m);
>   mbuf_offset = 0;
>   while (mbuf_avail != 0 || m->next != NULL) {
> @@ -262,6 +261,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
>   struct vhost_virtqueue *vq;
>   uint16_t res_start_idx, res_end_idx;
>   uint16_t desc_indexes[MAX_PKT_BURST];
> + uint16_t used_idx;
>   uint32_t i;
>  
>   LOG_DEBUG(VHOST_DATA, "(%d) %s\n", dev->vid, __func__);
> @@ -285,27 +285,29 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
>   /* Retrieve all of the desc indexes first to avoid caching issues. */
>   rte_prefetch0(&vq->avail->ring[res_start_idx & (vq->size - 1)]);
>   for (i = 0; i < count; i++) {
> - desc_indexes[i] = vq->avail->ring[(res_start_idx + i) &
> -   (vq->size - 1)];
> + used_idx = (res_start_idx + i) & (vq->size - 1);
> + desc_indexes[i] = vq->avail->ring[used_idx];
> + vq->used->ring[used_idx].id = desc_indexes[i];
> + vq->used->ring[used_idx].len = pkts[i]->pkt_len +
> +dev->vhost_hlen;
> + vhost_log_used_vring(dev, vq,
> + offsetof(struct vring_used, ring[used_idx]),
> + sizeof(vq->used->ring[used_idx]));
>   }
>  
>   rte_prefetch0(&vq->desc[desc_indexes[0]]);
>   for (i = 0; i < count; i++) {
>   uint16_t desc_idx = desc_indexes[i];
> - uint16_t used_idx = (res_start_idx + i) & (vq->size - 1);
> - uint32_t copied;
>   int err;
>  
> - err = copy_mbuf_to_desc(dev, vq, pkts[i], desc_idx, &copied);
> -
> - vq->used->ring[used_idx].id = desc_idx;
> - if (unlikely(err))
> + err = copy_mbuf_to_desc(dev, vq, pkts[i], desc_idx);
> + if (unlikely(err)) {
> + used_idx = (res_start_idx + i) & (vq->size - 1);
>   vq->used->ring[used_idx].len = dev->vhost_hlen;
> - else
> - vq->used->ring[used_idx].len = copied + dev->vhost_hlen;
> - vhost_log_used_vring(dev, vq,
> - offsetof(struct vring_used, ring[used_idx]),
> - sizeof(vq->used->ring[used_idx]));
> + vhost_log_used_vring(dev, vq,
> + offsetof(struct vring_used, ring[used_idx]),
> + sizeof(vq->used->ring[used_idx]));
> + }
>  
>   if (i + 1 < count)
>   rte_prefetch0(&vq->desc[desc_indexes[i+1]]);
> @@ -879,6 +881,7 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
>   /* Prefetch available ring to retrieve head indexes. */
>   used_idx = vq->last_used_idx & (vq->size - 1);
>   rte_prefetch0(&vq->avail->ring[used_idx]);
> + rte_prefetch0(&vq->used->ring[used_idx]);
>  
>   count = RTE_MIN(count, MAX_PKT_BURST);
>   count = RTE_MIN(count, free_entries);
> @@ -887,22 +890,23 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
>  
>   /* Retrieve all of the head indexes first to avoid caching issues. */
>   for (i = 0; i < count; i++) {
> - desc_indexes[i] = vq->avail->ring[(vq->last_used_idx + i) &
> - (vq->size - 1)];
> + used_idx = (vq->last_used_idx + i) & (vq->size - 1);
> + desc_indexes[i] = vq->avail->ring[used_idx];
> +
> + 

[dpdk-dev] [PATCH 2/3] vhost: optimize dequeue for small packets

2016-06-01 Thread Xie, Huawei
On 5/3/2016 8:42 AM, Yuanhan Liu wrote:
> Both current kernel virtio driver and DPDK virtio driver use at least
> 2 desc buffer for Tx: the first for storing the header, and the others
> for storing the data.

Tx could prepend some space for virtio net header whenever possible, so
that it could use only one descriptor.

Another thing is this doesn't reduce the check because you also add a check.


>
> Therefore, we could fetch the first data desc buf before the main loop,
> and do the copy first before the check of "are we done yet?". This
> could save one check for small packets, that just have one data desc
> buffer and need one mbuf to store it.
>
> Signed-off-by: Yuanhan Liu 
> ---
>  lib/librte_vhost/vhost_rxtx.c | 52 
> ++-
>  1 file changed, 36 insertions(+), 16 deletions(-)
>
> diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
> index 2c3b810..34d6ed1 100644
> --- a/lib/librte_vhost/vhost_rxtx.c
> +++ b/lib/librte_vhost/vhost_rxtx.c
> @@ -753,18 +753,48 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct 
> vhost_virtqueue *vq,
>   return -1;
>  
>   desc_addr = gpa_to_vva(dev, desc->addr);
> - rte_prefetch0((void *)(uintptr_t)desc_addr);
> -
> - /* Retrieve virtio net header */
>   hdr = (struct virtio_net_hdr *)((uintptr_t)desc_addr);
> - desc_avail  = desc->len - dev->vhost_hlen;
> - desc_offset = dev->vhost_hlen;
> + rte_prefetch0(hdr);
> +
> + /*
> +  * Both current kernel virio driver and DPDK virtio driver
> +  * use at least 2 desc bufferr for Tx: the first for storing
> +  * the header, and others for storing the data.
> +  */
> + if (likely(desc->len == dev->vhost_hlen)) {
> + desc = &vq->desc[desc->next];
> +
> + desc_addr = gpa_to_vva(dev, desc->addr);
> + rte_prefetch0((void *)(uintptr_t)desc_addr);
> +
> + desc_offset = 0;
> + desc_avail  = desc->len;
> + nr_desc+= 1;
> +
> + PRINT_PACKET(dev, (uintptr_t)desc_addr, desc->len, 0);
> + } else {
> + desc_avail  = desc->len - dev->vhost_hlen;
> + desc_offset = dev->vhost_hlen;
> + }
>  
>   mbuf_offset = 0;
>   mbuf_avail  = m->buf_len - RTE_PKTMBUF_HEADROOM;
> - while (desc_avail != 0 || (desc->flags & VRING_DESC_F_NEXT) != 0) {
> + while (1) {
> + cpy_len = RTE_MIN(desc_avail, mbuf_avail);
> + rte_memcpy(rte_pktmbuf_mtod_offset(cur, void *, mbuf_offset),
> + (void *)((uintptr_t)(desc_addr + desc_offset)),
> + cpy_len);
> +
> + mbuf_avail  -= cpy_len;
> + mbuf_offset += cpy_len;
> + desc_avail  -= cpy_len;
> + desc_offset += cpy_len;
> +
>   /* This desc reaches to its end, get the next one */
>   if (desc_avail == 0) {
> + if ((desc->flags & VRING_DESC_F_NEXT) == 0)
> + break;
> +
>   if (unlikely(desc->next >= vq->size ||
>++nr_desc >= vq->size))
>   return -1;
> @@ -800,16 +830,6 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct 
> vhost_virtqueue *vq,
>   mbuf_offset = 0;
>   mbuf_avail  = cur->buf_len - RTE_PKTMBUF_HEADROOM;
>   }
> -
> - cpy_len = RTE_MIN(desc_avail, mbuf_avail);
> - rte_memcpy(rte_pktmbuf_mtod_offset(cur, void *, mbuf_offset),
> - (void *)((uintptr_t)(desc_addr + desc_offset)),
> - cpy_len);
> -
> - mbuf_avail  -= cpy_len;
> - mbuf_offset += cpy_len;
> - desc_avail  -= cpy_len;
> - desc_offset += cpy_len;
>   }
>  
>   prev->data_len = mbuf_offset;



[dpdk-dev] [PATCH] virtio: check if devargs is NULL before checking its value

2016-06-01 Thread Xie, Huawei
On 5/27/2016 10:08 AM, Yuanhan Liu wrote:
> On Wed, May 25, 2016 at 12:47:30PM +0200, Thomas Monjalon wrote:
>>> -   dev->devargs->type != RTE_DEVTYPE_WHITELISTED_PCI) {
>>> +   (!dev->devargs ||
>>> +dev->devargs->type != RTE_DEVTYPE_WHITELISTED_PCI)) {
>> Should the title be something like "fix crash ..."?
>>
>> I would also add
>> Reported-by: Vincent Li 
> Huawei, the two are good comments (Thomas, thanks for the review, BTW :).

np.

>
> So, mind to send v2? BTW, I think this patch deserves some explanation,
> say, why dev->devargs could be NULL.
>
>
>
>   --yliu
>



[dpdk-dev] [PATCH] virtio: use volatile to get used->idx in the loop

2016-06-01 Thread Xie, Huawei
On 5/30/2016 4:20 PM, Yuanhan Liu wrote:
> On Wed, May 25, 2016 at 12:16:41AM +0800, Huawei Xie wrote:
>> There is no external function call or any barrier in the loop,
>> the used->idx would only be retrieved once.
>>
>> Signed-off-by: Huawei Xie 
>> ---
>>  drivers/net/virtio/virtio_ethdev.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/virtio/virtio_ethdev.c 
>> b/drivers/net/virtio/virtio_ethdev.c
>> index c3fb628..f6d6305 100644
>> --- a/drivers/net/virtio/virtio_ethdev.c
>> +++ b/drivers/net/virtio/virtio_ethdev.c
>> @@ -204,7 +204,8 @@ virtio_send_command(struct virtqueue *vq, struct 
>> virtio_pmd_ctrl *ctrl,
>>  usleep(100);
>>  }
>>  
>> -while (vq->vq_used_cons_idx != vq->vq_ring.used->idx) {
>> +while (vq->vq_used_cons_idx !=
>> +   *((volatile uint16_t *)(&vq->vq_ring.used->idx))) {
> I'm wondering maybe we could fix VIRTQUEUE_NUSED (which has no such
> qualifier) and use this macro here?
>
> If you check the reference of that macro, you might find similar
> issues, say, it is also used inside the while-loop of
> virtio_recv_mergeable_pkts().
>
>   --yliu
>  
>

Yes, seems it has same issue though haven't confirmed with asm code.


[dpdk-dev] [PATCH] virtio: split virtio rx/tx queue

2016-05-30 Thread Xie, Huawei
On 5/30/2016 11:00 AM, Yuanhan Liu wrote:
> On Mon, May 30, 2016 at 02:40:00AM +0000, Xie, Huawei wrote:
>> On 5/27/2016 5:06 PM, Yuanhan Liu wrote:
>>> On Tue, May 24, 2016 at 09:38:32PM +0800, Huawei Xie wrote:
>>>>vq->vq_ring_mem = mz->phys_addr;
>>>>vq->vq_ring_virt_mem = mz->addr;
>>>> -  PMD_INIT_LOG(DEBUG, "vq->vq_ring_mem:  0x%"PRIx64, 
>>>> (uint64_t)mz->phys_addr);
>>>> -  PMD_INIT_LOG(DEBUG, "vq->vq_ring_virt_mem: 0x%"PRIx64, 
>>>> (uint64_t)(uintptr_t)mz->addr);
>>>> -  vq->virtio_net_hdr_mz  = NULL;
>>>> -  vq->virtio_net_hdr_mem = 0;
>>>> +  PMD_INIT_LOG(DEBUG, "vq->vq_ring_mem:  0x%"PRIx64,
>>>> +   (uint64_t)mz->phys_addr);
>>>> +  PMD_INIT_LOG(DEBUG, "vq->vq_ring_virt_mem: 0x%"PRIx64,
>>>> +   (uint64_t)(uintptr_t)mz->addr);
>>>> +
>>>> +  hdr_mz = rte_memzone_reserve_aligned(vq_hdr_name, sz_hdr_mz, socket_id,
>>>> +   0, RTE_CACHE_LINE_SIZE);
>>> We don't need allocate hdr_mz for Rx queue, and in such case, sz_hdr_mz
>>> is 0. I'm wondering what hdr_mz would be then, NULL?
>>>
>>> Anyway, you should skip the hdr_mz allocation for Rx queue, and I also
>>> would suggest you to move the vq_hdr_name setup here.
>> will check sz_hdr_mz before the zone allocation.
>>
>>
>>>> +  if (hdr_mz == NULL) {
>>>> +  if (rte_errno == EEXIST)
>>>> +  hdr_mz = rte_memzone_lookup(vq_hdr_name);
>>>> +  if (hdr_mz == NULL) {
>>>> +  ret = -ENOMEM;
>>>> +  goto fail_q_alloc;
>>>> +  }
>>>> +  }
>>>>  
>>> ...
>>>>  
>>>>PMD_INIT_FUNC_TRACE();
>>>>ret = virtio_dev_queue_setup(dev, VTNET_CQ, VTNET_SQ_CQ_QUEUE_IDX,
>>>> -  vtpci_queue_idx, 0, socket_id, &vq);
>>>> +  vtpci_queue_idx, 0, socket_id, (void **)&cvq);
>>> Unnecessary cast. Note that there are few others like that in this
>>> patch.
>> This cast is needed.
> Oh, right, indeed. Sorry.
>
>>>> -  PMD_RX_LOG(DEBUG, "dequeue:%d", num);
>>>> -  PMD_RX_LOG(DEBUG, "packet len:%d", len[0]);
>>>> +  PMD_RX_LOG(DEBUG, "dequeue:%d\n", num);
>>>> +  PMD_RX_LOG(DEBUG, "packet len:%d\n", len[0]);
>>> We should not append "\n" for PMD_RX_LOG; this macro alreadys does it.
>> Weird. Will remove it. Thanks.
>>
>>> Another note is that you might want to run checkpatch; I saw quite many
>>> warnings.
>> Had checked. The warnings are all due to 80 char limitation of
>> virtio_rxq_stats_strings. Just 4 or 5 chars cross 80 line limit. I
>> prefer to keep the fields aligned.
> Agreed. However, I was talking about others warnings.

ok, i see you are using checkpatch of newer Linux kernel.


>
>   --yliu
>
> ---
> CHECK:BRACES: braces {} should be used on all arms of this statement
> #198: FILE: drivers/net/virtio/virtio_ethdev.c:343:

would fix this.

> +   if (queue_type == VTNET_RQ)
> [...]
> +   else if (queue_type == VTNET_TQ) {
> [...]
> +   } else if (queue_type == VTNET_CQ) {
> [...]
>
> CHECK:CAMELCASE: Avoid CamelCase: 
> #280: FILE: drivers/net/virtio/virtio_ethdev.c:404:
> +   PMD_INIT_LOG(DEBUG, "vq->vq_ring_mem:  0x%"PRIx64,
>
> CHECK:CONCATENATED_STRING: Concatenated strings should use spaces
> between elements
> #280: FILE: drivers/net/virtio/virtio_ethdev.c:404:
> +   PMD_INIT_LOG(DEBUG, "vq->vq_ring_mem:  0x%"PRIx64,
>
> CHECK:CONCATENATED_STRING: Concatenated strings should use spaces
> between elements
> #282: FILE: drivers/net/virtio/virtio_ethdev.c:406:
> +   PMD_INIT_LOG(DEBUG, "vq->vq_ring_virt_mem: 0x%"PRIx64,
>
> WARNING:UNSPECIFIED_INT: Prefer 'unsigned int' to bare use of 'unsigned'
> #482: FILE: drivers/net/virtio/virtio_ethdev.c:753:
> +   unsigned nstats = dev->data->nb_tx_queues * VIRTIO_NB_TXQ_XSTATS
> +
>
> WARNING:LONG_LINE: line over 80 characters
> #551: FILE: drivers/net/virtio/virtio_ethdev.c:819:
> +   memset(txvq->stats.size_bins, 0,
> sizeof(txvq->stats.size_bins[0]) * 8);
>
> WARNING:LONG_LINE: line over 80 characters
> #571: FILE: drivers/net/virtio/virtio_ethdev.c:832:
> +   memset(rxvq->stats.size_bins, 0,
> sizeof(rxvq->stats.size_bins[0]) * 8);

would fix this.

>
> CHECK:MULTIPLE_ASSIGNMENTS: multiple assignments should be avoided
> #1529: FILE: drivers/net/virtio/virtio_rxtx_simple.c:362:
> +   nb_commit = nb_pkts = RTE_MIN((vq->vq_free_cnt >> 1), nb_pkts);
>
> CHECK:SPACING: No space is necessary after a cast
> #1530: FILE: drivers/net/virtio/virtio_rxtx_simple.c:363:
> +   desc_idx = (uint16_t) (vq->vq_avail_idx & desc_idx_max);

All other warnings are due to the newer checkpatch script, and not
introduced by this patch, so wouldn't fix in this patch.
But i think some are better programming practice, like 'Concatenated
strings should use spaces between elements', so would post a patch to
fix them in virtio driver (if possible, all other drivers).




[dpdk-dev] [PATCH] virtio: split virtio rx/tx queue

2016-05-30 Thread Xie, Huawei
On 5/27/2016 5:06 PM, Yuanhan Liu wrote:
> On Tue, May 24, 2016 at 09:38:32PM +0800, Huawei Xie wrote:
>>  vq->vq_ring_mem = mz->phys_addr;
>>  vq->vq_ring_virt_mem = mz->addr;
>> -PMD_INIT_LOG(DEBUG, "vq->vq_ring_mem:  0x%"PRIx64, 
>> (uint64_t)mz->phys_addr);
>> -PMD_INIT_LOG(DEBUG, "vq->vq_ring_virt_mem: 0x%"PRIx64, 
>> (uint64_t)(uintptr_t)mz->addr);
>> -vq->virtio_net_hdr_mz  = NULL;
>> -vq->virtio_net_hdr_mem = 0;
>> +PMD_INIT_LOG(DEBUG, "vq->vq_ring_mem:  0x%"PRIx64,
>> + (uint64_t)mz->phys_addr);
>> +PMD_INIT_LOG(DEBUG, "vq->vq_ring_virt_mem: 0x%"PRIx64,
>> + (uint64_t)(uintptr_t)mz->addr);
>> +
>> +hdr_mz = rte_memzone_reserve_aligned(vq_hdr_name, sz_hdr_mz, socket_id,
>> + 0, RTE_CACHE_LINE_SIZE);
> We don't need allocate hdr_mz for Rx queue, and in such case, sz_hdr_mz
> is 0. I'm wondering what hdr_mz would be then, NULL?
>
> Anyway, you should skip the hdr_mz allocation for Rx queue, and I also
> would suggest you to move the vq_hdr_name setup here.

will check sz_hdr_mz before the zone allocation.


>
>> +if (hdr_mz == NULL) {
>> +if (rte_errno == EEXIST)
>> +hdr_mz = rte_memzone_lookup(vq_hdr_name);
>> +if (hdr_mz == NULL) {
>> +ret = -ENOMEM;
>> +goto fail_q_alloc;
>> +}
>> +}
>>  
> ...
>>  
>>  PMD_INIT_FUNC_TRACE();
>>  ret = virtio_dev_queue_setup(dev, VTNET_CQ, VTNET_SQ_CQ_QUEUE_IDX,
>> -vtpci_queue_idx, 0, socket_id, &vq);
>> +vtpci_queue_idx, 0, socket_id, (void **)&cvq);
> Unnecessary cast. Note that there are few others like that in this
> patch.

This cast is needed.

>
>> -PMD_RX_LOG(DEBUG, "dequeue:%d", num);
>> -PMD_RX_LOG(DEBUG, "packet len:%d", len[0]);
>> +PMD_RX_LOG(DEBUG, "dequeue:%d\n", num);
>> +PMD_RX_LOG(DEBUG, "packet len:%d\n", len[0]);
> We should not append "\n" for PMD_RX_LOG; this macro alreadys does it.

Weird. Will remove it. Thanks.

>
> Another note is that you might want to run checkpatch; I saw quite many
> warnings.

Had checked. The warnings are all due to 80 char limitation of
virtio_rxq_stats_strings. Just 4 or 5 chars cross 80 line limit. I
prefer to keep the fields aligned.

>
>   --yliu
>



[dpdk-dev] Crashing OVS+DPDK at the host, from inside of a KVM Guest

2016-05-25 Thread Xie, Huawei
On 5/25/2016 2:06 PM, Christian Ehrhardt wrote:
> Hi,
> ping ...
>
> Later on I want to look at it again once we upgraded to more recent
> releases of the software components involved, but those have to be made
> ready to use first :-/
>
> But the description is good and I wonder if anybody else could reproduce
> this and/or would have a hint on where this might come from or already
> existing related fixes.
>
> I mean in general nothing should be able to crash the host right?

Yes, we are taking care of these issues to avoid malicious or buggy
guest driver to corrupt vhost.
We have fixed some issues, and would continue to check if there are
other potential issues.

>
>
> P.S. yeah two list cross posting, but it is yet unclear which it belongs to
> so I'll keep it
>
> Christian Ehrhardt
> Software Engineer, Ubuntu Server
> Canonical Ltd
>
> On Sun, May 15, 2016 at 7:08 AM, Martinx - ?  gmail.com>
> wrote:
>
>> Guys,
>>
>>  If using OVS 2.5 with DPDK 2.2, on Ubuntu Xenial, it is possible to crash
>> the OVS running at the host, from inside of a KVM Guest.
>>
>>  Basically, what I'm trying to do, is to run OVS+DPDK at the host, and
>> also, inside of a KVM Guest, with multi-queue, but it doesn't work and
>> crashes.
>>
>>  Soon as you enable multi-queue at the guest, it crashes the OVS of the
>> host!
>>
>> OVS+DPDK segfault at the host, after running "ovs-vsctl set Open_vSwitch .
>> other_config:n-dpdk-rxqs=4" within a KVM Guest:
>>
>> https://bugs.launchpad.net/ubuntu/+source/openvswitch/+bug/1577088
>>
>> Thanks!
>> Thiago
>>



[dpdk-dev] [PATCH] virtio: use volatile to get used->idx in the loop

2016-05-25 Thread Xie, Huawei
On 5/25/2016 6:01 PM, Richardson, Bruce wrote:
> On Wed, May 25, 2016 at 12:50:02PM +0300, Michael S. Tsirkin wrote:
>> On Wed, May 25, 2016 at 10:47:30AM +0100, Bruce Richardson wrote:
>>> On Wed, May 25, 2016 at 11:34:24AM +0300, Michael S. Tsirkin wrote:
>>>> On Wed, May 25, 2016 at 08:25:20AM +, Xie, Huawei wrote:
>>>>> On 5/25/2016 4:12 PM, Xie, Huawei wrote:
>>>>>> There is no external function call or any barrier in the loop,
>>>>>> the used->idx would only be retrieved once.
>>>>>>
>>>>>> Signed-off-by: Huawei Xie 
>>>>>> ---
>>>>>>  drivers/net/virtio/virtio_ethdev.c | 3 ++-
>>>>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/net/virtio/virtio_ethdev.c 
>>>>>> b/drivers/net/virtio/virtio_ethdev.c
>>>>>> index c3fb628..f6d6305 100644
>>>>>> --- a/drivers/net/virtio/virtio_ethdev.c
>>>>>> +++ b/drivers/net/virtio/virtio_ethdev.c
>>>>>> @@ -204,7 +204,8 @@ virtio_send_command(struct virtqueue *vq, struct 
>>>>>> virtio_pmd_ctrl *ctrl,
>>>>>>  usleep(100);
>>>>>>  }
>>>>>>  
>>>>>> -while (vq->vq_used_cons_idx != vq->vq_ring.used->idx) {
>>>>>> +while (vq->vq_used_cons_idx !=
>>>>>> +   *((volatile uint16_t *)(&vq->vq_ring.used->idx))) {
>>>>>>  uint32_t idx, desc_idx, used_idx;
>>>>>>  struct vring_used_elem *uep;
>>>>>>  
>>>>> Find this issue when do the code rework of RX/TX queue.
>>>>> As in other places, we also have loop retrieving the value of avial->idx
>>>>> or used->idx, i prefer to declare the index in vq structure as volatile
>>>>> to avoid potential issue.
>>> Is there a reason why the value is not always volatile? I would have thought
>>> it would be generally safer to mark the actual value as volatile inside the
>>> structure definition itself? In any cases where we do want to store the 
>>> value
>>> locally and not re-access the structure, a local variable can be used.
>>>
>>> Regards,
>>> /Bruce
>> Linux generally discourages volatile as a general style guidance:
>> https://www.kernel.org/doc/Documentation/volatile-considered-harmful.txt
>> it doesn't have to apply to dpdk which has a different coding style
>> but IIUC this structure is inherited from linux, deviating
>> will make keeping things up to date harder.
> The prohibition on volatile indeed doesn't apply to DPDK, due to the fact that
> we so seldom use locks, and do a lot of direct register accesses in out PMDs.
> [I also still have the scars from previous issues where we had nice subtle 
> bugs
> in our PMDs - which only occurred with specific subversions of gcc - all due
> to a missing "volatile" on one structure element.]
>
> However, in this case, I take your point about keeping things consistent with
> the kernel. :-)

At least for virtio PMD, we have to support both Linux and FreeBSD, so
DPDK defines its own vring structure instead of including linux header file.
Two solutions for this volatile issue, 1) declare  used->idx and
avail->idx as volatile 2) define similar
access_once/read_once/write_once macro.
Would take the first one. In future, we could consider define
access_once, and apply to all other data structures if we want to use
the kernel style.

One thing i am confusing is other DPDK components include Linux header
files, do they compile on FreeBSD?

>
> /Bruce
>
>>>> It might be a good idea to wrap this in a macro
>>>> similar to ACCESS_ONCE in Linux.
>>>>
>>>>> Stephen:
>>>>> Another question is why we need a loop here?
>>>>>
>>>>> /huawei
>>>> -- 
>>>> MST



[dpdk-dev] [PATCH] virtio: split virtio rx/tx queue

2016-05-25 Thread Xie, Huawei
On 5/25/2016 6:08 PM, Thomas Monjalon wrote:
> Is it a v2? There is neither changelog nor v2 in the title.

Yes, forget to add v2. Mainly rebase against Yuanhan's tree. Besides, to
make the patch simple and focused, use the old queue setup.


[dpdk-dev] [PATCH] virtio: use volatile to get used->idx in the loop

2016-05-25 Thread Xie, Huawei
On 5/25/2016 4:12 PM, Xie, Huawei wrote:
> There is no external function call or any barrier in the loop,
> the used->idx would only be retrieved once.
>
> Signed-off-by: Huawei Xie 
> ---
>  drivers/net/virtio/virtio_ethdev.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/virtio/virtio_ethdev.c 
> b/drivers/net/virtio/virtio_ethdev.c
> index c3fb628..f6d6305 100644
> --- a/drivers/net/virtio/virtio_ethdev.c
> +++ b/drivers/net/virtio/virtio_ethdev.c
> @@ -204,7 +204,8 @@ virtio_send_command(struct virtqueue *vq, struct 
> virtio_pmd_ctrl *ctrl,
>   usleep(100);
>   }
>  
> - while (vq->vq_used_cons_idx != vq->vq_ring.used->idx) {
> + while (vq->vq_used_cons_idx !=
> +*((volatile uint16_t *)(&vq->vq_ring.used->idx))) {
>   uint32_t idx, desc_idx, used_idx;
>   struct vring_used_elem *uep;
>  

Find this issue when do the code rework of RX/TX queue.
As in other places, we also have loop retrieving the value of avial->idx
or used->idx, i prefer to declare the index in vq structure as volatile
to avoid potential issue.

Stephen:
Another question is why we need a loop here?

/huawei


[dpdk-dev] [PATCH 3/6] vhost: add reconnect ability

2016-05-10 Thread Xie, Huawei
On 5/10/2016 4:42 PM, Michael S. Tsirkin wrote:
> On Tue, May 10, 2016 at 08:07:00AM +0000, Xie, Huawei wrote:
>> On 5/10/2016 3:56 PM, Michael S. Tsirkin wrote:
>>> On Tue, May 10, 2016 at 07:24:10AM +, Xie, Huawei wrote:
>>>> On 5/10/2016 2:08 AM, Yuanhan Liu wrote:
>>>>> On Mon, May 09, 2016 at 04:47:02PM +, Xie, Huawei wrote:
>>>>>> On 5/7/2016 2:36 PM, Yuanhan Liu wrote:
>>>>>>> +static void *
>>>>>>> +vhost_user_client_reconnect(void *arg)
>>>>>>> +{
>>>>>>> +   struct reconnect_info *reconn = arg;
>>>>>>> +   int ret;
>>>>>>> +
>>>>>>> +   RTE_LOG(ERR, VHOST_CONFIG, "reconnecting...\n");
>>>>>>> +   while (1) {
>>>>>>> +   ret = connect(reconn->fd, (struct sockaddr 
>>>>>>> *)&reconn->un,
>>>>>>> +   sizeof(reconn->un));
>>>>>>> +   if (ret == 0)
>>>>>>> +   break;
>>>>>>> +   sleep(1);
>>>>>>> +   }
>>>>>>> +
>>>>>>> +   vhost_user_add_connection(reconn->fd, reconn->vsocket);
>>>>>>> +   free(reconn);
>>>>>>> +
>>>>>>> +   return NULL;
>>>>>>> +}
>>>>>>> +
>>>>>> We could create hundreds of vhost-user ports in OVS. Wihout connections
>>>>>> with QEMU established, those ports are just inactive. This works fine in
>>>>>> server mode.
>>>>>> With client modes, do we need to create hundreds of vhost threads? This
>>>>>> would be too interruptible.
>>>>>> How about we create only one thread and do the reconnections for all the
>>>>>> unconnected socket?
>>>>> Yes, good point and good suggestion. Will do it in v2.
>>>> Hi Michael:
>>>> This reminds me another irrelevant issue.
>>>> In OVS, currently for each vhost port, we create an unix domain socket,
>>>> and QEMU vhost proxy connects to this socket, and we use this to
>>>> identify the connection. This works fine but is our workaround,
>>>> otherwise we have no way to identify the connection.
>>>> Do you think if this is an issue?
>> Let us say vhost creates one unix domain socket, with path as "sockpath",
>> and two virtio ports in two VMS both connect to the same socket with the
>> following command line
>> -chardev socket,id=char0,path=sockpath
>> How could vhost identify the connection?
> getpeername(2)?

getpeer name returns host/port? then it isn't useful.

The typical scenario in my mind is:
We create a OVS port with the name "port1", and when we receive an
virtio connection with ID "port1", we attach this virtio interface to
the OVS port "port1".


>
>
>> Workarounds:
>> vhost creates two unix domain sockets, with path as "sockpath1" and
>> "sockpath2" respectively,
>> and the virtio ports in two VMS respectively connect to "sockpath1" and
>> "sockpath2".
>>
>> If we have some name string from QEMU over vhost, as you mentioned, we
>> could create only one socket with path "sockpath".
>> User ensure that the names are unique, just as how they do with multiple
>> sockets.
>>
> Seems rather fragile.

>From the scenario above, it is enough. That is also how it works today
in DPDK OVS implementation with multiple sockets.
Any other idea?

>
>>> I'm sorry, I have trouble understanding what you wrote above.
>>> What is the issue you are trying to work around?
>>>
>>>> Do we have plan to support identification in VHOST_USER_MESSAGE? With
>>>> the identification, if vhost as server, we only need to create one
>>>> socket which receives multiple connections, and use the ID in the
>>>> message to identify the connection.
>>>>
>>>> /huawei
>>> Sending e.g. -name string from qemu over vhost might be useful
>>> for debugging, but I'm not sure it's a good idea to
>>> rely on it being unique.
>>>
>>>>> Thanks.
>>>>>
>>>>>   --yliu
>>>>>



[dpdk-dev] [PATCH 4/6] vhost: workaround stale vring base

2016-05-10 Thread Xie, Huawei
On 5/7/2016 2:36 PM, Yuanhan Liu wrote:
> When DPDK app crashes (or quits, or gets killed), and when QEMU supports
> reconnecting (patches have been sent, not merged yet), a restart of DPDK
> app would get stale vring base from QEMU. That would break the kernel
> virtio net completely, making it non-work any more, unless a driver reset
> is done.

s/DPDK app/DPDK vhost/

DPDK app is confusing

>
> So, instead of getting the stale vring base from QEMU, Huawei suggested
> we could get a proper one from used->idx.
>
> Cc: "Michael S. Tsirkin" 
> Suggested-by: "Xie, Huawei" 
> Signed-off-by: Yuanhan Liu 

Acked-by: Huawei Xie 


[dpdk-dev] [PATCH 3/6] vhost: add reconnect ability

2016-05-10 Thread Xie, Huawei
On 5/10/2016 3:56 PM, Michael S. Tsirkin wrote:
> On Tue, May 10, 2016 at 07:24:10AM +0000, Xie, Huawei wrote:
>> On 5/10/2016 2:08 AM, Yuanhan Liu wrote:
>>> On Mon, May 09, 2016 at 04:47:02PM +, Xie, Huawei wrote:
>>>> On 5/7/2016 2:36 PM, Yuanhan Liu wrote:
>>>>> +static void *
>>>>> +vhost_user_client_reconnect(void *arg)
>>>>> +{
>>>>> + struct reconnect_info *reconn = arg;
>>>>> + int ret;
>>>>> +
>>>>> + RTE_LOG(ERR, VHOST_CONFIG, "reconnecting...\n");
>>>>> + while (1) {
>>>>> + ret = connect(reconn->fd, (struct sockaddr *)&reconn->un,
>>>>> + sizeof(reconn->un));
>>>>> + if (ret == 0)
>>>>> + break;
>>>>> + sleep(1);
>>>>> + }
>>>>> +
>>>>> + vhost_user_add_connection(reconn->fd, reconn->vsocket);
>>>>> + free(reconn);
>>>>> +
>>>>> + return NULL;
>>>>> +}
>>>>> +
>>>> We could create hundreds of vhost-user ports in OVS. Wihout connections
>>>> with QEMU established, those ports are just inactive. This works fine in
>>>> server mode.
>>>> With client modes, do we need to create hundreds of vhost threads? This
>>>> would be too interruptible.
>>>> How about we create only one thread and do the reconnections for all the
>>>> unconnected socket?
>>> Yes, good point and good suggestion. Will do it in v2.
>> Hi Michael:
>> This reminds me another irrelevant issue.
>> In OVS, currently for each vhost port, we create an unix domain socket,
>> and QEMU vhost proxy connects to this socket, and we use this to
>> identify the connection. This works fine but is our workaround,
>> otherwise we have no way to identify the connection.
>> Do you think if this is an issue?

Let us say vhost creates one unix domain socket, with path as "sockpath",
and two virtio ports in two VMS both connect to the same socket with the
following command line
-chardev socket,id=char0,path=sockpath
How could vhost identify the connection?


Workarounds:
vhost creates two unix domain sockets, with path as "sockpath1" and
"sockpath2" respectively,
and the virtio ports in two VMS respectively connect to "sockpath1" and
"sockpath2".

If we have some name string from QEMU over vhost, as you mentioned, we
could create only one socket with path "sockpath".
User ensure that the names are unique, just as how they do with multiple
sockets.


> I'm sorry, I have trouble understanding what you wrote above.
> What is the issue you are trying to work around?
>
>> Do we have plan to support identification in VHOST_USER_MESSAGE? With
>> the identification, if vhost as server, we only need to create one
>> socket which receives multiple connections, and use the ID in the
>> message to identify the connection.
>>
>> /huawei
> Sending e.g. -name string from qemu over vhost might be useful
> for debugging, but I'm not sure it's a good idea to
> rely on it being unique.
>
>>> Thanks.
>>>
>>> --yliu
>>>



[dpdk-dev] [PATCH 3/6] vhost: add reconnect ability

2016-05-10 Thread Xie, Huawei
On 5/10/2016 2:08 AM, Yuanhan Liu wrote:
> On Mon, May 09, 2016 at 04:47:02PM +0000, Xie, Huawei wrote:
>> On 5/7/2016 2:36 PM, Yuanhan Liu wrote:
>>> +static void *
>>> +vhost_user_client_reconnect(void *arg)
>>> +{
>>> +   struct reconnect_info *reconn = arg;
>>> +   int ret;
>>> +
>>> +   RTE_LOG(ERR, VHOST_CONFIG, "reconnecting...\n");
>>> +   while (1) {
>>> +   ret = connect(reconn->fd, (struct sockaddr *)&reconn->un,
>>> +   sizeof(reconn->un));
>>> +   if (ret == 0)
>>> +   break;
>>> +   sleep(1);
>>> +   }
>>> +
>>> +   vhost_user_add_connection(reconn->fd, reconn->vsocket);
>>> +   free(reconn);
>>> +
>>> +   return NULL;
>>> +}
>>> +
>> We could create hundreds of vhost-user ports in OVS. Wihout connections
>> with QEMU established, those ports are just inactive. This works fine in
>> server mode.
>> With client modes, do we need to create hundreds of vhost threads? This
>> would be too interruptible.
>> How about we create only one thread and do the reconnections for all the
>> unconnected socket?
> Yes, good point and good suggestion. Will do it in v2.

Hi Michael:
This reminds me another irrelevant issue.
In OVS, currently for each vhost port, we create an unix domain socket,
and QEMU vhost proxy connects to this socket, and we use this to
identify the connection. This works fine but is our workaround,
otherwise we have no way to identify the connection.
Do you think if this is an issue?
Do we have plan to support identification in VHOST_USER_MESSAGE? With
the identification, if vhost as server, we only need to create one
socket which receives multiple connections, and use the ID in the
message to identify the connection.

/huawei


>
> Thanks.
>
>   --yliu
>



[dpdk-dev] [PATCH 3/6] vhost: add reconnect ability

2016-05-09 Thread Xie, Huawei
On 5/7/2016 2:36 PM, Yuanhan Liu wrote:
> +static void *
> +vhost_user_client_reconnect(void *arg)
> +{
> + struct reconnect_info *reconn = arg;
> + int ret;
> +
> + RTE_LOG(ERR, VHOST_CONFIG, "reconnecting...\n");
> + while (1) {
> + ret = connect(reconn->fd, (struct sockaddr *)&reconn->un,
> + sizeof(reconn->un));
> + if (ret == 0)
> + break;
> + sleep(1);
> + }
> +
> + vhost_user_add_connection(reconn->fd, reconn->vsocket);
> + free(reconn);
> +
> + return NULL;
> +}
> +

We could create hundreds of vhost-user ports in OVS. Wihout connections
with QEMU established, those ports are just inactive. This works fine in
server mode.
With client modes, do we need to create hundreds of vhost threads? This
would be too interruptible.
How about we create only one thread and do the reconnections for all the
unconnected socket?



[dpdk-dev] [PATCH 4/6] vhost: workaround stale vring base

2016-05-09 Thread Xie, Huawei
On 5/7/2016 2:36 PM, Yuanhan Liu wrote:
> However, Michael claims some concerns: he made a good point: a crash
> is happening means some memory is corrupted, and it could be the virtio
> memory being corrupted. In such case, nothing will work without the
> reset.

I don't get this point. What is the scenario?
For the crash of virtio frontend driver, i remember we discussed before,
we have no good recipe but some workaround. The user space frontend
driver crashes, and its memory is reallocated to other instances, but
vhost is still writing to that memory. However this has nothing to do
with vhost reconnect.


[dpdk-dev] [PATCH 4/6] vhost: workaround stale vring base

2016-05-09 Thread Xie, Huawei
On 5/9/2016 6:45 PM, Victor Kaplansky wrote:
>> diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
>> > index c88aaa3..df103aa 100644
>> > --- a/lib/librte_vhost/virtio-net.c
>> > +++ b/lib/librte_vhost/virtio-net.c
>> > @@ -560,6 +560,14 @@ vhost_set_vring_addr(int vid, struct vhost_vring_addr
>> > *addr)
>> >return -1;
>> >}
>> >  
>> > +  if (vq->last_used_idx != vq->used->idx) {
>> > +  RTE_LOG(WARNING, VHOST_CONFIG,
>> > +  "last_used_idx (%u) and vq->used->idx (%u) mismatch\n",
> I agree with this approach. I just would add to the log message that 
> last_user_idx
> have overrode by used_idx and some packets may be dropped.

For TX, the packets are resent.
For RX, the packets are dropped.

>
>> > +  vq->last_used_idx, vq->used->idx);
>> > +  vq->last_used_idx = vq->used->idx;
>> > +  vq->last_used_idx_res = vq->used->idx;
>> > +  }
>> > +
>> >vq->log_guest_addr = addr->log_guest_addr;
>> >  
>> >LOG_DEBUG(VHOST_CONFIG, "(%d) mapped address desc: %p\n",
>> > --
>> > 1.9.0
>> > 



[dpdk-dev] [PATCH] virtio: split virtio rx/tx queue

2016-05-09 Thread Xie, Huawei


> -Original Message-
> From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com]
> Sent: Monday, May 09, 2016 1:15 PM
> To: Xie, Huawei
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] virtio: split virtio rx/tx queue
> 
> On Thu, May 05, 2016 at 05:29:27AM +, Xie, Huawei wrote:
> > What I mean is firstly we split the queue, without breaking the common
> > setup; then introduce RX/TX specific setup calling extracted common
> > setup, so we don't have a chance to introduce duplicated code.
> 
> In such way, you have actually introduced duplicated code, haven't you?
> You may argue, "yes, but I will fix it in a later patch."  This is to
> introducing a build error and fixing it later to me.
> 

Yuanhan, I don't see how hard it is to understand this. Duplicated code isn't 
introduced. I will send the patch later.

>   --yliu


[dpdk-dev] virtio still blindly take over virtio device managed by kernel

2016-05-06 Thread Xie, Huawei
On 5/5/2016 12:21 AM, Vincent Li wrote:
> Hi,
>
> I am running the dpdk git repo which already had commit ac5e1d838dc
> (virtio: skip error when probing kernel managed device), but in my
> test, it seems still taking control of the kernel managed virtio
> device and segmentation fault, here is the example:
>
> # ./tools/dpdk_nic_bind.py --status
>
> Network devices using DPDK-compatible driver
> 
> :00:07.0 'Virtio network device' drv=igb_uio unused=
> :00:08.0 'Virtio network device' drv=igb_uio unused=
>
> Network devices using kernel driver
> ===
> :00:03.0 'Virtio network device' if= drv=virtio-pci unused=igb_uio
>
>  #./x86_64-native-linuxapp-gcc/app/testpmd -c 0xf -n 4  -- -i
> EAL: Detected 4 lcore(s)
> EAL: Probing VFIO support...
> EAL: WARNING: cpu flags constant_tsc=yes nonstop_tsc=no -> using
> unreliable clock cycles !
> EAL: PCI device :00:03.0 on NUMA socket -1
> EAL:   probe driver: 1af4:1000 rte_virtio_pmd
> PMD: vtpci_init(): trying with legacy virtio pci.
> Segmentation fault (core dumped)

Hi Vincent:
Could you try this?
Weird. I had tested by binding the virtio nic with different kernel
drivers including unbinding any drivers when submitting patches. But i
find that it doesn't work even with that commit.
So dev->devargs could be NULL.

If it works for you, i will submit the fix.

diff --git a/drivers/net/virtio/virtio_pci.c
b/drivers/net/virtio/virtio_pci.c
index 45edecc..8aeb44a 100644
--- a/drivers/net/virtio/virtio_pci.c
+++ b/drivers/net/virtio/virtio_pci.c
@@ -650,7 +650,8 @@ vtpci_init(struct rte_pci_device *dev, struct
virtio_hw *hw)
PMD_INIT_LOG(INFO, "trying with legacy virtio pci.");
if (legacy_virtio_resource_init(dev, hw) < 0) {
if (dev->kdrv == RTE_KDRV_UNKNOWN &&
-   dev->devargs->type != RTE_DEVTYPE_WHITELISTED_PCI) {
+   (!dev->devargs ||
+dev->devargs->type != RTE_DEVTYPE_WHITELISTED_PCI)){
PMD_INIT_LOG(INFO,
"skip kernel managed virtio device.");
return 1;


>
> if I blacklist :00:03.0 from testpmd, testpmd works:
>
> # ./x86_64-native-linuxapp-gcc/app/testpmd -c 0xf -n 4  -b :00:03.0 -- -i
> EAL: Detected 4 lcore(s)
> EAL: Probing VFIO support...
> EAL: WARNING: cpu flags constant_tsc=yes nonstop_tsc=no -> using
> unreliable clock cycles !
> EAL: PCI device :00:03.0 on NUMA socket -1
> EAL: PCI device :00:07.0 on NUMA socket -1
> EAL:   probe driver: 1af4:1000 rte_virtio_pmd
> PMD: virtio_read_caps(): no modern virtio pci device found.
> PMD: vtpci_init(): trying with legacy virtio pci.
> EAL: PCI device :00:08.0 on NUMA socket -1
> EAL:   probe driver: 1af4:1000 rte_virtio_pmd
> PMD: virtio_read_caps(): no modern virtio pci device found.
> PMD: vtpci_init(): trying with legacy virtio pci.
> Interactive-mode selected
> Configuring Port 0 (socket 0)
> rte_eth_dev_config_restore: port 0: MAC address array not supported
> Port 0: 52:54:00:EA:6E:3E
> Configuring Port 1 (socket 0)
> rte_eth_dev_config_restore: port 1: MAC address array not supported
> Port 1: 52:54:00:24:06:DB
> Checking link statuses...
> Port 0 Link Up - speed 1 Mbps - full-duplex
> Port 1 Link Up - speed 1 Mbps - full-duplex
> Done
> testpmd>
>



[dpdk-dev] [PATCH] virtio: split virtio rx/tx queue

2016-05-05 Thread Xie, Huawei
On 5/5/2016 11:46 AM, Yuanhan Liu wrote:
> On Thu, May 05, 2016 at 03:29:44AM +0000, Xie, Huawei wrote:
>> On 5/5/2016 11:03 AM, Yuanhan Liu wrote:
>>> On Thu, May 05, 2016 at 01:54:25AM +, Xie, Huawei wrote:
>>>> On 5/5/2016 7:59 AM, Yuanhan Liu wrote:
>>>>> On Wed, May 04, 2016 at 08:50:27AM +0800, Huawei Xie wrote:
>>>>>> -int virtio_dev_queue_setup(struct rte_eth_dev *dev,
>>>>>> -int queue_type,
>>>>>> -uint16_t queue_idx,
>>>>>> +static int
>>>>>> +virtio_dev_cq_queue_setup(struct rte_eth_dev *dev,
>>>>> While it's good to split Rx/Tx specific stuff, but why are you trying to
>>>>> remove a common queue_setup function that does common setups, such as 
>>>>> vring
>>>>> memory allocation.
>>>>>
>>>>> This results to much duplicated code: following diff summary also shows
>>>>> it clearly:
>>>> The motivation to do this is we need separate RX/TX queue setup.
>>> We actually have done that. If you look at current rx/tx/ctrl_queue_setup()
>>> code, we invoked the common function; we also did some queue specific
>>> settings. It has not been done in a very clean way though: there are quite
>>> many "if .. else .." as you stated. And that's what you are going to 
>>> resolve,
>>> but IMO, you went far: you made __same__ code 3 copies, one for rx, tx and
>>> ctrl queue, respectively.
>>>
>>>> The switch/case in the common queue setup looks bad.
>>> Assuming you are talking about the "if .. else .." ...
>>>
>>> While I agree with you on that, introducing so many duplicated code is 
>>> worse.
>>>
>>>> I am aware of the common operations, and i had planned to extract them,
>>>> maybe i could do this in this patchset.
>>> If you meant to do in another patch on top of this patch, then it looks
>>> like the wrong way to go: breaking something first and then fixing it
>>> later does not sound a good practice to me.
>> To your later comment, we could split first, then do the queue setup rework.
> Well, if you insist, I'm Okay. But please don't do it in the way this
> patch does, that introduces quite many duplicated codes.

Yuanhan, I have no insist.

Our target is 1) remove the queue type if else checking in the
virtio_dev_queue_setup 2) extract the common setup for vq and call them
in specific RX/TX/CQ setup.
For 2, which is really meaningful to me is the queue size retrieve,
queue allocation

What I mean is firstly we split the queue, without breaking the common
setup; then introduce RX/TX specific setup calling extracted common
setup, so we don't have a chance to introduce duplicated code.


>   --yliu
>



[dpdk-dev] [PATCH] virtio: split virtio rx/tx queue

2016-05-05 Thread Xie, Huawei
On 5/5/2016 11:03 AM, Yuanhan Liu wrote:
> On Thu, May 05, 2016 at 01:54:25AM +0000, Xie, Huawei wrote:
>> On 5/5/2016 7:59 AM, Yuanhan Liu wrote:
>>> On Wed, May 04, 2016 at 08:50:27AM +0800, Huawei Xie wrote:
>>>> -int virtio_dev_queue_setup(struct rte_eth_dev *dev,
>>>> -  int queue_type,
>>>> -  uint16_t queue_idx,
>>>> +static int
>>>> +virtio_dev_cq_queue_setup(struct rte_eth_dev *dev,
>>> While it's good to split Rx/Tx specific stuff, but why are you trying to
>>> remove a common queue_setup function that does common setups, such as vring
>>> memory allocation.
>>>
>>> This results to much duplicated code: following diff summary also shows
>>> it clearly:
>> The motivation to do this is we need separate RX/TX queue setup.
> We actually have done that. If you look at current rx/tx/ctrl_queue_setup()
> code, we invoked the common function; we also did some queue specific
> settings. It has not been done in a very clean way though: there are quite
> many "if .. else .." as you stated. And that's what you are going to resolve,
> but IMO, you went far: you made __same__ code 3 copies, one for rx, tx and
> ctrl queue, respectively.
>
>> The switch/case in the common queue setup looks bad.
> Assuming you are talking about the "if .. else .." ...
>
> While I agree with you on that, introducing so many duplicated code is worse.
>
>> I am aware of the common operations, and i had planned to extract them,
>> maybe i could do this in this patchset.
> If you meant to do in another patch on top of this patch, then it looks
> like the wrong way to go: breaking something first and then fixing it
> later does not sound a good practice to me.

To your later comment, we could split first, then do the queue setup rework.

>
>>> 7 files changed, 655 insertions(+), 422 deletions(-)
>>>
>>> which makes it harder for maintaining.
>>>
>>>> -}
>>>> +  rxvq = (struct virtnet_rx *)RTE_PTR_ADD(vq,
>>>> +  sizeof(*vq) + vq_size * sizeof(struct vq_desc_extra));
>>>> +  rxvq->vq = vq;
>>>> +  vq->sw_ring = sw_ring;
>>> sw_ring is needed for rx queue only, why not moving it to rx queue struct?
>> Actually this is not about sw_ring.
>> I had planned to use sw_ring for both RX/TX and remove the vq_desc_extra.
>> Two issues
>> 1. RX uses both sw_ring and vq_desc_extra
>> 2. ndescs in vq_desc_extra isn't really needed, we could simply
>> calculate this when we walk through the desc chain, and in most cases,
>> it is 1 or 2.
>>
>> As it is not related to this rework, will do this in a separate patch.
> Yes, it's not related to this patch, and this patch does rx/tx split
> only. So, thinking that sw_ring is for rx only, you should move there.
>
> It will not against with your plan; you can make corresponding change
> there. But for this patch, let's do the split only.
>
> BTW, I still would suggest you to build the patch on top of the cleanup
> and memory leak fix patches from Jianfeng. Your patch won't apply on
> top of current dpdk-next-virtio, and one way or another, you need do
> a rebase.
>
> Last, if I were you, I would split this patch in two: one to move
> the queue specific settings to it's queue setup function, another
> to split rx/tx fields. That would make it easier for review.
>
>   --yliu
>



[dpdk-dev] [PATCH] virtio: split virtio rx/tx queue

2016-05-05 Thread Xie, Huawei
On 5/5/2016 7:59 AM, Yuanhan Liu wrote:
> On Wed, May 04, 2016 at 08:50:27AM +0800, Huawei Xie wrote:
>> -int virtio_dev_queue_setup(struct rte_eth_dev *dev,
>> -int queue_type,
>> -uint16_t queue_idx,
>> +static int
>> +virtio_dev_cq_queue_setup(struct rte_eth_dev *dev,
> While it's good to split Rx/Tx specific stuff, but why are you trying to
> remove a common queue_setup function that does common setups, such as vring
> memory allocation.
>
> This results to much duplicated code: following diff summary also shows
> it clearly:

The motivation to do this is we need separate RX/TX queue setup.
The switch/case in the common queue setup looks bad.

I am aware of the common operations, and i had planned to extract them,
maybe i could do this in this patchset.


>
> 7 files changed, 655 insertions(+), 422 deletions(-)
>
> which makes it harder for maintaining.
>
>> -}
>> +rxvq = (struct virtnet_rx *)RTE_PTR_ADD(vq,
>> +sizeof(*vq) + vq_size * sizeof(struct vq_desc_extra));
>> +rxvq->vq = vq;
>> +vq->sw_ring = sw_ring;
> sw_ring is needed for rx queue only, why not moving it to rx queue struct?

Actually this is not about sw_ring.
I had planned to use sw_ring for both RX/TX and remove the vq_desc_extra.
Two issues
1. RX uses both sw_ring and vq_desc_extra
2. ndescs in vq_desc_extra isn't really needed, we could simply
calculate this when we walk through the desc chain, and in most cases,
it is 1 or 2.

As it is not related to this rework, will do this in a separate patch.

>
>>  static void
>> -virtio_update_packet_stats(struct virtqueue *vq, struct rte_mbuf *mbuf)
>> +virtio_update_rxq_stats(struct virtnet_rx *rxvq, struct rte_mbuf *mbuf)
>>  {
>>  uint32_t s = mbuf->pkt_len;
>>  struct ether_addr *ea;
>>  
>>  if (s == 64) {
>> -vq->size_bins[1]++;
>> +rxvq->stats.size_bins[1]++;
>>  } else if (s > 64 && s < 1024) {
>>  uint32_t bin;
>>  
>>  /* count zeros, and offset into correct bin */
>>  bin = (sizeof(s) * 8) - __builtin_clz(s) - 5;
>> -vq->size_bins[bin]++;
>> +rxvq->stats.size_bins[bin]++;
>>  } else {
>>  if (s < 64)
>> -vq->size_bins[0]++;
>> +rxvq->stats.size_bins[0]++;
>>  else if (s < 1519)
>> -vq->size_bins[6]++;
>> +rxvq->stats.size_bins[6]++;
>>  else if (s >= 1519)
>> -vq->size_bins[7]++;
>> +rxvq->stats.size_bins[7]++;
>>  }
>>  
>>  ea = rte_pktmbuf_mtod(mbuf, struct ether_addr *);
>>  if (is_multicast_ether_addr(ea)) {
>>  if (is_broadcast_ether_addr(ea))
>> -vq->broadcast++;
>> +rxvq->stats.broadcast++;
>>  else
>> -vq->multicast++;
>> +rxvq->stats.multicast++;
>> +}
>> +}
>> +
>> +static void
>> +virtio_update_txq_stats(struct virtnet_tx *txvq, struct rte_mbuf *mbuf)
> Why not taking "struct virtnet_stats *stats" as the arg, so that we
> don't have to implment two exactly same functions.

ok to me.

>
>> diff --git a/drivers/net/virtio/virtio_rxtx.h 
>> b/drivers/net/virtio/virtio_rxtx.h
>> index a76c3e5..ced55a3 100644
>> --- a/drivers/net/virtio/virtio_rxtx.h
>> +++ b/drivers/net/virtio/virtio_rxtx.h
>> @@ -34,7 +34,59 @@
>>  #define RTE_PMD_VIRTIO_RX_MAX_BURST 64
>>  
>>  #ifdef RTE_MACHINE_CPUFLAG_SSSE3
>> -int virtio_rxq_vec_setup(struct virtqueue *rxq);
>> +
>> +struct virtnet_stats {
> Another remind again: you should put following codes before the
> "#ifdef".
>
>   --yliu
>



[dpdk-dev] virtio still blindly take over virtio device managed by kernel

2016-05-04 Thread Xie, Huawei


> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Vincent Li
> Sent: Thursday, May 05, 2016 12:21 AM
> To: dev at dpdk.org
> Subject: [dpdk-dev] virtio still blindly take over virtio device managed
> by kernel
> 
> Hi,
> 
> I am running the dpdk git repo which already had commit ac5e1d838dc
> (virtio: skip error when probing kernel managed device), but in my
> test, it seems still taking control of the kernel managed virtio
> device and segmentation fault, here is the example:
> 
> # ./tools/dpdk_nic_bind.py --status
> 
> Network devices using DPDK-compatible driver
> 
> :00:07.0 'Virtio network device' drv=igb_uio unused=
> :00:08.0 'Virtio network device' drv=igb_uio unused=
> 
> Network devices using kernel driver
> ===
> :00:03.0 'Virtio network device' if= drv=virtio-pci unused=igb_uio
> 
>  #./x86_64-native-linuxapp-gcc/app/testpmd -c 0xf -n 4  -- -i
> EAL: Detected 4 lcore(s)
> EAL: Probing VFIO support...
> EAL: WARNING: cpu flags constant_tsc=yes nonstop_tsc=no -> using
> unreliable clock cycles !
> EAL: PCI device :00:03.0 on NUMA socket -1
> EAL:   probe driver: 1af4:1000 rte_virtio_pmd
> PMD: vtpci_init(): trying with legacy virtio pci.
> Segmentation fault (core dumped)
> 
> if I blacklist :00:03.0 from testpmd, testpmd works:
> 
> # ./x86_64-native-linuxapp-gcc/app/testpmd -c 0xf -n 4  -b :00:03.0 -
> - -i
> EAL: Detected 4 lcore(s)
> EAL: Probing VFIO support...
> EAL: WARNING: cpu flags constant_tsc=yes nonstop_tsc=no -> using
> unreliable clock cycles !
> EAL: PCI device :00:03.0 on NUMA socket -1
> EAL: PCI device :00:07.0 on NUMA socket -1
> EAL:   probe driver: 1af4:1000 rte_virtio_pmd
> PMD: virtio_read_caps(): no modern virtio pci device found.
> PMD: vtpci_init(): trying with legacy virtio pci.
> EAL: PCI device :00:08.0 on NUMA socket -1
> EAL:   probe driver: 1af4:1000 rte_virtio_pmd
> PMD: virtio_read_caps(): no modern virtio pci device found.
> PMD: vtpci_init(): trying with legacy virtio pci.
> Interactive-mode selected
> Configuring Port 0 (socket 0)
> rte_eth_dev_config_restore: port 0: MAC address array not supported
> Port 0: 52:54:00:EA:6E:3E
> Configuring Port 1 (socket 0)
> rte_eth_dev_config_restore: port 1: MAC address array not supported
> Port 1: 52:54:00:24:06:DB
> Checking link statuses...
> Port 0 Link Up - speed 1 Mbps - full-duplex
> Port 1 Link Up - speed 1 Mbps - full-duplex
> Done
> testpmd>

Hi Vincent:
Thanks for the info. Will check tomorrow to see what breaks this.


[dpdk-dev] [PATCH v3 0/3] xen: netfront poll mode driver

2016-05-03 Thread Xie, Huawei
On 4/20/2016 10:19 PM, Bruce Richardson wrote:
> On Tue, Mar 22, 2016 at 10:55:26AM +0100, Jan Blunck wrote:
>> v3 changes:
>> - removed fake PCI interface
>> - removed struct virt_eth_driver
>> - check for UIO name and version
>> - added basic documentation
>>
>> Jan Blunck (3):
>>   xen: Add UIO kernel driver
>>   xen: Add netfront poll mode driver
>>   xen: Add documentation
>>
> Hi Jan,
>
> any update on this series?
>
> /Bruce
>

Jan and Bruce:
I will find time to review this starting from this week. It takes time.
Please stay tuned.

/huawei


[dpdk-dev] [PATCH] virtio: avoid avail ring entry index update if equal

2016-04-28 Thread Xie, Huawei
On 4/28/2016 4:14 PM, Thomas Monjalon wrote:
> 2016-04-27 23:19, Yuanhan Liu:
>> And applied to dpdk-next-virtio, with few tiny changes:
>>
>> - we normally put suggested/reported-by above the SoB.
> Yes we must keep the chronological order in these lines.
>

Thanks for reminder


[dpdk-dev] [RFC PATCH] avail idx update optimizations

2016-04-28 Thread Xie, Huawei
On 4/24/2016 9:24 PM, Xie, Huawei wrote:
> On 4/24/2016 5:15 PM, Michael S. Tsirkin wrote:
>> On Sun, Apr 24, 2016 at 02:45:22AM +0000, Xie, Huawei wrote:
>>> Forget to cc the mailing list.
>>>
>>> On 4/22/2016 9:53 PM, Xie, Huawei wrote:
>>>> Hi:
>>>>
>>>> This is a series of virtio/vhost idx/ring update optimizations for cache
>>>> to cache transfer. Actually I don't expect many of them as virtio/vhost
>>>> has already done quite right.
>> Hmm - is it a series or a single patch?
> Others in my mind is caching a copy of avail index in vhost. Use the
> cached copy if there are enough slots and then sync with the index in
> the ring.
> Haven't evaluated that idea.

Tried cached vhost idx which gives a slight better perf, but will hold
this patch, as i guess we couldn't blindly set this cached avail idx to
0, which might cause issue in live migration.

>
>>>> For this patch, in a VM2VM test, i observed ~6% performance increase.
>> Interesting. In that case, it seems likely that new ring layout
>> would give you an even bigger performance gain.
>> Could you take a look at tools/virtio/ringtest/ring.c
>> in latest Linux and tell me what do you think?
>> In particular, I know you looked at using vectored instructions
>> to do ring updates - would the layout in tools/virtio/ringtest/ring.c
>> interfere with that?
> Thanks. Would check. You know i have ever tried fixing avail ring in the
> virtio driver. In purely vhost->virtio test, it could have 2~3 times
> performance boost, but it isn't that obvious if with the physical nic
> involved, investigating that issue.
> I had planned to sync with you whether it is generic enough that we
> could have a negotiated feature, either for in order descriptor
> processing or like fixed avail ring. Would check your new ring layout.
>
>
>>>> In VM1, run testpmd with txonly mode
>>>> In VM2, run testpmd with rxonly mode
>>>> In host, run testpmd(with two vhostpmds) with io forward
>>>>
>>>> Michael:
>>>> We have talked about this method when i tried the fixed ring.
>>>>
>>>>
>>>> On 4/22/2016 5:12 PM, Xie, Huawei wrote:
>>>>> eliminate unnecessary cache to cache transfer between virtio and vhost
>>>>> core
>> Yes I remember proposing this, but you probably should include the
>> explanation about why this works in he commit log:
>>
>> - pre-format avail ring with expected descriptor index values
>> - as long as entries are consumed in-order, there's no
>>   need to modify the avail ring
>> - as long as avail ring is not modified, it can be
>>   valid in caches of both consumer and producer
> Yes, would add the explanation in the formal patch.
>
>
>>>>> ---
>>>>>  drivers/net/virtio/virtqueue.h | 3 ++-
>>>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/net/virtio/virtqueue.h 
>>>>> b/drivers/net/virtio/virtqueue.h
>>>>> index 4e9239e..8c46a83 100644
>>>>> --- a/drivers/net/virtio/virtqueue.h
>>>>> +++ b/drivers/net/virtio/virtqueue.h
>>>>> @@ -302,7 +302,8 @@ vq_update_avail_ring(struct virtqueue *vq, uint16_t 
>>>>> desc_idx)
>>>>>* descriptor.
>>>>>*/
>>>>>   avail_idx = (uint16_t)(vq->vq_avail_idx & (vq->vq_nentries - 1));
>>>>> - vq->vq_ring.avail->ring[avail_idx] = desc_idx;
>>>>> + if (unlikely(vq->vq_ring.avail->ring[avail_idx] != desc_idx))
>>>>> + vq->vq_ring.avail->ring[avail_idx] = desc_idx;
>>>>>   vq->vq_avail_idx++;
>>>>>  }
>>>>>  
>



[dpdk-dev] [PATCH v2] virtio: fix segfault when transmit pkts

2016-04-25 Thread Xie, Huawei
On 4/25/2016 10:37 AM, Tan, Jianfeng wrote:
> Issue: when using virtio nic to transmit pkts, it causes segment fault.
>
> How to reproduce:
> Basically, we need to construct a case with vm send packets to vhost-user,
> and this issue does not happen when transmitting packets using indirect
> desc. Besides, make sure all descriptors are exhausted before vhost
> dequeues any packets.
>
> a. start testpmd with vhost.
>   $ testpmd -c 0x3 -n 4 --socket-mem 1024,0 --no-pci \
> --vdev 'eth_vhost0,iface=/tmp/sock0,queues=1' -- -i --nb-cores=1
>
> b. start a qemu with a virtio nic connected with the vhost-user port, just
> make sure mrg_rxbuf is enabled.
>
> c. enable testpmd on the host.
>   testpmd> set fwd io
>   testpmd> start (better without start vhost-user)
>
> d. start testpmd in VM.
>   $testpmd -c 0x3 -n 4 -m 1024 -- -i --disable-hw-vlan-filter --txqflags=0xf01
>   testpmd> set fwd txonly
>   testpmd> start
>
> How to fix: this bug is because inside virtqueue_enqueue_xmit(), the flag of
> desc has been updated inside the do {} while (), not necessary to update after
> the loop. (And if we do that after the loop, if all descs could have run out,
> idx is VQ_RING_DESC_CHAIN_END (32768), use this idx to reference the start_dp
> array will lead to segment fault.)
>
> Fixes: dd856dfcb9e ("virtio: use any layout on Tx")
>
> Signed-off-by: Jianfeng Tan 
> ---
>  v2: refine the commit message.
>
>  drivers/net/virtio/virtio_rxtx.c | 2 --
>  1 file changed, 2 deletions(-)
>
> diff --git a/drivers/net/virtio/virtio_rxtx.c 
> b/drivers/net/virtio/virtio_rxtx.c
> index ef21d8e..432aeab 100644
> --- a/drivers/net/virtio/virtio_rxtx.c
> +++ b/drivers/net/virtio/virtio_rxtx.c
> @@ -271,8 +271,6 @@ virtqueue_enqueue_xmit(struct virtqueue *txvq, struct 
> rte_mbuf *cookie,
>   idx = start_dp[idx].next;
>   } while ((cookie = cookie->next) != NULL);
>  
> - start_dp[idx].flags &= ~VRING_DESC_F_NEXT;
> -
>   if (use_indirect)
>   idx = txvq->vq_ring.desc[head_idx].next;
>  

Ack the code.

Acked-by: Huawei Xie 


[dpdk-dev] [RFC PATCH] avail idx update optimizations

2016-04-24 Thread Xie, Huawei
On 4/24/2016 5:15 PM, Michael S. Tsirkin wrote:
> On Sun, Apr 24, 2016 at 02:45:22AM +0000, Xie, Huawei wrote:
>> Forget to cc the mailing list.
>>
>> On 4/22/2016 9:53 PM, Xie, Huawei wrote:
>>> Hi:
>>>
>>> This is a series of virtio/vhost idx/ring update optimizations for cache
>>> to cache transfer. Actually I don't expect many of them as virtio/vhost
>>> has already done quite right.
> Hmm - is it a series or a single patch?

Others in my mind is caching a copy of avail index in vhost. Use the
cached copy if there are enough slots and then sync with the index in
the ring.
Haven't evaluated that idea.

>>> For this patch, in a VM2VM test, i observed ~6% performance increase.
> Interesting. In that case, it seems likely that new ring layout
> would give you an even bigger performance gain.
> Could you take a look at tools/virtio/ringtest/ring.c
> in latest Linux and tell me what do you think?
> In particular, I know you looked at using vectored instructions
> to do ring updates - would the layout in tools/virtio/ringtest/ring.c
> interfere with that?

Thanks. Would check. You know i have ever tried fixing avail ring in the
virtio driver. In purely vhost->virtio test, it could have 2~3 times
performance boost, but it isn't that obvious if with the physical nic
involved, investigating that issue.
I had planned to sync with you whether it is generic enough that we
could have a negotiated feature, either for in order descriptor
processing or like fixed avail ring. Would check your new ring layout.


>
>>> In VM1, run testpmd with txonly mode
>>> In VM2, run testpmd with rxonly mode
>>> In host, run testpmd(with two vhostpmds) with io forward
>>>
>>> Michael:
>>> We have talked about this method when i tried the fixed ring.
>>>
>>>
>>> On 4/22/2016 5:12 PM, Xie, Huawei wrote:
>>>> eliminate unnecessary cache to cache transfer between virtio and vhost
>>>> core
> Yes I remember proposing this, but you probably should include the
> explanation about why this works in he commit log:
>
> - pre-format avail ring with expected descriptor index values
> - as long as entries are consumed in-order, there's no
>   need to modify the avail ring
> - as long as avail ring is not modified, it can be
>   valid in caches of both consumer and producer

Yes, would add the explanation in the formal patch.


>
>>>> ---
>>>>  drivers/net/virtio/virtqueue.h | 3 ++-
>>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/net/virtio/virtqueue.h 
>>>> b/drivers/net/virtio/virtqueue.h
>>>> index 4e9239e..8c46a83 100644
>>>> --- a/drivers/net/virtio/virtqueue.h
>>>> +++ b/drivers/net/virtio/virtqueue.h
>>>> @@ -302,7 +302,8 @@ vq_update_avail_ring(struct virtqueue *vq, uint16_t 
>>>> desc_idx)
>>>> * descriptor.
>>>> */
>>>>avail_idx = (uint16_t)(vq->vq_avail_idx & (vq->vq_nentries - 1));
>>>> -  vq->vq_ring.avail->ring[avail_idx] = desc_idx;
>>>> +  if (unlikely(vq->vq_ring.avail->ring[avail_idx] != desc_idx))
>>>> +  vq->vq_ring.avail->ring[avail_idx] = desc_idx;
>>>>vq->vq_avail_idx++;
>>>>  }
>>>>  



[dpdk-dev] [RFC PATCH] avail idx update optimizations

2016-04-24 Thread Xie, Huawei
Forget to cc the mailing list.

On 4/22/2016 9:53 PM, Xie, Huawei wrote:
> Hi:
>
> This is a series of virtio/vhost idx/ring update optimizations for cache
> to cache transfer. Actually I don't expect many of them as virtio/vhost
> has already done quite right.
>
> For this patch, in a VM2VM test, i observed ~6% performance increase.
> In VM1, run testpmd with txonly mode
> In VM2, run testpmd with rxonly mode
> In host, run testpmd(with two vhostpmds) with io forward
>
> Michael:
> We have talked about this method when i tried the fixed ring.
>
>
> On 4/22/2016 5:12 PM, Xie, Huawei wrote:
>> eliminate unnecessary cache to cache transfer between virtio and vhost
>> core
>>
>> ---
>>  drivers/net/virtio/virtqueue.h | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
>> index 4e9239e..8c46a83 100644
>> --- a/drivers/net/virtio/virtqueue.h
>> +++ b/drivers/net/virtio/virtqueue.h
>> @@ -302,7 +302,8 @@ vq_update_avail_ring(struct virtqueue *vq, uint16_t 
>> desc_idx)
>>   * descriptor.
>>   */
>>  avail_idx = (uint16_t)(vq->vq_avail_idx & (vq->vq_nentries - 1));
>> -vq->vq_ring.avail->ring[avail_idx] = desc_idx;
>> +if (unlikely(vq->vq_ring.avail->ring[avail_idx] != desc_idx))
>> +vq->vq_ring.avail->ring[avail_idx] = desc_idx;
>>  vq->vq_avail_idx++;
>>  }
>>  
>



[dpdk-dev] [PATCH] virtio: fix segfault when transmit pkts

2016-04-22 Thread Xie, Huawei
On 4/22/2016 6:43 AM, Yuanhan Liu wrote:
> On Thu, Apr 21, 2016 at 12:36:10PM +, Jianfeng Tan wrote:
>> Issue: when using virtio nic to transmit pkts, it causes segment fault.
> Jianfeng,
>
> It's good to describe the issues, steps to reproduce it and how to fix
> it. But you don't have to tell it like filling a form. Instead, you
> could try to describe in plain English.
>
>> How to reproduce:
>> a. start testpmd with vhost.
>> $testpmd -c 0x3 -n 4 --socket-mem 1024,0 --no-pci \
>>   --vdev 'eth_vhost0,iface=/tmp/sock0,queues=1' -- -i --nb-cores=1
> I personally would suggest to add some indentations (and some whitespace
> lines), like
>
> a. start testpmd with vhost.
>$ testpmd -c 0x3 -n 4 ... \
>  --vdev '...' ...
>
> b. something else ...
>some more explanations.
>
> And, do not quote a command line like "$testpmd ...", it's like a var
> in base in this way. Instead, add a space after "$ ", like what I did
> above.
>
>> b. start a qemu with a virtio nic connected with the vhost-user port.
> This should be enough. I didn't find any special options below,
> therefore, above words is enough. However, if it's some specific
> option introduces a bug, you could claim it aloud then.
>
>> $qemu -smp cores=2,sockets=1 -cpu host -enable-kvm vm-0.img -vnc :5 -m 4G \
>>   -object memory-backend-file,id=mem,size=4096M,mem-path=,share=on \
>>   -numa node,memdev=mem -mem-prealloc \
>>   -chardev socket,id=char1,path=$sock_vhost \
>>   -netdev type=vhost-user,id=net1,chardev=char1 \
>>   -device virtio-net-pci,netdev=net1,mac=00:01:02:03:04:05
>> c. enable testpmd on the host.
>> testpmd> set fwd io
>> testpmd> start
>> d. start testpmd in VM.
>> $testpmd -c 0x3 -n 4 -m 1024 -- -i --disable-hw-vlan-filter --txqflags=0xf01

With txqflags=0xf01, virtio PMD will use virtio_xmit_pkts_simple, in
which case the enqueue_xmit willn't called, so typo?

>> testpmd> set fwd txonly
>> testpmd> start
>>
>> How to fix: this bug is because inside virtqueue_enqueue_xmit(), the flag of
>> desc has been updated inside the do {} while (); and after the loop, all 
>> descs
>> could have run out, so idx is VQ_RING_DESC_CHAIN_END (32768), use this idx to
>> reference the start_dp array will lead to segment fault.
> You missed a fix line here.

Yes, please include the commit that introduces this bug.

>
>> Signed-off-by: Jianfeng Tan 
>> ---
>>  drivers/net/virtio/virtio_rxtx.c | 2 --
>>  1 file changed, 2 deletions(-)
>>
>> diff --git a/drivers/net/virtio/virtio_rxtx.c 
>> b/drivers/net/virtio/virtio_rxtx.c
>> index ef21d8e..432aeab 100644
>> --- a/drivers/net/virtio/virtio_rxtx.c
>> +++ b/drivers/net/virtio/virtio_rxtx.c
>> @@ -271,8 +271,6 @@ virtqueue_enqueue_xmit(struct virtqueue *txvq, struct 
>> rte_mbuf *cookie,
>>  idx = start_dp[idx].next;
>>  } while ((cookie = cookie->next) != NULL);
>>  
>> -start_dp[idx].flags &= ~VRING_DESC_F_NEXT;
> This looks a good fix to me. I'm just wondering why we never met it
> before, and on which specific case do we meet it? Talking about that,
> seems like "set fwd txonly" is with high suspicious.

You will not meet this if you have more free descriptors than needed, so
this depends on the relative speed of virtio xmit and vhost dequeue.
Also if indirect feature is negotiated, it willn't trigger.
However, without indirect, seems as long as we start testpmd with txonly
in guest `first`, we could definitely trigger this. Jianfeng, could you
confirm this and emphasize the order in the commit message?

>
>   --yliu
>



[dpdk-dev] DPDK 2.2 build failing with vhost-kni

2016-04-11 Thread Xie, Huawei
On 4/10/2016 7:26 AM, chintu hetam wrote:
> I am compiling DPDK 2.2 on Fedora 23 and i am seeing following build error
> /home/vcr/devel/dpdk/dpdk-2.2.0/build/build/lib/librte_eal/linuxapp/kni/kni_vhost.c:
> In function ?kni_sock_poll?:
> /home/vcr/devel/dpdk/dpdk-2.2.0/build/build/lib/librte_eal/linuxapp/kni/kni_vhost.c:254:25:
> error: ?SOCK_ASYNC_NOSPACE? undeclared (first use in this function)
>   (!test_and_set_bit(SOCK_ASYNC_NOSPACE, &q->sock->flags) &&
>  ^

Hi, besides the issue, now user space vhost is the preferred way to
switch packets with the virtual machine, and we have plans to remove kni
vhost support. Do you have any reason to use kni vhost?


[dpdk-dev] [PATCH] vhost: fix mem share between VM and host

2016-04-11 Thread Xie, Huawei
On 4/11/2016 1:29 AM, Zhe Tao wrote:
>  
> +/* Check the share memory in case the QEMU doesn't set the share option
> + * as on for the memory-backend-file object in the QEMU command line.
> + */
> +
> +int
> +vhost_check_mem_shared(struct vhost_device_ctx ctx)
> +{
> + struct virtio_net *dev;
> + struct vhost_virtqueue *vq;
> + int ret = 0;
> +
> + dev = get_device(ctx);
> + if ((dev == NULL) || (dev->mem == NULL))
> + return -1;
> +
> + /* check first virtqueue 0 rxq. */
> + vq = dev->virtqueue[VIRTIO_RXQ];
> + ret = vq->desc[0].next == 0 ? -1 : 0;
> +
> + if (ret)
> + RTE_LOG(ERR, VHOST_CONFIG,
> + "The mem is not shared between VM and host\n");
> +
> + return ret;
> +}
> +

Zhe: This is known to vhost-user users that share=on is a must-to-have
option. I think this isn't an issue.
It is OK if we could do some early check in vhost, however we couldn't
assume the value of vq->desc[0].next.
Check if there is other simple and reliable way.



[dpdk-dev] [RFC] vhost user: add error handling for fd > 1023

2016-04-08 Thread Xie, Huawei
On 4/7/2016 10:52 PM, Christian Ehrhardt wrote:
> I totally agree to that there is no deterministic rule what to expect.
> The only rule is that #fd certainly always is > #vhost_user devices.
> In various setup variants I've crossed fd 1024 anywhere between 475
> and 970 vhost_user ports.
>
> Once the discussion continues and we have an updates version of the
> patch with some more agreement I hope I can help to test it.

Thanks. Let us first temporarily fix this problem for robustness, then
we consider whether upgrade to (e)poll.
Will check the patch in detail later. Basically it should work but need
check whether we need extra fixes elsewhere.


[dpdk-dev] XEN netfront PMD support

2016-04-07 Thread Xie, Huawei
On 4/7/2016 3:52 PM, Thomas Monjalon wrote:
> 2016-04-07 06:20, Xie, Huawei:
>> Hi Stephen:
>> I recall that you ever send a netfront PMD patch. I think that is one
>> missing piece for running PMD in XEN domU. There are still users
>> requiring that feature. Could you resend that patch? I could help review.
> It has been resent by Jan Blunck:
>   http://thread.gmane.org/gmane.comp.networking.dpdk.devel/36170
>

Thanks!


[dpdk-dev] XEN netfront PMD support

2016-04-07 Thread Xie, Huawei
Hi Stephen:
I recall that you ever send a netfront PMD patch. I think that is one
missing piece for running PMD in XEN domU. There are still users
requiring that feature. Could you resend that patch? I could help review.



[dpdk-dev] [RFC] vhost user: add error handling for fd > 1023

2016-03-30 Thread Xie, Huawei
On 3/18/2016 5:15 PM, Patrik Andersson wrote:
> Protect against DPDK crash when allocation of listen fd >= 1023.
> For events on fd:s >1023, the current implementation will trigger
> an abort due to access outside of allocated bit mask.
>
> Corrections would include:
>
>   * Match fdset_add() signature in fd_man.c to fd_man.h
>   * Handling of return codes from fdset_add()
>   * Addition of check of fd number in fdset_add_fd()
>
> ---
>
> The rationale behind the suggested code change is that,
> fdset_event_dispatch() could attempt access outside of the FD_SET
> bitmask if there is an event on a file descriptor that in turn
> looks up a virtio file descriptor with a value > 1023.
> Such an attempt will lead to an abort() and a restart of any
> vswitch using DPDK.
>
> A discussion topic exist in the ovs-discuss mailing list that can
> provide a little more background:
>
> http://openvswitch.org/pipermail/discuss/2016-February/020243.html

Thanks for catching this. Could you give more details on how to
accumulating fds?
The buggy code is based on the fact that FD_SETSIZE limits the number of
file descriptors, which might be true on Windows. However from the
manual, it says clearly it limits the value of file descriptors.
The use of select is for portability. I have been wondering if it is
truly that important. Use poll could simplify the code a bit, for
example we need add timeout to select so that another thread could
insert/remove a fd into/from the monitored list.

Any comments on using poll/epoll?

>
> Signed-off-by: Patrik Andersson 
> ---
>  lib/librte_vhost/vhost_user/fd_man.c | 11 ++-
>  lib/librte_vhost/vhost_user/vhost-net-user.c | 22 --
>  2 files changed, 26 insertions(+), 7 deletions(-)
>
> diff --git a/lib/librte_vhost/vhost_user/fd_man.c 
> b/lib/librte_vhost/vhost_user/fd_man.c
> index 087aaed..c691339 100644
> --- a/lib/librte_vhost/vhost_user/fd_man.c
> +++ b/lib/librte_vhost/vhost_user/fd_man.c
> @@ -71,20 +71,22 @@ fdset_find_free_slot(struct fdset *pfdset)
>   return fdset_find_fd(pfdset, -1);
>  }
>  
> -static void
> +static int
>  fdset_add_fd(struct fdset  *pfdset, int idx, int fd,
>   fd_cb rcb, fd_cb wcb, void *dat)
>  {
>   struct fdentry *pfdentry;
>  
> - if (pfdset == NULL || idx >= MAX_FDS)

seems we had better change the definition of MAX_FDS and
MAX_VHOST_SERVER to FD_SETSIZE or add some build time check.

> - return;
> + if (pfdset == NULL || idx >= MAX_FDS || fd >= FD_SETSIZE)
> + return -1;
>  
>   pfdentry = &pfdset->fd[idx];
>   pfdentry->fd = fd;
>   pfdentry->rcb = rcb;
>   pfdentry->wcb = wcb;
>   pfdentry->dat = dat;
> +
> + return 0;
>  }
>  
>  /**
> @@ -150,12 +152,11 @@ fdset_add(struct fdset *pfdset, int fd, fd_cb rcb, 
> fd_cb wcb, void *dat)
>  
>   /* Find a free slot in the list. */
>   i = fdset_find_free_slot(pfdset);
> - if (i == -1) {
> + if (i == -1 || fdset_add_fd(pfdset, i, fd, rcb, wcb, dat) < 0) {
>   pthread_mutex_unlock(&pfdset->fd_mutex);
>   return -2;
>   }
>  
> - fdset_add_fd(pfdset, i, fd, rcb, wcb, dat);
>   pfdset->num++;
>  
>   pthread_mutex_unlock(&pfdset->fd_mutex);
> diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c 
> b/lib/librte_vhost/vhost_user/vhost-net-user.c
> index df2bd64..778851d 100644
> --- a/lib/librte_vhost/vhost_user/vhost-net-user.c
> +++ b/lib/librte_vhost/vhost_user/vhost-net-user.c
> @@ -288,6 +288,7 @@ vserver_new_vq_conn(int fd, void *dat, __rte_unused int 
> *remove)
>   int fh;
>   struct vhost_device_ctx vdev_ctx = { (pid_t)0, 0 };
>   unsigned int size;
> + int ret;
>  
>   conn_fd = accept(fd, NULL, NULL);
>   RTE_LOG(INFO, VHOST_CONFIG,
> @@ -317,8 +318,15 @@ vserver_new_vq_conn(int fd, void *dat, __rte_unused int 
> *remove)
>  
>   ctx->vserver = vserver;
>   ctx->fh = fh;
> - fdset_add(&g_vhost_server.fdset,
> + ret = fdset_add(&g_vhost_server.fdset,
>   conn_fd, vserver_message_handler, NULL, ctx);
> + if (ret < 0) {
> + free(ctx);
> + close(conn_fd);
> + RTE_LOG(ERR, VHOST_CONFIG,
> + "failed to add fd %d into vhost server fdset\n",
> + conn_fd);
> + }
>  }
>  
>  /* callback when there is message on the connfd */
> @@ -453,6 +461,7 @@ int
>  rte_vhost_driver_register(const char *path)
>  {
>   struct vhost_server *vserver;
> + int ret;
>  
>   pthread_mutex_lock(&g_vhost_server.server_mutex);
>  
> @@ -478,8 +487,17 @@ rte_vhost_driver_register(const char *path)
>  
>   vserver->path = strdup(path);
>  
> - fdset_add(&g_vhost_server.fdset, vserver->listenfd,
> + ret = fdset_add(&g_vhost_server.fdset, vserver->listenfd,
>   vserver_new_vq_conn, NULL, vserver);
> + if (ret < 0) {
> + pthread_mutex_unlock(&g_vhost_server.server_mutex);
> +

[dpdk-dev] [PATCH v4] vhost: use SMP barriers instead of compiler ones.

2016-03-23 Thread Xie, Huawei
On 3/18/2016 8:24 PM, Ilya Maximets wrote:
> Since commit 4c02e453cc62 ("eal: introduce SMP memory barriers") virtio
> uses architecture dependent SMP barriers. vHost should use them too.
>
> Fixes: 4c02e453cc62 ("eal: introduce SMP memory barriers")
>
> Signed-off-by: Ilya Maximets 

Acked-by: Huawei Xie 

It fixes the issue for other archs. Will recheck in future.


[dpdk-dev] [PATCH v2] ring: check for zero objects mc dequeue / mp enqueue

2016-03-22 Thread Xie, Huawei
On 3/22/2016 6:13 PM, Richardson, Bruce wrote:
> On Mon, Mar 21, 2016 at 05:47:44PM +0000, Xie, Huawei wrote:
>> On 3/18/2016 10:17 PM, Bruce Richardson wrote:
>>> On Fri, Mar 18, 2016 at 01:47:29PM +0100, Mauricio V?squez wrote:
>>>> Hi,
>>>>
>>>>
>>>> On Fri, Mar 18, 2016 at 11:35 AM, Thomas Monjalon >>> 6wind.com
>>>>> wrote:
>>>>> 2016-03-18 11:27, Olivier Matz:
>>>>>> On 03/18/2016 11:18 AM, Bruce Richardson wrote:
>>>>>>>>> +   /* Avoid the unnecessary cmpset operation below, which is
>>>>> also
>>>>>>>>> +* potentially harmful when n equals 0. */
>>>>>>>>> +   if (n == 0)
>>>>>>>>>
>>>>>>>> What about using unlikely here?
>>>>>>>>
>>>>>>> Unless there is a measurable performance increase by adding in
>>>>> likely/unlikely
>>>>>>> I'd suggest avoiding it's use. In general, likely/unlikely should only
>>>>> be used
>>>>>>> for things like catestrophic errors because the penalty for taking the
>>>>> unlikely
>>>>>>> leg of the code can be quite severe. For normal stuff, where the code
>>>>> nearly
>>>>>>> always goes one way in the branch but occasionally goes the other, the
>>>>> hardware
>>>>>>> branch predictors generally do a good enough job.
>>>>>> Do you mean using likely/unlikely could be worst than not using it
>>>>>> in this case?
>>>>>>
>>>>>> To me, using unlikely here is not a bad idea: it shows to the compiler
>>>>>> and to the reader of the code that is case is not the usual case.
>>>>> It would be nice to have a guideline section about likely/unlikely in
>>>>> doc/guides/contributing/design.rst
>>>>>
>>>>> Bruce gave a talk at Dublin about this kind of things.
>>>>> I'm sure he could contribute more design guidelines ;)
>>>>>
>>>> There is a small explanation in the section "Branch Prediction" of
>>>> doc/guides/contributing/coding_style.rst, but I do not know if that is
>>>> enough to understand when to use them.
>>>>
>>>> I've made a fast check and there are many PMDs that use them to check if
>>>> number of packets is zero in the transmission function.
>>> Yeah, and I wonder how many of those are actually necessary too :-)
>>>
>>> It's not a big deal either way, I just think the patch is fine as-is without
>>> the extra macros.
>> IMO we use likely/unlikely in two cases, catastrophic errors and the
>> code nearly always goes one way, i.e, preferred/favored fast path.
>> Likely/unlikely helps not only for branch predication but also for cache
> For branch prediction, anything after the first time through the code path
> the prediction will be based on what happened before rather than any static
> hints in the code.

Yes, maybe i didn't make myself clear? My main concern isn't about
branch predication...

>> usage. The code generated for the likely path will directly follow the
>> branch instruction. To me, it is reasonable enough to add unlikely for n
>> == 0, which we don't expect to happen.
>> I remember with/without likely, compiler could generate three kind of
>> instructions. Didn't deep dive into it.
>>
>>> /Bruce
>>>



[dpdk-dev] [PATCH v2] ring: check for zero objects mc dequeue / mp enqueue

2016-03-21 Thread Xie, Huawei
On 3/18/2016 10:17 PM, Bruce Richardson wrote:
> On Fri, Mar 18, 2016 at 01:47:29PM +0100, Mauricio V?squez wrote:
>> Hi,
>>
>>
>> On Fri, Mar 18, 2016 at 11:35 AM, Thomas Monjalon > 6wind.com
>>> wrote:
>>> 2016-03-18 11:27, Olivier Matz:
 On 03/18/2016 11:18 AM, Bruce Richardson wrote:
>>> +   /* Avoid the unnecessary cmpset operation below, which is
>>> also
>>> +* potentially harmful when n equals 0. */
>>> +   if (n == 0)
>>>
>> What about using unlikely here?
>>
> Unless there is a measurable performance increase by adding in
>>> likely/unlikely
> I'd suggest avoiding it's use. In general, likely/unlikely should only
>>> be used
> for things like catestrophic errors because the penalty for taking the
>>> unlikely
> leg of the code can be quite severe. For normal stuff, where the code
>>> nearly
> always goes one way in the branch but occasionally goes the other, the
>>> hardware
> branch predictors generally do a good enough job.
 Do you mean using likely/unlikely could be worst than not using it
 in this case?

 To me, using unlikely here is not a bad idea: it shows to the compiler
 and to the reader of the code that is case is not the usual case.
>>> It would be nice to have a guideline section about likely/unlikely in
>>> doc/guides/contributing/design.rst
>>>
>>> Bruce gave a talk at Dublin about this kind of things.
>>> I'm sure he could contribute more design guidelines ;)
>>>
>> There is a small explanation in the section "Branch Prediction" of
>> doc/guides/contributing/coding_style.rst, but I do not know if that is
>> enough to understand when to use them.
>>
>> I've made a fast check and there are many PMDs that use them to check if
>> number of packets is zero in the transmission function.
> Yeah, and I wonder how many of those are actually necessary too :-)
>
> It's not a big deal either way, I just think the patch is fine as-is without
> the extra macros.

IMO we use likely/unlikely in two cases, catastrophic errors and the
code nearly always goes one way, i.e, preferred/favored fast path.
Likely/unlikely helps not only for branch predication but also for cache
usage. The code generated for the likely path will directly follow the
branch instruction. To me, it is reasonable enough to add unlikely for n
== 0, which we don't expect to happen.
I remember with/without likely, compiler could generate three kind of
instructions. Didn't deep dive into it.

>
> /Bruce
>



[dpdk-dev] [PATCH v4] vhost: use SMP barriers instead of compiler ones.

2016-03-21 Thread Xie, Huawei
On 3/21/2016 10:07 PM, Ananyev, Konstantin wrote:
>
>> -Original Message-
>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Ilya Maximets
>> Sent: Monday, March 21, 2016 4:50 AM
>> To: Yuanhan Liu
>> Cc: dev at dpdk.org; Xie, Huawei; Dyasly Sergey
>> Subject: Re: [dpdk-dev] [PATCH v4] vhost: use SMP barriers instead of 
>> compiler ones.
>>
>>
>>
>> On 18.03.2016 15:41, Yuanhan Liu wrote:
>>> On Fri, Mar 18, 2016 at 03:23:53PM +0300, Ilya Maximets wrote:
>>>> Since commit 4c02e453cc62 ("eal: introduce SMP memory barriers") virtio
>>>> uses architecture dependent SMP barriers. vHost should use them too.
>>>>
>>>> Fixes: 4c02e453cc62 ("eal: introduce SMP memory barriers")
>>>>
>>>> Signed-off-by: Ilya Maximets 
>>>> ---
>>>>  lib/librte_vhost/vhost_rxtx.c | 7 ---
>>>>  1 file changed, 4 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
>>>> index b4da665..859c669 100644
>>>> --- a/lib/librte_vhost/vhost_rxtx.c
>>>> +++ b/lib/librte_vhost/vhost_rxtx.c
>>>> @@ -315,7 +315,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t 
>>>> queue_id,
>>>>rte_prefetch0(&vq->desc[desc_indexes[i+1]]);
>>>>}
>>>>
>>>> -  rte_compiler_barrier();
>>>> +  rte_smp_wmb();
>>>>
>>>>/* Wait until it's our turn to add our buffer to the used ring. */
>>>>while (unlikely(vq->last_used_idx != res_start_idx))
>>>> @@ -565,7 +565,7 @@ virtio_dev_merge_rx(struct virtio_net *dev, uint16_t 
>>>> queue_id,
>>>>
>>>>nr_used = copy_mbuf_to_desc_mergeable(dev, vq, start, end,
>>>>  pkts[pkt_idx]);
>>>> -  rte_compiler_barrier();
>>>> +  rte_smp_wmb();
>>>>
>>>>/*
>>>> * Wait until it's our turn to add our buffer
>>>> @@ -923,7 +923,8 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, 
>>>> uint16_t queue_id,
>>>>sizeof(vq->used->ring[used_idx]));
>>>>}
>>>>
>>>> -  rte_compiler_barrier();
>>>> +  rte_smp_wmb();
>>>> +  rte_smp_rmb();
>>> rte_smp_mb?
>> rte_smp_mb() is a real mm_fence() on x86. And we don't need to synchronize 
>> reads with
>> writes here, only reads with reads and writes with writes. It is enough 
>> because next
>> increment uses read and write. Pair of barriers is better because it will 
>> not impact
>> on performance on x86.
> Not arguing about that particular patch, just a question:
> Why do we have:
> #define rte_smp_mb() rte_mb()

Konstantine, actually smp_mb is defined as mfence while smp_r/wmb are
defined as barrier in kernel_src/arch/x86/include/asm/barrier.h.

> for  x86?
> Why not just:
> #define rte_smp_mb() rte_compiler_barrier()
> here?
> I meant for situations when we do need real mfence, there is an 'rte_mb' to 
> use.
> Konstantin
>
>> Best regards, Ilya Maximets.



[dpdk-dev] [PATCH RFC v3 1/3] vhost: use SMP barriers instead of compiler ones.

2016-03-18 Thread Xie, Huawei
On 3/18/2016 6:39 PM, Ilya Maximets wrote:
>
> On 18.03.2016 13:27, Xie, Huawei wrote:
>> On 3/18/2016 6:23 PM, Ilya Maximets wrote:
>>> On 18.03.2016 13:08, Xie, Huawei wrote:
>>>> On 2/24/2016 7:47 PM, Ilya Maximets wrote:
>>>>>* Wait until it's our turn to add our buffer
>>>>> @@ -979,7 +979,7 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, 
>>>>> uint16_t queue_id,
>>>>>   entry_success++;
>>>>>   }
>>>>>  
>>>>> - rte_compiler_barrier();
>>>>> + rte_smp_rmb();
>>>> smp_rmb()?
>>> There is no such function 'smp_rmb' in DPDK.
>>> But:
>>> .../arch/arm/rte_atomic.h:#define rte_smp_rmb() rte_rmb()
>>> .../arch/ppc_64/rte_atomic.h:#define rte_smp_rmb() rte_compiler_barrier()
>>> .../arch/tile/rte_atomic.h:#define rte_smp_rmb() rte_compiler_barrier()
>>> .../arch/x86/rte_atomic.h:#define rte_smp_rmb() rte_compiler_barrier()
>> I mean shoudn't be rte_smp_wmb()?
> No. Here we need to be sure that copying of data from descriptor to
> our local mbuf completed before 'vq->used->idx += entry_success;'.
>
> Read memory barrier will help us with it.
>
> In other places write barriers used because copying performed in
> opposite direction.

What about the udpate to the used ring?


>
>>>>>   vq->used->idx += entry_success;
>>>>>   vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx),
>>>>>   sizeof(vq->used->idx));
>>>>> -- 2.5.0
>>>>
>>
>>



[dpdk-dev] [PATCH RFC v3 1/3] vhost: use SMP barriers instead of compiler ones.

2016-03-18 Thread Xie, Huawei
On 3/18/2016 6:23 PM, Ilya Maximets wrote:
> On 18.03.2016 13:08, Xie, Huawei wrote:
>> On 2/24/2016 7:47 PM, Ilya Maximets wrote:
>>>  * Wait until it's our turn to add our buffer
>>> @@ -979,7 +979,7 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, 
>>> uint16_t queue_id,
>>> entry_success++;
>>> }
>>>  
>>> -   rte_compiler_barrier();
>>> +   rte_smp_rmb();
>> smp_rmb()?
> There is no such function 'smp_rmb' in DPDK.
> But:
> .../arch/arm/rte_atomic.h:#define rte_smp_rmb() rte_rmb()
> .../arch/ppc_64/rte_atomic.h:#define rte_smp_rmb() rte_compiler_barrier()
> .../arch/tile/rte_atomic.h:#define rte_smp_rmb() rte_compiler_barrier()
> .../arch/x86/rte_atomic.h:#define rte_smp_rmb() rte_compiler_barrier()

I mean shoudn't be rte_smp_wmb()?

>
>>> vq->used->idx += entry_success;
>>> vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx),
>>> sizeof(vq->used->idx));
>>> -- 2.5.0
>>
>>



[dpdk-dev] [PATCH RFC v3 0/3] Thread safe rte_vhost_enqueue_burst().

2016-03-18 Thread Xie, Huawei
On 3/18/2016 5:55 PM, Ilya Maximets wrote:
> Hi all.
> And what about first patch of this series:
> "vhost: use SMP barriers instead of compiler ones." ?
>
> It's not about thread safety in terms of 'lockless'. It is the standalone
> patch that fixes many times discussed issue with barriers on arm.
>
> Best regards, Ilya Maximets.
Right, for ARM. I put a comment there. Btw, could you add -in-reply-to
when you send you next patch?


[dpdk-dev] [PATCH RFC v3 1/3] vhost: use SMP barriers instead of compiler ones.

2016-03-18 Thread Xie, Huawei
On 2/24/2016 7:47 PM, Ilya Maximets wrote:
>* Wait until it's our turn to add our buffer
> @@ -979,7 +979,7 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t 
> queue_id,
>   entry_success++;
>   }
>  
> - rte_compiler_barrier();
> + rte_smp_rmb();

smp_rmb()?

>   vq->used->idx += entry_success;
>   vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx),
>   sizeof(vq->used->idx));
> -- 2.5.0



[dpdk-dev] vhost: no protection against malformed queue descriptors in rte_vhost_dequeue_burst()

2016-03-17 Thread Xie, Huawei
On 3/16/2016 8:53 PM, Patrik Andersson R wrote:
> Hello,
>
> When taking a snapshot of a running VM instance, using OpenStack
> "nova image-create", I noticed that one OVS pmd-thread eventually
> failed in DPDK rte_vhost_dequeue_burst() with repeating log entries:
>
>compute-0-6 ovs-vswitchd[38172]: VHOST_DATA: Failed to allocate
> memory for mbuf.
>
>
> Debugging (data included further down) this issue lead to the
> observation that there is no protection against malformed vhost
> queue descriptors, thus tenant separation might be violated as a
> single faulty VM might bring down the connectivity of all VMs
> connected to the same virtual switch.
>
> To avoid this, validation would be needed at some points in the
> rte_vhost_dequeue_burst() code:
>
>   1) when the queue descriptor is picked up for processing,
>   desc->flags and desc->len might both be 0
>
>...
>desc = &vq->desc[head[entry_success]];
>...
>/* Discard first buffer as it is the virtio header */
>if (desc->flags & VRING_DESC_F_NEXT) {
> desc = &vq->desc[desc->next];
> vb_offset = 0;
> vb_avail = desc->len;
>} else {
> vb_offset = vq->vhost_hlen;
> vb_avail = desc->len - vb_offset;
>}
> 
>
>   2) at buffer address translation gpa_to_vva(), might fail
>   returning NULL as indication
>
>vb_addr = gpa_to_vva(dev, desc->addr);
>...
>while (cpy_len != 0) {
> rte_memcpy(rte_pktmbuf_mtod_offset(cur, void *, seg_offset),
> (void *)((uintptr_t)(vb_addr + vb_offset)),
> cpy_len);
>...
>}
>...
>
>
> Wondering if there are any plans of adding any kind of validation in
> DPDK, or if it would be useful to suggest specific implementation of
> such validations in the DPDK code?
>
> Or is there some mechanism that gives us the confidence to trust
> the vhost queue content absolutely?
>
>
>
> Debugging data:
>
> For my scenario the problem occurs in DPDK rte_vhost_dequeue_burst()
> due to use of a vhost queue descriptor that has all fields 0:
>
>   (gdb) print *desc
>{addr = 0, len = 0, flags = 0, next = 0}
>
>
> Subsequent use of desc->len to compute vb_avail = desc->len - vb_offset,
> leads to the problem observed. What happens is that the packet needs to
> be segmented -- on my system it fails roughly at segment 122000 when
> memory available for mbufs run out.
>
> The relevant local variables for rte_vhost_dequeue_burst() when breaking
> on the condition desc->len == 0:
>
>vb_avail = 4294967284  (0xfff4)
>seg_avail = 2608
>vb_offset = 12
>cpy_len = 2608
>seg_num = 1
>desc = 0x2aadb6e5c000
>vb_addr = 46928960159744
>entry_success = 0
>
> Note also that there is no crash despite to the desc->addr being zero,
> it is a valid address in the regions mapped to the device. Although, the
> 3 regions mapped does not seem to be correct either at this stage.
>
>
> The versions that I'm running are OVS 2.4.0, with corrections from the
> 2.4 branch, and DPDK 2.1.0. QEMU emulator version 2.2.0 and
> libvirt version 1.2.12.
>
>
> Regards,
>
> Patrik

Thanks Patrik. You are right. We had planned to enhance the robustness
of vhost so that neither malicious nor buggy guest virtio driver could
corrupt vhost. Actually the 16.04 RC1 has fixed some issues (the return
of gpa_to_vva isn't checked).

>



[dpdk-dev] [PATCH] vhost: remove unnecessary memset for virtio net hdr

2016-03-17 Thread Xie, Huawei
On 3/16/2016 2:44 PM, Yuanhan Liu wrote:
> We have to reset the virtio net hdr at virtio_enqueue_offload()
> before, due to all mbufs share a single virtio_hdr structure:
>
>   struct virtio_net_hdr_mrg_rxbuf virtio_hdr = {{0, }, 0};
>
>   foreach (mbuf) {
>   virtio_enqueue_offload(mbuf, &virtio_hdr.hdr);
>
>   copy net hdr and mbuf to desc buf
>   }
>
> However, after the vhost rxtx refactor, the code looks like:
>
>   copy_mbuf_to_desc(mbuf)
>   {
>   struct virtio_net_hdr_mrg_rxbuf virtio_hdr = {{0, }, 0}
>
>   virtio_enqueue_offload(mbuf, &virtio_hdr.hdr);
>
>   copy net hdr and mbuf to desc buf
>   }
>
>   foreach (mbuf) {
>   copy_mbuf_to_desc(mbuf);
>   }
>
> Therefore, the memset at virtio_enqueue_offload() is not necessary
> any more; remove it.
>
> Signed-off-by: Yuanhan Liu 
> ---
>  lib/librte_vhost/vhost_rxtx.c | 2 --
>  1 file changed, 2 deletions(-)
>
> diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
> index a6330f8..b4da665 100644
> --- a/lib/librte_vhost/vhost_rxtx.c
> +++ b/lib/librte_vhost/vhost_rxtx.c
> @@ -94,8 +94,6 @@ is_valid_virt_queue_idx(uint32_t idx, int is_tx, uint32_t 
> qp_nb)
>  static void
>  virtio_enqueue_offload(struct rte_mbuf *m_buf, struct virtio_net_hdr 
> *net_hdr)
>  {
> - memset(net_hdr, 0, sizeof(struct virtio_net_hdr));
> -
>   if (m_buf->ol_flags & PKT_TX_L4_MASK) {
>   net_hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
>   net_hdr->csum_start = m_buf->l2_len + m_buf->l3_len;

Acked-by: Huawei Xie 


[dpdk-dev] [PATCH 0/3 v3] virtio: Tx performance improvements

2016-03-16 Thread Xie, Huawei
On 3/14/2016 6:56 PM, Richardson, Bruce wrote:
> On Fri, Mar 04, 2016 at 10:19:18AM -0800, Stephen Hemminger wrote:
>> This patch series uses virtio negotiated features to allow for
>> more packets to be queued to host even though the default QEMU/KVM
>> virtio queue is very small 256 elements.
>>
>> Stephen Hemminger (3):
>>   virtio: use indirect ring elements
>>   virtio: use any layout on transmit
>>   virtio: optimize transmit enqueue
>>
> These patches require an ack to merge. Virtio maintainers, can you please
> review and ack if ok.
>
> /Bruce
>
Acked in the previous release window.

Acked-by: Huawei Xie 


[dpdk-dev] [PATCH] vhost: remove lockless enqueue to the virtio ring

2016-03-16 Thread Xie, Huawei
On 3/15/2016 7:14 AM, Thomas Monjalon wrote:
> 2016-01-05 07:16, Xie, Huawei:
>> On 1/5/2016 2:42 PM, Xie, Huawei wrote:
>>> This patch removes the internal lockless enqueue implmentation.
>>> DPDK doesn't support receiving/transmitting packets from/to the same
>>> queue. Vhost PMD wraps vhost device as normal DPDK port. DPDK
>>> applications normally have their own lock implmentation when enqueue
>>> packets to the same queue of a port.
>>>
>>> The atomic cmpset is a costly operation. This patch should help
>>> performance a bit.
>>>
>>> Signed-off-by: Huawei Xie 
>> This patch modifies the API's behavior, which is also a trivial ABI
>> change. In my opinion, application shouldn't rely on previous behavior.
>> Anyway, i am checking how to declare the ABI change.
> I guess this patch is now obsolete?

How about we delay this to next release after more considerations,
whether we should keep this behavior, and what is the best way for
concurrency in vhost.




[dpdk-dev] [PATCH v4 4/4] virtio: return 1 to tell the upper layer we don't take over this device

2016-03-08 Thread Xie, Huawei
On 3/1/2016 6:09 PM, Xie, Huawei wrote:
> On 3/1/2016 5:57 PM, Thomas Monjalon wrote:
>> 2016-03-01 08:39, Xie, Huawei:
>>> On 3/1/2016 4:24 PM, Thomas Monjalon wrote:
>>>> 2016-03-01 07:53, Xie, Huawei:
>>>>> On 3/1/2016 3:18 PM, Thomas Monjalon wrote:
>>>>>> 2016-02-26 09:53, Huawei Xie:
>>>>>>> @@ -1037,8 +1039,11 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
>>>>>>>  
>>>>>>> pci_dev = eth_dev->pci_dev;
>>>>>>>  
>>>>>>> -   if (vtpci_init(pci_dev, hw) < 0)
>>>>>>> -   return -1;
>>>>>>> +   ret = vtpci_init(pci_dev, hw);
>>>>>>> +   if (ret) {
>>>>>>> +   rte_free(eth_dev->data->mac_addrs);
>>>>>> The freeing seems not related to this patch.
>>>>> I can send a separate patch, ok within this patchset?
>>>> Yes
>>>>
>>>>>> [...]
>>>>>>> PMD_INIT_LOG(INFO, "trying with legacy virtio pci.");
>>>>>>> -   if (legacy_virtio_resource_init(dev, hw) < 0)
>>>>>>> +   if (legacy_virtio_resource_init(dev, hw) < 0) {
>>>>>>> +   if (dev->kdrv == RTE_KDRV_UNKNOWN) {
>>>>>>> +   PMD_INIT_LOG(INFO,
>>>>>>> +   "skip kernel managed virtio device.");
>>>>>>> +   return 1;
>>>>>>> +   }
>>>>>>> return -1;
>>>>>>> +   }
>>>>>> You cannot skip a device if it was whitelisted.
>>>>>> I think you should check RTE_DEVTYPE_WHITELISTED_PCI and throw an error
>>>>>> in this case.
>>>>> I feel there is a subtle difference on the understanding of -w args. To
>>>>> me, without it, probe all devices; with it, only probe whiltelisted API.
>>>>> That is all.
>>>> I don't know if it is clearly documented indeed.
>>>>
>>>>> Do you mean that -w implies that devices whitelisted must be probed
>>>>> successfully otherwise we throw an error? If i get it right, then what
>>>>> about the devices whitelisted but without PMD driver?
>>>> Yes we should probably consider the whitelist as a "forced" init.
>>>> Later, we could introduce some device flags for probing/discovery:
>>>> PROBE_AUTO, PROBE_FORCE, PROBE_IGNORE. It would make white/black list
>>>> more precise.
>>>>
>>>>> I will fix, :).
>>>>> if (dev->kdrv == RTE_KDRV_UNKNOWN && dev->devargs->type !=
>>>>> RTE_DEVTYPE_WHITELISTED_PCI) {
>>>>> 
>>>>> return 1;
>>>>> }
>>>> You should also consider the blacklist case: if there is a blacklist,
>>>> the not blacklisted devices must be initialised or throw an error.
>>>>
>>> Don't we already skip probing the blacklisted device in
>>> rte_eal_pci_probe_one_driver?
>> Yes, I'm talking about the devices which are not blacklisted.
>> Having some blacklisted devices imply that others are implicitly whitelisted.
> For blacklist, it only means the blacklisted device should be excluded
> from being probed. It doesn't mean all other devices should be probed
> either successfully or otherwise throw an error which cause DPDK exit.
> Even that, the upper layer should explicitly put the non-blacklisted
> device into whitelist, i mean here we should only deal with whitelist.

Thomas, comments?
Btw, i have reworked the patch, but git am always complains patch format
detection failed. Investigating. Will send it later.

>



[dpdk-dev] [PATCH v2 0/5] virtio: Tx performance improvements

2016-03-08 Thread Xie, Huawei
On 3/5/2016 2:17 AM, Stephen Hemminger wrote:
> Resending them now. I don't understand the issue with merge-able header.
> Virtio negotiation is symmetric, if receiver is using merge-able header
> then the transmitter needs to send it also.

Yes, both receiver and transmitter use the same format of header though
merge-able is actually only meaningful in RX path.


[dpdk-dev] [PATCH v2 3/7] vhost: refactor virtio_dev_merge_rx

2016-03-07 Thread Xie, Huawei
On 3/7/2016 4:36 PM, Yuanhan Liu wrote:
> On Mon, Mar 07, 2016 at 07:52:22AM +0000, Xie, Huawei wrote:
>> On 2/18/2016 9:48 PM, Yuanhan Liu wrote:
>>> Current virtio_dev_merge_rx() implementation just looks like the
>>> old rte_vhost_dequeue_burst(), full of twisted logic, that you
>>> can see same code block in quite many different places.
>>>
>>> However, the logic of virtio_dev_merge_rx() is quite similar to
>>> virtio_dev_rx().  The big difference is that the mergeable one
>>> could allocate more than one available entries to hold the data.
>>> Fetching all available entries to vec_buf at once makes the
>> [...]
>>> -   }
>>> +static inline uint32_t __attribute__((always_inline))
>>> +copy_mbuf_to_desc_mergeable(struct virtio_net *dev, struct vhost_virtqueue 
>>> *vq,
>>> +   uint16_t res_start_idx, uint16_t res_end_idx,
>>> +   struct rte_mbuf *m)
>>> +{
>>> +   struct virtio_net_hdr_mrg_rxbuf virtio_hdr = {{0, 0, 0, 0, 0, 0}, 0};
>>> +   uint32_t vec_idx = 0;
>>> +   uint16_t cur_idx = res_start_idx;
>>> +   uint64_t desc_addr;
>>> +   uint32_t mbuf_offset, mbuf_avail;
>>> +   uint32_t desc_offset, desc_avail;
>>> +   uint32_t cpy_len;
>>> +   uint16_t desc_idx, used_idx;
>>> +   uint32_t nr_used = 0;
>>>  
>>> -   cpy_len = RTE_MIN(vb_avail, seg_avail);
>>> +   if (m == NULL)
>>> +   return 0;
>> Is this inherited from old code?
> Yes.
>
>> Let us remove the unnecessary check.
>> Caller ensures it is not NULL.
> ...
>>> +   desc_avail  = vq->buf_vec[vec_idx].buf_len - vq->vhost_hlen;
>>> +   desc_offset = vq->vhost_hlen;
>> As we know we are in merge-able path, use sizeof(virtio_net_hdr) to save
>> one load for the header len.
> Please, it's a refactor patch series. You have mentioned quite many
> trivial issues here and there, which I don't care too much and I don't
> think they would matter somehow. In addition, they are actually from
> the old code.

For normal code, it would be better using vq->vhost_hlen for example for
future compatibility. For DPDK, we don't waste cycles whenever possible,
especially vhost is the centralized bottleneck.
For the check of m == NULL, it should be removed, which not only
occupies unnecessary branch predication resource but also causes
confusion for return nr_used from copy_mbuf_to_desc_mergeable.
It is OK if you don't want to fix this in this patchset.

>>> +
>>> +   mbuf_avail  = rte_pktmbuf_data_len(m);
>>> +   mbuf_offset = 0;
>>> +   while (1) {
>>> +   /* done with current desc buf, get the next one */
>>> +
>> [...]
>>> +   if (reserve_avail_buf_mergeable(vq, pkt_len, &start, &end) < 0)
>>> +   break;
>>>  
>>> +   nr_used = copy_mbuf_to_desc_mergeable(dev, vq, start, end,
>>> + pkts[pkt_idx]);
>> In which case couldn't we get nr_used from start and end?
> When pkts[pkt_idx] is NULL, though you suggest to remove it, the check
> is here.
>
>   --yliu
>



[dpdk-dev] [PATCH v2 3/7] vhost: refactor virtio_dev_merge_rx

2016-03-07 Thread Xie, Huawei
On 2/18/2016 9:48 PM, Yuanhan Liu wrote:
> Current virtio_dev_merge_rx() implementation just looks like the
> old rte_vhost_dequeue_burst(), full of twisted logic, that you
> can see same code block in quite many different places.
>
> However, the logic of virtio_dev_merge_rx() is quite similar to
> virtio_dev_rx().  The big difference is that the mergeable one
> could allocate more than one available entries to hold the data.
> Fetching all available entries to vec_buf at once makes the
[...]
> - }
> +static inline uint32_t __attribute__((always_inline))
> +copy_mbuf_to_desc_mergeable(struct virtio_net *dev, struct vhost_virtqueue 
> *vq,
> + uint16_t res_start_idx, uint16_t res_end_idx,
> + struct rte_mbuf *m)
> +{
> + struct virtio_net_hdr_mrg_rxbuf virtio_hdr = {{0, 0, 0, 0, 0, 0}, 0};
> + uint32_t vec_idx = 0;
> + uint16_t cur_idx = res_start_idx;
> + uint64_t desc_addr;
> + uint32_t mbuf_offset, mbuf_avail;
> + uint32_t desc_offset, desc_avail;
> + uint32_t cpy_len;
> + uint16_t desc_idx, used_idx;
> + uint32_t nr_used = 0;
>  
> - cpy_len = RTE_MIN(vb_avail, seg_avail);
> + if (m == NULL)
> + return 0;

Is this inherited from old code? Let us remove the unnecessary check.
Caller ensures it is not NULL.

>  
> - while (cpy_len > 0) {
> - /* Copy mbuf data to vring buffer */
> - rte_memcpy((void *)(uintptr_t)(vb_addr + vb_offset),
> - rte_pktmbuf_mtod_offset(pkt, const void *, seg_offset),
> - cpy_len);
> + LOG_DEBUG(VHOST_DATA,
> + "(%"PRIu64") Current Index %d| End Index %d\n",
> + dev->device_fh, cur_idx, res_end_idx);
>  
> - PRINT_PACKET(dev,
> - (uintptr_t)(vb_addr + vb_offset),
> - cpy_len, 0);
> + desc_addr = gpa_to_vva(dev, vq->buf_vec[vec_idx].buf_addr);
> + rte_prefetch0((void *)(uintptr_t)desc_addr);
> +
> + virtio_hdr.num_buffers = res_end_idx - res_start_idx;
> + LOG_DEBUG(VHOST_DATA, "(%"PRIu64") RX: Num merge buffers %d\n",
> + dev->device_fh, virtio_hdr.num_buffers);
>  
> - seg_offset += cpy_len;
> - vb_offset += cpy_len;
> - seg_avail -= cpy_len;
> - vb_avail -= cpy_len;
> - entry_len += cpy_len;
> -
> - if (seg_avail != 0) {
> - /*
> -  * The virtio buffer in this vring
> -  * entry reach to its end.
> -  * But the segment doesn't complete.
> -  */
> - if ((vq->desc[vq->buf_vec[vec_idx].desc_idx].flags &
> - VRING_DESC_F_NEXT) == 0) {
> + virtio_enqueue_offload(m, &virtio_hdr.hdr);
> + rte_memcpy((void *)(uintptr_t)desc_addr,
> + (const void *)&virtio_hdr, vq->vhost_hlen);
> + PRINT_PACKET(dev, (uintptr_t)desc_addr, vq->vhost_hlen, 0);
> +
> + desc_avail  = vq->buf_vec[vec_idx].buf_len - vq->vhost_hlen;
> + desc_offset = vq->vhost_hlen;

As we know we are in merge-able path, use sizeof(virtio_net_hdr) to save
one load for the header len.

> +
> + mbuf_avail  = rte_pktmbuf_data_len(m);
> + mbuf_offset = 0;
> + while (1) {
> + /* done with current desc buf, get the next one */
> +
[...]
> + if (reserve_avail_buf_mergeable(vq, pkt_len, &start, &end) < 0)
> + break;
>  
> + nr_used = copy_mbuf_to_desc_mergeable(dev, vq, start, end,
> +   pkts[pkt_idx]);

In which case couldn't we get nr_used from start and end?

>   rte_compiler_barrier();
>  
>   /*
>* Wait until it's our turn to add our buffer
>* to the used ring.
>*/
> - while (unlikely(vq->last_used_idx != res_base_idx))
> + while (unlikely(vq->last_used_idx != start))
>   rte_pause();
>  
> - *(volatile uint16_t *)&vq->used->idx += entry_success;
> - vq->last_used_idx = res_cur_idx;
> + *(volatile uint16_t *)&vq->used->idx += nr_used;
> + vq->last_used_idx = end;
>   }
>  
> -merge_rx_exit:
>   if (likely(pkt_idx)) {
>   /* flush used->idx update before we read avail->flags. */
>   rte_mb();



[dpdk-dev] [PATCH v2 3/7] vhost: refactor virtio_dev_merge_rx

2016-03-07 Thread Xie, Huawei
On 3/7/2016 3:04 PM, Xie, Huawei wrote:
> On 3/7/2016 2:49 PM, Yuanhan Liu wrote:
>> On Mon, Mar 07, 2016 at 06:38:42AM +0000, Xie, Huawei wrote:
>>> On 3/7/2016 2:35 PM, Yuanhan Liu wrote:
>>>> On Mon, Mar 07, 2016 at 06:22:25AM +, Xie, Huawei wrote:
>>>>> On 2/18/2016 9:48 PM, Yuanhan Liu wrote:
>>>>>> +uint16_t idx = vq->avail->ring[avail_idx & (vq->size - 1)];
>>>>>> +uint32_t vec_id = *vec_idx;
>>>>>> +uint32_t len= *allocated;
>>>>>>  
>>>>> There is bug not using volatile to retrieve the avail idx.
>>>> avail_idx? This is actually from "vq->last_used_idx_res". 
>>> uint16_t idx = vq->avail->ring[avail_idx & (vq->size - 1)]
>>>
>>> the idx retrieved from avail->ring. 
>> Hmm.. I saw quite many similar lines of code retrieving an index from
>> avail->ring, but none of them acutally use "volatile". So, a bug?
> Others are not. This function is inline, and is in one translation unit
> with its caller.

Oh, my fault. For the avail idx, we should take care on whether using
volatile.

>>  --yliu
>>
>



[dpdk-dev] [PATCH v2 3/7] vhost: refactor virtio_dev_merge_rx

2016-03-07 Thread Xie, Huawei
On 3/7/2016 2:49 PM, Yuanhan Liu wrote:
> On Mon, Mar 07, 2016 at 06:38:42AM +0000, Xie, Huawei wrote:
>> On 3/7/2016 2:35 PM, Yuanhan Liu wrote:
>>> On Mon, Mar 07, 2016 at 06:22:25AM +, Xie, Huawei wrote:
>>>> On 2/18/2016 9:48 PM, Yuanhan Liu wrote:
>>>>> + uint16_t idx = vq->avail->ring[avail_idx & (vq->size - 1)];
>>>>> + uint32_t vec_id = *vec_idx;
>>>>> + uint32_t len= *allocated;
>>>>>  
>>>> There is bug not using volatile to retrieve the avail idx.
>>> avail_idx? This is actually from "vq->last_used_idx_res". 
>> uint16_t idx = vq->avail->ring[avail_idx & (vq->size - 1)]
>>
>> the idx retrieved from avail->ring. 
> Hmm.. I saw quite many similar lines of code retrieving an index from
> avail->ring, but none of them acutally use "volatile". So, a bug?

Others are not. This function is inline, and is in one translation unit
with its caller.

>
>   --yliu
>



[dpdk-dev] [PATCH v2 3/7] vhost: refactor virtio_dev_merge_rx

2016-03-07 Thread Xie, Huawei
On 3/7/2016 2:35 PM, Yuanhan Liu wrote:
> On Mon, Mar 07, 2016 at 06:22:25AM +0000, Xie, Huawei wrote:
>> On 2/18/2016 9:48 PM, Yuanhan Liu wrote:
>>> +   uint16_t idx = vq->avail->ring[avail_idx & (vq->size - 1)];
>>> +   uint32_t vec_id = *vec_idx;
>>> +   uint32_t len= *allocated;
>>>  
>> There is bug not using volatile to retrieve the avail idx.
> avail_idx? This is actually from "vq->last_used_idx_res". 

uint16_t idx = vq->avail->ring[avail_idx & (vq->size - 1)]

the idx retrieved from avail->ring. 


>
>   --yliu
>



[dpdk-dev] [PATCH v2 3/7] vhost: refactor virtio_dev_merge_rx

2016-03-07 Thread Xie, Huawei
On 2/18/2016 9:48 PM, Yuanhan Liu wrote:
> + uint16_t idx = vq->avail->ring[avail_idx & (vq->size - 1)];
> + uint32_t vec_id = *vec_idx;
> + uint32_t len= *allocated;
>  
There is bug not using volatile to retrieve the avail idx.


[dpdk-dev] [PATCH v2 4/7] vhost: do not use rte_memcpy for virtio_hdr copy

2016-03-07 Thread Xie, Huawei
On 3/7/2016 12:20 PM, Stephen Hemminger wrote:
> On Thu, 18 Feb 2016 21:49:09 +0800
> Yuanhan Liu  wrote:
>
>> +static inline void
>> +copy_virtio_net_hdr(struct vhost_virtqueue *vq, uint64_t desc_addr,
>> +struct virtio_net_hdr_mrg_rxbuf hdr)
>> +{
>> +if (vq->vhost_hlen == sizeof(struct virtio_net_hdr_mrg_rxbuf)) {
>> +*(struct virtio_net_hdr_mrg_rxbuf *)(uintptr_t)desc_addr = hdr;
>> +} else {
>> +*(struct virtio_net_hdr *)(uintptr_t)desc_addr = hdr.hdr;
>> +}
>> +}
>> +
> Don't use {} around single statements.
There are other cs issues, like
 used_idx = vq->last_used_idx & (vq->size -1);
   ^ space needed
Please run checkpatch against your patch.
> Since you are doing all this casting, why not just use regular old memcpy
> which will be inlined by Gcc  into same instructions anyway.
> And since are always casting the desc_addr, why not pass a type that
> doesn't need the additional cast (like void *)
>



[dpdk-dev] [PATCH v2 2/7] vhost: refactor virtio_dev_rx

2016-03-07 Thread Xie, Huawei
On 2/18/2016 9:48 PM, Yuanhan Liu wrote:
> + while (1) {
> + /* done with current mbuf, fetch next */
> + if (mbuf_avail == 0) {
> + m = m->next;
> + if (m == NULL)
> + break;
> +
> + mbuf_offset = 0;
> + mbuf_avail  = rte_pktmbuf_data_len(m);
> + }
> +

You could use while (mbuf_avail || m->next) to align with the style of
coyp_desc_to_mbuf?


[dpdk-dev] [PATCH v2 7/7] vhost: do sanity check for desc->next

2016-03-07 Thread Xie, Huawei
On 2/18/2016 9:48 PM, Yuanhan Liu wrote:
> + if (unlikely(desc->next >= vq->size))
> + goto fail;

desc chains could be forged into a loop then vhost runs the dead loop
until it exhaust all mbuf memory.



[dpdk-dev] [PATCH v2 1/7] vhost: refactor rte_vhost_dequeue_burst

2016-03-07 Thread Xie, Huawei
On 2/18/2016 9:48 PM, Yuanhan Liu wrote:
> + mbuf_avail  = 0;
> + mbuf_offset = 0;
one cs nit, put it at the definition.


[dpdk-dev] [PATCH v2 1/7] vhost: refactor rte_vhost_dequeue_burst

2016-03-07 Thread Xie, Huawei
On 3/4/2016 10:30 AM, Yuanhan Liu wrote:
> On Thu, Mar 03, 2016 at 05:40:14PM +0000, Xie, Huawei wrote:
>> On 2/18/2016 9:48 PM, Yuanhan Liu wrote:
>>> The current rte_vhost_dequeue_burst() implementation is a bit messy
>> [...]
>>> +
>>>  uint16_t
>>>  rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
>>> struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t count)
>>>  {
>>> -   struct rte_mbuf *m, *prev;
>>> struct vhost_virtqueue *vq;
>>> -   struct vring_desc *desc;
>>> -   uint64_t vb_addr = 0;
>>> -   uint64_t vb_net_hdr_addr = 0;
>>> -   uint32_t head[MAX_PKT_BURST];
>>> +   uint32_t desc_indexes[MAX_PKT_BURST];
>> indices
> http://dictionary.reference.com/browse/index
>
> index
> noun, plural indexes, indices 

ok, i see both two are used.


>>
>>> uint32_t used_idx;
>>> uint32_t i;
>>> -   uint16_t free_entries, entry_success = 0;
>>> +   uint16_t free_entries;
>>> uint16_t avail_idx;
>>> -   struct virtio_net_hdr *hdr = NULL;
>>> +   struct rte_mbuf *m;
>>>  
>>> if (unlikely(!is_valid_virt_queue_idx(queue_id, 1, dev->virt_qp_nb))) {
>>> RTE_LOG(ERR, VHOST_DATA,
>>> @@ -730,197 +813,49 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, 
>>> uint16_t queue_id,
>>> return 0;
>>>  
>>> -   if (entry_success < (free_entries - 1)) {
>>> -   /* Prefetch descriptor index. */
>>> -   rte_prefetch0(&vq->desc[head[entry_success+1]]);
>>> -   rte_prefetch0(&vq->used->ring[(used_idx + 1) & 
>>> (vq->size - 1)]);
>>> -   }
>> Why is this prefetch silently dropped in the patch?
> Oops, good catching. Will fix it. Thanks.
>
>
>>> break;
>>> +   pkts[i] = m;
>>>  
>>> -   m->nb_segs = seg_num;
>>> -   if ((hdr->flags != 0) || (hdr->gso_type != 
>>> VIRTIO_NET_HDR_GSO_NONE))
>>> -   vhost_dequeue_offload(hdr, m);
>>> -
>>> -   pkts[entry_success] = m;
>>> -   vq->last_used_idx++;
>>> -   entry_success++;
>>> +   used_idx = vq->last_used_idx++ & (vq->size - 1);
>>> +   vq->used->ring[used_idx].id  = desc_indexes[i];
>>> +   vq->used->ring[used_idx].len = 0;
>> What is the correct value for ring[used_idx].len,  the packet length or 0?
> Good question. I didn't notice that before. Sounds buggy to me. However,
> that's from the old code. Will check it.

Yes, i knew it is in old code also. Thanks.

>   --yliu
>



[dpdk-dev] [PATCH v2 1/7] vhost: refactor rte_vhost_dequeue_burst

2016-03-07 Thread Xie, Huawei
On 3/7/2016 10:47 AM, Yuanhan Liu wrote:
> On Mon, Mar 07, 2016 at 02:32:46AM +0000, Xie, Huawei wrote:
>> On 3/4/2016 10:15 AM, Yuanhan Liu wrote:
>>> On Thu, Mar 03, 2016 at 04:30:42PM +, Xie, Huawei wrote:
>>>> On 2/18/2016 9:48 PM, Yuanhan Liu wrote:
>>>>> + mbuf_avail  = 0;
>>>>> + mbuf_offset = 0;
>>>>> + while (desc_avail || (desc->flags & VRING_DESC_F_NEXT) != 0) {
>>>>> + /* This desc reachs to its end, get the next one */
>>>>> + if (desc_avail == 0) {
>>>>> + desc = &vq->desc[desc->next];
>>>>> +
>>>>> + desc_addr = gpa_to_vva(dev, desc->addr);
>>>>> + rte_prefetch0((void *)(uintptr_t)desc_addr);
>>>>> +
>>>>> + desc_offset = 0;
>>>>> + desc_avail  = desc->len;
>>>>> +
>>>>> + PRINT_PACKET(dev, (uintptr_t)desc_addr, desc->len, 0);
>>>>> + }
>>>>> +
>>>>> + /*
>>>>> +  * This mbuf reachs to its end, get a new one
>>>>> +  * to hold more data.
>>>>> +  */
>>>>> + if (mbuf_avail == 0) {
>>>>> + cur = rte_pktmbuf_alloc(mbuf_pool);
>>>>> + if (unlikely(!cur)) {
>>>>> + RTE_LOG(ERR, VHOST_DATA, "Failed to "
>>>>> + "allocate memory for mbuf.\n");
>>>>> + if (head)
>>>>> + rte_pktmbuf_free(head);
>>>>> + return NULL;
>>>>> + }
>>>> We could always allocate the head mbuf before the loop, then we save the
>>>> following branch and make the code more streamlined.
>>>> It reminds me that this change prevents the possibility of mbuf bulk
>>>> allocation, one solution is we pass the head mbuf from an additional
>>>> parameter.
>>> Yep, that's also something I have thought of.
>>>
>>>> Btw, put unlikely before the check of mbuf_avail and checks elsewhere.
>>> I don't think so. It would benifit for the small packets. What if,
>>> however, when TSO or jumbo frame is enabled that we have big packets?
>> Prefer to favor the path that packet could fit in one mbuf.
> Hmmm, why? While I know that TSO and mergeable buf is disable by default
> in our vhost example vhost-switch, they are enabled by default in real
> life.

mergable is only meaningful in RX path.
If you mean big packets in TX path, i personally favor the path that
packet fits in one mbuf.


>> Btw, not specially to this, return "m = copy_desc_to_mbuf(dev, vq,
>> desc_indexes[i], mbuf_pool)", failure case is unlikely to happen, so add
>> unlikely for the check if (m == NULL) there. Please check all branches
>> elsewhere.
> Thanks for the remind, will have a detail check.
>
>   --yliu
>



[dpdk-dev] [PATCH v2 1/7] vhost: refactor rte_vhost_dequeue_burst

2016-03-07 Thread Xie, Huawei
On 3/4/2016 10:10 AM, Yuanhan Liu wrote:
> On Thu, Mar 03, 2016 at 05:19:42PM +0000, Xie, Huawei wrote:
>> On 2/18/2016 9:48 PM, Yuanhan Liu wrote:
>>> [...]
>> CCed changchun, the author for the chained handling of desc and mbuf.
>> The change makes the code more readable, but i think the following
>> commit message is simple and enough.
> Hmm.., my commit log tells a full story:
>
> - What is the issue? (messy/logic twisted code)
>
> - What the code does? (And what are the challenges: few tricky places)
>
> - What's the proposed solution to fix it. (the below pseudo code)
>
> And you suggest me to get rid of the first 2 items and leave 3rd item
> (a solution) only?

The following are simple and enough with just one additional statement
for the repeated mbuf allocation or your twisted. Other commit messages
are overly duplicated. Just my personal opinion. Up to you.

To this special case, for example, we could make both mbuf and
vring_desc chains into iovec, then use commonly used iovec copy
algorithms for both dequeue and enqueue, which further makes the code
much simpler and more readable. For this change, one or two sentences
are clear to me.

>   --yliu
>
>>> while (this_desc_is_not_drained_totally || has_next_desc) {
>>> if (this_desc_has_drained_totally) {
>>> this_desc = next_desc();
>>> }
>>>
>>> if (mbuf_has_no_room) {
>>> mbuf = allocate_a_new_mbuf();
>>> }
>>>
>>> COPY(mbuf, desc);
>>> }
>>>
>>> [...]
>>>
>>> This refactor makes the code much more readable (IMO), yet it reduces
>>> binary code size (nearly 2K).
>> I guess the reduced binary code size comes from reduced inline calls to
>> mbuf allocation.
>>



[dpdk-dev] [PATCH v2 1/7] vhost: refactor rte_vhost_dequeue_burst

2016-03-07 Thread Xie, Huawei
On 3/4/2016 10:15 AM, Yuanhan Liu wrote:
> On Thu, Mar 03, 2016 at 04:30:42PM +0000, Xie, Huawei wrote:
>> On 2/18/2016 9:48 PM, Yuanhan Liu wrote:
>>> +   mbuf_avail  = 0;
>>> +   mbuf_offset = 0;
>>> +   while (desc_avail || (desc->flags & VRING_DESC_F_NEXT) != 0) {
>>> +   /* This desc reachs to its end, get the next one */
>>> +   if (desc_avail == 0) {
>>> +   desc = &vq->desc[desc->next];
>>> +
>>> +   desc_addr = gpa_to_vva(dev, desc->addr);
>>> +   rte_prefetch0((void *)(uintptr_t)desc_addr);
>>> +
>>> +   desc_offset = 0;
>>> +   desc_avail  = desc->len;
>>> +
>>> +   PRINT_PACKET(dev, (uintptr_t)desc_addr, desc->len, 0);
>>> +   }
>>> +
>>> +   /*
>>> +* This mbuf reachs to its end, get a new one
>>> +* to hold more data.
>>> +*/
>>> +   if (mbuf_avail == 0) {
>>> +   cur = rte_pktmbuf_alloc(mbuf_pool);
>>> +   if (unlikely(!cur)) {
>>> +   RTE_LOG(ERR, VHOST_DATA, "Failed to "
>>> +   "allocate memory for mbuf.\n");
>>> +   if (head)
>>> +   rte_pktmbuf_free(head);
>>> +   return NULL;
>>> +   }
>> We could always allocate the head mbuf before the loop, then we save the
>> following branch and make the code more streamlined.
>> It reminds me that this change prevents the possibility of mbuf bulk
>> allocation, one solution is we pass the head mbuf from an additional
>> parameter.
> Yep, that's also something I have thought of.
>
>> Btw, put unlikely before the check of mbuf_avail and checks elsewhere.
> I don't think so. It would benifit for the small packets. What if,
> however, when TSO or jumbo frame is enabled that we have big packets?

Prefer to favor the path that packet could fit in one mbuf.

Btw, not specially to this, return "m = copy_desc_to_mbuf(dev, vq,
desc_indexes[i], mbuf_pool)", failure case is unlikely to happen, so add
unlikely for the check if (m == NULL) there. Please check all branches
elsewhere.

>   --yliu
>



[dpdk-dev] [PATCH v2 1/7] vhost: refactor rte_vhost_dequeue_burst

2016-03-07 Thread Xie, Huawei
On 3/4/2016 10:19 AM, Yuanhan Liu wrote:
> On Thu, Mar 03, 2016 at 04:21:19PM +0000, Xie, Huawei wrote:
>> On 2/18/2016 9:48 PM, Yuanhan Liu wrote:
>>> The current rte_vhost_dequeue_burst() implementation is a bit messy
>>> and logic twisted. And you could see repeat code here and there: it
>>> invokes rte_pktmbuf_alloc() three times at three different places!
>>>
>>> However, rte_vhost_dequeue_burst() acutally does a simple job: copy
>>> the packet data from vring desc to mbuf. What's tricky here is:
>>>
>>> - desc buff could be chained (by desc->next field), so that you need
>>>   fetch next one if current is wholly drained.
>>>
>>> - One mbuf could not be big enough to hold all desc buff, hence you
>>>   need to chain the mbuf as well, by the mbuf->next field.
>>>
>>> Even though, the logic could be simple. Here is the pseudo code.
>>>
>>> while (this_desc_is_not_drained_totally || has_next_desc) {
>>> if (this_desc_has_drained_totally) {
>>> this_desc = next_desc();
>>> }
>>>
>>> if (mbuf_has_no_room) {
>>> mbuf = allocate_a_new_mbuf();
>>> }
>>>
>>> COPY(mbuf, desc);
>>> }
>>>
>>> And this is how I refactored rte_vhost_dequeue_burst.
>>>
>>> Note that the old patch does a special handling for skipping virtio
>>> header. However, that could be simply done by adjusting desc_avail
>>> and desc_offset var:
>>>
>>> desc_avail  = desc->len - vq->vhost_hlen;
>>> desc_offset = vq->vhost_hlen;
>>>
>>> This refactor makes the code much more readable (IMO), yet it reduces
>>> binary code size (nearly 2K).
>>>
>>> Signed-off-by: Yuanhan Liu 
>>> ---
>>>
>>> v2: - fix potential NULL dereference bug of var "prev" and "head"
>>> ---
>>>  lib/librte_vhost/vhost_rxtx.c | 297 
>>> +-
>>>  1 file changed, 116 insertions(+), 181 deletions(-)
>>>
>>> diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
>>> index 5e7e5b1..d5cd0fa 100644
>>> --- a/lib/librte_vhost/vhost_rxtx.c
>>> +++ b/lib/librte_vhost/vhost_rxtx.c
>>> @@ -702,21 +702,104 @@ vhost_dequeue_offload(struct virtio_net_hdr *hdr, 
>>> struct rte_mbuf *m)
>>> }
>>>  }
>>>  
>>> +static inline struct rte_mbuf *
>>> +copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
>>> + uint16_t desc_idx, struct rte_mempool *mbuf_pool)
>>> +{
>>> +   struct vring_desc *desc;
>>> +   uint64_t desc_addr;
>>> +   uint32_t desc_avail, desc_offset;
>>> +   uint32_t mbuf_avail, mbuf_offset;
>>> +   uint32_t cpy_len;
>>> +   struct rte_mbuf *head = NULL;
>>> +   struct rte_mbuf *cur = NULL, *prev = NULL;
>>> +   struct virtio_net_hdr *hdr;
>>> +
>>> +   desc = &vq->desc[desc_idx];
>>> +   desc_addr = gpa_to_vva(dev, desc->addr);
>>> +   rte_prefetch0((void *)(uintptr_t)desc_addr);
>>> +
>>> +   /* Retrieve virtio net header */
>>> +   hdr = (struct virtio_net_hdr *)((uintptr_t)desc_addr);
>>> +   desc_avail  = desc->len - vq->vhost_hlen;
>> There is a serious bug here, desc->len - vq->vhost_len could overflow.
>> VM could easily create this case. Let us fix it here.
> Nope, this issue has been there since the beginning, and this patch
> is a refactor: we should not bring any functional changes. Therefore,
> we should not fix it here.

No, I don't mean exactly fixing in this patch but in this series.

Besides, from refactoring's perspective, actually we could make things
further much simpler and more readable. Both the desc chains and mbuf
could be converted into iovec, then both dequeue(copy_desc_to_mbuf) and
enqueue(copy_mbuf_to_desc) could use the commonly used iovec copying
algorithms As data path are performance oriented, let us stop here.

>
> And actually, it's been fixed in the 6th patch in this series:

Will check that.

>
> [PATCH v2 6/7] vhost: do sanity check for desc->len
>
>   --yliu
>



[dpdk-dev] [PATCH v2 4/7] vhost: do not use rte_memcpy for virtio_hdr copy

2016-03-07 Thread Xie, Huawei
On 2/18/2016 9:48 PM, Yuanhan Liu wrote:
> Signed-off-by: Yuanhan Liu 
Acked-by: Huawei Xie 


[dpdk-dev] [PATCH v2 0/5] virtio: Tx performance improvements

2016-03-04 Thread Xie, Huawei
On 1/14/2016 9:49 PM, Xie, Huawei wrote:
> On 1/6/2016 8:04 PM, Thomas Monjalon wrote:
>> 2016-01-05 08:10, Xie, Huawei:
>>> On 10/26/2015 10:06 PM, Xie, Huawei wrote:
>>>> On 10/19/2015 1:16 PM, Stephen Hemminger wrote:
>>>>> This is a tested version of the virtio Tx performance improvements
>>>>> that I posted earlier on the list, and described at the DPDK Userspace
>>>>> meeting in Dublin. Together they get a 25% performance improvement for
>>>>> both small packet and large multi-segment packet case when testing
>>>>> from DPDK guest application to Linux KVM host.
>>>>>
>>>>> Stephen Hemminger (5):
>>>>>   virtio: clean up space checks on xmit
>>>>>   virtio: don't use unlikely for normal tx stuff
>>>>>   virtio: use indirect ring elements
>>>>>   virtio: use any layout on transmit
>>>>>   virtio: optimize transmit enqueue
>>>> There is one open why merge-able header is used in tx path. Since old
>>>> implementation is also using the merge-able header in tx path if this
>>>> feature is negotiated, i choose to ack the patch and address this later
>>>> if not now.
>>>>
>>>> Acked-by: Huawei Xie 
>>> Thomas:
>>> This patch isn't in the patchwork. Does Stephen need to send a new one?
>> Yes please, I cannot find them in patchwork.
> ping

Ping.


>



[dpdk-dev] [PATCH] virtio: fix rx ring descriptor starvation

2016-03-04 Thread Xie, Huawei
On 2/23/2016 12:23 AM, Tom Kiely wrote:
> Hi,
> Sorry I missed the last few messages until now. I'm happy with
> just removing the "if". Kyle, when you say you fixed it, do you mean
> that you will push the patch or have already done so ?
>Thanks,
>Tom

Could you please send the patch?



[dpdk-dev] [PATCH v2 1/7] vhost: refactor rte_vhost_dequeue_burst

2016-03-03 Thread Xie, Huawei
On 2/18/2016 9:48 PM, Yuanhan Liu wrote:
> The current rte_vhost_dequeue_burst() implementation is a bit messy
[...]
> +
>  uint16_t
>  rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
>   struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t count)
>  {
> - struct rte_mbuf *m, *prev;
>   struct vhost_virtqueue *vq;
> - struct vring_desc *desc;
> - uint64_t vb_addr = 0;
> - uint64_t vb_net_hdr_addr = 0;
> - uint32_t head[MAX_PKT_BURST];
> + uint32_t desc_indexes[MAX_PKT_BURST];

indices


>   uint32_t used_idx;
>   uint32_t i;
> - uint16_t free_entries, entry_success = 0;
> + uint16_t free_entries;
>   uint16_t avail_idx;
> - struct virtio_net_hdr *hdr = NULL;
> + struct rte_mbuf *m;
>  
>   if (unlikely(!is_valid_virt_queue_idx(queue_id, 1, dev->virt_qp_nb))) {
>   RTE_LOG(ERR, VHOST_DATA,
> @@ -730,197 +813,49 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, 
> uint16_t queue_id,
>   return 0;
>  
>   avail_idx =  *((volatile uint16_t *)&vq->avail->idx);
> -
> - /* If there are no available buffers then return. */
> - if (vq->last_used_idx == avail_idx)
> + free_entries = avail_idx - vq->last_used_idx;
> + if (free_entries == 0)
>   return 0;
>  
> - LOG_DEBUG(VHOST_DATA, "%s (%"PRIu64")\n", __func__,
> - dev->device_fh);
> + LOG_DEBUG(VHOST_DATA, "%s (%"PRIu64")\n", __func__, dev->device_fh);
>  
> - /* Prefetch available ring to retrieve head indexes. */
> - rte_prefetch0(&vq->avail->ring[vq->last_used_idx & (vq->size - 1)]);
> + used_idx = vq->last_used_idx & (vq->size -1);
>  
> - /*get the number of free entries in the ring*/
> - free_entries = (avail_idx - vq->last_used_idx);
> + /* Prefetch available ring to retrieve head indexes. */
> + rte_prefetch0(&vq->avail->ring[used_idx]);
>  
> - free_entries = RTE_MIN(free_entries, count);
> - /* Limit to MAX_PKT_BURST. */
> - free_entries = RTE_MIN(free_entries, MAX_PKT_BURST);
> + count = RTE_MIN(count, MAX_PKT_BURST);
> + count = RTE_MIN(count, free_entries);
> + LOG_DEBUG(VHOST_DATA, "(%"PRIu64") about to dequeue %u buffers\n",
> + dev->device_fh, count);
>  
> - LOG_DEBUG(VHOST_DATA, "(%"PRIu64") Buffers available %d\n",
> - dev->device_fh, free_entries);
>   /* Retrieve all of the head indexes first to avoid caching issues. */
> - for (i = 0; i < free_entries; i++)
> - head[i] = vq->avail->ring[(vq->last_used_idx + i) & (vq->size - 
> 1)];
> + for (i = 0; i < count; i++) {
> + desc_indexes[i] = vq->avail->ring[(vq->last_used_idx + i) &
> + (vq->size - 1)];
> + }
>  
>   /* Prefetch descriptor index. */
> - rte_prefetch0(&vq->desc[head[entry_success]]);
> + rte_prefetch0(&vq->desc[desc_indexes[0]]);
>   rte_prefetch0(&vq->used->ring[vq->last_used_idx & (vq->size - 1)]);
>  
> - while (entry_success < free_entries) {
> - uint32_t vb_avail, vb_offset;
> - uint32_t seg_avail, seg_offset;
> - uint32_t cpy_len;
> - uint32_t seg_num = 0;
> - struct rte_mbuf *cur;
> - uint8_t alloc_err = 0;
> -
> - desc = &vq->desc[head[entry_success]];
> -
> - vb_net_hdr_addr = gpa_to_vva(dev, desc->addr);
> - hdr = (struct virtio_net_hdr *)((uintptr_t)vb_net_hdr_addr);
> -
> - /* Discard first buffer as it is the virtio header */
> - if (desc->flags & VRING_DESC_F_NEXT) {
> - desc = &vq->desc[desc->next];
> - vb_offset = 0;
> - vb_avail = desc->len;
> - } else {
> - vb_offset = vq->vhost_hlen;
> - vb_avail = desc->len - vb_offset;
> - }
> -
> - /* Buffer address translation. */
> - vb_addr = gpa_to_vva(dev, desc->addr);
> - /* Prefetch buffer address. */
> - rte_prefetch0((void *)(uintptr_t)vb_addr);
> -
> - used_idx = vq->last_used_idx & (vq->size - 1);
> -
> - if (entry_success < (free_entries - 1)) {
> - /* Prefetch descriptor index. */
> - rte_prefetch0(&vq->desc[head[entry_success+1]]);
> - rte_prefetch0(&vq->used->ring[(used_idx + 1) & 
> (vq->size - 1)]);
> - }

Why is this prefetch silently dropped in the patch?
> -
> - /* Update used index buffer information. */
> - vq->used->ring[used_idx].id = head[entry_success];
> - vq->used->ring[used_idx].len = 0;
> -
> - /* Allocate an mbuf and populate the structure. */
> - m = rte_pktmbuf_alloc(mbuf_pool);
> - if (unlikely(m == NULL)) {
> - RTE_LOG(ERR, VHOST_DATA,
> -   

[dpdk-dev] [PATCH v2 1/7] vhost: refactor rte_vhost_dequeue_burst

2016-03-03 Thread Xie, Huawei
On 2/18/2016 9:48 PM, Yuanhan Liu wrote:
> [...]
CCed changchun, the author for the chained handling of desc and mbuf.
The change makes the code more readable, but i think the following
commit message is simple and enough.
>
>   while (this_desc_is_not_drained_totally || has_next_desc) {
>   if (this_desc_has_drained_totally) {
>   this_desc = next_desc();
>   }
>
>   if (mbuf_has_no_room) {
>   mbuf = allocate_a_new_mbuf();
>   }
>
>   COPY(mbuf, desc);
>   }
>
> [...]
>
> This refactor makes the code much more readable (IMO), yet it reduces
> binary code size (nearly 2K).
I guess the reduced binary code size comes from reduced inline calls to
mbuf allocation.



[dpdk-dev] [PATCH v2 1/7] vhost: refactor rte_vhost_dequeue_burst

2016-03-03 Thread Xie, Huawei
On 2/18/2016 9:48 PM, Yuanhan Liu wrote:
> + mbuf_avail  = 0;
> + mbuf_offset = 0;
> + while (desc_avail || (desc->flags & VRING_DESC_F_NEXT) != 0) {
> + /* This desc reachs to its end, get the next one */
> + if (desc_avail == 0) {
> + desc = &vq->desc[desc->next];
> +
> + desc_addr = gpa_to_vva(dev, desc->addr);
> + rte_prefetch0((void *)(uintptr_t)desc_addr);
> +
> + desc_offset = 0;
> + desc_avail  = desc->len;
> +
> + PRINT_PACKET(dev, (uintptr_t)desc_addr, desc->len, 0);
> + }
> +
> + /*
> +  * This mbuf reachs to its end, get a new one
> +  * to hold more data.
> +  */
> + if (mbuf_avail == 0) {
> + cur = rte_pktmbuf_alloc(mbuf_pool);
> + if (unlikely(!cur)) {
> + RTE_LOG(ERR, VHOST_DATA, "Failed to "
> + "allocate memory for mbuf.\n");
> + if (head)
> + rte_pktmbuf_free(head);
> + return NULL;
> + }

We could always allocate the head mbuf before the loop, then we save the
following branch and make the code more streamlined.
It reminds me that this change prevents the possibility of mbuf bulk
allocation, one solution is we pass the head mbuf from an additional
parameter.
Btw, put unlikely before the check of mbuf_avail and checks elsewhere.

> + if (!head) {
> + head = cur;
> + } else {
> + prev->next = cur;
> + prev->data_len = mbuf_offset;
> + head->nb_segs += 1;
> + }
> + head->pkt_len += mbuf_offset;
> + prev = cur;
> +
> + mbuf_offset = 0;
> + mbuf_avail  = cur->buf_len - RTE_PKTMBUF_HEADROOM;
> + }
> +
> + cpy_len = RTE_MIN(desc_avail, mbuf_avail);
> + rte_memcpy(rte_pktmbuf_mtod_offset(cur, void *, mbuf_offset),
> + (void *)((uintptr_t)(desc_addr + desc_offset)),
> + cpy_len);
> +
> + mbuf_avail  -= cpy_len;
> + mbuf_offset += cpy_len;
> + desc_avail  -= cpy_len;
> + desc_offset += cpy_len;
> + }
> +



[dpdk-dev] [PATCH v2 1/7] vhost: refactor rte_vhost_dequeue_burst

2016-03-03 Thread Xie, Huawei
On 2/18/2016 9:48 PM, Yuanhan Liu wrote:
> The current rte_vhost_dequeue_burst() implementation is a bit messy
> and logic twisted. And you could see repeat code here and there: it
> invokes rte_pktmbuf_alloc() three times at three different places!
>
> However, rte_vhost_dequeue_burst() acutally does a simple job: copy
> the packet data from vring desc to mbuf. What's tricky here is:
>
> - desc buff could be chained (by desc->next field), so that you need
>   fetch next one if current is wholly drained.
>
> - One mbuf could not be big enough to hold all desc buff, hence you
>   need to chain the mbuf as well, by the mbuf->next field.
>
> Even though, the logic could be simple. Here is the pseudo code.
>
>   while (this_desc_is_not_drained_totally || has_next_desc) {
>   if (this_desc_has_drained_totally) {
>   this_desc = next_desc();
>   }
>
>   if (mbuf_has_no_room) {
>   mbuf = allocate_a_new_mbuf();
>   }
>
>   COPY(mbuf, desc);
>   }
>
> And this is how I refactored rte_vhost_dequeue_burst.
>
> Note that the old patch does a special handling for skipping virtio
> header. However, that could be simply done by adjusting desc_avail
> and desc_offset var:
>
>   desc_avail  = desc->len - vq->vhost_hlen;
>   desc_offset = vq->vhost_hlen;
>
> This refactor makes the code much more readable (IMO), yet it reduces
> binary code size (nearly 2K).
>
> Signed-off-by: Yuanhan Liu 
> ---
>
> v2: - fix potential NULL dereference bug of var "prev" and "head"
> ---
>  lib/librte_vhost/vhost_rxtx.c | 297 
> +-
>  1 file changed, 116 insertions(+), 181 deletions(-)
>
> diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
> index 5e7e5b1..d5cd0fa 100644
> --- a/lib/librte_vhost/vhost_rxtx.c
> +++ b/lib/librte_vhost/vhost_rxtx.c
> @@ -702,21 +702,104 @@ vhost_dequeue_offload(struct virtio_net_hdr *hdr, 
> struct rte_mbuf *m)
>   }
>  }
>  
> +static inline struct rte_mbuf *
> +copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
> +   uint16_t desc_idx, struct rte_mempool *mbuf_pool)
> +{
> + struct vring_desc *desc;
> + uint64_t desc_addr;
> + uint32_t desc_avail, desc_offset;
> + uint32_t mbuf_avail, mbuf_offset;
> + uint32_t cpy_len;
> + struct rte_mbuf *head = NULL;
> + struct rte_mbuf *cur = NULL, *prev = NULL;
> + struct virtio_net_hdr *hdr;
> +
> + desc = &vq->desc[desc_idx];
> + desc_addr = gpa_to_vva(dev, desc->addr);
> + rte_prefetch0((void *)(uintptr_t)desc_addr);
> +
> + /* Retrieve virtio net header */
> + hdr = (struct virtio_net_hdr *)((uintptr_t)desc_addr);
> + desc_avail  = desc->len - vq->vhost_hlen;

There is a serious bug here, desc->len - vq->vhost_len could overflow.
VM could easily create this case. Let us fix it here.

> + desc_offset = vq->vhost_hlen;
> +
> + mbuf_avail  = 0;
> + mbuf_offset = 0;
> + while (desc_avail || (desc->flags & VRING_DESC_F_NEXT) != 0) {
> + /* This desc reachs to its end, get the next one */
> + if (desc_avail == 0) {
> + desc = &vq->desc[desc->next];
> +
> + desc_addr = gpa_to_vva(dev, desc->addr);
> + rte_prefetch0((void *)(uintptr_t)desc_addr);
> +
> + desc_offset = 0;
> + desc_avail  = desc->len;
> +
> + PRINT_PACKET(dev, (uintptr_t)desc_addr, desc->len, 0);
> + }
> +
> + /*
> +  * This mbuf reachs to its end, get a new one
> +  * to hold more data.
> +  */
> + if (mbuf_avail == 0) {
> + cur = rte_pktmbuf_alloc(mbuf_pool);
> + if (unlikely(!cur)) {
> + RTE_LOG(ERR, VHOST_DATA, "Failed to "
> + "allocate memory for mbuf.\n");
> + if (head)
> + rte_pktmbuf_free(head);
> + return NULL;
> + }
> + if (!head) {
> + head = cur;
> + } else {
> + prev->next = cur;
> + prev->data_len = mbuf_offset;
> + head->nb_segs += 1;
> + }
> + head->pkt_len += mbuf_offset;
> + prev = cur;
> +
> + mbuf_offset = 0;
> + mbuf_avail  = cur->buf_len - RTE_PKTMBUF_HEADROOM;
> + }
> +
> + cpy_len = RTE_MIN(desc_avail, mbuf_avail);
> + rte_memcpy(rte_pktmbuf_mtod_offset(cur, void *, mbuf_offset),
> + (void *)((uintptr_t)(desc_addr + desc_offset)),
> + cpy_len);
> +
> +   

[dpdk-dev] [PATCH v4 4/4] virtio: return 1 to tell the upper layer we don't take over this device

2016-03-01 Thread Xie, Huawei
On 3/1/2016 5:57 PM, Thomas Monjalon wrote:
> 2016-03-01 08:39, Xie, Huawei:
>> On 3/1/2016 4:24 PM, Thomas Monjalon wrote:
>>> 2016-03-01 07:53, Xie, Huawei:
>>>> On 3/1/2016 3:18 PM, Thomas Monjalon wrote:
>>>>> 2016-02-26 09:53, Huawei Xie:
>>>>>> @@ -1037,8 +1039,11 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
>>>>>>  
>>>>>>  pci_dev = eth_dev->pci_dev;
>>>>>>  
>>>>>> -if (vtpci_init(pci_dev, hw) < 0)
>>>>>> -return -1;
>>>>>> +ret = vtpci_init(pci_dev, hw);
>>>>>> +if (ret) {
>>>>>> +rte_free(eth_dev->data->mac_addrs);
>>>>> The freeing seems not related to this patch.
>>>> I can send a separate patch, ok within this patchset?
>>> Yes
>>>
>>>>> [...]
>>>>>>  PMD_INIT_LOG(INFO, "trying with legacy virtio pci.");
>>>>>> -if (legacy_virtio_resource_init(dev, hw) < 0)
>>>>>> +if (legacy_virtio_resource_init(dev, hw) < 0) {
>>>>>> +if (dev->kdrv == RTE_KDRV_UNKNOWN) {
>>>>>> +PMD_INIT_LOG(INFO,
>>>>>> +"skip kernel managed virtio device.");
>>>>>> +return 1;
>>>>>> +}
>>>>>>  return -1;
>>>>>> +}
>>>>> You cannot skip a device if it was whitelisted.
>>>>> I think you should check RTE_DEVTYPE_WHITELISTED_PCI and throw an error
>>>>> in this case.
>>>> I feel there is a subtle difference on the understanding of -w args. To
>>>> me, without it, probe all devices; with it, only probe whiltelisted API.
>>>> That is all.
>>> I don't know if it is clearly documented indeed.
>>>
>>>> Do you mean that -w implies that devices whitelisted must be probed
>>>> successfully otherwise we throw an error? If i get it right, then what
>>>> about the devices whitelisted but without PMD driver?
>>> Yes we should probably consider the whitelist as a "forced" init.
>>> Later, we could introduce some device flags for probing/discovery:
>>> PROBE_AUTO, PROBE_FORCE, PROBE_IGNORE. It would make white/black list
>>> more precise.
>>>
>>>> I will fix, :).
>>>> if (dev->kdrv == RTE_KDRV_UNKNOWN && dev->devargs->type !=
>>>> RTE_DEVTYPE_WHITELISTED_PCI) {
>>>> 
>>>> return 1;
>>>> }
>>> You should also consider the blacklist case: if there is a blacklist,
>>> the not blacklisted devices must be initialised or throw an error.
>>>
>> Don't we already skip probing the blacklisted device in
>> rte_eal_pci_probe_one_driver?
> Yes, I'm talking about the devices which are not blacklisted.
> Having some blacklisted devices imply that others are implicitly whitelisted.

For blacklist, it only means the blacklisted device should be excluded
from being probed. It doesn't mean all other devices should be probed
either successfully or otherwise throw an error which cause DPDK exit.
Even that, the upper layer should explicitly put the non-blacklisted
device into whitelist, i mean here we should only deal with whitelist.
>



[dpdk-dev] [PATCH v4 4/4] virtio: return 1 to tell the upper layer we don't take over this device

2016-03-01 Thread Xie, Huawei
On 3/1/2016 4:24 PM, Thomas Monjalon wrote:
> 2016-03-01 07:53, Xie, Huawei:
>> On 3/1/2016 3:18 PM, Thomas Monjalon wrote:
>>> 2016-02-26 09:53, Huawei Xie:
>>>> @@ -1037,8 +1039,11 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
>>>>  
>>>>pci_dev = eth_dev->pci_dev;
>>>>  
>>>> -  if (vtpci_init(pci_dev, hw) < 0)
>>>> -  return -1;
>>>> +  ret = vtpci_init(pci_dev, hw);
>>>> +  if (ret) {
>>>> +  rte_free(eth_dev->data->mac_addrs);
>>> The freeing seems not related to this patch.
>> I can send a separate patch, ok within this patchset?
> Yes
>
>>> [...]
>>>>PMD_INIT_LOG(INFO, "trying with legacy virtio pci.");
>>>> -  if (legacy_virtio_resource_init(dev, hw) < 0)
>>>> +  if (legacy_virtio_resource_init(dev, hw) < 0) {
>>>> +  if (dev->kdrv == RTE_KDRV_UNKNOWN) {
>>>> +  PMD_INIT_LOG(INFO,
>>>> +  "skip kernel managed virtio device.");
>>>> +  return 1;
>>>> +  }
>>>>return -1;
>>>> +  }
>>> You cannot skip a device if it was whitelisted.
>>> I think you should check RTE_DEVTYPE_WHITELISTED_PCI and throw an error
>>> in this case.
>> I feel there is a subtle difference on the understanding of -w args. To
>> me, without it, probe all devices; with it, only probe whiltelisted API.
>> That is all.
> I don't know if it is clearly documented indeed.
>
>> Do you mean that -w implies that devices whitelisted must be probed
>> successfully otherwise we throw an error? If i get it right, then what
>> about the devices whitelisted but without PMD driver?
> Yes we should probably consider the whitelist as a "forced" init.
> Later, we could introduce some device flags for probing/discovery:
> PROBE_AUTO, PROBE_FORCE, PROBE_IGNORE. It would make white/black list
> more precise.
>
>> I will fix, :).
>> if (dev->kdrv == RTE_KDRV_UNKNOWN && dev->devargs->type !=
>> RTE_DEVTYPE_WHITELISTED_PCI) {
>> 
>> return 1;
>> }
> You should also consider the blacklist case: if there is a blacklist,
> the not blacklisted devices must be initialised or throw an error.
>
Don't we already skip probing the blacklisted device in
rte_eal_pci_probe_one_driver?



[dpdk-dev] [PATCH v4 4/4] virtio: return 1 to tell the upper layer we don't take over this device

2016-03-01 Thread Xie, Huawei
On 3/1/2016 3:18 PM, Thomas Monjalon wrote:
> Hi Huawei,
>
> 2016-02-26 09:53, Huawei Xie:
>> --- a/drivers/net/virtio/virtio_ethdev.c
>> +++ b/drivers/net/virtio/virtio_ethdev.c
>> @@ -1,4 +1,5 @@
>>  /*-
>> +
> This new line seems useless :)

Sorry, would fix.

>
>>   *   BSD LICENSE
>>   *
> [...]
>> @@ -1037,8 +1039,11 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
>>  
>>  pci_dev = eth_dev->pci_dev;
>>  
>> -if (vtpci_init(pci_dev, hw) < 0)
>> -return -1;
>> +ret = vtpci_init(pci_dev, hw);
>> +if (ret) {
>> +rte_free(eth_dev->data->mac_addrs);
> The freeing seems not related to this patch.

I can send a separate patch, ok within this patchset?

>
>> +return ret;
>> +}
> [...]
>>  PMD_INIT_LOG(INFO, "trying with legacy virtio pci.");
>> -if (legacy_virtio_resource_init(dev, hw) < 0)
>> +if (legacy_virtio_resource_init(dev, hw) < 0) {
>> +if (dev->kdrv == RTE_KDRV_UNKNOWN) {
>> +PMD_INIT_LOG(INFO,
>> +"skip kernel managed virtio device.");
>> +return 1;
>> +}
>>  return -1;
>> +}
> You cannot skip a device if it was whitelisted.
> I think you should check RTE_DEVTYPE_WHITELISTED_PCI and throw an error
> in this case.

I feel there is a subtle difference on the understanding of -w args. To
me, without it, probe all devices; with it, only probe whiltelisted API.
That is all.
Do you mean that -w implies that devices whitelisted must be probed
successfully otherwise we throw an error? If i get it right, then what
about the devices whitelisted but without PMD driver?

I will fix, :).
if (dev->kdrv == RTE_KDRV_UNKNOWN && dev->devargs->type !=
RTE_DEVTYPE_WHITELISTED_PCI) {

return 1;
}


>
>



[dpdk-dev] [PATCH v4 2/4] eal: set kdrv to RTE_KDRV_NONE if kernel driver isn't managing the device

2016-02-29 Thread Xie, Huawei
On 2/29/2016 4:47 PM, David Marchand wrote:
> On Fri, Feb 26, 2016 at 2:53 AM, Huawei Xie  wrote:
>> v4 changes:
>>  reword the commit message. When we mention kernel driver, emphasizes
>> that it includes UIO/VFIO.
> Annotations should not be part of the commitlog itself.

Do you mean that "rewording the commit message" should not appear in the
commit message itself?  Those version changes will not appear in the
commit log when applied, right? So i added this so that reviewers know
that i have changed the commit message otherwise they don't need to
waste their time reviewing the commit message again. Is it that even if
i send a new patch version with only the changes to the commit message ,
i needn't mention this?
>
>> Use RTE_KDRV_NONE to indicate that kernel driver(including UIO/VFIO)
>> isn't manipulating the device.
> missing space before (

Thomas, could you help change this?

>
>> Signed-off-by: Huawei Xie 
>> Acked-by: Yuanhan Liu 
> Thought I already acked this.
> Anyway,
> Acked-by: David Marchand 
>
>



[dpdk-dev] [PATCH v1] virtio: Use cpuflag for vector api

2016-02-29 Thread Xie, Huawei
On 2/29/2016 12:26 PM, Yuanhan Liu wrote:
> On Fri, Feb 26, 2016 at 02:21:02PM +0530, Santosh Shukla wrote:
>> Check cpuflag macro before using vectored api.
>> -virtio_recv_pkts_vec() uses _sse3__ simd instruction for now so added 
>> cpuflag.
>> - Also wrap other vectored freind api ie..
>> 1) virtqueue_enqueue_recv_refill_simple
>> 2) virtio_rxq_vec_setup
>>
> ...
>> diff --git a/drivers/net/virtio/virtio_rxtx_simple.c 
>> b/drivers/net/virtio/virtio_rxtx_simple.c
>> index 3a1de9d..be51d7c 100644
>> --- a/drivers/net/virtio/virtio_rxtx_simple.c
>> +++ b/drivers/net/virtio/virtio_rxtx_simple.c
> Hmm, why not wrapping the whole file, instead of just few functions?
>
> Or maybe better, do a compile time check at the Makefile, something
> like:
>
> if has_CPUFLAG_xxx
> SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple.c
> endif
>
>
>   --yliu
>
For next release, we could consider providing arch level framework for
different arch optimizations. It is more complicated for rte_memcpy. For
the time being, except the small issue, ok with the temporary solution
using CPUFLAG.


[dpdk-dev] [PATCH v4 2/4] eal: set kdrv to RTE_KDRV_NONE if kernel driver isn't managing the device

2016-02-29 Thread Xie, Huawei
? 2/27/2016 1:47 AM, Xie, Huawei ??:
> Use RTE_KDRV_NONE to indicate that kernel driver(including UIO/VFIO)
> isn't manipulating the device.
Thomas, could you kindly help change manipulating->managing? I have
changed others per Panu's suggestion but missed this.


[dpdk-dev] [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API

2016-02-26 Thread Xie, Huawei
On 2/26/2016 4:56 PM, Olivier MATZ wrote:
>
> On 02/23/2016 06:35 AM, Xie, Huawei wrote:
>>>> Also, it would be nice to have a simple test function in
>>>> app/test/test_mbuf.c. For instance, you could update
>>>> test_one_pktmbuf() to take a mbuf pointer as a parameter and remove
>>>> the mbuf allocation from the function. Then it could be called with
>>>> a mbuf allocated with rte_pktmbuf_alloc() (like before) and with
>>>> all the mbufs of rte_pktmbuf_alloc_bulk().
>> Don't quite get you. Is it that we write two cases, one case allocate
>> mbuf through rte_pktmbuf_alloc_bulk and one use rte_pktmbuf_alloc? It is
>> good to have. 
> Yes, something like:
>
> test_one_pktmbuf(struct rte_mbuf *m)
> {
>   /* same as before without the allocation/free */
> }
>
> test_pkt_mbuf(void)
> {
>   m = rte_pktmbuf_alloc(pool);
>   test_one_pktmbuf(m);
>   rte_pktmbuf_free(m);
>
>   ret = rte_pktmbuf_alloc_bulk(pool, mtab, BULK_CNT)
>   for (i = 0; i < BULK_CNT; i++) {
>   m = mtab[i];
>   test_one_pktmbuf(m);
>   rte_pktmbuf_free(m);
>   }
> }

This is to test the functionality.
Let us also have the case like the following?
cycles_start = rte_get_timer_cycles();
while(rounds--) {

ret = rte_pktmbuf_alloc_bulk(pool, mtab, BULK_CNT)
for (i = 0; i < BULK_CNT; i++) {
m = mtab[i];
/* some work if needed */
rte_pktmbuf_free(m);
}
}
cycles_end = rte_get_timer_cycles();

to compare with
   cycles_start = rte_get_timer_cycles();
   while(rounds--) {
for (i = 0; i < BULK_CNT; i++)
mtab[i] = rte_pktmbuf_alloc(...);

ret = rte_pktmbuf_alloc_bulk(pool, mtab, BULK_CNT)
for (i = 0; i < BULK_CNT; i++) {
m = mtab[i];
/* some work if needed */
rte_pktmbuf_free(m);
}
}
cycles_end = rte_get_timer_cycles();


>> I could do this after this patch.
> Yes, please.
>
>
> Thanks,
> Olivier
>



[dpdk-dev] [PATCH v3 0/4] fix the issue that DPDK takes over virtio device blindly

2016-02-26 Thread Xie, Huawei
On 2/26/2016 4:41 PM, David Marchand wrote:
> On Fri, Feb 26, 2016 at 7:09 AM, Xie, Huawei  wrote:
>> On 2/24/2016 8:45 PM, Thomas Monjalon wrote:
>>>> Huawei Xie (4):
>>>>   eal: make the comment more accurate
>>>>   eal: set kdrv to RTE_KDRV_NONE if kernel driver isn't manipulating the 
>>>> device.
>>>>   virtio: return 1 to tell the kernel we don't take over this device
>>>>   virtio: check if kernel driver is manipulating the virtio device
>>> The virtio PCI code has been refactored.
>>> Please Huawei, would it be possible to rebase on master?
>> OK. Since IO port map is moved to EAL layer, it is not straightforward
>> like before for virtio PMD to distinguish the reason why port map fails.
>> We have two choices. Return 1 to the upper layer to say that we don't
>> take over the device for all the map failures or we check the driver
>> type, return -1 for UIO/VFIO driver error, return 1 for kernel driver,
>> which is a bit overelaborate.
> The important thing is to have eal report "none" driver first (your
> 2nd patch) , then in ioport_map, "none" driver will trigger the x86
> special case (see other discussion [1]).
> For "uio" drivers, code (when fixed for uio_pci_generic) already does
> the right stuff.
> "vfio" is handled.
> Anything else should fail once we have the "none" driver correctly reported.
>

in rte_eal_pci_map_device:
case RTE_KDRV_NONE:
#if defined(RTE_ARCH_X86)
ret = pci_ioport_map(dev, bar, p);
#endif
break;
}


in vtpci_init:
if (legacy_virtio_resource_init(dev, hw) < 0) {
if (dev->kdrv == RTE_KDRV_UNKNOWN) {
PMD_INIT_LOG(INFO,
"skip kernel managed virtio device.");
return 1;
}
return -1;
}


in dev_init
ret = vt_pci_init
if (ret)
return ret
> What did I miss ?
>
>
> [1] http://dpdk.org/ml/archives/dev/2016-February/034035.html
>



[dpdk-dev] virtio PMD is not working with master version

2016-02-26 Thread Xie, Huawei
On 2/26/2016 4:29 PM, David Marchand wrote:
> On Fri, Feb 26, 2016 at 3:23 AM, Yuanhan Liu
>  wrote:
>> Mauricio, thanks for the testing and report.
>>
>> On Thu, Feb 25, 2016 at 02:30:18PM +0100, David Marchand wrote:
>>> >From the logs, I would say I broke uio_pci_generic since we are in
>>> "uio" case, but uio portio sysfs does not exist.
>>> virtio pmd fell back to ioports discovery before my change.
>> Maybe we can do same?
We shouldn't, :). I am now rebasing the patch to fix the issue that
virtio driver takes the virtio device blindly.
With the patch:
 if driver is VFIO/UIO, and errors happens, returns without falling back
to IO port.
 if no any kernel driver is managing the device, try IO port; otherwise
returns 1 to tell the layer we don't take over this device.

> I suppose, but see below.
>
>> ---
>> diff --git a/lib/librte_eal/linuxapp/eal/eal_pci.c 
>> b/lib/librte_eal/linuxapp/eal/eal_pci.c
>> index 4346973..579731c 100644
>> --- a/lib/librte_eal/linuxapp/eal/eal_pci.c
>> +++ b/lib/librte_eal/linuxapp/eal/eal_pci.c
>> @@ -685,12 +685,11 @@ int
>>  rte_eal_pci_ioport_map(struct rte_pci_device *dev, int bar,
>>struct rte_pci_ioport *p)
>>  {
>> -   int ret;
>> +   int ret = -1;
>>
>> switch (dev->kdrv) {
>>  #ifdef VFIO_PRESENT
>> case RTE_KDRV_VFIO:
>> -   ret = -1;
>> if (pci_vfio_is_enabled())
>> ret = pci_vfio_ioport_map(dev, bar, p);
>> break;
>> @@ -700,14 +699,14 @@ rte_eal_pci_ioport_map(struct rte_pci_device *dev, int 
>> bar,
>> ret = pci_uio_ioport_map(dev, bar, p);
>> break;
>> default:
>> +   break;
>> +   }
>> +
>>  #if defined(RTE_ARCH_X86_64) || defined(RTE_ARCH_I686)
>> -   /* special case for x86 ... */
>> +   /* special case for x86 ... */
>> +   if (ret)
>> ret = pci_ioport_map(dev, bar, p);
>> -#else
>> -   ret = -1;
>>  #endif
>> -   break;
>> -   }
> What if we are supposed to do vfio here, but for some reason init failed ?
> Next thing, we will call ioport_read in vfio context, but init went
> through the ioports parsing => boom ?
>
> Another issue is that when device is bound to a kernel driver (let's
> say virtio-pci here), then init will succeed and pmd will kick in the
> device registers.
>
> This special case should really be narrowed down to "uio" and "none"
> driver cases.
>
>



[dpdk-dev] [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API

2016-02-26 Thread Xie, Huawei
On 2/24/2016 9:23 PM, Ananyev, Konstantin wrote:
> Hi Panu,
>
>> -Original Message-
>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Panu Matilainen
>> Sent: Wednesday, February 24, 2016 12:12 PM
>> To: Xie, Huawei; Olivier MATZ; dev at dpdk.org
>> Cc: dprovan at bivio.net
>> Subject: Re: [dpdk-dev] [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk 
>> API
>>
>> On 02/23/2016 07:35 AM, Xie, Huawei wrote:
>>> On 2/22/2016 10:52 PM, Xie, Huawei wrote:
>>>> On 2/4/2016 1:24 AM, Olivier MATZ wrote:
>>>>> Hi,
>>>>>
>>>>> On 01/27/2016 02:56 PM, Panu Matilainen wrote:
>>>>>> Since rte_pktmbuf_alloc_bulk() is an inline function, it is not part of
>>>>>> the library ABI and should not be listed in the version map.
>>>>>>
>>>>>> I assume its inline for performance reasons, but then you lose the
>>>>>> benefits of dynamic linking such as ability to fix bugs and/or improve
>>>>>> itby just updating the library. Since the point of having a bulk API is
>>>>>> to improve performance by reducing the number of calls required, does it
>>>>>> really have to be inline? As in, have you actually measured the
>>>>>> difference between inline and non-inline and decided its worth all the
>>>>>> downsides?
>>>>> Agree with Panu. It would be interesting to compare the performance
>>>>> between inline and non inline to decide whether inlining it or not.
>>>> Will update after i gathered more data. inline could show obvious
>>>> performance difference in some cases.
>>> Panu and Oliver:
>>> I write a simple benchmark. This benchmark run 10M rounds, in each round
>>> 8 mbufs are allocated through bulk API, and then freed.
>>> These are the CPU cycles measured(Intel(R) Xeon(R) CPU E5-2680 0 @
>>> 2.70GHz, CPU isolated, timer interrupt disabled, rcu offloaded).
>>> Btw, i have removed some exceptional data, the frequency of which is
>>> like 1/10. Sometimes observed user usage suddenly disappeared, no clue
>>> what happened.
>>>
>>> With 8 mbufs allocated, there is about 6% performance increase using inline.
>> [...]
>>> With 16 mbufs allocated, we could still observe obvious performance
>>> difference, though only 1%-2%
>>>
>> [...]
>>> With 32/64 mbufs allocated, the deviation of the data itself would hide
>>> the performance difference.
>>> So we prefer using inline for performance.
>> At least I was more after real-world performance in a real-world
>> use-case rather than CPU cycles in a microbenchmark, we know function
>> calls have a cost but the benefits tend to outweight the cons.

It depends on what could be called the real world case. It could be
argued. I think the case Konstantin mentioned could be called a real
world one.
If your opinion on whether use benchmark or real-world use case is not
specific to this bulk API, then i have different opinion. For example,
for kernel virtio optimization, people use vring bench. We couldn't
guarantee each small optimization could bring obvious performance gain
in some big workload. The gain could be hided if bottleneck is
elsewhere, so i also plan to build such kind of virtio bench in DPDK.

Finally, i am open to inline or not, but currently priority better goes
with performance. If we make it an API now, we couldn't easily step back
in future; But we could change otherwise, after we have more confidence.
We could even check every inline "API", whether it should be inline or
be in the lib.

>>
>> Inline functions have their place and they're far less evil in project
>> internal use, but in library public API they are BAD and should be ...
>> well, not banned because there are exceptions to every rule, but highly
>> discouraged.
> Why is that?
> As you can see right now we have all mbuf alloc/free routines as static 
> inline.
> And I think we would like to keep it like that.
> So why that particular function should be different?
> After all that function is nothing more than a wrapper 
> around rte_mempool_get_bulk()  unrolled by 4 loop {rte_pktmbuf_reset()}
> So unless mempool get/put API would change, I can hardly see there could be 
> any ABI
> breakages in future. 
> About 'real world' performance gain - it was a 'real world' performance 
> problem,
> that we tried to solve by introducing that function:
> http://dpdk.org/ml/archives/dev/2015-May/017633.html
>
> And according to the user feedback, it does help:  
> http://dpdk.org/ml/archives/dev/2016-February/033203.html
>
> Konstantin
>
>>  - Panu -
>>



[dpdk-dev] [PATCH v3 0/4] fix the issue that DPDK takes over virtio device blindly

2016-02-26 Thread Xie, Huawei
On 2/24/2016 8:45 PM, Thomas Monjalon wrote:
>> Huawei Xie (4):
>>   eal: make the comment more accurate
>>   eal: set kdrv to RTE_KDRV_NONE if kernel driver isn't manipulating the 
>> device.
>>   virtio: return 1 to tell the kernel we don't take over this device
>>   virtio: check if kernel driver is manipulating the virtio device
> The virtio PCI code has been refactored.
> Please Huawei, would it be possible to rebase on master?

OK. Since IO port map is moved to EAL layer, it is not straightforward
like before for virtio PMD to distinguish the reason why port map fails.
We have two choices. Return 1 to the upper layer to say that we don't
take over the device for all the map failures or we check the driver
type, return -1 for UIO/VFIO driver error, return 1 for kernel driver,
which is a bit overelaborate.

>



[dpdk-dev] [PATCH RFC 4/4] doc: add note about rte_vhost_enqueue_burst thread safety.

2016-02-25 Thread Xie, Huawei
On 2/24/2016 1:07 PM, Ilya Maximets wrote:
> On 23.02.2016 08:56, Xie, Huawei wrote:
>> On 2/22/2016 6:16 PM, Thomas Monjalon wrote:
>>> 2016-02-22 02:07, Xie, Huawei:
>>>> On 2/19/2016 5:05 PM, Ilya Maximets wrote:
>>>>> On 19.02.2016 11:36, Xie, Huawei wrote:
>>>>>> On 2/19/2016 3:10 PM, Yuanhan Liu wrote:
>>>>>>> On Fri, Feb 19, 2016 at 09:32:43AM +0300, Ilya Maximets wrote:
>>>>>>>> Signed-off-by: Ilya Maximets 
>>>>>>>> ---
>>>>>>>>  doc/guides/prog_guide/thread_safety_dpdk_functions.rst | 1 +
>>>>>>>>  1 file changed, 1 insertion(+)
>>>>>>>>
>>>>>>>> diff --git a/doc/guides/prog_guide/thread_safety_dpdk_functions.rst 
>>>>>>>> b/doc/guides/prog_guide/thread_safety_dpdk_functions.rst
>>>>>>>> index 403e5fc..13a6c89 100644
>>>>>>>> --- a/doc/guides/prog_guide/thread_safety_dpdk_functions.rst
>>>>>>>> +++ b/doc/guides/prog_guide/thread_safety_dpdk_functions.rst
>>>>>>>>  The mempool library is based on the DPDK lockless ring library and 
>>>>>>>> therefore is also multi-thread safe.
>>>>>>>> +rte_vhost_enqueue_burst() is also thread safe because based on 
>>>>>>>> lockless ring-buffer algorithm like the ring library.
>>>>>>> FYI, Huawei meant to make rte_vhost_enqueue_burst() not be thread-safe,
>>>>>>> to aligh with the usage of rte_eth_tx_burst().
>>>>>>>
>>>>>>> --yliu
>>>>>> I have a patch to remove the lockless enqueue. Unless there is strong
>>>>>> reason, i prefer vhost PMD to behave like other PMDs, with no internal
>>>>>> lockless algorithm. In future, for people who really need it, we could
>>>>>> have dynamic/static switch to enable it.
>>>> Thomas, what is your opinion on this and my patch removing lockless 
>>>> enqueue?
>>> The thread safety behaviour is part of the API specification.
>>> If we want to enable/disable such behaviour, it must be done with an API
>>> function. But it would introduce a conditional statement in the fast path.
>>> That's why the priority must be to keep a simple and consistent behaviour
>>> and try to build around. An API complexity may be considered only if there
>>> is a real (measured) gain.
>> Let us put the gain aside temporarily. I would do the measurement.
>> Vhost is wrapped as a PMD in Tetsuya's patch. And also in DPDK OVS's
>> case, it is wrapped as a vport like all other physical ports. The DPDK
>> app/OVS will treat all ports equally.
> That is not true. Currently vhost in Open vSwitch implemented as a separate
> netdev class. So, to use concurrency of vhost we just need to remove
> 2 lines (rte_spinlock_lock and rte_spinlock_unlock) from function
> __netdev_dpdk_vhost_send(). This will not change behaviour of other types
> of ports.

I checked OVS implementation. It raised several concerns.
For physical ports, it uses multiple queues to solve concurrent tx.
For vhost ports,  a) The thread safe behavior of vhost isn't used.
rte_spinlock is used outside. Yes, it could be removed.  b) If a packet
is send to vhost, it is directly enqueued to guest without buffering. We
could use thread safe ring to queue packets first and then enqueued to
guest at appropriate time later, then vhost internal lockless isn't needed.
Besides, IMO thread safe implementation adds the complexity of vhost
implementation.

>> It will add complexity if the app
>> needs to know that some supports concurrency while some not. Since all
>> other PMDs doesn't support thread safety, it doesn't make sense for
>> vhost PMD to support that. I believe the APP will not use that behavior.
>> >From the API's point of view, if we previously implemented it wrongly,
>> we need to fix it as early as possible.



[dpdk-dev] [PATCH] virtio: fix rx ring descriptor starvation

2016-02-23 Thread Xie, Huawei
On 2/23/2016 12:23 AM, Tom Kiely wrote:
> Hi,
> Sorry I missed the last few messages until now. I'm happy with
> just removing the "if". Kyle, when you say you fixed it, do you mean
> that you will push the patch or have already done so ?
>Thanks,
>Tom
>
> On 02/18/2016 02:03 PM, Kyle Larose wrote:
>> On Tue, Jan 5, 2016 at 2:13 AM, Xie, Huawei 
>> wrote:
>>> On 12/17/2015 7:18 PM, Tom Kiely wrote:
>>>>
>>>> On 11/25/2015 05:32 PM, Xie, Huawei wrote:
>>>>> On 11/13/2015 5:33 PM, Tom Kiely wrote:
>>>>>> If all rx descriptors are processed while transient
>>>>>> mbuf exhaustion is present, the rx ring ends up with
>>>>>> no available descriptors. Thus no packets are received
>>>>>> on that ring. Since descriptor refill is performed post
>>>>>> rx descriptor processing, in this case no refill is
>>>>>> ever subsequently performed resulting in permanent rx
>>>>>> traffic drop.
>>>>>>
>>>>>> Signed-off-by: Tom Kiely 
>>>>>> ---
>>>>>>drivers/net/virtio/virtio_rxtx.c |6 --
>>>>>>1 file changed, 4 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/net/virtio/virtio_rxtx.c
>>>>>> b/drivers/net/virtio/virtio_rxtx.c
>>>>>> index 5770fa2..a95e234 100644
>>>>>> --- a/drivers/net/virtio/virtio_rxtx.c
>>>>>> +++ b/drivers/net/virtio/virtio_rxtx.c
>>>>>> @@ -586,7 +586,8 @@ virtio_recv_pkts(void *rx_queue, struct rte_mbuf
>>>>>> **rx_pkts, uint16_t nb_pkts)
>>>>>>if (likely(num > DESC_PER_CACHELINE))
>>>>>>num = num - ((rxvq->vq_used_cons_idx + num) %
>>>>>> DESC_PER_CACHELINE);
>>>>>>-if (num == 0)
>>>>>> +/* Refill free descriptors even if no pkts recvd */
>>>>>> +if (num == 0 && virtqueue_full(rxvq))
>>>>> Should the return condition be that no used buffers and we have avail
>>>>> descs in avail ring, i.e,
>>>>>   num == 0 && rxvq->vq_free_cnt != rxvq->vq_nentries
>>>>>
>>>>> rather than
>>>>>   num == 0 && rxvq->vq_free_cnt == 0
>>>> Yes we could do that but I don't see a good reason to wait until the
>>>> vq_free_cnt == vq_nentries
>>>> before attempting the refill. The existing code will attempt refill
>>>> even if only 1 packet was received
>>>> and the free count is small. To me it seems safer to extend that to
>>>> try refill even if no packet was received
>>>> but the free count is non-zero.
>>> The existing code attempt to refill only if 1 packet was received.
>>>
>>> If we want to refill even no packet was received, then the strict
>>> condition should be
>>>  num == 0 && rxvq->vq_free_cnt != rxvq->vq_nentries
>>>
>>> The safer condition, what you want to use,  should be
>>>  num == 0 && !virtqueue_full(...)
>>> rather than
>>>  num == 0 && virtqueue_full(...)
>>>
>>> We could simplify things a bit, just remove this check, if the
>>> following
>>> receiving code already takes care of the "num == 0" condition.
>>>
>> FWIW, I fixed this issue myself by just removing the if(num == 0)
>> checks entirely. I didn't see any benefit in short-circuiting a loop
>> which pretty much does nothing anyway when num == 0. Further, we only
>> hit this case when there's no packets to receive, which means there's
>> probably a few cycles to spare. This is even simpler.

Yes, as i said, that is the simplest fix.

>>
>>> I find virtqueue_full is confusing, maybe we could change it to some
>>> other meaningful name.
>>>
>>>> Tom
>>>>
>>>>>>return 0;
>>>>>>  num = virtqueue_dequeue_burst_rx(rxvq, rcv_pkts, len, num);
>>>>>> @@ -683,7 +684,8 @@ virtio_recv_mergeable_pkts(void *rx_queue,
>>>>>>  virtio_rmb();
>>>>>>-if (nb_used == 0)
>>>>>> +/* Refill free descriptors even if no pkts recvd */
>>>>>> +if (nb_used == 0 && virtqueue_full(rxvq))
>>>>>>return 0;
>>>>>>  PMD_RX_LOG(DEBUG, "used:%d\n", nb_used);
>>>>
>
>



[dpdk-dev] [PATCH RFC 4/4] doc: add note about rte_vhost_enqueue_burst thread safety.

2016-02-23 Thread Xie, Huawei
On 2/22/2016 6:16 PM, Thomas Monjalon wrote:
> 2016-02-22 02:07, Xie, Huawei:
>> On 2/19/2016 5:05 PM, Ilya Maximets wrote:
>>> On 19.02.2016 11:36, Xie, Huawei wrote:
>>>> On 2/19/2016 3:10 PM, Yuanhan Liu wrote:
>>>>> On Fri, Feb 19, 2016 at 09:32:43AM +0300, Ilya Maximets wrote:
>>>>>> Signed-off-by: Ilya Maximets 
>>>>>> ---
>>>>>>  doc/guides/prog_guide/thread_safety_dpdk_functions.rst | 1 +
>>>>>>  1 file changed, 1 insertion(+)
>>>>>>
>>>>>> diff --git a/doc/guides/prog_guide/thread_safety_dpdk_functions.rst 
>>>>>> b/doc/guides/prog_guide/thread_safety_dpdk_functions.rst
>>>>>> index 403e5fc..13a6c89 100644
>>>>>> --- a/doc/guides/prog_guide/thread_safety_dpdk_functions.rst
>>>>>> +++ b/doc/guides/prog_guide/thread_safety_dpdk_functions.rst
>>>>>>  The mempool library is based on the DPDK lockless ring library and 
>>>>>> therefore is also multi-thread safe.
>>>>>> +rte_vhost_enqueue_burst() is also thread safe because based on lockless 
>>>>>> ring-buffer algorithm like the ring library.
>>>>> FYI, Huawei meant to make rte_vhost_enqueue_burst() not be thread-safe,
>>>>> to aligh with the usage of rte_eth_tx_burst().
>>>>>
>>>>>   --yliu
>>>> I have a patch to remove the lockless enqueue. Unless there is strong
>>>> reason, i prefer vhost PMD to behave like other PMDs, with no internal
>>>> lockless algorithm. In future, for people who really need it, we could
>>>> have dynamic/static switch to enable it.
>> Thomas, what is your opinion on this and my patch removing lockless enqueue?
> The thread safety behaviour is part of the API specification.
> If we want to enable/disable such behaviour, it must be done with an API
> function. But it would introduce a conditional statement in the fast path.
> That's why the priority must be to keep a simple and consistent behaviour
> and try to build around. An API complexity may be considered only if there
> is a real (measured) gain.

Let us put the gain aside temporarily. I would do the measurement.
Vhost is wrapped as a PMD in Tetsuya's patch. And also in DPDK OVS's
case, it is wrapped as a vport like all other physical ports. The DPDK
app/OVS will treat all ports equally. It will add complexity if the app
needs to know that some supports concurrency while some not. Since all
other PMDs doesn't support thread safety, it doesn't make sense for
vhost PMD to support that. I believe the APP will not use that behavior.
>From the API's point of view, if we previously implemented it wrongly,
we need to fix it as early as possible.

>
>



[dpdk-dev] [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API

2016-02-23 Thread Xie, Huawei
On 2/22/2016 10:52 PM, Xie, Huawei wrote:
> On 2/4/2016 1:24 AM, Olivier MATZ wrote:
>> Hi,
>>
>> On 01/27/2016 02:56 PM, Panu Matilainen wrote:
>>> Since rte_pktmbuf_alloc_bulk() is an inline function, it is not part of
>>> the library ABI and should not be listed in the version map.
>>>
>>> I assume its inline for performance reasons, but then you lose the
>>> benefits of dynamic linking such as ability to fix bugs and/or improve
>>> itby just updating the library. Since the point of having a bulk API is
>>> to improve performance by reducing the number of calls required, does it
>>> really have to be inline? As in, have you actually measured the
>>> difference between inline and non-inline and decided its worth all the
>>> downsides?
>> Agree with Panu. It would be interesting to compare the performance
>> between inline and non inline to decide whether inlining it or not.
> Will update after i gathered more data. inline could show obvious
> performance difference in some cases.

Panu and Oliver:
I write a simple benchmark. This benchmark run 10M rounds, in each round
8 mbufs are allocated through bulk API, and then freed.
These are the CPU cycles measured(Intel(R) Xeon(R) CPU E5-2680 0 @
2.70GHz, CPU isolated, timer interrupt disabled, rcu offloaded).
Btw, i have removed some exceptional data, the frequency of which is
like 1/10. Sometimes observed user usage suddenly disappeared, no clue
what happened.

With 8 mbufs allocated, there is about 6% performance increase using inline.
inlinenon-inline
2780732950309416
28348536962951378072
28230153202954500888
28250600322958939912
28244998042898938284
28108597202944892796
28522294203014273296
27873085002956809852
27933372602958674900
2834762954346352
27854551842925719136
28215286242937380416
28229221362974978604
27766459202947666548
28159525722952316900
28010487402947366984
28514626722946469004

With 16 mbufs allocated, we could still observe obvious performance
difference, though only 1%-2%

inlinenon-inline
55199870845669902680
55384160965737646840
55789340645590165532
55481319725767926840
56255856965831345628
55582828765662223764
54455877685641003924
55590963205775258444
56564379885743969272
54409394045664882412
54988759685785138532
55616528085737123940
55152117165627775604
55505671405630790628
56659642805589568164
55912959005702697308

With 32/64 mbufs allocated, the deviation of the data itself would hide
the performance difference.

So we prefer using inline for performance.
>> Also, it would be nice to have a simple test function in
>> app/test/test_mbuf.c. For instance, you could update
>> test_one_pktmbuf() to take a mbuf pointer as a parameter and remove
>> the mbuf allocation from the function. Then it could be called with
>> a mbuf allocated with rte_pktmbuf_alloc() (like before) and with
>> all the mbufs of rte_pktmbuf_alloc_bulk().

Don't quite get you. Is it that we write two cases, one case allocate
mbuf through rte_pktmbuf_alloc_bulk and one use rte_pktmbuf_alloc? It is
good to have. I could do this after this patch.
>>
>> Regards,
>> Olivier
>>
>



[dpdk-dev] [PATCH v6 1/2] mbuf: provide rte_pktmbuf_alloc_bulk API

2016-02-22 Thread Xie, Huawei
On 2/4/2016 1:24 AM, Olivier MATZ wrote:
> Hi,
>
> On 01/27/2016 02:56 PM, Panu Matilainen wrote:
>>
>> Since rte_pktmbuf_alloc_bulk() is an inline function, it is not part of
>> the library ABI and should not be listed in the version map.
>>
>> I assume its inline for performance reasons, but then you lose the
>> benefits of dynamic linking such as ability to fix bugs and/or improve
>> itby just updating the library. Since the point of having a bulk API is
>> to improve performance by reducing the number of calls required, does it
>> really have to be inline? As in, have you actually measured the
>> difference between inline and non-inline and decided its worth all the
>> downsides?
>
> Agree with Panu. It would be interesting to compare the performance
> between inline and non inline to decide whether inlining it or not.

Will update after i gathered more data. inline could show obvious
performance difference in some cases.

>
> Also, it would be nice to have a simple test function in
> app/test/test_mbuf.c. For instance, you could update
> test_one_pktmbuf() to take a mbuf pointer as a parameter and remove
> the mbuf allocation from the function. Then it could be called with
> a mbuf allocated with rte_pktmbuf_alloc() (like before) and with
> all the mbufs of rte_pktmbuf_alloc_bulk().
>
> Regards,
> Olivier
>



[dpdk-dev] [PATCH RFC 4/4] doc: add note about rte_vhost_enqueue_burst thread safety.

2016-02-22 Thread Xie, Huawei
On 2/19/2016 5:05 PM, Ilya Maximets wrote:
> On 19.02.2016 11:36, Xie, Huawei wrote:
>> On 2/19/2016 3:10 PM, Yuanhan Liu wrote:
>>> On Fri, Feb 19, 2016 at 09:32:43AM +0300, Ilya Maximets wrote:
>>>> Signed-off-by: Ilya Maximets 
>>>> ---
>>>>  doc/guides/prog_guide/thread_safety_dpdk_functions.rst | 1 +
>>>>  1 file changed, 1 insertion(+)
>>>>
>>>> diff --git a/doc/guides/prog_guide/thread_safety_dpdk_functions.rst 
>>>> b/doc/guides/prog_guide/thread_safety_dpdk_functions.rst
>>>> index 403e5fc..13a6c89 100644
>>>> --- a/doc/guides/prog_guide/thread_safety_dpdk_functions.rst
>>>> +++ b/doc/guides/prog_guide/thread_safety_dpdk_functions.rst
>>>> @@ -67,6 +67,7 @@ then locking, or some other form of mutual exclusion, is 
>>>> necessary.
>>>>  The ring library is based on a lockless ring-buffer algorithm that 
>>>> maintains its original design for thread safety.
>>>>  Moreover, it provides high performance for either multi- or 
>>>> single-consumer/producer enqueue/dequeue operations.
>>>>  The mempool library is based on the DPDK lockless ring library and 
>>>> therefore is also multi-thread safe.
>>>> +rte_vhost_enqueue_burst() is also thread safe because based on lockless 
>>>> ring-buffer algorithm like the ring library.
>>> FYI, Huawei meant to make rte_vhost_enqueue_burst() not be thread-safe,
>>> to aligh with the usage of rte_eth_tx_burst().
>>>
>>> --yliu
>> I have a patch to remove the lockless enqueue. Unless there is strong
>> reason, i prefer vhost PMD to behave like other PMDs, with no internal
>> lockless algorithm. In future, for people who really need it, we could
>> have dynamic/static switch to enable it.

Thomas, what is your opinion on this and my patch removing lockless enqueue?

> OK, got it. So, I think, this documentation patch may be dropped.
> Other patches of series still may be merged to fix existing issues and
> keep code in consistent state for the future.
> Am I right?

Yes.

> Best regards, Ilya Maximets.
>



[dpdk-dev] [PATCH v7 2/4] virtio: Introduce config RTE_VIRTIO_INC_VECTOR

2016-02-22 Thread Xie, Huawei
On 2/19/2016 2:42 PM, Yuanhan Liu wrote:
> On Fri, Feb 19, 2016 at 10:16:42AM +0530, Santosh Shukla wrote:
>> On Tue, Feb 16, 2016 at 8:35 AM, Yuanhan Liu
>>  wrote:
>>> On Mon, Feb 15, 2016 at 04:48:36PM +0530, Santosh Shukla wrote:
 Hi Yuanhan,

 On Mon, Feb 15, 2016 at 4:27 PM, Yuanhan Liu
  wrote:
> On Mon, Feb 15, 2016 at 03:22:11PM +0530, Santosh Shukla wrote:
>> Hi Yuanhan,
>>
>> I guess you are back from vacation.
>>
>> Can you pl. review this patch, Except this patch, rest of patches
>> received ack-by:
> I had a quick glimpse of the comments from Thomas: he made a good point.
> I will have a deeper thought tomorrow, to see what I can do to fix it.
>
 I agree to what Thomas pointed out about runtime mode switch (vectored
 vs non-vectored). I have a proposal in my mind and Like to know you
 opinion:

 - need for apis like is_arch_support_vec().

 if (is_arch_support_vec())
  simpple_ = 1 /* Switch code path to vector mode */
 else
  simple_ = 0  /* Switch code path to non-vector mode */

 That api should reside to arch file. i.e.. arch like i686/arm{for
 implementation not exist so say no supported} will return 0 and for
 x86_64 = 1
>>> I was thinking that Thomas meant to something like below (like what
>>> we did at rte_memcpy.h):
>>>
>>> #ifdef RTE_MACHINE_CPUFLAG_SSE (or whatever)
>>>
>>> /* with vec here */
>>>
>>> #else
>>>
>>> /* without vec here */
>>>
>>> #endif
>>>
>>> I mean, you have to bypass the build first; otherwise, you can't
>>> go that further to runtime, right?
>>>
>> I meant: move virtio_recv_pkt_vec() implementation in
>> lib/libeal_rte/xx/include/arch/xx/virtio_vec.h. virtio driver to check
>> for CPUFLAG supported or not and then use _recv_pkt() call back
>> function from arch files. This approach will avoid #ifdef ARCH
>> clutter.
> Moving virtio stuff to eal looks wrong to me.
>
> Huawei, please comment on this.
>
>   --yliu
>

This issue doesn't apply to virtio driver only but to all other PMDs,
unless they are assumed to run on only one arch. As we are close to
release, for the time being, i prefer to use RTE_MACHINE_CPUFLAG_. Later
we look for other elegant solutions, like moving different arch specific
optimizations into the arch directory under driver/virtio/ directory?
Other thoughts?





[dpdk-dev] [PATCH RFC 4/4] doc: add note about rte_vhost_enqueue_burst thread safety.

2016-02-19 Thread Xie, Huawei
On 2/19/2016 3:10 PM, Yuanhan Liu wrote:
> On Fri, Feb 19, 2016 at 09:32:43AM +0300, Ilya Maximets wrote:
>> Signed-off-by: Ilya Maximets 
>> ---
>>  doc/guides/prog_guide/thread_safety_dpdk_functions.rst | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/doc/guides/prog_guide/thread_safety_dpdk_functions.rst 
>> b/doc/guides/prog_guide/thread_safety_dpdk_functions.rst
>> index 403e5fc..13a6c89 100644
>> --- a/doc/guides/prog_guide/thread_safety_dpdk_functions.rst
>> +++ b/doc/guides/prog_guide/thread_safety_dpdk_functions.rst
>> @@ -67,6 +67,7 @@ then locking, or some other form of mutual exclusion, is 
>> necessary.
>>  The ring library is based on a lockless ring-buffer algorithm that 
>> maintains its original design for thread safety.
>>  Moreover, it provides high performance for either multi- or 
>> single-consumer/producer enqueue/dequeue operations.
>>  The mempool library is based on the DPDK lockless ring library and 
>> therefore is also multi-thread safe.
>> +rte_vhost_enqueue_burst() is also thread safe because based on lockless 
>> ring-buffer algorithm like the ring library.
> FYI, Huawei meant to make rte_vhost_enqueue_burst() not be thread-safe,
> to aligh with the usage of rte_eth_tx_burst().
>
>   --yliu

I have a patch to remove the lockless enqueue. Unless there is strong
reason, i prefer vhost PMD to behave like other PMDs, with no internal
lockless algorithm. In future, for people who really need it, we could
have dynamic/static switch to enable it.




[dpdk-dev] [PATCH RFC 2/4] vhost: make buf vector for scatter RX local.

2016-02-19 Thread Xie, Huawei
On 2/19/2016 3:31 PM, Ilya Maximets wrote:
> On 19.02.2016 10:06, Yuanhan Liu wrote:
>> On Fri, Feb 19, 2016 at 09:32:41AM +0300, Ilya Maximets wrote:
>>> Array of buf_vector's is just an array for temporary storing information
>>> about available descriptors. It used only locally in virtio_dev_merge_rx()
>>> and there is no reason for that array to be shared.
>>>
>>> Fix that by allocating local buf_vec inside virtio_dev_merge_rx().
>>>
>>> Signed-off-by: Ilya Maximets 
>>> ---
>>>  lib/librte_vhost/rte_virtio_net.h |  1 -
>>>  lib/librte_vhost/vhost_rxtx.c | 45 
>>> ---
>>>  2 files changed, 23 insertions(+), 23 deletions(-)
>>>
>>> diff --git a/lib/librte_vhost/rte_virtio_net.h 
>>> b/lib/librte_vhost/rte_virtio_net.h
>>> index 10dcb90..ae1e4fb 100644
>>> --- a/lib/librte_vhost/rte_virtio_net.h
>>> +++ b/lib/librte_vhost/rte_virtio_net.h
>>> @@ -91,7 +91,6 @@ struct vhost_virtqueue {
>>> int kickfd; /**< Currently unused 
>>> as polling mode is enabled. */
>>> int enabled;
>>> uint64_treserved[16];   /**< Reserve some 
>>> spaces for future extension. */
>>> -   struct buf_vector   buf_vec[BUF_VECTOR_MAX];/**< for 
>>> scatter RX. */
>>>  } __rte_cache_aligned;
>> I like this kind of cleanup, however, it breaks ABI.
> Should I prepare version of this patch with field above marked as
> deprecated and add note to doc/guides/rel_notes/release_16_04.rst
> about future deletion?

Ilya, you could follow the ABI process:
http://dpdk.org/doc/guides/contributing/versioning.html

>
> Best regards, Ilya Maximets.
>



  1   2   3   4   5   6   >