[dpdk-dev] [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-13 Thread Yang, Zhiyong
Hi, Yuanhan:

> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Yuanhan Liu
> Sent: Wednesday, October 12, 2016 11:22 AM
> To: Michael S. Tsirkin ; Thomas Monjalon
> 
> Cc: Wang, Zhihong ; Maxime Coquelin
> ; Stephen Hemminger
> ; dev at dpdk.org; qemu-
> devel at nongnu.org
> Subject: Re: [dpdk-dev] [Qemu-devel] [PATCH 1/2] vhost: enable any layout
> feature
> 
> On Tue, Oct 11, 2016 at 02:57:49PM +0800, Yuanhan Liu wrote:
> > > > > > > There was an example: the vhost enqueue optmization patchset
> > > > > > > from Zhihong [0] uses memset, and it introduces more than
> > > > > > > 15% drop (IIRC)
> 
> Though it doesn't matter now, but I have verified it yesterday (with and
> wihtout memset), the drop could be up to 30+%.
> 
> This is to let you know that it could behaviour badly if memset is not 
> inlined.
> 
> > > > > > > on my Ivybridge server: it has no such issue on his server though.
> > > > > > >
> > > > > > > [0]: http://dpdk.org/ml/archives/dev/2016-August/045272.html
> > > > > > >
> > > > > > >   --yliu
> > > > > >
> > > > > > I'd say that's weird. what's your config? any chance you are
> > > > > > using an old compiler?
> > > > >
> > > > > Not really, it's gcc 5.3.1. Maybe Zhihong could explain more.
> > > > > IIRC, he said the memset is not well optimized for Ivybridge server.
> > > >
> > > > The dst is remote in that case. It's fine on Haswell but has
> > > > complication in Ivy Bridge which (wasn't supposed to but) causes
> serious frontend issue.
> > > >
> > > > I don't think gcc inlined it there. I'm using fc24 gcc 6.1.1.
> > >
> > >
> > > So try something like this then:
> >
> > Yes, I saw memset is inlined when this diff is applied.
> 
> I have another concern though: It's a trick could let gcc do the inline, I am 
> not
> quite sure whether that's ture with other compilers (i.e. clang, icc, or even,
> older gcc).
> 
> For this case, I think I still prefer some trick like
> *(struct ..*) = {0, }
> 
> Or even, we may could introduce rte_memset(). IIRC, that has been
> proposed somehow before?
> 

I'm trying to introduce rte_memset to have a prototype  It have
Gotten some performance enhancement For small size, I'm optimize it further.

--Zhiyong

>   --yliu


[dpdk-dev] [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-12 Thread Yuanhan Liu
On Tue, Oct 11, 2016 at 02:57:49PM +0800, Yuanhan Liu wrote:
> > > > > > There was an example: the vhost enqueue optmization patchset from
> > > > > > Zhihong [0] uses memset, and it introduces more than 15% drop (IIRC)

Though it doesn't matter now, but I have verified it yesterday (with and
wihtout memset), the drop could be up to 30+%.

This is to let you know that it could behaviour badly if memset is not
inlined.

> > > > > > on my Ivybridge server: it has no such issue on his server though.
> > > > > >
> > > > > > [0]: http://dpdk.org/ml/archives/dev/2016-August/045272.html
> > > > > >
> > > > > > --yliu
> > > > >
> > > > > I'd say that's weird. what's your config? any chance you
> > > > > are using an old compiler?
> > > > 
> > > > Not really, it's gcc 5.3.1. Maybe Zhihong could explain more. IIRC,
> > > > he said the memset is not well optimized for Ivybridge server.
> > > 
> > > The dst is remote in that case. It's fine on Haswell but has complication
> > > in Ivy Bridge which (wasn't supposed to but) causes serious frontend 
> > > issue.
> > > 
> > > I don't think gcc inlined it there. I'm using fc24 gcc 6.1.1.
> > 
> > 
> > So try something like this then:
> 
> Yes, I saw memset is inlined when this diff is applied.

I have another concern though: It's a trick could let gcc do the inline,
I am not quite sure whether that's ture with other compilers (i.e. clang,
icc, or even, older gcc).

For this case, I think I still prefer some trick like
*(struct ..*) = {0, }

Or even, we may could introduce rte_memset(). IIRC, that has been
proposed somehow before?

--yliu


[dpdk-dev] [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-11 Thread Yuanhan Liu
On Mon, Oct 10, 2016 at 07:39:59AM +0300, Michael S. Tsirkin wrote:
> > > > > > 1. why is this memset expensive?
> > > > >
> > > > > I don't have the exact answer, but just some rough thoughts:
> > > > >
> > > > > It's an external clib function: there is a call stack and the
> > > > > IP register will bounch back and forth.
> > > >
> > > > for memset 0?  gcc 5.3.1 on fedora happily inlines it.
> > > 
> > > Good to know!
> > > 
> > > > > overkill to use that for resetting 14 bytes structure.
> > > > >
> > > > > Some trick like
> > > > > *(struct virtio_net_hdr *)hdr = {0, };
> > > > >
> > > > > Or even
> > > > > hdr->xxx = 0;
> > > > > hdr->yyy = 0;
> > > > >
> > > > > should behaviour better.
> > > > >
> > > > > There was an example: the vhost enqueue optmization patchset from
> > > > > Zhihong [0] uses memset, and it introduces more than 15% drop (IIRC)
> > > > > on my Ivybridge server: it has no such issue on his server though.
> > > > >
> > > > > [0]: http://dpdk.org/ml/archives/dev/2016-August/045272.html
> > > > >
> > > > >   --yliu
> > > >
> > > > I'd say that's weird. what's your config? any chance you
> > > > are using an old compiler?
> > > 
> > > Not really, it's gcc 5.3.1. Maybe Zhihong could explain more. IIRC,
> > > he said the memset is not well optimized for Ivybridge server.
> > 
> > The dst is remote in that case. It's fine on Haswell but has complication
> > in Ivy Bridge which (wasn't supposed to but) causes serious frontend issue.
> > 
> > I don't think gcc inlined it there. I'm using fc24 gcc 6.1.1.
> 
> 
> So try something like this then:

Yes, I saw memset is inlined when this diff is applied.

So, mind to send a formal patch? You might want to try build at least:
it doesn't build.


> Generally pointer chasing in vq->hw->vtnet_hdr_size can't be good
> for performance. Move fields used on data path into vq
> and use from there to avoid indirections?

Good suggestion!

--yliu


[dpdk-dev] [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-11 Thread Yuanhan Liu
On Tue, Oct 11, 2016 at 08:39:54AM +0200, Maxime Coquelin wrote:
> 
> 
> On 10/11/2016 08:04 AM, Yuanhan Liu wrote:
> >On Mon, Oct 10, 2016 at 04:54:39PM +0200, Maxime Coquelin wrote:
> >>
> >>
> >>On 10/10/2016 04:42 PM, Yuanhan Liu wrote:
> >>>On Mon, Oct 10, 2016 at 02:40:44PM +0200, Maxime Coquelin wrote:
> >>>At that time, a packet always use 2 descs. Since indirect desc is
> >>>enabled (by default) now, the assumption is not true then. What's
> >>>worse, it might even slow things a bit down. That should also be
> >>>part of the reason why performance is slightly worse than before.
> >>>
> >>>   --yliu
> >>
> >>I'm not sure I get what you are saying
> >>
> >>>commit 1d41d77cf81c448c1b09e1e859bfd300e2054a98
> >>>Author: Yuanhan Liu 
> >>>Date:   Mon May 2 17:46:17 2016 -0700
> >>>
> >>>  vhost: optimize dequeue for small packets
> >>>
> >>>  A virtio driver normally uses at least 2 desc buffers for Tx: the
> >>>  first for storing the header, and the others for storing the data.
> >>>
> >>>  Therefore, we could fetch the first data desc buf before the main
> >>>  loop, and do the copy first before the check of "are we done yet?".
> >>>  This could save one check for small packets that just have one data
> >>>  desc buffer and need one mbuf to store it.
> >>>
> >>>  Signed-off-by: Yuanhan Liu 
> >>>  Acked-by: Huawei Xie 
> >>>  Tested-by: Rich Lane 
> >>
> >>This fast-paths the 2-descriptors format but it's not active
> >>for indirect descriptors. Is this what you mean?
> >
> >Yes. It's also not active when ANY_LAYOUT is actually turned on.
> >>Should be a simple matter to apply this optimization for indirect.
> >
> >Might be.
> 
> If I understand the code correctly, indirect descs also benefit from this
> optimization, or am I missing something?
> >>>
> >>>Aha..., you are right!
> >>
> >>The interesting thing is that the patch I send on Thursday that removes
> >>header access when no offload has been negotiated[0] seems to reduce
> >>almost to zero the performance seen with indirect descriptors enabled.
> >
> >Didn't follow that.
> >
> >>I see this with 64 bytes packets using testpmd on both ends.
> >>
> >>When I did the patch, I would have expected the same gain with both
> >>modes, whereas I measured +1% for direct and +4% for indirect.
> >
> >IIRC, I did a test before (remove those offload code piece), and the
> >performance was basically the same before and after that. Well, there
> >might be some small difference, say 1% as you said. But the result has
> >never been steady.
> >
> >Anyway, I think your patch is good to have: I just didn't see v2.
> 
> I waited to gather some comments/feedback before sending the v2.
> I'll send it today or tomorrow.

Interesting, I saw a deadlock then: I haven't looked at the code
carefully once you said there is a v2, thus I'm waiting for it.
However, you are waitting for my review. :)

Anyway, I will take time to look at it shortly.

--yliu


[dpdk-dev] [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-11 Thread Yuanhan Liu
On Mon, Oct 10, 2016 at 04:54:39PM +0200, Maxime Coquelin wrote:
> 
> 
> On 10/10/2016 04:42 PM, Yuanhan Liu wrote:
> >On Mon, Oct 10, 2016 at 02:40:44PM +0200, Maxime Coquelin wrote:
> >At that time, a packet always use 2 descs. Since indirect desc is
> >enabled (by default) now, the assumption is not true then. What's
> >worse, it might even slow things a bit down. That should also be
> >part of the reason why performance is slightly worse than before.
> >
> > --yliu
> 
> I'm not sure I get what you are saying
> 
> >commit 1d41d77cf81c448c1b09e1e859bfd300e2054a98
> >Author: Yuanhan Liu 
> >Date:   Mon May 2 17:46:17 2016 -0700
> >
> >   vhost: optimize dequeue for small packets
> >
> >   A virtio driver normally uses at least 2 desc buffers for Tx: the
> >   first for storing the header, and the others for storing the data.
> >
> >   Therefore, we could fetch the first data desc buf before the main
> >   loop, and do the copy first before the check of "are we done yet?".
> >   This could save one check for small packets that just have one data
> >   desc buffer and need one mbuf to store it.
> >
> >   Signed-off-by: Yuanhan Liu 
> >   Acked-by: Huawei Xie 
> >   Tested-by: Rich Lane 
> 
> This fast-paths the 2-descriptors format but it's not active
> for indirect descriptors. Is this what you mean?
> >>>
> >>>Yes. It's also not active when ANY_LAYOUT is actually turned on.
> Should be a simple matter to apply this optimization for indirect.
> >>>
> >>>Might be.
> >>
> >>If I understand the code correctly, indirect descs also benefit from this
> >>optimization, or am I missing something?
> >
> >Aha..., you are right!
> 
> The interesting thing is that the patch I send on Thursday that removes
> header access when no offload has been negotiated[0] seems to reduce
> almost to zero the performance seen with indirect descriptors enabled.

Didn't follow that.

> I see this with 64 bytes packets using testpmd on both ends.
> 
> When I did the patch, I would have expected the same gain with both
> modes, whereas I measured +1% for direct and +4% for indirect.

IIRC, I did a test before (remove those offload code piece), and the
performance was basically the same before and after that. Well, there
might be some small difference, say 1% as you said. But the result has
never been steady.

Anyway, I think your patch is good to have: I just didn't see v2.

--yliu


[dpdk-dev] [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-11 Thread Maxime Coquelin


On 10/11/2016 08:04 AM, Yuanhan Liu wrote:
> On Mon, Oct 10, 2016 at 04:54:39PM +0200, Maxime Coquelin wrote:
>>
>>
>> On 10/10/2016 04:42 PM, Yuanhan Liu wrote:
>>> On Mon, Oct 10, 2016 at 02:40:44PM +0200, Maxime Coquelin wrote:
>>> At that time, a packet always use 2 descs. Since indirect desc is
>>> enabled (by default) now, the assumption is not true then. What's
>>> worse, it might even slow things a bit down. That should also be
>>> part of the reason why performance is slightly worse than before.
>>>
>>> --yliu
>>
>> I'm not sure I get what you are saying
>>
>>> commit 1d41d77cf81c448c1b09e1e859bfd300e2054a98
>>> Author: Yuanhan Liu 
>>> Date:   Mon May 2 17:46:17 2016 -0700
>>>
>>>   vhost: optimize dequeue for small packets
>>>
>>>   A virtio driver normally uses at least 2 desc buffers for Tx: the
>>>   first for storing the header, and the others for storing the data.
>>>
>>>   Therefore, we could fetch the first data desc buf before the main
>>>   loop, and do the copy first before the check of "are we done yet?".
>>>   This could save one check for small packets that just have one data
>>>   desc buffer and need one mbuf to store it.
>>>
>>>   Signed-off-by: Yuanhan Liu 
>>>   Acked-by: Huawei Xie 
>>>   Tested-by: Rich Lane 
>>
>> This fast-paths the 2-descriptors format but it's not active
>> for indirect descriptors. Is this what you mean?
>
> Yes. It's also not active when ANY_LAYOUT is actually turned on.
>> Should be a simple matter to apply this optimization for indirect.
>
> Might be.

 If I understand the code correctly, indirect descs also benefit from this
 optimization, or am I missing something?
>>>
>>> Aha..., you are right!
>>
>> The interesting thing is that the patch I send on Thursday that removes
>> header access when no offload has been negotiated[0] seems to reduce
>> almost to zero the performance seen with indirect descriptors enabled.
>
> Didn't follow that.
>
>> I see this with 64 bytes packets using testpmd on both ends.
>>
>> When I did the patch, I would have expected the same gain with both
>> modes, whereas I measured +1% for direct and +4% for indirect.
>
> IIRC, I did a test before (remove those offload code piece), and the
> performance was basically the same before and after that. Well, there
> might be some small difference, say 1% as you said. But the result has
> never been steady.
>
> Anyway, I think your patch is good to have: I just didn't see v2.

I waited to gather some comments/feedback before sending the v2.
I'll send it today or tomorrow.

Thanks,
Maxime


[dpdk-dev] [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-10 Thread Yuanhan Liu
On Mon, Oct 10, 2016 at 02:40:44PM +0200, Maxime Coquelin wrote:
> >>>At that time, a packet always use 2 descs. Since indirect desc is
> >>>enabled (by default) now, the assumption is not true then. What's
> >>>worse, it might even slow things a bit down. That should also be
> >>>part of the reason why performance is slightly worse than before.
> >>>
> >>>   --yliu
> >>
> >>I'm not sure I get what you are saying
> >>
> >>>commit 1d41d77cf81c448c1b09e1e859bfd300e2054a98
> >>>Author: Yuanhan Liu 
> >>>Date:   Mon May 2 17:46:17 2016 -0700
> >>>
> >>>vhost: optimize dequeue for small packets
> >>>
> >>>A virtio driver normally uses at least 2 desc buffers for Tx: the
> >>>first for storing the header, and the others for storing the data.
> >>>
> >>>Therefore, we could fetch the first data desc buf before the main
> >>>loop, and do the copy first before the check of "are we done yet?".
> >>>This could save one check for small packets that just have one data
> >>>desc buffer and need one mbuf to store it.
> >>>
> >>>Signed-off-by: Yuanhan Liu 
> >>>Acked-by: Huawei Xie 
> >>>Tested-by: Rich Lane 
> >>
> >>This fast-paths the 2-descriptors format but it's not active
> >>for indirect descriptors. Is this what you mean?
> >
> >Yes. It's also not active when ANY_LAYOUT is actually turned on.
> >>Should be a simple matter to apply this optimization for indirect.
> >
> >Might be.
> 
> If I understand the code correctly, indirect descs also benefit from this
> optimization, or am I missing something?

Aha..., you are right!

--yliu


[dpdk-dev] [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-10 Thread Maxime Coquelin


On 10/10/2016 04:42 PM, Yuanhan Liu wrote:
> On Mon, Oct 10, 2016 at 02:40:44PM +0200, Maxime Coquelin wrote:
> At that time, a packet always use 2 descs. Since indirect desc is
> enabled (by default) now, the assumption is not true then. What's
> worse, it might even slow things a bit down. That should also be
> part of the reason why performance is slightly worse than before.
>
>   --yliu

 I'm not sure I get what you are saying

> commit 1d41d77cf81c448c1b09e1e859bfd300e2054a98
> Author: Yuanhan Liu 
> Date:   Mon May 2 17:46:17 2016 -0700
>
>vhost: optimize dequeue for small packets
>
>A virtio driver normally uses at least 2 desc buffers for Tx: the
>first for storing the header, and the others for storing the data.
>
>Therefore, we could fetch the first data desc buf before the main
>loop, and do the copy first before the check of "are we done yet?".
>This could save one check for small packets that just have one data
>desc buffer and need one mbuf to store it.
>
>Signed-off-by: Yuanhan Liu 
>Acked-by: Huawei Xie 
>Tested-by: Rich Lane 

 This fast-paths the 2-descriptors format but it's not active
 for indirect descriptors. Is this what you mean?
>>>
>>> Yes. It's also not active when ANY_LAYOUT is actually turned on.
 Should be a simple matter to apply this optimization for indirect.
>>>
>>> Might be.
>>
>> If I understand the code correctly, indirect descs also benefit from this
>> optimization, or am I missing something?
>
> Aha..., you are right!

The interesting thing is that the patch I send on Thursday that removes
header access when no offload has been negotiated[0] seems to reduce
almost to zero the performance seen with indirect descriptors enabled.
I see this with 64 bytes packets using testpmd on both ends.

When I did the patch, I would have expected the same gain with both
modes, whereas I measured +1% for direct and +4% for indirect.

Maxime
[0]: http://dpdk.org/dev/patchwork/patch/16423/


[dpdk-dev] [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-10 Thread Maxime Coquelin


On 10/10/2016 06:22 AM, Yuanhan Liu wrote:
> On Mon, Oct 10, 2016 at 07:17:06AM +0300, Michael S. Tsirkin wrote:
>> On Mon, Oct 10, 2016 at 12:05:31PM +0800, Yuanhan Liu wrote:
>>> On Fri, Sep 30, 2016 at 10:16:43PM +0300, Michael S. Tsirkin wrote:
>> And the same is done is done in DPDK:
>>
>> static inline int __attribute__((always_inline))
>> copy_desc_to_mbuf(struct virtio_net *dev, struct vring_desc *descs,
>>   uint16_t max_desc, struct rte_mbuf *m, uint16_t desc_idx,
>>   struct rte_mempool *mbuf_pool)
>> {
>> ...
>> /*
>>  * A virtio driver normally uses at least 2 desc buffers
>>  * for Tx: the first for storing the header, and others
>>  * for storing the data.
>>  */
>> if (likely((desc->len == dev->vhost_hlen) &&
>>(desc->flags & VRING_DESC_F_NEXT) != 0)) {
>> desc = [desc->next];
>> if (unlikely(desc->flags & VRING_DESC_F_INDIRECT))
>> return -1;
>>
>> desc_addr = gpa_to_vva(dev, desc->addr);
>> if (unlikely(!desc_addr))
>> return -1;
>>
>> rte_prefetch0((void *)(uintptr_t)desc_addr);
>>
>> desc_offset = 0;
>> desc_avail  = desc->len;
>> nr_desc+= 1;
>>
>> PRINT_PACKET(dev, (uintptr_t)desc_addr, desc->len, 0);
>> } else {
>> desc_avail  = desc->len - dev->vhost_hlen;
>> desc_offset = dev->vhost_hlen;
>> }
>
> Actually, the header is parsed in DPDK vhost implementation.
> But as Virtio PMD provides a zero'ed header, we could just parse
> the header only if VIRTIO_NET_F_NO_TX_HEADER is not negotiated.

 host can always skip the header parse if it wants to.
 It didn't seem worth it to add branches there but
 if I'm wrong, by all means code it up.
>>>
>>> It's added by following commit, which yields about 10% performance
>>> boosts for PVP case (with 64B packet size).
>>>
>>> At that time, a packet always use 2 descs. Since indirect desc is
>>> enabled (by default) now, the assumption is not true then. What's
>>> worse, it might even slow things a bit down. That should also be
>>> part of the reason why performance is slightly worse than before.
>>>
>>> --yliu
>>
>> I'm not sure I get what you are saying
>>
>>> commit 1d41d77cf81c448c1b09e1e859bfd300e2054a98
>>> Author: Yuanhan Liu 
>>> Date:   Mon May 2 17:46:17 2016 -0700
>>>
>>> vhost: optimize dequeue for small packets
>>>
>>> A virtio driver normally uses at least 2 desc buffers for Tx: the
>>> first for storing the header, and the others for storing the data.
>>>
>>> Therefore, we could fetch the first data desc buf before the main
>>> loop, and do the copy first before the check of "are we done yet?".
>>> This could save one check for small packets that just have one data
>>> desc buffer and need one mbuf to store it.
>>>
>>> Signed-off-by: Yuanhan Liu 
>>> Acked-by: Huawei Xie 
>>> Tested-by: Rich Lane 
>>
>> This fast-paths the 2-descriptors format but it's not active
>> for indirect descriptors. Is this what you mean?
>
> Yes. It's also not active when ANY_LAYOUT is actually turned on.
>> Should be a simple matter to apply this optimization for indirect.
>
> Might be.

If I understand the code correctly, indirect descs also benefit from 
this optimization, or am I missing something?

Maxime


[dpdk-dev] [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-10 Thread Yuanhan Liu
On Mon, Oct 10, 2016 at 07:17:06AM +0300, Michael S. Tsirkin wrote:
> On Mon, Oct 10, 2016 at 12:05:31PM +0800, Yuanhan Liu wrote:
> > On Fri, Sep 30, 2016 at 10:16:43PM +0300, Michael S. Tsirkin wrote:
> > > > > And the same is done is done in DPDK:
> > > > > 
> > > > > static inline int __attribute__((always_inline))
> > > > > copy_desc_to_mbuf(struct virtio_net *dev, struct vring_desc *descs,
> > > > >   uint16_t max_desc, struct rte_mbuf *m, uint16_t desc_idx,
> > > > >   struct rte_mempool *mbuf_pool)
> > > > > {
> > > > > ...
> > > > > /*
> > > > >  * A virtio driver normally uses at least 2 desc buffers
> > > > >  * for Tx: the first for storing the header, and others
> > > > >  * for storing the data.
> > > > >  */
> > > > > if (likely((desc->len == dev->vhost_hlen) &&
> > > > >(desc->flags & VRING_DESC_F_NEXT) != 0)) {
> > > > > desc = [desc->next];
> > > > > if (unlikely(desc->flags & VRING_DESC_F_INDIRECT))
> > > > > return -1;
> > > > > 
> > > > > desc_addr = gpa_to_vva(dev, desc->addr);
> > > > > if (unlikely(!desc_addr))
> > > > > return -1;
> > > > > 
> > > > > rte_prefetch0((void *)(uintptr_t)desc_addr);
> > > > > 
> > > > > desc_offset = 0;
> > > > > desc_avail  = desc->len;
> > > > > nr_desc+= 1;
> > > > > 
> > > > > PRINT_PACKET(dev, (uintptr_t)desc_addr, desc->len, 0);
> > > > > } else {
> > > > > desc_avail  = desc->len - dev->vhost_hlen;
> > > > > desc_offset = dev->vhost_hlen;
> > > > > }
> > > > 
> > > > Actually, the header is parsed in DPDK vhost implementation.
> > > > But as Virtio PMD provides a zero'ed header, we could just parse
> > > > the header only if VIRTIO_NET_F_NO_TX_HEADER is not negotiated.
> > > 
> > > host can always skip the header parse if it wants to.
> > > It didn't seem worth it to add branches there but
> > > if I'm wrong, by all means code it up.
> > 
> > It's added by following commit, which yields about 10% performance
> > boosts for PVP case (with 64B packet size).
> > 
> > At that time, a packet always use 2 descs. Since indirect desc is
> > enabled (by default) now, the assumption is not true then. What's
> > worse, it might even slow things a bit down. That should also be
> > part of the reason why performance is slightly worse than before.
> > 
> > --yliu
> 
> I'm not sure I get what you are saying
> 
> > commit 1d41d77cf81c448c1b09e1e859bfd300e2054a98
> > Author: Yuanhan Liu 
> > Date:   Mon May 2 17:46:17 2016 -0700
> > 
> > vhost: optimize dequeue for small packets
> > 
> > A virtio driver normally uses at least 2 desc buffers for Tx: the
> > first for storing the header, and the others for storing the data.
> > 
> > Therefore, we could fetch the first data desc buf before the main
> > loop, and do the copy first before the check of "are we done yet?".
> > This could save one check for small packets that just have one data
> > desc buffer and need one mbuf to store it.
> > 
> > Signed-off-by: Yuanhan Liu 
> > Acked-by: Huawei Xie 
> > Tested-by: Rich Lane 
> 
> This fast-paths the 2-descriptors format but it's not active
> for indirect descriptors. Is this what you mean?

Yes. It's also not active when ANY_LAYOUT is actually turned on.

> Should be a simple matter to apply this optimization for indirect.

Might be.

--yliu


[dpdk-dev] [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-10 Thread Yuanhan Liu
On Fri, Sep 30, 2016 at 10:16:43PM +0300, Michael S. Tsirkin wrote:
> > > And the same is done is done in DPDK:
> > > 
> > > static inline int __attribute__((always_inline))
> > > copy_desc_to_mbuf(struct virtio_net *dev, struct vring_desc *descs,
> > >   uint16_t max_desc, struct rte_mbuf *m, uint16_t desc_idx,
> > >   struct rte_mempool *mbuf_pool)
> > > {
> > > ...
> > > /*
> > >  * A virtio driver normally uses at least 2 desc buffers
> > >  * for Tx: the first for storing the header, and others
> > >  * for storing the data.
> > >  */
> > > if (likely((desc->len == dev->vhost_hlen) &&
> > >(desc->flags & VRING_DESC_F_NEXT) != 0)) {
> > > desc = [desc->next];
> > > if (unlikely(desc->flags & VRING_DESC_F_INDIRECT))
> > > return -1;
> > > 
> > > desc_addr = gpa_to_vva(dev, desc->addr);
> > > if (unlikely(!desc_addr))
> > > return -1;
> > > 
> > > rte_prefetch0((void *)(uintptr_t)desc_addr);
> > > 
> > > desc_offset = 0;
> > > desc_avail  = desc->len;
> > > nr_desc+= 1;
> > > 
> > > PRINT_PACKET(dev, (uintptr_t)desc_addr, desc->len, 0);
> > > } else {
> > > desc_avail  = desc->len - dev->vhost_hlen;
> > > desc_offset = dev->vhost_hlen;
> > > }
> > 
> > Actually, the header is parsed in DPDK vhost implementation.
> > But as Virtio PMD provides a zero'ed header, we could just parse
> > the header only if VIRTIO_NET_F_NO_TX_HEADER is not negotiated.
> 
> host can always skip the header parse if it wants to.
> It didn't seem worth it to add branches there but
> if I'm wrong, by all means code it up.

It's added by following commit, which yields about 10% performance
boosts for PVP case (with 64B packet size).

At that time, a packet always use 2 descs. Since indirect desc is
enabled (by default) now, the assumption is not true then. What's
worse, it might even slow things a bit down. That should also be
part of the reason why performance is slightly worse than before.

--yliu

commit 1d41d77cf81c448c1b09e1e859bfd300e2054a98
Author: Yuanhan Liu 
Date:   Mon May 2 17:46:17 2016 -0700

vhost: optimize dequeue for small packets

A virtio driver normally uses at least 2 desc buffers for Tx: the
first for storing the header, and the others for storing the data.

Therefore, we could fetch the first data desc buf before the main
loop, and do the copy first before the check of "are we done yet?".
This could save one check for small packets that just have one data
desc buffer and need one mbuf to store it.

Signed-off-by: Yuanhan Liu 
Acked-by: Huawei Xie 
Tested-by: Rich Lane 



[dpdk-dev] [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-10 Thread Yuanhan Liu
On Mon, Oct 10, 2016 at 06:46:44AM +0300, Michael S. Tsirkin wrote:
> On Mon, Oct 10, 2016 at 11:37:44AM +0800, Yuanhan Liu wrote:
> > On Thu, Sep 29, 2016 at 11:21:48PM +0300, Michael S. Tsirkin wrote:
> > > On Thu, Sep 29, 2016 at 10:05:22PM +0200, Maxime Coquelin wrote:
> > > > 
> > > > 
> > > > On 09/29/2016 07:57 PM, Michael S. Tsirkin wrote:
> > > Yes but two points.
> > > 
> > > 1. why is this memset expensive?
> > 
> > I don't have the exact answer, but just some rough thoughts:
> > 
> > It's an external clib function: there is a call stack and the
> > IP register will bounch back and forth.
> 
> for memset 0?  gcc 5.3.1 on fedora happily inlines it.

Good to know!

> > overkill to use that for resetting 14 bytes structure.
> > 
> > Some trick like
> > *(struct virtio_net_hdr *)hdr = {0, };
> > 
> > Or even 
> > hdr->xxx = 0;
> > hdr->yyy = 0;
> > 
> > should behaviour better.
> > 
> > There was an example: the vhost enqueue optmization patchset from
> > Zhihong [0] uses memset, and it introduces more than 15% drop (IIRC)
> > on my Ivybridge server: it has no such issue on his server though.
> > 
> > [0]: http://dpdk.org/ml/archives/dev/2016-August/045272.html
> > 
> > --yliu
> 
> I'd say that's weird. what's your config? any chance you
> are using an old compiler?

Not really, it's gcc 5.3.1. Maybe Zhihong could explain more. IIRC,
he said the memset is not well optimized for Ivybridge server.

--yliu


[dpdk-dev] [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-10 Thread Yuanhan Liu
On Thu, Sep 29, 2016 at 10:05:22PM +0200, Maxime Coquelin wrote:
> >so doing this unconditionally would be a spec violation, but if you see
> >value in this, we can add a feature bit.
> Right it would be a spec violation, so it should be done conditionally.
> If a feature bit is to be added, what about VIRTIO_NET_F_NO_TX_HEADER?
> It would imply VIRTIO_NET_F_CSUM not set, and no GSO features set.
> If negotiated, we wouldn't need to prepend a header.

If we could skip Tx header, I think we could also skip Rx header, in the
case when mrg-rx is aslo turned off?

> From the micro-benchmarks results, we can expect +10% compared to
> indirect descriptors, and + 5% compared to using 2 descs in the
> virtqueue.
> Also, it should have the same benefits as indirect descriptors for 0%
> pkt loss (as we can fill 2x more packets in the virtqueue).
> 
> What do you think?

I would vote for this. It should yield maximum performance for the case
that it's guaranteed that packet size will always fit in a typical MTU
(1500).

--yliu


[dpdk-dev] [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-10 Thread Yuanhan Liu
On Thu, Sep 29, 2016 at 11:21:48PM +0300, Michael S. Tsirkin wrote:
> On Thu, Sep 29, 2016 at 10:05:22PM +0200, Maxime Coquelin wrote:
> > 
> > 
> > On 09/29/2016 07:57 PM, Michael S. Tsirkin wrote:
> Yes but two points.
> 
> 1. why is this memset expensive?

I don't have the exact answer, but just some rough thoughts:

It's an external clib function: there is a call stack and the
IP register will bounch back and forth. BTW, It's kind of an
overkill to use that for resetting 14 bytes structure.

Some trick like
*(struct virtio_net_hdr *)hdr = {0, };

Or even 
hdr->xxx = 0;
hdr->yyy = 0;

should behaviour better.

There was an example: the vhost enqueue optmization patchset from
Zhihong [0] uses memset, and it introduces more than 15% drop (IIRC)
on my Ivybridge server: it has no such issue on his server though.

[0]: http://dpdk.org/ml/archives/dev/2016-August/045272.html

--yliu

> Is the test completely skipping looking
>at the packet otherwise?
> 
> 2. As long as we are doing this, see
>   Alignment vs. Networking
>   
> in Documentation/unaligned-memory-access.txt
> 
> 
> > From the micro-benchmarks results, we can expect +10% compared to
> > indirect descriptors, and + 5% compared to using 2 descs in the
> > virtqueue.
> > Also, it should have the same benefits as indirect descriptors for 0%
> > pkt loss (as we can fill 2x more packets in the virtqueue).
> > 
> > What do you think?
> > 
> > Thanks,
> > Maxime


[dpdk-dev] [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-10 Thread Michael S. Tsirkin
On Mon, Oct 10, 2016 at 04:16:19AM +, Wang, Zhihong wrote:
> 
> 
> > -Original Message-
> > From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com]
> > Sent: Monday, October 10, 2016 11:59 AM
> > To: Michael S. Tsirkin 
> > Cc: Maxime Coquelin ; Stephen Hemminger
> > ; dev at dpdk.org; qemu-
> > devel at nongnu.org; Wang, Zhihong 
> > Subject: Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature
> > 
> > On Mon, Oct 10, 2016 at 06:46:44AM +0300, Michael S. Tsirkin wrote:
> > > On Mon, Oct 10, 2016 at 11:37:44AM +0800, Yuanhan Liu wrote:
> > > > On Thu, Sep 29, 2016 at 11:21:48PM +0300, Michael S. Tsirkin wrote:
> > > > > On Thu, Sep 29, 2016 at 10:05:22PM +0200, Maxime Coquelin wrote:
> > > > > >
> > > > > >
> > > > > > On 09/29/2016 07:57 PM, Michael S. Tsirkin wrote:
> > > > > Yes but two points.
> > > > >
> > > > > 1. why is this memset expensive?
> > > >
> > > > I don't have the exact answer, but just some rough thoughts:
> > > >
> > > > It's an external clib function: there is a call stack and the
> > > > IP register will bounch back and forth.
> > >
> > > for memset 0?  gcc 5.3.1 on fedora happily inlines it.
> > 
> > Good to know!
> > 
> > > > overkill to use that for resetting 14 bytes structure.
> > > >
> > > > Some trick like
> > > > *(struct virtio_net_hdr *)hdr = {0, };
> > > >
> > > > Or even
> > > > hdr->xxx = 0;
> > > > hdr->yyy = 0;
> > > >
> > > > should behaviour better.
> > > >
> > > > There was an example: the vhost enqueue optmization patchset from
> > > > Zhihong [0] uses memset, and it introduces more than 15% drop (IIRC)
> > > > on my Ivybridge server: it has no such issue on his server though.
> > > >
> > > > [0]: http://dpdk.org/ml/archives/dev/2016-August/045272.html
> > > >
> > > > --yliu
> > >
> > > I'd say that's weird. what's your config? any chance you
> > > are using an old compiler?
> > 
> > Not really, it's gcc 5.3.1. Maybe Zhihong could explain more. IIRC,
> > he said the memset is not well optimized for Ivybridge server.
> 
> The dst is remote in that case. It's fine on Haswell but has complication
> in Ivy Bridge which (wasn't supposed to but) causes serious frontend issue.
> 
> I don't think gcc inlined it there. I'm using fc24 gcc 6.1.1.


So try something like this then:

Signed-off-by: Michael S. Tsirkin 


diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h
index dd7693f..7a3f88e 100644
--- a/drivers/net/virtio/virtio_pci.h
+++ b/drivers/net/virtio/virtio_pci.h
@@ -292,6 +292,16 @@ vtpci_with_feature(struct virtio_hw *hw, uint64_t bit)
return (hw->guest_features & (1ULL << bit)) != 0;
 }

+static inline int
+vtnet_hdr_size(struct virtio_hw *hw)
+{
+   if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF) ||
+   vtpci_with_feature(hw, VIRTIO_F_VERSION_1))
+   return sizeof(struct virtio_net_hdr_mrg_rxbuf);
+   else
+   return sizeof(struct virtio_net_hdr);
+}
+
 /*
  * Function declaration from virtio_pci.c
  */
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index a27208e..21a45e1 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -216,7 +216,7 @@ virtqueue_enqueue_xmit(struct virtnet_tx *txvq, struct 
rte_mbuf *cookie,
struct vring_desc *start_dp;
uint16_t seg_num = cookie->nb_segs;
uint16_t head_idx, idx;
-   uint16_t head_size = vq->hw->vtnet_hdr_size;
+   uint16_t head_size = vtnet_hdr_size(vq->hw);
unsigned long offs;

head_idx = vq->vq_desc_head_idx;

Generally pointer chasing in vq->hw->vtnet_hdr_size can't be good
for performance. Move fields used on data path into vq
and use from there to avoid indirections?



[dpdk-dev] [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-10 Thread Michael S. Tsirkin
On Mon, Oct 10, 2016 at 12:22:09PM +0800, Yuanhan Liu wrote:
> On Mon, Oct 10, 2016 at 07:17:06AM +0300, Michael S. Tsirkin wrote:
> > On Mon, Oct 10, 2016 at 12:05:31PM +0800, Yuanhan Liu wrote:
> > > On Fri, Sep 30, 2016 at 10:16:43PM +0300, Michael S. Tsirkin wrote:
> > > > > > And the same is done is done in DPDK:
> > > > > > 
> > > > > > static inline int __attribute__((always_inline))
> > > > > > copy_desc_to_mbuf(struct virtio_net *dev, struct vring_desc *descs,
> > > > > >   uint16_t max_desc, struct rte_mbuf *m, uint16_t desc_idx,
> > > > > >   struct rte_mempool *mbuf_pool)
> > > > > > {
> > > > > > ...
> > > > > > /*
> > > > > >  * A virtio driver normally uses at least 2 desc buffers
> > > > > >  * for Tx: the first for storing the header, and others
> > > > > >  * for storing the data.
> > > > > >  */
> > > > > > if (likely((desc->len == dev->vhost_hlen) &&
> > > > > >(desc->flags & VRING_DESC_F_NEXT) != 0)) {
> > > > > > desc = [desc->next];
> > > > > > if (unlikely(desc->flags & VRING_DESC_F_INDIRECT))
> > > > > > return -1;
> > > > > > 
> > > > > > desc_addr = gpa_to_vva(dev, desc->addr);
> > > > > > if (unlikely(!desc_addr))
> > > > > > return -1;
> > > > > > 
> > > > > > rte_prefetch0((void *)(uintptr_t)desc_addr);
> > > > > > 
> > > > > > desc_offset = 0;
> > > > > > desc_avail  = desc->len;
> > > > > > nr_desc+= 1;
> > > > > > 
> > > > > > PRINT_PACKET(dev, (uintptr_t)desc_addr, desc->len, 0);
> > > > > > } else {
> > > > > > desc_avail  = desc->len - dev->vhost_hlen;
> > > > > > desc_offset = dev->vhost_hlen;
> > > > > > }
> > > > > 
> > > > > Actually, the header is parsed in DPDK vhost implementation.
> > > > > But as Virtio PMD provides a zero'ed header, we could just parse
> > > > > the header only if VIRTIO_NET_F_NO_TX_HEADER is not negotiated.
> > > > 
> > > > host can always skip the header parse if it wants to.
> > > > It didn't seem worth it to add branches there but
> > > > if I'm wrong, by all means code it up.
> > > 
> > > It's added by following commit, which yields about 10% performance
> > > boosts for PVP case (with 64B packet size).
> > > 
> > > At that time, a packet always use 2 descs. Since indirect desc is
> > > enabled (by default) now, the assumption is not true then. What's
> > > worse, it might even slow things a bit down. That should also be
> > > part of the reason why performance is slightly worse than before.
> > > 
> > >   --yliu
> > 
> > I'm not sure I get what you are saying
> > 
> > > commit 1d41d77cf81c448c1b09e1e859bfd300e2054a98
> > > Author: Yuanhan Liu 
> > > Date:   Mon May 2 17:46:17 2016 -0700
> > > 
> > > vhost: optimize dequeue for small packets
> > > 
> > > A virtio driver normally uses at least 2 desc buffers for Tx: the
> > > first for storing the header, and the others for storing the data.
> > > 
> > > Therefore, we could fetch the first data desc buf before the main
> > > loop, and do the copy first before the check of "are we done yet?".
> > > This could save one check for small packets that just have one data
> > > desc buffer and need one mbuf to store it.
> > > 
> > > Signed-off-by: Yuanhan Liu 
> > > Acked-by: Huawei Xie 
> > > Tested-by: Rich Lane 
> > 
> > This fast-paths the 2-descriptors format but it's not active
> > for indirect descriptors. Is this what you mean?
> 
> Yes. It's also not active when ANY_LAYOUT is actually turned on.

It's not needed there though - you only use 1 desc, no point in
fetching the next one.


> > Should be a simple matter to apply this optimization for indirect.
> 
> Might be.
> 
>   --yliu


[dpdk-dev] [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-10 Thread Michael S. Tsirkin
On Mon, Oct 10, 2016 at 04:16:19AM +, Wang, Zhihong wrote:
> 
> 
> > -Original Message-
> > From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com]
> > Sent: Monday, October 10, 2016 11:59 AM
> > To: Michael S. Tsirkin 
> > Cc: Maxime Coquelin ; Stephen Hemminger
> > ; dev at dpdk.org; qemu-
> > devel at nongnu.org; Wang, Zhihong 
> > Subject: Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature
> > 
> > On Mon, Oct 10, 2016 at 06:46:44AM +0300, Michael S. Tsirkin wrote:
> > > On Mon, Oct 10, 2016 at 11:37:44AM +0800, Yuanhan Liu wrote:
> > > > On Thu, Sep 29, 2016 at 11:21:48PM +0300, Michael S. Tsirkin wrote:
> > > > > On Thu, Sep 29, 2016 at 10:05:22PM +0200, Maxime Coquelin wrote:
> > > > > >
> > > > > >
> > > > > > On 09/29/2016 07:57 PM, Michael S. Tsirkin wrote:
> > > > > Yes but two points.
> > > > >
> > > > > 1. why is this memset expensive?
> > > >
> > > > I don't have the exact answer, but just some rough thoughts:
> > > >
> > > > It's an external clib function: there is a call stack and the
> > > > IP register will bounch back and forth.
> > >
> > > for memset 0?  gcc 5.3.1 on fedora happily inlines it.
> > 
> > Good to know!
> > 
> > > > overkill to use that for resetting 14 bytes structure.
> > > >
> > > > Some trick like
> > > > *(struct virtio_net_hdr *)hdr = {0, };
> > > >
> > > > Or even
> > > > hdr->xxx = 0;
> > > > hdr->yyy = 0;
> > > >
> > > > should behaviour better.
> > > >
> > > > There was an example: the vhost enqueue optmization patchset from
> > > > Zhihong [0] uses memset, and it introduces more than 15% drop (IIRC)
> > > > on my Ivybridge server: it has no such issue on his server though.
> > > >
> > > > [0]: http://dpdk.org/ml/archives/dev/2016-August/045272.html
> > > >
> > > > --yliu
> > >
> > > I'd say that's weird. what's your config? any chance you
> > > are using an old compiler?
> > 
> > Not really, it's gcc 5.3.1. Maybe Zhihong could explain more. IIRC,
> > he said the memset is not well optimized for Ivybridge server.
> 
> The dst is remote in that case. It's fine on Haswell but has complication
> in Ivy Bridge which (wasn't supposed to but) causes serious frontend issue.
> 
> I don't think gcc inlined it there. I'm using fc24 gcc 6.1.1.

I just wrote some test code and compiled with gcc -O2,
it did get inlined.

It's probably this:
uint16_t head_size = vq->hw->vtnet_hdr_size;
...
memset(hdr, 0, head_size);
IOW head_size is not known to compiler.

Try sticking a bool there instead of vtnet_hdr_size, and
 memset(hdr, 0, bigheader ? 10 : 12);


> > 
> > --yliu


[dpdk-dev] [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-10 Thread Michael S. Tsirkin
On Mon, Oct 10, 2016 at 12:05:31PM +0800, Yuanhan Liu wrote:
> On Fri, Sep 30, 2016 at 10:16:43PM +0300, Michael S. Tsirkin wrote:
> > > > And the same is done is done in DPDK:
> > > > 
> > > > static inline int __attribute__((always_inline))
> > > > copy_desc_to_mbuf(struct virtio_net *dev, struct vring_desc *descs,
> > > >   uint16_t max_desc, struct rte_mbuf *m, uint16_t desc_idx,
> > > >   struct rte_mempool *mbuf_pool)
> > > > {
> > > > ...
> > > > /*
> > > >  * A virtio driver normally uses at least 2 desc buffers
> > > >  * for Tx: the first for storing the header, and others
> > > >  * for storing the data.
> > > >  */
> > > > if (likely((desc->len == dev->vhost_hlen) &&
> > > >(desc->flags & VRING_DESC_F_NEXT) != 0)) {
> > > > desc = [desc->next];
> > > > if (unlikely(desc->flags & VRING_DESC_F_INDIRECT))
> > > > return -1;
> > > > 
> > > > desc_addr = gpa_to_vva(dev, desc->addr);
> > > > if (unlikely(!desc_addr))
> > > > return -1;
> > > > 
> > > > rte_prefetch0((void *)(uintptr_t)desc_addr);
> > > > 
> > > > desc_offset = 0;
> > > > desc_avail  = desc->len;
> > > > nr_desc+= 1;
> > > > 
> > > > PRINT_PACKET(dev, (uintptr_t)desc_addr, desc->len, 0);
> > > > } else {
> > > > desc_avail  = desc->len - dev->vhost_hlen;
> > > > desc_offset = dev->vhost_hlen;
> > > > }
> > > 
> > > Actually, the header is parsed in DPDK vhost implementation.
> > > But as Virtio PMD provides a zero'ed header, we could just parse
> > > the header only if VIRTIO_NET_F_NO_TX_HEADER is not negotiated.
> > 
> > host can always skip the header parse if it wants to.
> > It didn't seem worth it to add branches there but
> > if I'm wrong, by all means code it up.
> 
> It's added by following commit, which yields about 10% performance
> boosts for PVP case (with 64B packet size).
> 
> At that time, a packet always use 2 descs. Since indirect desc is
> enabled (by default) now, the assumption is not true then. What's
> worse, it might even slow things a bit down. That should also be
> part of the reason why performance is slightly worse than before.
> 
>   --yliu

I'm not sure I get what you are saying

> commit 1d41d77cf81c448c1b09e1e859bfd300e2054a98
> Author: Yuanhan Liu 
> Date:   Mon May 2 17:46:17 2016 -0700
> 
> vhost: optimize dequeue for small packets
> 
> A virtio driver normally uses at least 2 desc buffers for Tx: the
> first for storing the header, and the others for storing the data.
> 
> Therefore, we could fetch the first data desc buf before the main
> loop, and do the copy first before the check of "are we done yet?".
> This could save one check for small packets that just have one data
> desc buffer and need one mbuf to store it.
> 
> Signed-off-by: Yuanhan Liu 
> Acked-by: Huawei Xie 
> Tested-by: Rich Lane 

This fast-paths the 2-descriptors format but it's not active
for indirect descriptors. Is this what you mean?
Should be a simple matter to apply this optimization for indirect.




[dpdk-dev] [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-10 Thread Michael S. Tsirkin
On Mon, Oct 10, 2016 at 11:37:44AM +0800, Yuanhan Liu wrote:
> On Thu, Sep 29, 2016 at 11:21:48PM +0300, Michael S. Tsirkin wrote:
> > On Thu, Sep 29, 2016 at 10:05:22PM +0200, Maxime Coquelin wrote:
> > > 
> > > 
> > > On 09/29/2016 07:57 PM, Michael S. Tsirkin wrote:
> > Yes but two points.
> > 
> > 1. why is this memset expensive?
> 
> I don't have the exact answer, but just some rough thoughts:
> 
> It's an external clib function: there is a call stack and the
> IP register will bounch back and forth.

for memset 0?  gcc 5.3.1 on fedora happily inlines it.

> BTW, It's kind of an
> overkill to use that for resetting 14 bytes structure.
> 
> Some trick like
> *(struct virtio_net_hdr *)hdr = {0, };
> 
> Or even 
> hdr->xxx = 0;
> hdr->yyy = 0;
> 
> should behaviour better.
> 
> There was an example: the vhost enqueue optmization patchset from
> Zhihong [0] uses memset, and it introduces more than 15% drop (IIRC)
> on my Ivybridge server: it has no such issue on his server though.
> 
> [0]: http://dpdk.org/ml/archives/dev/2016-August/045272.html
> 
>   --yliu

I'd say that's weird. what's your config? any chance you
are using an old compiler?


> > Is the test completely skipping looking
> >at the packet otherwise?
> > 
> > 2. As long as we are doing this, see
> > Alignment vs. Networking
> > 
> > in Documentation/unaligned-memory-access.txt
> > 
> > 
> > > From the micro-benchmarks results, we can expect +10% compared to
> > > indirect descriptors, and + 5% compared to using 2 descs in the
> > > virtqueue.
> > > Also, it should have the same benefits as indirect descriptors for 0%
> > > pkt loss (as we can fill 2x more packets in the virtqueue).
> > > 
> > > What do you think?
> > > 
> > > Thanks,
> > > Maxime


[dpdk-dev] [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-10 Thread Wang, Zhihong


> -Original Message-
> From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com]
> Sent: Monday, October 10, 2016 11:59 AM
> To: Michael S. Tsirkin 
> Cc: Maxime Coquelin ; Stephen Hemminger
> ; dev at dpdk.org; qemu-
> devel at nongnu.org; Wang, Zhihong 
> Subject: Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature
> 
> On Mon, Oct 10, 2016 at 06:46:44AM +0300, Michael S. Tsirkin wrote:
> > On Mon, Oct 10, 2016 at 11:37:44AM +0800, Yuanhan Liu wrote:
> > > On Thu, Sep 29, 2016 at 11:21:48PM +0300, Michael S. Tsirkin wrote:
> > > > On Thu, Sep 29, 2016 at 10:05:22PM +0200, Maxime Coquelin wrote:
> > > > >
> > > > >
> > > > > On 09/29/2016 07:57 PM, Michael S. Tsirkin wrote:
> > > > Yes but two points.
> > > >
> > > > 1. why is this memset expensive?
> > >
> > > I don't have the exact answer, but just some rough thoughts:
> > >
> > > It's an external clib function: there is a call stack and the
> > > IP register will bounch back and forth.
> >
> > for memset 0?  gcc 5.3.1 on fedora happily inlines it.
> 
> Good to know!
> 
> > > overkill to use that for resetting 14 bytes structure.
> > >
> > > Some trick like
> > > *(struct virtio_net_hdr *)hdr = {0, };
> > >
> > > Or even
> > > hdr->xxx = 0;
> > > hdr->yyy = 0;
> > >
> > > should behaviour better.
> > >
> > > There was an example: the vhost enqueue optmization patchset from
> > > Zhihong [0] uses memset, and it introduces more than 15% drop (IIRC)
> > > on my Ivybridge server: it has no such issue on his server though.
> > >
> > > [0]: http://dpdk.org/ml/archives/dev/2016-August/045272.html
> > >
> > >   --yliu
> >
> > I'd say that's weird. what's your config? any chance you
> > are using an old compiler?
> 
> Not really, it's gcc 5.3.1. Maybe Zhihong could explain more. IIRC,
> he said the memset is not well optimized for Ivybridge server.

The dst is remote in that case. It's fine on Haswell but has complication
in Ivy Bridge which (wasn't supposed to but) causes serious frontend issue.

I don't think gcc inlined it there. I'm using fc24 gcc 6.1.1.

> 
>   --yliu


[dpdk-dev] [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-03 Thread Maxime Coquelin


On 09/29/2016 10:21 PM, Michael S. Tsirkin wrote:
> On Thu, Sep 29, 2016 at 10:05:22PM +0200, Maxime Coquelin wrote:
>> >
>> >
>> > On 09/29/2016 07:57 PM, Michael S. Tsirkin wrote:
>>> > > On Thu, Sep 29, 2016 at 05:30:53PM +0200, Maxime Coquelin wrote:
>> > ...
 > > >
 > > > Before enabling anything by default, we should first optimize the 1 
 > > > slot
 > > > case. Indeed, micro-benchmark using testpmd in txonly[0] shows ~17%
 > > > perf regression for 64 bytes case:
 > > >  - 2 descs per packet: 11.6Mpps
 > > >  - 1 desc per packet: 9.6Mpps
 > > >
 > > > This is due to the virtio header clearing in 
 > > > virtqueue_enqueue_xmit().
 > > > Removing it, we get better results than with 2 descs (1.20Mpps).
 > > > Since the Virtio PMD doesn't support offloads, I wonder whether we 
 > > > can
 > > > just drop the memset?
>>> > >
>>> > > What will happen? Will the header be uninitialized?
>> > Yes..
>> > I didn't look closely at the spec, but just looked at DPDK's and Linux
>> > vhost implementations. IIUC, the header is just skipped in the two
>> > implementations.
> In linux guest skbs are initialized AFAIK. See virtio_net_hdr_from_skb
> first thing it does is
> memset(hdr, 0, sizeof(*hdr));
>
>
>
>>> > >
>>> > > The spec says:
>>> > > The driver can send a completely checksummed packet. In this 
>>> > > case, flags
>>> > > will be zero, and gso_type
>>> > > will be VIRTIO_NET_HDR_GSO_NONE.
>>> > >
>>> > > and
>>> > > The driver MUST set num_buffers to zero.
>>> > > If VIRTIO_NET_F_CSUM is not negotiated, the driver MUST set 
>>> > > flags to
>>> > > zero and SHOULD supply a fully
>>> > > checksummed packet to the device.
>>> > >
>>> > > and
>>> > > If none of the VIRTIO_NET_F_HOST_TSO4, TSO6 or UFO options have 
>>> > > been
>>> > > negotiated, the driver MUST
>>> > > set gso_type to VIRTIO_NET_HDR_GSO_NONE.
>>> > >
>>> > > so doing this unconditionally would be a spec violation, but if you see
>>> > > value in this, we can add a feature bit.
>> > Right it would be a spec violation, so it should be done conditionally.
>> > If a feature bit is to be added, what about VIRTIO_NET_F_NO_TX_HEADER?
>> > It would imply VIRTIO_NET_F_CSUM not set, and no GSO features set.
>> > If negotiated, we wouldn't need to prepend a header.
> Yes but two points.
>
> 1. why is this memset expensive? Is the test completely skipping looking
>at the packet otherwise?
>
> 2. As long as we are doing this, see
>   Alignment vs. Networking
>   
> in Documentation/unaligned-memory-access.txt

This change will not have an impact on the IP header alignment,
as is offset in the mbuf will not change.

Regards,
Maxime