[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-10-14 Thread Yuanhan Liu
On Thu, Oct 13, 2016 at 11:23:44AM +0200, Maxime Coquelin wrote:
> I was going to re-run some PVP benchmark with 0% pkt loss, as I had
> some strange results last week.
> 
> Problem is that your series no more apply cleanly due to
> next-virtio's master branch history rewrite.
> Any chance you send me a rebased version so that I can apply the series?

I think it's pointless to do that now: it won't be merged after all.
We have refactored out the new series, please help review if you
got time :)

BTW, apologize that I forgot to include your Reviewed-by for the
first patch. I intended to do that ...

--yliu


[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-10-13 Thread Yuanhan Liu
On Wed, Oct 12, 2016 at 12:22:08PM +, Wang, Zhihong wrote:
> > > >> >  3. How many percentage of drop are you seeing?
> > > The testing result:
> > > size (bytes) improvement (%)
> > > 64   3.92
> > > 128 11.51
> > > 256  24.16
> > > 512  -13.79
> > > 1024-22.51
> > > 1500-12.22
> > > A correction is that performance is dropping if byte size is larger than 
> > > 512.
> > 
> > I have thought of this twice. Unfortunately, I think I may need NACK this
> > series.
> > 
> > Merging two code path into one is really good: as you stated, it improves
> > the maintainability. But only if we see no performance regression on both
> > path after the refactor. Unfortunately, that's not the case here: it hurts
> > the performance for one code path (non-mrg Rx).
> > 
> > That makes me think we may should not do the code path merge at all. I think
> > that also aligns with what you have said before (internally): we could do 
> > the
> > merge if it gives comparable performance before and after that.
> > 
> > Besides that, I don't quite like the way you did in patch 2 (rewrite 
> > enqueue):
> > you made a lot of changes in one patch. That means if something wrong
> > happened,
> > it is hard to narrow down which change introduces that regression. Badly,
> > that's exactly what we met here. Weeks have been passed, I see no progress.
> > 
> > That's the reason we like the idea of "one patch only does one thing, an
> > atomic thing".
> 
> 
> Yuanhan, folks,
> 
> Thanks for the analysis. I disagree here though.
> 
> I analyze, develop, benchmark on x86 platforms, where this patch
> works great.

Yes, that's great effort! With your hardwork, we know what the bottleneck
is and how it could be improved.

However, you don't have to do code refactor (merge two code path to one)
to apply those improvements. From what I know, in this patchset, there
are two factors could improve the performance:

- copy hdr together with packet data

- shadow used ring update and update at once

The overall performance boost I got with your v6 patchset with mrg-Rx
code path is about 27% (in PVP case). And I have just applied the 1st
optimization, it yields about 20% boosts. The left could be covered if
we apply the 2nd optimization (I guess).

That would be a clean way to optimize vhost mergeable Rx path:

- you don't touch non-mrg Rx path (well, you may could apply the
  shadow_used_ring trick to it as wel)

  This would at least make sure we will have no such performance
  regression issue reported by ARM guys.

- you don't refactor the code

  The rewrite from scratch could introduce other issues, besides the
  performance regression. We may just don't know it yet.


Make sense to you? If you agree, I think we could still make it in
this release: they would be some small changes after all. For example,
below is the patch applies the 1st optimization tip on top of
dpdk-next-virtio

--yliu

---
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 8a151af..0ddb5af 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -379,7 +379,7 @@ copy_mbuf_to_desc_mergeable(struct virtio_net *dev, struct 
vhost_virtqueue *vq,
uint16_t end_idx, struct rte_mbuf *m,
struct buf_vector *buf_vec)
 {
-   struct virtio_net_hdr_mrg_rxbuf virtio_hdr = {{0, 0, 0, 0, 0, 0}, 0};
+   struct virtio_net_hdr_mrg_rxbuf *virtio_hdr;
uint32_t vec_idx = 0;
uint16_t start_idx = vq->last_used_idx;
uint16_t cur_idx = start_idx;
@@ -388,6 +388,8 @@ copy_mbuf_to_desc_mergeable(struct virtio_net *dev, struct 
vhost_virtqueue *vq,
uint32_t desc_offset, desc_avail;
uint32_t cpy_len;
uint16_t desc_idx, used_idx;
+   uint16_t num_buffers = end_idx - start_idx;
+   int hdr_copied = 0;

if (unlikely(m == NULL))
return 0;
@@ -399,16 +401,11 @@ copy_mbuf_to_desc_mergeable(struct virtio_net *dev, 
struct vhost_virtqueue *vq,
if (buf_vec[vec_idx].buf_len < dev->vhost_hlen || !desc_addr)
return 0;

-   rte_prefetch0((void *)(uintptr_t)desc_addr);
+   virtio_hdr = (struct virtio_net_hdr_mrg_rxbuf *)(uintptr_t)desc_addr;
+   rte_prefetch0(virtio_hdr);

-   virtio_hdr.num_buffers = end_idx - start_idx;
LOG_DEBUG(VHOST_DATA, "(%d) RX: num merge buffers %d\n",
-   dev->vid, virtio_hdr.num_buffers);
-
-   virtio_enqueue_offload(m, _hdr.hdr);
-   copy_virtio_net_hdr(dev, desc_addr, virtio_hdr);
-   vhost_log_write(dev, buf_vec[vec_idx].buf_addr, dev->vhost_hlen);
-   PRINT_PACKET(dev, (uintptr_t)desc_addr, dev->vhost_hlen, 0);
+   dev->vid, num_buffers);

desc_avail  = buf_vec[vec_idx].buf_len - dev->vhost_hlen;
desc_offset = 

[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-10-13 Thread Jianbo Liu
Hi Thomas,

On 12 October 2016 at 23:31, Thomas Monjalon  
wrote:
> Sorry guys, you lost me in the discussion.
>
> Is there some regression only on ARM?
> Does it need some work specifically on memcpy for ARM,

I don't know if there is common way to improve memcpy on different ARM
hardware.  Even there is, it could take times.
I have tried do that using neon (like sse) instructions, but without success.

> or vhost for ARM?
> Who can work on ARM optimization?
>


[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-10-13 Thread Maxime Coquelin


On 10/13/2016 09:54 AM, Maxime Coquelin wrote:
>
>
> On 10/13/2016 08:02 AM, Wang, Zhihong wrote:
>>> > Yes, that's great effort! With your hardwork, we know what the
>>> bottleneck
>>> > is and how it could be improved.
>>> >
>>> > However, you don't have to do code refactor (merge two code path to
>>> one)
>>> > to apply those improvements. From what I know, in this patchset, there
>>> > are two factors could improve the performance:
>>> >
>>> > - copy hdr together with packet data
>>> >
>>> > - shadow used ring update and update at once
>>> >
>>> > The overall performance boost I got with your v6 patchset with mrg-Rx
>>> > code path is about 27% (in PVP case). And I have just applied the 1st
>>> > optimization, it yields about 20% boosts. The left could be covered if
>>> > we apply the 2nd optimization (I guess).
>>> >
>>> > That would be a clean way to optimize vhost mergeable Rx path:
>>> >
>>> > - you don't touch non-mrg Rx path (well, you may could apply the
>>> >   shadow_used_ring trick to it as wel)
>>> >
>>> >   This would at least make sure we will have no such performance
>>> >   regression issue reported by ARM guys.
>>> >
>>> > - you don't refactor the code
>>> >
>>> >   The rewrite from scratch could introduce other issues, besides the
>>> >   performance regression. We may just don't know it yet.
>>> >
>>> >
>>> > Make sense to you? If you agree, I think we could still make it in
>>> > this release: they would be some small changes after all. For example,
>>> > below is the patch applies the 1st optimization tip on top of
>>> > dpdk-next-virtio
>>
>> Thanks for this great idea. I think it's a better way to do it.
>> I'll start to make the patch then.
>>
>>
>
> I personally find having two paths better for maintenance as it is
> easier to understand (IMHO).
> So if we can have the performance gain while keeping the two paths,
> I definitely support the idea.

I was going to re-run some PVP benchmark with 0% pkt loss, as I had
some strange results last week.

Problem is that your series no more apply cleanly due to
next-virtio's master branch history rewrite.
Any chance you send me a rebased version so that I can apply the series?

Thanks,
Maxime


[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-10-13 Thread Maxime Coquelin


On 10/13/2016 08:02 AM, Wang, Zhihong wrote:
>> > Yes, that's great effort! With your hardwork, we know what the bottleneck
>> > is and how it could be improved.
>> >
>> > However, you don't have to do code refactor (merge two code path to one)
>> > to apply those improvements. From what I know, in this patchset, there
>> > are two factors could improve the performance:
>> >
>> > - copy hdr together with packet data
>> >
>> > - shadow used ring update and update at once
>> >
>> > The overall performance boost I got with your v6 patchset with mrg-Rx
>> > code path is about 27% (in PVP case). And I have just applied the 1st
>> > optimization, it yields about 20% boosts. The left could be covered if
>> > we apply the 2nd optimization (I guess).
>> >
>> > That would be a clean way to optimize vhost mergeable Rx path:
>> >
>> > - you don't touch non-mrg Rx path (well, you may could apply the
>> >   shadow_used_ring trick to it as wel)
>> >
>> >   This would at least make sure we will have no such performance
>> >   regression issue reported by ARM guys.
>> >
>> > - you don't refactor the code
>> >
>> >   The rewrite from scratch could introduce other issues, besides the
>> >   performance regression. We may just don't know it yet.
>> >
>> >
>> > Make sense to you? If you agree, I think we could still make it in
>> > this release: they would be some small changes after all. For example,
>> > below is the patch applies the 1st optimization tip on top of
>> > dpdk-next-virtio
>
> Thanks for this great idea. I think it's a better way to do it.
> I'll start to make the patch then.
>
>

I personally find having two paths better for maintenance as it is
easier to understand (IMHO).
So if we can have the performance gain while keeping the two paths,
I definitely support the idea.

Thanks,
Maxime


[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-10-13 Thread Wang, Zhihong


> -Original Message-
> From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com]
> Sent: Thursday, October 13, 2016 1:33 PM
> To: Wang, Zhihong 
> Cc: Jianbo Liu ; Thomas Monjalon
> ; Maxime Coquelin
> ; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> 
> On Wed, Oct 12, 2016 at 12:22:08PM +, Wang, Zhihong wrote:
> > > > >> >  3. How many percentage of drop are you seeing?
> > > > The testing result:
> > > > size (bytes) improvement (%)
> > > > 64   3.92
> > > > 128 11.51
> > > > 256  24.16
> > > > 512  -13.79
> > > > 1024-22.51
> > > > 1500-12.22
> > > > A correction is that performance is dropping if byte size is larger than
> 512.
> > >
> > > I have thought of this twice. Unfortunately, I think I may need NACK this
> > > series.
> > >
> > > Merging two code path into one is really good: as you stated, it improves
> > > the maintainability. But only if we see no performance regression on both
> > > path after the refactor. Unfortunately, that's not the case here: it hurts
> > > the performance for one code path (non-mrg Rx).
> > >
> > > That makes me think we may should not do the code path merge at all. I
> think
> > > that also aligns with what you have said before (internally): we could do
> the
> > > merge if it gives comparable performance before and after that.
> > >
> > > Besides that, I don't quite like the way you did in patch 2 (rewrite
> enqueue):
> > > you made a lot of changes in one patch. That means if something wrong
> > > happened,
> > > it is hard to narrow down which change introduces that regression. Badly,
> > > that's exactly what we met here. Weeks have been passed, I see no
> progress.
> > >
> > > That's the reason we like the idea of "one patch only does one thing, an
> > > atomic thing".
> >
> >
> > Yuanhan, folks,
> >
> > Thanks for the analysis. I disagree here though.
> >
> > I analyze, develop, benchmark on x86 platforms, where this patch
> > works great.
> 
> Yes, that's great effort! With your hardwork, we know what the bottleneck
> is and how it could be improved.
> 
> However, you don't have to do code refactor (merge two code path to one)
> to apply those improvements. From what I know, in this patchset, there
> are two factors could improve the performance:
> 
> - copy hdr together with packet data
> 
> - shadow used ring update and update at once
> 
> The overall performance boost I got with your v6 patchset with mrg-Rx
> code path is about 27% (in PVP case). And I have just applied the 1st
> optimization, it yields about 20% boosts. The left could be covered if
> we apply the 2nd optimization (I guess).
> 
> That would be a clean way to optimize vhost mergeable Rx path:
> 
> - you don't touch non-mrg Rx path (well, you may could apply the
>   shadow_used_ring trick to it as wel)
> 
>   This would at least make sure we will have no such performance
>   regression issue reported by ARM guys.
> 
> - you don't refactor the code
> 
>   The rewrite from scratch could introduce other issues, besides the
>   performance regression. We may just don't know it yet.
> 
> 
> Make sense to you? If you agree, I think we could still make it in
> this release: they would be some small changes after all. For example,
> below is the patch applies the 1st optimization tip on top of
> dpdk-next-virtio


Thanks for this great idea. I think it's a better way to do it.
I'll start to make the patch then.


> 
>   --yliu
> 
> ---
> diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
> index 8a151af..0ddb5af 100644
> --- a/lib/librte_vhost/virtio_net.c
> +++ b/lib/librte_vhost/virtio_net.c
> @@ -379,7 +379,7 @@ copy_mbuf_to_desc_mergeable(struct virtio_net
> *dev, struct vhost_virtqueue *vq,
>   uint16_t end_idx, struct rte_mbuf *m,
>   struct buf_vector *buf_vec)
>  {
> - struct virtio_net_hdr_mrg_rxbuf virtio_hdr = {{0, 0, 0, 0, 0, 0}, 0};
> + struct virtio_net_hdr_mrg_rxbuf *virtio_hdr;
>   uint32_t vec_idx = 0;
>   uint16_t start_idx = vq->last_used_idx;
>   uint16_t cur_idx = start_idx;
> @@ -388,6 +388,8 @@ copy_mbuf_to_desc_mergeable(struct virtio_net
> *dev, struct vhost_virtqueue *vq,
>   uint32

[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-10-13 Thread Wang, Zhihong


> -Original Message-
> From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com]
> Sent: Wednesday, October 12, 2016 11:31 PM
> To: Wang, Zhihong 
> Cc: Yuanhan Liu ; Jianbo Liu
> ; Maxime Coquelin ;
> dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> 
> Sorry guys, you lost me in the discussion.
> 
> Is there some regression only on ARM?

ARM is what we see, no info on ppc yet.

> Does it need some work specifically on memcpy for ARM,
> or vhost for ARM?
> Who can work on ARM optimization?

These are still open questions, Jiaobo who reported this doesn't
have time for more testing now according to the reply.

I'm trying to do some test in the hope to identify the root cause.

> 
> More comments below.
> 
> 2016-10-12 12:22, Wang, Zhihong:
> > From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com]
> > > > It's an ARM server.
> > > >
> > > > >> >  3. How many percentage of drop are you seeing?
> > > > The testing result:
> > > > size (bytes) improvement (%)
> > > > 64   3.92
> > > > 128 11.51
> > > > 256  24.16
> > > > 512  -13.79
> > > > 1024-22.51
> > > > 1500-12.22
> > > > A correction is that performance is dropping if byte size is larger than
> 512.
> > >
> > > I have thought of this twice. Unfortunately, I think I may need NACK this
> > > series.
> > >
> > > Merging two code path into one is really good: as you stated, it improves
> > > the maintainability. But only if we see no performance regression on both
> > > path after the refactor. Unfortunately, that's not the case here: it hurts
> > > the performance for one code path (non-mrg Rx).
> 
> +1
> 
> > > That makes me think we may should not do the code path merge at all. I
> think
> > > that also aligns with what you have said before (internally): we could do
> the
> > > merge if it gives comparable performance before and after that.
> > >
> > > Besides that, I don't quite like the way you did in patch 2 (rewrite
> enqueue):
> > > you made a lot of changes in one patch. That means if something wrong
> > > happened,
> > > it is hard to narrow down which change introduces that regression. Badly,
> > > that's exactly what we met here. Weeks have been passed, I see no
> progress.
> 
> +1, it is important to have simple patches making changes step by step.
> 
> > > That's the reason we like the idea of "one patch only does one thing, an
> > > atomic thing".
> >
> >
> > Yuanhan, folks,
> >
> > Thanks for the analysis. I disagree here though.
> >
> > I analyze, develop, benchmark on x86 platforms, where this patch
> > works great.
> >
> > I've been trying to analyze on ARM too but it takes time and I've
> > had a schedule. Also since the ARM perf issue comes when it's
> > v6 already, I might not be able to make it in time. However
> > that's what I have to do for this patch to be merged in this
> > or the next release.
> >
> > In the meantime, may I suggest we consider the possibility to
> > have dedicated codes for **perf critical paths** for different
> > kinds of architecture?
> 
> Yes that's what we do in several parts of DPDK.
> 
> > It can be hard for a person to have both the knowledge and the
> > development environment for multiple archs at the same time.
> 
> Yes we do not expect you work on ARM.
> So if nobody work on the ARM issue, you could make 2 code paths
> in order to allow your optimization for x86 only.
> But that's not the preferred way.
> And you must split your rework to better identify which part is
> a regression on ARM.
> 
> > Moreover, different optimization techniques might be required for
> > different archs, so it's hard and unnecessary to make a function
> > works for all archs, sometimes it's just not the right thing to do.
> 
> Yes sometimes. Please help us to be convinced for this case.
> 
> > > So I will apply the first patch (it's a bug fixing patch) and ask you to
> > > refactor the rest, without the code path merge.
> > >
> > > I think we could still have a good maintainability code base if we 
> > > introduce
> > > more common helper functions that can be used on both Rx path, or
> even on
> > > Tx path (such as update_used_ring, or shadow_used_ring).
> 
> Yes it is a good step.
> And the code path merge could be reconsidered later.
> 
> > > It's a bit late for too many changes for v16.11. I think you could just
> > > grab patch 6 (vhost: optimize cache access) to the old mrg-Rx code path,
> > > if that also helps the performance? Let us handle the left in next 
> > > release,
> > > such as shadow used ring.
> 
> Thank you


[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-10-12 Thread Thomas Monjalon
Sorry guys, you lost me in the discussion.

Is there some regression only on ARM?
Does it need some work specifically on memcpy for ARM,
or vhost for ARM?
Who can work on ARM optimization?

More comments below.

2016-10-12 12:22, Wang, Zhihong:
> From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com]
> > > It's an ARM server.
> > >
> > > >> >  3. How many percentage of drop are you seeing?
> > > The testing result:
> > > size (bytes) improvement (%)
> > > 64   3.92
> > > 128 11.51
> > > 256  24.16
> > > 512  -13.79
> > > 1024-22.51
> > > 1500-12.22
> > > A correction is that performance is dropping if byte size is larger than 
> > > 512.
> > 
> > I have thought of this twice. Unfortunately, I think I may need NACK this
> > series.
> > 
> > Merging two code path into one is really good: as you stated, it improves
> > the maintainability. But only if we see no performance regression on both
> > path after the refactor. Unfortunately, that's not the case here: it hurts
> > the performance for one code path (non-mrg Rx).

+1

> > That makes me think we may should not do the code path merge at all. I think
> > that also aligns with what you have said before (internally): we could do 
> > the
> > merge if it gives comparable performance before and after that.
> > 
> > Besides that, I don't quite like the way you did in patch 2 (rewrite 
> > enqueue):
> > you made a lot of changes in one patch. That means if something wrong
> > happened,
> > it is hard to narrow down which change introduces that regression. Badly,
> > that's exactly what we met here. Weeks have been passed, I see no progress.

+1, it is important to have simple patches making changes step by step.

> > That's the reason we like the idea of "one patch only does one thing, an
> > atomic thing".
> 
> 
> Yuanhan, folks,
> 
> Thanks for the analysis. I disagree here though.
> 
> I analyze, develop, benchmark on x86 platforms, where this patch
> works great.
> 
> I've been trying to analyze on ARM too but it takes time and I've
> had a schedule. Also since the ARM perf issue comes when it's
> v6 already, I might not be able to make it in time. However
> that's what I have to do for this patch to be merged in this
> or the next release.
> 
> In the meantime, may I suggest we consider the possibility to
> have dedicated codes for **perf critical paths** for different
> kinds of architecture?

Yes that's what we do in several parts of DPDK.

> It can be hard for a person to have both the knowledge and the
> development environment for multiple archs at the same time.

Yes we do not expect you work on ARM.
So if nobody work on the ARM issue, you could make 2 code paths
in order to allow your optimization for x86 only.
But that's not the preferred way.
And you must split your rework to better identify which part is
a regression on ARM.

> Moreover, different optimization techniques might be required for
> different archs, so it's hard and unnecessary to make a function
> works for all archs, sometimes it's just not the right thing to do.

Yes sometimes. Please help us to be convinced for this case.

> > So I will apply the first patch (it's a bug fixing patch) and ask you to
> > refactor the rest, without the code path merge.
> > 
> > I think we could still have a good maintainability code base if we introduce
> > more common helper functions that can be used on both Rx path, or even on
> > Tx path (such as update_used_ring, or shadow_used_ring).

Yes it is a good step.
And the code path merge could be reconsidered later.

> > It's a bit late for too many changes for v16.11. I think you could just
> > grab patch 6 (vhost: optimize cache access) to the old mrg-Rx code path,
> > if that also helps the performance? Let us handle the left in next release,
> > such as shadow used ring.

Thank you


[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-10-12 Thread Wang, Zhihong


> -Original Message-
> From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com]
> Sent: Wednesday, October 12, 2016 10:53 AM
> To: Wang, Zhihong ; Jianbo Liu  linaro.org>
> Cc: Maxime Coquelin ; dev at dpdk.org; Thomas
> Monjalon 
> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> 
> On Thu, Sep 22, 2016 at 01:47:45PM +0800, Jianbo Liu wrote:
> > On 22 September 2016 at 10:29, Yuanhan Liu 
> wrote:
> > > On Wed, Sep 21, 2016 at 08:54:11PM +0800, Jianbo Liu wrote:
> > >> >> > My setup consists of one host running a guest.
> > >> >> > The guest generates as much 64bytes packets as possible using
> > >> >>
> > >> >> Have you tested with other different packet size?
> > >> >> My testing shows that performance is dropping when packet size is more
> > >> >> than 256.
> > >> >
> > >> >
> > >> > Hi Jianbo,
> > >> >
> > >> > Thanks for reporting this.
> > >> >
> > >> >  1. Are you running the vector frontend with mrg_rxbuf=off?
> > >> >
> > Yes, my testing is mrg_rxbuf=off, but not vector frontend PMD.
> >
> > >> >  2. Could you please specify what CPU you're running? Is it Haswell
> > >> > or Ivy Bridge?
> > >> >
> > It's an ARM server.
> >
> > >> >  3. How many percentage of drop are you seeing?
> > The testing result:
> > size (bytes) improvement (%)
> > 64   3.92
> > 128 11.51
> > 256  24.16
> > 512  -13.79
> > 1024-22.51
> > 1500-12.22
> > A correction is that performance is dropping if byte size is larger than 
> > 512.
> 
> I have thought of this twice. Unfortunately, I think I may need NACK this
> series.
> 
> Merging two code path into one is really good: as you stated, it improves
> the maintainability. But only if we see no performance regression on both
> path after the refactor. Unfortunately, that's not the case here: it hurts
> the performance for one code path (non-mrg Rx).
> 
> That makes me think we may should not do the code path merge at all. I think
> that also aligns with what you have said before (internally): we could do the
> merge if it gives comparable performance before and after that.
> 
> Besides that, I don't quite like the way you did in patch 2 (rewrite enqueue):
> you made a lot of changes in one patch. That means if something wrong
> happened,
> it is hard to narrow down which change introduces that regression. Badly,
> that's exactly what we met here. Weeks have been passed, I see no progress.
> 
> That's the reason we like the idea of "one patch only does one thing, an
> atomic thing".


Yuanhan, folks,

Thanks for the analysis. I disagree here though.

I analyze, develop, benchmark on x86 platforms, where this patch
works great.

I've been trying to analyze on ARM too but it takes time and I've
had a schedule. Also since the ARM perf issue comes when it's
v6 already, I might not be able to make it in time. However
that's what I have to do for this patch to be merged in this
or the next release.

In the meantime, may I suggest we consider the possibility to
have dedicated codes for **perf critical paths** for different
kinds of architecture?

It can be hard for a person to have both the knowledge and the
development environment for multiple archs at the same time.

Moreover, different optimization techniques might be required for
different archs, so it's hard and unnecessary to make a function
works for all archs, sometimes it's just not the right thing to do.


Thanks
Zhihong


> 
> So I will apply the first patch (it's a bug fixing patch) and ask you to
> refactor the rest, without the code path merge.
> 
> I think we could still have a good maintainability code base if we introduce
> more common helper functions that can be used on both Rx path, or even on
> Tx path (such as update_used_ring, or shadow_used_ring).
> 
> It's a bit late for too many changes for v16.11. I think you could just
> grab patch 6 (vhost: optimize cache access) to the old mrg-Rx code path,
> if that also helps the performance? Let us handle the left in next release,
> such as shadow used ring.
> 
> Thanks.
> 
>   --yliu


[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-10-12 Thread Yuanhan Liu
On Thu, Sep 22, 2016 at 01:47:45PM +0800, Jianbo Liu wrote:
> On 22 September 2016 at 10:29, Yuanhan Liu  
> wrote:
> > On Wed, Sep 21, 2016 at 08:54:11PM +0800, Jianbo Liu wrote:
> >> >> > My setup consists of one host running a guest.
> >> >> > The guest generates as much 64bytes packets as possible using
> >> >>
> >> >> Have you tested with other different packet size?
> >> >> My testing shows that performance is dropping when packet size is more
> >> >> than 256.
> >> >
> >> >
> >> > Hi Jianbo,
> >> >
> >> > Thanks for reporting this.
> >> >
> >> >  1. Are you running the vector frontend with mrg_rxbuf=off?
> >> >
> Yes, my testing is mrg_rxbuf=off, but not vector frontend PMD.
> 
> >> >  2. Could you please specify what CPU you're running? Is it Haswell
> >> > or Ivy Bridge?
> >> >
> It's an ARM server.
> 
> >> >  3. How many percentage of drop are you seeing?
> The testing result:
> size (bytes) improvement (%)
> 64   3.92
> 128 11.51
> 256  24.16
> 512  -13.79
> 1024-22.51
> 1500-12.22
> A correction is that performance is dropping if byte size is larger than 512.

I have thought of this twice. Unfortunately, I think I may need NACK this
series.

Merging two code path into one is really good: as you stated, it improves
the maintainability. But only if we see no performance regression on both 
path after the refactor. Unfortunately, that's not the case here: it hurts
the performance for one code path (non-mrg Rx).

That makes me think we may should not do the code path merge at all. I think
that also aligns with what you have said before (internally): we could do the
merge if it gives comparable performance before and after that.

Besides that, I don't quite like the way you did in patch 2 (rewrite enqueue):
you made a lot of changes in one patch. That means if something wrong happened,
it is hard to narrow down which change introduces that regression. Badly,
that's exactly what we met here. Weeks have been passed, I see no progress.

That's the reason we like the idea of "one patch only does one thing, an
atomic thing".

So I will apply the first patch (it's a bug fixing patch) and ask you to
refactor the rest, without the code path merge.

I think we could still have a good maintainability code base if we introduce 
more common helper functions that can be used on both Rx path, or even on
Tx path (such as update_used_ring, or shadow_used_ring).

It's a bit late for too many changes for v16.11. I think you could just
grab patch 6 (vhost: optimize cache access) to the old mrg-Rx code path,
if that also helps the performance? Let us handle the left in next release,
such as shadow used ring.

Thanks.

--yliu


[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-10-10 Thread Jianbo Liu
On 10 October 2016 at 14:22, Wang, Zhihong  wrote:
>
>
>> -Original Message-
>> From: Jianbo Liu [mailto:jianbo.liu at linaro.org]
>> Sent: Monday, October 10, 2016 1:32 PM
>> To: Yuanhan Liu 
>> Cc: Wang, Zhihong ; Maxime Coquelin
>> ; dev at dpdk.org
>> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
>>
>> On 10 October 2016 at 10:44, Yuanhan Liu 
>> wrote:
>> > On Sun, Oct 09, 2016 at 12:09:07PM +, Wang, Zhihong wrote:
>> >> > > > Tested with testpmd, host: txonly, guest: rxonly
>> >> > > > size (bytes) improvement (%)
>> >> > > > 644.12
>> >> > > > 128   6
>> >> > > > 256   2.65
>> >> > > > 512   -1.12
>> >> > > > 1024 -7.02
>> >> > >
>> >> > > There is a difference between Zhihong's code and the old I spotted in
>> >> > > the first time: Zhihong removed the avail_idx prefetch. I understand
>> >> > > the prefetch becomes a bit tricky when mrg-rx code path is
>> considered;
>> >> > > thus, I didn't comment on that.
>> >> > >
>> >> > > That's one of the difference that, IMO, could drop a regression. I 
>> >> > > then
>> >> > > finally got a chance to add it back.
>> >> > >
>> >> > > A rough test shows it improves the performance of 1400B packet size
>> >> > greatly
>> >> > > in the "txonly in host and rxonly in guest" case: +33% is the number I
>> get
>> >> > > with my test server (Ivybridge).
>> >> >
>> >> > Thanks Yuanhan! I'll validate this on x86.
>> >>
>> >> Hi Yuanhan,
>> >>
>> >> Seems your code doesn't perform correctly. I write a new version
>> >> of avail idx prefetch but didn't see any perf benefit.
>> >>
>> >> To be honest I doubt the benefit of this idea. The previous mrg_off
>> >> code has this method but doesn't give any benefits.
>> >
>> > Good point. I thought of that before, too. But you know that I made it
>> > in rush, that I didn't think further and test more.
>> >
>> > I looked the code a bit closer this time, and spotted a bug: the prefetch
>> > actually didn't happen, due to following code piece:
>> >
>> > if (vq->next_avail_idx >= NR_AVAIL_IDX_PREFETCH) {
>> > prefetch_avail_idx(vq);
>> > ...
>> > }
>> >
>> > Since vq->next_avail_idx is set to 0 at the entrance of enqueue path,
>> > prefetch_avail_idx() will be called. The fix is easy though: just put
>> > prefetch_avail_idx before invoking enqueue_packet.
>> >
>> > In summary, Zhihong is right, I see no more gains with that fix :(
>> >
>> > However, as stated, that's kind of the only difference I found between
>> > yours and the old code, that maybe it's still worthwhile to have a
>> > test on ARM, Jianbo?
>> >
>> I haven't tested it, but I think it could be no improvement for ARM either.
>>
>> A smalll suggestion for enqueue_packet:
>>
>> .
>> +   /* start copy from mbuf to desc */
>> +   while (mbuf_avail || mbuf->next) {
>> .
>>
>> Considering pkt_len is in the first cache line (same as data_len),
>> while next pointer is in the second cache line,
>> is it better to check the total packet len, instead of the last mbuf's
>> next pointer to jump out of while loop and avoid possible cache miss?
>
> Jianbo,
>
> Thanks for the reply!
>
> This idea sounds good, but it won't help the general perf in my
> opinion, since the 2nd cache line is accessed anyway prior in
> virtio_enqueue_offload.
>
Yes, you are right. I'm thinking of prefetching beforehand.
And if it's a chained mbuf, virtio_enqueue_offload will not be called
in next loop.

> Also this would bring a NULL check when actually access mbuf->next.
>
> BTW, could you please publish the number of:
>
>  1. mrg_rxbuf=on, comparison between original and original + this patch
>
>  2. mrg_rxbuf=off, comparison between original and original + this patch
>
> So we can have a whole picture of how this patch impact on ARM platform.
>
I think you already have got many results in my previous emails.
Sorry I can't test right now and busy with other things.


[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-10-10 Thread Jianbo Liu
On 10 October 2016 at 10:44, Yuanhan Liu  wrote:
> On Sun, Oct 09, 2016 at 12:09:07PM +, Wang, Zhihong wrote:
>> > > > Tested with testpmd, host: txonly, guest: rxonly
>> > > > size (bytes) improvement (%)
>> > > > 644.12
>> > > > 128   6
>> > > > 256   2.65
>> > > > 512   -1.12
>> > > > 1024 -7.02
>> > >
>> > > There is a difference between Zhihong's code and the old I spotted in
>> > > the first time: Zhihong removed the avail_idx prefetch. I understand
>> > > the prefetch becomes a bit tricky when mrg-rx code path is considered;
>> > > thus, I didn't comment on that.
>> > >
>> > > That's one of the difference that, IMO, could drop a regression. I then
>> > > finally got a chance to add it back.
>> > >
>> > > A rough test shows it improves the performance of 1400B packet size
>> > greatly
>> > > in the "txonly in host and rxonly in guest" case: +33% is the number I 
>> > > get
>> > > with my test server (Ivybridge).
>> >
>> > Thanks Yuanhan! I'll validate this on x86.
>>
>> Hi Yuanhan,
>>
>> Seems your code doesn't perform correctly. I write a new version
>> of avail idx prefetch but didn't see any perf benefit.
>>
>> To be honest I doubt the benefit of this idea. The previous mrg_off
>> code has this method but doesn't give any benefits.
>
> Good point. I thought of that before, too. But you know that I made it
> in rush, that I didn't think further and test more.
>
> I looked the code a bit closer this time, and spotted a bug: the prefetch
> actually didn't happen, due to following code piece:
>
> if (vq->next_avail_idx >= NR_AVAIL_IDX_PREFETCH) {
> prefetch_avail_idx(vq);
> ...
> }
>
> Since vq->next_avail_idx is set to 0 at the entrance of enqueue path,
> prefetch_avail_idx() will be called. The fix is easy though: just put
> prefetch_avail_idx before invoking enqueue_packet.
>
> In summary, Zhihong is right, I see no more gains with that fix :(
>
> However, as stated, that's kind of the only difference I found between
> yours and the old code, that maybe it's still worthwhile to have a
> test on ARM, Jianbo?
>
I haven't tested it, but I think it could be no improvement for ARM either.

A smalll suggestion for enqueue_packet:

.
+   /* start copy from mbuf to desc */
+   while (mbuf_avail || mbuf->next) {
.

Considering pkt_len is in the first cache line (same as data_len),
while next pointer is in the second cache line,
is it better to check the total packet len, instead of the last mbuf's
next pointer to jump out of while loop and avoid possible cache miss?


[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-10-10 Thread Yuanhan Liu
On Sun, Oct 09, 2016 at 12:09:07PM +, Wang, Zhihong wrote:
> > > > Tested with testpmd, host: txonly, guest: rxonly
> > > > size (bytes) improvement (%)
> > > > 644.12
> > > > 128   6
> > > > 256   2.65
> > > > 512   -1.12
> > > > 1024 -7.02
> > >
> > > There is a difference between Zhihong's code and the old I spotted in
> > > the first time: Zhihong removed the avail_idx prefetch. I understand
> > > the prefetch becomes a bit tricky when mrg-rx code path is considered;
> > > thus, I didn't comment on that.
> > >
> > > That's one of the difference that, IMO, could drop a regression. I then
> > > finally got a chance to add it back.
> > >
> > > A rough test shows it improves the performance of 1400B packet size
> > greatly
> > > in the "txonly in host and rxonly in guest" case: +33% is the number I get
> > > with my test server (Ivybridge).
> > 
> > Thanks Yuanhan! I'll validate this on x86.
> 
> Hi Yuanhan,
> 
> Seems your code doesn't perform correctly. I write a new version
> of avail idx prefetch but didn't see any perf benefit.
> 
> To be honest I doubt the benefit of this idea. The previous mrg_off
> code has this method but doesn't give any benefits.

Good point. I thought of that before, too. But you know that I made it
in rush, that I didn't think further and test more.

I looked the code a bit closer this time, and spotted a bug: the prefetch
actually didn't happen, due to following code piece:

if (vq->next_avail_idx >= NR_AVAIL_IDX_PREFETCH) {
prefetch_avail_idx(vq);
...
}

Since vq->next_avail_idx is set to 0 at the entrance of enqueue path,
prefetch_avail_idx() will be called. The fix is easy though: just put
prefetch_avail_idx before invoking enqueue_packet.

In summary, Zhihong is right, I see no more gains with that fix :(

However, as stated, that's kind of the only difference I found between
yours and the old code, that maybe it's still worthwhile to have a
test on ARM, Jianbo?

--yliu

> Even if this is useful, the benefits should be more significant for
> small packets, it's unlikely this simple idx prefetch could bring
> over 30% perf gain for large packets like 1400B ones.
> 
> But if you really do work it out like that I'll be very glad to see.
> 
> Thanks
> Zhihong
> 
> > 
> > >
> > > I guess this might/would help your case as well. Mind to have a test
> > > and tell me the results?
> > >
> > > BTW, I made it in rush; I haven't tested the mrg-rx code path yet.
> > >
> > > Thanks.
> > >
> > >   --yliu


[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-10-10 Thread Wang, Zhihong


> -Original Message-
> From: Jianbo Liu [mailto:jianbo.liu at linaro.org]
> Sent: Monday, October 10, 2016 2:58 PM
> To: Wang, Zhihong 
> Cc: Yuanhan Liu ; Maxime Coquelin
> ; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> 
> On 10 October 2016 at 14:22, Wang, Zhihong 
> wrote:
> >
> >
> >> -Original Message-
> >> From: Jianbo Liu [mailto:jianbo.liu at linaro.org]
> >> Sent: Monday, October 10, 2016 1:32 PM
> >> To: Yuanhan Liu 
> >> Cc: Wang, Zhihong ; Maxime Coquelin
> >> ; dev at dpdk.org
> >> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> >>
> >> On 10 October 2016 at 10:44, Yuanhan Liu 
> >> wrote:
> >> > On Sun, Oct 09, 2016 at 12:09:07PM +, Wang, Zhihong wrote:
> >> >> > > > Tested with testpmd, host: txonly, guest: rxonly
> >> >> > > > size (bytes) improvement (%)
> >> >> > > > 644.12
> >> >> > > > 128   6
> >> >> > > > 256   2.65
> >> >> > > > 512   -1.12
> >> >> > > > 1024 -7.02
> >> >> > >
> >> >> > > There is a difference between Zhihong's code and the old I spotted
> in
> >> >> > > the first time: Zhihong removed the avail_idx prefetch. I
> understand
> >> >> > > the prefetch becomes a bit tricky when mrg-rx code path is
> >> considered;
> >> >> > > thus, I didn't comment on that.
> >> >> > >
> >> >> > > That's one of the difference that, IMO, could drop a regression. I
> then
> >> >> > > finally got a chance to add it back.
> >> >> > >
> >> >> > > A rough test shows it improves the performance of 1400B packet
> size
> >> >> > greatly
> >> >> > > in the "txonly in host and rxonly in guest" case: +33% is the number
> I
> >> get
> >> >> > > with my test server (Ivybridge).
> >> >> >
> >> >> > Thanks Yuanhan! I'll validate this on x86.
> >> >>
> >> >> Hi Yuanhan,
> >> >>
> >> >> Seems your code doesn't perform correctly. I write a new version
> >> >> of avail idx prefetch but didn't see any perf benefit.
> >> >>
> >> >> To be honest I doubt the benefit of this idea. The previous mrg_off
> >> >> code has this method but doesn't give any benefits.
> >> >
> >> > Good point. I thought of that before, too. But you know that I made it
> >> > in rush, that I didn't think further and test more.
> >> >
> >> > I looked the code a bit closer this time, and spotted a bug: the prefetch
> >> > actually didn't happen, due to following code piece:
> >> >
> >> > if (vq->next_avail_idx >= NR_AVAIL_IDX_PREFETCH) {
> >> > prefetch_avail_idx(vq);
> >> > ...
> >> > }
> >> >
> >> > Since vq->next_avail_idx is set to 0 at the entrance of enqueue path,
> >> > prefetch_avail_idx() will be called. The fix is easy though: just put
> >> > prefetch_avail_idx before invoking enqueue_packet.
> >> >
> >> > In summary, Zhihong is right, I see no more gains with that fix :(
> >> >
> >> > However, as stated, that's kind of the only difference I found between
> >> > yours and the old code, that maybe it's still worthwhile to have a
> >> > test on ARM, Jianbo?
> >> >
> >> I haven't tested it, but I think it could be no improvement for ARM either.
> >>
> >> A smalll suggestion for enqueue_packet:
> >>
> >> .
> >> +   /* start copy from mbuf to desc */
> >> +   while (mbuf_avail || mbuf->next) {
> >> .
> >>
> >> Considering pkt_len is in the first cache line (same as data_len),
> >> while next pointer is in the second cache line,
> >> is it better to check the total packet len, instead of the last mbuf's
> >> next pointer to jump out of while loop and avoid possible cache miss?
> >
> > Jianbo,
> >
> > Thanks for the reply!
> >
> > This idea sounds good, but it won't help the general perf in my
> > opinion, since the 2nd cache line is accessed anyway prior in
> > virtio_enqueue_offload.
> >
> Yes, you are right. I'm thinking of prefetching beforehand.
> And if it's a chained mbuf, virtio_enqueue_offload will not be called
> in next loop.
> 
> > Also this would bring a NULL check when actually access mbuf->next.
> >
> > BTW, could you please publish the number of:
> >
> >  1. mrg_rxbuf=on, comparison between original and original + this patch
> >
> >  2. mrg_rxbuf=off, comparison between original and original + this patch
> >
> > So we can have a whole picture of how this patch impact on ARM platform.
> >
> I think you already have got many results in my previous emails.
> Sorry I can't test right now and busy with other things.

We're still missing mrg on data.



[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-10-10 Thread Wang, Zhihong


> -Original Message-
> From: Jianbo Liu [mailto:jianbo.liu at linaro.org]
> Sent: Monday, October 10, 2016 1:32 PM
> To: Yuanhan Liu 
> Cc: Wang, Zhihong ; Maxime Coquelin
> ; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> 
> On 10 October 2016 at 10:44, Yuanhan Liu 
> wrote:
> > On Sun, Oct 09, 2016 at 12:09:07PM +, Wang, Zhihong wrote:
> >> > > > Tested with testpmd, host: txonly, guest: rxonly
> >> > > > size (bytes) improvement (%)
> >> > > > 644.12
> >> > > > 128   6
> >> > > > 256   2.65
> >> > > > 512   -1.12
> >> > > > 1024 -7.02
> >> > >
> >> > > There is a difference between Zhihong's code and the old I spotted in
> >> > > the first time: Zhihong removed the avail_idx prefetch. I understand
> >> > > the prefetch becomes a bit tricky when mrg-rx code path is
> considered;
> >> > > thus, I didn't comment on that.
> >> > >
> >> > > That's one of the difference that, IMO, could drop a regression. I then
> >> > > finally got a chance to add it back.
> >> > >
> >> > > A rough test shows it improves the performance of 1400B packet size
> >> > greatly
> >> > > in the "txonly in host and rxonly in guest" case: +33% is the number I
> get
> >> > > with my test server (Ivybridge).
> >> >
> >> > Thanks Yuanhan! I'll validate this on x86.
> >>
> >> Hi Yuanhan,
> >>
> >> Seems your code doesn't perform correctly. I write a new version
> >> of avail idx prefetch but didn't see any perf benefit.
> >>
> >> To be honest I doubt the benefit of this idea. The previous mrg_off
> >> code has this method but doesn't give any benefits.
> >
> > Good point. I thought of that before, too. But you know that I made it
> > in rush, that I didn't think further and test more.
> >
> > I looked the code a bit closer this time, and spotted a bug: the prefetch
> > actually didn't happen, due to following code piece:
> >
> > if (vq->next_avail_idx >= NR_AVAIL_IDX_PREFETCH) {
> > prefetch_avail_idx(vq);
> > ...
> > }
> >
> > Since vq->next_avail_idx is set to 0 at the entrance of enqueue path,
> > prefetch_avail_idx() will be called. The fix is easy though: just put
> > prefetch_avail_idx before invoking enqueue_packet.
> >
> > In summary, Zhihong is right, I see no more gains with that fix :(
> >
> > However, as stated, that's kind of the only difference I found between
> > yours and the old code, that maybe it's still worthwhile to have a
> > test on ARM, Jianbo?
> >
> I haven't tested it, but I think it could be no improvement for ARM either.
> 
> A smalll suggestion for enqueue_packet:
> 
> .
> +   /* start copy from mbuf to desc */
> +   while (mbuf_avail || mbuf->next) {
> .
> 
> Considering pkt_len is in the first cache line (same as data_len),
> while next pointer is in the second cache line,
> is it better to check the total packet len, instead of the last mbuf's
> next pointer to jump out of while loop and avoid possible cache miss?

Jianbo,

Thanks for the reply!

This idea sounds good, but it won't help the general perf in my
opinion, since the 2nd cache line is accessed anyway prior in
virtio_enqueue_offload.

Also this would bring a NULL check when actually access mbuf->next.

BTW, could you please publish the number of:

 1. mrg_rxbuf=on, comparison between original and original + this patch

 2. mrg_rxbuf=off, comparison between original and original + this patch

So we can have a whole picture of how this patch impact on ARM platform.

Thanks
Zhihong



[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-10-09 Thread Wang, Zhihong


> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Wang, Zhihong
> Sent: Wednesday, September 28, 2016 12:45 AM
> To: Yuanhan Liu ; Jianbo Liu
> 
> Cc: Maxime Coquelin ; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> 
> 
> 
> > -Original Message-
> > From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com]
> > Sent: Tuesday, September 27, 2016 6:21 PM
> > To: Jianbo Liu 
> > Cc: Wang, Zhihong ; Maxime Coquelin
> > ; dev at dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> >
> > On Thu, Sep 22, 2016 at 05:01:41PM +0800, Jianbo Liu wrote:
> > > On 22 September 2016 at 14:58, Wang, Zhihong
> 
> > wrote:
> > > >
> > > >
> > > >> -Original Message-
> > > >> From: Jianbo Liu [mailto:jianbo.liu at linaro.org]
> > > >> Sent: Thursday, September 22, 2016 1:48 PM
> > > >> To: Yuanhan Liu 
> > > >> Cc: Wang, Zhihong ; Maxime Coquelin
> > > >> ; dev at dpdk.org
> > > >> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> > > >>
> > > >> On 22 September 2016 at 10:29, Yuanhan Liu
> 
> > > >> wrote:
> > > >> > On Wed, Sep 21, 2016 at 08:54:11PM +0800, Jianbo Liu wrote:
> > > >> >> >> > My setup consists of one host running a guest.
> > > >> >> >> > The guest generates as much 64bytes packets as possible
> using
> > > >> >> >>
> > > >> >> >> Have you tested with other different packet size?
> > > >> >> >> My testing shows that performance is dropping when packet
> size is
> > > >> more
> > > >> >> >> than 256.
> > > >> >> >
> > > >> >> >
> > > >> >> > Hi Jianbo,
> > > >> >> >
> > > >> >> > Thanks for reporting this.
> > > >> >> >
> > > >> >> >  1. Are you running the vector frontend with mrg_rxbuf=off?
> > > >> >> >
> > > >> Yes, my testing is mrg_rxbuf=off, but not vector frontend PMD.
> > > >>
> > > >> >> >  2. Could you please specify what CPU you're running? Is it
> Haswell
> > > >> >> > or Ivy Bridge?
> > > >> >> >
> > > >> It's an ARM server.
> > > >>
> > > >> >> >  3. How many percentage of drop are you seeing?
> > > >> The testing result:
> > > >> size (bytes) improvement (%)
> > > >> 64   3.92
> > > >> 128 11.51
> > > >> 256  24.16
> > > >> 512  -13.79
> > > >> 1024-22.51
> > > >> 1500-12.22
> > > >> A correction is that performance is dropping if byte size is larger 
> > > >> than
> 512.
> > > >
> > > >
> > > > Jianbo,
> > > >
> > > > Could you please verify does this patch really cause enqueue perf to
> drop?
> > > >
> > > > You can test the enqueue path only by set guest to do rxonly, and
> compare
> > > > the mpps by show port stats all in the guest.
> > > >
> > > >
> > > Tested with testpmd, host: txonly, guest: rxonly
> > > size (bytes) improvement (%)
> > > 644.12
> > > 128   6
> > > 256   2.65
> > > 512   -1.12
> > > 1024 -7.02
> >
> > There is a difference between Zhihong's code and the old I spotted in
> > the first time: Zhihong removed the avail_idx prefetch. I understand
> > the prefetch becomes a bit tricky when mrg-rx code path is considered;
> > thus, I didn't comment on that.
> >
> > That's one of the difference that, IMO, could drop a regression. I then
> > finally got a chance to add it back.
> >
> > A rough test shows it improves the performance of 1400B packet size
> greatly
> > in the "txonly in host and rxonly in guest" case: +33% is the number I get
> > with my test server (Ivybridge).
> 
> Thanks Yuanhan! I'll validate this on x86.

Hi Yuanhan,

Seems your code doesn't perform correctly. I write a new version
of avail idx prefetch but didn't see any perf benefit.

To be honest I doubt the benefit of this idea. The previous mrg_off
code has this method but doesn't give any benefits.

Even if this is useful, the benefits should be more significant for
small packets, it's unlikely this simple idx prefetch could bring
over 30% perf gain for large packets like 1400B ones.

But if you really do work it out like that I'll be very glad to see.

Thanks
Zhihong

> 
> >
> > I guess this might/would help your case as well. Mind to have a test
> > and tell me the results?
> >
> > BTW, I made it in rush; I haven't tested the mrg-rx code path yet.
> >
> > Thanks.
> >
> > --yliu


[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-08-24 Thread Wang, Zhihong


> -Original Message-
> From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com]
> Sent: Monday, August 22, 2016 6:35 PM
> To: Maxime Coquelin ; Wang, Zhihong
> ; yuanhan.liu at linux.intel.com
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> 
> 2016-08-22 12:01, Maxime Coquelin:
> > I forgot to add that before this series, I think we should first fix the
> windows bug.
> > Else we will need a dedicated fix for the stable branch.
> 
> This is a funny situation :)
> If Zhihong had reworked the code without mentioning it is fixing a scenario
> with Windows guests, maybe that nobody would have notice ;) That's
> probably why it is not written in v2/v3. But thanks to the v1, we all know it:
>   "It also fixes the issue working with Windows VMs."

I thought it'd be more appropriate to send a dedicated fix for stable branch.
So I removed this info.

> 
> So yes, it would be a lot better to find the root cause and try to have a
> minimal fix for 16.07, then rework the code for performance in 16.11.
> I think we must avoid silent fixes, and even more, avoid writing specific 
> fixes
> for stable branches without validating them in the master branch and its large
> users base.

Okay, that's also what Maxime and Yuanhan suggest.

BTW the root cause has been identified and fix will be in v4.

> 
> Thanks for your good works guys, DPDK vhost is improving very well.


[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-08-23 Thread Yuanhan Liu
On Tue, Aug 23, 2016 at 10:43:36AM +, Wang, Zhihong wrote:
> > > I forgot to add that before this series, I think we should first fix the 
> > > windows
> > bug.
> > > Else we will need a dedicated fix for the stable branch.
> > 
> > Okay I'll try to fix it, though I can't make any promises at present.
> > 
> > Have tried once but stopped since we don't have enough debug info from the
> > frontend side so basically I was debugging the backend based on guesses.
> 
> Hi Maxime, Yuanhan,
> 
> I've identified the root cause, do you think it makes sense to put the fix
> in the same patch set? Or send it as a separated patch?

Great!

Yes, it's okay to put it in the patch set (normally, as the first patch,
before the rewrite).

Please also add following line before your Signed-off-by in the commit
log:

Cc: 

--yliu


[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-08-23 Thread Maxime Coquelin


On 08/23/2016 12:43 PM, Wang, Zhihong wrote:
>
>
>> -Original Message-
>> From: Wang, Zhihong
>> Sent: Tuesday, August 23, 2016 10:31 AM
>> To: Maxime Coquelin ; dev at dpdk.org
>> Cc: yuanhan.liu at linux.intel.com
>> Subject: RE: [PATCH v3 0/5] vhost: optimize enqueue
>>
>>
>>
>>> -Original Message-
>>> From: Maxime Coquelin [mailto:maxime.coquelin at redhat.com]
>>> Sent: Monday, August 22, 2016 6:02 PM
>>> To: Wang, Zhihong ; dev at dpdk.org
>>> Cc: yuanhan.liu at linux.intel.com
>>> Subject: Re: [PATCH v3 0/5] vhost: optimize enqueue
..

>>> I forgot to add that before this series, I think we should first fix the 
>>> windows
>> bug.
>>> Else we will need a dedicated fix for the stable branch.
>>
>> Okay I'll try to fix it, though I can't make any promises at present.
>>
>> Have tried once but stopped since we don't have enough debug info from the
>> frontend side so basically I was debugging the backend based on guesses.
>
> Hi Maxime, Yuanhan,
>
> I've identified the root cause, do you think it makes sense to put the fix
> in the same patch set? Or send it as a separated patch?

Good work!

Agree with Yuanhan, send it before the optimization series.

Thanks,
Maxime


[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-08-23 Thread Wang, Zhihong


> -Original Message-
> From: Wang, Zhihong
> Sent: Tuesday, August 23, 2016 10:31 AM
> To: Maxime Coquelin ; dev at dpdk.org
> Cc: yuanhan.liu at linux.intel.com
> Subject: RE: [PATCH v3 0/5] vhost: optimize enqueue
> 
> 
> 
> > -Original Message-
> > From: Maxime Coquelin [mailto:maxime.coquelin at redhat.com]
> > Sent: Monday, August 22, 2016 6:02 PM
> > To: Wang, Zhihong ; dev at dpdk.org
> > Cc: yuanhan.liu at linux.intel.com
> > Subject: Re: [PATCH v3 0/5] vhost: optimize enqueue
> >
> >
> > On 08/22/2016 10:11 AM, Maxime Coquelin wrote:
> > > Hi Zhihong,
> > >
> > > On 08/19/2016 07:43 AM, Zhihong Wang wrote:
> > > > This patch set optimizes the vhost enqueue function.
> > > >
> > > > It implements the vhost logic from scratch into a single function
> > > > designed
> > > > for high performance and good maintainability, and improves CPU
> > > > efficiency
> > > > significantly by optimizing cache access, which means:
> > > >
> > > >  *  For fast frontends (eg. DPDK virtio pmd), higher performance
> > (maximum
> > > > throughput) can be achieved.
> > > >
> > > >  *  For slow frontends (eg. kernel virtio-net), better scalability can 
> > > > be
> > > > achieved, each vhost core can support more connections since it 
> > > > takes
> > > > less cycles to handle each single frontend.
> > > >
> > > > The main optimization techniques are:
> > > >
> > > >  1. Reorder code to reduce CPU pipeline stall cycles.
> > > >
> > > >  2. Batch update the used ring for better efficiency.
> > > >
> > > >  3. Prefetch descriptor to hide cache latency.
> > > >
> > > >  4. Remove useless volatile attribute to allow compiler optimization.
> > >
> > > Thanks for these details, this is helpful to understand where the perf
> > > gain comes from.
> > > I would suggest to add these information as comments in the code
> > > where/if it makes sense. If more a general comment, at least add it in
> > > the commit message of the patch introducing it.
> > > Indeed, adding it to the cover letter is fine, but the information is
> > > lost as soon as the series is applied.
> > >
> > > You don't mention any figures, so I set up a benchmark on my side to
> > > evaluate your series. It indeed shows an interesting performance gain.
> > >
> > > My setup consists of one host running a guest.
> > > The guest generates as much 64bytes packets as possible using
> > > pktgen-dpdk. The hosts forwards received packets back to the guest
> > > using testpmd on vhost pmd interface. Guest's vCPUs are pinned to
> > > physical CPUs.
> > >
> > > I tested it with and without your v1 patch, with and without
> > > rx-mergeable feature turned ON.
> > > Results are the average of 8 runs of 60 seconds:
> > >
> > > Rx-Mergeable ON : 7.72Mpps
> > > Rx-Mergeable ON + "vhost: optimize enqueue" v1: 9.19Mpps
> > > Rx-Mergeable OFF: 10.52Mpps
> > > Rx-Mergeable OFF + "vhost: optimize enqueue" v1: 10.60Mpps
> > >
> > I forgot to add that before this series, I think we should first fix the 
> > windows
> bug.
> > Else we will need a dedicated fix for the stable branch.
> 
> Okay I'll try to fix it, though I can't make any promises at present.
> 
> Have tried once but stopped since we don't have enough debug info from the
> frontend side so basically I was debugging the backend based on guesses.

Hi Maxime, Yuanhan,

I've identified the root cause, do you think it makes sense to put the fix
in the same patch set? Or send it as a separated patch?


Thanks
Zhihong

> 
> 
> >
> > Regards,
> > Maxime



[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-08-23 Thread Wang, Zhihong


> -Original Message-
> From: Maxime Coquelin [mailto:maxime.coquelin at redhat.com]
> Sent: Monday, August 22, 2016 6:02 PM
> To: Wang, Zhihong ; dev at dpdk.org
> Cc: yuanhan.liu at linux.intel.com
> Subject: Re: [PATCH v3 0/5] vhost: optimize enqueue
> 
> 
> On 08/22/2016 10:11 AM, Maxime Coquelin wrote:
> > Hi Zhihong,
> >
> > On 08/19/2016 07:43 AM, Zhihong Wang wrote:
> > > This patch set optimizes the vhost enqueue function.
> > >
> > > It implements the vhost logic from scratch into a single function
> > > designed
> > > for high performance and good maintainability, and improves CPU
> > > efficiency
> > > significantly by optimizing cache access, which means:
> > >
> > >  *  For fast frontends (eg. DPDK virtio pmd), higher performance
> (maximum
> > > throughput) can be achieved.
> > >
> > >  *  For slow frontends (eg. kernel virtio-net), better scalability can be
> > > achieved, each vhost core can support more connections since it takes
> > > less cycles to handle each single frontend.
> > >
> > > The main optimization techniques are:
> > >
> > >  1. Reorder code to reduce CPU pipeline stall cycles.
> > >
> > >  2. Batch update the used ring for better efficiency.
> > >
> > >  3. Prefetch descriptor to hide cache latency.
> > >
> > >  4. Remove useless volatile attribute to allow compiler optimization.
> >
> > Thanks for these details, this is helpful to understand where the perf
> > gain comes from.
> > I would suggest to add these information as comments in the code
> > where/if it makes sense. If more a general comment, at least add it in
> > the commit message of the patch introducing it.
> > Indeed, adding it to the cover letter is fine, but the information is
> > lost as soon as the series is applied.
> >
> > You don't mention any figures, so I set up a benchmark on my side to
> > evaluate your series. It indeed shows an interesting performance gain.
> >
> > My setup consists of one host running a guest.
> > The guest generates as much 64bytes packets as possible using
> > pktgen-dpdk. The hosts forwards received packets back to the guest
> > using testpmd on vhost pmd interface. Guest's vCPUs are pinned to
> > physical CPUs.
> >
> > I tested it with and without your v1 patch, with and without
> > rx-mergeable feature turned ON.
> > Results are the average of 8 runs of 60 seconds:
> >
> > Rx-Mergeable ON : 7.72Mpps
> > Rx-Mergeable ON + "vhost: optimize enqueue" v1: 9.19Mpps
> > Rx-Mergeable OFF: 10.52Mpps
> > Rx-Mergeable OFF + "vhost: optimize enqueue" v1: 10.60Mpps
> >
> I forgot to add that before this series, I think we should first fix the 
> windows bug.
> Else we will need a dedicated fix for the stable branch.

Okay I'll try to fix it, though I can't make any promises at present.

Have tried once but stopped since we don't have enough debug info from the
frontend side so basically I was debugging the backend based on guesses.


> 
> Regards,
> Maxime



[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-08-23 Thread Wang, Zhihong
> Subject: Re: [PATCH v3 0/5] vhost: optimize enqueue
> 
> Hi Zhihong,
> 
[...]
> > The main optimization techniques are:
> >
> >  1. Reorder code to reduce CPU pipeline stall cycles.
> >
> >  2. Batch update the used ring for better efficiency.
> >
> >  3. Prefetch descriptor to hide cache latency.
> >
> >  4. Remove useless volatile attribute to allow compiler optimization.
> 
> Thanks for these details, this is helpful to understand where the perf
> gain comes from.
> I would suggest to add these information as comments in the code
> where/if it makes sense. If more a general comment, at least add it in
> the commit message of the patch introducing it.
> Indeed, adding it to the cover letter is fine, but the information is
> lost as soon as the series is applied.

Hi Maxime,

I did add these info in the later optimization patches to explain each
optimization techniques. The v1 was indeed hard to read.


> 
> You don't mention any figures, so I set up a benchmark on my side to
> evaluate your series. It indeed shows an interesting performance gain.
> 
> My setup consists of one host running a guest.
> The guest generates as much 64bytes packets as possible using
> pktgen-dpdk. The hosts forwards received packets back to the guest
> using testpmd on vhost pmd interface. Guest's vCPUs are pinned to
> physical CPUs.
> 

Thanks for doing the test!

I didn't publish any numbers since the gain varies in different platforms
and test setups.

In my phy to vm test on both IVB and HSW, where testpmd in the host rx from
the nic and enqueue to the guest, the enqueue efficiency (cycles per packet)
is 2.4x and 1.4x as fast as the current code for mergeable on and mergeable
off respectively, for v3 patch.


> I tested it with and without your v1 patch, with and without
> rx-mergeable feature turned ON.
> Results are the average of 8 runs of 60 seconds:
> 
> Rx-Mergeable ON : 7.72Mpps
> Rx-Mergeable ON + "vhost: optimize enqueue" v1: 9.19Mpps
> Rx-Mergeable OFF: 10.52Mpps
> Rx-Mergeable OFF + "vhost: optimize enqueue" v1: 10.60Mpps
> 
> Regards,
> Maxime


[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-08-22 Thread Thomas Monjalon
2016-08-22 12:01, Maxime Coquelin:
> I forgot to add that before this series, I think we should first fix the 
> windows bug.
> Else we will need a dedicated fix for the stable branch.

This is a funny situation :)
If Zhihong had reworked the code without mentioning it is fixing a scenario
with Windows guests, maybe that nobody would have notice ;)
That's probably why it is not written in v2/v3. But thanks to the v1,
we all know it:
"It also fixes the issue working with Windows VMs."

So yes, it would be a lot better to find the root cause and try to have a
minimal fix for 16.07, then rework the code for performance in 16.11.
I think we must avoid silent fixes, and even more, avoid writing specific
fixes for stable branches without validating them in the master branch and
its large users base.

Thanks for your good works guys, DPDK vhost is improving very well.


[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-08-22 Thread Maxime Coquelin

On 08/22/2016 10:11 AM, Maxime Coquelin wrote:
> Hi Zhihong,
>
> On 08/19/2016 07:43 AM, Zhihong Wang wrote:
> > This patch set optimizes the vhost enqueue function.
> >
> > It implements the vhost logic from scratch into a single function
> > designed
> > for high performance and good maintainability, and improves CPU
> > efficiency
> > significantly by optimizing cache access, which means:
> >
> >  *  For fast frontends (eg. DPDK virtio pmd), higher performance (maximum
> > throughput) can be achieved.
> >
> >  *  For slow frontends (eg. kernel virtio-net), better scalability can be
> > achieved, each vhost core can support more connections since it takes
> > less cycles to handle each single frontend.
> >
> > The main optimization techniques are:
> >
> >  1. Reorder code to reduce CPU pipeline stall cycles.
> >
> >  2. Batch update the used ring for better efficiency.
> >
> >  3. Prefetch descriptor to hide cache latency.
> >
> >  4. Remove useless volatile attribute to allow compiler optimization.
>
> Thanks for these details, this is helpful to understand where the perf
> gain comes from.
> I would suggest to add these information as comments in the code
> where/if it makes sense. If more a general comment, at least add it in
> the commit message of the patch introducing it.
> Indeed, adding it to the cover letter is fine, but the information is
> lost as soon as the series is applied.
>
> You don't mention any figures, so I set up a benchmark on my side to
> evaluate your series. It indeed shows an interesting performance gain.
>
> My setup consists of one host running a guest.
> The guest generates as much 64bytes packets as possible using
> pktgen-dpdk. The hosts forwards received packets back to the guest
> using testpmd on vhost pmd interface. Guest's vCPUs are pinned to
> physical CPUs.
>
> I tested it with and without your v1 patch, with and without
> rx-mergeable feature turned ON.
> Results are the average of 8 runs of 60 seconds:
>
> Rx-Mergeable ON : 7.72Mpps
> Rx-Mergeable ON + "vhost: optimize enqueue" v1: 9.19Mpps
> Rx-Mergeable OFF: 10.52Mpps
> Rx-Mergeable OFF + "vhost: optimize enqueue" v1: 10.60Mpps
>
I forgot to add that before this series, I think we should first fix the 
windows bug.
Else we will need a dedicated fix for the stable branch.

Regards,
Maxime



[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-08-22 Thread Maxime Coquelin
Hi Zhihong,

On 08/19/2016 07:43 AM, Zhihong Wang wrote:
> This patch set optimizes the vhost enqueue function.
>
> It implements the vhost logic from scratch into a single function designed
> for high performance and good maintainability, and improves CPU efficiency
> significantly by optimizing cache access, which means:
>
>  *  For fast frontends (eg. DPDK virtio pmd), higher performance (maximum
> throughput) can be achieved.
>
>  *  For slow frontends (eg. kernel virtio-net), better scalability can be
> achieved, each vhost core can support more connections since it takes
> less cycles to handle each single frontend.
>
> The main optimization techniques are:
>
>  1. Reorder code to reduce CPU pipeline stall cycles.
>
>  2. Batch update the used ring for better efficiency.
>
>  3. Prefetch descriptor to hide cache latency.
>
>  4. Remove useless volatile attribute to allow compiler optimization.

Thanks for these details, this is helpful to understand where the perf
gain comes from.
I would suggest to add these information as comments in the code
where/if it makes sense. If more a general comment, at least add it in
the commit message of the patch introducing it.
Indeed, adding it to the cover letter is fine, but the information is
lost as soon as the series is applied.

You don't mention any figures, so I set up a benchmark on my side to
evaluate your series. It indeed shows an interesting performance gain.

My setup consists of one host running a guest.
The guest generates as much 64bytes packets as possible using
pktgen-dpdk. The hosts forwards received packets back to the guest
using testpmd on vhost pmd interface. Guest's vCPUs are pinned to
physical CPUs.

I tested it with and without your v1 patch, with and without
rx-mergeable feature turned ON.
Results are the average of 8 runs of 60 seconds:

Rx-Mergeable ON : 7.72Mpps
Rx-Mergeable ON + "vhost: optimize enqueue" v1: 9.19Mpps
Rx-Mergeable OFF: 10.52Mpps
Rx-Mergeable OFF + "vhost: optimize enqueue" v1: 10.60Mpps

Regards,
Maxime


[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-08-19 Thread Zhihong Wang
This patch set optimizes the vhost enqueue function.

It implements the vhost logic from scratch into a single function designed
for high performance and good maintainability, and improves CPU efficiency
significantly by optimizing cache access, which means:

 *  For fast frontends (eg. DPDK virtio pmd), higher performance (maximum
throughput) can be achieved.

 *  For slow frontends (eg. kernel virtio-net), better scalability can be
achieved, each vhost core can support more connections since it takes
less cycles to handle each single frontend.

The main optimization techniques are:

 1. Reorder code to reduce CPU pipeline stall cycles.

 2. Batch update the used ring for better efficiency.

 3. Prefetch descriptor to hide cache latency.

 4. Remove useless volatile attribute to allow compiler optimization.

Code reordering and batch used ring update bring most of the performance
improvements.

In the existing code there're 2 callbacks for vhost enqueue:

 *  virtio_dev_merge_rx for mrg_rxbuf turned on cases.

 *  virtio_dev_rx for mrg_rxbuf turned off cases.

The performance of the existing code is not optimal, especially when the
mrg_rxbuf feature turned on. Also, having 2 separated functions increases
maintenance efforts.

---
Changes in v3:

 1. Remove unnecessary memset which causes frontend stall on SNB & IVB.

 2. Rename variables to follow naming convention.

 3. Rewrite enqueue and delete the obsolete in the same patch.

---
Changes in v2:

 1. Split the big function into several small ones.

 2. Use multiple patches to explain each optimization.

 3. Add comments.

Zhihong Wang (5):
  vhost: rewrite enqueue
  vhost: remove useless volatile
  vhost: add desc prefetch
  vhost: batch update used ring
  vhost: optimize cache access

 lib/librte_vhost/vhost-net.h  |   6 +-
 lib/librte_vhost/vhost_rxtx.c | 573 +++---
 lib/librte_vhost/virtio-net.c |  15 +-
 3 files changed, 220 insertions(+), 374 deletions(-)

-- 
2.7.4