[Qemu-devel] [v2 0/2] add avx2 instruction optimization

2015-11-09 Thread Liang Li
buffer_find_nonzero_offset() is a hot function during live migration.
Now it use SSE2 intructions for optimization. For platform supports
AVX2 instructions, use the AVX2 instructions for optimization can help
to improve the performance about 30% comparing to SSE2.
Zero page check can be faster with this optimization, the test result
shows that for an 8GB RAM idle guest, this patch can help to shorten
the total live migration time about 6%.

This patch use the ifunc mechanism to select the proper function when
running, for platform supports AVX2, excute the AVX2 instructions,
else, excute the original code.

With patch, if build QEMU binary with AVX2 enabled, the binary can run
on both platforms support AVX2 or not.

If build QEMU binary with AVX2 diabled, or if compiler can not support
AVX2, the binary will not contain the AVX2 instruction, and it can run
on both platforms support AVX2 or not.

 
Liang Li (2):
  cutils: add avx2 instruction optimization
  configure: add options to config avx2

 configure | 29 ++
 include/qemu-common.h | 28 +++--
 util/Makefile.objs|  2 ++
 util/avx2.c   | 69 +++
 util/cutils.c | 53 +--
 5 files changed, 172 insertions(+), 9 deletions(-)
 create mode 100644 util/avx2.c

-- 
1.9.1




Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization

2016-04-07 Thread Dr. David Alan Gilbert
* Eric Blake (ebl...@redhat.com) wrote:
> On 11/12/2015 12:56 PM, Dr. David Alan Gilbert wrote:
> 
> >> One thing I still can't understand, why the unit test in host environment 
> >> shows
> >> 'memcmp()' have better performance?
> 
> Have you tried running under a profiler, to see if there are hotspots or
> at least get an idea of where the time is being spent?
> 
> > 
> > Are you aware of any program other than QEMU that also wants to do something
> > similar?  Finding whether a block of memory is zero, sounds like something
> > that would be useful in lots of places, I just can't think which ones.
> 
> At least dd, cp, and probably several other utilities.  It would be nice
> to post an RFE to glibc to see if they can come up with a dedicated
> interface that is faster than memcmp(), although that still only helps
> us when targetting a system new enough to have that interface.

I've just posted that RFE:
https://sourceware.org/bugzilla/show_bug.cgi?id=19920

Dave

> -- 
> Eric Blake   eblake redhat com+1-919-301-3266
> Libvirt virtualization library http://libvirt.org
> 


--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization

2016-04-07 Thread Michael S. Tsirkin
On Thu, Apr 07, 2016 at 12:09:52PM +0100, Dr. David Alan Gilbert wrote:
> * Eric Blake (ebl...@redhat.com) wrote:
> > On 11/12/2015 12:56 PM, Dr. David Alan Gilbert wrote:
> > 
> > >> One thing I still can't understand, why the unit test in host 
> > >> environment shows
> > >> 'memcmp()' have better performance?
> > 
> > Have you tried running under a profiler, to see if there are hotspots or
> > at least get an idea of where the time is being spent?
> > 
> > > 
> > > Are you aware of any program other than QEMU that also wants to do 
> > > something
> > > similar?  Finding whether a block of memory is zero, sounds like something
> > > that would be useful in lots of places, I just can't think which ones.
> > 
> > At least dd, cp, and probably several other utilities.  It would be nice
> > to post an RFE to glibc to see if they can come up with a dedicated
> > interface that is faster than memcmp(), although that still only helps
> > us when targetting a system new enough to have that interface.
> 
> I've just posted that RFE:
> https://sourceware.org/bugzilla/show_bug.cgi?id=19920
> 
> Dave

Have you guys seen the discussion in
http://rusty.ozlabs.org/?p=560#respond

In particular it claims this is close to optimal:


char check_zero(char *p, int len)
{
char res = 0;
int i;

for (i = 0; i < len; i++) {
res = res | p[i];
}

return res;
}


If you compile this function with --tree-vectorize and --unroll-loops.

Now, this version always scans all of the buffer, so
it will be slower when buffer is *not* all-zeroes.

Which might indicate that you need to know what your
workload is to implement compare to zero efficiently,
and if that is the case, it's not clear this is appropriate for libc.


> > -- 
> > Eric Blake   eblake redhat com+1-919-301-3266
> > Libvirt virtualization library http://libvirt.org
> > 
> 
> 
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization

2016-04-07 Thread Dr. David Alan Gilbert
* Michael S. Tsirkin (m...@redhat.com) wrote:
> On Thu, Apr 07, 2016 at 12:09:52PM +0100, Dr. David Alan Gilbert wrote:
> > * Eric Blake (ebl...@redhat.com) wrote:
> > > On 11/12/2015 12:56 PM, Dr. David Alan Gilbert wrote:
> > > 
> > > >> One thing I still can't understand, why the unit test in host 
> > > >> environment shows
> > > >> 'memcmp()' have better performance?
> > > 
> > > Have you tried running under a profiler, to see if there are hotspots or
> > > at least get an idea of where the time is being spent?
> > > 
> > > > 
> > > > Are you aware of any program other than QEMU that also wants to do 
> > > > something
> > > > similar?  Finding whether a block of memory is zero, sounds like 
> > > > something
> > > > that would be useful in lots of places, I just can't think which ones.
> > > 
> > > At least dd, cp, and probably several other utilities.  It would be nice
> > > to post an RFE to glibc to see if they can come up with a dedicated
> > > interface that is faster than memcmp(), although that still only helps
> > > us when targetting a system new enough to have that interface.
> > 
> > I've just posted that RFE:
> > https://sourceware.org/bugzilla/show_bug.cgi?id=19920
> > 
> > Dave
> 
> Have you guys seen the discussion in
> http://rusty.ozlabs.org/?p=560#respond
> 
> In particular it claims this is close to optimal:
> 
> 
> char check_zero(char *p, int len)
> {
> char res = 0;
> int i;
> 
> for (i = 0; i < len; i++) {
> res = res | p[i];
> }
> 
> return res;
> }
> 
> 
> If you compile this function with --tree-vectorize and --unroll-loops.
> 
> Now, this version always scans all of the buffer, so
> it will be slower when buffer is *not* all-zeroes.
> 
> Which might indicate that you need to know what your
> workload is to implement compare to zero efficiently,
> and if that is the case, it's not clear this is appropriate for libc.

On the contrary; anything that needs a couple of carefully chosen
compiler switches and assumes a particular workload is much
better optimised in a library for the general workload.

Dave

> 
> > > -- 
> > > Eric Blake   eblake redhat com+1-919-301-3266
> > > Libvirt virtualization library http://libvirt.org
> > > 
> > 
> > 
> > --
> > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization

2016-04-07 Thread Paolo Bonzini


On 07/04/2016 14:54, Michael S. Tsirkin wrote:
> 
> char check_zero(char *p, int len)
> {
> char res = 0;
> int i;
> 
> for (i = 0; i < len; i++) {
> res = res | p[i];
> }
> 
> return res;
> }
> 
> 
> If you compile this function with --tree-vectorize and --unroll-loops.

What you get then is exactly the same as what we already have in QEMU,
except for:

- the QEMU one has 128 extra instructions (32 times pcmpeq, movmsk, cmp,
je) in the loop.  Those extra instructions probably are free because, in
the case where the function goes through the whole buffer, the cache
misses dominate despite the efforts of the hardware prefetcher

- the QEMU one has an extra small loop at the beginning that proceeds a
word at a time to catch the case where almost everything in the page is
nonzero.

> Now, this version always scans all of the buffer, so
> it will be slower when buffer is *not* all-zeroes.

This is by far the common case.

> Which might indicate that you need to know what your
> workload is to implement compare to zero efficiently,

Not necessarily.  The two cases (unrolled/higher setup cost, and
non-unrolled/lower setup cost) are the same as the "parallel" and
"sequential" parts in Amdahl's law, and they optimize for completely
opposite workloads.  Amdahl's law then tells you that by making the
non-unrolled part small enough you can get very close to the absolute
maximum speedup.

Now of course if you know that your workload is "almost everything is
zero except a few bytes at the end of the page" then you have the
problem that your workload sucks and you should hate the guy who wrote
the software running in the guest. :)

Paolo



Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization

2015-11-09 Thread Eric Blake
On 11/09/2015 07:51 PM, Liang Li wrote:
> buffer_find_nonzero_offset() is a hot function during live migration.
> Now it use SSE2 intructions for optimization. For platform supports
> AVX2 instructions, use the AVX2 instructions for optimization can help
> to improve the performance about 30% comparing to SSE2.

Rather than trying to cater to multiple assembly instruction
implementations ourselves, have you tried taking the ideas in this
earlier thread?
https://lists.gnu.org/archive/html/qemu-devel/2015-10/msg05298.html

Ideally, libc's memcmp() will already be using the most efficient
assembly instructions without us having to reproduce the work of picking
the instructions that work best.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization

2015-11-09 Thread Li, Liang Z
> Rather than trying to cater to multiple assembly instruction implementations
> ourselves, have you tried taking the ideas in this earlier thread?
> https://lists.gnu.org/archive/html/qemu-devel/2015-10/msg05298.html
> 
> Ideally, libc's memcmp() will already be using the most efficient assembly
> instructions without us having to reproduce the work of picking the 
> instructions
> that work best.
> 

Eric, thanks for you information. I didn't notice that discussion before.


I rewrite the buffer_find_nonzero_offset() with the 'bool memeqzero4_paolo 
length'
then write a test program to check a large amount of zero pages,  and use the 
'time' to 
recode the time takes by different optimization. Test result is like this:

SSE2:
--
  |test 1 | test 2

Time(S):|   13.696| 13.533  



AVX2:
---
  |test 1 | test 2
---
Time (S):|  10.583  |  10.306
---

memeqzero4_paolo:
---
  |test 1 | test 2
---
Time (S):|  9.718 |  9.817



Paolo's implementation has the best performance. It seems that we can remove 
the SSE2 related Intrinsics.

Liang
> --
> Eric Blake   eblake redhat com+1-919-301-3266
> Libvirt virtualization library http://libvirt.org



Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization

2015-11-10 Thread Juan Quintela
"Li, Liang Z"  wrote:
>> Rather than trying to cater to multiple assembly instruction implementations
>> ourselves, have you tried taking the ideas in this earlier thread?
>> https://lists.gnu.org/archive/html/qemu-devel/2015-10/msg05298.html
>> 
>> Ideally, libc's memcmp() will already be using the most efficient assembly
>> instructions without us having to reproduce the work of picking the 
>> instructions
>> that work best.
>> 
>
> Eric, thanks for you information. I didn't notice that discussion before.
>
>
> I rewrite the buffer_find_nonzero_offset() with the 'bool memeqzero4_paolo 
> length'
> then write a test program to check a large amount of zero pages, and
> use the 'time' to
> recode the time takes by different optimization. Test result is like this:
>
> SSE2:
> --
>   |test 1 | test 2
> 
> Time(S):|   13.696| 13.533  
> 
>
>
> AVX2:
> ---
>   |test 1 | test 2
> ---
> Time (S):|  10.583  |  10.306
> ---
>
> memeqzero4_paolo:
> ---
>   |test 1 | test 2
> ---
> Time (S):|  9.718 |  9.817
> 
>
>
> Paolo's implementation has the best performance. It seems that we can
> remove the SSE2 related Intrinsics.

How should I understand that comment?  That you are about to send an
email to remove the sse2 support and that I can forget about this patch?

Thanks, Juan.


>
> Liang
>> --
>> Eric Blake   eblake redhat com+1-919-301-3266
>> Libvirt virtualization library http://libvirt.org



Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization

2015-11-10 Thread Paolo Bonzini


On 10/11/2015 10:13, Juan Quintela wrote:
>> > I rewrite the buffer_find_nonzero_offset() with the 'bool memeqzero4_paolo 
>> > length'
>> > then write a test program to check a large amount of zero pages, and
>> > use the 'time' to
>> > recode the time takes by different optimization. Test result is like this:
>> >
>> > SSE2:
>> > --
>> >   |test 1 | test 2
>> > 
>> > Time(S):|   13.696| 13.533  
>> > 
>> >
>> >
>> > AVX2:
>> > ---
>> >   |test 1 | test 2
>> > ---
>> > Time (S):|  10.583  |  10.306
>> > ---
>> >
>> > memeqzero4_paolo:
>> > ---
>> >   |test 1 | test 2
>> > ---
>> > Time (S):|  9.718 |  9.817
>> > 
>> >
>> >
>> > Paolo's implementation has the best performance. It seems that we can
>> > remove the SSE2 related Intrinsics.

Note that you can simplify my implementation a lot, because
buffer_find_nonzero_offset already assumes that the buffer is aligned to
sizeof(VECTYPE), i.e. 16 bytes.  For example you can just check the
first 4 unsigned longs against zero and then call memcmp.

Paolo

> How should I understand that comment?  That you are about to send an
> email to remove the sse2 support and that I can forget about this patch?



Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization

2015-11-10 Thread Li, Liang Z
> > Eric, thanks for you information. I didn't notice that discussion before.
> >
> >
> > I rewrite the buffer_find_nonzero_offset() with the 'bool memeqzero4_paolo
> length'
> > then write a test program to check a large amount of zero pages, and
> > use the 'time' to recode the time takes by different optimization.
> > Test result is like this:
> >
> > SSE2:
> > --
> >   |test 1 | test 2
> > 
> > Time(S):|   13.696| 13.533
> > 
> >
> >
> > AVX2:
> > ---
> >   |test 1 | test 2
> > ---
> > Time (S):|  10.583  |  10.306
> > ---
> >
> > memeqzero4_paolo:
> > ---
> >   |test 1 | test 2
> > ---
> > Time (S):|  9.718 |  9.817
> > 
> >
> >
> > Paolo's implementation has the best performance. It seems that we can
> > remove the SSE2 related Intrinsics.
> 
> How should I understand that comment?  That you are about to send an email
> to remove the sse2 support and that I can forget about this patch?
> 
> Thanks, Juan.
> 

I don't know Paolo's opinion about how to deal with the SSE2 Intrinsics, he is 
the author. From my personal view, 
now that we have found a better way, why to use such low level SSE2/AVX2 
Intrinsics. I don't know if someone else
is working on this. if not, and the related maintainer agrees to remove them, I 
am happy to send out a new patch.

Let's forget my patch at the moment.

Liang



Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization

2015-11-10 Thread Paolo Bonzini


On 10/11/2015 10:26, Li, Liang Z wrote:
> I don't know Paolo's opinion about how to deal with the SSE2
> Intrinsics, he is the author. From my personal view, now that we have
> found a better way, why to use such low level SSE2/AVX2 Intrinsics.

I totally agree. :)

Paolo



Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization

2015-11-10 Thread Li, Liang Z
> On 10/11/2015 10:26, Li, Liang Z wrote:
> > I don't know Paolo's opinion about how to deal with the SSE2
> > Intrinsics, he is the author. From my personal view, now that we have
> > found a better way, why to use such low level SSE2/AVX2 Intrinsics.
> 
> I totally agree. :)
> 
> Paolo

Hi Paolo,

It seems you are the right person to remove them, you are the author for both 
the 'SSE2 Intrinsics' and 'memeqzero4_paolo'.
Please forget my patch totally.

Liang



Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization

2015-11-10 Thread Paolo Bonzini


On 10/11/2015 10:41, Li, Liang Z wrote:
>> On 10/11/2015 10:26, Li, Liang Z wrote:
>>> I don't know Paolo's opinion about how to deal with the SSE2 
>>> Intrinsics, he is the author. From my personal view, now that we
>>> have found a better way, why to use such low level SSE2/AVX2
>>> Intrinsics.
>> 
>> I totally agree. :)
> 
> It seems you are the right person to remove them, you are the author
> for both the 'SSE2 Intrinsics' and 'memeqzero4_paolo'. Please forget
> my patch totally.

I agree that your patch can be dropped, but go ahead and submit your
improvements!

Paolo



Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization

2015-11-10 Thread Li, Liang Z
> On 10/11/2015 10:41, Li, Liang Z wrote:
> >> On 10/11/2015 10:26, Li, Liang Z wrote:
> >>> I don't know Paolo's opinion about how to deal with the SSE2
> >>> Intrinsics, he is the author. From my personal view, now that we
> >>> have found a better way, why to use such low level SSE2/AVX2
> >>> Intrinsics.
> >>
> >> I totally agree. :)
> >
> > It seems you are the right person to remove them, you are the author
> > for both the 'SSE2 Intrinsics' and 'memeqzero4_paolo'. Please forget
> > my patch totally.
> 
> I agree that your patch can be dropped, but go ahead and submit your
> improvements!
> 
> Paolo

You mean I do this work? 
If you are busy, I can do this. I really hope the related improvement can be 
merged into QEMU 2.5.0.

Liang


Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization

2015-11-10 Thread Paolo Bonzini


On 10/11/2015 10:56, Li, Liang Z wrote:
> > I agree that your patch can be dropped, but go ahead and submit your
> > improvements!
> 
> You mean I do this work? 
> If you are busy, I can do this.

It's not that I'm busy, it's that it's your idea.  It doesn't matter if
I (and Peter Lieven too, actually) originally did the optimizations.

You also have the infrastructure to benchmark the improvements.

Paolo



Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization

2015-11-10 Thread Li, Liang Z
> On 10/11/2015 10:56, Li, Liang Z wrote:
> > > I agree that your patch can be dropped, but go ahead and submit your
> > > improvements!
> >
> > You mean I do this work?
> > If you are busy, I can do this.
> 
> It's not that I'm busy, it's that it's your idea.  It doesn't matter if I 
> (and Peter
> Lieven too, actually) originally did the optimizations.
> 
> You also have the infrastructure to benchmark the improvements.
> 
> Paolo

OK. I will rework and send a new patch.

Liang


Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization

2015-11-11 Thread Li, Liang Z
> 
> On 10/11/2015 10:26, Li, Liang Z wrote:
> > I don't know Paolo's opinion about how to deal with the SSE2
> > Intrinsics, he is the author. From my personal view, now that we have
> > found a better way, why to use such low level SSE2/AVX2 Intrinsics.
> 
> I totally agree. :)
> 
> Paolo

Hi Paolo,

I am very surprised about the live migration performance  result when I use 
your ' memeqzero4_paolo' instead of these SSE2 Intrinsics to check the zero 
pages.
The total live migration time increased about 8%!   Not decreased.  Although in 
the unit test your ' memeqzero4_paolo'  has better performance, any idea?

Liang


  



Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization

2015-11-12 Thread Paolo Bonzini


On 12/11/2015 03:49, Li, Liang Z wrote:
> I am very surprised about the live migration performance  result when
> I use your ' memeqzero4_paolo' instead of these SSE2 Intrinsics to
> check the zero pages.

What code were you using?  Remember I suggested using only unsigned long
checks, like

unsigned long *p = ...
if (p[0] || p[1] || p[2] || p[3]
|| memcmp(p+4, p, size - 4 * sizeof(unsigned long)) != 0)
return BUFFER_NOT_ZERO;
else
return BUFFER_ZERO;

> The total live migration time increased about
> 8%!   Not decreased.  Although in the unit test your '
> memeqzero4_paolo'  has better performance, any idea?

You only tested the case of zero pages.  But real pages usually are not
zero, even if they have a few zero bytes at the beginning.  It's very
important to optimize the initial check before the memcmp call.

Paolo



Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization

2015-11-12 Thread Li, Liang Z
> On 12/11/2015 03:49, Li, Liang Z wrote:
> > I am very surprised about the live migration performance  result when
> > I use your ' memeqzero4_paolo' instead of these SSE2 Intrinsics to
> > check the zero pages.
> 
> What code were you using?  Remember I suggested using only unsigned long
> checks, like
> 
>   unsigned long *p = ...
>   if (p[0] || p[1] || p[2] || p[3]
>   || memcmp(p+4, p, size - 4 * sizeof(unsigned long)) != 0)
>   return BUFFER_NOT_ZERO;
>   else
>   return BUFFER_ZERO;
> 



I use the following code:


bool memeqzero4_paolo(const void *data, size_t length)
{
const unsigned char *p = data;
unsigned long word;

if (!length)
return true;

/* Check len bytes not aligned on a word.  */
while (__builtin_expect(length & (sizeof(word) - 1), 0)) {
if (*p)
return false;
p++;
length--;
if (!length)
return true;
}

/* Check up to 16 bytes a word at a time.  */
for (;;) {
memcpy(&word, p, sizeof(word));
if (word)
return false;
p += sizeof(word);
length -= sizeof(word);
if (!length)
return true;
if (__builtin_expect(length & 15, 0) == 0)
break;
}

 /* Now we know that's zero, memcmp with self. */
 return memcmp(data, p, length) == 0;
}

> > The total live migration time increased about
> > 8%!   Not decreased.  Although in the unit test your '
> > memeqzero4_paolo'  has better performance, any idea?
> 
> You only tested the case of zero pages.  But real pages usually are not zero,
> even if they have a few zero bytes at the beginning.  It's very important to
> optimize the initial check before the memcmp call.
> 

In the unit test, I only test zero pages too, and the performance of  
'memeqzero4_paolo' is better.
But when merged into QEMU, it caused performance drop. Why?

> Paolo


Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization

2015-11-12 Thread Paolo Bonzini


On 12/11/2015 09:53, Li, Liang Z wrote:
>> On 12/11/2015 03:49, Li, Liang Z wrote:
>>> I am very surprised about the live migration performance  result when
>>> I use your ' memeqzero4_paolo' instead of these SSE2 Intrinsics to
>>> check the zero pages.
>>
>> What code were you using?  Remember I suggested using only unsigned long
>> checks, like
>>
>>  unsigned long *p = ...
>>  if (p[0] || p[1] || p[2] || p[3]
>>  || memcmp(p+4, p, size - 4 * sizeof(unsigned long)) != 0)
>>  return BUFFER_NOT_ZERO;
>>  else
>>  return BUFFER_ZERO;
>>
> 
> I use the following code:
> 
> 
> bool memeqzero4_paolo(const void *data, size_t length)
> {
>  ...
> }

The code you used is very generic and not optimized for the kind of data
you see during migration, hence the existing code in QEMU fares better.

>>> The total live migration time increased about
>>> 8%!   Not decreased.  Although in the unit test your '
>>> memeqzero4_paolo'  has better performance, any idea?
>>
>> You only tested the case of zero pages.  But real pages usually are not zero,
>> even if they have a few zero bytes at the beginning.  It's very important to
>> optimize the initial check before the memcmp call.
>>
> 
> In the unit test, I only test zero pages too, and the performance of  
> 'memeqzero4_paolo' is better.
> But when merged into QEMU, it caused performance drop. Why?

Because QEMU is not migrating zero pages only.

Paolo



Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization

2015-11-12 Thread Li, Liang Z
> >>> I am very surprised about the live migration performance  result
> >>> when I use your ' memeqzero4_paolo' instead of these SSE2 Intrinsics
> >>> to check the zero pages.
> >>
> >> What code were you using?  Remember I suggested using only unsigned
> >> long checks, like
> >>
> >>unsigned long *p = ...
> >>if (p[0] || p[1] || p[2] || p[3]
> >>|| memcmp(p+4, p, size - 4 * sizeof(unsigned long)) != 0)
> >>return BUFFER_NOT_ZERO;
> >>else
> >>return BUFFER_ZERO;
> >>
> >
> > I use the following code:
> >
> >
> > bool memeqzero4_paolo(const void *data, size_t length) {
> >  ...
> > }
> 
> The code you used is very generic and not optimized for the kind of data you
> see during migration, hence the existing code in QEMU fares better.
> 

I migrate a 8GB RAM Idle guest,  I think most of it's pages are zero pages.

I use your new code:
-
unsigned long *p = ...
if (p[0] || p[1] || p[2] || p[3]
|| memcmp(p+4, p, size - 4 * sizeof(unsigned long)) != 0)
return BUFFER_NOT_ZERO;
else
return BUFFER_ZERO;
---
and the result is almost the same.  I also tried the check 8, 16 long data at 
the beginning, 
same result.

> >>> The total live migration time increased about
> >>> 8%!   Not decreased.  Although in the unit test your '
> >>> memeqzero4_paolo'  has better performance, any idea?
> >>
> >> You only tested the case of zero pages.  But real pages usually are
> >> not zero, even if they have a few zero bytes at the beginning.  It's
> >> very important to optimize the initial check before the memcmp call.
> >>
> >
> > In the unit test, I only test zero pages too, and the performance of
> 'memeqzero4_paolo' is better.
> > But when merged into QEMU, it caused performance drop. Why?
> 
> Because QEMU is not migrating zero pages only.
> 
> Paolo


Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization

2015-11-12 Thread Paolo Bonzini


On 12/11/2015 10:40, Li, Liang Z wrote:
> I migrate a 8GB RAM Idle guest,  I think most of it's pages are zero pages.
> 
> I use your new code:
> -
>   unsigned long *p = ...
>   if (p[0] || p[1] || p[2] || p[3]
>   || memcmp(p+4, p, size - 4 * sizeof(unsigned long)) != 0)
>   return BUFFER_NOT_ZERO;
>   else
>   return BUFFER_ZERO;
> ---
> and the result is almost the same.  I also tried the check 8, 16 long data at 
> the beginning, 
> same result.

Interesting...  Well, all I can say is that applaud you for testing your
hypothesis with the benchmark.

Probably the setup cost of memcmp is too high, because the testing loop
is already very optimized.

Please submit the AVX2 version if it helps!

Paolo



Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization

2015-11-12 Thread Li, Liang Z
> On 12/11/2015 10:40, Li, Liang Z wrote:
> > I migrate a 8GB RAM Idle guest,  I think most of it's pages are zero pages.
> >
> > I use your new code:
> > -
> > unsigned long *p = ...
> > if (p[0] || p[1] || p[2] || p[3]
> > || memcmp(p+4, p, size - 4 * sizeof(unsigned long)) != 0)
> > return BUFFER_NOT_ZERO;
> > else
> > return BUFFER_ZERO;
> > ---
> > and the result is almost the same.  I also tried the check 8, 16 long
> > data at the beginning, same result.
> 
> Interesting...  Well, all I can say is that applaud you for testing your 
> hypothesis
> with the benchmark.
> 
> Probably the setup cost of memcmp is too high, because the testing loop is
> already very optimized.
> 
> Please submit the AVX2 version if it helps!

Yes, the AVX2 version really helps. I have already submitted it, could you help 
to review it?

I am curious about the original intention to add the SSE2 Intrinsics, is the 
same reason?

I even suspect the VM may impact the 'memcmp()' performance, is it possible?

Liang

> Paolo


Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization

2015-11-12 Thread Juan Quintela
"Li, Liang Z"  wrote:
>> On 12/11/2015 10:40, Li, Liang Z wrote:
>> > I migrate a 8GB RAM Idle guest,  I think most of it's pages are zero pages.
>> >
>> > I use your new code:
>> > -
>> >unsigned long *p = ...
>> >if (p[0] || p[1] || p[2] || p[3]
>> >|| memcmp(p+4, p, size - 4 * sizeof(unsigned long)) != 0)
>> >return BUFFER_NOT_ZERO;
>> >else
>> >return BUFFER_ZERO;
>> > ---
>> > and the result is almost the same.  I also tried the check 8, 16 long
>> > data at the beginning, same result.
>> 
>> Interesting...  Well, all I can say is that applaud you for testing
>> your hypothesis
>> with the benchmark.
>> 
>> Probably the setup cost of memcmp is too high, because the testing loop is
>> already very optimized.
>> 
>> Please submit the AVX2 version if it helps!

I read the email in the wrong order.  Forget about my other email.

Sorry, Juan.


>
> Yes, the AVX2 version really helps. I have already submitted it, could
> you help to review it?
>
> I am curious about the original intention to add the SSE2 Intrinsics,
> is the same reason?
>
> I even suspect the VM may impact the 'memcmp()' performance, is it possible?
>
> Liang
>
>> Paolo



Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization

2015-11-12 Thread Li, Liang Z
> >> >
> >> > I use your new code:
> >> > -
> >> >  unsigned long *p = ...
> >> >  if (p[0] || p[1] || p[2] || p[3]
> >> >  || memcmp(p+4, p, size - 4 * sizeof(unsigned long)) != 0)
> >> >  return BUFFER_NOT_ZERO;
> >> >  else
> >> >  return BUFFER_ZERO;
> >> > ---
> >> > and the result is almost the same.  I also tried the check 8, 16
> >> > long data at the beginning, same result.
> >>
> >> Interesting...  Well, all I can say is that applaud you for testing
> >> your hypothesis with the benchmark.
> >>
> >> Probably the setup cost of memcmp is too high, because the testing
> >> loop is already very optimized.
> >>
> >> Please submit the AVX2 version if it helps!
> 
> I read the email in the wrong order.  Forget about my other email.
> 
> Sorry, Juan.
> 

One thing I still can't understand, why the unit test in host environment shows
'memcmp()' have better performance?

Liang
> 
> >
> > Yes, the AVX2 version really helps. I have already submitted it, could
> > you help to review it?
> >
> > I am curious about the original intention to add the SSE2 Intrinsics,
> > is the same reason?
> >
> > I even suspect the VM may impact the 'memcmp()' performance, is it
> possible?
> >
> > Liang
> >
> >> Paolo



Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization

2015-11-12 Thread Dr. David Alan Gilbert
* Li, Liang Z (liang.z...@intel.com) wrote:
> > >> >
> > >> > I use your new code:
> > >> > -
> > >> >unsigned long *p = ...
> > >> >if (p[0] || p[1] || p[2] || p[3]
> > >> >|| memcmp(p+4, p, size - 4 * sizeof(unsigned long)) != 0)
> > >> >return BUFFER_NOT_ZERO;
> > >> >else
> > >> >return BUFFER_ZERO;
> > >> > ---
> > >> > and the result is almost the same.  I also tried the check 8, 16
> > >> > long data at the beginning, same result.
> > >>
> > >> Interesting...  Well, all I can say is that applaud you for testing
> > >> your hypothesis with the benchmark.
> > >>
> > >> Probably the setup cost of memcmp is too high, because the testing
> > >> loop is already very optimized.
> > >>
> > >> Please submit the AVX2 version if it helps!
> > 
> > I read the email in the wrong order.  Forget about my other email.
> > 
> > Sorry, Juan.
> > 
> 
> One thing I still can't understand, why the unit test in host environment 
> shows
> 'memcmp()' have better performance?

Are you aware of any program other than QEMU that also wants to do something
similar?  Finding whether a block of memory is zero, sounds like something
that would be useful in lots of places, I just can't think which ones.

Dave

> 
> Liang
> > 
> > >
> > > Yes, the AVX2 version really helps. I have already submitted it, could
> > > you help to review it?
> > >
> > > I am curious about the original intention to add the SSE2 Intrinsics,
> > > is the same reason?
> > >
> > > I even suspect the VM may impact the 'memcmp()' performance, is it
> > possible?
> > >
> > > Liang
> > >
> > >> Paolo
> 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization

2015-11-12 Thread Eric Blake
On 11/12/2015 12:56 PM, Dr. David Alan Gilbert wrote:

>> One thing I still can't understand, why the unit test in host environment 
>> shows
>> 'memcmp()' have better performance?

Have you tried running under a profiler, to see if there are hotspots or
at least get an idea of where the time is being spent?

> 
> Are you aware of any program other than QEMU that also wants to do something
> similar?  Finding whether a block of memory is zero, sounds like something
> that would be useful in lots of places, I just can't think which ones.

At least dd, cp, and probably several other utilities.  It would be nice
to post an RFE to glibc to see if they can come up with a dedicated
interface that is faster than memcmp(), although that still only helps
us when targetting a system new enough to have that interface.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature