Re: [Qemu-devel] [PATCH RFC] mem-prealloc: Reduce large guest start-up and migration time.

2017-02-06 Thread Jitendra Kolhe
On 1/30/2017 2:02 PM, Jitendra Kolhe wrote:
> On 1/27/2017 6:33 PM, Dr. David Alan Gilbert wrote:
>> * Jitendra Kolhe (jitendra.ko...@hpe.com) wrote:
>>> Using "-mem-prealloc" option for a very large guest leads to huge guest
>>> start-up and migration time. This is because with "-mem-prealloc" option
>>> qemu tries to map every guest page (create address translations), and
>>> make sure the pages are available during runtime. virsh/libvirt by
>>> default, seems to use "-mem-prealloc" option in case the guest is
>>> configured to use huge pages. The patch tries to map all guest pages
>>> simultaneously by spawning multiple threads. Given the problem is more
>>> prominent for large guests, the patch limits the changes to the guests
>>> of at-least 64GB of memory size. Currently limiting the change to QEMU
>>> library functions on POSIX compliant host only, as we are not sure if
>>> the problem exists on win32. Below are some stats with "-mem-prealloc"
>>> option for guest configured to use huge pages.
>>>
>>> 
>>> Idle Guest  | Start-up time | Migration time
>>> 
>>> Guest stats with 2M HugePage usage - single threaded (existing code)
>>> 
>>> 64 Core - 4TB   | 54m11.796s| 75m43.843s
>>> 64 Core - 1TB   | 8m56.576s | 14m29.049s
>>> 64 Core - 256GB | 2m11.245s | 3m26.598s
>>> 
>>> Guest stats with 2M HugePage usage - map guest pages using 8 threads
>>> 
>>> 64 Core - 4TB   | 5m1.027s  | 34m10.565s
>>> 64 Core - 1TB   | 1m10.366s | 8m28.188s
>>> 64 Core - 256GB | 0m19.040s | 2m10.148s
>>> ---
>>> Guest stats with 2M HugePage usage - map guest pages using 16 threads
>>> ---
>>> 64 Core - 4TB   | 1m58.970s | 31m43.400s
>>> 64 Core - 1TB   | 0m39.885s | 7m55.289s
>>> 64 Core - 256GB | 0m11.960s | 2m0.135s
>>> ---
>>
>> That's a nice improvement.
>>
>>> Signed-off-by: Jitendra Kolhe 
>>> ---
>>>  util/oslib-posix.c | 64 
>>> +++---
>>>  1 file changed, 61 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
>>> index f631464..a8bd7c2 100644
>>> --- a/util/oslib-posix.c
>>> +++ b/util/oslib-posix.c
>>> @@ -55,6 +55,13 @@
>>>  #include "qemu/error-report.h"
>>>  #endif
>>>  
>>> +#define PAGE_TOUCH_THREAD_COUNT 8
>>
>> It seems a shame to fix that number as a constant.
>>
> 
> Yes, as per comments received we will update patch to incorporate vcpu count.
> 
>>> +typedef struct {
>>> +char *addr;
>>> +uint64_t numpages;
>>> +uint64_t hpagesize;
>>> +} PageRange;
>>> +
>>>  int qemu_get_thread_id(void)
>>>  {
>>>  #if defined(__linux__)
>>> @@ -323,6 +330,52 @@ static void sigbus_handler(int signal)
>>>  siglongjmp(sigjump, 1);
>>>  }
>>>  
>>> +static void *do_touch_pages(void *arg)
>>> +{
>>> +PageRange *range = (PageRange *)arg;
>>> +char *start_addr = range->addr;
>>> +uint64_t numpages = range->numpages;
>>> +uint64_t hpagesize = range->hpagesize;
>>> +uint64_t i = 0;
>>> +
>>> +for (i = 0; i < numpages; i++) {
>>> +memset(start_addr + (hpagesize * i), 0, 1);
>>> +}
>>> +qemu_thread_exit(NULL);
>>> +
>>> +return NULL;
>>> +}
>>> +
>>> +static int touch_all_pages(char *area, size_t hpagesize, size_t numpages)
>>> +{
>>> +QemuThread page_threads[PAGE_TOUCH_THREAD_COUNT];
>>> +PageRange page_range[PAGE_TOUCH_THREAD_COUNT];
>>> +uint64_tnumpage_per_thread, size_per_thread;
>>> +int i = 0, tcount = 0;
>>> +
>>> +numpage_per_thread = (numpages / PAGE_TOUCH_THREAD_COUNT);
>>> +size_per_thread = (hpagesize * numpage_per_thread);
>>> +for (i = 0; i < (PAGE_TOUCH_THREAD_COUNT - 1); i++) {
>>> +page_range[i].addr = area;
>>> +page_range[i].numpages = numpage_per_thread;
>>> +page_range[i].hpagesize = hpagesize;
>>> +
>>> +qemu_thread_create(page_threads + i, "touch_pages",
>>> +   do_touch_pages, (page_range + i),
>>> +   QEMU_THREAD_JOINABLE);
>>> +tcount++;
>>> +area += size_per_thread;
>>> +numpages -= numpage_per_thread;
>>> +}
>>> +for (i = 0; i < numpages; i++) {
>>> +memset(area + (hpagesize * i), 0, 1);
>>> +}
>>> +for (i = 0; i < tcount; i++) {
>>> +qemu_thread_join(page_threads + i);
>>> +}
>>> +return 0;
>>> +}
>>> +
>>>  void os_mem_prealloc(int fd, char *area, size_t memory, Error **errp)
>>>  {

Re: [Qemu-devel] [PATCH RFC] mem-prealloc: Reduce large guest start-up and migration time.

2017-02-03 Thread Paolo Bonzini


On 02/02/2017 01:35, Jitendra Kolhe wrote:
>> Of course you'd still need the memset() trick if qemu was given
>> non-hugepages in combination with --mem-prealloc, as you don't
>> want to lock normal pages into ram permanently.
>>
> given above numbers, I think we can stick to memset() implementation for
> both hugepage and non-hugepage cases?

Yes, of course!

Paolo



Re: [Qemu-devel] [PATCH RFC] mem-prealloc: Reduce large guest start-up and migration time.

2017-02-02 Thread Jitendra Kolhe
On 1/27/2017 6:56 PM, Daniel P. Berrange wrote:
> On Thu, Jan 05, 2017 at 12:54:02PM +0530, Jitendra Kolhe wrote:
>> Using "-mem-prealloc" option for a very large guest leads to huge guest
>> start-up and migration time. This is because with "-mem-prealloc" option
>> qemu tries to map every guest page (create address translations), and
>> make sure the pages are available during runtime. virsh/libvirt by
>> default, seems to use "-mem-prealloc" option in case the guest is
>> configured to use huge pages. The patch tries to map all guest pages
>> simultaneously by spawning multiple threads. Given the problem is more
>> prominent for large guests, the patch limits the changes to the guests
>> of at-least 64GB of memory size. Currently limiting the change to QEMU
>> library functions on POSIX compliant host only, as we are not sure if
>> the problem exists on win32. Below are some stats with "-mem-prealloc"
>> option for guest configured to use huge pages.
>>
>> 
>> Idle Guest  | Start-up time | Migration time
>> 
>> Guest stats with 2M HugePage usage - single threaded (existing code)
>> 
>> 64 Core - 4TB   | 54m11.796s| 75m43.843s
>> 64 Core - 1TB   | 8m56.576s | 14m29.049s
>> 64 Core - 256GB | 2m11.245s | 3m26.598s
>> 
>> Guest stats with 2M HugePage usage - map guest pages using 8 threads
>> 
>> 64 Core - 4TB   | 5m1.027s  | 34m10.565s
>> 64 Core - 1TB   | 1m10.366s | 8m28.188s
>> 64 Core - 256GB | 0m19.040s | 2m10.148s
>> ---
>> Guest stats with 2M HugePage usage - map guest pages using 16 threads
>> ---
>> 64 Core - 4TB   | 1m58.970s | 31m43.400s
>> 64 Core - 1TB   | 0m39.885s | 7m55.289s
>> 64 Core - 256GB | 0m11.960s | 2m0.135s
>> ---
> 
> For comparison, what is performance like if you replace memset() in
> the current code with a call to mlock().
> 

It doesn't look like we get much benefit by replacing memset() for loop, 
with a single instance of mlock(). Here are some numbers from my system.

#hugepages| memset  | memset  | memset  | mlock (entire range)
(2M size) | (1 thread)  | (8 threads) | (16 threads)| (1 thread)
--|-|-|-|
1048576 (2TB) | 1790661 ms  | 105577 ms   | 37331 ms| 1789580 ms
524288 (1TB)  |  895119 ms  | 52795 ms| 18686 ms| 894199 ms
131072 (256G) |  173081 ms  | 9337 ms | 4667 ms | 172506 ms
-

> IIUC, huge pages are non-swappable once allocated, so it feels like
> we ought to be able to just call mlock() to preallocate them with
> no downside, rather than spawning many threads to memset() them.
> 

yes, for me too it looks like mlock() should do the job in case of
hugepages.

> Of course you'd still need the memset() trick if qemu was given
> non-hugepages in combination with --mem-prealloc, as you don't
> want to lock normal pages into ram permanently.
> 

given above numbers, I think we can stick to memset() implementation for
both hugepage and non-hugepage cases?

Thanks,
- Jitendra

> Regards,
> Daniel
> 



Re: [Qemu-devel] [PATCH RFC] mem-prealloc: Reduce large guest start-up and migration time.

2017-01-30 Thread Jitendra Kolhe
On 1/27/2017 6:33 PM, Dr. David Alan Gilbert wrote:
> * Jitendra Kolhe (jitendra.ko...@hpe.com) wrote:
>> Using "-mem-prealloc" option for a very large guest leads to huge guest
>> start-up and migration time. This is because with "-mem-prealloc" option
>> qemu tries to map every guest page (create address translations), and
>> make sure the pages are available during runtime. virsh/libvirt by
>> default, seems to use "-mem-prealloc" option in case the guest is
>> configured to use huge pages. The patch tries to map all guest pages
>> simultaneously by spawning multiple threads. Given the problem is more
>> prominent for large guests, the patch limits the changes to the guests
>> of at-least 64GB of memory size. Currently limiting the change to QEMU
>> library functions on POSIX compliant host only, as we are not sure if
>> the problem exists on win32. Below are some stats with "-mem-prealloc"
>> option for guest configured to use huge pages.
>>
>> 
>> Idle Guest  | Start-up time | Migration time
>> 
>> Guest stats with 2M HugePage usage - single threaded (existing code)
>> 
>> 64 Core - 4TB   | 54m11.796s| 75m43.843s
>> 64 Core - 1TB   | 8m56.576s | 14m29.049s
>> 64 Core - 256GB | 2m11.245s | 3m26.598s
>> 
>> Guest stats with 2M HugePage usage - map guest pages using 8 threads
>> 
>> 64 Core - 4TB   | 5m1.027s  | 34m10.565s
>> 64 Core - 1TB   | 1m10.366s | 8m28.188s
>> 64 Core - 256GB | 0m19.040s | 2m10.148s
>> ---
>> Guest stats with 2M HugePage usage - map guest pages using 16 threads
>> ---
>> 64 Core - 4TB   | 1m58.970s | 31m43.400s
>> 64 Core - 1TB   | 0m39.885s | 7m55.289s
>> 64 Core - 256GB | 0m11.960s | 2m0.135s
>> ---
> 
> That's a nice improvement.
> 
>> Signed-off-by: Jitendra Kolhe 
>> ---
>>  util/oslib-posix.c | 64 
>> +++---
>>  1 file changed, 61 insertions(+), 3 deletions(-)
>>
>> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
>> index f631464..a8bd7c2 100644
>> --- a/util/oslib-posix.c
>> +++ b/util/oslib-posix.c
>> @@ -55,6 +55,13 @@
>>  #include "qemu/error-report.h"
>>  #endif
>>  
>> +#define PAGE_TOUCH_THREAD_COUNT 8
> 
> It seems a shame to fix that number as a constant.
> 

Yes, as per comments received we will update patch to incorporate vcpu count.

>> +typedef struct {
>> +char *addr;
>> +uint64_t numpages;
>> +uint64_t hpagesize;
>> +} PageRange;
>> +
>>  int qemu_get_thread_id(void)
>>  {
>>  #if defined(__linux__)
>> @@ -323,6 +330,52 @@ static void sigbus_handler(int signal)
>>  siglongjmp(sigjump, 1);
>>  }
>>  
>> +static void *do_touch_pages(void *arg)
>> +{
>> +PageRange *range = (PageRange *)arg;
>> +char *start_addr = range->addr;
>> +uint64_t numpages = range->numpages;
>> +uint64_t hpagesize = range->hpagesize;
>> +uint64_t i = 0;
>> +
>> +for (i = 0; i < numpages; i++) {
>> +memset(start_addr + (hpagesize * i), 0, 1);
>> +}
>> +qemu_thread_exit(NULL);
>> +
>> +return NULL;
>> +}
>> +
>> +static int touch_all_pages(char *area, size_t hpagesize, size_t numpages)
>> +{
>> +QemuThread page_threads[PAGE_TOUCH_THREAD_COUNT];
>> +PageRange page_range[PAGE_TOUCH_THREAD_COUNT];
>> +uint64_tnumpage_per_thread, size_per_thread;
>> +int i = 0, tcount = 0;
>> +
>> +numpage_per_thread = (numpages / PAGE_TOUCH_THREAD_COUNT);
>> +size_per_thread = (hpagesize * numpage_per_thread);
>> +for (i = 0; i < (PAGE_TOUCH_THREAD_COUNT - 1); i++) {
>> +page_range[i].addr = area;
>> +page_range[i].numpages = numpage_per_thread;
>> +page_range[i].hpagesize = hpagesize;
>> +
>> +qemu_thread_create(page_threads + i, "touch_pages",
>> +   do_touch_pages, (page_range + i),
>> +   QEMU_THREAD_JOINABLE);
>> +tcount++;
>> +area += size_per_thread;
>> +numpages -= numpage_per_thread;
>> +}
>> +for (i = 0; i < numpages; i++) {
>> +memset(area + (hpagesize * i), 0, 1);
>> +}
>> +for (i = 0; i < tcount; i++) {
>> +qemu_thread_join(page_threads + i);
>> +}
>> +return 0;
>> +}
>> +
>>  void os_mem_prealloc(int fd, char *area, size_t memory, Error **errp)
>>  {
>>  int ret;
>> @@ -353,9 +406,14 @@ void os_mem_prealloc(int fd, char *area, size_t memory, 
>> Error **errp)
>>  size_t hpagesize = qemu_fd_getpagesiz

Re: [Qemu-devel] [PATCH RFC] mem-prealloc: Reduce large guest start-up and migration time.

2017-01-30 Thread Jitendra Kolhe
On 1/27/2017 6:23 PM, Juan Quintela wrote:
> Jitendra Kolhe  wrote:
>> Using "-mem-prealloc" option for a very large guest leads to huge guest
>> start-up and migration time. This is because with "-mem-prealloc" option
>> qemu tries to map every guest page (create address translations), and
>> make sure the pages are available during runtime. virsh/libvirt by
>> default, seems to use "-mem-prealloc" option in case the guest is
>> configured to use huge pages. The patch tries to map all guest pages
>> simultaneously by spawning multiple threads. Given the problem is more
>> prominent for large guests, the patch limits the changes to the guests
>> of at-least 64GB of memory size. Currently limiting the change to QEMU
>> library functions on POSIX compliant host only, as we are not sure if
>> the problem exists on win32. Below are some stats with "-mem-prealloc"
>> option for guest configured to use huge pages.
>>
>> 
>> Idle Guest  | Start-up time | Migration time
>> 
>> Guest stats with 2M HugePage usage - single threaded (existing code)
>> 
>> 64 Core - 4TB   | 54m11.796s| 75m43.843s
> ^^
> 
>> 64 Core - 1TB   | 8m56.576s | 14m29.049s
>> 64 Core - 256GB | 2m11.245s | 3m26.598s
>> 
>> Guest stats with 2M HugePage usage - map guest pages using 8 threads
>> 
>> 64 Core - 4TB   | 5m1.027s  | 34m10.565s
>> 64 Core - 1TB   | 1m10.366s | 8m28.188s
>> 64 Core - 256GB | 0m19.040s | 2m10.148s
>> ---
>> Guest stats with 2M HugePage usage - map guest pages using 16 threads
>> ---
>> 64 Core - 4TB   | 1m58.970s | 31m43.400s
> ^
> 
> Impressive, not everyday one get an speedup of 20 O:-)
> 
> 
>> +static void *do_touch_pages(void *arg)
>> +{
>> +PageRange *range = (PageRange *)arg;
>> +char *start_addr = range->addr;
>> +uint64_t numpages = range->numpages;
>> +uint64_t hpagesize = range->hpagesize;
>> +uint64_t i = 0;
>> +
>> +for (i = 0; i < numpages; i++) {
>> +memset(start_addr + (hpagesize * i), 0, 1);
> 
> I would use the range->addr and similar here directly, but it is just a
> question of taste.
> 

Thanks for your response, 
will update my next patch.

>> -/* MAP_POPULATE silently ignores failures */
>> -for (i = 0; i < numpages; i++) {
>> -memset(area + (hpagesize * i), 0, 1);
>> +/* touch pages simultaneously for memory >= 64G */
>> +if (memory < (1ULL << 36)) {
> 
> 64GB guest already took quite a bit of time, I think I would put it
> always as min(num_vcpus, 16).  So, we always execute the multiple theard
> codepath?
> 

It sounds good idea to have a heuristic on vcpu count. I will add in my 
next version. But shouldn't we also restrict based on guest RAM size too, 
to avoid overhead spawning multiple threads for thin guests?

Thanks,
- Jitendra

> But very nice, thanks.
> 
> Later, Juan.
> 



Re: [Qemu-devel] [PATCH RFC] mem-prealloc: Reduce large guest start-up and migration time.

2017-01-27 Thread Daniel P. Berrange
On Thu, Jan 05, 2017 at 12:54:02PM +0530, Jitendra Kolhe wrote:
> Using "-mem-prealloc" option for a very large guest leads to huge guest
> start-up and migration time. This is because with "-mem-prealloc" option
> qemu tries to map every guest page (create address translations), and
> make sure the pages are available during runtime. virsh/libvirt by
> default, seems to use "-mem-prealloc" option in case the guest is
> configured to use huge pages. The patch tries to map all guest pages
> simultaneously by spawning multiple threads. Given the problem is more
> prominent for large guests, the patch limits the changes to the guests
> of at-least 64GB of memory size. Currently limiting the change to QEMU
> library functions on POSIX compliant host only, as we are not sure if
> the problem exists on win32. Below are some stats with "-mem-prealloc"
> option for guest configured to use huge pages.
> 
> 
> Idle Guest  | Start-up time | Migration time
> 
> Guest stats with 2M HugePage usage - single threaded (existing code)
> 
> 64 Core - 4TB   | 54m11.796s| 75m43.843s
> 64 Core - 1TB   | 8m56.576s | 14m29.049s
> 64 Core - 256GB | 2m11.245s | 3m26.598s
> 
> Guest stats with 2M HugePage usage - map guest pages using 8 threads
> 
> 64 Core - 4TB   | 5m1.027s  | 34m10.565s
> 64 Core - 1TB   | 1m10.366s | 8m28.188s
> 64 Core - 256GB | 0m19.040s | 2m10.148s
> ---
> Guest stats with 2M HugePage usage - map guest pages using 16 threads
> ---
> 64 Core - 4TB   | 1m58.970s | 31m43.400s
> 64 Core - 1TB   | 0m39.885s | 7m55.289s
> 64 Core - 256GB | 0m11.960s | 2m0.135s
> ---

For comparison, what is performance like if you replace memset() in
the current code with a call to mlock().

IIUC, huge pages are non-swappable once allocated, so it feels like
we ought to be able to just call mlock() to preallocate them with
no downside, rather than spawning many threads to memset() them.

Of course you'd still need the memset() trick if qemu was given
non-hugepages in combination with --mem-prealloc, as you don't
want to lock normal pages into ram permanently.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://entangle-photo.org   -o-http://search.cpan.org/~danberr/ :|



Re: [Qemu-devel] [PATCH RFC] mem-prealloc: Reduce large guest start-up and migration time.

2017-01-27 Thread Paolo Bonzini


On 27/01/2017 13:53, Juan Quintela wrote:
>> +static void *do_touch_pages(void *arg)
>> +{
>> +PageRange *range = (PageRange *)arg;
>> +char *start_addr = range->addr;
>> +uint64_t numpages = range->numpages;
>> +uint64_t hpagesize = range->hpagesize;
>> +uint64_t i = 0;
>> +
>> +for (i = 0; i < numpages; i++) {
>> +memset(start_addr + (hpagesize * i), 0, 1);
> 
> I would use the range->addr and similar here directly, but it is just a
> question of taste.
> 
>> -/* MAP_POPULATE silently ignores failures */
>> -for (i = 0; i < numpages; i++) {
>> -memset(area + (hpagesize * i), 0, 1);
>> +/* touch pages simultaneously for memory >= 64G */
>> +if (memory < (1ULL << 36)) {
> 
> 64GB guest already took quite a bit of time, I think I would put it
> always as min(num_vcpus, 16).  So, we always execute the multiple theard
> codepath?

I too would like some kind of heuristic to choose the number of threads.
 Juan's suggested usage of the VCPUs (smp_cpus) is a good one.

Paolo



Re: [Qemu-devel] [PATCH RFC] mem-prealloc: Reduce large guest start-up and migration time.

2017-01-27 Thread Dr. David Alan Gilbert
* Jitendra Kolhe (jitendra.ko...@hpe.com) wrote:
> Using "-mem-prealloc" option for a very large guest leads to huge guest
> start-up and migration time. This is because with "-mem-prealloc" option
> qemu tries to map every guest page (create address translations), and
> make sure the pages are available during runtime. virsh/libvirt by
> default, seems to use "-mem-prealloc" option in case the guest is
> configured to use huge pages. The patch tries to map all guest pages
> simultaneously by spawning multiple threads. Given the problem is more
> prominent for large guests, the patch limits the changes to the guests
> of at-least 64GB of memory size. Currently limiting the change to QEMU
> library functions on POSIX compliant host only, as we are not sure if
> the problem exists on win32. Below are some stats with "-mem-prealloc"
> option for guest configured to use huge pages.
> 
> 
> Idle Guest  | Start-up time | Migration time
> 
> Guest stats with 2M HugePage usage - single threaded (existing code)
> 
> 64 Core - 4TB   | 54m11.796s| 75m43.843s
> 64 Core - 1TB   | 8m56.576s | 14m29.049s
> 64 Core - 256GB | 2m11.245s | 3m26.598s
> 
> Guest stats with 2M HugePage usage - map guest pages using 8 threads
> 
> 64 Core - 4TB   | 5m1.027s  | 34m10.565s
> 64 Core - 1TB   | 1m10.366s | 8m28.188s
> 64 Core - 256GB | 0m19.040s | 2m10.148s
> ---
> Guest stats with 2M HugePage usage - map guest pages using 16 threads
> ---
> 64 Core - 4TB   | 1m58.970s | 31m43.400s
> 64 Core - 1TB   | 0m39.885s | 7m55.289s
> 64 Core - 256GB | 0m11.960s | 2m0.135s
> ---

That's a nice improvement.

> Signed-off-by: Jitendra Kolhe 
> ---
>  util/oslib-posix.c | 64 
> +++---
>  1 file changed, 61 insertions(+), 3 deletions(-)
> 
> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> index f631464..a8bd7c2 100644
> --- a/util/oslib-posix.c
> +++ b/util/oslib-posix.c
> @@ -55,6 +55,13 @@
>  #include "qemu/error-report.h"
>  #endif
>  
> +#define PAGE_TOUCH_THREAD_COUNT 8

It seems a shame to fix that number as a constant.

> +typedef struct {
> +char *addr;
> +uint64_t numpages;
> +uint64_t hpagesize;
> +} PageRange;
> +
>  int qemu_get_thread_id(void)
>  {
>  #if defined(__linux__)
> @@ -323,6 +330,52 @@ static void sigbus_handler(int signal)
>  siglongjmp(sigjump, 1);
>  }
>  
> +static void *do_touch_pages(void *arg)
> +{
> +PageRange *range = (PageRange *)arg;
> +char *start_addr = range->addr;
> +uint64_t numpages = range->numpages;
> +uint64_t hpagesize = range->hpagesize;
> +uint64_t i = 0;
> +
> +for (i = 0; i < numpages; i++) {
> +memset(start_addr + (hpagesize * i), 0, 1);
> +}
> +qemu_thread_exit(NULL);
> +
> +return NULL;
> +}
> +
> +static int touch_all_pages(char *area, size_t hpagesize, size_t numpages)
> +{
> +QemuThread page_threads[PAGE_TOUCH_THREAD_COUNT];
> +PageRange page_range[PAGE_TOUCH_THREAD_COUNT];
> +uint64_tnumpage_per_thread, size_per_thread;
> +int i = 0, tcount = 0;
> +
> +numpage_per_thread = (numpages / PAGE_TOUCH_THREAD_COUNT);
> +size_per_thread = (hpagesize * numpage_per_thread);
> +for (i = 0; i < (PAGE_TOUCH_THREAD_COUNT - 1); i++) {
> +page_range[i].addr = area;
> +page_range[i].numpages = numpage_per_thread;
> +page_range[i].hpagesize = hpagesize;
> +
> +qemu_thread_create(page_threads + i, "touch_pages",
> +   do_touch_pages, (page_range + i),
> +   QEMU_THREAD_JOINABLE);
> +tcount++;
> +area += size_per_thread;
> +numpages -= numpage_per_thread;
> +}
> +for (i = 0; i < numpages; i++) {
> +memset(area + (hpagesize * i), 0, 1);
> +}
> +for (i = 0; i < tcount; i++) {
> +qemu_thread_join(page_threads + i);
> +}
> +return 0;
> +}
> +
>  void os_mem_prealloc(int fd, char *area, size_t memory, Error **errp)
>  {
>  int ret;
> @@ -353,9 +406,14 @@ void os_mem_prealloc(int fd, char *area, size_t memory, 
> Error **errp)
>  size_t hpagesize = qemu_fd_getpagesize(fd);
>  size_t numpages = DIV_ROUND_UP(memory, hpagesize);
>  
> -/* MAP_POPULATE silently ignores failures */
> -for (i = 0; i < numpages; i++) {
> -memset(area + (hpagesize * i), 0, 1);
> +/* touch pages simult

Re: [Qemu-devel] [PATCH RFC] mem-prealloc: Reduce large guest start-up and migration time.

2017-01-27 Thread Juan Quintela
Jitendra Kolhe  wrote:
> Using "-mem-prealloc" option for a very large guest leads to huge guest
> start-up and migration time. This is because with "-mem-prealloc" option
> qemu tries to map every guest page (create address translations), and
> make sure the pages are available during runtime. virsh/libvirt by
> default, seems to use "-mem-prealloc" option in case the guest is
> configured to use huge pages. The patch tries to map all guest pages
> simultaneously by spawning multiple threads. Given the problem is more
> prominent for large guests, the patch limits the changes to the guests
> of at-least 64GB of memory size. Currently limiting the change to QEMU
> library functions on POSIX compliant host only, as we are not sure if
> the problem exists on win32. Below are some stats with "-mem-prealloc"
> option for guest configured to use huge pages.
>
> 
> Idle Guest  | Start-up time | Migration time
> 
> Guest stats with 2M HugePage usage - single threaded (existing code)
> 
> 64 Core - 4TB   | 54m11.796s| 75m43.843s
^^

> 64 Core - 1TB   | 8m56.576s | 14m29.049s
> 64 Core - 256GB | 2m11.245s | 3m26.598s
> 
> Guest stats with 2M HugePage usage - map guest pages using 8 threads
> 
> 64 Core - 4TB   | 5m1.027s  | 34m10.565s
> 64 Core - 1TB   | 1m10.366s | 8m28.188s
> 64 Core - 256GB | 0m19.040s | 2m10.148s
> ---
> Guest stats with 2M HugePage usage - map guest pages using 16 threads
> ---
> 64 Core - 4TB   | 1m58.970s | 31m43.400s
^

Impressive, not everyday one get an speedup of 20 O:-)


> +static void *do_touch_pages(void *arg)
> +{
> +PageRange *range = (PageRange *)arg;
> +char *start_addr = range->addr;
> +uint64_t numpages = range->numpages;
> +uint64_t hpagesize = range->hpagesize;
> +uint64_t i = 0;
> +
> +for (i = 0; i < numpages; i++) {
> +memset(start_addr + (hpagesize * i), 0, 1);

I would use the range->addr and similar here directly, but it is just a
question of taste.

> -/* MAP_POPULATE silently ignores failures */
> -for (i = 0; i < numpages; i++) {
> -memset(area + (hpagesize * i), 0, 1);
> +/* touch pages simultaneously for memory >= 64G */
> +if (memory < (1ULL << 36)) {

64GB guest already took quite a bit of time, I think I would put it
always as min(num_vcpus, 16).  So, we always execute the multiple theard
codepath?

But very nice, thanks.

Later, Juan.