from:"Liang Li"

Re: About the performance of hyper-v

2021-06-01 Thread Liang Li

==
> > Analyze events for all VMs, all VCPUs:
> >  VM-EXITSamples  Samples% Time%Min TimeMax
> > Time Avg time
> >MSR_WRITE 92404589.96%81.10%  0.42us
> > 68.42us  1.26us ( +-   0.07% )
> >DR_ACCESS  44669 4.35% 2.36%  0.32us
> > 50.74us  0.76us ( +-   0.32% )
> >   EXTERNAL_INTERRUPT  29809 2.90% 6.42%  0.66us
> > 70.75us  3.10us ( +-   0.54% )
> >   VMCALL  17819 1.73% 5.21%  0.75us
> > 15.64us  4.20us ( +-   0.33%
> >
> > Total Samples:1027227, Total events handled time:1436343.94us.
> > ===
> >
> > The result shows the overhead increased.  enable the apicv can help to
> > reduce the vm-exit
> > caused by interrupt injection, but on the other side, there are a lot
> > of vm-exit caused by APIC_EOI.
> >
> > When turning off the hyper-v and using the kvm apicv, there is no such
> > overhead.
>
> I think I know what's happening. We've asked Windows to use synthetic
> MSRs to access APIC (HV_APIC_ACCESS_RECOMMENDED) and this can't be
> accelerated in hardware.
>
> Could you please try the following hack (KVM):
>
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index c8f2592ccc99..66ee85a83e9a 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -145,6 +145,13 @@ void kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu)
>vcpu->arch.ia32_misc_enable_msr &
>MSR_IA32_MISC_ENABLE_MWAIT);
> }
> +
> +   /* Dirty hack: force HV_DEPRECATING_AEOI_RECOMMENDED. Not to be 
> merged! */
> +   best = kvm_find_cpuid_entry(vcpu, HYPERV_CPUID_ENLIGHTMENT_INFO, 0);
> +   if (best) {
> +   best->eax &= ~HV_X64_APIC_ACCESS_RECOMMENDED;
> +   best->eax |= HV_DEPRECATING_AEOI_RECOMMENDED;
> +   }
>  }
>  EXPORT_SYMBOL_GPL(kvm_update_cpuid_runtime);
>
> > It seems turning on hyper V related features is not always the best
> > choice for a windows guest.
>
> Generally it is, we'll just need to make QEMU smarter when setting
> 'recommendation' bits.
>

Hi Vitaly,

I have tried your patch and found it can help to reduce the overhead.
it works as well as
the  option  "" is set in
libvirt xml.

===with your patch and stimer enabled=
Analyze events for all VMs, all VCPUs:
 VM-EXITSamples  Samples% Time%Min TimeMax
Time Avg time
  APIC_WRITE 17223278.36%68.99%  0.70us
47.71us  1.48us ( +-   0.18% )
 DR_ACCESS  19136 8.71% 4.42%  0.55us
4.42us  0.85us ( +-   0.32% )
  EXTERNAL_INTERRUPT  15921 7.24%13.84%  0.87us
55.28us  3.21us ( +-   0.55% )
  VMCALL   6971 3.17%10.34%  1.16us
12.02us  5.48us ( +-   0.49%
Total Samples:219802, Total events handled time:369310.30us.

===with hypervisor disabled=

Analyze events for all VMs, all VCPUs:
 VM-EXITSamples  Samples% Time%Min TimeMax
Time Avg time
  APIC_WRITE 20048278.51%68.62%  0.64us
49.51us  1.37us ( +-   0.16% )
   DR_ACCESS  24235 9.49% 4.92%  0.55us
3.65us  0.81us ( +-   0.26% )
  EXTERNAL_INTERRUPT  17084 6.69%13.20%  0.89us
56.38us  3.09us ( +-   0.53% )
  VMCALL   7124 2.79% 9.87%  1.26us
12.39us  5.54us ( +-   0.49% )
 EOI_INDUCED   5066 1.98% 1.36%  0.66us
2.64us  1.07us ( +-   0.25% )
  IO_INSTRUCTION591 0.23% 1.27%  3.37us
673.23us  8.59us ( +-  13.69% )
Total Samples:255363, Total events handled time:399954.27us.


Thanks!
Liang

Re: About the performance of hyper-v

2021-05-23 Thread Liang Li

> >> > Analyze events for all VMs, all VCPUs:
> >> >  VM-EXITSamples  Samples% Time%Min TimeMax
> >> > Time Avg time
> >> >   EXTERNAL_INTERRUPT 47183159.89%68.58%  0.64us
> >> > 65.42us  2.34us ( +-   0.11% )
> >> >MSR_WRITE 23893230.33%23.07%  0.48us
> >> > 41.05us  1.56us ( +-   0.14% )
> >> >
> >> > Total Samples:787803, Total events handled time:1611193.84us.
> >> >
> >> > I tried turning off hyper-v for the same workload and repeat the test,
> >> > the overall virtualization overhead reduced by about of 50%:
> >> >
> >> > ---
> >> >
> >> > Analyze events for all VMs, all VCPUs:
> >> >
> >> >  VM-EXITSamples  Samples% Time%Min TimeMax
> >> > Time Avg time
> >> >   APIC_WRITE 25515274.43%50.72%  0.49us
> >> > 50.01us  1.42us ( +-   0.14% )
> >> >EPT_MISCONFIG  3996711.66%40.58%  1.55us
> >> > 686.05us  7.27us ( +-   0.43% )
> >> >DR_ACCESS  3500310.21% 4.64%  0.32us
> >> > 40.03us  0.95us ( +-   0.32% )
> >> >   EXTERNAL_INTERRUPT   6622 1.93% 2.08%  0.70us
> >> > 57.38us  2.25us ( +-   1.42% )
> >> >
> >> > Total Samples:342788, Total events handled time:715695.62us.
> >> >
> >> > For this scenario,  hyper-v works really bad.  stimer works better
> >> > than hpet, but on the other hand, it relies on SynIC which has
> >> > negative effects for IPI intensive workloads.
> >> > Do you have any plans for improvement?
> >> >
> >>
> >> Hey,
> >>
> >> the above can be caused by the fact that when 'hv-synic' is enabled, KVM
> >> automatically disables APICv and this can explain the overhead and the
> >> fact that you're seeing more vmexits. KVM disables APICv because SynIC's
> >> 'AutoEOI' feature is incompatible with it. We can, however, tell Windows
> >> to not use AutoEOI ('Recommend deprecating AutoEOI' bit) and only
> >> inhibit APICv if the recommendation was ignored. This is implemented in
> >> the following KVM patch series:
> >> https://lore.kernel.org/kvm/20210518144339.1987982-1-vkuzn...@redhat.com/
> >>
> >> It will, however, require a new 'hv-something' flag to QEMU. For now, it
> >> can be tested with 'hv-passthrough'.
> >>
> >> It would be great if you could give it a spin!
> >>
> >> --
> >> Vitaly
> >
> > It's great to know that you already have a solution for this. :)
> >
> > By the way,  is there any requirement for the version of windows or
> > windows updates for the new feature to work?
>
> AFAIR, 'Recommend deprecating AutoEOI' bit appeared in WS2012 so I'd
> expect WS2008 to ignore it completely (and thus SynIC will always be
> disabling APICv for it).
>

Hi Vitaly,
  I tried your patchset and found it's not helpful to reduce the
virtualization overhead.
here are some perfdata with the same workload

===
Analyze events for all VMs, all VCPUs:
 VM-EXITSamples  Samples% Time%Min TimeMax
Time Avg time
   MSR_WRITE 92404589.96%81.10%  0.42us
68.42us  1.26us ( +-   0.07% )
   DR_ACCESS  44669 4.35% 2.36%  0.32us
50.74us  0.76us ( +-   0.32% )
  EXTERNAL_INTERRUPT  29809 2.90% 6.42%  0.66us
70.75us  3.10us ( +-   0.54% )
  VMCALL  17819 1.73% 5.21%  0.75us
15.64us  4.20us ( +-   0.33%

Total Samples:1027227, Total events handled time:1436343.94us.
===

The result shows the overhead increased.  enable the apicv can help to
reduce the vm-exit
caused by interrupt injection, but on the other side, there are a lot
of vm-exit caused by APIC_EOI.

When turning off the hyper-v and using the kvm apicv, there is no such
overhead. It seems turning
on hyper V related features is not always the best choice for a windows guest.

Thanks!
Liang

Re: About the performance of hyper-v

2021-05-21 Thread Liang Li

> > Hi Vitaly,
> >
> > I found a case that the virtualization overhead was almost doubled
> > when turning on Hper-v related features compared to that without any
> > no hyper-v feature.  It happens when running a 3D game in windows
> > guest in qemu kvm environment.
> >
> > By investigation, I found there are a lot of IPIs triggered by guest,
> > when turning on the hyer-v related features including stimer, for the
> > apicv is turned off, at least two vm exits are needed for processing a
> > single IPI.
> >
> >
> > perf stat will show something like below [recorded for 5 seconds]
> >
> > -
> >
> > Analyze events for all VMs, all VCPUs:
> >  VM-EXITSamples  Samples% Time%Min TimeMax
> > Time Avg time
> >   EXTERNAL_INTERRUPT 47183159.89%68.58%  0.64us
> > 65.42us  2.34us ( +-   0.11% )
> >MSR_WRITE 23893230.33%23.07%  0.48us
> > 41.05us  1.56us ( +-   0.14% )
> >
> > Total Samples:787803, Total events handled time:1611193.84us.
> >
> > I tried turning off hyper-v for the same workload and repeat the test,
> > the overall virtualization overhead reduced by about of 50%:
> >
> > ---
> >
> > Analyze events for all VMs, all VCPUs:
> >
> >  VM-EXITSamples  Samples% Time%Min TimeMax
> > Time Avg time
> >   APIC_WRITE 25515274.43%50.72%  0.49us
> > 50.01us  1.42us ( +-   0.14% )
> >EPT_MISCONFIG  3996711.66%40.58%  1.55us
> > 686.05us  7.27us ( +-   0.43% )
> >DR_ACCESS  3500310.21% 4.64%  0.32us
> > 40.03us  0.95us ( +-   0.32% )
> >   EXTERNAL_INTERRUPT   6622 1.93% 2.08%  0.70us
> > 57.38us  2.25us ( +-   1.42% )
> >
> > Total Samples:342788, Total events handled time:715695.62us.
> >
> > For this scenario,  hyper-v works really bad.  stimer works better
> > than hpet, but on the other hand, it relies on SynIC which has
> > negative effects for IPI intensive workloads.
> > Do you have any plans for improvement?
> >
>
> Hey,
>
> the above can be caused by the fact that when 'hv-synic' is enabled, KVM
> automatically disables APICv and this can explain the overhead and the
> fact that you're seeing more vmexits. KVM disables APICv because SynIC's
> 'AutoEOI' feature is incompatible with it. We can, however, tell Windows
> to not use AutoEOI ('Recommend deprecating AutoEOI' bit) and only
> inhibit APICv if the recommendation was ignored. This is implemented in
> the following KVM patch series:
> https://lore.kernel.org/kvm/20210518144339.1987982-1-vkuzn...@redhat.com/
>
> It will, however, require a new 'hv-something' flag to QEMU. For now, it
> can be tested with 'hv-passthrough'.
>
> It would be great if you could give it a spin!
>
> --
> Vitaly

It's great to know that you already have a solution for this. :)

By the way,  is there any requirement for the version of windows or
windows updates for the new feature to work?

Thanks!

Liang

About the performance of hyper-v

2021-05-19 Thread Liang Li

[resend for missing cc]

Hi Vitaly,

I found a case that the virtualization overhead was almost doubled
when turning on Hper-v related features compared to that without any
no hyper-v feature.  It happens when running a 3D game in windows
guest in qemu kvm environment.

By investigation, I found there are a lot of IPIs triggered by guest,
when turning on the hyer-v related features including stimer, for the
apicv is turned off, at least two vm exits are needed for processing a
single IPI.


perf stat will show something like below [recorded for 5 seconds]

-

Analyze events for all VMs, all VCPUs:
 VM-EXITSamples  Samples% Time%Min TimeMax
Time Avg time
  EXTERNAL_INTERRUPT 47183159.89%68.58%  0.64us
65.42us  2.34us ( +-   0.11% )
   MSR_WRITE 23893230.33%23.07%  0.48us
41.05us  1.56us ( +-   0.14% )

Total Samples:787803, Total events handled time:1611193.84us.

I tried turning off hyper-v for the same workload and repeat the test,
the overall virtualization overhead reduced by about of 50%:

---

Analyze events for all VMs, all VCPUs:

 VM-EXITSamples  Samples% Time%Min TimeMax
Time Avg time
  APIC_WRITE 25515274.43%50.72%  0.49us
50.01us  1.42us ( +-   0.14% )
   EPT_MISCONFIG  3996711.66%40.58%  1.55us
686.05us  7.27us ( +-   0.43% )
   DR_ACCESS  3500310.21% 4.64%  0.32us
40.03us  0.95us ( +-   0.32% )
  EXTERNAL_INTERRUPT   6622 1.93% 2.08%  0.70us
57.38us  2.25us ( +-   1.42% )

Total Samples:342788, Total events handled time:715695.62us.

For this scenario,  hyper-v works really bad.  stimer works better
than hpet, but on the other hand, it relies on SynIC which has
negative effects for IPI intensive workloads.
Do you have any plans for improvement?


Thanks!
Liang

Re: [RFC PATCH 1/3] mm: support hugetlb free page reporting

2020-12-23 Thread Liang Li

> >>> +static int
> >>> +hugepage_reporting_cycle(struct page_reporting_dev_info *prdev,
> >>> +  struct hstate *h, unsigned int nid,
> >>> +  struct scatterlist *sgl, unsigned int *offset)
> >>> +{
> >>> + struct list_head *list = >hugepage_freelists[nid];
> >>> + unsigned int page_len = PAGE_SIZE << h->order;
> >>> + struct page *page, *next;
> >>> + long budget;
> >>> + int ret = 0, scan_cnt = 0;
> >>> +
> >>> + /*
> >>> +  * Perform early check, if free area is empty there is
> >>> +  * nothing to process so we can skip this free_list.
> >>> +  */
> >>> + if (list_empty(list))
> >>> + return ret;
> >>
> >> Do note that not all entries on the hugetlb free lists are free.  Reserved
> >> entries are also on the free list.  The actual number of free entries is
> >> 'h->free_huge_pages - h->resv_huge_pages'.
> >> Is the intention to process reserved pages as well as free pages?
> >
> > Yes, Reserved pages was treated as 'free pages'
>
> If that is true, then this code breaks hugetlb.  hugetlb code assumes that
> h->free_huge_pages is ALWAYS >= h->resv_huge_pages.  This code would break
> that assumption.  If you really want to add support for hugetlb pages, then
> you will need to take reserved pages into account.

I didn't know that. thanks!

> P.S. There might be some confusion about 'reservations' based on the
> commit message.  My comments are directed at hugetlb reservations described
> in Documentation/vm/hugetlbfs_reserv.rst.
>
> >>> + /* Attempt to pull page from list and place in scatterlist 
> >>> */
> >>> + if (*offset) {
> >>> + isolate_free_huge_page(page, h, nid);
> >>
> >> Once a hugetlb page is isolated, it can not be used and applications that
> >> depend on hugetlb pages can start to fail.
> >> I assume that is acceptable/expected behavior.  Correct?
> >> On some systems, hugetlb pages are a precious resource and the sysadmin
> >> carefully configures the number needed by applications.  Removing a hugetlb
> >> page (even for a very short period of time) could cause serious application
> >> failure.
> >
> > That' true, especially for 1G pages. Any suggestions?
> > Let the hugepage allocator be aware of this situation and retry ?
>
> I would hate to add that complexity to the allocator.
>
> This question is likely based on my lack of understanding of virtio-balloon
> usage and this reporting mechanism.  But, why do the hugetlb pages have to
> be 'temporarily' allocated for reporting purposes?

The link here will give your more detail about how page reporting
works, https://www.kernel.org/doc/html/latest//vm/free_page_reporting.html
the virtio-balloon driver is based on this framework and will report the
free pages information to QEMU, host can unmap the memory
region corresponding to reported free pages and reclaim the memory
for other use, it's useful for memory overcommit.
Allocated the pages 'temporarily' before reporting is necessary, it make
sure guests will not use the page when the host side unmap the region.
or it will break the guest.

Now I realized we should solve this issue first, it seems adding a lock
will help.

Thanks

Re: [RFC PATCH 1/3] mm: support hugetlb free page reporting

2020-12-23 Thread Liang Li

> > > > +   spin_lock_irq(_lock);
> > > > +
> > > > +   if (huge_page_order(h) > MAX_ORDER)
> > > > +   budget = HUGEPAGE_REPORTING_CAPACITY;
> > > > +   else
> > > > +   budget = HUGEPAGE_REPORTING_CAPACITY * 32;
> > >
> > > Wouldn't huge_page_order always be more than MAX_ORDER? Seems like we
> > > don't even really need budget since this should probably be pulling
> > > out no more than one hugepage at a time.
> >
> > I want to disting a 2M page and 1GB page here. The order of 1GB page is 
> > greater
> > than MAX_ORDER while 2M page's order is less than MAX_ORDER.
>
> The budget here is broken. When I put the budget in page reporting it
> was so that we wouldn't try to report all of the memory in a given
> region. It is meant to hold us to no more than one pass through 1/16
> of the free memory. So essentially we will be slowly processing all of
> memory and it will take 16 calls (32 seconds) for us to process a
> system that is sitting completely idle. It is meant to pace us so we
> don't spend a ton of time doing work that will be undone, not to
> prevent us from burying a CPU which is what seems to be implied here.
>
> Using HUGEPAGE_REPORTING_CAPACITY makes no sense here. I was using it
> in the original definition because it was how many pages we could
> scoop out at a time and then I was aiming for a 16th of that. Here you
> are arbitrarily squaring HUGEPAGE_REPORTING_CAPACITY in terms of the
> amount of work you will doo since you are using it as a multiple
> instead of a divisor.
>
> > >
> > > > +   /* loop through free list adding unreported pages to sg list */
> > > > +   list_for_each_entry_safe(page, next, list, lru) {
> > > > +   /* We are going to skip over the reported pages. */
> > > > +   if (PageReported(page)) {
> > > > +   if (++scan_cnt >= MAX_SCAN_NUM) {
> > > > +   ret = scan_cnt;
> > > > +   break;
> > > > +   }
> > > > +   continue;
> > > > +   }
> > > > +
> > >
> > > It would probably have been better to place this set before your new
> > > set. I don't see your new set necessarily being the best use for page
> > > reporting.
> >
> > I haven't really latched on to what you mean, could you explain it again?
>
> It would be better for you to spend time understanding how this patch
> set works before you go about expanding it to do other things.
> Mistakes like the budget one above kind of point out the fact that you
> don't understand how this code was supposed to work and just kind of
> shoehorned you page zeroing code onto it.
>
> It would be better to look at trying to understand this code first
> before you extend it to support your zeroing use case. So adding huge
> pages first might make more sense than trying to zero and push the
> order down. The fact is the page reporting extension should be minimal
> for huge pages since they are just passed as a scatterlist so you
> should only need to add a small bit to page_reporting.c to extend it
> to support this use case.
>
> > >
> > > > +   /*
> > > > +* If we fully consumed our budget then update our
> > > > +* state to indicate that we are requesting additional
> > > > +* processing and exit this list.
> > > > +*/
> > > > +   if (budget < 0) {
> > > > +   atomic_set(>state, 
> > > > PAGE_REPORTING_REQUESTED);
> > > > +   next = page;
> > > > +   break;
> > > > +   }
> > > > +
> > >
> > > If budget is only ever going to be 1 then we probably could just look
> > > at making this the default case for any time we find a non-reported
> > > page.
> >
> > and here again.
>
> It comes down to the fact that the changes you made have a significant
> impact on how this is supposed to function. Reducing the scatterlist
> to a size of one makes the whole point of doing batching kind of
> pointless. Basically the code should be rewritten with the assumption
> that if you find a page you report it.
>
> The old code would batch things up because there is significant
> overhead to be addressed when going to the hypervisor to report said
> memory. Your code doesn't seem to really take anything like that into
> account and instead is using an arbitrary budget value based on the
> page size.
>
> > > > +   /* Attempt to pull page from list and place in 
> > > > scatterlist */
> > > > +   if (*offset) {
> > > > +   isolate_free_huge_page(page, h, nid);
> > > > +   /* Add page to scatter list */
> > > > +   --(*offset);
> > > > +   sg_set_page([*offset], page, page_len, 0);
> > > > +
> > > > +   continue;
> > > > +   }
> > > > +
> > >
> > > There is no point in the

Re: [RFC PATCH 1/3] mm: support hugetlb free page reporting

2020-12-22 Thread Liang Li

> On 12/21/20 11:46 PM, Liang Li wrote:
> > Free page reporting only supports buddy pages, it can't report the
> > free pages reserved for hugetlbfs case. On the other hand, hugetlbfs
> > is a good choice for a system with a huge amount of RAM, because it
> > can help to reduce the memory management overhead and improve system
> > performance.
> > This patch add the support for reporting hugepages in the free list
> > of hugetlb, it canbe used by virtio_balloon driver for memory
> > overcommit and pre zero out free pages for speeding up memory population.
>
> My apologies as I do not follow virtio_balloon driver.  Comments from
> the hugetlb perspective.

Any comments are welcome.


> >  static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
> > @@ -5531,6 +5537,29 @@ follow_huge_pgd(struct mm_struct *mm, unsigned long 
> > address, pgd_t *pgd, int fla
> >   return pte_page(*(pte_t *)pgd) + ((address & ~PGDIR_MASK) >> 
> > PAGE_SHIFT);
> >  }
> >
> > +bool isolate_free_huge_page(struct page *page, struct hstate *h, int nid)
>
> Looks like this always returns true.  Should it be type void?

will change in the next revision.

> > +{
> > + bool ret = true;
> > +
> > + VM_BUG_ON_PAGE(!PageHead(page), page);
> > +
> > + list_move(>lru, >hugepage_activelist);
> > + set_page_refcounted(page);
> > + h->free_huge_pages--;
> > + h->free_huge_pages_node[nid]--;
> > +
> > + return ret;
> > +}
> > +
>
> ...

> > +static void
> > +hugepage_reporting_drain(struct page_reporting_dev_info *prdev,
> > +  struct hstate *h, struct scatterlist *sgl,
> > +  unsigned int nents, bool reported)
> > +{
> > + struct scatterlist *sg = sgl;
> > +
> > + /*
> > +  * Drain the now reported pages back into their respective
> > +  * free lists/areas. We assume at least one page is populated.
> > +  */
> > + do {
> > + struct page *page = sg_page(sg);
> > +
> > + putback_isolate_huge_page(h, page);
> > +
> > + /* If the pages were not reported due to error skip flagging 
> > */
> > + if (!reported)
> > + continue;
> > +
> > + __SetPageReported(page);
> > + } while ((sg = sg_next(sg)));
> > +
> > + /* reinitialize scatterlist now that it is empty */
> > + sg_init_table(sgl, nents);
> > +}
> > +
> > +/*
> > + * The page reporting cycle consists of 4 stages, fill, report, drain, and
> > + * idle. We will cycle through the first 3 stages until we cannot obtain a
> > + * full scatterlist of pages, in that case we will switch to idle.
> > + */
>
> As mentioned, I am not familiar with virtio_balloon and the overall design.
> So, some of this does not make sense to me.
>
> > +static int
> > +hugepage_reporting_cycle(struct page_reporting_dev_info *prdev,
> > +  struct hstate *h, unsigned int nid,
> > +  struct scatterlist *sgl, unsigned int *offset)
> > +{
> > + struct list_head *list = >hugepage_freelists[nid];
> > + unsigned int page_len = PAGE_SIZE << h->order;
> > + struct page *page, *next;
> > + long budget;
> > + int ret = 0, scan_cnt = 0;
> > +
> > + /*
> > +  * Perform early check, if free area is empty there is
> > +  * nothing to process so we can skip this free_list.
> > +  */
> > + if (list_empty(list))
> > + return ret;
>
> Do note that not all entries on the hugetlb free lists are free.  Reserved
> entries are also on the free list.  The actual number of free entries is
> 'h->free_huge_pages - h->resv_huge_pages'.
> Is the intention to process reserved pages as well as free pages?

Yes, Reserved pages was treated as 'free pages'

> > +
> > + spin_lock_irq(_lock);
> > +
> > + if (huge_page_order(h) > MAX_ORDER)
> > + budget = HUGEPAGE_REPORTING_CAPACITY;
> > + else
> > + budget = HUGEPAGE_REPORTING_CAPACITY * 32;
> > +
> > + /* loop through free list adding unreported pages to sg list */
> > + list_for_each_entry_safe(page, next, list, lru) {
> > + /* We are going to skip over the reported pages. */
> > + if (PageReported(page)) {
> > + if (++scan_cnt >= MAX_SCAN_NUM) {
> > +

Re: [RFC PATCH 1/3] mm: support hugetlb free page reporting

2020-12-22 Thread Liang Li

> On 12/22/20 11:59 AM, Alexander Duyck wrote:
> > On Mon, Dec 21, 2020 at 11:47 PM Liang Li  
> > wrote:
> >> +
> >> +   if (huge_page_order(h) > MAX_ORDER)
> >> +   budget = HUGEPAGE_REPORTING_CAPACITY;
> >> +   else
> >> +   budget = HUGEPAGE_REPORTING_CAPACITY * 32;
> >
> > Wouldn't huge_page_order always be more than MAX_ORDER? Seems like we
> > don't even really need budget since this should probably be pulling
> > out no more than one hugepage at a time.
>
> On standard x86_64 configs, 2MB huge pages are of order 9 < MAX_ORDER (11).
> What is important for hugetlb is the largest order that can be allocated
> from buddy.  Anything bigger is considered a gigantic page and has to be
> allocated differently.
>
> If the code above is trying to distinguish between huge and gigantic pages,
> it is off by 1.  The largest order that can be allocated from the buddy is
> (MAX_ORDER - 1).  So, the check should be '>='.
>
> --
> Mike Kravetz

Yes, you're right!  thanks

Liang

Re: [RFC PATCH 1/3] mm: support hugetlb free page reporting

2020-12-22 Thread Liang Li

> > +hugepage_reporting_cycle(struct page_reporting_dev_info *prdev,
> > +struct hstate *h, unsigned int nid,
> > +struct scatterlist *sgl, unsigned int *offset)
> > +{
> > +   struct list_head *list = >hugepage_freelists[nid];
> > +   unsigned int page_len = PAGE_SIZE << h->order;
> > +   struct page *page, *next;
> > +   long budget;
> > +   int ret = 0, scan_cnt = 0;
> > +
> > +   /*
> > +* Perform early check, if free area is empty there is
> > +* nothing to process so we can skip this free_list.
> > +*/
> > +   if (list_empty(list))
> > +   return ret;
> > +
> > +   spin_lock_irq(_lock);
> > +
> > +   if (huge_page_order(h) > MAX_ORDER)
> > +   budget = HUGEPAGE_REPORTING_CAPACITY;
> > +   else
> > +   budget = HUGEPAGE_REPORTING_CAPACITY * 32;
>
> Wouldn't huge_page_order always be more than MAX_ORDER? Seems like we
> don't even really need budget since this should probably be pulling
> out no more than one hugepage at a time.

I want to disting a 2M page and 1GB page here. The order of 1GB page is greater
than MAX_ORDER while 2M page's order is less than MAX_ORDER.

>
> > +   /* loop through free list adding unreported pages to sg list */
> > +   list_for_each_entry_safe(page, next, list, lru) {
> > +   /* We are going to skip over the reported pages. */
> > +   if (PageReported(page)) {
> > +   if (++scan_cnt >= MAX_SCAN_NUM) {
> > +   ret = scan_cnt;
> > +   break;
> > +   }
> > +   continue;
> > +   }
> > +
>
> It would probably have been better to place this set before your new
> set. I don't see your new set necessarily being the best use for page
> reporting.

I haven't really latched on to what you mean, could you explain it again?

>
> > +   /*
> > +* If we fully consumed our budget then update our
> > +* state to indicate that we are requesting additional
> > +* processing and exit this list.
> > +*/
> > +   if (budget < 0) {
> > +   atomic_set(>state, PAGE_REPORTING_REQUESTED);
> > +   next = page;
> > +   break;
> > +   }
> > +
>
> If budget is only ever going to be 1 then we probably could just look
> at making this the default case for any time we find a non-reported
> page.

and here again.

> > +   /* Attempt to pull page from list and place in scatterlist 
> > */
> > +   if (*offset) {
> > +   isolate_free_huge_page(page, h, nid);
> > +   /* Add page to scatter list */
> > +   --(*offset);
> > +   sg_set_page([*offset], page, page_len, 0);
> > +
> > +   continue;
> > +   }
> > +
>
> There is no point in the continue case if we only have a budget of 1.
> We should probably just tighten up the loop so that all it does is
> search until it finds the 1 page it can pull, pull it, and then return
> it. The scatterlist doesn't serve much purpose and could be reduced to
> just a single entry.

I will think about it more.

> > +static int
> > +hugepage_reporting_process_hstate(struct page_reporting_dev_info *prdev,
> > +   struct scatterlist *sgl, struct hstate *h)
> > +{
> > +   unsigned int leftover, offset = HUGEPAGE_REPORTING_CAPACITY;
> > +   int ret = 0, nid;
> > +
> > +   for (nid = 0; nid < MAX_NUMNODES; nid++) {
> > +   ret = hugepage_reporting_cycle(prdev, h, nid, sgl, );
> > +
> > +   if (ret < 0)
> > +   return ret;
> > +   }
> > +
> > +   /* report the leftover pages before going idle */
> > +   leftover = HUGEPAGE_REPORTING_CAPACITY - offset;
> > +   if (leftover) {
> > +   sgl = [offset];
> > +   ret = prdev->report(prdev, sgl, leftover);
> > +
> > +   /* flush any remaining pages out from the last report */
> > +   spin_lock_irq(_lock);
> > +   hugepage_reporting_drain(prdev, h, sgl, leftover, !ret);
> > +   spin_unlock_irq(_lock);
> > +   }
> > +
> > +   return ret;
> > +}
> > +
>
> If HUGEPAGE_REPORTING_CAPACITY is 1 it would make more sense to
> rewrite this code to just optimize for a find and process a page
> approach rather than trying to batch pages.

Yes, I will make a change. Thanks for your comments!

Liang

Re: [RFC PATCH 3/3] mm: support free hugepage pre zero out

2020-12-22 Thread Liang Li

> > Free page reporting in virtio-balloon doesn't give you any guarantees
> > regarding zeroing of pages. Take a look at the QEMU implementation -
> > e.g., with vfio all reports are simply ignored.
> >
> > Also, I am not sure if mangling such details ("zeroing of pages") into
> > the page reporting infrastructure is a good idea.
> >
>
> Oh, now I get what you are doing here, you rely on zero_free_pages of
> your other patch series and are not relying on virtio-balloon free page
> reporting to do the zeroing.
>
> You really should have mentioned that this patch series relies on the
> other one and in which way.

I am sorry for that. After I sent out the patch, I realized I should
mention that, so I sent out an updated version which added the
information you mentioned :)

Thanks !
Liang

Re: [RFC PATCH 2/3] virtio-balloon: add support for providing free huge page reports to host

2020-12-22 Thread Liang Li

On Tue, Dec 22, 2020 at 4:28 PM David Hildenbrand  wrote:
>
> On 22.12.20 08:48, Liang Li wrote:
> > Free page reporting only supports buddy pages, it can't report the
> > free pages reserved for hugetlbfs case. On the other hand, hugetlbfs
>
> The virtio-balloon free page reporting interface accepts a generic sg,
> so it isn't glue to buddy pages. There is no need for a new interface.

OK, then there will be two workers accessing the same vq, we can add a
lock for concurrent access.

Thanks!

Liang

[RFC PATCH 0/3 updated] add support for free hugepage reporting

2020-12-22 Thread Liang Li

A typical usage of hugetlbfs it's to reserve amount of memory during 
kernel booting, and the reserved pages are unlikely to return to the
buddy system. When application need hugepages, kernel will allocate 
them from the reserved pool. when application terminates, huge pages
will return to the reserved pool and are kept in the free list for
hugetlb, these free pages will not return to buddy freelist unless
the size fo reserved pool is changed. 
Free page reporting only supports buddy pages, it can't report the
free pages reserved for hugetlbfs. On the other hand, hugetlbfs
is a good choice for system with a huge amount of RAM, because it
can help to reduce the memory management overhead and improve system
performance.
This patch add the support for reporting hugepages in the free list
of hugetlb, it can be used by virtio_balloon driver for memory
overcommit and pre zero out free pages for speeding up memory
population and page fault handling.

Most of the code are 'copied' from free page reporting because they
are working in the same way. So the code can be refined to remove
the duplicated code. Since this is an RFC, I didn't do that.

For the virtio_balloon driver, changes for the virtio spec are needed.
Before that, I need the feedback of the comunity about this new feature.

This RFC is baed on my previous series:
  '[RFC v2 PATCH 0/4] speed up page allocation for __GFP_ZERO' 

Liang Li (3):
  mm: support hugetlb free page reporting
  virtio-balloon: add support for providing free huge page reports to
host
  mm: support free hugepage pre zero out

 drivers/virtio/virtio_balloon.c |  61 ++
 include/linux/hugetlb.h |   3 +
 include/linux/page_reporting.h  |   5 +
 include/uapi/linux/virtio_balloon.h |   1 +
 mm/hugetlb.c|  29 +++
 mm/page_prezero.c   |  17 ++
 mm/page_reporting.c | 287 
 mm/page_reporting.h |  34 
 8 files changed, 437 insertions(+)

Cc: Alexander Duyck 
Cc: Mel Gorman 
Cc: Andrea Arcangeli 
Cc: Dan Williams 
Cc: Dave Hansen 
Cc: David Hildenbrand   
Cc: Michal Hocko  
Cc: Andrew Morton 
Cc: Alex Williamson 
Cc: Michael S. Tsirkin 
Cc: Jason Wang 
Cc: Mike Kravetz 
Cc: Liang Li 
-- 
2.18.2

[RFC PATCH 0/3 updated] add support for free hugepage reporting

2020-12-22 Thread Liang Li

A typical usage of hugetlbfs it's to reserve amount of memory during 
kernel booting, and the reserved pages are unlikely to return to the
buddy system. When application need hugepages, kernel will allocate 
them from the reserved pool. when application terminates, huge pages
will return to the reserved pool and are kept in the free list for
hugetlb, these free pages will not return to buddy freelist unless
the size fo reserved pool is changed. 
Free page reporting only supports buddy pages, it can't report the
free pages reserved for hugetlbfs. On the other hand, hugetlbfs
is a good choice for system with a huge amount of RAM, because it
can help to reduce the memory management overhead and improve system
performance.
This patch add the support for reporting hugepages in the free list
of hugetlb, it can be used by virtio_balloon driver for memory
overcommit and pre zero out free pages for speeding up memory
population and page fault handling.

Most of the code are 'copied' from free page reporting because they
are working in the same way. So the code can be refined to remove
the duplicated code. Since this is an RFC, I didn't do that.

For the virtio_balloon driver, changes for the virtio spec are needed.
Before that, I need the feedback of the comunity about this new feature.

This RFC is baed on my previous series:
  '[RFC v2 PATCH 0/4] speed up page allocation for __GFP_ZERO' 

Liang Li (3):
  mm: support hugetlb free page reporting
  virtio-balloon: add support for providing free huge page reports to
host
  mm: support free hugepage pre zero out

 drivers/virtio/virtio_balloon.c |  61 ++
 include/linux/hugetlb.h |   3 +
 include/linux/page_reporting.h  |   5 +
 include/uapi/linux/virtio_balloon.h |   1 +
 mm/hugetlb.c|  29 +++
 mm/page_prezero.c   |  17 ++
 mm/page_reporting.c | 287 
 mm/page_reporting.h |  34 
 8 files changed, 437 insertions(+)

Cc: Alexander Duyck 
Cc: Mel Gorman 
Cc: Andrea Arcangeli 
Cc: Dan Williams 
Cc: Dave Hansen 
Cc: David Hildenbrand   
Cc: Michal Hocko  
Cc: Andrew Morton 
Cc: Alex Williamson 
Cc: Michael S. Tsirkin 
Cc: Jason Wang 
Cc: Mike Kravetz 
Cc: Liang Li 
-- 
2.18.2

[RFC PATCH 3/3] mm: support free hugepage pre zero out

2020-12-21 Thread Liang Li

This patch add support of pre zero out free hugepage, we can use
this feature to speed up page population and page fault handing.

Cc: Alexander Duyck 
Cc: Mel Gorman 
Cc: Andrea Arcangeli 
Cc: Dan Williams 
Cc: Dave Hansen 
Cc: David Hildenbrand   
Cc: Michal Hocko  
Cc: Andrew Morton 
Cc: Alex Williamson 
Cc: Michael S. Tsirkin 
Cc: Jason Wang 
Cc: Mike Kravetz 
Cc: Liang Li 
Signed-off-by: Liang Li 
---
 mm/page_prezero.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/mm/page_prezero.c b/mm/page_prezero.c
index c8ce720bfc54..dff4e0adf402 100644
--- a/mm/page_prezero.c
+++ b/mm/page_prezero.c
@@ -26,6 +26,7 @@ static unsigned long delay_millisecs = 1000;
 static unsigned long zeropage_enable __read_mostly;
 static DEFINE_MUTEX(kzeropaged_mutex);
 static struct page_reporting_dev_info zero_page_dev_info;
+static struct page_reporting_dev_info zero_hugepage_dev_info;
 
 inline void clear_zero_page_flag(struct page *page, int order)
 {
@@ -69,9 +70,17 @@ static int start_kzeropaged(void)
zero_page_dev_info.delay_jiffies = 
msecs_to_jiffies(delay_millisecs);
 
err = page_reporting_register(_page_dev_info);
+
+   zero_hugepage_dev_info.report = zero_free_pages;
+   zero_hugepage_dev_info.mini_order = mini_page_order;
+   zero_hugepage_dev_info.batch_size = batch_size;
+   zero_hugepage_dev_info.delay_jiffies = 
msecs_to_jiffies(delay_millisecs);
+
+   err |= hugepage_reporting_register(_hugepage_dev_info);
pr_info("Zero page enabled\n");
} else {
page_reporting_unregister(_page_dev_info);
+   hugepage_reporting_unregister(_hugepage_dev_info);
pr_info("Zero page disabled\n");
}
 
@@ -90,7 +99,15 @@ static int restart_kzeropaged(void)
zero_page_dev_info.batch_size = batch_size;
zero_page_dev_info.delay_jiffies = 
msecs_to_jiffies(delay_millisecs);
 
+   hugepage_reporting_unregister(_hugepage_dev_info);
+
+   zero_hugepage_dev_info.report = zero_free_pages;
+   zero_hugepage_dev_info.mini_order = mini_page_order;
+   zero_hugepage_dev_info.batch_size = batch_size;
+   zero_hugepage_dev_info.delay_jiffies = 
msecs_to_jiffies(delay_millisecs);
+
err = page_reporting_register(_page_dev_info);
+   err |= hugepage_reporting_register(_hugepage_dev_info);
pr_info("Zero page enabled\n");
}
 
-- 
2.18.2

[RFC PATCH 2/3] virtio-balloon: add support for providing free huge page reports to host

2020-12-21 Thread Liang Li

Free page reporting only supports buddy pages, it can't report the
free pages reserved for hugetlbfs case. On the other hand, hugetlbfs
is a good choice for a system with a huge amount of RAM, because it
can help to reduce the memory management overhead and improve system
performance.  This patch add support for reporting free hugepage to
host when guest use hugetlbfs.
A new feature bit and a new vq is added for this new feature.

Cc: Alexander Duyck 
Cc: Mel Gorman 
Cc: Andrea Arcangeli 
Cc: Dan Williams 
Cc: Dave Hansen 
Cc: David Hildenbrand   
Cc: Michal Hocko  
Cc: Andrew Morton 
Cc: Alex Williamson 
Cc: Michael S. Tsirkin 
Cc: Jason Wang 
Cc: Mike Kravetz 
Cc: Liang Li 
Signed-off-by: Liang Li 
---
 drivers/virtio/virtio_balloon.c | 61 +
 include/uapi/linux/virtio_balloon.h |  1 +
 2 files changed, 62 insertions(+)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index a298517079bb..61363dfd3c2d 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -52,6 +52,7 @@ enum virtio_balloon_vq {
VIRTIO_BALLOON_VQ_STATS,
VIRTIO_BALLOON_VQ_FREE_PAGE,
VIRTIO_BALLOON_VQ_REPORTING,
+   VIRTIO_BALLOON_VQ_HPG_REPORTING,
VIRTIO_BALLOON_VQ_MAX
 };
 
@@ -126,6 +127,10 @@ struct virtio_balloon {
/* Free page reporting device */
struct virtqueue *reporting_vq;
struct page_reporting_dev_info pr_dev_info;
+
+   /* Free hugepage reporting device */
+   struct virtqueue *hpg_reporting_vq;
+   struct page_reporting_dev_info hpr_dev_info;
 };
 
 static const struct virtio_device_id id_table[] = {
@@ -192,6 +197,33 @@ static int virtballoon_free_page_report(struct 
page_reporting_dev_info *pr_dev_i
return 0;
 }
 
+static int virtballoon_free_hugepage_report(struct page_reporting_dev_info 
*hpr_dev_info,
+  struct scatterlist *sg, unsigned int nents)
+{
+   struct virtio_balloon *vb =
+   container_of(hpr_dev_info, struct virtio_balloon, hpr_dev_info);
+   struct virtqueue *vq = vb->hpg_reporting_vq;
+   unsigned int unused, err;
+
+   /* We should always be able to add these buffers to an empty queue. */
+   err = virtqueue_add_inbuf(vq, sg, nents, vb, GFP_NOWAIT | __GFP_NOWARN);
+
+   /*
+* In the extremely unlikely case that something has occurred and we
+* are able to trigger an error we will simply display a warning
+* and exit without actually processing the pages.
+*/
+   if (WARN_ON_ONCE(err))
+   return err;
+
+   virtqueue_kick(vq);
+
+   /* When host has read buffer, this completes via balloon_ack */
+   wait_event(vb->acked, virtqueue_get_buf(vq, ));
+
+   return 0;
+}
+
 static void set_page_pfns(struct virtio_balloon *vb,
  __virtio32 pfns[], struct page *page)
 {
@@ -515,6 +547,7 @@ static int init_vqs(struct virtio_balloon *vb)
callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
names[VIRTIO_BALLOON_VQ_REPORTING] = NULL;
+   names[VIRTIO_BALLOON_VQ_HPG_REPORTING] = NULL;
 
if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
names[VIRTIO_BALLOON_VQ_STATS] = "stats";
@@ -531,6 +564,11 @@ static int init_vqs(struct virtio_balloon *vb)
callbacks[VIRTIO_BALLOON_VQ_REPORTING] = balloon_ack;
}
 
+   if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HPG_REPORTING)) {
+   names[VIRTIO_BALLOON_VQ_HPG_REPORTING] = "hpg_reporting_vq";
+   callbacks[VIRTIO_BALLOON_VQ_HPG_REPORTING] = balloon_ack;
+   }
+
err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
 vqs, callbacks, names, NULL, NULL);
if (err)
@@ -566,6 +604,8 @@ static int init_vqs(struct virtio_balloon *vb)
if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
vb->reporting_vq = vqs[VIRTIO_BALLOON_VQ_REPORTING];
 
+   if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HPG_REPORTING))
+   vb->hpg_reporting_vq = vqs[VIRTIO_BALLOON_VQ_HPG_REPORTING];
return 0;
 }
 
@@ -1001,6 +1041,24 @@ static int virtballoon_probe(struct virtio_device *vdev)
goto out_unregister_oom;
}
 
+   vb->hpr_dev_info.report = virtballoon_free_hugepage_report;
+   if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HPG_REPORTING)) {
+   unsigned int capacity;
+
+   capacity = virtqueue_get_vring_size(vb->hpg_reporting_vq);
+   if (capacity < PAGE_REPORTING_CAPACITY) {
+   err = -ENOSPC;
+   goto out_unregister_oom;
+   }
+
+   vb->hpr_dev_info.mini_order = 0;
+   vb-

[RFC PATCH 1/3] mm: support hugetlb free page reporting

2020-12-21 Thread Liang Li

Free page reporting only supports buddy pages, it can't report the
free pages reserved for hugetlbfs case. On the other hand, hugetlbfs
is a good choice for a system with a huge amount of RAM, because it
can help to reduce the memory management overhead and improve system
performance.
This patch add the support for reporting hugepages in the free list
of hugetlb, it canbe used by virtio_balloon driver for memory
overcommit and pre zero out free pages for speeding up memory population.

Cc: Alexander Duyck 
Cc: Mel Gorman 
Cc: Andrea Arcangeli 
Cc: Dan Williams 
Cc: Dave Hansen 
Cc: David Hildenbrand   
Cc: Michal Hocko  
Cc: Andrew Morton 
Cc: Alex Williamson 
Cc: Michael S. Tsirkin 
Cc: Jason Wang 
Cc: Mike Kravetz 
Cc: Liang Li 
Signed-off-by: Liang Li 
---
 include/linux/hugetlb.h|   3 +
 include/linux/page_reporting.h |   5 +
 mm/hugetlb.c   |  29 
 mm/page_reporting.c| 287 +
 mm/page_reporting.h|  34 
 5 files changed, 358 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ebca2ef02212..a72ad25501d3 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct ctl_table;
 struct user_struct;
@@ -114,6 +115,8 @@ int hugetlb_treat_movable_handler(struct ctl_table *, int, 
void *, size_t *,
 int hugetlb_mempolicy_sysctl_handler(struct ctl_table *, int, void *, size_t *,
loff_t *);
 
+bool isolate_free_huge_page(struct page *page, struct hstate *h, int nid);
+void putback_isolate_huge_page(struct hstate *h, struct page *page);
 int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct 
vm_area_struct *);
 long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
 struct page **, struct vm_area_struct **,
diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h
index 63e1e9fbcaa2..0da3d1a6f0cc 100644
--- a/include/linux/page_reporting.h
+++ b/include/linux/page_reporting.h
@@ -7,6 +7,7 @@
 
 /* This value should always be a power of 2, see page_reporting_cycle() */
 #define PAGE_REPORTING_CAPACITY32
+#define HUGEPAGE_REPORTING_CAPACITY1
 
 struct page_reporting_dev_info {
/* function that alters pages to make them "reported" */
@@ -26,4 +27,8 @@ struct page_reporting_dev_info {
 /* Tear-down and bring-up for page reporting devices */
 void page_reporting_unregister(struct page_reporting_dev_info *prdev);
 int page_reporting_register(struct page_reporting_dev_info *prdev);
+
+/* Tear-down and bring-up for hugepage reporting devices */
+void hugepage_reporting_unregister(struct page_reporting_dev_info *prdev);
+int hugepage_reporting_register(struct page_reporting_dev_info *prdev);
 #endif /*_LINUX_PAGE_REPORTING_H */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index cbf32d2824fd..de6ce147dfe2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -41,6 +41,7 @@
 #include 
 #include 
 #include 
+#include "page_reporting.h"
 #include "internal.h"
 
 int hugetlb_max_hstate __read_mostly;
@@ -1028,6 +1029,11 @@ static void enqueue_huge_page(struct hstate *h, struct 
page *page)
list_move(>lru, >hugepage_freelists[nid]);
h->free_huge_pages++;
h->free_huge_pages_node[nid]++;
+   if (hugepage_reported(page)) {
+   __ClearPageReported(page);
+   pr_info("%s, free_huge_pages=%ld\n", __func__, 
h->free_huge_pages);
+   }
+   hugepage_reporting_notify_free(h->order);
 }
 
 static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
@@ -5531,6 +5537,29 @@ follow_huge_pgd(struct mm_struct *mm, unsigned long 
address, pgd_t *pgd, int fla
return pte_page(*(pte_t *)pgd) + ((address & ~PGDIR_MASK) >> 
PAGE_SHIFT);
 }
 
+bool isolate_free_huge_page(struct page *page, struct hstate *h, int nid)
+{
+   bool ret = true;
+
+   VM_BUG_ON_PAGE(!PageHead(page), page);
+
+   list_move(>lru, >hugepage_activelist);
+   set_page_refcounted(page);
+   h->free_huge_pages--;
+   h->free_huge_pages_node[nid]--;
+
+   return ret;
+}
+
+void putback_isolate_huge_page(struct hstate *h, struct page *page)
+{
+   int nid = page_to_nid(page);
+   pr_info("%s, free_huge_pages=%ld\n", __func__, h->free_huge_pages);
+   list_move(>lru, >hugepage_freelists[nid]);
+   h->free_huge_pages++;
+   h->free_huge_pages_node[nid]++;
+}
+
 bool isolate_huge_page(struct page *page, struct list_head *list)
 {
bool ret = true;
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index 20ec3fb1afc4..15d4b5372df8 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -7,6 +7,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "page_reporting.h"
 #include "internal.h"
@@ -16,6 +17

[RFC PATCH 0/3] add support for free hugepage reporting

2020-12-21 Thread Liang Li

A typical usage of hugetlbfs it's to reserve amount of memory when
the during kernel booting stage, and the reserved pages are unlikely
to return to the buddy system. When application need hugepages, kernel
will allocate them from the reserved pool. when application terminates,
huge pages will return to the reserved pool and are kept in the free
list for hugetlb, these free pages will not return to buddy freelist
unless the size fo reserved pool is changed. 
Free page reporting only supports buddy pages, it can't report the
free pages reserved for hugetlbfs. On the other hand, hugetlbfs
is a good choice for system with a huge amount of RAM, because it
can help to reduce the memory management overhead and improve system
performance.
This patch add the support for reporting hugepages in the free list
of hugetlb, it can be used by virtio_balloon driver for memory
overcommit and pre zero out free pages for speeding up memory
population and page fault handling.

Most of the code are 'copied' from free page reporting because they
are working in the same way. So the code can be refined to remove
the duplication code. Since this is an RFC, I didn't do that.

For the virtio_balloon driver, changes for the virtio spec are needed.
Before that, I need the feedback of the comunity about this new feature.

Liang Li (3):
  mm: support hugetlb free page reporting
  virtio-balloon: add support for providing free huge page reports to
host
  mm: support free hugepage pre zero out

 drivers/virtio/virtio_balloon.c |  61 ++
 include/linux/hugetlb.h |   3 +
 include/linux/page_reporting.h  |   5 +
 include/uapi/linux/virtio_balloon.h |   1 +
 mm/hugetlb.c|  29 +++
 mm/page_prezero.c   |  17 ++
 mm/page_reporting.c | 287 
 mm/page_reporting.h |  34 
 8 files changed, 437 insertions(+)

Cc: Alexander Duyck 
Cc: Mel Gorman 
Cc: Andrea Arcangeli 
Cc: Dan Williams 
Cc: Dave Hansen 
Cc: David Hildenbrand   
Cc: Michal Hocko  
Cc: Andrew Morton 
Cc: Alex Williamson 
Cc: Michael S. Tsirkin 
Cc: Jason Wang 
Cc: Mike Kravetz 
Cc: Liang Li 
-- 
2.18.2

Re: [Qemu-devel] [PATCH 0/2] buffer and delay backup COW write operation

2019-05-05 Thread Liang Li

On Tue, Apr 30, 2019 at 10:35:32AM +, Vladimir Sementsov-Ogievskiy wrote:
> 28.04.2019 13:01, Liang Li wrote:
> > If the backup target is a slow device like ceph rbd, the backup
> > process will affect guest BLK write IO performance seriously,
> > it's cause by the drawback of COW mechanism, if guest overwrite the
> > backup BLK area, the IO can only be processed after the data has
> > been written to backup target.
> > The impact can be relieved by buffering data read from backup
> > source and writing to backup target later, so the guest BLK write
> > IO can be processed in time.
> > Data area with no overwrite will be process like before without
> > buffering, in most case, we don't need a very large buffer.
> > 
> > An fio test was done when the backup was going on, the test resut
> > show a obvious performance improvement by buffering.
> 
> Hi Liang!
> 
> Good thing. Something like this I've briefly mentioned in my KVM Forum 2018
> report as "RAM Cache", and I'd really prefer this functionality to be a 
> separate
> filter, instead of complication of backup code. Further more, write notifiers
> will go away from backup code, after my backup-top series merged.
> 
> v5: https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg06211.html
> and separated preparing refactoring v7: 
> https://lists.gnu.org/archive/html/qemu-devel/2019-04/msg04813.html
> 
> RAM Cache should be a filter driver, with an in-memory buffer(s) for data 
> written to it
> and with ability to flush data to underlying backing file.
> 
> Also, here is another approach for the problem, which helps if guest writing 
> activity
> is really high and long and buffer will be filled and performance will 
> decrease anyway:
> 
> 1. Create local temporary image, and COWs will go to it. (previously 
> considered on list, that we should call
> these backup operations issued by guest writes CBW = copy-before-write, as 
> copy-on-write
> is generally another thing, and using this term in backup is confusing).
> 
> 2. We also set original disk as a backing for temporary image, and start 
> another backup from
> temporary to real target.
> 
> This scheme is almost possible now, you need to start backup(sync=none) from 
> source to temp,
> to do [1]. Some patches are still needed to allow such scheme. I didn't send 
> them, as I want
> my other backup patches go first anyway. But I can. On the other hand if 
> approach with in-memory
> buffer works for you it may be better.
> 
> Also, I'm not sure for now, should we really do this thing through two backup 
> jobs, or we just
> need one separate backup-top filter and one backup job without filter, or we 
> need an additional
> parameter for backup job to set cache-block-node.
> 

Hi Vladimir,

   Thanks for your valuable information. I didn't notice that you are already 
working on
this,  so my patch will conflict with your work. We have thought about the way 
[2] and
give it up because it would affect local storage performance.
   I have read your slice in KVM Forum 2018 and the related patches, your 
solution can
help to solve the issues in backup. I am not sure if the "RAM cache" is a qcow2 
file in
RAM? if so, your implementation will free the RAM space occupied by BLK data 
once it's
written to the far target in time? or we may need a large cache to make things 
work.
   Two backup jobs seems complex and not user friendly, is it possible to make 
my patch
cowork with CBW?

Liang

[Qemu-devel] [PATCH 1/2] backup: buffer COW request and delay the write operation

2019-04-28 Thread Liang Li

If the backup target is a slow device like ceph rbd, the backup
process will affect guest BLK write IO performance seriously,
it's cause by the drawback of COW mechanism, if guest overwrite the
backup BLK area, the IO can only be processed after the data has
been written to backup target.
The impact can be relieved by buffering data read from backup
source and writing to backup target later, so the guest BLK write
IO can be processed in time.
Data area with no overwrite will be process like before without
buffering, in most case, we don't need a very large buffer.

An fio test was done when the backup was going on, the test resut
show a obvious performance improvement by buffering.

Test result(1GB buffer):

fio setting:
[random-writers]
ioengine=libaio
iodepth=8
rw=randwrite
bs=32k
direct=1
size=1G
numjobs=1

result:
  IOPSAVG latency
   no backup: 19389 410 us
  backup:  14025702 us
backup w/ buffer:  8684 918 us
==

Cc: John Snow 
Cc: Kevin Wolf 
Cc: Max Reitz 
Cc: Wen Congyang 
Cc: Xie Changlong 
Cc: Markus Armbruster 
Cc: Eric Blake 
Cc: Fam Zheng 
Signed-off-by: Liang Li 
---
 block/backup.c | 117 ++---
 1 file changed, 104 insertions(+), 13 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 9988753249..d436f9e4ee 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -35,6 +35,12 @@ typedef struct CowRequest {
 CoQueue wait_queue; /* coroutines blocked on this request */
 } CowRequest;
 
+typedef struct CowBufReq {
+int64_t offset;
+struct iovec iov;
+QTAILQ_ENTRY(CowBufReq) next;
+} CowBufReq;
+
 typedef struct BackupBlockJob {
 BlockJob common;
 BlockBackend *target;
@@ -56,9 +62,14 @@ typedef struct BackupBlockJob {
 int64_t copy_range_size;
 
 bool serialize_target_writes;
+QTAILQ_HEAD(, CowBufReq) buf_reqs;
+int64_t cow_buf_used;
+int64_t cow_buf_size;
+int64_t buf_cow_total;
 } BackupBlockJob;
 
 static const BlockJobDriver backup_job_driver;
+static bool coroutine_fn yield_and_check(BackupBlockJob *job);
 
 /* See if in-flight requests overlap and wait for them to complete */
 static void coroutine_fn wait_for_overlapping_requests(BackupBlockJob *job,
@@ -97,6 +108,46 @@ static void cow_request_end(CowRequest *req)
 qemu_co_queue_restart_all(>wait_queue);
 }
 
+static int write_buffer_reqs(BackupBlockJob *job, bool *error_is_read)
+{
+int ret = 0;
+CowBufReq *req, *next_req;
+QEMUIOVector qiov;
+
+QTAILQ_FOREACH_SAFE(req, >buf_reqs, next, next_req) {
+if (req->iov.iov_base == NULL) {
+ret = blk_co_pwrite_zeroes(job->target, req->offset,
+   req->iov.iov_len, BDRV_REQ_MAY_UNMAP);
+} else {
+qemu_iovec_init_external(, >iov, 1);
+ret = blk_co_pwritev(job->target, req->offset,
+ req->iov.iov_len, ,
+ job->compress ? BDRV_REQ_WRITE_COMPRESSED : 
0);
+}
+if (ret < 0) {
+trace_backup_do_cow_write_fail(job, req->offset, ret);
+if (error_is_read) {
+*error_is_read = false;
+}
+ret = -1;
+break;
+}
+job_progress_update(>common.job, req->iov.iov_len);
+QTAILQ_REMOVE(>buf_reqs, req, next);
+if (req->iov.iov_base) {
+job->cow_buf_used -= job->cluster_size;
+assert(job->cow_buf_used >= 0);
+g_free(req->iov.iov_base);
+}
+g_free(req);
+if (yield_and_check(job)) {
+break;
+}
+}
+
+return ret;
+}
+
 /* Copy range to target with a bounce buffer and return the bytes copied. If
  * error occurred, return a negative error number */
 static int coroutine_fn backup_cow_with_bounce_buffer(BackupBlockJob *job,
@@ -129,20 +180,35 @@ static int coroutine_fn 
backup_cow_with_bounce_buffer(BackupBlockJob *job,
 goto fail;
 }
 
-if (qemu_iovec_is_zero()) {
-ret = blk_co_pwrite_zeroes(job->target, start,
-   qiov.size, write_flags | 
BDRV_REQ_MAY_UNMAP);
+if (is_write_notifier &&
+job->cow_buf_used <= job->cow_buf_size - job->cluster_size) {
+CowBufReq *cow_req = g_malloc0(sizeof(CowBufReq));
+cow_req->offset = start;
+cow_req->iov = *qiov.iov;
+if (qemu_iovec_is_zero()) {
+cow_req->iov.iov_base = NULL;
+} else {
+job->cow_buf_used += job->cluster_size;
+*bounce_buffer = NULL;
+}
+QTAILQ_INSERT_TAIL(>buf_reqs, cow_req, next);
+job->buf_cow_total++;
 } else {
-ret = blk_co_pwritev(job->tar

[Qemu-devel] [PATCH 2/2] qapi: add interface for setting backup cow buffer size

2019-04-28 Thread Liang Li

Cc: John Snow 
Cc: Kevin Wolf 
Cc: Max Reitz 
Cc: Wen Congyang 
Cc: Xie Changlong 
Cc: Markus Armbruster 
Cc: Eric Blake 
Cc: Fam Zheng 
Signed-off-by: Liang Li 
---
 block/backup.c| 3 ++-
 block/replication.c   | 2 +-
 blockdev.c| 5 +
 include/block/block_int.h | 2 ++
 qapi/block-core.json  | 5 +
 5 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index d436f9e4ee..9a04003968 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -652,6 +652,7 @@ BlockJob *backup_job_create(const char *job_id, 
BlockDriverState *bs,
   BlockDriverState *target, int64_t speed,
   MirrorSyncMode sync_mode, BdrvDirtyBitmap *sync_bitmap,
   bool compress,
+  int buf_size,
   BlockdevOnError on_source_error,
   BlockdevOnError on_target_error,
   int creation_flags,
@@ -748,7 +749,7 @@ BlockJob *backup_job_create(const char *job_id, 
BlockDriverState *bs,
 job->sync_bitmap = sync_mode == MIRROR_SYNC_MODE_INCREMENTAL ?
sync_bitmap : NULL;
 job->compress = compress;
-job->cow_buf_size = 0;
+job->cow_buf_size = buf_size;
 
 /* Detect image-fleecing (and similar) schemes */
 job->serialize_target_writes = bdrv_chain_contains(target, bs);
diff --git a/block/replication.c b/block/replication.c
index 3d4dedddfc..5ec6911355 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -540,7 +540,7 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
 bdrv_op_unblock(top_bs, BLOCK_OP_TYPE_DATAPLANE, s->blocker);
 
 job = backup_job_create(NULL, s->secondary_disk->bs, 
s->hidden_disk->bs,
-0, MIRROR_SYNC_MODE_NONE, NULL, false,
+0, MIRROR_SYNC_MODE_NONE, NULL, false, 0,
 BLOCKDEV_ON_ERROR_REPORT,
 BLOCKDEV_ON_ERROR_REPORT, JOB_INTERNAL,
 backup_job_completed, bs, NULL, _err);
diff --git a/blockdev.c b/blockdev.c
index 79fbac8450..15d96fe25c 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3449,6 +3449,9 @@ static BlockJob *do_drive_backup(DriveBackup *backup, 
JobTxn *txn,
 if (!backup->has_compress) {
 backup->compress = false;
 }
+if (!backup->has_buffer) {
+backup->buffer = 0;
+}
 
 bs = qmp_get_root_bs(backup->device, errp);
 if (!bs) {
@@ -3550,6 +3553,7 @@ static BlockJob *do_drive_backup(DriveBackup *backup, 
JobTxn *txn,
 
 job = backup_job_create(backup->job_id, bs, target_bs, backup->speed,
 backup->sync, bmap, backup->compress,
+backup->buffer,
 backup->on_source_error, backup->on_target_error,
 job_flags, NULL, NULL, txn, _err);
 bdrv_unref(target_bs);
@@ -3660,6 +3664,7 @@ BlockJob *do_blockdev_backup(BlockdevBackup *backup, 
JobTxn *txn,
 }
 job = backup_job_create(backup->job_id, bs, target_bs, backup->speed,
 backup->sync, bmap, backup->compress,
+backup->buffer,
 backup->on_source_error, backup->on_target_error,
 job_flags, NULL, NULL, txn, _err);
 if (local_err != NULL) {
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 01e855a066..17c7f26b84 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -1137,6 +1137,7 @@ void mirror_start(const char *job_id, BlockDriverState 
*bs,
  * @speed: The maximum speed, in bytes per second, or 0 for unlimited.
  * @sync_mode: What parts of the disk image should be copied to the 
destination.
  * @sync_bitmap: The dirty bitmap if sync_mode is MIRROR_SYNC_MODE_INCREMENTAL.
+ * @buffer: Size of buffer used to save data for delayed writing.
  * @on_source_error: The action to take upon error reading from the source.
  * @on_target_error: The action to take upon error writing to the target.
  * @creation_flags: Flags that control the behavior of the Job lifetime.
@@ -1153,6 +1154,7 @@ BlockJob *backup_job_create(const char *job_id, 
BlockDriverState *bs,
 MirrorSyncMode sync_mode,
 BdrvDirtyBitmap *sync_bitmap,
 bool compress,
+int buffer,
 BlockdevOnError on_source_error,
 BlockdevOnError on_target_error,
 int creation_flags,
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 7ccbfff9d0..726c04c02a 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1377,6 +1377,7 @@
 '*format': 'str', 'sync': 'MirrorSyncMod

[Qemu-devel] [PATCH 0/2] buffer and delay backup COW write operation

2019-04-28 Thread Liang Li

If the backup target is a slow device like ceph rbd, the backup
process will affect guest BLK write IO performance seriously,
it's cause by the drawback of COW mechanism, if guest overwrite the
backup BLK area, the IO can only be processed after the data has
been written to backup target.
The impact can be relieved by buffering data read from backup
source and writing to backup target later, so the guest BLK write
IO can be processed in time.
Data area with no overwrite will be process like before without
buffering, in most case, we don't need a very large buffer.

An fio test was done when the backup was going on, the test resut
show a obvious performance improvement by buffering.

Test result(1GB buffer):

fio setting:
[random-writers]
ioengine=libaio
iodepth=8
rw=randwrite
bs=32k
direct=1
size=1G
numjobs=1

result:
  IOPSAVG latency
   no backup: 19389 410 us
  backup:  14025702 us
backup w/ buffer:  8684 918 us
==

Cc: John Snow 
Cc: Kevin Wolf 
Cc: Max Reitz 
Cc: Wen Congyang 
Cc: Xie Changlong 
Cc: Markus Armbruster 
Cc: Eric Blake 
Cc: Fam Zheng 

Liang Li (2):
  backup: buffer COW request and delay the write operation
  qapi: add interface for setting backup cow buffer size

 block/backup.c| 118 +-
 block/replication.c   |   2 +-
 blockdev.c|   5 ++
 include/block/block_int.h |   2 +
 qapi/block-core.json  |   5 ++
 5 files changed, 118 insertions(+), 14 deletions(-)

-- 
2.14.1

Re: [Qemu-devel] [PATCH] vhost-user: fix qemu crash caused by failed backend

2018-10-29 Thread Liang Li

On Tue, Oct 02, 2018 at 01:54:25PM +0400, Marc-André Lureau wrote:
> Hi
> 
> On Thu, Sep 27, 2018 at 7:37 PM Liang Li  wrote:
> >
> > During live migration, when stopping vhost-user device, 'vhost_dev_stop'
> > will be called, 'vhost_dev_stop' will call a batch of 'vhost_user_read'
> > and 'vhost_user_write'. If a previous 'vhost_user_read' or 
> > 'vhost_user_write'
> > failed because the vhost user backend failed, the 'CHR_EVENT_CLOSED' event
> > will be triggerd, followed by the call chain 
> > chr_closed_bh()->vhost_user_stop()->
> > vhost_net_cleanup()->vhost_dev_cleanup()
> >
> > vhost_dev_cleanup will clear vhost_dev struct, so the later 
> > 'vhost_user_read'
> > or 'vhost_user_read' will reference null pointer and cause qemu crash
> 
> Do you have a backtrace to help understand the issue?
> thanks
> 

sorry for late response.

Yes, I have. But it's the backtrace for qemu-kvm-2.10.
and we found this issue when doing pressure test, it was triggered
by a buggy ovs-dpdk backend, an ovs-dpdk coredump followed by a qemu
coredump. 

the backtrace is like bellow:

 ==
0  0x7f0af85ea069 in vhost_user_read (msg=msg@entry=0x7f0a2a4b5300, 
dev=0x7f0afaee0340) at /usr/src/debug/qemu-2.10.0/hw/virtio/vhost-user.c:139
1  0x7f0af85ea2df in vhost_user_get_vring_base (dev=0x7f0afaee0340, 
ring=0x7f0a2a4b5450) at /usr/src/debug/qemu-2.10.0/hw/virtio/vhost-user.c:458
2  0x7f0af85e715e in vhost_virtqueue_stop 
(dev=dev@entry=0x7f0afaee0340, vdev=vdev@entry=0x7f0afcba0170, 
vq=0x7f0afaee05d0, idx=1)
at /usr/src/debug/qemu-2.10.0/hw/virtio/vhost.c:1138
3  0x7f0af85e8e24 in vhost_dev_stop (hdev=hdev@entry=0x7f0afaee0340, 
vdev=vdev@entry=0x7f0afcba0170) at 
/usr/src/debug/qemu-2.10.0/hw/virtio/vhost.c:1601
4  0x7f0af85d1418 in vhost_net_stop_one (net=0x7f0afaee0340, 
dev=0x7f0afcba0170) at /usr/src/debug/qemu-2.10.0/hw/net/vhost_net.c:289
5  0x7f0af85d191b in vhost_net_stop (dev=dev@entry=0x7f0afcba0170, 
ncs=, total_queues=total_queues@entry=1) at 
/usr/src/debug/qemu-2.10.0/hw/net/vhost_net.c:3
6  0x7f0af85ceba6 in virtio_net_set_status (status=, 
n=0x7f0afcba0170) at /usr/src/debug/qemu-2.10.0/hw/net/virtio-net.c:180
7  0x7f0af85ceba6 in virtio_net_set_status (vdev=0x7f0afcba0170, 
status=15 '\017') at /usr/src/debug/qemu-2.10.0/hw/net/virtio-net.c:254
8  0x7f0af85e0f2c in virtio_set_status (vdev=0x7f0afcba0170, 
val=) at /usr/src/debug/qemu-2.10.0/hw/virtio/virtio.c:1147
9  0x7f0af866dce2 in vm_state_notify (running=running@entry=0, 
state=state@entry=RUN_STATE_FINISH_MIGRATE) at vl.c:1623
10 0x7f0af858f11a in do_vm_stop (state=RUN_STATE_FINISH_MIGRATE, 
send_stop=send_stop@entry=true) at /usr/src/debug/qemu-2.10.0/cpus.c:941
11 0x7f0af858f159 in vm_stop (state=) at 
/usr/src/debug/qemu-2.10.0/cpus.c:1818
12 0x7f0af858f296 in vm_stop_force_state 
(state=state@entry=RUN_STATE_FINISH_MIGRATE) at 
/usr/src/debug/qemu-2.10.0/cpus.c:1868
13 0x7f0af87551d7 in migration_thread (start_time=, 
old_vm_running=, current_active_state=4, s=0x7f0afaf00500)
at migration/migration.c:1956
14 0x7f0af87551d7 in migration_thread (opaque=0x7f0afaf00500) at 
migration/migration.c:2129
15 0x7f0af217fdc5 in start_thread () at /lib64/libpthread.so.0
16 0x7f0af1eae73d in clone () at /lib64/libc.so.6

(gdb) l
134 }
135
136 static int vhost_user_read(struct vhost_dev *dev, VhostUserMsg *msg)
137 {
138 struct vhost_user *u = dev->opaque;
139 CharBackend *chr = u->chr;
140 uint8_t *p = (uint8_t *) msg;
141 int r, size = VHOST_USER_HDR_SIZE;
142
143 r = qemu_chr_fe_read_all(chr, p, size);
144 if (r != size) {
145 error_report("Failed to read msg header. Read %d instead of %d."
146  " Original request %d.", r, size, msg->request);
147 goto fail;
148 }
149
150 /* validate received flags */
151 if (msg->flags != (VHOST_USER_REPLY_MASK | VHOST_USER_VERSION)) {
152 error_report("Failed to read msg header."
153 " Flags 0x%x instead of 0x%x.", msg->flags,
(gdb) p u
$1 = (struct vhost_user *) 0x0
(gdb) p dev
$2 = (struct vhost_dev *) 0x7f0afaee0340
(gdb) p *dev
$3 = {vdev = 0x0, memory_listener = {begin = 0x0, commit = 0x0, region_add 
= 0x0, region_del = 0x0, region_nop = 0x0, log_start = 0x0, log_stop = 0x0, 
log_sync = 0x0,
log_global_start = 0x0, log_global_stop = 0x0, eventfd_add = 0x0, 
eventfd_del = 0x0, coalesced_mmio_add = 0x0, coalesced_mmio_del = 0x0, priority 
= 0, address_space = 0x0,
link = {tqe_next = 0x0, tqe_prev = 0x0}, link_as = {tqe_n

[Qemu-devel] [PATCH] migration: fix concurrent call of multifd_save_cleanup

2018-10-29 Thread Liang Li

Concurrent call of multifd_save_cleanup() is unsafe, it will lead to
null pointer dereference. 'multifd_save_cleanup()' should not be called
in multifd_new_send_channel_async(), move it to ram_save_cleanup() like
other features do.

Signed-off-by: Liang Li 
---
 migration/migration.c | 5 -
 migration/ram.c   | 7 +++
 migration/ram.h   | 2 +-
 3 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 8b36e7f..f422218 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1372,7 +1372,6 @@ static void migrate_fd_cleanup(void *opaque)
 qemu_savevm_state_cleanup();
 
 if (s->to_dst_file) {
-Error *local_err = NULL;
 QEMUFile *tmp;
 
 trace_migrate_fd_cleanup();
@@ -1382,10 +1381,6 @@ static void migrate_fd_cleanup(void *opaque)
 s->migration_thread_running = false;
 }
 qemu_mutex_lock_iothread();
-
-if (multifd_save_cleanup(_err) != 0) {
-error_report_err(local_err);
-}
 qemu_mutex_lock(>qemu_file_lock);
 tmp = s->to_dst_file;
 s->to_dst_file = NULL;
diff --git a/migration/ram.c b/migration/ram.c
index 7e7deec..a232b9c 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -917,7 +917,7 @@ static void multifd_send_terminate_threads(Error *err)
 }
 }
 
-int multifd_save_cleanup(Error **errp)
+int multifd_save_cleanup(void)
 {
 int i;
 int ret = 0;
@@ -1071,9 +1071,7 @@ static void multifd_new_send_channel_async(QIOTask *task, 
gpointer opaque)
 Error *local_err = NULL;
 
 if (qio_task_propagate_error(task, _err)) {
-if (multifd_save_cleanup(_err) != 0) {
-migrate_set_error(migrate_get_current(), local_err);
-}
+migrate_set_error(migrate_get_current(), local_err);
 } else {
 p->c = QIO_CHANNEL(sioc);
 qio_channel_set_delay(p->c, false);
@@ -2542,6 +2540,7 @@ static void ram_save_cleanup(void *opaque)
 
 xbzrle_cleanup();
 compress_threads_save_cleanup();
+multifd_save_cleanup();
 ram_state_cleanup(rsp);
 }
 
diff --git a/migration/ram.h b/migration/ram.h
index 83ff1bc..c4fafea 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -43,7 +43,7 @@ uint64_t ram_bytes_remaining(void);
 uint64_t ram_bytes_total(void);
 
 int multifd_save_setup(void);
-int multifd_save_cleanup(Error **errp);
+int multifd_save_cleanup(void);
 int multifd_load_setup(void);
 int multifd_load_cleanup(Error **errp);
 bool multifd_recv_all_channels_created(void);
-- 
1.8.3.1

[Qemu-devel] [PATCH] vhost-user: fix qemu crash caused by failed backend

2018-09-27 Thread Liang Li

During live migration, when stopping vhost-user device, 'vhost_dev_stop'
will be called, 'vhost_dev_stop' will call a batch of 'vhost_user_read'
and 'vhost_user_write'. If a previous 'vhost_user_read' or 'vhost_user_write'
failed because the vhost user backend failed, the 'CHR_EVENT_CLOSED' event
will be triggerd, followed by the call chain 
chr_closed_bh()->vhost_user_stop()->
vhost_net_cleanup()->vhost_dev_cleanup()

vhost_dev_cleanup will clear vhost_dev struct, so the later 'vhost_user_read'
or 'vhost_user_read' will reference null pointer and cause qemu crash

Signed-off-by: Liang Li 
---
 hw/net/vhost_net.c|  6 ++
 hw/virtio/vhost-user.c| 15 +--
 include/hw/virtio/vhost.h |  1 +
 include/net/vhost_net.h   |  1 +
 net/vhost-user.c  |  3 +++
 5 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
index e037db6..77994e9 100644
--- a/hw/net/vhost_net.c
+++ b/hw/net/vhost_net.c
@@ -113,6 +113,11 @@ uint64_t vhost_net_get_features(struct vhost_net *net, 
uint64_t features)
 features);
 }
 
+void vhost_net_mark_break_down(struct vhost_net *net)
+{
+net->dev.break_down = true;
+}
+
 void vhost_net_ack_features(struct vhost_net *net, uint64_t features)
 {
 net->dev.acked_features = net->dev.backend_features;
@@ -156,6 +161,7 @@ struct vhost_net *vhost_net_init(VhostNetOptions *options)
 net->dev.max_queues = 1;
 net->dev.nvqs = 2;
 net->dev.vqs = net->vqs;
+net->dev.break_down = false;
 
 if (backend_kernel) {
 r = vhost_net_get_fd(options->net_backend);
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index b041343..1394719 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -213,14 +213,20 @@ static bool ioeventfd_enabled(void)
 static int vhost_user_read(struct vhost_dev *dev, VhostUserMsg *msg)
 {
 struct vhost_user *u = dev->opaque;
-CharBackend *chr = u->user->chr;
+CharBackend *chr;
 uint8_t *p = (uint8_t *) msg;
 int r, size = VHOST_USER_HDR_SIZE;
 
+if (dev->break_down) {
+goto fail;
+}
+
+chr = u->user->chr;
 r = qemu_chr_fe_read_all(chr, p, size);
 if (r != size) {
 error_report("Failed to read msg header. Read %d instead of %d."
  " Original request %d.", r, size, msg->hdr.request);
+dev->break_down = true;
 goto fail;
 }
 
@@ -299,9 +305,12 @@ static int vhost_user_write(struct vhost_dev *dev, 
VhostUserMsg *msg,
 int *fds, int fd_num)
 {
 struct vhost_user *u = dev->opaque;
-CharBackend *chr = u->user->chr;
+CharBackend *chr;
 int ret, size = VHOST_USER_HDR_SIZE + msg->hdr.size;
 
+if (dev->break_down) {
+return -1;
+}
 /*
  * For non-vring specific requests, like VHOST_USER_SET_MEM_TABLE,
  * we just need send it once in the first time. For later such
@@ -312,6 +321,7 @@ static int vhost_user_write(struct vhost_dev *dev, 
VhostUserMsg *msg,
 return 0;
 }
 
+chr = u->user->chr;
 if (qemu_chr_fe_set_msgfds(chr, fds, fd_num) < 0) {
 error_report("Failed to set msg fds.");
 return -1;
@@ -319,6 +329,7 @@ static int vhost_user_write(struct vhost_dev *dev, 
VhostUserMsg *msg,
 
 ret = qemu_chr_fe_write_all(chr, (const uint8_t *) msg, size);
 if (ret != size) {
+dev->break_down = true;
 error_report("Failed to write msg."
  " Wrote %d instead of %d.", ret, size);
 return -1;
diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index a7f449f..86d0dc5 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -74,6 +74,7 @@ struct vhost_dev {
 bool started;
 bool log_enabled;
 uint64_t log_size;
+bool break_down;
 Error *migration_blocker;
 const VhostOps *vhost_ops;
 void *opaque;
diff --git a/include/net/vhost_net.h b/include/net/vhost_net.h
index 77e4739..06f2c08 100644
--- a/include/net/vhost_net.h
+++ b/include/net/vhost_net.h
@@ -27,6 +27,7 @@ void vhost_net_cleanup(VHostNetState *net);
 
 uint64_t vhost_net_get_features(VHostNetState *net, uint64_t features);
 void vhost_net_ack_features(VHostNetState *net, uint64_t features);
+void vhost_net_mark_break_down(VHostNetState *net);
 
 bool vhost_net_virtqueue_pending(VHostNetState *net, int n);
 void vhost_net_virtqueue_mask(VHostNetState *net, VirtIODevice *dev,
diff --git a/net/vhost-user.c b/net/vhost-user.c
index a39f9c9..b99e20b 100644
--- a/net/vhost-user.c
+++ b/net/vhost-user.c
@@ -270,6 +270,9 @@ static void net_vhost_user_event(void *opaque, int event)
 if (s->watch) {
 AioContext *ctx = qemu_get_current_aio_context();
 
+if (s->vhost_net) {
+vhost_net_mark_break_down(s->vhost_ne

[Qemu-devel] [PATCH] vhost-user: fix qemu crash caused by failed backend

2018-09-27 Thread Liang Li

During live migration, when stopping vhost-user device, 'vhost_dev_stop'
will be called, 'vhost_dev_stop' will call a batch of 'vhost_user_read'
and 'vhost_user_write'. If a previous 'vhost_user_read' or 'vhost_user_write'
failed because the vhost user backend failed, the 'CHR_EVENT_CLOSED' event
will be triggerd, followed by the call chain 
chr_closed_bh()->vhost_user_stop()->
vhost_net_cleanup()->vhost_dev_cleanup()

vhost_dev_cleanup will clear vhost_dev struct, so the later 'vhost_user_read'
or 'vhost_user_read' will reference null pointer and cause qemu crash

Signed-off-by: Liang Li 
---
 hw/net/vhost_net.c|  6 ++
 hw/virtio/vhost-user.c| 15 +--
 include/hw/virtio/vhost.h |  1 +
 include/net/vhost_net.h   |  1 +
 net/vhost-user.c  |  3 +++
 5 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
index e037db6..77994e9 100644
--- a/hw/net/vhost_net.c
+++ b/hw/net/vhost_net.c
@@ -113,6 +113,11 @@ uint64_t vhost_net_get_features(struct vhost_net *net, 
uint64_t features)
 features);
 }
 
+void vhost_net_mark_break_down(struct vhost_net *net)
+{
+net->dev.break_down = true;
+}
+
 void vhost_net_ack_features(struct vhost_net *net, uint64_t features)
 {
 net->dev.acked_features = net->dev.backend_features;
@@ -156,6 +161,7 @@ struct vhost_net *vhost_net_init(VhostNetOptions *options)
 net->dev.max_queues = 1;
 net->dev.nvqs = 2;
 net->dev.vqs = net->vqs;
+net->dev.break_down = false;
 
 if (backend_kernel) {
 r = vhost_net_get_fd(options->net_backend);
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index b041343..1394719 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -213,14 +213,20 @@ static bool ioeventfd_enabled(void)
 static int vhost_user_read(struct vhost_dev *dev, VhostUserMsg *msg)
 {
 struct vhost_user *u = dev->opaque;
-CharBackend *chr = u->user->chr;
+CharBackend *chr;
 uint8_t *p = (uint8_t *) msg;
 int r, size = VHOST_USER_HDR_SIZE;
 
+if (dev->break_down) {
+goto fail;
+}
+
+chr = u->user->chr;
 r = qemu_chr_fe_read_all(chr, p, size);
 if (r != size) {
 error_report("Failed to read msg header. Read %d instead of %d."
  " Original request %d.", r, size, msg->hdr.request);
+dev->break_down = true;
 goto fail;
 }
 
@@ -299,9 +305,12 @@ static int vhost_user_write(struct vhost_dev *dev, 
VhostUserMsg *msg,
 int *fds, int fd_num)
 {
 struct vhost_user *u = dev->opaque;
-CharBackend *chr = u->user->chr;
+CharBackend *chr;
 int ret, size = VHOST_USER_HDR_SIZE + msg->hdr.size;
 
+if (dev->break_down) {
+return -1;
+}
 /*
  * For non-vring specific requests, like VHOST_USER_SET_MEM_TABLE,
  * we just need send it once in the first time. For later such
@@ -312,6 +321,7 @@ static int vhost_user_write(struct vhost_dev *dev, 
VhostUserMsg *msg,
 return 0;
 }
 
+chr = u->user->chr;
 if (qemu_chr_fe_set_msgfds(chr, fds, fd_num) < 0) {
 error_report("Failed to set msg fds.");
 return -1;
@@ -319,6 +329,7 @@ static int vhost_user_write(struct vhost_dev *dev, 
VhostUserMsg *msg,
 
 ret = qemu_chr_fe_write_all(chr, (const uint8_t *) msg, size);
 if (ret != size) {
+dev->break_down = true;
 error_report("Failed to write msg."
  " Wrote %d instead of %d.", ret, size);
 return -1;
diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index a7f449f..86d0dc5 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -74,6 +74,7 @@ struct vhost_dev {
 bool started;
 bool log_enabled;
 uint64_t log_size;
+bool break_down;
 Error *migration_blocker;
 const VhostOps *vhost_ops;
 void *opaque;
diff --git a/include/net/vhost_net.h b/include/net/vhost_net.h
index 77e4739..06f2c08 100644
--- a/include/net/vhost_net.h
+++ b/include/net/vhost_net.h
@@ -27,6 +27,7 @@ void vhost_net_cleanup(VHostNetState *net);
 
 uint64_t vhost_net_get_features(VHostNetState *net, uint64_t features);
 void vhost_net_ack_features(VHostNetState *net, uint64_t features);
+void vhost_net_mark_break_down(VHostNetState *net);
 
 bool vhost_net_virtqueue_pending(VHostNetState *net, int n);
 void vhost_net_virtqueue_mask(VHostNetState *net, VirtIODevice *dev,
diff --git a/net/vhost-user.c b/net/vhost-user.c
index a39f9c9..b99e20b 100644
--- a/net/vhost-user.c
+++ b/net/vhost-user.c
@@ -270,6 +270,9 @@ static void net_vhost_user_event(void *opaque, int event)
 if (s->watch) {
 AioContext *ctx = qemu_get_current_aio_context();
 
+if (s->vhost_net) {
+vhost_net_mark_break_down(s->vhost_ne

Re: [Qemu-devel] [PATCH V5] migration: add capability to bypass the shared memory

2018-06-27 Thread Liang Li

On Mon, Apr 16, 2018 at 11:00:11PM +0800, Lai Jiangshan wrote:
> 
>  migration/migration.c | 22 ++
>  migration/migration.h |  1 +
>  migration/ram.c   | 27 ++-
>  qapi/migration.json   |  6 +-
>  4 files changed, 46 insertions(+), 10 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 52a5092add..110b40f6d4 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -736,6 +736,19 @@ static bool migrate_caps_check(bool *cap_list,
>  return false;
>  }
>  
> +if (cap_list[MIGRATION_CAPABILITY_BYPASS_SHARED_MEMORY]) {
> +/* Bypass and postcopy are quite conflicting ways
> + * to get memory in the destination.  And there
> + * is not code to discriminate the differences and
> + * handle the conflicts currently.  It should be possible
> + * to fix, but it is generally useless when both ways
> + * are used together.
> + */
> +error_setg(errp, "Bypass is not currently compatible "
> +   "with postcopy");
> +return false;
> +}
> +
>  /* This check is reasonably expensive, so only when it's being
>   * set the first time, also it's only the destination that needs
>   * special support.
> @@ -1509,6 +1522,15 @@ bool migrate_release_ram(void)
>  return s->enabled_capabilities[MIGRATION_CAPABILITY_RELEASE_RAM];
>  }
>  
> +bool migrate_bypass_shared_memory(void)
> +{
> +MigrationState *s;
> +
> +s = migrate_get_current();
> +
> +return 
> s->enabled_capabilities[MIGRATION_CAPABILITY_BYPASS_SHARED_MEMORY];
> +}
> +
>  bool migrate_postcopy_ram(void)
>  {
>  MigrationState *s;
> diff --git a/migration/migration.h b/migration/migration.h
> index 8d2f320c48..cfd2513ef0 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -206,6 +206,7 @@ MigrationState *migrate_get_current(void);
>  
>  bool migrate_postcopy(void);
>  
> +bool migrate_bypass_shared_memory(void);
>  bool migrate_release_ram(void);
>  bool migrate_postcopy_ram(void);
>  bool migrate_zero_blocks(void);
> diff --git a/migration/ram.c b/migration/ram.c
> index 0e90efa092..bca170c386 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -780,6 +780,11 @@ unsigned long migration_bitmap_find_dirty(RAMState *rs, 
> RAMBlock *rb,
>  unsigned long *bitmap = rb->bmap;
>  unsigned long next;
>  
> +/* when this ramblock is requested bypassing */
> +if (!bitmap) {
> +return size;
> +}
> +
>  if (rs->ram_bulk_stage && start > 0) {
>  next = start + 1;
>  } else {
> @@ -850,7 +855,9 @@ static void migration_bitmap_sync(RAMState *rs)
>  qemu_mutex_lock(>bitmap_mutex);
>  rcu_read_lock();
>  RAMBLOCK_FOREACH(block) {
> -migration_bitmap_sync_range(rs, block, 0, block->used_length);
> +if (!migrate_bypass_shared_memory() || !qemu_ram_is_shared(block)) {
> +migration_bitmap_sync_range(rs, block, 0, block->used_length);
> +}
>  }
>  rcu_read_unlock();
>  qemu_mutex_unlock(>bitmap_mutex);
> @@ -2132,18 +2139,12 @@ static int ram_state_init(RAMState **rsp)
>  qemu_mutex_init(&(*rsp)->src_page_req_mutex);
>  QSIMPLEQ_INIT(&(*rsp)->src_page_requests);
>  
> -/*
> - * Count the total number of pages used by ram blocks not including any
> - * gaps due to alignment or unplugs.
> - */
> -(*rsp)->migration_dirty_pages = ram_bytes_total() >> TARGET_PAGE_BITS;
> -
>  ram_state_reset(*rsp);
>  
>  return 0;
>  }
>  
> -static void ram_list_init_bitmaps(void)
> +static void ram_list_init_bitmaps(RAMState *rs)
>  {
>  RAMBlock *block;
>  unsigned long pages;
> @@ -2151,9 +2152,17 @@ static void ram_list_init_bitmaps(void)
>  /* Skip setting bitmap if there is no RAM */
>  if (ram_bytes_total()) {
>  QLIST_FOREACH_RCU(block, _list.blocks, next) {
> +if (migrate_bypass_shared_memory() && qemu_ram_is_shared(block)) 
> {
> +continue;
> +}
>  pages = block->max_length >> TARGET_PAGE_BITS;
>  block->bmap = bitmap_new(pages);
>  bitmap_set(block->bmap, 0, pages);
> +/*
> + * Count the total number of pages used by ram blocks not
> + * including any gaps due to alignment or unplugs.
> + */
> +rs->migration_dirty_pages += pages;
Hi Jiangshan,

I think you should use 'block->used_length >> TARGET_PAGE_BITS' instead of pages
here.

As I have said before, we should skip dirty logging the related operations of
the shared memory to speed up the live migration process, and more important,
skipping dirty log can avoid splitting the EPT entry from 2M/1G to 4K if 
transparent
hugpage is used, and thus avoid performance degradation after migration. 

Some other things we should pay

Re: [Qemu-devel] [PATCH v2 3/3] virtio-balloon: add a timer to limit the free page report waiting time

2018-02-27 Thread Liang Li

On Tue, Feb 27, 2018 at 06:10:47PM +0800, Wei Wang wrote:
> On 02/27/2018 08:50 AM, Michael S. Tsirkin wrote:
> > On Mon, Feb 26, 2018 at 12:35:31PM +0800, Wei Wang wrote:
> > > On 02/09/2018 08:15 PM, Dr. David Alan Gilbert wrote:
> > > > * Wei Wang (wei.w.w...@intel.com) wrote:
> > 
> > I think all this is premature optimization. It is not at all clear that
> > anything is gained by delaying migration. Just ask for hints and start
> > sending pages immediately.  If guest tells us a page is free before it's
> > sent, we can skip sending it.  OTOH if migration is taking less time to
> > complete than it takes for guest to respond, then we are better off just
> > ignoring the hint.
> 
> OK, I'll try to create a thread for the free page optimization. We create
> the thread to poll for free pages at the beginning of the bulk stage, and
> stops at the end of bulk stage.
> There are also comments about postcopy support with this feature, I plan to
> leave that as the second step (that support seems not urgent for now).
> 
> 
> Best,
> Wei

you can make use the current migration thread instead of creating a new one.

Liang

[Qemu-devel] [PATCH v2 resend] block/mirror: change the semantic of 'force' of block-job-cancel

2018-02-26 Thread Liang Li

When doing drive mirror to a low speed shared storage, if there was heavy
BLK IO write workload in VM after the 'ready' event, drive mirror block job
can't be canceled immediately, it would keep running until the heavy BLK IO
workload stopped in the VM.

Libvirt depends on the current block-job-cancel semantics, which is that
when used without a flag after the 'ready' event, the command blocks
until data is in sync.  However, these semantics are awkward in other
situations, for example, people may use drive mirror for realtime
backups while still wanting to use block live migration.  Libvirt cannot
start a block live migration while another drive mirror is in progress,
but the user would rather abandon the backup attempt as broken and
proceed with the live migration than be stuck waiting for the current
drive mirror backup to finish.

The drive-mirror command already includes a 'force' flag, which libvirt
does not use, although it documented the flag as only being useful to
quit a job which is paused.  However, since quitting a paused job has
the same effect as abandoning a backup in a non-paused job (namely, the
destination file is not in sync, and the command completes immediately),
we can just improve the documentation to make the force flag obviously
useful.

Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Jeff Cody <jc...@redhat.com>
Cc: Kevin Wolf <kw...@redhat.com>
Cc: Max Reitz <mre...@redhat.com>
Cc: Eric Blake <ebl...@redhat.com>
Cc: John Snow <js...@redhat.com>
Reported-by: Huaitong Han <huanhuait...@didichuxing.com>
Signed-off-by: Huaitong Han <huanhuait...@didichuxing.com>
Signed-off-by: Liang Li <liliang...@didichuxing.com>
---
 block/mirror.c| 10 --
 blockdev.c|  4 ++--
 blockjob.c| 12 +++-
 hmp-commands.hx   |  3 ++-
 include/block/blockjob.h  |  9 -
 qapi/block-core.json  |  5 +++--
 tests/test-blockjob-txn.c |  8 
 7 files changed, 30 insertions(+), 21 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index c9badc1..9190b1c 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -869,11 +869,8 @@ static void coroutine_fn mirror_run(void *opaque)
 
 ret = 0;
 trace_mirror_before_sleep(s, cnt, s->synced, delay_ns);
-if (!s->synced) {
-block_job_sleep_ns(>common, delay_ns);
-if (block_job_is_cancelled(>common)) {
-break;
-}
+if (block_job_is_cancelled(>common) && s->common.force) {
+break;
 } else if (!should_complete) {
 delay_ns = (s->in_flight == 0 && cnt == 0 ? SLICE_TIME : 0);
 block_job_sleep_ns(>common, delay_ns);
@@ -887,7 +884,8 @@ immediate_exit:
  * or it was cancelled prematurely so that we do not guarantee that
  * the target is a copy of the source.
  */
-assert(ret < 0 || (!s->synced && block_job_is_cancelled(>common)));
+assert(ret < 0 || ((s->common.force || !s->synced) &&
+   block_job_is_cancelled(>common)));
 assert(need_drain);
 mirror_wait_for_all_io(s);
 }
diff --git a/blockdev.c b/blockdev.c
index 8e977ee..039f156 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -145,7 +145,7 @@ void blockdev_mark_auto_del(BlockBackend *blk)
 aio_context_acquire(aio_context);
 
 if (bs->job) {
-block_job_cancel(bs->job);
+block_job_cancel(bs->job, false);
 }
 
 aio_context_release(aio_context);
@@ -3802,7 +3802,7 @@ void qmp_block_job_cancel(const char *device,
 }
 
 trace_qmp_block_job_cancel(job);
-block_job_cancel(job);
+block_job_cancel(job, force);
 out:
 aio_context_release(aio_context);
 }
diff --git a/blockjob.c b/blockjob.c
index f5cea84..9b0b1a4 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -365,7 +365,7 @@ static void block_job_completed_single(BlockJob *job)
 block_job_unref(job);
 }
 
-static void block_job_cancel_async(BlockJob *job)
+static void block_job_cancel_async(BlockJob *job, bool force)
 {
 if (job->iostatus != BLOCK_DEVICE_IO_STATUS_OK) {
 block_job_iostatus_reset(job);
@@ -376,6 +376,8 @@ static void block_job_cancel_async(BlockJob *job)
 job->pause_count--;
 }
 job->cancelled = true;
+/* To prevent 'force == false' overriding a previous 'force == true' */
+job->force |= force;
 }
 
 static int block_job_finish_sync(BlockJob *job,
@@ -437,7 +439,7 @@ static void block_job_completed_txn_abort(BlockJob *job)
  * on the caller, so leave it. */
 QLIST_FOREACH(other_job, >jobs, txn_list) {
 if (other_job != job) {
-block_job_cancel_async(other_job);
+block_job_cancel_async(other_job, false);
 }
 }
 while (!QLIST_EMPTY(>jobs)) {
@@ -542,10 +544,10 @@ void block_jo

[Qemu-devel] [PATCH v2] block/mirror: change the semantic of 'force' of block-job-cancel

2018-02-05 Thread Liang Li

When doing drive mirror to a low speed shared storage, if there was heavy
BLK IO write workload in VM after the 'ready' event, drive mirror block job
can't be canceled immediately, it would keep running until the heavy BLK IO
workload stopped in the VM.

Libvirt depends on the current block-job-cancel semantics, which is that
when used without a flag after the 'ready' event, the command blocks
until data is in sync.  However, these semantics are awkward in other
situations, for example, people may use drive mirror for realtime
backups while still wanting to use block live migration.  Libvirt cannot
start a block live migration while another drive mirror is in progress,
but the user would rather abandon the backup attempt as broken and
proceed with the live migration than be stuck waiting for the current
drive mirror backup to finish.

The drive-mirror command already includes a 'force' flag, which libvirt
does not use, although it documented the flag as only being useful to
quit a job which is paused.  However, since quitting a paused job has
the same effect as abandoning a backup in a non-paused job (namely, the
destination file is not in sync, and the command completes immediately),
we can just improve the documentation to make the force flag obviously
useful.

Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Jeff Cody <jc...@redhat.com>
Cc: Kevin Wolf <kw...@redhat.com>
Cc: Max Reitz <mre...@redhat.com>
Cc: Eric Blake <ebl...@redhat.com>
Cc: John Snow <js...@redhat.com>
Reported-by: Huaitong Han <huanhuait...@didichuxing.com>
Signed-off-by: Huaitong Han <huanhuait...@didichuxing.com>
Signed-off-by: Liang Li <liliang...@didichuxing.com>
---
 block/mirror.c| 10 --
 blockdev.c|  4 ++--
 blockjob.c| 12 +++-
 hmp-commands.hx   |  3 ++-
 include/block/blockjob.h  |  9 -
 qapi/block-core.json  |  5 +++--
 tests/test-blockjob-txn.c |  8 
 7 files changed, 30 insertions(+), 21 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index c9badc1..9190b1c 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -869,11 +869,8 @@ static void coroutine_fn mirror_run(void *opaque)
 
 ret = 0;
 trace_mirror_before_sleep(s, cnt, s->synced, delay_ns);
-if (!s->synced) {
-block_job_sleep_ns(>common, delay_ns);
-if (block_job_is_cancelled(>common)) {
-break;
-}
+if (block_job_is_cancelled(>common) && s->common.force) {
+break;
 } else if (!should_complete) {
 delay_ns = (s->in_flight == 0 && cnt == 0 ? SLICE_TIME : 0);
 block_job_sleep_ns(>common, delay_ns);
@@ -887,7 +884,8 @@ immediate_exit:
  * or it was cancelled prematurely so that we do not guarantee that
  * the target is a copy of the source.
  */
-assert(ret < 0 || (!s->synced && block_job_is_cancelled(>common)));
+assert(ret < 0 || ((s->common.force || !s->synced) &&
+   block_job_is_cancelled(>common)));
 assert(need_drain);
 mirror_wait_for_all_io(s);
 }
diff --git a/blockdev.c b/blockdev.c
index 8e977ee..039f156 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -145,7 +145,7 @@ void blockdev_mark_auto_del(BlockBackend *blk)
 aio_context_acquire(aio_context);
 
 if (bs->job) {
-block_job_cancel(bs->job);
+block_job_cancel(bs->job, false);
 }
 
 aio_context_release(aio_context);
@@ -3802,7 +3802,7 @@ void qmp_block_job_cancel(const char *device,
 }
 
 trace_qmp_block_job_cancel(job);
-block_job_cancel(job);
+block_job_cancel(job, force);
 out:
 aio_context_release(aio_context);
 }
diff --git a/blockjob.c b/blockjob.c
index f5cea84..9b0b1a4 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -365,7 +365,7 @@ static void block_job_completed_single(BlockJob *job)
 block_job_unref(job);
 }
 
-static void block_job_cancel_async(BlockJob *job)
+static void block_job_cancel_async(BlockJob *job, bool force)
 {
 if (job->iostatus != BLOCK_DEVICE_IO_STATUS_OK) {
 block_job_iostatus_reset(job);
@@ -376,6 +376,8 @@ static void block_job_cancel_async(BlockJob *job)
 job->pause_count--;
 }
 job->cancelled = true;
+/* To prevent 'force == false' overriding a previous 'force == true' */
+job->force |= force;
 }
 
 static int block_job_finish_sync(BlockJob *job,
@@ -437,7 +439,7 @@ static void block_job_completed_txn_abort(BlockJob *job)
  * on the caller, so leave it. */
 QLIST_FOREACH(other_job, >jobs, txn_list) {
 if (other_job != job) {
-block_job_cancel_async(other_job);
+block_job_cancel_async(other_job, false);
 }
 }
 while (!QLIST_EMPTY(>jobs)) {
@@ -542,10 +544,10 @@ void block_jo

Re: [Qemu-devel] [PATCH] block/mirror: change the semantic of 'force' of block-job-cancel

2018-02-05 Thread Liang Li

On Mon, Feb 05, 2018 at 02:28:55PM -0500, John Snow wrote:
> 
> 
> On 01/31/2018 09:19 PM, Liang Li wrote:
> > On Tue, Jan 30, 2018 at 03:18:31PM -0500, John Snow wrote:
> >>
> >>
> >> On 01/30/2018 03:38 AM, Liang Li wrote:
> >>> When doing drive mirror to a low speed shared storage, if there was heavy
> >>> BLK IO write workload in VM after the 'ready' event, drive mirror block 
> >>> job
> >>> can't be canceled immediately, it would keep running until the heavy BLK 
> >>> IO
> >>> workload stopped in the VM.
> >>>
> >>> Because libvirt depends on block-job-cancel for block live migration, the
> >>> current block-job-cancel has the semantic to make sure data is in sync 
> >>> after
> >>> the 'ready' event.  This semantic can't meet some requirement, for 
> >>> example,
> >>> people may use drive mirror for realtime backup while need the ability of
> >>> block live migration. If drive mirror can't not be cancelled immediately,
> >>> it means block live migration need to wait, because libvirt make use drive
> >>> mirror to implement block live migration and only one drive mirror block
> >>> job is allowed at the same time for a give block dev.
> >>>
> >>> We need a new interface for 'force cancel', which could quit block job
> >>> immediately if don't care about whether data is in sync or not.
> >>>
> >>> 'force' is not used by libvirt currently, to make things simple, change
> >>> it's semantic slightly, hope it will not break some use case which need 
> >>> its
> >>> original semantic.
> >>>
> >>> Cc: Paolo Bonzini <pbonz...@redhat.com>
> >>> Cc: Jeff Cody <jc...@redhat.com>
> >>> Cc: Kevin Wolf <kw...@redhat.com>
> >>> Cc: Max Reitz <mre...@redhat.com>
> >>> Cc: Eric Blake <ebl...@redhat.com>
> >>> Cc: John Snow <js...@redhat.com>
> >>> Reported-by: Huaitong Han <huanhuait...@didichuxing.com>
> >>> Signed-off-by: Huaitong Han <huanhuait...@didichuxing.com>
> >>> Signed-off-by: Liang Li <liliang...@didichuxing.com>
> >>> ---
> >>> block/mirror.c|  9 +++--
> >>> blockdev.c|  4 ++--
> >>> blockjob.c| 11 ++-
> >>> hmp-commands.hx   |  3 ++-
> >>> include/block/blockjob.h  |  9 -
> >>> qapi/block-core.json  |  6 --
> >>> tests/test-blockjob-txn.c |  8 
> >>> 7 files changed, 29 insertions(+), 21 deletions(-)
> >>>
> >>> diff --git a/block/mirror.c b/block/mirror.c
> >>> index c9badc1..c22dff9 100644
> >>> --- a/block/mirror.c
> >>> +++ b/block/mirror.c
> >>> @@ -869,11 +869,8 @@ static void coroutine_fn mirror_run(void *opaque)
> >>>
> >>> ret = 0;
> >>> trace_mirror_before_sleep(s, cnt, s->synced, delay_ns);
> >>> -if (!s->synced) {
> >>> -block_job_sleep_ns(>common, delay_ns);
> >>> -if (block_job_is_cancelled(>common)) {
> >>> -break;
> >>> -}
> >>> +if (block_job_is_cancelled(>common) && s->common.force) {
> >>> +break;
> >>
> >> what's the justification for removing the sleep in the case that
> >> !s->synced && !block_job_is_cancelled(...) ?
> >>
> > if !block_job_is_cancelled() satisfied, the code in 'if (!should_complete) 
> > {}'
> > will execute, there is a block_job_sleep_ns there.
> > 
> > block_job_sleep_ns is for rate throttling, if there is no more data to 
> > sync, 
> > sleep is not needed, right?
> > 
> >>> } else if (!should_complete) {
> >>> delay_ns = (s->in_flight == 0 && cnt == 0 ? SLICE_TIME : 0);
> >>> block_job_sleep_ns(>common, delay_ns);
> >>> @@ -887,7 +884,7 @@ immediate_exit:
> >>>  * or it was cancelled prematurely so that we do not guarantee 
> >>> that
> >>>  * the target is a copy of the source.
> >>>  */
> >>> -assert(ret < 0 || (!s->synced && 
> >>> block_job_is_cancelled(>common)));
> >>> +assert(ret < 0 || block_job_is_cancelled(>co

Re: [Qemu-devel] [PATCH] block/mirror: change the semantic of 'force' of block-job-cancel

2018-01-31 Thread Liang Li

On Tue, Jan 30, 2018 at 03:18:31PM -0500, John Snow wrote:
> 
> 
> On 01/30/2018 03:38 AM, Liang Li wrote:
>> When doing drive mirror to a low speed shared storage, if there was heavy
>> BLK IO write workload in VM after the 'ready' event, drive mirror block job
>> can't be canceled immediately, it would keep running until the heavy BLK IO
>> workload stopped in the VM.
>> 
>> Because libvirt depends on block-job-cancel for block live migration, the
>> current block-job-cancel has the semantic to make sure data is in sync after
>> the 'ready' event.  This semantic can't meet some requirement, for example,
>> people may use drive mirror for realtime backup while need the ability of
>> block live migration. If drive mirror can't not be cancelled immediately,
>> it means block live migration need to wait, because libvirt make use drive
>> mirror to implement block live migration and only one drive mirror block
>> job is allowed at the same time for a give block dev.
>> 
>> We need a new interface for 'force cancel', which could quit block job
>> immediately if don't care about whether data is in sync or not.
>> 
>> 'force' is not used by libvirt currently, to make things simple, change
>> it's semantic slightly, hope it will not break some use case which need its
>> original semantic.
>> 
>> Cc: Paolo Bonzini <pbonz...@redhat.com>
>> Cc: Jeff Cody <jc...@redhat.com>
>> Cc: Kevin Wolf <kw...@redhat.com>
>> Cc: Max Reitz <mre...@redhat.com>
>> Cc: Eric Blake <ebl...@redhat.com>
>> Cc: John Snow <js...@redhat.com>
>> Reported-by: Huaitong Han <huanhuait...@didichuxing.com>
>> Signed-off-by: Huaitong Han <huanhuait...@didichuxing.com>
>> Signed-off-by: Liang Li <liliang...@didichuxing.com>
>> ---
>> block/mirror.c|  9 +++--
>> blockdev.c|  4 ++--
>> blockjob.c| 11 ++-
>> hmp-commands.hx   |  3 ++-
>> include/block/blockjob.h  |  9 -
>> qapi/block-core.json  |  6 --
>> tests/test-blockjob-txn.c |  8 
>> 7 files changed, 29 insertions(+), 21 deletions(-)
>> 
>> diff --git a/block/mirror.c b/block/mirror.c
>> index c9badc1..c22dff9 100644
>> --- a/block/mirror.c
>> +++ b/block/mirror.c
>> @@ -869,11 +869,8 @@ static void coroutine_fn mirror_run(void *opaque)
>> 
>> ret = 0;
>> trace_mirror_before_sleep(s, cnt, s->synced, delay_ns);
>> -if (!s->synced) {
>> -block_job_sleep_ns(>common, delay_ns);
>> -if (block_job_is_cancelled(>common)) {
>> -break;
>> -}
>> +if (block_job_is_cancelled(>common) && s->common.force) {
>> +break;
> 
> what's the justification for removing the sleep in the case that
> !s->synced && !block_job_is_cancelled(...) ?
> 
if !block_job_is_cancelled() satisfied, the code in 'if (!should_complete) {}'
will execute, there is a block_job_sleep_ns there.

block_job_sleep_ns is for rate throttling, if there is no more data to sync, 
sleep is not needed, right?

>> } else if (!should_complete) {
>> delay_ns = (s->in_flight == 0 && cnt == 0 ? SLICE_TIME : 0);
>> block_job_sleep_ns(>common, delay_ns);
>> @@ -887,7 +884,7 @@ immediate_exit:
>>  * or it was cancelled prematurely so that we do not guarantee that
>>  * the target is a copy of the source.
>>  */
>> -assert(ret < 0 || (!s->synced && 
>> block_job_is_cancelled(>common)));
>> +assert(ret < 0 || block_job_is_cancelled(>common));
> 
> This assertion gets weaker in the case where force isn't provided, is
> that desired?
> 
yes. if force quit is used, the following condition can be true

(ret >= 0) && (s->synced) && (block_job_is_cancelled(>common)) 

so the above assert should be changed, or it will be failed.

>> assert(need_drain);
>> mirror_wait_for_all_io(s);
>> }
>> diff --git a/blockdev.c b/blockdev.c
>> index 8e977ee..039f156 100644
>> --- a/blockdev.c
>> +++ b/blockdev.c
>> @@ -145,7 +145,7 @@ void blockdev_mark_auto_del(BlockBackend *blk)
>> aio_context_acquire(aio_context);
>> 
>> if (bs->job) {
>> -block_job_cancel(bs->job);
>> +block_job_cancel(bs->job, false);
>> }
>> 
>> aio_context_release(aio_context);
>> @@ -3802,7 +3802,7 @@

Re: [Qemu-devel] [PATCH] block/mirror: change the semantic of 'force' of block-job-cancel

2018-01-31 Thread Liang Li

On Tue, Jan 30, 2018 at 08:20:03AM -0600, Eric Blake wrote:
> On 01/30/2018 02:38 AM, Liang Li wrote:
>> When doing drive mirror to a low speed shared storage, if there was heavy
>> BLK IO write workload in VM after the 'ready' event, drive mirror block job
>> can't be canceled immediately, it would keep running until the heavy BLK IO
>> workload stopped in the VM.
> 
> So far so good.   But the grammar and explanation in the rest of the
> commit is a bit hard to read; let me give a shot at an alternative wording:
> 
> Libvirt depends on the current block-job-cancel semantics, which is that
> when used without a flag after the 'ready' event, the command blocks
> until data is in sync.  However, these semantics are awkward in other
> situations, for example, people may use drive mirror for realtime
> backups while still wanting to use block live migration.  Libvirt cannot
> start a block live migration while another drive mirror is in progress,
> but the user would rather abandon the backup attempt as broken and
> proceed with the live migration than be stuck waiting for the current
> drive mirror backup to finish.
> 
> The drive-mirror command already includes a 'force' flag, which libvirt
> does not use, although it documented the flag as only being useful to
> quit a job which is paused.  However, since quitting a paused job has
> the same effect as abandoning a backup in a non-paused job (namely, the
> destination file is not in sync, and the command completes immediately),
> we can just improve the documentation to make the force flag obviously
> useful.
> 

much better, will include in the v2. Thanks!
>> 
>> Cc: Paolo Bonzini <pbonz...@redhat.com>
>> Cc: Jeff Cody <jc...@redhat.com>
>> Cc: Kevin Wolf <kw...@redhat.com>
>> Cc: Max Reitz <mre...@redhat.com>
>> Cc: Eric Blake <ebl...@redhat.com>
>> Cc: John Snow <js...@redhat.com>
>> Reported-by: Huaitong Han <huanhuait...@didichuxing.com>
>> Signed-off-by: Huaitong Han <huanhuait...@didichuxing.com>
>> Signed-off-by: Liang Li <liliang...@didichuxing.com>
>> ---
> 
> 
>> +++ b/hmp-commands.hx
>> @@ -106,7 +106,8 @@ ETEXI
>> .args_type  = "force:-f,device:B",
>> .params = "[-f] device",
>> .help   = "stop an active background block operation (use -f"
>> -  "\n\t\t\t if the operation is currently paused)",
>> +  "\n\t\t\t if you want to abort the operation 
>> immediately"
>> +  "\n\t\t\t instead of keep running until data is in 
>> sync )",
> 
> s/sync )/sync)/
> 

done
>> .cmd= hmp_block_job_cancel,
>> },
>> 
>> diff --git a/include/block/blockjob.h b/include/block/blockjob.h
>> index 00403d9..4a96c42 100644
>> --- a/include/block/blockjob.h
>> +++ b/include/block/blockjob.h
>> @@ -63,6 +63,12 @@ typedef struct BlockJob {
>> bool cancelled;
>> 
>> /**
>> + * Set to true if the job should be abort immediately without waiting
> 
> s/be //

done
> 
>> + * for data is in sync.
> 
> s/is/to be/
> 

done
>> + */
>> +bool force;
>> +
>> +/**
>>  * Counter for pause request. If non-zero, the block job is either 
>> paused,
>>  * or if busy == true will pause itself as soon as possible.
>>  */
>> @@ -218,10 +224,11 @@ void block_job_start(BlockJob *job);
>> /**
>>  * block_job_cancel:
>>  * @job: The job to be canceled.
>> + * @force: Quit a job without waiting data is in sync.
> 
> s/data is/for data to be/
> 

done
>> +++ b/qapi/block-core.json
>> @@ -2098,8 +2098,10 @@
>> #  the name of the parameter), but since QEMU 2.7 it can have
>> #  other values.
>> #
>> -# @force: whether to allow cancellation of a paused job (default
>> -# false).  Since 1.3.
>> +# @force: #optional whether to allow cancellation a job without waiting 
>> data is
> 
> The '#optional' tag should no longer be added.
> 
>> +# in sync, please not that since 2.12 it's semantic is not exactly 
>> the
>> +# same as before, from 1.3 to 2.11 it means whether to allow 
>> cancellation
>> +# of a paused job (default false).  Since 1.3.
> 
> Reads awkwardly.  I suggest:
> 
> @force: If true, and the job has already emitted the event
> BLOCK_JOB_READY, abandon the job immediately (even if it is paused)
> instead of waiting for the destination to complete its

[Qemu-devel] [PATCH] block/mirror: change the semantic of 'force' of block-job-cancel

2018-01-30 Thread Liang Li

When doing drive mirror to a low speed shared storage, if there was heavy
BLK IO write workload in VM after the 'ready' event, drive mirror block job
can't be canceled immediately, it would keep running until the heavy BLK IO
workload stopped in the VM.

Because libvirt depends on block-job-cancel for block live migration, the
current block-job-cancel has the semantic to make sure data is in sync after
the 'ready' event.  This semantic can't meet some requirement, for example,
people may use drive mirror for realtime backup while need the ability of
block live migration. If drive mirror can't not be cancelled immediately,
it means block live migration need to wait, because libvirt make use drive
mirror to implement block live migration and only one drive mirror block
job is allowed at the same time for a give block dev.

We need a new interface for 'force cancel', which could quit block job
immediately if don't care about whether data is in sync or not.

'force' is not used by libvirt currently, to make things simple, change
it's semantic slightly, hope it will not break some use case which need its
original semantic.

Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Jeff Cody <jc...@redhat.com>
Cc: Kevin Wolf <kw...@redhat.com>
Cc: Max Reitz <mre...@redhat.com>
Cc: Eric Blake <ebl...@redhat.com>
Cc: John Snow <js...@redhat.com>
Reported-by: Huaitong Han <huanhuait...@didichuxing.com>
Signed-off-by: Huaitong Han <huanhuait...@didichuxing.com>
Signed-off-by: Liang Li <liliang...@didichuxing.com>
---
 block/mirror.c|  9 +++--
 blockdev.c|  4 ++--
 blockjob.c| 11 ++-
 hmp-commands.hx   |  3 ++-
 include/block/blockjob.h  |  9 -
 qapi/block-core.json  |  6 --
 tests/test-blockjob-txn.c |  8 
 7 files changed, 29 insertions(+), 21 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index c9badc1..c22dff9 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -869,11 +869,8 @@ static void coroutine_fn mirror_run(void *opaque)
 
 ret = 0;
 trace_mirror_before_sleep(s, cnt, s->synced, delay_ns);
-if (!s->synced) {
-block_job_sleep_ns(>common, delay_ns);
-if (block_job_is_cancelled(>common)) {
-break;
-}
+if (block_job_is_cancelled(>common) && s->common.force) {
+break;
 } else if (!should_complete) {
 delay_ns = (s->in_flight == 0 && cnt == 0 ? SLICE_TIME : 0);
 block_job_sleep_ns(>common, delay_ns);
@@ -887,7 +884,7 @@ immediate_exit:
  * or it was cancelled prematurely so that we do not guarantee that
  * the target is a copy of the source.
  */
-assert(ret < 0 || (!s->synced && block_job_is_cancelled(>common)));
+assert(ret < 0 || block_job_is_cancelled(>common));
 assert(need_drain);
 mirror_wait_for_all_io(s);
 }
diff --git a/blockdev.c b/blockdev.c
index 8e977ee..039f156 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -145,7 +145,7 @@ void blockdev_mark_auto_del(BlockBackend *blk)
 aio_context_acquire(aio_context);
 
 if (bs->job) {
-block_job_cancel(bs->job);
+block_job_cancel(bs->job, false);
 }
 
 aio_context_release(aio_context);
@@ -3802,7 +3802,7 @@ void qmp_block_job_cancel(const char *device,
 }
 
 trace_qmp_block_job_cancel(job);
-block_job_cancel(job);
+block_job_cancel(job, force);
 out:
 aio_context_release(aio_context);
 }
diff --git a/blockjob.c b/blockjob.c
index f5cea84..0aacb50 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -365,7 +365,7 @@ static void block_job_completed_single(BlockJob *job)
 block_job_unref(job);
 }
 
-static void block_job_cancel_async(BlockJob *job)
+static void block_job_cancel_async(BlockJob *job, bool force)
 {
 if (job->iostatus != BLOCK_DEVICE_IO_STATUS_OK) {
 block_job_iostatus_reset(job);
@@ -376,6 +376,7 @@ static void block_job_cancel_async(BlockJob *job)
 job->pause_count--;
 }
 job->cancelled = true;
+job->force = force;
 }
 
 static int block_job_finish_sync(BlockJob *job,
@@ -437,7 +438,7 @@ static void block_job_completed_txn_abort(BlockJob *job)
  * on the caller, so leave it. */
 QLIST_FOREACH(other_job, >jobs, txn_list) {
 if (other_job != job) {
-block_job_cancel_async(other_job);
+block_job_cancel_async(other_job, true);
 }
 }
 while (!QLIST_EMPTY(>jobs)) {
@@ -542,10 +543,10 @@ void block_job_user_resume(BlockJob *job)
 }
 }
 
-void block_job_cancel(BlockJob *job)
+void block_job_cancel(BlockJob *job, bool force)
 {
 if (block_job_started(job)) {
-block_job_cancel_async(job);
+block_job_cancel_async(job, force);
 block_job_enter(job);

Re: [Qemu-devel] [PATCH] block/mirror: fix fail to cancel when VM has heavy BLK IO

2018-01-28 Thread Liang Li

On Fri, Jan 26, 2018 at 08:04:08AM -0600, Eric Blake wrote:
> On 01/26/2018 12:46 AM, Liang Li wrote:
> > The current QMP command is:
> > 
> > { 'command': 'block-job-cancel', 'data': { 'device': 'str', '*force': 
> > 'bool' } }
> > 
> > 'force' has other meaning which is not used by libvirt, for the change, 
> > there
> > are 3 options:
> > 
> > a. Now that 'force' is not used by libvirt and it current semantic is not 
> > very useful,
> > we can change it's semantic to force-quit without syncing.
> 
> The current semantics are:
> 
> # @force: whether to allow cancellation of a paused job (default
> # false).  Since 1.3.
> 
> You are right that libvirt is not using it at the moment; but that
> doesn't tell us whether someone else is using it.  On the other hand, it
> is a fairly easy argument to make that "a job which is paused is not
> complete, so forcing it to cancel means an unclean image left behind",
> which can then be reformulated as "the force flag says to cancel
> immediately, whether the job is paused or has pending data, and thus
> leave an unclean image behind".  In other words, I don't think it is too
> bad to just tidy up the wording, and allow the existing 'force':true
> parameter to be enabled to quit a job that won't converge.
> 
> > 
> > b. change 'force' from bool to flag, and bit 0 is used for it's original 
> > meaning.
> 
> Not possible.  You can't change from 'force':true to 'force':1 in JSON,
> at least not without rewriting the command to use an alternate that
> accepts both bool and int (actually, I seem to recall that we tightened
> QAPI to not permit alternates that might be ambiguous when parsed by
> QemuOpts, which may mean that is not even possible - although I haven't
> tried to see if it works or gives an error).
> 
> > 
> > c. add another bool parameter.
> 
> Also doable, if we are concerned that existing semantics of 'force'
> affecting only paused jobs must be preserved.
> 
> > 
> > 
> > which is the best one?
> 
> 1 is slightly less code, but 3 is more conservative.  I'd be okay with
> option 1 if no one else can provide a reason why it would break something.
> 

OK. I will send a patch based on the first option.

Thanks!

Liang
> -- 
> Eric Blake, Principal Software Engineer
> Red Hat, Inc.   +1-919-301-3266
> Virtualization:  qemu.org | libvirt.org
>

Re: [Qemu-devel] [PATCH] block/mirror: fix fail to cancel when VM has heavy BLK IO

2018-01-25 Thread Liang Li

On Thu, Jan 25, 2018 at 08:48:22AM -0600, Eric Blake wrote:
> On 01/24/2018 10:59 PM, Liang Li wrote:
> >>
> >> There's ongoing work on adding async mirroring; this may be a better
> >> solution to the issue you are seeing.
> >>
> >> https://lists.gnu.org/archive/html/qemu-devel/2018-01/msg05419.html
> >>
> > Hi Eric,
> > 
> > Thinks for your information, I didn't know libvirt depends on 
> > 'block-job-cancel'
> > for some of the block related operations.
> > 
> > It's seems a new interface should provided by qemu for use case that just
> > for aborting block job and don't care abort the mirror data integrality, and
> > libvirt can make use of this new interface.
> > 
> > Do you think this is the right direction?
> 
> I don't know if it is better to wait for the new async mirroring code to
> land, or to just propose a new QMP command that can force-quit an
> ongoing mirror in the READY state, but you are correct that the only
> safe way to do it is by adding a new command (or a new optional flag to
> the existing block-job-cancel command).
> 

Active sync does not conflict with the new QMP command, no need to wait.
The current QMP command is:

{ 'command': 'block-job-cancel', 'data': { 'device': 'str', '*force': 'bool' } }

'force' has other meaning which is not used by libvirt, for the change, there
are 3 options:

a. Now that 'force' is not used by libvirt and it current semantic is not very 
useful,
we can change it's semantic to force-quit without syncing.

b. change 'force' from bool to flag, and bit 0 is used for it's original 
meaning.

c. add another bool parameter.


which is the best one?

Thanks!

Liang 


> -- 
> Eric Blake, Principal Software Engineer
> Red Hat, Inc.   +1-919-301-3266
> Virtualization:  qemu.org | libvirt.org
>

Re: [Qemu-devel] [PATCH] block/mirror: fix fail to cancel when VM has heavy BLK IO

2018-01-24 Thread Liang Li

On Wed, Jan 24, 2018 at 01:16:39PM -0600, Eric Blake wrote:
> On 01/24/2018 12:17 AM, Liang Li wrote:
> > We found that when doing drive mirror to a low speed shared storage,
> > if there was heavy BLK IO write workload in VM after the 'ready' event,
> > drive mirror block job can't be canceled immediately, it would keep
> > running until the heavy BLK IO workload stopped in the VM. This patch
> > fixed this issue.
> 
> I think you are breaking semantics here.  Libvirt relies on
> 'block-job-cancel' after the 'ready' event to be a clean point-in-time
> snapshot, but that is only possible if there is no out-of-order pending
> I/O at the time the action takes place.  Breaking in the middle of the
> loop, without using bdrv_drain(), risks leaving an inconsistent copy of
> data in the mirror not corresponding to any point-in-time on the source.
> 
> There's ongoing work on adding async mirroring; this may be a better
> solution to the issue you are seeing.
> 
> https://lists.gnu.org/archive/html/qemu-devel/2018-01/msg05419.html
> 
Hi Eric,

Thinks for your information, I didn't know libvirt depends on 'block-job-cancel'
for some of the block related operations.

It's seems a new interface should provided by qemu for use case that just
for aborting block job and don't care abort the mirror data integrality, and
libvirt can make use of this new interface.

Do you think this is the right direction?

Liang
> -- 
> Eric Blake, Principal Software Engineer
> Red Hat, Inc.   +1-919-301-3266
> Virtualization:  qemu.org | libvirt.org
>

[Qemu-devel] [PATCH] block/mirror: fix fail to cancel when VM has heavy BLK IO

2018-01-23 Thread Liang Li

We found that when doing drive mirror to a low speed shared storage,
if there was heavy BLK IO write workload in VM after the 'ready' event,
drive mirror block job can't be canceled immediately, it would keep
running until the heavy BLK IO workload stopped in the VM. This patch
fixed this issue.

Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Jeff Cody <jc...@redhat.com>
Cc: Kevin Wolf <kw...@redhat.com>
Cc: Max Reitz <mre...@redhat.com>
Signed-off-by: Huaitong Han <hanhuait...@didichuxing.com>
Signed-off-by: Liang Li <liliang...@didichuxing.com>
---
 block/mirror.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index c9badc1..3bc49a5 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -869,11 +869,9 @@ static void coroutine_fn mirror_run(void *opaque)
 
 ret = 0;
 trace_mirror_before_sleep(s, cnt, s->synced, delay_ns);
-if (!s->synced) {
-block_job_sleep_ns(>common, delay_ns);
-if (block_job_is_cancelled(>common)) {
-break;
-}
+
+if (block_job_is_cancelled(>common)) {
+break;
 } else if (!should_complete) {
 delay_ns = (s->in_flight == 0 && cnt == 0 ? SLICE_TIME : 0);
 block_job_sleep_ns(>common, delay_ns);
@@ -887,7 +885,7 @@ immediate_exit:
  * or it was cancelled prematurely so that we do not guarantee that
  * the target is a copy of the source.
  */
-assert(ret < 0 || (!s->synced && block_job_is_cancelled(>common)));
+assert(ret < 0 || block_job_is_cancelled(>common));
 assert(need_drain);
 mirror_wait_for_all_io(s);
 }
-- 
1.8.3.1

[Qemu-devel] [PATCH] hbitmap: fix missing restore count when finish deserialization

2018-01-18 Thread Liang Li

The .count of HBitmap is forgot to set in function
hbitmap_deserialize_finish, let's set it to the right value.

Cc: Vladimir Sementsov-Ogievskiy <vsement...@virtuozzo.com>
Cc: Fam Zheng <f...@redhat.com>
Cc: Max Reitz <mre...@redhat.com>
Cc: John Snow <js...@redhat.com>
Signed-off-by: weiping zhang <zhangweip...@didichuxing.com>
Signed-off-by: Liang Li <liliang...@didichuxing.com>
---
 util/hbitmap.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/util/hbitmap.c b/util/hbitmap.c
index 289778a..58a2c93 100644
--- a/util/hbitmap.c
+++ b/util/hbitmap.c
@@ -630,6 +630,7 @@ void hbitmap_deserialize_finish(HBitmap *bitmap)
 }
 
 bitmap->levels[0][0] |= 1UL << (BITS_PER_LONG - 1);
+bitmap->count = hb_count_between(bitmap, 0, bitmap->size - 1);
 }
 
 void hbitmap_free(HBitmap *hb)
-- 
1.8.3.1

[Qemu-devel] [PATCH resend] hbitmap: fix missing restore count when finish deserialization

2018-01-18 Thread Liang Li

The .count of HBitmap is forgot to set in function
hbitmap_deserialize_finish, let's set it to the right value.

Cc: Vladimir Sementsov-Ogievskiy <vsement...@virtuozzo.com>
Cc: Fam Zheng <f...@redhat.com>
Cc: Max Reitz <mre...@redhat.com>
Cc: John Snow <js...@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsement...@virtuozzo.com>
Signed-off-by: Weiping Zhang <zhangweip...@didichuxing.com>
Signed-off-by: Liang Li <liliang...@didichuxing.com>

---
 util/hbitmap.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/util/hbitmap.c b/util/hbitmap.c
index 289778a..58a2c93 100644
--- a/util/hbitmap.c
+++ b/util/hbitmap.c
@@ -630,6 +630,7 @@ void hbitmap_deserialize_finish(HBitmap *bitmap)
 }
 
 bitmap->levels[0][0] |= 1UL << (BITS_PER_LONG - 1);
+bitmap->count = hb_count_between(bitmap, 0, bitmap->size - 1);
 }
 
 void hbitmap_free(HBitmap *hb)
-- 
1.8.3.1

[Qemu-devel] [PATCH v4 qemu 6/6] migration: skip unused pages during live migration

2017-01-11 Thread Liang Li

After sending out the request for unused pages, live migration
process will start without waiting for the unused page bitmap is
ready. If the unused page bitmap is not ready when doing the 1st
migration_bitmap_sync() after ram_save_setup(), the unused page
bitmap will be ignored, this means the unused pages will not be
filtered out in this case.
The current implementation can not work with post copy, if post
copy is enabled, we simply ignore the unused pages. Will make it
work later.

Signed-off-by: Liang Li <liang.z...@intel.com>
---
 migration/ram.c | 86 -
 1 file changed, 85 insertions(+), 1 deletion(-)

diff --git a/migration/ram.c b/migration/ram.c
index a1c8089..f029512 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -44,6 +44,8 @@
 #include "exec/ram_addr.h"
 #include "qemu/rcu_queue.h"
 #include "migration/colo.h"
+#include "sysemu/balloon.h"
+#include "sysemu/kvm.h"
 
 #ifdef DEBUG_MIGRATION_RAM
 #define DPRINTF(fmt, ...) \
@@ -229,6 +231,8 @@ static QemuMutex migration_bitmap_mutex;
 static uint64_t migration_dirty_pages;
 static uint32_t last_version;
 static bool ram_bulk_stage;
+static bool ignore_unused_page;
+static uint64_t unused_page_req_id;
 
 /* used by the search for pages to send */
 struct PageSearchStatus {
@@ -245,6 +249,7 @@ static struct BitmapRcu {
 struct rcu_head rcu;
 /* Main migration bitmap */
 unsigned long *bmap;
+unsigned long *unused_page_bmap;
 /* bitmap of pages that haven't been sent even once
  * only maintained and used in postcopy at the moment
  * where it's used to send the dirtymap at the start
@@ -637,6 +642,7 @@ static void migration_bitmap_sync(void)
 rcu_read_unlock();
 qemu_mutex_unlock(_bitmap_mutex);
 
+ignore_unused_page = true;
 trace_migration_bitmap_sync_end(migration_dirty_pages
 - num_dirty_pages_init);
 num_dirty_pages_period += migration_dirty_pages - num_dirty_pages_init;
@@ -1483,6 +1489,76 @@ void migration_bitmap_extend(ram_addr_t old, ram_addr_t 
new)
 }
 }
 
+static void filter_out_unused_pages(unsigned long *raw_bmap, long nbits)
+{
+long i, page_count = 0, len;
+unsigned long *new_bmap;
+
+tighten_guest_free_page_bmap(raw_bmap);
+qemu_mutex_lock(_bitmap_mutex);
+new_bmap = atomic_rcu_read(_bitmap_rcu)->bmap;
+slow_bitmap_complement(new_bmap, raw_bmap, nbits);
+
+len = (last_ram_offset() >> TARGET_PAGE_BITS) / BITS_PER_LONG;
+for (i = 0; i < len; i++) {
+page_count += hweight_long(new_bmap[i]);
+}
+
+migration_dirty_pages = page_count;
+qemu_mutex_unlock(_bitmap_mutex);
+}
+
+static void ram_get_unused_pages(unsigned long *bmap, unsigned long max_pfn)
+{
+BalloonReqStatus status;
+
+unused_page_req_id++;
+status = balloon_get_unused_pages(bmap, max_pfn / BITS_PER_BYTE,
+  unused_page_req_id);
+if (status == REQ_START) {
+ignore_unused_page = false;
+}
+}
+
+static void ram_handle_unused_page(void)
+{
+unsigned long nbits, req_id = 0;
+RAMBlock *pc_ram_block;
+BalloonReqStatus status;
+
+status = balloon_unused_page_ready(_id);
+switch (status) {
+case REQ_DONE:
+if (req_id != unused_page_req_id) {
+return;
+}
+rcu_read_lock();
+pc_ram_block = QLIST_FIRST_RCU(_list.blocks);
+nbits = pc_ram_block->used_length >> TARGET_PAGE_BITS;
+filter_out_unused_pages(migration_bitmap_rcu->unused_page_bmap, nbits);
+rcu_read_unlock();
+
+qemu_mutex_lock_iothread();
+migration_bitmap_sync();
+qemu_mutex_unlock_iothread();
+/*
+ * bulk stage assumes in (migration_bitmap_find_and_reset_dirty) that
+ * every page is dirty, that's no longer ture at this point.
+ */
+ram_bulk_stage = false;
+last_seen_block = NULL;
+last_sent_block = NULL;
+last_offset = 0;
+break;
+case REQ_ERROR:
+ignore_unused_page = true;
+error_report("failed to get unused page");
+break;
+default:
+break;
+}
+}
+
 /*
  * 'expected' is the value you expect the bitmap mostly to be full
  * of; it won't bother printing lines that are all this value.
@@ -1962,8 +2038,13 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
  }
 }
 
-rcu_read_lock();
+if (balloon_unused_pages_support() && !migrate_postcopy_ram()) {
+unsigned long max_pfn = get_guest_max_pfn();
+migration_bitmap_rcu->unused_page_bmap = bitmap_new(max_pfn);
+ram_get_unused_pages(migration_bitmap_rcu->unused_page_bmap, max_pfn);
+}
 
+rcu_read_lock();
 qemu_put_be64(f, ram_bytes_total() | RAM_SAVE_FLAG_MEM_SIZE);
 
 QLIST_FOREACH_RCU(block, _list.blocks, next) {
@@ -2004,6 +20

[Qemu-devel] [PATCH v4 qemu 5/6] kvm.c: Add two new arch specific functions

2017-01-11 Thread Liang Li

Add a new function to get the vm's max pfn and a new function
to filter out the holes in the undressed free page bitmap to get
a tight free page bitmap. They are implemented on X86 and should
be implemented on other arches for live migration optimization.

Signed-off-by: Liang Li <liang.z...@intel.com>
---
 include/sysemu/kvm.h | 18 ++
 target/arm/kvm.c | 14 ++
 target/i386/kvm.c| 37 +
 target/mips/kvm.c| 14 ++
 target/ppc/kvm.c | 14 ++
 target/s390x/kvm.c   | 14 ++
 6 files changed, 111 insertions(+)

diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index df67cc0..ef91053 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -238,6 +238,24 @@ int kvm_remove_breakpoint(CPUState *cpu, target_ulong addr,
   target_ulong len, int type);
 void kvm_remove_all_breakpoints(CPUState *cpu);
 int kvm_update_guest_debug(CPUState *cpu, unsigned long reinject_trap);
+
+/**
+ * tighten_guest_free_page_bmap - process the free page bitmap from
+ * guest to get a tight page bitmap which does not contain
+ * holes.
+ * @bmap: undressed guest free page bitmap
+ * Returns: a tight guest free page bitmap, the n th bit in the
+ * returned bitmap and the n th bit in the migration bitmap
+ * should correspond to the same guest RAM page.
+ */
+unsigned long *tighten_guest_free_page_bmap(unsigned long *bmap);
+
+/**
+ * get_guest_max_pfn - get the max pfn of guest
+ * Returns: the max pfn of guest
+ */
+unsigned long get_guest_max_pfn(void);
+
 #ifndef _WIN32
 int kvm_set_signal_mask(CPUState *cpu, const sigset_t *sigset);
 #endif
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index c00b94e..785e969 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -638,3 +638,17 @@ int kvm_arch_msi_data_to_gsi(uint32_t data)
 {
 return (data - 32) & 0x;
 }
+
+unsigned long get_guest_max_pfn(void)
+{
+/* To be done */
+
+return 0;
+}
+
+unsigned long *tighten_guest_free_page_bmap(unsigned long *bmap)
+{
+/* To be done */
+
+return bmap;
+}
diff --git a/target/i386/kvm.c b/target/i386/kvm.c
index 10a9cd8..b8dbe3d 100644
--- a/target/i386/kvm.c
+++ b/target/i386/kvm.c
@@ -3541,3 +3541,40 @@ int kvm_arch_msi_data_to_gsi(uint32_t data)
 {
 abort();
 }
+
+#define _4G (1ULL << 32)
+
+unsigned long get_guest_max_pfn(void)
+{
+PCMachineState *pcms = PC_MACHINE(current_machine);
+ram_addr_t above_4g_mem = pcms->above_4g_mem_size;
+unsigned long max_pfn;
+
+if (above_4g_mem) {
+max_pfn = (_4G + above_4g_mem) >> TARGET_PAGE_BITS;
+} else {
+max_pfn = pcms->below_4g_mem_size >> TARGET_PAGE_BITS;
+}
+
+return max_pfn;
+}
+
+unsigned long *tighten_guest_free_page_bmap(unsigned long *bmap)
+{
+PCMachineState *pcms = PC_MACHINE(current_machine);
+ram_addr_t above_4g_mem = pcms->above_4g_mem_size;
+
+if (above_4g_mem) {
+unsigned long *src, *dst, len, pos;
+ram_addr_t below_4g_mem = pcms->below_4g_mem_size;
+src = bmap + (_4G >> TARGET_PAGE_BITS) / BITS_PER_LONG;
+dst = bmap + (below_4g_mem >> TARGET_PAGE_BITS) / BITS_PER_LONG;
+bitmap_move(dst, src, above_4g_mem >> TARGET_PAGE_BITS);
+
+pos = (above_4g_mem + below_4g_mem) >> TARGET_PAGE_BITS;
+len = (_4G - below_4g_mem) >> TARGET_PAGE_BITS;
+bitmap_clear(bmap, pos, len);
+}
+
+return bmap;
+}
diff --git a/target/mips/kvm.c b/target/mips/kvm.c
index dcf5fbb..2feb406 100644
--- a/target/mips/kvm.c
+++ b/target/mips/kvm.c
@@ -1058,3 +1058,17 @@ int kvm_arch_msi_data_to_gsi(uint32_t data)
 {
 abort();
 }
+
+unsigned long get_guest_max_pfn(void)
+{
+/* To be done */
+
+return 0;
+}
+
+unsigned long *tighten_guest_free_page_bmap(unsigned long *bmap)
+{
+/* To be done */
+
+return bmap;
+}
diff --git a/target/ppc/kvm.c b/target/ppc/kvm.c
index 9c4834c..a130d3a 100644
--- a/target/ppc/kvm.c
+++ b/target/ppc/kvm.c
@@ -2672,3 +2672,17 @@ int kvmppc_enable_hwrng(void)
 
 return kvmppc_enable_hcall(kvm_state, H_RANDOM);
 }
+
+unsigned long get_guest_max_pfn(void)
+{
+/* To be done */
+
+return 0;
+}
+
+unsigned long *tighten_guest_free_page_bmap(unsigned long *bmap)
+{
+/* To be done */
+
+return bmap;
+}
diff --git a/target/s390x/kvm.c b/target/s390x/kvm.c
index 97afe02..181c59c 100644
--- a/target/s390x/kvm.c
+++ b/target/s390x/kvm.c
@@ -2651,3 +2651,17 @@ void kvm_s390_apply_cpu_model(const S390CPUModel *model, 
Error **errp)
 }
 }
 }
+
+unsigned long get_guest_max_pfn(void)
+{
+/* To be done */
+
+return 0;
+}
+
+unsigned long *tighten_guest_free_page_bmap(unsigned long *bmap)
+{
+/* To be done */
+
+return bmap;
+}
-- 
1.9.1

[Qemu-devel] [PATCH v4 qemu 4/6] bitmap: Add a new bitmap_move function

2017-01-11 Thread Liang Li

Sometimes, it is need to move a portion of bitmap to another place
in a large bitmap, if overlap happens, the bitmap_copy can't not
work correctly, we need a new function to do this work.

Signed-off-by: Liang Li <liang.z...@intel.com>
Reviewed-by: Dr. David Alan Gilbert <dgilb...@redhat.com>
---
 include/qemu/bitmap.h | 13 +
 1 file changed, 13 insertions(+)

diff --git a/include/qemu/bitmap.h b/include/qemu/bitmap.h
index 63ea2d0..775d05e 100644
--- a/include/qemu/bitmap.h
+++ b/include/qemu/bitmap.h
@@ -37,6 +37,7 @@
  * bitmap_set(dst, pos, nbits) Set specified bit area
  * bitmap_set_atomic(dst, pos, nbits)   Set specified bit area with atomic ops
  * bitmap_clear(dst, pos, nbits)   Clear specified bit area
+ * bitmap_move(dst, src, nbits) Move *src to *dst
  * bitmap_test_and_clear_atomic(dst, pos, nbits)Test and clear area
  * bitmap_find_next_zero_area(buf, len, pos, n, mask)  Find bit free area
  */
@@ -129,6 +130,18 @@ static inline void bitmap_copy(unsigned long *dst, const 
unsigned long *src,
 }
 }
 
+static inline void bitmap_move(unsigned long *dst, const unsigned long *src,
+   long nbits)
+{
+if (small_nbits(nbits)) {
+unsigned long tmp = *src;
+*dst = tmp;
+} else {
+long len = BITS_TO_LONGS(nbits) * sizeof(unsigned long);
+memmove(dst, src, len);
+}
+}
+
 static inline int bitmap_and(unsigned long *dst, const unsigned long *src1,
  const unsigned long *src2, long nbits)
 {
-- 
1.9.1

[Qemu-devel] [PATCH v4 qemu 3/6] balloon: get unused page info from guest

2017-01-11 Thread Liang Li

Add a new feature to get the unused page information from guest,
the unused page information is saved in the {pfn|length} arrays.
Please note that 'unused page' means page is not inuse sometime after
host set the value of request ID and before it receive response with
the same ID.

Signed-off-by: Liang Li <liang.z...@intel.com>
---
 balloon.c  |  47 +++-
 hw/virtio/virtio-balloon.c | 149 -
 include/hw/virtio/virtio-balloon.h |  18 -
 include/sysemu/balloon.h   |  18 -
 4 files changed, 227 insertions(+), 5 deletions(-)

diff --git a/balloon.c b/balloon.c
index f2ef50c..8efabe1 100644
--- a/balloon.c
+++ b/balloon.c
@@ -36,6 +36,8 @@
 
 static QEMUBalloonEvent *balloon_event_fn;
 static QEMUBalloonStatus *balloon_stat_fn;
+static QEMUBalloonGetUnusedPage *balloon_get_unused_page_fn;
+static QEMUBalloonUnusedPageReady *balloon_unused_page_ready_fn;
 static void *balloon_opaque;
 static bool balloon_inhibited;
 
@@ -65,9 +67,13 @@ static bool have_balloon(Error **errp)
 }
 
 int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
- QEMUBalloonStatus *stat_func, void *opaque)
+ QEMUBalloonStatus *stat_func,
+ QEMUBalloonGetUnusedPage *get_unused_page_func,
+ QEMUBalloonUnusedPageReady 
*unused_page_ready_func,
+ void *opaque)
 {
-if (balloon_event_fn || balloon_stat_fn || balloon_opaque) {
+if (balloon_event_fn || balloon_stat_fn || balloon_get_unused_page_fn
+|| balloon_unused_page_ready_fn || balloon_opaque) {
 /* We're already registered one balloon handler.  How many can
  * a guest really have?
  */
@@ -75,6 +81,8 @@ int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
 }
 balloon_event_fn = event_func;
 balloon_stat_fn = stat_func;
+balloon_get_unused_page_fn = get_unused_page_func;
+balloon_unused_page_ready_fn = unused_page_ready_func;
 balloon_opaque = opaque;
 return 0;
 }
@@ -86,6 +94,8 @@ void qemu_remove_balloon_handler(void *opaque)
 }
 balloon_event_fn = NULL;
 balloon_stat_fn = NULL;
+balloon_get_unused_page_fn = NULL;
+balloon_unused_page_ready_fn = NULL;
 balloon_opaque = NULL;
 }
 
@@ -116,3 +126,36 @@ void qmp_balloon(int64_t target, Error **errp)
 trace_balloon_event(balloon_opaque, target);
 balloon_event_fn(balloon_opaque, target);
 }
+
+bool balloon_unused_pages_support(void)
+{
+return balloon_get_unused_page_fn ? true : false;
+}
+
+BalloonReqStatus balloon_get_unused_pages(unsigned long *bitmap,
+  unsigned long len,
+  unsigned long req_id)
+{
+if (!balloon_get_unused_page_fn) {
+return REQ_UNSUPPORT;
+}
+
+if (!bitmap) {
+return REQ_INVALID_PARAM;
+}
+
+return balloon_get_unused_page_fn(balloon_opaque, bitmap, len, req_id);
+}
+
+BalloonReqStatus balloon_unused_page_ready(unsigned long *req_id)
+{
+if (!balloon_unused_page_ready_fn) {
+return REQ_UNSUPPORT;
+}
+
+if (!req_id) {
+return REQ_INVALID_PARAM;
+}
+
+return balloon_unused_page_ready_fn(balloon_opaque, req_id);
+}
diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index 4ab65ba..71c7e49 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -143,6 +143,13 @@ static bool balloon_page_ranges_supported(const 
VirtIOBalloon *s)
 return virtio_vdev_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_RANGE);
 }
 
+static bool balloon_host_request_vq_supported(const VirtIOBalloon *s)
+{
+VirtIODevice *vdev = VIRTIO_DEVICE(s);
+
+return virtio_vdev_has_feature(vdev, VIRTIO_BALLOON_F_HOST_REQ_VQ);
+}
+
 static bool balloon_stats_enabled(const VirtIOBalloon *s)
 {
 return s->stats_poll_interval > 0;
@@ -394,6 +401,72 @@ out:
 }
 }
 
+static void virtio_balloon_handle_resp(VirtIODevice *vdev, VirtQueue *vq)
+{
+VirtIOBalloon *s = VIRTIO_BALLOON(vdev);
+VirtQueueElement *elem;
+size_t offset = 0;
+struct virtio_balloon_resp_hdr hdr;
+uint64_t range;
+
+elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
+if (!elem) {
+s->req_status = REQ_ERROR;
+return;
+}
+
+s->host_req_vq_elem = elem;
+if (!elem->out_num) {
+return;
+}
+
+iov_to_buf(elem->out_sg, elem->out_num, offset,
+   , sizeof(hdr));
+offset += sizeof(hdr);
+
+switch (hdr.cmd) {
+case BALLOON_GET_UNUSED_PAGES:
+if (hdr.id == s->host_req.param) {
+if (s->bmap_len < hdr.data_len) {
+ hdr.data_len = s->bmap_len;
+}
+
+while (offset < hdr.data_len + sizeof(hdr)) {
+unsigned long pfn, nr_page;
+
+iov_

[Qemu-devel] [PATCH v4 qemu 0/6] Fast (de)inflating & fast live migration

2017-01-11 Thread Liang Li

This patch set intends to do two optimizations, one is to speed up
the (de)inflating process of virtio balloon, and another one which
is to speed up the live migration process. We put them together
because both of them are required to change the virtio balloon spec.
 
The main idea of speeding up the (de)inflating process is to use
{pfn|length} to send the page information to host instead of the PFNs,
to reduce the overhead of virtio data transmission, address translation
and madvise(). This can help to improve the performance by about 85%.
 
The idea of speeding up live migration is to skip process guest's
unused pages in the first round of data copy, to reduce needless
data processing, this can help to save quite a lot of CPU cycles and
network bandwidth. We get guest's unused page information through the
virt queue of virtio-balloon, and filter out these unused pages during
live migration. For an idle 8GB guest, this can help to shorten the
total live migration time from 2Sec to about 500ms in the 10Gbps
network environment.
 
Changes from v3 to v4:
* Update kernel head file because of ABI change 
* Change the code to get the page information

Changes from v2 to v3:
* Merged two patches for kernel head file updating into one 
* Removed one patch which was unrelated with this feature 
* Removed the patch to migrate the vq elem, use a new way instead

Changes from v1 to v2:
* Abandon the patch for dropping page cache.
* Get a struct from vq instead of separate variables.
* Use two separate APIs to request free pages and query the status.
* Changed the virtio balloon interface.
* Addressed some of the comments of v1.

Liang Li (6):
  virtio-balloon: update linux head file
  virtio-balloon: speed up inflating & deflating process
  balloon: get unused page info from guest
  bitmap: Add a new bitmap_move function
  kvm.c: Add two new arch specific functions
  migration: skip unused pages during live migration

 balloon.c   |  47 +++-
 hw/virtio/virtio-balloon.c  | 291 +---
 include/hw/virtio/virtio-balloon.h  |  18 +-
 include/qemu/bitmap.h   |  13 ++
 include/standard-headers/linux/virtio_balloon.h |  34 +++
 include/sysemu/balloon.h|  18 +-
 include/sysemu/kvm.h|  18 ++
 migration/ram.c |  86 ++-
 target/arm/kvm.c|  14 ++
 target/i386/kvm.c   |  37 +++
 target/mips/kvm.c   |  14 ++
 target/ppc/kvm.c|  14 ++
 target/s390x/kvm.c  |  14 ++
 13 files changed, 587 insertions(+), 31 deletions(-)

-- 
1.9.1

[Qemu-devel] [PATCH v4 qemu 2/6] virtio-balloon: speed up inflating & deflating process

2017-01-11 Thread Liang Li

The implementation of the current virtio-balloon is not very
efficient, the time spends on different stages of inflating
the balloon to 7GB of a 8GB idle guest:

a. allocating pages (6.5%)
b. sending PFNs to host (68.3%)
c. address translation (6.1%)
d. madvise (19%)

It takes about 4126ms for the inflating process to complete.
Debugging shows that the bottle neck are the stage b and stage d.

If using {pfn|length} arrays to send the page info instead of the
PFNs, we can reduce the overhead in stage b quite a lot. Furthermore,
we can do address translation and call madvise() with a bulk of
RAM pages, instead of the current page per page way, the overhead
of stage c and stage d can also be reduced a lot.

This patch is the kernel side implementation which is intended to
speed up the inflating & deflating process by adding a new feature
to the virtio-balloon device. With this new feature, inflating the
balloon to 7GB of a 8GB idle guest only takes 590ms, the
performance improvement is about 85%.

TODO: optimize stage a by allocating/freeing a chunk of pages
instead of a single page at a time.

Signed-off-by: Liang Li <liang.z...@intel.com>
Suggested-by: Michael S. Tsirkin <m...@redhat.com>
---
 hw/virtio/virtio-balloon.c | 142 +
 1 file changed, 117 insertions(+), 25 deletions(-)

diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index a705e0e..4ab65ba 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -31,6 +31,7 @@
 #include "hw/virtio/virtio-access.h"
 
 #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
+#define BALLOON_NR_PFN_MASK ((1 << VIRTIO_BALLOON_NR_PFN_BITS) - 1)
 
 static void balloon_page(void *addr, int deflate)
 {
@@ -52,6 +53,69 @@ static const char *balloon_stat_names[] = {
[VIRTIO_BALLOON_S_NR] = NULL
 };
 
+static void do_balloon_bulk_pages(ram_addr_t base_pfn,
+  ram_addr_t size, bool deflate)
+{
+ram_addr_t processed, chunk, base;
+MemoryRegionSection section = {.mr = NULL};
+
+base = base_pfn * TARGET_PAGE_SIZE;
+
+for (processed = 0; processed < size; processed += chunk) {
+chunk = size - processed;
+while (chunk >= TARGET_PAGE_SIZE) {
+section = memory_region_find(get_system_memory(),
+ base + processed, chunk);
+if (!section.mr) {
+chunk = QEMU_ALIGN_DOWN(chunk / 2, TARGET_PAGE_SIZE);
+} else {
+break;
+}
+}
+
+if (!section.mr || !int128_nz(section.size) ||
+!memory_region_is_ram(section.mr) ||
+memory_region_is_rom(section.mr) ||
+memory_region_is_romd(section.mr)) {
+qemu_log_mask(LOG_GUEST_ERROR,
+  "Invalid guest RAM range [0x%lx, 0x%lx]\n",
+  base + processed, chunk);
+chunk = TARGET_PAGE_SIZE;
+} else {
+void *addr = section.offset_within_region +
+   memory_region_get_ram_ptr(section.mr);
+
+qemu_madvise(addr, chunk,
+ deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
+}
+}
+}
+
+static void balloon_bulk_pages(struct virtio_balloon_resp_hdr *hdr,
+   uint64_t *pages, bool deflate)
+{
+ram_addr_t base_pfn;
+unsigned long current = 0, nr_pfn, len = hdr->data_len;
+uint64_t *range;
+
+if (!qemu_balloon_is_inhibited() && (!kvm_enabled() ||
+ kvm_has_sync_mmu())) {
+while (current < len / sizeof(uint64_t)) {
+range = pages + current;
+base_pfn = *range >> VIRTIO_BALLOON_NR_PFN_BITS;
+nr_pfn = *range & BALLOON_NR_PFN_MASK;
+current++;
+if (nr_pfn == 0) {
+nr_pfn = *(range + 1);
+current++;
+}
+
+do_balloon_bulk_pages(base_pfn, nr_pfn * TARGET_PAGE_SIZE,
+  deflate);
+}
+}
+}
+
 /*
  * reset_stats - Mark all items in the stats array as unset
  *
@@ -72,6 +136,13 @@ static bool balloon_stats_supported(const VirtIOBalloon *s)
 return virtio_vdev_has_feature(vdev, VIRTIO_BALLOON_F_STATS_VQ);
 }
 
+static bool balloon_page_ranges_supported(const VirtIOBalloon *s)
+{
+VirtIODevice *vdev = VIRTIO_DEVICE(s);
+
+return virtio_vdev_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_RANGE);
+}
+
 static bool balloon_stats_enabled(const VirtIOBalloon *s)
 {
 return s->stats_poll_interval > 0;
@@ -218,32 +289,51 @@ static void virtio_balloon_handle_output(VirtIODevice 
*vdev, VirtQueue *vq)
 return;
 }
 
-while (iov_to_buf(elem->out_sg, elem->out_num, offset, , 4) == 4) {
-ram_addr_t pa;
-ram_

[Qemu-devel] [PATCH v4 qemu 1/6] virtio-balloon: update linux head file

2017-01-11 Thread Liang Li

Update the linux head file to keep consistent with kernel side.
The new definition will be used in the following patches.

Signed-off-by: Liang Li <liang.z...@intel.com>
---
 include/standard-headers/linux/virtio_balloon.h | 34 +
 1 file changed, 34 insertions(+)

diff --git a/include/standard-headers/linux/virtio_balloon.h 
b/include/standard-headers/linux/virtio_balloon.h
index 9d06ccd..c15d592 100644
--- a/include/standard-headers/linux/virtio_balloon.h
+++ b/include/standard-headers/linux/virtio_balloon.h
@@ -34,10 +34,15 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST0 /* Tell before reclaiming 
pages */
 #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_PAGE_RANGE3 /* Send page info with ranges */
+#define VIRTIO_BALLOON_F_HOST_REQ_VQ   4 /* Host request virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
 
+/* Bits width for the length of the pfn range */
+#define VIRTIO_BALLOON_NR_PFN_BITS 12
+
 struct virtio_balloon_config {
/* Number of pages host wants Guest to give up. */
uint32_t num_pages;
@@ -82,4 +87,33 @@ struct virtio_balloon_stat {
__virtio64 val;
 } QEMU_PACKED;
 
+/* Response header structure */
+struct virtio_balloon_resp_hdr {
+   uint64_t cmd : 8; /* Distinguish different requests type */
+   uint64_t flag: 8; /* Mark status for a specific request type */
+   uint64_t id : 16; /* Distinguish requests of a specific type */
+   uint64_t data_len: 32; /* Length of the following data, in bytes */
+};
+
+enum virtio_balloon_req_id {
+   /* Get unused page information */
+   BALLOON_GET_UNUSED_PAGES,
+};
+
+enum virtio_balloon_flag {
+   /* Have more data for a request */
+   BALLOON_FLAG_CONT,
+   /* No more data for a request */
+   BALLOON_FLAG_DONE,
+};
+
+struct virtio_balloon_req_hdr {
+   /* Used to distinguish different requests */
+   uint16_t cmd;
+   /* Reserved */
+   uint16_t reserved[3];
+   /* Request parameter */
+   uint64_t param;
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
1.9.1

[Qemu-devel] [PATCH v6 kernel 5/5] virtio-balloon: tell host vm's unused page info

2016-12-20 Thread Liang Li

This patch contains two parts:

One is to add a new API to mm go get the unused page information.
The virtio balloon driver will use this new API added to get the
unused page info and send it to hypervisor(QEMU) to speed up live
migration. During sending the bitmap, some the pages may be modified
and are used by the guest, this inaccuracy can be corrected by the
dirty page logging mechanism.

One is to add support the request for vm's unused page information,
QEMU can make use of unused page information and the dirty page
logging mechanism to skip the transportation of some of these unused
pages, this is very helpful to reduce the network traffic and speed
up the live migration process.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Andrew Morton <a...@linux-foundation.org>
Cc: Mel Gorman <mgor...@techsingularity.net>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
Cc: Dave Hansen <dave.han...@intel.com>
Cc: Andrea Arcangeli <aarca...@redhat.com>
Cc: David Hildenbrand <da...@redhat.com>
---
 drivers/virtio/virtio_balloon.c | 144 ++--
 include/linux/mm.h  |   3 +
 mm/page_alloc.c | 120 +
 3 files changed, 261 insertions(+), 6 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 03383b3..b67f865 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -56,7 +56,7 @@
 
 struct virtio_balloon {
struct virtio_device *vdev;
-   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *req_vq;
 
/* The balloon servicing is delegated to a freezable workqueue. */
struct work_struct update_balloon_stats_work;
@@ -85,6 +85,8 @@ struct virtio_balloon {
unsigned int nr_page_bmap;
/* Used to record the processed pfn range */
unsigned long min_pfn, max_pfn, start_pfn, end_pfn;
+   /* Request header */
+   struct virtio_balloon_req_hdr req_hdr;
/*
 * The pages we've told the Host we're not using are enqueued
 * at vb_dev_info->pages list.
@@ -505,6 +507,80 @@ static void update_balloon_stats(struct virtio_balloon *vb)
pages_to_bytes(available));
 }
 
+static void __send_unused_pages(struct virtio_balloon *vb,
+   unsigned long req_id, unsigned int pos, bool done)
+{
+   struct virtio_balloon_resp_hdr *hdr = vb->resp_hdr;
+   struct virtqueue *vq = vb->req_vq;
+
+   vb->resp_pos = pos;
+   hdr->cmd = BALLOON_GET_UNUSED_PAGES;
+   hdr->id = req_id;
+   if (!done)
+   hdr->flag = BALLOON_FLAG_CONT;
+   else
+   hdr->flag = BALLOON_FLAG_DONE;
+
+   if (pos > 0 || done)
+   send_resp_data(vb, vq, true);
+
+}
+
+static void send_unused_pages(struct virtio_balloon *vb,
+   unsigned long req_id)
+{
+   struct scatterlist sg_in;
+   unsigned int pos = 0;
+   struct virtqueue *vq = vb->req_vq;
+   int ret, order;
+   struct zone *zone = NULL;
+   bool part_fill = false;
+
+   mutex_lock(>balloon_lock);
+
+   for (order = MAX_ORDER - 1; order >= 0; order--) {
+   ret = mark_unused_pages(, order, vb->resp_data,
+vb->resp_buf_size / sizeof(__le64),
+, VIRTIO_BALLOON_NR_PFN_BITS, part_fill);
+   if (ret == -ENOSPC) {
+   if (pos == 0) {
+   void *new_resp_data;
+
+   new_resp_data = kmalloc(2 * vb->resp_buf_size,
+   GFP_KERNEL);
+   if (new_resp_data) {
+   kfree(vb->resp_data);
+   vb->resp_data = new_resp_data;
+   vb->resp_buf_size *= 2;
+   } else {
+   part_fill = true;
+   dev_warn(>vdev->dev,
+"%s: part fill order: %d\n",
+__func__, order);
+   }
+   } else {
+   __send_unused_pages(vb, req_id, pos, false);
+   pos = 0;
+   }
+
+   if (!part_fill) {
+   order++;
+   continue;
+   }
+   } else
+   zone = NULL;
+
+   if

[Qemu-devel] [PATCH v6 kernel 3/5] virtio-balloon: speed up inflate/deflate process

2016-12-20 Thread Liang Li

The implementation of the current virtio-balloon is not very
efficient, the time spends on different stages of inflating
the balloon to 7GB of a 8GB idle guest:

a. allocating pages (6.5%)
b. sending PFNs to host (68.3%)
c. address translation (6.1%)
d. madvise (19%)

It takes about 4126ms for the inflating process to complete.
Debugging shows that the bottle neck are the stage b and stage d.

If using {pfn|length} array to send the page info instead of the
PFNs, we can reduce the overhead in stage b quite a lot.
Furthermore, we can do the address translation and call madvise()
with a range of memory, instead of the current page per page way,
the overhead of stage c and stage d can also be reduced a lot.

This patch is the kernel side implementation which is intended to
speed up the inflating & deflating process by adding a new feature
to the virtio-balloon device. With this new feature, inflating the
balloon to 7GB of a 8GB idle guest only takes 590ms, the
performance improvement is about 85%.

TODO: optimize stage a by allocating/freeing a chunk of pages
instead of a single page at a time.

Signed-off-by: Liang Li <liang.z...@intel.com>
Suggested-by: Michael S. Tsirkin <m...@redhat.com>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
Cc: Dave Hansen <dave.han...@intel.com>
Cc: Andrea Arcangeli <aarca...@redhat.com>
Cc: David Hildenbrand <da...@redhat.com>
---
 drivers/virtio/virtio_balloon.c | 348 
 1 file changed, 320 insertions(+), 28 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index f59cb4f..03383b3 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -42,6 +42,10 @@
 #define OOM_VBALLOON_DEFAULT_PAGES 256
 #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
 
+#define BALLOON_BMAP_SIZE  (8 * PAGE_SIZE)
+#define PFNS_PER_BMAP  (BALLOON_BMAP_SIZE * BITS_PER_BYTE)
+#define BALLOON_BMAP_COUNT 32
+
 static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
 module_param(oom_pages, int, S_IRUSR | S_IWUSR);
 MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
@@ -67,6 +71,20 @@ struct virtio_balloon {
 
/* Number of balloon pages we've told the Host we're not using. */
unsigned int num_pages;
+   /* Pointer to the response header. */
+   void *resp_hdr;
+   /* Pointer to the start address of response data. */
+   __le64 *resp_data;
+   /* Size of response data buffer. */
+   unsigned int resp_buf_size;
+   /* Pointer offset of the response data. */
+   unsigned int resp_pos;
+   /* Bitmap used to save the pfns info */
+   unsigned long *page_bitmap[BALLOON_BMAP_COUNT];
+   /* Number of split page bitmaps */
+   unsigned int nr_page_bmap;
+   /* Used to record the processed pfn range */
+   unsigned long min_pfn, max_pfn, start_pfn, end_pfn;
/*
 * The pages we've told the Host we're not using are enqueued
 * at vb_dev_info->pages list.
@@ -110,20 +128,180 @@ static void balloon_ack(struct virtqueue *vq)
wake_up(>acked);
 }
 
-static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
+static inline void init_bmap_pfn_range(struct virtio_balloon *vb)
 {
-   struct scatterlist sg;
+   vb->min_pfn = ULONG_MAX;
+   vb->max_pfn = 0;
+}
+
+static inline void update_bmap_pfn_range(struct virtio_balloon *vb,
+struct page *page)
+{
+   unsigned long balloon_pfn = page_to_balloon_pfn(page);
+
+   vb->min_pfn = min(balloon_pfn, vb->min_pfn);
+   vb->max_pfn = max(balloon_pfn, vb->max_pfn);
+}
+
+static void extend_page_bitmap(struct virtio_balloon *vb,
+   unsigned long nr_pfn)
+{
+   int i, bmap_count;
+   unsigned long bmap_len;
+
+   bmap_len = ALIGN(nr_pfn, BITS_PER_LONG) / BITS_PER_BYTE;
+   bmap_len = ALIGN(bmap_len, BALLOON_BMAP_SIZE);
+   bmap_count = min((int)(bmap_len / BALLOON_BMAP_SIZE),
+BALLOON_BMAP_COUNT);
+
+   for (i = 1; i < bmap_count; i++) {
+   vb->page_bitmap[i] = kmalloc(BALLOON_BMAP_SIZE, GFP_KERNEL);
+   if (vb->page_bitmap[i])
+   vb->nr_page_bmap++;
+   else
+   break;
+   }
+}
+
+static void free_extended_page_bitmap(struct virtio_balloon *vb)
+{
+   int i, bmap_count = vb->nr_page_bmap;
+
+   for (i = 1; i < bmap_count; i++) {
+   kfree(vb->page_bitmap[i]);
+   vb->page_bitmap[i] = NULL;
+   vb->nr_page_bmap--;
+   }
+}
+
+static void kfree_page_bitmap(struct virtio_balloon *vb)
+{
+   int i;
+
+   for (i = 0; i < vb->nr_page_bmap; i+

[Qemu-devel] [PATCH v6 kernel 4/5] virtio-balloon: define flags and head for host request vq

2016-12-20 Thread Liang Li

Define the flags and head struct for a new host request virtual
queue. Guest can get requests from host and then responds to them on
this new virtual queue.
Host can make use of this virtual queue to request the guest do some
operations, e.g. drop page cache, synchronize file system, etc.
And the hypervisor can get some of guest's runtime information
through this virtual queue too, e.g. the guest's unused page
information, which can be used for live migration optimization.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Andrew Morton <a...@linux-foundation.org>
Cc: Mel Gorman <mgor...@techsingularity.net>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
Cc: Dave Hansen <dave.han...@intel.com>
Cc: Andrea Arcangeli <aarca...@redhat.com>
Cc: David Hildenbrand <da...@redhat.com>
---
 include/uapi/linux/virtio_balloon.h | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index 2f850bf..b367020 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_PAGE_RANGE3 /* Send page info with ranges */
+#define VIRTIO_BALLOON_F_HOST_REQ_VQ   4 /* Host request virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -94,4 +95,25 @@ struct virtio_balloon_resp_hdr {
__le64 data_len: 32; /* Length of the following data, in bytes */
 };
 
+enum virtio_balloon_req_id {
+   /* Get unused page information */
+   BALLOON_GET_UNUSED_PAGES,
+};
+
+enum virtio_balloon_flag {
+   /* Have more data for a request */
+   BALLOON_FLAG_CONT,
+   /* No more data for a request */
+   BALLOON_FLAG_DONE,
+};
+
+struct virtio_balloon_req_hdr {
+   /* Used to distinguish different requests */
+   __le16 cmd;
+   /* Reserved */
+   __le16 reserved[3];
+   /* Request parameter */
+   __le64 param;
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
1.9.1

[Qemu-devel] [PATCH v6 kernel 2/5] virtio-balloon: define new feature bit and head struct

2016-12-20 Thread Liang Li

Add a new feature which supports sending the page information
with range array. The current implementation uses PFNs array,
which is not very efficient. Using ranges can improve the
performance of inflating/deflating significantly.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
Cc: Dave Hansen <dave.han...@intel.com>
Cc: Andrea Arcangeli <aarca...@redhat.com>
Cc: David Hildenbrand <da...@redhat.com>
---
 include/uapi/linux/virtio_balloon.h | 12 
 1 file changed, 12 insertions(+)

diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index 343d7dd..2f850bf 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,10 +34,14 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST0 /* Tell before reclaiming 
pages */
 #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_PAGE_RANGE3 /* Send page info with ranges */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
 
+/* Bits width for the length of the pfn range */
+#define VIRTIO_BALLOON_NR_PFN_BITS 12
+
 struct virtio_balloon_config {
/* Number of pages host wants Guest to give up. */
__u32 num_pages;
@@ -82,4 +86,12 @@ struct virtio_balloon_stat {
__virtio64 val;
 } __attribute__((packed));
 
+/* Response header structure */
+struct virtio_balloon_resp_hdr {
+   __le64 cmd : 8; /* Distinguish different requests type */
+   __le64 flag: 8; /* Mark status for a specific request type */
+   __le64 id : 16; /* Distinguish requests of a specific type */
+   __le64 data_len: 32; /* Length of the following data, in bytes */
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
1.9.1

[Qemu-devel] [PATCH v6 kernel 1/5] virtio-balloon: rework deflate to add page to a list

2016-12-20 Thread Liang Li

When doing the inflating/deflating operation, the current virtio-balloon
implementation uses an array to save 256 PFNS, then send these PFNS to
host through virtio and process each PFN one by one. This way is not
efficient when inflating/deflating a large mount of memory because too
many times of the following operations:

1. Virtio data transmission
2. Page allocate/free
3. Address translation(GPA->HVA)
4. madvise

The over head of these operations will consume a lot of CPU cycles and
will take a long time to complete, it may impact the QoS of the guest as
well as the host. The overhead will be reduced a lot if batch processing
is used. E.g. If there are several pages whose address are physical
contiguous in the guest, these bulk pages can be processed in one
operation.

The main idea for the optimization is to reduce the above operations as
much as possible. And it can be achieved by using a {pfn|length} array
instead of a PFN array. Comparing with PFN array, {pfn|length} array can
present more pages and is fit for batch processing.

This patch saves the deflated pages to a list instead of the PFN array,
which will allow faster notifications using the {pfn|length} down the
road. balloon_pfn_to_page() can be removed because it's useless.

Signed-off-by: Liang Li <liang.z...@intel.com>
Signed-off-by: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
Cc: Dave Hansen <dave.han...@intel.com>
Cc: Andrea Arcangeli <aarca...@redhat.com>
Cc: David Hildenbrand <da...@redhat.com>
---
 drivers/virtio/virtio_balloon.c | 22 --
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 181793f..f59cb4f 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -103,12 +103,6 @@ static u32 page_to_balloon_pfn(struct page *page)
return pfn * VIRTIO_BALLOON_PAGES_PER_PAGE;
 }
 
-static struct page *balloon_pfn_to_page(u32 pfn)
-{
-   BUG_ON(pfn % VIRTIO_BALLOON_PAGES_PER_PAGE);
-   return pfn_to_page(pfn / VIRTIO_BALLOON_PAGES_PER_PAGE);
-}
-
 static void balloon_ack(struct virtqueue *vq)
 {
struct virtio_balloon *vb = vq->vdev->priv;
@@ -181,18 +175,16 @@ static unsigned fill_balloon(struct virtio_balloon *vb, 
size_t num)
return num_allocated_pages;
 }
 
-static void release_pages_balloon(struct virtio_balloon *vb)
+static void release_pages_balloon(struct virtio_balloon *vb,
+struct list_head *pages)
 {
-   unsigned int i;
-   struct page *page;
+   struct page *page, *next;
 
-   /* Find pfns pointing at start of each page, get pages and free them. */
-   for (i = 0; i < vb->num_pfns; i += VIRTIO_BALLOON_PAGES_PER_PAGE) {
-   page = balloon_pfn_to_page(virtio32_to_cpu(vb->vdev,
-  vb->pfns[i]));
+   list_for_each_entry_safe(page, next, pages, lru) {
if (!virtio_has_feature(vb->vdev,
VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
adjust_managed_page_count(page, 1);
+   list_del(>lru);
put_page(page); /* balloon reference */
}
 }
@@ -202,6 +194,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, 
size_t num)
unsigned num_freed_pages;
struct page *page;
struct balloon_dev_info *vb_dev_info = >vb_dev_info;
+   LIST_HEAD(pages);
 
/* We can only do one array worth at a time. */
num = min(num, ARRAY_SIZE(vb->pfns));
@@ -215,6 +208,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, 
size_t num)
if (!page)
break;
set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+   list_add(>lru, );
vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
}
 
@@ -226,7 +220,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, 
size_t num)
 */
if (vb->num_pfns != 0)
tell_host(vb, vb->deflate_vq);
-   release_pages_balloon(vb);
+   release_pages_balloon(vb, );
mutex_unlock(>balloon_lock);
return num_freed_pages;
 }
-- 
1.9.1

[Qemu-devel] [PATCH v6 kernel 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-20 Thread Liang Li

This patch set contains two parts of changes to the virtio-balloon.
 
One is the change for speeding up the inflating & deflating process,
the main idea of this optimization is to use {pfn|length} to present
the page information instead of the PFNs, to reduce the overhead of
virtio data transmission, address translation and madvise(). This can
help to improve the performance by about 85%.
 
Another change is for speeding up live migration. By skipping process
guest's unused pages in the first round of data copy, to reduce needless
data processing, this can help to save quite a lot of CPU cycles and
network bandwidth. We put guest's unused page information in a
{pfn|length} array and send it to host with the virt queue of
virtio-balloon. For an idle guest with 8GB RAM, this can help to shorten
the total live migration time from 2Sec to about 500ms in 10Gbps network
environment. For an guest with quite a lot of page cache and with little
unused pages, it's possible to let the guest drop it's page cache before
live migration, this case can benefit from this new feature too.
 
Changes from v5 to v6:
* Drop the bitmap from the virtio ABI, use {pfn|length} only.
* Enhance the API to get the unused page information from mm. 

Changes from v4 to v5:
* Drop the code to get the max_pfn, use another way instead.
* Simplify the API to get the unused page information from mm. 

Changes from v3 to v4:
* Use the new scheme suggested by Dave Hansen to encode the bitmap.
* Add code which is missed in v3 to handle migrate page. 
* Free the memory for bitmap intime once the operation is done.
* Address some of the comments in v3.

Changes from v2 to v3:
* Change the name of 'free page' to 'unused page'.
* Use the scatter & gather bitmap instead of a 1MB page bitmap.
* Fix overwriting the page bitmap after kicking.
* Some of MST's comments for v2.
 
Changes from v1 to v2:
* Abandon the patch for dropping page cache.
* Put some structures to uapi head file.
* Use a new way to determine the page bitmap size.
* Use a unified way to send the free page information with the bitmap
* Address the issues referred in MST's comments

Liang Li (5):
  virtio-balloon: rework deflate to add page to a list
  virtio-balloon: define new feature bit and head struct
  virtio-balloon: speed up inflate/deflate process
  virtio-balloon: define flags and head for host request vq
  virtio-balloon: tell host vm's unused page info

 drivers/virtio/virtio_balloon.c | 510 
 include/linux/mm.h  |   3 +
 include/uapi/linux/virtio_balloon.h |  34 +++
 mm/page_alloc.c | 120 +
 4 files changed, 621 insertions(+), 46 deletions(-)

-- 
1.9.1

[Qemu-devel] [PATCH kernel v5 4/5] virtio-balloon: define flags and head for host request vq

2016-11-30 Thread Liang Li

Define the flags and head struct for a new host request virtual
queue. Guest can get requests from host and then responds to them on
this new virtual queue.
Host can make use of this virtual queue to request the guest do some
operations, e.g. drop page cache, synchronize file system, etc.
And the hypervisor can get some of guest's runtime information
through this virtual queue too, e.g. the guest's unused page
information, which can be used for live migration optimization.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Andrew Morton <a...@linux-foundation.org>
Cc: Mel Gorman <mgor...@techsingularity.net>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
Cc: Dave Hansen <dave.han...@intel.com>
---
 include/uapi/linux/virtio_balloon.h | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index 1be4b1f..5ac3a40 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_PAGE_BITMAP   3 /* Send page info with bitmap */
+#define VIRTIO_BALLOON_F_HOST_REQ_VQ   4 /* Host request virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -101,4 +102,25 @@ struct virtio_balloon_bmap_hdr {
__le64 bmap[0];
 };
 
+enum virtio_balloon_req_id {
+   /* Get unused page information */
+   BALLOON_GET_UNUSED_PAGES,
+};
+
+enum virtio_balloon_flag {
+   /* Have more data for a request */
+   BALLOON_FLAG_CONT,
+   /* No more data for a request */
+   BALLOON_FLAG_DONE,
+};
+
+struct virtio_balloon_req_hdr {
+   /* Used to distinguish different requests */
+   __le16 cmd;
+   /* Reserved */
+   __le16 reserved[3];
+   /* Request parameter */
+   __le64 param;
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
1.8.3.1

[Qemu-devel] [PATCH kernel v5 3/5] virtio-balloon: speed up inflate/deflate process

2016-11-30 Thread Liang Li

The implementation of the current virtio-balloon is not very
efficient, the time spends on different stages of inflating
the balloon to 7GB of a 8GB idle guest:

a. allocating pages (6.5%)
b. sending PFNs to host (68.3%)
c. address translation (6.1%)
d. madvise (19%)

It takes about 4126ms for the inflating process to complete.
Debugging shows that the bottle neck are the stage b and stage d.

If using a bitmap to send the page info instead of the PFNs, we
can reduce the overhead in stage b quite a lot. Furthermore, we
can do the address translation and call madvise() with a bulk of
RAM pages, instead of the current page per page way, the overhead
of stage c and stage d can also be reduced a lot.

This patch is the kernel side implementation which is intended to
speed up the inflating & deflating process by adding a new feature
to the virtio-balloon device. With this new feature, inflating the
balloon to 7GB of a 8GB idle guest only takes 590ms, the
performance improvement is about 85%.

TODO: optimize stage a by allocating/freeing a chunk of pages
instead of a single page at a time.

Signed-off-by: Liang Li <liang.z...@intel.com>
Suggested-by: Michael S. Tsirkin <m...@redhat.com>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
Cc: Dave Hansen <dave.han...@intel.com>
---
 drivers/virtio/virtio_balloon.c | 395 +---
 1 file changed, 367 insertions(+), 28 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index f59cb4f..c3ddec3 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -42,6 +42,10 @@
 #define OOM_VBALLOON_DEFAULT_PAGES 256
 #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
 
+#define BALLOON_BMAP_SIZE  (8 * PAGE_SIZE)
+#define PFNS_PER_BMAP  (BALLOON_BMAP_SIZE * BITS_PER_BYTE)
+#define BALLOON_BMAP_COUNT 32
+
 static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
 module_param(oom_pages, int, S_IRUSR | S_IWUSR);
 MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
@@ -67,6 +71,18 @@ struct virtio_balloon {
 
/* Number of balloon pages we've told the Host we're not using. */
unsigned int num_pages;
+   /* Pointer to the response header. */
+   void *resp_hdr;
+   /* Pointer to the start address of response data. */
+   unsigned long *resp_data;
+   /* Pointer offset of the response data. */
+   unsigned long resp_pos;
+   /* Bitmap and bitmap count used to tell the host the pages */
+   unsigned long *page_bitmap[BALLOON_BMAP_COUNT];
+   /* Number of split page bitmaps */
+   unsigned int nr_page_bmap;
+   /* Used to record the processed pfn range */
+   unsigned long min_pfn, max_pfn, start_pfn, end_pfn;
/*
 * The pages we've told the Host we're not using are enqueued
 * at vb_dev_info->pages list.
@@ -110,20 +126,228 @@ static void balloon_ack(struct virtqueue *vq)
wake_up(>acked);
 }
 
-static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
+static inline void init_bmap_pfn_range(struct virtio_balloon *vb)
 {
-   struct scatterlist sg;
+   vb->min_pfn = ULONG_MAX;
+   vb->max_pfn = 0;
+}
+
+static inline void update_bmap_pfn_range(struct virtio_balloon *vb,
+struct page *page)
+{
+   unsigned long balloon_pfn = page_to_balloon_pfn(page);
+
+   vb->min_pfn = min(balloon_pfn, vb->min_pfn);
+   vb->max_pfn = max(balloon_pfn, vb->max_pfn);
+}
+
+static void extend_page_bitmap(struct virtio_balloon *vb,
+   unsigned long nr_pfn)
+{
+   int i, bmap_count;
+   unsigned long bmap_len;
+
+   bmap_len = ALIGN(nr_pfn, BITS_PER_LONG) / BITS_PER_BYTE;
+   bmap_len = ALIGN(bmap_len, BALLOON_BMAP_SIZE);
+   bmap_count = min((int)(bmap_len / BALLOON_BMAP_SIZE),
+BALLOON_BMAP_COUNT);
+
+   for (i = 1; i < bmap_count; i++) {
+   vb->page_bitmap[i] = kmalloc(BALLOON_BMAP_SIZE, GFP_KERNEL);
+   if (vb->page_bitmap[i])
+   vb->nr_page_bmap++;
+   else
+   break;
+   }
+}
+
+static void free_extended_page_bitmap(struct virtio_balloon *vb)
+{
+   int i, bmap_count = vb->nr_page_bmap;
+
+
+   for (i = 1; i < bmap_count; i++) {
+   kfree(vb->page_bitmap[i]);
+   vb->page_bitmap[i] = NULL;
+   vb->nr_page_bmap--;
+   }
+}
+
+static void kfree_page_bitmap(struct virtio_balloon *vb)
+{
+   int i;
+
+   for (i = 0; i < vb->nr_page_bmap; i++)
+   kfree(vb->page_bitmap[i]);
+}
+
+static void clear_page_bitmap(struct virtio_balloon *vb)
+{
+   int i;
+
+   for (i = 0; i &

[Qemu-devel] [PATCH kernel v5 2/5] virtio-balloon: define new feature bit and head struct

2016-11-30 Thread Liang Li

Add a new feature which supports sending the page information with
a bitmap. The current implementation uses PFNs array, which is not
very efficient. Using bitmap can improve the performance of
inflating/deflating significantly

The page bitmap header will used to tell the host some information
about the page bitmap. e.g. the page size, page bitmap length and
start pfn.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
Cc: Dave Hansen <dave.han...@intel.com>
---
 include/uapi/linux/virtio_balloon.h | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index 343d7dd..1be4b1f 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST0 /* Tell before reclaiming 
pages */
 #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_PAGE_BITMAP   3 /* Send page info with bitmap */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -82,4 +83,22 @@ struct virtio_balloon_stat {
__virtio64 val;
 } __attribute__((packed));
 
+/* Response header structure */
+struct virtio_balloon_resp_hdr {
+   __le64 cmd : 8; /* Distinguish different requests type */
+   __le64 flag: 8; /* Mark status for a specific request type */
+   __le64 id : 16; /* Distinguish requests of a specific type */
+   __le64 data_len: 32; /* Length of the following data, in bytes */
+};
+
+/* Page bitmap header structure */
+struct virtio_balloon_bmap_hdr {
+   struct {
+   __le64 start_pfn : 52; /* start pfn for the bitmap */
+   __le64 page_shift : 6; /* page shift width, in bytes */
+   __le64 bmap_len : 6;  /* bitmap length, in bytes */
+   } head;
+   __le64 bmap[0];
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
1.8.3.1

[Qemu-devel] [PATCH kernel v5 5/5] virtio-balloon: tell host vm's unused page info

2016-11-30 Thread Liang Li

This patch contains two parts:

One is to add a new API to mm go get the unused page information.
The virtio balloon driver will use this new API added to get the
unused page info and send it to hypervisor(QEMU) to speed up live
migration. During sending the bitmap, some the pages may be modified
and are used by the guest, this inaccuracy can be corrected by the
dirty page logging mechanism.

One is to add support the request for vm's unused page information,
QEMU can make use of unused page information and the dirty page
logging mechanism to skip the transportation of some of these unused
pages, this is very helpful to reduce the network traffic and speed
up the live migration process.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Andrew Morton <a...@linux-foundation.org>
Cc: Mel Gorman <mgor...@techsingularity.net>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
Cc: Dave Hansen <dave.han...@intel.com>
---
 drivers/virtio/virtio_balloon.c | 126 +---
 include/linux/mm.h  |   3 +-
 mm/page_alloc.c |  72 +++
 3 files changed, 193 insertions(+), 8 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index c3ddec3..2626cc0 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -56,7 +56,7 @@
 
 struct virtio_balloon {
struct virtio_device *vdev;
-   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *req_vq;
 
/* The balloon servicing is delegated to a freezable workqueue. */
struct work_struct update_balloon_stats_work;
@@ -75,6 +75,8 @@ struct virtio_balloon {
void *resp_hdr;
/* Pointer to the start address of response data. */
unsigned long *resp_data;
+   /* Size of response data buffer. */
+   unsigned long resp_buf_size;
/* Pointer offset of the response data. */
unsigned long resp_pos;
/* Bitmap and bitmap count used to tell the host the pages */
@@ -83,6 +85,8 @@ struct virtio_balloon {
unsigned int nr_page_bmap;
/* Used to record the processed pfn range */
unsigned long min_pfn, max_pfn, start_pfn, end_pfn;
+   /* Request header */
+   struct virtio_balloon_req_hdr req_hdr;
/*
 * The pages we've told the Host we're not using are enqueued
 * at vb_dev_info->pages list.
@@ -551,6 +555,58 @@ static void update_balloon_stats(struct virtio_balloon *vb)
pages_to_bytes(available));
 }
 
+static void send_unused_pages_info(struct virtio_balloon *vb,
+   unsigned long req_id)
+{
+   struct scatterlist sg_in;
+   unsigned long pos = 0;
+   struct virtqueue *vq = vb->req_vq;
+   struct virtio_balloon_resp_hdr *hdr = vb->resp_hdr;
+   int ret, order;
+
+   mutex_lock(>balloon_lock);
+
+   for (order = MAX_ORDER - 1; order >= 0; order--) {
+   pos = 0;
+   ret = get_unused_pages(vb->resp_data,
+vb->resp_buf_size / sizeof(unsigned long),
+order, );
+   if (ret == -ENOSPC) {
+   void *new_resp_data;
+
+   new_resp_data = kmalloc(2 * vb->resp_buf_size,
+   GFP_KERNEL);
+   if (new_resp_data) {
+   kfree(vb->resp_data);
+   vb->resp_data = new_resp_data;
+   vb->resp_buf_size *= 2;
+   order++;
+   continue;
+   } else
+   dev_warn(>vdev->dev,
+"%s: omit some %d order pages\n",
+__func__, order);
+   }
+
+   if (pos > 0) {
+   vb->resp_pos = pos;
+   hdr->cmd = BALLOON_GET_UNUSED_PAGES;
+   hdr->id = req_id;
+   if (order > 0)
+   hdr->flag = BALLOON_FLAG_CONT;
+   else
+   hdr->flag = BALLOON_FLAG_DONE;
+
+   send_resp_data(vb, vq, true);
+   }
+   }
+
+   mutex_unlock(>balloon_lock);
+   sg_init_one(_in, >req_hdr, sizeof(vb->req_hdr));
+   virtqueue_add_inbuf(vq, _in, 1, >req_hdr, GFP_KERNEL);
+   virtqueue_kick(vq);
+}
+
 /*
  * While most virtqueues communicate guest-initiated requests to the 
hypervisor,
  * the stats queue operates in r

[Qemu-devel] [PATCH kernel v5 1/5] virtio-balloon: rework deflate to add page to a list

2016-11-30 Thread Liang Li

When doing the inflating/deflating operation, the current virtio-balloon
implementation uses an array to save 256 PFNS, then send these PFNS to
host through virtio and process each PFN one by one. This way is not
efficient when inflating/deflating a large mount of memory because too
many times of the following operations:

1. Virtio data transmission
2. Page allocate/free
3. Address translation(GPA->HVA)
4. madvise

The over head of these operations will consume a lot of CPU cycles and
will take a long time to complete, it may impact the QoS of the guest as
well as the host. The overhead will be reduced a lot if batch processing
is used. E.g. If there are several pages whose address are physical
contiguous in the guest, these bulk pages can be processed in one
operation.

The main idea for the optimization is to reduce the above operations as
much as possible. And it can be achieved by using a bitmap instead of an
PFN array. Comparing with PFN array, for a specific size buffer, bitmap
can present more pages, which is very important for batch processing.

Using bitmap instead of PFN is not very helpful when inflating/deflating
a small mount of pages, in this case, using PFNs is better. But using
bitmap will not impact the QoS of guest or host heavily because the
operation will be completed very soon for a small mount of pages, and we
will use some methods to make sure the efficiency not drop too much.

This patch saves the deflated pages to a list instead of the PFN array,
which will allow faster notifications using a bitmap down the road.
balloon_pfn_to_page() can be removed because it's useless.

Signed-off-by: Liang Li <liang.z...@intel.com>
Signed-off-by: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
Cc: Dave Hansen <dave.han...@intel.com>
---
 drivers/virtio/virtio_balloon.c | 22 --
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 181793f..f59cb4f 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -103,12 +103,6 @@ static u32 page_to_balloon_pfn(struct page *page)
return pfn * VIRTIO_BALLOON_PAGES_PER_PAGE;
 }
 
-static struct page *balloon_pfn_to_page(u32 pfn)
-{
-   BUG_ON(pfn % VIRTIO_BALLOON_PAGES_PER_PAGE);
-   return pfn_to_page(pfn / VIRTIO_BALLOON_PAGES_PER_PAGE);
-}
-
 static void balloon_ack(struct virtqueue *vq)
 {
struct virtio_balloon *vb = vq->vdev->priv;
@@ -181,18 +175,16 @@ static unsigned fill_balloon(struct virtio_balloon *vb, 
size_t num)
return num_allocated_pages;
 }
 
-static void release_pages_balloon(struct virtio_balloon *vb)
+static void release_pages_balloon(struct virtio_balloon *vb,
+struct list_head *pages)
 {
-   unsigned int i;
-   struct page *page;
+   struct page *page, *next;
 
-   /* Find pfns pointing at start of each page, get pages and free them. */
-   for (i = 0; i < vb->num_pfns; i += VIRTIO_BALLOON_PAGES_PER_PAGE) {
-   page = balloon_pfn_to_page(virtio32_to_cpu(vb->vdev,
-  vb->pfns[i]));
+   list_for_each_entry_safe(page, next, pages, lru) {
if (!virtio_has_feature(vb->vdev,
VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
adjust_managed_page_count(page, 1);
+   list_del(>lru);
put_page(page); /* balloon reference */
}
 }
@@ -202,6 +194,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, 
size_t num)
unsigned num_freed_pages;
struct page *page;
struct balloon_dev_info *vb_dev_info = >vb_dev_info;
+   LIST_HEAD(pages);
 
/* We can only do one array worth at a time. */
num = min(num, ARRAY_SIZE(vb->pfns));
@@ -215,6 +208,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, 
size_t num)
if (!page)
break;
set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+   list_add(>lru, );
vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
}
 
@@ -226,7 +220,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, 
size_t num)
 */
if (vb->num_pfns != 0)
tell_host(vb, vb->deflate_vq);
-   release_pages_balloon(vb);
+   release_pages_balloon(vb, );
mutex_unlock(>balloon_lock);
return num_freed_pages;
 }
-- 
1.8.3.1

[Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-11-30 Thread Liang Li

This patch set contains two parts of changes to the virtio-balloon.
 
One is the change for speeding up the inflating & deflating process,
the main idea of this optimization is to use bitmap to send the page
information to host instead of the PFNs, to reduce the overhead of
virtio data transmission, address translation and madvise(). This can
help to improve the performance by about 85%.
 
Another change is for speeding up live migration. By skipping process
guest's unused pages in the first round of data copy, to reduce needless
data processing, this can help to save quite a lot of CPU cycles and
network bandwidth. We put guest's unused page information in a bitmap
and send it to host with the virt queue of virtio-balloon. For an idle
guest with 8GB RAM, this can help to shorten the total live migration
time from 2Sec to about 500ms in 10Gbps network environment.
 
Changes from v4 to v5:
* Drop the code to get the max_pfn, use another way instead.
* Simplify the API to get the unused page information from mm. 

Changes from v3 to v4:
* Use the new scheme suggested by Dave Hansen to encode the bitmap.
* Add code which is missed in v3 to handle migrate page. 
* Free the memory for bitmap intime once the operation is done.
* Address some of the comments in v3.

Changes from v2 to v3:
* Change the name of 'free page' to 'unused page'.
* Use the scatter & gather bitmap instead of a 1MB page bitmap.
* Fix overwriting the page bitmap after kicking.
* Some of MST's comments for v2.
 
Changes from v1 to v2:
* Abandon the patch for dropping page cache.
* Put some structures to uapi head file.
* Use a new way to determine the page bitmap size.
* Use a unified way to send the free page information with the bitmap
* Address the issues referred in MST's comments

Liang Li (5):
  virtio-balloon: rework deflate to add page to a list
  virtio-balloon: define new feature bit and head struct
  virtio-balloon: speed up inflate/deflate process
  virtio-balloon: define flags and head for host request vq
  virtio-balloon: tell host vm's unused page info

 drivers/virtio/virtio_balloon.c | 539 
 include/linux/mm.h  |   3 +-
 include/uapi/linux/virtio_balloon.h |  41 +++
 mm/page_alloc.c |  72 +
 4 files changed, 607 insertions(+), 48 deletions(-)

-- 
1.8.3.1

[Qemu-devel] [PATCH kernel v4 7/7] virtio-balloon: tell host vm's unused page info

2016-11-02 Thread Liang Li

Support the request for vm's unused page information, response with
a page bitmap. QEMU can make use of this bitmap and the dirty page
logging mechanism to skip the transportation of some of these unused
pages, this is very helpful to reduce the network traffic and  speed
up the live migration process.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
Cc: Dave Hansen <dave.han...@intel.com>
---
 drivers/virtio/virtio_balloon.c | 128 +---
 1 file changed, 121 insertions(+), 7 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index c6c94b6..ba2d37b 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -56,7 +56,7 @@
 
 struct virtio_balloon {
struct virtio_device *vdev;
-   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *req_vq;
 
/* The balloon servicing is delegated to a freezable workqueue. */
struct work_struct update_balloon_stats_work;
@@ -83,6 +83,8 @@ struct virtio_balloon {
unsigned int nr_page_bmap;
/* Used to record the processed pfn range */
unsigned long min_pfn, max_pfn, start_pfn, end_pfn;
+   /* Request header */
+   struct virtio_balloon_req_hdr req_hdr;
/*
 * The pages we've told the Host we're not using are enqueued
 * at vb_dev_info->pages list.
@@ -552,6 +554,63 @@ static void update_balloon_stats(struct virtio_balloon *vb)
pages_to_bytes(available));
 }
 
+static void send_unused_pages_info(struct virtio_balloon *vb,
+   unsigned long req_id)
+{
+   struct scatterlist sg_in;
+   unsigned long pfn = 0, bmap_len, pfn_limit, last_pfn, nr_pfn;
+   struct virtqueue *vq = vb->req_vq;
+   struct virtio_balloon_resp_hdr *hdr = vb->resp_hdr;
+   int ret = 1, used_nr_bmap = 0, i;
+
+   if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_PAGE_BITMAP) &&
+   vb->nr_page_bmap == 1)
+   extend_page_bitmap(vb);
+
+   pfn_limit = PFNS_PER_BMAP * vb->nr_page_bmap;
+   mutex_lock(>balloon_lock);
+   last_pfn = get_max_pfn();
+
+   while (ret) {
+   clear_page_bitmap(vb);
+   ret = get_unused_pages(pfn, pfn + pfn_limit, vb->page_bitmap,
+PFNS_PER_BMAP, vb->nr_page_bmap);
+   if (ret < 0)
+   break;
+   hdr->cmd = BALLOON_GET_UNUSED_PAGES;
+   hdr->id = req_id;
+   bmap_len = BALLOON_BMAP_SIZE * vb->nr_page_bmap;
+
+   if (!ret) {
+   hdr->flag = BALLOON_FLAG_DONE;
+   nr_pfn = last_pfn - pfn;
+   used_nr_bmap = nr_pfn / PFNS_PER_BMAP;
+   if (nr_pfn % PFNS_PER_BMAP)
+   used_nr_bmap++;
+   bmap_len = nr_pfn / BITS_PER_BYTE;
+   } else {
+   hdr->flag = BALLOON_FLAG_CONT;
+   used_nr_bmap = vb->nr_page_bmap;
+   }
+   for (i = 0; i < used_nr_bmap; i++) {
+   unsigned int bmap_size = BALLOON_BMAP_SIZE;
+
+   if (i + 1 == used_nr_bmap)
+   bmap_size = bmap_len - BALLOON_BMAP_SIZE * i;
+   set_bulk_pages(vb, vq, pfn + i * PFNS_PER_BMAP,
+vb->page_bitmap[i], bmap_size, true);
+   }
+   if (vb->resp_pos > 0)
+   send_resp_data(vb, vq, true);
+   pfn += pfn_limit;
+   }
+
+   mutex_unlock(>balloon_lock);
+   sg_init_one(_in, >req_hdr, sizeof(vb->req_hdr));
+   virtqueue_add_inbuf(vq, _in, 1, >req_hdr, GFP_KERNEL);
+   virtqueue_kick(vq);
+}
+
 /*
  * While most virtqueues communicate guest-initiated requests to the 
hypervisor,
  * the stats queue operates in reverse.  The driver initializes the virtqueue
@@ -686,18 +745,56 @@ static void update_balloon_size_func(struct work_struct 
*work)
queue_work(system_freezable_wq, work);
 }
 
+static void misc_handle_rq(struct virtio_balloon *vb)
+{
+   struct virtio_balloon_req_hdr *ptr_hdr;
+   unsigned int len;
+
+   ptr_hdr = virtqueue_get_buf(vb->req_vq, );
+   if (!ptr_hdr || len != sizeof(vb->req_hdr))
+   return;
+
+   switch (ptr_hdr->cmd) {
+   case BALLOON_GET_UNUSED_PAGES:
+   send_unused_pages_info(vb, ptr_hdr->param);
+   break;
+   default:
+   break;
+   }
+}
+
+static void misc

[Qemu-devel] [PATCH kernel v4 6/7] virtio-balloon: define flags and head for host request vq

2016-11-02 Thread Liang Li

Define the flags and head struct for a new host request virtual
queue. Guest can get requests from host and then responds to them on
this new virtual queue.
Host can make use of this virtual queue to request the guest do some
operations, e.g. drop page cache, synchronize file system, etc.
And the hypervisor can get some of guest's runtime information
through this virtual queue too, e.g. the guest's unused page
information, which can be used for live migration optimization.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Andrew Morton <a...@linux-foundation.org>
Cc: Mel Gorman <mgor...@techsingularity.net>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
Cc: Dave Hansen <dave.han...@intel.com>
---
 include/uapi/linux/virtio_balloon.h | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index bed6f41..c4e34d0 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_PAGE_BITMAP   3 /* Send page info with bitmap */
+#define VIRTIO_BALLOON_F_HOST_REQ_VQ   4 /* Host request virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -101,4 +102,25 @@ struct virtio_balloon_bmap_hdr {
__le64 bmap[0];
 };
 
+enum virtio_balloon_req_id {
+   /* Get unused page information */
+   BALLOON_GET_UNUSED_PAGES,
+};
+
+enum virtio_balloon_flag {
+   /* Have more data for a request */
+   BALLOON_FLAG_CONT,
+   /* No more data for a request */
+   BALLOON_FLAG_DONE,
+};
+
+struct virtio_balloon_req_hdr {
+   /* Used to distinguish different requests */
+   __le16 cmd;
+   /* Reserved */
+   __le16 reserved[3];
+   /* Request parameter */
+   __le64 param;
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
1.8.3.1

[Qemu-devel] [PATCH kernel v4 4/7] virtio-balloon: speed up inflate/deflate process

2016-11-02 Thread Liang Li

The implementation of the current virtio-balloon is not very
efficient, the time spends on different stages of inflating
the balloon to 7GB of a 8GB idle guest:

a. allocating pages (6.5%)
b. sending PFNs to host (68.3%)
c. address translation (6.1%)
d. madvise (19%)

It takes about 4126ms for the inflating process to complete.
Debugging shows that the bottle neck are the stage b and stage d.

If using a bitmap to send the page info instead of the PFNs, we
can reduce the overhead in stage b quite a lot. Furthermore, we
can do the address translation and call madvise() with a bulk of
RAM pages, instead of the current page per page way, the overhead
of stage c and stage d can also be reduced a lot.

This patch is the kernel side implementation which is intended to
speed up the inflating & deflating process by adding a new feature
to the virtio-balloon device. With this new feature, inflating the
balloon to 7GB of a 8GB idle guest only takes 590ms, the
performance improvement is about 85%.

TODO: optimize stage a by allocating/freeing a chunk of pages
instead of a single page at a time.

Signed-off-by: Liang Li <liang.z...@intel.com>
Suggested-by: Michael S. Tsirkin <m...@redhat.com>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
Cc: Dave Hansen <dave.han...@intel.com>
---
 drivers/virtio/virtio_balloon.c | 398 +---
 1 file changed, 369 insertions(+), 29 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 59ffe5a..c6c94b6 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -42,6 +42,10 @@
 #define OOM_VBALLOON_DEFAULT_PAGES 256
 #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
 
+#define BALLOON_BMAP_SIZE  (8 * PAGE_SIZE)
+#define PFNS_PER_BMAP  (BALLOON_BMAP_SIZE * BITS_PER_BYTE)
+#define BALLOON_BMAP_COUNT 32
+
 static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
 module_param(oom_pages, int, S_IRUSR | S_IWUSR);
 MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
@@ -67,6 +71,18 @@ struct virtio_balloon {
 
/* Number of balloon pages we've told the Host we're not using. */
unsigned int num_pages;
+   /* Pointer to the response header. */
+   void *resp_hdr;
+   /* Pointer to the start address of response data. */
+   unsigned long *resp_data;
+   /* Pointer offset of the response data. */
+   unsigned long resp_pos;
+   /* Bitmap and bitmap count used to tell the host the pages */
+   unsigned long *page_bitmap[BALLOON_BMAP_COUNT];
+   /* Number of split page bitmaps */
+   unsigned int nr_page_bmap;
+   /* Used to record the processed pfn range */
+   unsigned long min_pfn, max_pfn, start_pfn, end_pfn;
/*
 * The pages we've told the Host we're not using are enqueued
 * at vb_dev_info->pages list.
@@ -110,20 +126,227 @@ static void balloon_ack(struct virtqueue *vq)
wake_up(>acked);
 }
 
-static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
+static inline void init_bmap_pfn_range(struct virtio_balloon *vb)
 {
-   struct scatterlist sg;
+   vb->min_pfn = ULONG_MAX;
+   vb->max_pfn = 0;
+}
+
+static inline void update_bmap_pfn_range(struct virtio_balloon *vb,
+struct page *page)
+{
+   unsigned long balloon_pfn = page_to_balloon_pfn(page);
+
+   vb->min_pfn = min(balloon_pfn, vb->min_pfn);
+   vb->max_pfn = max(balloon_pfn, vb->max_pfn);
+}
+
+static void extend_page_bitmap(struct virtio_balloon *vb)
+{
+   int i, bmap_count;
+   unsigned long bmap_len;
+
+   bmap_len = ALIGN(get_max_pfn(), BITS_PER_LONG) / BITS_PER_BYTE;
+   bmap_len = ALIGN(bmap_len, BALLOON_BMAP_SIZE);
+   bmap_count = min((int)(bmap_len / BALLOON_BMAP_SIZE),
+BALLOON_BMAP_COUNT);
+
+   for (i = 1; i < bmap_count; i++) {
+   vb->page_bitmap[i] = kmalloc(BALLOON_BMAP_SIZE, GFP_KERNEL);
+   if (vb->page_bitmap[i])
+   vb->nr_page_bmap++;
+   else
+   break;
+   }
+}
+
+static void free_extended_page_bitmap(struct virtio_balloon *vb)
+{
+   int i, bmap_count = vb->nr_page_bmap;
+
+
+   for (i = 1; i < bmap_count; i++) {
+   kfree(vb->page_bitmap[i]);
+   vb->page_bitmap[i] = NULL;
+   vb->nr_page_bmap--;
+   }
+}
+
+static void kfree_page_bitmap(struct virtio_balloon *vb)
+{
+   int i;
+
+   for (i = 0; i < vb->nr_page_bmap; i++)
+   kfree(vb->page_bitmap[i]);
+}
+
+static void clear_page_bitmap(struct virtio_balloon *vb)
+{
+   int i;
+
+   for (i = 0; i < vb->nr_page_bmap; i++)
+

[Qemu-devel] [PATCH kernel v4 5/7] mm: add the related functions to get unused page

2016-11-02 Thread Liang Li

Save the unused page info into a split page bitmap. The virtio
balloon driver will use this new API to get the unused page bitmap
and send the bitmap to hypervisor(QEMU) to speed up live migration.
During sending the bitmap, some the pages may be modified and are
no free anymore, this inaccuracy can be corrected by the dirty
page logging mechanism.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Andrew Morton <a...@linux-foundation.org>
Cc: Mel Gorman <mgor...@techsingularity.net>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
Cc: Dave Hansen <dave.han...@intel.com>
---
 include/linux/mm.h |  2 ++
 mm/page_alloc.c| 85 ++
 2 files changed, 87 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f47862a..7014d8a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1773,6 +1773,8 @@ extern void free_area_init_node(int nid, unsigned long * 
zones_size,
unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
 extern unsigned long get_max_pfn(void);
+extern int get_unused_pages(unsigned long start_pfn, unsigned long end_pfn,
+   unsigned long *bitmap[], unsigned long len, unsigned int nr_bmap);
 
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 12cc8ed..72537cc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4438,6 +4438,91 @@ unsigned long get_max_pfn(void)
 }
 EXPORT_SYMBOL(get_max_pfn);
 
+static void mark_unused_pages_bitmap(struct zone *zone,
+   unsigned long start_pfn, unsigned long end_pfn,
+   unsigned long *bitmap[], unsigned long bits,
+   unsigned int nr_bmap)
+{
+   unsigned long pfn, flags, nr_pg, pos, *bmap;
+   unsigned int order, i, t, bmap_idx;
+   struct list_head *curr;
+
+   if (zone_is_empty(zone))
+   return;
+
+   end_pfn = min(start_pfn + nr_bmap * bits, end_pfn);
+   spin_lock_irqsave(>lock, flags);
+
+   for_each_migratetype_order(order, t) {
+   list_for_each(curr, >free_area[order].free_list[t]) {
+   pfn = page_to_pfn(list_entry(curr, struct page, lru));
+   if (pfn < start_pfn || pfn >= end_pfn)
+   continue;
+   nr_pg = 1UL << order;
+   if (pfn + nr_pg > end_pfn)
+   nr_pg = end_pfn - pfn;
+   bmap_idx = (pfn - start_pfn) / bits;
+   if (bmap_idx == (pfn + nr_pg - start_pfn) / bits) {
+   bmap = bitmap[bmap_idx];
+   pos = (pfn - start_pfn) % bits;
+   bitmap_set(bmap, pos, nr_pg);
+   } else
+   for (i = 0; i < nr_pg; i++) {
+   pos = pfn - start_pfn + i;
+   bmap_idx = pos / bits;
+   bmap = bitmap[bmap_idx];
+   pos = pos % bits;
+   bitmap_set(bmap, pos, 1);
+   }
+   }
+   }
+
+   spin_unlock_irqrestore(>lock, flags);
+}
+
+/*
+ * During live migration, page is always discardable unless it's
+ * content is needed by the system.
+ * get_unused_pages provides an API to get the unused pages, these
+ * unused pages can be discarded if there is no modification since
+ * the request. Some other mechanism, like the dirty page logging
+ * can be used to track the modification.
+ *
+ * This function scans the free page list to get the unused pages
+ * whose pfn are range from start_pfn to end_pfn, and set the
+ * corresponding bit in the bitmap if an unused page is found.
+ *
+ * Allocating a large bitmap may fail because of fragmentation,
+ * instead of using a single bitmap, we use a scatter/gather bitmap.
+ * The 'bitmap' is the start address of an array which contains
+ * 'nr_bmap' separate small bitmaps, each bitmap contains 'bits' bits.
+ *
+ * return -1 if parameters are invalid
+ * return 0 when end_pfn >= max_pfn
+ * return 1 when end_pfn < max_pfn
+ */
+int get_unused_pages(unsigned long start_pfn, unsigned long end_pfn,
+   unsigned long *bitmap[], unsigned long bits, unsigned int nr_bmap)
+{
+   struct zone *zone;
+   int ret = 0;
+
+   if (bitmap == NULL || *bitmap == NULL || nr_bmap == 0 ||
+bits == 0 || start_pfn > end_pfn)
+   return -1;
+   if (end_pfn < max_pfn)
+   ret = 1;
+   if (end_pfn >= max_pfn)
+   ret = 0;
+
+   for_each_

[Qemu-devel] [PATCH kernel v4 3/7] mm: add a function to get the max pfn

2016-11-02 Thread Liang Li

Expose the function to get the max pfn, so it can be used in the
virtio-balloon device driver. Simply include the 'linux/bootmem.h'
is not enough, if the device driver is built to a module, directly
refer the max_pfn lead to build failed.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Andrew Morton <a...@linux-foundation.org>
Cc: Mel Gorman <mgor...@techsingularity.net>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
Cc: Dave Hansen <dave.han...@intel.com>
---
 include/linux/mm.h |  1 +
 mm/page_alloc.c| 10 ++
 2 files changed, 11 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a92c8d7..f47862a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1772,6 +1772,7 @@ static inline spinlock_t *pmd_lock(struct mm_struct *mm, 
pmd_t *pmd)
 extern void free_area_init_node(int nid, unsigned long * zones_size,
unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
+extern unsigned long get_max_pfn(void);
 
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8fd42aa..12cc8ed 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4428,6 +4428,16 @@ void show_free_areas(unsigned int filter)
show_swap_cache_info();
 }
 
+/*
+ * The max_pfn can change because of memory hot plug, so it's only good
+ * as a hint. e.g. for sizing data structures.
+ */
+unsigned long get_max_pfn(void)
+{
+   return max_pfn;
+}
+EXPORT_SYMBOL(get_max_pfn);
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
zoneref->zone = zone;
-- 
1.8.3.1

[Qemu-devel] [PATCH kernel v4 2/7] virtio-balloon: define new feature bit and head struct

2016-11-02 Thread Liang Li

Add a new feature which supports sending the page information with
a bitmap. The current implementation uses PFNs array, which is not
very efficient. Using bitmap can improve the performance of
inflating/deflating significantly

The page bitmap header will used to tell the host some information
about the page bitmap. e.g. the page size, page bitmap length and
start pfn.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
Cc: Dave Hansen <dave.han...@intel.com>
---
 include/uapi/linux/virtio_balloon.h | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index 343d7dd..bed6f41 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST0 /* Tell before reclaiming 
pages */
 #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_PAGE_BITMAP   3 /* Send page info with bitmap */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -82,4 +83,22 @@ struct virtio_balloon_stat {
__virtio64 val;
 } __attribute__((packed));
 
+/* Response header structure */
+struct virtio_balloon_resp_hdr {
+   __le64 cmd : 8; /* Distinguish different requests type */
+   __le64 flag: 8; /* Mark status for a specific request type */
+   __le64 id : 16; /* Distinguish requests of a specific type */
+   __le64 data_len: 32; /* Length of the following data, in bytes */
+};
+
+/* Page bitmap header structure */
+struct virtio_balloon_bmap_hdr {
+   struct {
+   __le64 start_pfn : 52; /* start pfn for the bitmap */
+   __le64 page_shift : 6; /* page shift width, in bytes */
+   __le64 bmap_len : 6;  /* bitmap length, in bytes */
+   } head;
+   __le64 bmap[0];
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
1.8.3.1

[Qemu-devel] [PATCH kernel v4 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-11-02 Thread Liang Li

This patch set contains two parts of changes to the virtio-balloon.
 
One is the change for speeding up the inflating & deflating process,
the main idea of this optimization is to use bitmap to send the page
information to host instead of the PFNs, to reduce the overhead of
virtio data transmission, address translation and madvise(). This can
help to improve the performance by about 85%.
 
Another change is for speeding up live migration. By skipping process
guest's unused pages in the first round of data copy, to reduce needless
data processing, this can help to save quite a lot of CPU cycles and
network bandwidth. We put guest's unused page information in a bitmap
and send it to host with the virt queue of virtio-balloon. For an idle
guest with 8GB RAM, this can help to shorten the total live migration
time from 2Sec to about 500ms in 10Gbps network environment.
 
Changes from v3 to v4:
* Use the new scheme suggested by Dave Hansen to encode the bitmap.
* Add code which is missed in v3 to handle migrate page. 
* Free the memory for bitmap intime once the operation is done.
* Address some of the comments in v3.

Changes from v2 to v3:
* Change the name of 'free page' to 'unused page'.
* Use the scatter & gather bitmap instead of a 1MB page bitmap.
* Fix overwriting the page bitmap after kicking.
* Some of MST's comments for v2.
 
Changes from v1 to v2:
* Abandon the patch for dropping page cache.
* Put some structures to uapi head file.
* Use a new way to determine the page bitmap size.
* Use a unified way to send the free page information with the bitmap
* Address the issues referred in MST's comments

Liang Li (7):
  virtio-balloon: rework deflate to add page to a list
  virtio-balloon: define new feature bit and head struct
  mm: add a function to get the max pfn
  virtio-balloon: speed up inflate/deflate process
  mm: add the related functions to get unused page
  virtio-balloon: define flags and head for host request vq
  virtio-balloon: tell host vm's unused page info

 drivers/virtio/virtio_balloon.c | 546 
 include/linux/mm.h  |   3 +
 include/uapi/linux/virtio_balloon.h |  41 +++
 mm/page_alloc.c |  95 +++
 4 files changed, 636 insertions(+), 49 deletions(-)

-- 
1.8.3.1

[Qemu-devel] [PATCH kernel v4 1/7] virtio-balloon: rework deflate to add page to a list

2016-11-02 Thread Liang Li

When doing the inflating/deflating operation, the current virtio-balloon
implementation uses an array to save 256 PFNS, then send these PFNS to
host through virtio and process each PFN one by one. This way is not
efficient when inflating/deflating a large mount of memory because too
many times of the following operations:

1. Virtio data transmission
2. Page allocate/free
3. Address translation(GPA->HVA)
4. madvise

The over head of these operations will consume a lot of CPU cycles and
will take a long time to complete, it may impact the QoS of the guest as
well as the host. The overhead will be reduced a lot if batch processing
is used. E.g. If there are several pages whose address are physical
contiguous in the guest, these bulk pages can be processed in one
operation.

The main idea for the optimization is to reduce the above operations as
much as possible. And it can be achieved by using a bitmap instead of an
PFN array. Comparing with PFN array, for a specific size buffer, bitmap
can present more pages, which is very important for batch processing.

Using bitmap instead of PFN is not very helpful when inflating/deflating
a small mount of pages, in this case, using PFNs is better. But using
bitmap will not impact the QoS of guest or host heavily because the
operation will be completed very soon for a small mount of pages, and we
will use some methods to make sure the efficiency not drop too much.

This patch saves the deflated pages to a list instead of the PFN array,
which will allow faster notifications using a bitmap down the road.
balloon_pfn_to_page() can be removed because it's useless.

Signed-off-by: Liang Li <liang.z...@intel.com>
Signed-off-by: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
Cc: Dave Hansen <dave.han...@intel.com>
---
 drivers/virtio/virtio_balloon.c | 22 --
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 4e7003d..59ffe5a 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -103,12 +103,6 @@ static u32 page_to_balloon_pfn(struct page *page)
return pfn * VIRTIO_BALLOON_PAGES_PER_PAGE;
 }
 
-static struct page *balloon_pfn_to_page(u32 pfn)
-{
-   BUG_ON(pfn % VIRTIO_BALLOON_PAGES_PER_PAGE);
-   return pfn_to_page(pfn / VIRTIO_BALLOON_PAGES_PER_PAGE);
-}
-
 static void balloon_ack(struct virtqueue *vq)
 {
struct virtio_balloon *vb = vq->vdev->priv;
@@ -181,18 +175,16 @@ static unsigned fill_balloon(struct virtio_balloon *vb, 
size_t num)
return num_allocated_pages;
 }
 
-static void release_pages_balloon(struct virtio_balloon *vb)
+static void release_pages_balloon(struct virtio_balloon *vb,
+struct list_head *pages)
 {
-   unsigned int i;
-   struct page *page;
+   struct page *page, *next;
 
-   /* Find pfns pointing at start of each page, get pages and free them. */
-   for (i = 0; i < vb->num_pfns; i += VIRTIO_BALLOON_PAGES_PER_PAGE) {
-   page = balloon_pfn_to_page(virtio32_to_cpu(vb->vdev,
-  vb->pfns[i]));
+   list_for_each_entry_safe(page, next, pages, lru) {
if (!virtio_has_feature(vb->vdev,
VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
adjust_managed_page_count(page, 1);
+   list_del(>lru);
put_page(page); /* balloon reference */
}
 }
@@ -202,6 +194,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, 
size_t num)
unsigned num_freed_pages;
struct page *page;
struct balloon_dev_info *vb_dev_info = >vb_dev_info;
+   LIST_HEAD(pages);
 
/* We can only do one array worth at a time. */
num = min(num, ARRAY_SIZE(vb->pfns));
@@ -215,6 +208,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, 
size_t num)
if (!page)
break;
set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+   list_add(>lru, );
vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
}
 
@@ -226,7 +220,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, 
size_t num)
 */
if (vb->num_pfns != 0)
tell_host(vb, vb->deflate_vq);
-   release_pages_balloon(vb);
+   release_pages_balloon(vb, );
mutex_unlock(>balloon_lock);
return num_freed_pages;
 }
-- 
1.8.3.1

[Qemu-devel] [PATCH qemu v3 6/6] migration: skip free pages during live migration

2016-10-21 Thread Liang Li

After sending out the request for free pages, live migration
process will start without waiting for the free page bitmap is
ready. If the free page bitmap is not ready when doing the 1st
migration_bitmap_sync() after ram_save_setup(), the free page
bitmap will be ignored, this means the free pages will not be
filtered out in this case.
The current implementation can not work with post copy, if post
copy is enabled, we simply ignore the free pages. Will make it
work later.

Signed-off-by: Liang Li <liang.z...@intel.com>
---
 migration/ram.c | 86 +
 1 file changed, 86 insertions(+)

diff --git a/migration/ram.c b/migration/ram.c
index bc6154f..00ce97e 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -43,6 +43,8 @@
 #include "trace.h"
 #include "exec/ram_addr.h"
 #include "qemu/rcu_queue.h"
+#include "sysemu/balloon.h"
+#include "sysemu/kvm.h"
 
 #ifdef DEBUG_MIGRATION_RAM
 #define DPRINTF(fmt, ...) \
@@ -228,6 +230,8 @@ static QemuMutex migration_bitmap_mutex;
 static uint64_t migration_dirty_pages;
 static uint32_t last_version;
 static bool ram_bulk_stage;
+static bool ignore_freepage_rsp;
+static uint64_t free_page_req_id;
 
 /* used by the search for pages to send */
 struct PageSearchStatus {
@@ -244,6 +248,7 @@ static struct BitmapRcu {
 struct rcu_head rcu;
 /* Main migration bitmap */
 unsigned long *bmap;
+unsigned long *free_page_bmap;
 /* bitmap of pages that haven't been sent even once
  * only maintained and used in postcopy at the moment
  * where it's used to send the dirtymap at the start
@@ -636,6 +641,7 @@ static void migration_bitmap_sync(void)
 rcu_read_unlock();
 qemu_mutex_unlock(_bitmap_mutex);
 
+ignore_freepage_rsp = true;
 trace_migration_bitmap_sync_end(migration_dirty_pages
 - num_dirty_pages_init);
 num_dirty_pages_period += migration_dirty_pages - num_dirty_pages_init;
@@ -1411,6 +1417,7 @@ static void migration_bitmap_free(struct BitmapRcu *bmap)
 {
 g_free(bmap->bmap);
 g_free(bmap->unsentmap);
+g_free(bmap->free_page_bmap);
 g_free(bmap);
 }
 
@@ -1481,6 +1488,77 @@ void migration_bitmap_extend(ram_addr_t old, ram_addr_t 
new)
 }
 }
 
+static void filter_out_guest_free_page(unsigned long *free_page_bmap,
+   long nbits)
+{
+long i, page_count = 0, len;
+unsigned long *bitmap;
+
+tighten_guest_free_page_bmap(free_page_bmap);
+qemu_mutex_lock(_bitmap_mutex);
+bitmap = atomic_rcu_read(_bitmap_rcu)->bmap;
+slow_bitmap_complement(bitmap, free_page_bmap, nbits);
+
+len = (last_ram_offset() >> TARGET_PAGE_BITS) / BITS_PER_LONG;
+for (i = 0; i < len; i++) {
+page_count += hweight_long(bitmap[i]);
+}
+
+migration_dirty_pages = page_count;
+qemu_mutex_unlock(_bitmap_mutex);
+}
+
+static void ram_request_free_page(unsigned long *bmap, unsigned long max_pfn)
+{
+BalloonReqStatus status;
+
+free_page_req_id++;
+status = balloon_get_free_pages(bmap, max_pfn / BITS_PER_BYTE,
+free_page_req_id);
+if (status == REQ_START) {
+ignore_freepage_rsp = false;
+}
+}
+
+static void ram_handle_free_page(void)
+{
+unsigned long nbits, req_id = 0;
+RAMBlock *pc_ram_block;
+BalloonReqStatus status;
+
+status = balloon_free_page_ready(_id);
+switch (status) {
+case REQ_DONE:
+if (req_id != free_page_req_id) {
+return;
+}
+rcu_read_lock();
+pc_ram_block = QLIST_FIRST_RCU(_list.blocks);
+nbits = pc_ram_block->used_length >> TARGET_PAGE_BITS;
+filter_out_guest_free_page(migration_bitmap_rcu->free_page_bmap, 
nbits);
+rcu_read_unlock();
+
+qemu_mutex_lock_iothread();
+migration_bitmap_sync();
+qemu_mutex_unlock_iothread();
+/*
+ * bulk stage assumes in (migration_bitmap_find_and_reset_dirty) that
+ * every page is dirty, that's no longer ture at this point.
+ */
+ram_bulk_stage = false;
+last_seen_block = NULL;
+last_sent_block = NULL;
+last_offset = 0;
+break;
+case REQ_ERROR:
+ignore_freepage_rsp = true;
+error_report("failed to get free page");
+break;
+default:
+break;
+}
+}
+
 /*
  * 'expected' is the value you expect the bitmap mostly to be full
  * of; it won't bother printing lines that are all this value.
@@ -1946,6 +2024,11 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
 qemu_mutex_unlock_ramlist();
 qemu_mutex_unlock_iothread();
 
+if (balloon_free_pages_support() && !migrate_postcopy_ram()) {
+unsigned long max_pfn = get_guest_max_pfn();
+migration_bitmap_rcu->free_page_bmap = bitmap_new(max_pfn);
+ram_

[Qemu-devel] [PATCH qemu v3 1/6] virtio-balloon: update linux head file

2016-10-21 Thread Liang Li

Update the new feature bit definition for virtio balloon and
the page bitmap header, request header struct to keep consistent
with kernel side.

Signed-off-by: Liang Li <liang.z...@intel.com>
---
 include/standard-headers/linux/virtio_balloon.h | 41 +
 1 file changed, 41 insertions(+)

diff --git a/include/standard-headers/linux/virtio_balloon.h 
b/include/standard-headers/linux/virtio_balloon.h
index 9d06ccd..797a868 100644
--- a/include/standard-headers/linux/virtio_balloon.h
+++ b/include/standard-headers/linux/virtio_balloon.h
@@ -34,6 +34,8 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST0 /* Tell before reclaiming 
pages */
 #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_PAGE_BITMAP   3 /* Send page info with bitmap */
+#define VIRTIO_BALLOON_F_MISC_VQ   4 /* Misc info virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -82,4 +84,43 @@ struct virtio_balloon_stat {
__virtio64 val;
 } QEMU_PACKED;
 
+/* Page bitmap header structure */
+struct balloon_bmap_hdr {
+   /* Used to distinguish different request */
+   __virtio16 cmd;
+   /* Shift width of page in the bitmap */
+   __virtio16 page_shift;
+   /* flag used to identify different status */
+   __virtio16 flag;
+   /* Reserved */
+   __virtio16 reserved;
+   /* ID of the request */
+   __virtio64 req_id;
+   /* The pfn of 0 bit in the bitmap */
+   __virtio64 start_pfn;
+   /* The length of the bitmap, in bytes */
+   __virtio64 bmap_len;
+};
+
+enum balloon_req_id {
+   /* Get free pages information */
+   BALLOON_GET_FREE_PAGES,
+};
+
+enum balloon_flag {
+   /* Have more data for a request */
+   BALLOON_FLAG_CONT,
+   /* No more data for a request */
+   BALLOON_FLAG_DONE,
+};
+
+struct balloon_req_hdr {
+   /* Used to distinguish different request */
+   __virtio16 cmd;
+   /* Reserved */
+   __virtio16 reserved[3];
+   /* Request parameter */
+   __virtio64 param;
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
1.8.3.1

[Qemu-devel] [PATCH qemu v3 5/6] kvm: Add two new arch specific functions

2016-10-21 Thread Liang Li

Add a new function to get the vm's max pfn and a new function
to filter out the holes in the undressed free page bitmap to get
a tight free page bitmap. They are implemented on X86 and should
be implemented on other arches for live migration optimization.

Signed-off-by: Liang Li <liang.z...@intel.com>
Reviewed-by: Dr. David Alan Gilbert <dgilb...@redhat.com>
---
 include/sysemu/kvm.h | 18 ++
 target-arm/kvm.c | 14 ++
 target-i386/kvm.c| 37 +
 target-mips/kvm.c| 14 ++
 target-ppc/kvm.c | 14 ++
 target-s390x/kvm.c   | 14 ++
 6 files changed, 111 insertions(+)

diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index df67cc0..ef91053 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -238,6 +238,24 @@ int kvm_remove_breakpoint(CPUState *cpu, target_ulong addr,
   target_ulong len, int type);
 void kvm_remove_all_breakpoints(CPUState *cpu);
 int kvm_update_guest_debug(CPUState *cpu, unsigned long reinject_trap);
+
+/**
+ * tighten_guest_free_page_bmap - process the free page bitmap from
+ * guest to get a tight page bitmap which does not contain
+ * holes.
+ * @bmap: undressed guest free page bitmap
+ * Returns: a tight guest free page bitmap, the n th bit in the
+ * returned bitmap and the n th bit in the migration bitmap
+ * should correspond to the same guest RAM page.
+ */
+unsigned long *tighten_guest_free_page_bmap(unsigned long *bmap);
+
+/**
+ * get_guest_max_pfn - get the max pfn of guest
+ * Returns: the max pfn of guest
+ */
+unsigned long get_guest_max_pfn(void);
+
 #ifndef _WIN32
 int kvm_set_signal_mask(CPUState *cpu, const sigset_t *sigset);
 #endif
diff --git a/target-arm/kvm.c b/target-arm/kvm.c
index c00b94e..785e969 100644
--- a/target-arm/kvm.c
+++ b/target-arm/kvm.c
@@ -638,3 +638,17 @@ int kvm_arch_msi_data_to_gsi(uint32_t data)
 {
 return (data - 32) & 0x;
 }
+
+unsigned long get_guest_max_pfn(void)
+{
+/* To be done */
+
+return 0;
+}
+
+unsigned long *tighten_guest_free_page_bmap(unsigned long *bmap)
+{
+/* To be done */
+
+return bmap;
+}
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 0472f45..32dd627 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -3527,3 +3527,40 @@ int kvm_arch_msi_data_to_gsi(uint32_t data)
 {
 abort();
 }
+
+#define _4G (1ULL << 32)
+
+unsigned long get_guest_max_pfn(void)
+{
+PCMachineState *pcms = PC_MACHINE(current_machine);
+ram_addr_t above_4g_mem = pcms->above_4g_mem_size;
+unsigned long max_pfn;
+
+if (above_4g_mem) {
+max_pfn = (_4G + above_4g_mem) >> TARGET_PAGE_BITS;
+} else {
+max_pfn = pcms->below_4g_mem_size >> TARGET_PAGE_BITS;
+}
+
+return max_pfn;
+}
+
+unsigned long *tighten_guest_free_page_bmap(unsigned long *bmap)
+{
+PCMachineState *pcms = PC_MACHINE(current_machine);
+ram_addr_t above_4g_mem = pcms->above_4g_mem_size;
+
+if (above_4g_mem) {
+unsigned long *src, *dst, len, pos;
+ram_addr_t below_4g_mem = pcms->below_4g_mem_size;
+src = bmap + (_4G >> TARGET_PAGE_BITS) / BITS_PER_LONG;
+dst = bmap + (below_4g_mem >> TARGET_PAGE_BITS) / BITS_PER_LONG;
+bitmap_move(dst, src, above_4g_mem >> TARGET_PAGE_BITS);
+
+pos = (above_4g_mem + below_4g_mem) >> TARGET_PAGE_BITS;
+len = (_4G - below_4g_mem) >> TARGET_PAGE_BITS;
+bitmap_clear(bmap, pos, len);
+}
+
+return bmap;
+}
diff --git a/target-mips/kvm.c b/target-mips/kvm.c
index dcf5fbb..2feb406 100644
--- a/target-mips/kvm.c
+++ b/target-mips/kvm.c
@@ -1058,3 +1058,17 @@ int kvm_arch_msi_data_to_gsi(uint32_t data)
 {
 abort();
 }
+
+unsigned long get_guest_max_pfn(void)
+{
+/* To be done */
+
+return 0;
+}
+
+unsigned long *tighten_guest_free_page_bmap(unsigned long *bmap)
+{
+/* To be done */
+
+return bmap;
+}
diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c
index 9c4834c..a130d3a 100644
--- a/target-ppc/kvm.c
+++ b/target-ppc/kvm.c
@@ -2672,3 +2672,17 @@ int kvmppc_enable_hwrng(void)
 
 return kvmppc_enable_hcall(kvm_state, H_RANDOM);
 }
+
+unsigned long get_guest_max_pfn(void)
+{
+/* To be done */
+
+return 0;
+}
+
+unsigned long *tighten_guest_free_page_bmap(unsigned long *bmap)
+{
+/* To be done */
+
+return bmap;
+}
diff --git a/target-s390x/kvm.c b/target-s390x/kvm.c
index 7f74572..b285efb 100644
--- a/target-s390x/kvm.c
+++ b/target-s390x/kvm.c
@@ -2651,3 +2651,17 @@ void kvm_s390_apply_cpu_model(const S390CPUModel *model, 
Error **errp)
 }
 }
 }
+
+unsigned long get_guest_max_pfn(void)
+{
+/* To be done */
+
+return 0;
+}
+
+unsigned long *tighten_guest_free_page_bmap(unsigned long *bmap)
+{
+/* To be done */
+
+return bmap;
+}
-- 
1.8.3.1

[Qemu-devel] [PATCH qemu v3 4/6] bitmap: Add a new bitmap_move function

2016-10-21 Thread Liang Li

Sometimes, it is need to move a portion of bitmap to another place
in a large bitmap, if overlap happens, the bitmap_copy can't not
work correctly, we need a new function to do this work.

Signed-off-by: Liang Li <liang.z...@intel.com>
---
 include/qemu/bitmap.h | 13 +
 1 file changed, 13 insertions(+)

diff --git a/include/qemu/bitmap.h b/include/qemu/bitmap.h
index 63ea2d0..775d05e 100644
--- a/include/qemu/bitmap.h
+++ b/include/qemu/bitmap.h
@@ -37,6 +37,7 @@
  * bitmap_set(dst, pos, nbits) Set specified bit area
  * bitmap_set_atomic(dst, pos, nbits)   Set specified bit area with atomic ops
  * bitmap_clear(dst, pos, nbits)   Clear specified bit area
+ * bitmap_move(dst, src, nbits) Move *src to *dst
  * bitmap_test_and_clear_atomic(dst, pos, nbits)Test and clear area
  * bitmap_find_next_zero_area(buf, len, pos, n, mask)  Find bit free area
  */
@@ -129,6 +130,18 @@ static inline void bitmap_copy(unsigned long *dst, const 
unsigned long *src,
 }
 }
 
+static inline void bitmap_move(unsigned long *dst, const unsigned long *src,
+   long nbits)
+{
+if (small_nbits(nbits)) {
+unsigned long tmp = *src;
+*dst = tmp;
+} else {
+long len = BITS_TO_LONGS(nbits) * sizeof(unsigned long);
+memmove(dst, src, len);
+}
+}
+
 static inline int bitmap_and(unsigned long *dst, const unsigned long *src1,
  const unsigned long *src2, long nbits)
 {
-- 
1.8.3.1

[Qemu-devel] [PATCH qemu v3 0/6] Fast (de)inflating & fast live migration

2016-10-21 Thread Liang Li

This patch set intends to do two optimizations, one is to speed up
the (de)inflating process of virtio balloon, and another one which
is to speed up the live migration process. We put them together
because both of them are required to change the virtio balloon spec.
 
The main idea of speeding up the (de)inflating process is to use
bitmap to send the page information to host instead of the PFNs, to
reduce the overhead of virtio data transmission, address translation
and madvise(). This can help to improve the performance by about 85%.
 
The idea of speeding up live migration is to skip process guest's
free pages in the first round of data copy, to reduce needless
data processing, this can help to save quite a lot of CPU cycles and
network bandwidth. We get guest's free page information through the
virt queue of virtio-balloon, and filter out these free pages during
live migration. For an idle 8GB guest, this can help to shorten the
total live migration time from 2Sec to about 500ms in the 10Gbps
network environment.
 
Changes from v2 to v3:
* Merged two patches for kernel head file updating into one 
* Removed one patch which was unrelated with this feature 
* Removed the patch to migrate the vq elem, use a new way instead

Changes from v1 to v2:
* Abandon the patch for dropping page cache.
* Get a struct from vq instead of separate variables.
* Use two separate APIs to request free pages and query the status.
* Changed the virtio balloon interface.
* Addressed some of the comments of v1.

Liang Li (6):
  virtio-balloon: update linux head file
  virtio-balloon: speed up inflating & deflating process
  balloon: get free page info from guest
  bitmap: Add a new bitmap_move function
  kvm: Add two new arch specific functions
  migration: skip free pages during live migration

 balloon.c   |  47 +++-
 hw/virtio/virtio-balloon.c  | 273 ++--
 include/hw/virtio/virtio-balloon.h  |  18 +-
 include/qemu/bitmap.h   |  13 ++
 include/standard-headers/linux/virtio_balloon.h |  41 
 include/sysemu/balloon.h|  18 +-
 include/sysemu/kvm.h|  18 ++
 migration/ram.c |  86 
 target-arm/kvm.c|  14 ++
 target-i386/kvm.c   |  37 
 target-mips/kvm.c   |  14 ++
 target-ppc/kvm.c|  14 ++
 target-s390x/kvm.c  |  14 ++
 13 files changed, 581 insertions(+), 26 deletions(-)

-- 
1.8.3.1

[Qemu-devel] [PATCH qemu v3 3/6] balloon: get free page info from guest

2016-10-21 Thread Liang Li

Add a new feature to get the free page information from guest,
the free page information is saved in a bitmap. Please note that
'free page' means page is free sometime after host set the value
of request ID and before it receive response with the same ID.

Signed-off-by: Liang Li <liang.z...@intel.com>
---
 balloon.c  |  47 +-
 hw/virtio/virtio-balloon.c | 129 -
 include/hw/virtio/virtio-balloon.h |  18 +-
 include/sysemu/balloon.h   |  18 +-
 4 files changed, 207 insertions(+), 5 deletions(-)

diff --git a/balloon.c b/balloon.c
index f2ef50c..d6a3791 100644
--- a/balloon.c
+++ b/balloon.c
@@ -36,6 +36,8 @@
 
 static QEMUBalloonEvent *balloon_event_fn;
 static QEMUBalloonStatus *balloon_stat_fn;
+static QEMUBalloonGetFreePage *balloon_get_free_page_fn;
+static QEMUBalloonFreePageReady *balloon_free_page_ready_fn;
 static void *balloon_opaque;
 static bool balloon_inhibited;
 
@@ -65,9 +67,13 @@ static bool have_balloon(Error **errp)
 }
 
 int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
- QEMUBalloonStatus *stat_func, void *opaque)
+ QEMUBalloonStatus *stat_func,
+ QEMUBalloonGetFreePage *get_free_page_func,
+ QEMUBalloonFreePageReady *free_page_ready_func,
+ void *opaque)
 {
-if (balloon_event_fn || balloon_stat_fn || balloon_opaque) {
+if (balloon_event_fn || balloon_stat_fn || balloon_get_free_page_fn
+|| balloon_free_page_ready_fn || balloon_opaque) {
 /* We're already registered one balloon handler.  How many can
  * a guest really have?
  */
@@ -75,6 +81,8 @@ int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
 }
 balloon_event_fn = event_func;
 balloon_stat_fn = stat_func;
+balloon_get_free_page_fn = get_free_page_func;
+balloon_free_page_ready_fn = free_page_ready_func;
 balloon_opaque = opaque;
 return 0;
 }
@@ -86,6 +94,8 @@ void qemu_remove_balloon_handler(void *opaque)
 }
 balloon_event_fn = NULL;
 balloon_stat_fn = NULL;
+balloon_get_free_page_fn = NULL;
+balloon_free_page_ready_fn = NULL;
 balloon_opaque = NULL;
 }
 
@@ -116,3 +126,36 @@ void qmp_balloon(int64_t target, Error **errp)
 trace_balloon_event(balloon_opaque, target);
 balloon_event_fn(balloon_opaque, target);
 }
+
+bool balloon_free_pages_support(void)
+{
+return balloon_get_free_page_fn ? true : false;
+}
+
+BalloonReqStatus balloon_get_free_pages(unsigned long *bitmap,
+unsigned long len,
+unsigned long req_id)
+{
+if (!balloon_get_free_page_fn) {
+return REQ_UNSUPPORT;
+}
+
+if (!bitmap) {
+return REQ_INVALID_PARAM;
+}
+
+return balloon_get_free_page_fn(balloon_opaque, bitmap, len, req_id);
+}
+
+BalloonReqStatus balloon_free_page_ready(unsigned long *req_id)
+{
+if (!balloon_free_page_ready_fn) {
+return REQ_UNSUPPORT;
+}
+
+if (!req_id) {
+return REQ_INVALID_PARAM;
+}
+
+return balloon_free_page_ready_fn(balloon_opaque, req_id);
+}
diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index 3fa80a4..a003033 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -150,6 +150,13 @@ static bool balloon_page_bitmap_supported(const 
VirtIOBalloon *s)
 return virtio_vdev_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_BITMAP);
 }
 
+static bool balloon_misc_vq_supported(const VirtIOBalloon *s)
+{
+VirtIODevice *vdev = VIRTIO_DEVICE(s);
+
+return virtio_vdev_has_feature(vdev, VIRTIO_BALLOON_F_MISC_VQ);
+}
+
 static bool balloon_stats_enabled(const VirtIOBalloon *s)
 {
 return s->stats_poll_interval > 0;
@@ -399,6 +406,52 @@ out:
 }
 }
 
+static void virtio_balloon_handle_resp(VirtIODevice *vdev, VirtQueue *vq)
+{
+VirtIOBalloon *s = VIRTIO_BALLOON(vdev);
+VirtQueueElement *elem;
+size_t offset = 0;
+struct balloon_bmap_hdr hdr;
+
+elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
+if (!elem) {
+s->req_status = REQ_ERROR;
+return;
+}
+
+s->misc_vq_elem = elem;
+if (!elem->out_num) {
+return;
+}
+
+iov_to_buf(elem->out_sg, elem->out_num, offset,
+   , sizeof(hdr));
+offset += sizeof(hdr);
+
+switch (hdr.cmd) {
+case BALLOON_GET_FREE_PAGES:
+if (hdr.req_id == s->misc_req.param) {
+if (s->bmap_len < hdr.start_pfn / BITS_PER_BYTE + hdr.bmap_len) {
+ hdr.bmap_len = s->bmap_len - hdr.start_pfn / BITS_PER_BYTE;
+}
+
+iov_to_buf(elem->out_sg, elem->out_num, offset,
+   s->free_page_bmap + hdr.start_pfn / BITS_PER_LONG,
+   hdr.bmap_len);
+if (hdr

[Qemu-devel] [PATCH qemu v3 2/6] virtio-balloon: speed up inflating & deflating process

2016-10-21 Thread Liang Li

The implementation of the current virtio-balloon is not very
efficient, the time spends on different stages of inflating
the balloon to 7GB of a 8GB idle guest:

a. allocating pages (6.5%)
b. sending PFNs to host (68.3%)
c. address translation (6.1%)
d. madvise (19%)

It takes about 4126ms for the inflating process to complete.
Debugging shows that the bottle neck are the stage b and stage d.

If using a bitmap to send the page info instead of the PFNs, we
can reduce the overhead in stage b quite a lot. Furthermore, we
can do the address translation and call madvise() with a bulk of
RAM pages, instead of the current page per page way, the overhead
of stage c and stage d can also be reduced a lot.

This patch is the kernel side implementation which is intended to
speed up the inflating & deflating process by adding a new feature
to the virtio-balloon device. With this new feature, inflating the
balloon to 7GB of a 8GB idle guest only takes 590ms, the
performance improvement is about 85%.

TODO: optimize stage a by allocating/freeing a chunk of pages
instead of a single page at a time.

Signed-off-by: Liang Li <liang.z...@intel.com>
Suggested-by: Michael S. Tsirkin <m...@redhat.com>
---
 hw/virtio/virtio-balloon.c | 144 ++---
 1 file changed, 123 insertions(+), 21 deletions(-)

diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index 1d77028..3fa80a4 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -52,6 +52,77 @@ static const char *balloon_stat_names[] = {
[VIRTIO_BALLOON_S_NR] = NULL
 };
 
+static void do_balloon_bulk_pages(ram_addr_t base_pfn, uint16_t page_shift,
+  unsigned long len, bool deflate)
+{
+ram_addr_t size, processed, chunk, base;
+MemoryRegionSection section = {.mr = NULL};
+
+size = len << page_shift;
+base = base_pfn << page_shift;
+
+for (processed = 0; processed < size; processed += chunk) {
+chunk = size - processed;
+while (chunk >= TARGET_PAGE_SIZE) {
+section = memory_region_find(get_system_memory(),
+ base + processed, chunk);
+if (!section.mr) {
+chunk = QEMU_ALIGN_DOWN(chunk / 2, TARGET_PAGE_SIZE);
+} else {
+break;
+}
+}
+
+if (section.mr &&
+(int128_nz(section.size) && memory_region_is_ram(section.mr))) {
+void *addr = section.offset_within_region +
+   memory_region_get_ram_ptr(section.mr);
+qemu_madvise(addr, chunk,
+ deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
+} else {
+qemu_log_mask(LOG_GUEST_ERROR,
+  "Invalid guest RAM range [0x%lx, 0x%lx]\n",
+  base + processed, chunk);
+chunk = TARGET_PAGE_SIZE;
+}
+}
+}
+
+static void balloon_bulk_pages(struct balloon_bmap_hdr *hdr,
+   unsigned long *bitmap, bool deflate)
+{
+ram_addr_t base_pfn = hdr->start_pfn;
+uint16_t page_shift = hdr->page_shift;
+unsigned long len = hdr->bmap_len;
+unsigned long current = 0, end = len * BITS_PER_BYTE;
+
+if (!qemu_balloon_is_inhibited() && (!kvm_enabled() ||
+ kvm_has_sync_mmu())) {
+while (current < end) {
+unsigned long one = find_next_bit(bitmap, end, current);
+
+if (one < end) {
+unsigned long pages, zero;
+
+zero = find_next_zero_bit(bitmap, end, one + 1);
+if (zero >= end) {
+pages = end - one;
+} else {
+pages = zero - one;
+}
+
+if (pages) {
+do_balloon_bulk_pages(base_pfn + one, page_shift,
+  pages, deflate);
+}
+current = one + pages;
+} else {
+current = one;
+}
+}
+}
+}
+
 /*
  * reset_stats - Mark all items in the stats array as unset
  *
@@ -72,6 +143,13 @@ static bool balloon_stats_supported(const VirtIOBalloon *s)
 return virtio_vdev_has_feature(vdev, VIRTIO_BALLOON_F_STATS_VQ);
 }
 
+static bool balloon_page_bitmap_supported(const VirtIOBalloon *s)
+{
+VirtIODevice *vdev = VIRTIO_DEVICE(s);
+
+return virtio_vdev_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_BITMAP);
+}
+
 static bool balloon_stats_enabled(const VirtIOBalloon *s)
 {
 return s->stats_poll_interval > 0;
@@ -213,32 +291,54 @@ static void virtio_balloon_handle_output(VirtIODevice 
*vdev, VirtQueue *vq)
 for (;;) {
 size_t offset = 0;
 uint32_t pfn;
+
 elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
 if (!elem) {

[Qemu-devel] [RESEND PATCH v3 kernel 5/7] mm: add the related functions to get unused page

2016-10-21 Thread Liang Li

Save the unused page info into page bitmap. The virtio balloon
driver call this new API to get the unused page bitmap and send
the bitmap to hypervisor(QEMU) for speeding up live migration.
During sending the bitmap, some the pages may be modified and are
no free anymore, this inaccuracy can be corrected by the dirty
page logging mechanism.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Andrew Morton <a...@linux-foundation.org>
Cc: Mel Gorman <mgor...@techsingularity.net>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
---
 include/linux/mm.h |  2 ++
 mm/page_alloc.c| 84 ++
 2 files changed, 86 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2a89da0e..84f56ec 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1777,6 +1777,8 @@ extern void free_area_init_node(int nid, unsigned long * 
zones_size,
unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
 extern unsigned long get_max_pfn(void);
+extern int get_unused_pages(unsigned long start_pfn, unsigned long end_pfn,
+   unsigned long *bitmap[], unsigned long len, unsigned int nr_bmap);
 
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e5f63a9..848bb85 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4436,6 +4436,90 @@ unsigned long get_max_pfn(void)
 }
 EXPORT_SYMBOL(get_max_pfn);
 
+static void mark_unused_pages_bitmap(struct zone *zone,
+   unsigned long start_pfn, unsigned long end_pfn,
+   unsigned long *bitmap[], unsigned long bits,
+   unsigned int nr_bmap)
+{
+   unsigned long pfn, flags, nr_pg, pos, *bmap;
+   unsigned int order, i, t, bmap_idx;
+   struct list_head *curr;
+
+   if (zone_is_empty(zone))
+   return;
+
+   end_pfn = min(start_pfn + nr_bmap * bits, end_pfn);
+   spin_lock_irqsave(>lock, flags);
+
+   for_each_migratetype_order(order, t) {
+   list_for_each(curr, >free_area[order].free_list[t]) {
+   pfn = page_to_pfn(list_entry(curr, struct page, lru));
+   if (pfn < start_pfn || pfn >= end_pfn)
+   continue;
+   nr_pg = 1UL << order;
+   if (pfn + nr_pg > end_pfn)
+   nr_pg = end_pfn - pfn;
+   bmap_idx = (pfn - start_pfn) / bits;
+   if (bmap_idx == (pfn + nr_pg - start_pfn) / bits) {
+   bmap = bitmap[bmap_idx];
+   pos = (pfn - start_pfn) % bits;
+   bitmap_set(bmap, pos, nr_pg);
+   } else
+   for (i = 0; i < nr_pg; i++) {
+   bmap_idx = pos / bits;
+   bmap = bitmap[bmap_idx];
+   pos = pos % bits;
+   bitmap_set(bmap, pos, 1);
+   }
+   }
+   }
+
+   spin_unlock_irqrestore(>lock, flags);
+}
+
+/*
+ * During live migration, page is always discardable unless it's
+ * content is needed by the system.
+ * get_unused_pages provides an API to get the unused pages, these
+ * unused pages can be discarded if there is no modification since
+ * the request. Some other mechanism, like the dirty page logging
+ * can be used to track the modification.
+ *
+ * This function scans the free page list to get the unused pages
+ * whose pfn are range from start_pfn to end_pfn, and set the
+ * corresponding bit in the bitmap if an unused page is found.
+ *
+ * Allocating a large bitmap may fail because of fragmentation,
+ * instead of using a single bitmap, we use a scatter/gather bitmap.
+ * The 'bitmap' is the start address of an array which contains
+ * 'nr_bmap' separate small bitmaps, each bitmap contains 'bits' bits.
+ *
+ * return -1 if parameters are invalid
+ * return 0 when end_pfn >= max_pfn
+ * return 1 when end_pfn < max_pfn
+ */
+int get_unused_pages(unsigned long start_pfn, unsigned long end_pfn,
+   unsigned long *bitmap[], unsigned long bits, unsigned int nr_bmap)
+{
+   struct zone *zone;
+   int ret = 0;
+
+   if (bitmap == NULL || *bitmap == NULL || nr_bmap == 0 ||
+bits == 0 || start_pfn > end_pfn)
+   return -1;
+   if (end_pfn < max_pfn)
+   ret = 1;
+   if (end_pfn >= max_pfn)
+   ret = 0;
+
+   for_each_populated_zone(zone)
+   mark_unused_pages_bitmap(zone, start_pfn, end_pfn, bitmap,
+

[Qemu-devel] [RESEND PATCH v3 kernel 6/7] virtio-balloon: define feature bit and head for misc virt queue

2016-10-21 Thread Liang Li

Define a new feature bit which supports a new virtual queue. This
new virtual qeuque is for information exchange between hypervisor
and guest. The VMM hypervisor can make use of this virtual queue
to request the guest do some operations, e.g. drop page cache,
synchronize file system, etc. And the VMM hypervisor can get some
of guest's runtime information through this virtual queue, e.g. the
guest's unused page information, which can be used for live migration
optimization.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
---
 include/uapi/linux/virtio_balloon.h | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index d3b182a..3a9d633 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_PAGE_BITMAP   3 /* Send page info with bitmap */
+#define VIRTIO_BALLOON_F_MISC_VQ   4 /* Misc info virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -101,4 +102,25 @@ struct balloon_bmap_hdr {
__virtio64 bmap_len;
 };
 
+enum balloon_req_id {
+   /* Get unused pages information */
+   BALLOON_GET_UNUSED_PAGES,
+};
+
+enum balloon_flag {
+   /* Have more data for a request */
+   BALLOON_FLAG_CONT,
+   /* No more data for a request */
+   BALLOON_FLAG_DONE,
+};
+
+struct balloon_req_hdr {
+   /* Used to distinguish different request */
+   __virtio16 cmd;
+   /* Reserved */
+   __virtio16 reserved[3];
+   /* Request parameter */
+   __virtio64 param;
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
1.8.3.1

[Qemu-devel] [RESEND PATCH v3 kernel 4/7] virtio-balloon: speed up inflate/deflate process

2016-10-21 Thread Liang Li

The implementation of the current virtio-balloon is not very
efficient, the time spends on different stages of inflating
the balloon to 7GB of a 8GB idle guest:

a. allocating pages (6.5%)
b. sending PFNs to host (68.3%)
c. address translation (6.1%)
d. madvise (19%)

It takes about 4126ms for the inflating process to complete.
Debugging shows that the bottle neck are the stage b and stage d.

If using a bitmap to send the page info instead of the PFNs, we
can reduce the overhead in stage b quite a lot. Furthermore, we
can do the address translation and call madvise() with a bulk of
RAM pages, instead of the current page per page way, the overhead
of stage c and stage d can also be reduced a lot.

This patch is the kernel side implementation which is intended to
speed up the inflating & deflating process by adding a new feature
to the virtio-balloon device. With this new feature, inflating the
balloon to 7GB of a 8GB idle guest only takes 590ms, the
performance improvement is about 85%.

TODO: optimize stage a by allocating/freeing a chunk of pages
instead of a single page at a time.

Signed-off-by: Liang Li <liang.z...@intel.com>
Suggested-by: Michael S. Tsirkin <m...@redhat.com>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
---
 drivers/virtio/virtio_balloon.c | 233 +++-
 1 file changed, 209 insertions(+), 24 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 59ffe5a..c31839c 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -42,6 +42,10 @@
 #define OOM_VBALLOON_DEFAULT_PAGES 256
 #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
 
+#define BALLOON_BMAP_SIZE  (8 * PAGE_SIZE)
+#define PFNS_PER_BMAP  (BALLOON_BMAP_SIZE * BITS_PER_BYTE)
+#define BALLOON_BMAP_COUNT 32
+
 static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
 module_param(oom_pages, int, S_IRUSR | S_IWUSR);
 MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
@@ -67,6 +71,13 @@ struct virtio_balloon {
 
/* Number of balloon pages we've told the Host we're not using. */
unsigned int num_pages;
+   /* Pointer of the bitmap header. */
+   void *bmap_hdr;
+   /* Bitmap and bitmap count used to tell the host the pages */
+   unsigned long *page_bitmap[BALLOON_BMAP_COUNT];
+   unsigned int nr_page_bmap;
+   /* Used to record the processed pfn range */
+   unsigned long min_pfn, max_pfn, start_pfn, end_pfn;
/*
 * The pages we've told the Host we're not using are enqueued
 * at vb_dev_info->pages list.
@@ -110,16 +121,66 @@ static void balloon_ack(struct virtqueue *vq)
wake_up(>acked);
 }
 
+static inline void init_pfn_range(struct virtio_balloon *vb)
+{
+   vb->min_pfn = ULONG_MAX;
+   vb->max_pfn = 0;
+}
+
+static inline void update_pfn_range(struct virtio_balloon *vb,
+struct page *page)
+{
+   unsigned long balloon_pfn = page_to_balloon_pfn(page);
+
+   if (balloon_pfn < vb->min_pfn)
+   vb->min_pfn = balloon_pfn;
+   if (balloon_pfn > vb->max_pfn)
+   vb->max_pfn = balloon_pfn;
+}
+
 static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
 {
-   struct scatterlist sg;
-   unsigned int len;
+   struct scatterlist sg, sg2[BALLOON_BMAP_COUNT + 1];
+   unsigned int len, i;
+
+   if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_PAGE_BITMAP)) {
+   struct balloon_bmap_hdr *hdr = vb->bmap_hdr;
+   unsigned long bmap_len;
+   int nr_pfn, nr_used_bmap, nr_buf;
+
+   nr_pfn = vb->end_pfn - vb->start_pfn + 1;
+   nr_pfn = roundup(nr_pfn, BITS_PER_LONG);
+   nr_used_bmap = nr_pfn / PFNS_PER_BMAP;
+   bmap_len = nr_pfn / BITS_PER_BYTE;
+   nr_buf = nr_used_bmap + 1;
+
+   /* cmd, reserved and req_id are init to 0, unused here */
+   hdr->page_shift = cpu_to_virtio16(vb->vdev, PAGE_SHIFT);
+   hdr->start_pfn = cpu_to_virtio64(vb->vdev, vb->start_pfn);
+   hdr->bmap_len = cpu_to_virtio64(vb->vdev, bmap_len);
+   sg_init_table(sg2, nr_buf);
+   sg_set_buf([0], hdr, sizeof(struct balloon_bmap_hdr));
+   for (i = 0; i < nr_used_bmap; i++) {
+   unsigned int  buf_len = BALLOON_BMAP_SIZE;
+
+   if (i + 1 == nr_used_bmap)
+   buf_len = bmap_len - BALLOON_BMAP_SIZE * i;
+   sg_set_buf([i + 1], vb->page_bitmap[i], buf_len);
+   }
 
-   sg_init_one(, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
+   while (vq->n

[Qemu-devel] [RESEND PATCH v3 kernel 7/7] virtio-balloon: tell host vm's unused page info

2016-10-21 Thread Liang Li

Support the request for vm's unused page information, response with
a page bitmap. QEMU can make use of this bitmap and the dirty page
logging mechanism to skip the transportation of these unused pages,
this is very helpful to speed up the live migration process.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
---
 drivers/virtio/virtio_balloon.c | 143 +---
 1 file changed, 134 insertions(+), 9 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index c31839c..f10bb8b 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -56,7 +56,7 @@
 
 struct virtio_balloon {
struct virtio_device *vdev;
-   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *misc_vq;
 
/* The balloon servicing is delegated to a freezable workqueue. */
struct work_struct update_balloon_stats_work;
@@ -78,6 +78,8 @@ struct virtio_balloon {
unsigned int nr_page_bmap;
/* Used to record the processed pfn range */
unsigned long min_pfn, max_pfn, start_pfn, end_pfn;
+   /* Request header */
+   struct balloon_req_hdr req_hdr;
/*
 * The pages we've told the Host we're not using are enqueued
 * at vb_dev_info->pages list.
@@ -423,6 +425,78 @@ static void update_balloon_stats(struct virtio_balloon *vb)
pages_to_bytes(available));
 }
 
+static void send_unused_pages_info(struct virtio_balloon *vb,
+   unsigned long req_id)
+{
+   struct scatterlist sg_in, sg_out[BALLOON_BMAP_COUNT + 1];
+   unsigned long pfn = 0, bmap_len, pfn_limit, last_pfn, nr_pfn;
+   struct virtqueue *vq = vb->misc_vq;
+   struct balloon_bmap_hdr *hdr = vb->bmap_hdr;
+   int ret = 1, nr_buf, used_nr_bmap = 0, i;
+
+   if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_PAGE_BITMAP) &&
+   vb->nr_page_bmap == 1)
+   extend_page_bitmap(vb);
+
+   pfn_limit = PFNS_PER_BMAP * vb->nr_page_bmap;
+   mutex_lock(>balloon_lock);
+   last_pfn = get_max_pfn();
+
+   while (ret) {
+   clear_page_bitmap(vb);
+   ret = get_unused_pages(pfn, pfn + pfn_limit, vb->page_bitmap,
+PFNS_PER_BMAP, vb->nr_page_bmap);
+   if (ret < 0)
+   break;
+   hdr->cmd = cpu_to_virtio16(vb->vdev, BALLOON_GET_UNUSED_PAGES);
+   hdr->page_shift = cpu_to_virtio16(vb->vdev, PAGE_SHIFT);
+   hdr->req_id = cpu_to_virtio64(vb->vdev, req_id);
+   hdr->start_pfn = cpu_to_virtio64(vb->vdev, pfn);
+   bmap_len = BALLOON_BMAP_SIZE * vb->nr_page_bmap;
+
+   if (!ret) {
+   hdr->flag = cpu_to_virtio16(vb->vdev,
+BALLOON_FLAG_DONE);
+   nr_pfn = last_pfn - pfn;
+   used_nr_bmap = nr_pfn / PFNS_PER_BMAP;
+   if (nr_pfn % PFNS_PER_BMAP)
+   used_nr_bmap++;
+   bmap_len = nr_pfn / BITS_PER_BYTE;
+   } else {
+   hdr->flag = cpu_to_virtio16(vb->vdev,
+   BALLOON_FLAG_CONT);
+   used_nr_bmap = vb->nr_page_bmap;
+   }
+   hdr->bmap_len = cpu_to_virtio64(vb->vdev, bmap_len);
+   nr_buf = used_nr_bmap + 1;
+   sg_init_table(sg_out, nr_buf);
+   sg_set_buf(_out[0], hdr, sizeof(struct balloon_bmap_hdr));
+   for (i = 0; i < used_nr_bmap; i++) {
+   unsigned int buf_len = BALLOON_BMAP_SIZE;
+
+   if (i + 1 == used_nr_bmap)
+   buf_len = bmap_len - BALLOON_BMAP_SIZE * i;
+   sg_set_buf(_out[i + 1], vb->page_bitmap[i], buf_len);
+   }
+
+   while (vq->num_free < nr_buf)
+   msleep(2);
+   if (virtqueue_add_outbuf(vq, sg_out, nr_buf, vb,
+GFP_KERNEL) == 0) {
+   virtqueue_kick(vq);
+   while (!virtqueue_get_buf(vq, )
+   && !virtqueue_is_broken(vq))
+   cpu_relax();
+   }
+   pfn += pfn_limit;
+   }
+
+   mutex_unlock(>balloon_lock);
+   sg_init_one(_in, >req_hdr, sizeof(vb->req_hdr));
+   virtqueue_add_inbuf(vq, _in, 1, >req_hdr, GFP_KERNEL);
+   vir

[Qemu-devel] [RESEND PATCH v3 kernel 3/7] mm: add a function to get the max pfn

2016-10-21 Thread Liang Li

Expose the function to get the max pfn, so it can be used in the
virtio-balloon device driver. Simply include the 'linux/bootmem.h'
is not enough, if the device driver is built to a module, directly
refer the max_pfn lead to build failed.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Andrew Morton <a...@linux-foundation.org>
Cc: Mel Gorman <mgor...@techsingularity.net>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
---
 include/linux/mm.h |  1 +
 mm/page_alloc.c| 10 ++
 2 files changed, 11 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ffbd729..2a89da0e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1776,6 +1776,7 @@ static inline spinlock_t *pmd_lock(struct mm_struct *mm, 
pmd_t *pmd)
 extern void free_area_init_node(int nid, unsigned long * zones_size,
unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
+extern unsigned long get_max_pfn(void);
 
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2b3bf67..e5f63a9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4426,6 +4426,16 @@ void show_free_areas(unsigned int filter)
show_swap_cache_info();
 }
 
+/*
+ * The max_pfn can change because of memory hot plug, so it's only good
+ * as a hint. e.g. for sizing data structures.
+ */
+unsigned long get_max_pfn(void)
+{
+   return max_pfn;
+}
+EXPORT_SYMBOL(get_max_pfn);
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
zoneref->zone = zone;
-- 
1.8.3.1

[Qemu-devel] [RESEND PATCH v3 kernel 2/7] virtio-balloon: define new feature bit and page bitmap head

2016-10-21 Thread Liang Li

Add a new feature which supports sending the page information with
a bitmap. The current implementation uses PFNs array, which is not
very efficient. Using bitmap can improve the performance of
inflating/deflating significantly

The page bitmap header will used to tell the host some information
about the page bitmap. e.g. the page size, page bitmap length and
start pfn.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
---
 include/uapi/linux/virtio_balloon.h | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index 343d7dd..d3b182a 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST0 /* Tell before reclaiming 
pages */
 #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_PAGE_BITMAP   3 /* Send page info with bitmap */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -82,4 +83,22 @@ struct virtio_balloon_stat {
__virtio64 val;
 } __attribute__((packed));
 
+/* Page bitmap header structure */
+struct balloon_bmap_hdr {
+   /* Used to distinguish different request */
+   __virtio16 cmd;
+   /* Shift width of page in the bitmap */
+   __virtio16 page_shift;
+   /* flag used to identify different status */
+   __virtio16 flag;
+   /* Reserved */
+   __virtio16 reserved;
+   /* ID of the request */
+   __virtio64 req_id;
+   /* The pfn of 0 bit in the bitmap */
+   __virtio64 start_pfn;
+   /* The length of the bitmap, in bytes */
+   __virtio64 bmap_len;
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
1.8.3.1

[Qemu-devel] [RESEND PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-10-21 Thread Liang Li

This patch set contains two parts of changes to the virtio-balloon.
 
One is the change for speeding up the inflating & deflating process,
the main idea of this optimization is to use bitmap to send the page
information to host instead of the PFNs, to reduce the overhead of
virtio data transmission, address translation and madvise(). This can
help to improve the performance by about 85%.
 
Another change is for speeding up live migration. By skipping process
guest's free pages in the first round of data copy, to reduce needless
data processing, this can help to save quite a lot of CPU cycles and
network bandwidth. We put guest's free page information in bitmap and
send it to host with the virt queue of virtio-balloon. For an idle 8GB
guest, this can help to shorten the total live migration time from 2Sec
to about 500ms in the 10Gbps network environment.
 
Dave Hansen suggested a new scheme to encode the data structure,
because of additional complexity, it's not implemented in v3.
 
Changes from v2 to v3:
* Change the name of 'free page' to 'unused page'.
* Use the scatter & gather bitmap instead of a 1MB page bitmap.
* Fix overwriting the page bitmap after kicking.
* Some of MST's comments for v2.
 
Changes from v1 to v2:
* Abandon the patch for dropping page cache.
* Put some structures to uapi head file.
* Use a new way to determine the page bitmap size.
* Use a unified way to send the free page information with the bitmap
* Address the issues referred in MST's comments

Liang Li (7):
  virtio-balloon: rework deflate to add page to a list
  virtio-balloon: define new feature bit and page bitmap head
  mm: add a function to get the max pfn
  virtio-balloon: speed up inflate/deflate process
  mm: add the related functions to get unused page
  virtio-balloon: define feature bit and head for misc virt queue
  virtio-balloon: tell host vm's unused page info

 drivers/virtio/virtio_balloon.c | 390 
 include/linux/mm.h  |   3 +
 include/uapi/linux/virtio_balloon.h |  41 
 mm/page_alloc.c |  94 +
 4 files changed, 485 insertions(+), 43 deletions(-)

-- 
1.8.3.1

[Qemu-devel] [RESEND PATCH v3 kernel 1/7] virtio-balloon: rework deflate to add page to a list

2016-10-21 Thread Liang Li

Will allow faster notifications using a bitmap down the road.
balloon_pfn_to_page() can be removed because it's useless.

Signed-off-by: Liang Li <liang.z...@intel.com>
Signed-off-by: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
---
 drivers/virtio/virtio_balloon.c | 22 --
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 4e7003d..59ffe5a 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -103,12 +103,6 @@ static u32 page_to_balloon_pfn(struct page *page)
return pfn * VIRTIO_BALLOON_PAGES_PER_PAGE;
 }
 
-static struct page *balloon_pfn_to_page(u32 pfn)
-{
-   BUG_ON(pfn % VIRTIO_BALLOON_PAGES_PER_PAGE);
-   return pfn_to_page(pfn / VIRTIO_BALLOON_PAGES_PER_PAGE);
-}
-
 static void balloon_ack(struct virtqueue *vq)
 {
struct virtio_balloon *vb = vq->vdev->priv;
@@ -181,18 +175,16 @@ static unsigned fill_balloon(struct virtio_balloon *vb, 
size_t num)
return num_allocated_pages;
 }
 
-static void release_pages_balloon(struct virtio_balloon *vb)
+static void release_pages_balloon(struct virtio_balloon *vb,
+struct list_head *pages)
 {
-   unsigned int i;
-   struct page *page;
+   struct page *page, *next;
 
-   /* Find pfns pointing at start of each page, get pages and free them. */
-   for (i = 0; i < vb->num_pfns; i += VIRTIO_BALLOON_PAGES_PER_PAGE) {
-   page = balloon_pfn_to_page(virtio32_to_cpu(vb->vdev,
-  vb->pfns[i]));
+   list_for_each_entry_safe(page, next, pages, lru) {
if (!virtio_has_feature(vb->vdev,
VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
adjust_managed_page_count(page, 1);
+   list_del(>lru);
put_page(page); /* balloon reference */
}
 }
@@ -202,6 +194,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, 
size_t num)
unsigned num_freed_pages;
struct page *page;
struct balloon_dev_info *vb_dev_info = >vb_dev_info;
+   LIST_HEAD(pages);
 
/* We can only do one array worth at a time. */
num = min(num, ARRAY_SIZE(vb->pfns));
@@ -215,6 +208,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, 
size_t num)
if (!page)
break;
set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+   list_add(>lru, );
vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
}
 
@@ -226,7 +220,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, 
size_t num)
 */
if (vb->num_pfns != 0)
tell_host(vb, vb->deflate_vq);
-   release_pages_balloon(vb);
+   release_pages_balloon(vb, );
mutex_unlock(>balloon_lock);
return num_freed_pages;
 }
-- 
1.8.3.1

[Qemu-devel] [PATCH repost] virtio-balloon: Remove needless precompiled directive

2016-08-08 Thread Liang Li

Since there in wrapper around madvise(), the virtio-balloon
code is able to work without the precompiled directive, the
directive can be removed.

Signed-off-by: Liang Li <liang.z...@intel.com>
Suggested-by: Thomas Huth <th...@redhat.com>
Reviewd-by: Dr. David Alan Gilbert <dgilb...@redhat.com>
---
 hw/virtio/virtio-balloon.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index 5af429a..61325f2 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -34,13 +34,11 @@
 
 static void balloon_page(void *addr, int deflate)
 {
-#if defined(__linux__)
 if (!qemu_balloon_is_inhibited() && (!kvm_enabled() ||
  kvm_has_sync_mmu())) {
 qemu_madvise(addr, BALLOON_PAGE_SIZE,
 deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
 }
-#endif
 }
 
 static const char *balloon_stat_names[] = {
-- 
1.9.1

[Qemu-devel] [PATCH] migration: fix live migration failure with compression

2016-08-08 Thread Liang Li

Because of commit 11808bb0c422, which remove some condition checks
of 'f->ops->writev_buffer', 'qemu_put_qemu_file' should be enhanced
to clear the 'f_src->iovcnt', or 'f_src->iovcnt' may exceed the
MAX_IOV_SIZE which will break live migration. This should be fixed.

Signed-off-by: Liang Li <liang.z...@intel.com>
Reported-by: Jinshi Zhang <jinshi.c.zh...@intel.com>
---
 migration/qemu-file.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index bbc565e..e9fae31 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -668,6 +668,7 @@ int qemu_put_qemu_file(QEMUFile *f_des, QEMUFile *f_src)
 len = f_src->buf_index;
 qemu_put_buffer(f_des, f_src->buf, f_src->buf_index);
 f_src->buf_index = 0;
+f_src->iovcnt = 0;
 }
 return len;
 }
-- 
1.9.1

[Qemu-devel] [PATCH v3 kernel 5/7] mm: add the related functions to get unused page

2016-08-08 Thread Liang Li

Save the unused page info into page bitmap. The virtio balloon
driver call this new API to get the unused page bitmap and send
the bitmap to hypervisor(QEMU) for speeding up live migration.
During sending the bitmap, some the pages may be modified and are
no free anymore, this inaccuracy can be corrected by the dirty
page logging mechanism.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Andrew Morton <a...@linux-foundation.org>
Cc: Mel Gorman <mgor...@techsingularity.net>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
Cc: Dave Hansen <dave.han...@intel.com>
---
 include/linux/mm.h |  2 ++
 mm/page_alloc.c| 84 ++
 2 files changed, 86 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5873057..d181864 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1789,6 +1789,8 @@ extern void free_area_init_node(int nid, unsigned long * 
zones_size,
unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
 extern unsigned long get_max_pfn(void);
+extern int get_unused_pages(unsigned long start_pfn, unsigned long end_pfn,
+   unsigned long *bitmap[], unsigned long len, unsigned int nr_bmap);
 
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3373704..1b5419d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4401,6 +4401,90 @@ unsigned long get_max_pfn(void)
 }
 EXPORT_SYMBOL(get_max_pfn);
 
+static void mark_unused_pages_bitmap(struct zone *zone,
+   unsigned long start_pfn, unsigned long end_pfn,
+   unsigned long *bitmap[], unsigned long bits,
+   unsigned int nr_bmap)
+{
+   unsigned long pfn, flags, nr_pg, pos, *bmap;
+   unsigned int order, i, t, bmap_idx;
+   struct list_head *curr;
+
+   if (zone_is_empty(zone))
+   return;
+
+   end_pfn = min(start_pfn + nr_bmap * bits, end_pfn);
+   spin_lock_irqsave(>lock, flags);
+
+   for_each_migratetype_order(order, t) {
+   list_for_each(curr, >free_area[order].free_list[t]) {
+   pfn = page_to_pfn(list_entry(curr, struct page, lru));
+   if (pfn < start_pfn || pfn >= end_pfn)
+   continue;
+   nr_pg = 1UL << order;
+   if (pfn + nr_pg > end_pfn)
+   nr_pg = end_pfn - pfn;
+   bmap_idx = (pfn - start_pfn) / bits;
+   if (bmap_idx == (pfn + nr_pg - start_pfn) / bits) {
+   bmap = bitmap[bmap_idx];
+   pos = (pfn - start_pfn) % bits;
+   bitmap_set(bmap, pos, nr_pg);
+   } else
+   for (i = 0; i < nr_pg; i++) {
+   bmap_idx = pos / bits;
+   bmap = bitmap[bmap_idx];
+   pos = pos % bits;
+   bitmap_set(bmap, pos, 1);
+   }
+   }
+   }
+
+   spin_unlock_irqrestore(>lock, flags);
+}
+
+/*
+ * During live migration, page is always discardable unless it's
+ * content is needed by the system.
+ * get_unused_pages provides an API to get the unused pages, these
+ * unused pages can be discarded if there is no modification since
+ * the request. Some other mechanism, like the dirty page logging
+ * can be used to track the modification.
+ *
+ * This function scans the free page list to get the unused pages
+ * whose pfn are range from start_pfn to end_pfn, and set the
+ * corresponding bit in the bitmap if an unused page is found.
+ *
+ * Allocating a large bitmap may fail because of fragmentation,
+ * instead of using a single bitmap, we use a scatter/gather bitmap.
+ * The 'bitmap' is the start address of an array which contains
+ * 'nr_bmap' separate small bitmaps, each bitmap contains 'bits' bits.
+ *
+ * return -1 if parameters are invalid
+ * return 0 when end_pfn >= max_pfn
+ * return 1 when end_pfn < max_pfn
+ */
+int get_unused_pages(unsigned long start_pfn, unsigned long end_pfn,
+   unsigned long *bitmap[], unsigned long bits, unsigned int nr_bmap)
+{
+   struct zone *zone;
+   int ret = 0;
+
+   if (bitmap == NULL || *bitmap == NULL || nr_bmap == 0 ||
+bits == 0 || start_pfn > end_pfn)
+   return -1;
+   if (end_pfn < max_pfn)
+   ret = 1;
+   if (end_pfn >= max_pfn)
+   ret = 0;
+
+   for_each_populated_zone(zone)
+   mark_unused_pages_bitmap(zone, start_pfn,

[Qemu-devel] [PATCH v3 kernel 3/7] mm: add a function to get the max pfn

2016-08-08 Thread Liang Li

Expose the function to get the max pfn, so it can be used in the
virtio-balloon device driver. Simply include the 'linux/bootmem.h'
is not enough, if the device driver is built to a module, directly
refer the max_pfn lead to build failed.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Andrew Morton <a...@linux-foundation.org>
Cc: Mel Gorman <mgor...@techsingularity.net>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
Cc: Dave Hansen <dave.han...@intel.com>
---
 include/linux/mm.h |  1 +
 mm/page_alloc.c| 10 ++
 2 files changed, 11 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 08ed53e..5873057 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1788,6 +1788,7 @@ extern void free_area_init(unsigned long * zones_size);
 extern void free_area_init_node(int nid, unsigned long * zones_size,
unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
+extern unsigned long get_max_pfn(void);
 
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fb975ce..3373704 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4391,6 +4391,16 @@ void show_free_areas(unsigned int filter)
show_swap_cache_info();
 }
 
+/*
+ * The max_pfn can change because of memory hot plug, so it's only good
+ * as a hint. e.g. for sizing data structures.
+ */
+unsigned long get_max_pfn(void)
+{
+   return max_pfn;
+}
+EXPORT_SYMBOL(get_max_pfn);
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
zoneref->zone = zone;
-- 
1.8.3.1

[Qemu-devel] [PATCH v3 kernel 7/7] virtio-balloon: tell host vm's unused page info

2016-08-08 Thread Liang Li

Support the request for vm's unused page information, response with
a page bitmap. QEMU can make use of this bitmap and the dirty page
logging mechanism to skip the transportation of these unused pages,
this is very helpful to speed up the live migration process.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
Cc: Dave Hansen <dave.han...@intel.com>
---
 drivers/virtio/virtio_balloon.c | 143 +---
 1 file changed, 134 insertions(+), 9 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index c31839c..f10bb8b 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -56,7 +56,7 @@ static struct vfsmount *balloon_mnt;
 
 struct virtio_balloon {
struct virtio_device *vdev;
-   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *misc_vq;
 
/* The balloon servicing is delegated to a freezable workqueue. */
struct work_struct update_balloon_stats_work;
@@ -78,6 +78,8 @@ struct virtio_balloon {
unsigned int nr_page_bmap;
/* Used to record the processed pfn range */
unsigned long min_pfn, max_pfn, start_pfn, end_pfn;
+   /* Request header */
+   struct balloon_req_hdr req_hdr;
/*
 * The pages we've told the Host we're not using are enqueued
 * at vb_dev_info->pages list.
@@ -423,6 +425,78 @@ static void update_balloon_stats(struct virtio_balloon *vb)
pages_to_bytes(available));
 }
 
+static void send_unused_pages_info(struct virtio_balloon *vb,
+   unsigned long req_id)
+{
+   struct scatterlist sg_in, sg_out[BALLOON_BMAP_COUNT + 1];
+   unsigned long pfn = 0, bmap_len, pfn_limit, last_pfn, nr_pfn;
+   struct virtqueue *vq = vb->misc_vq;
+   struct balloon_bmap_hdr *hdr = vb->bmap_hdr;
+   int ret = 1, nr_buf, used_nr_bmap = 0, i;
+
+   if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_PAGE_BITMAP) &&
+   vb->nr_page_bmap == 1)
+   extend_page_bitmap(vb);
+
+   pfn_limit = PFNS_PER_BMAP * vb->nr_page_bmap;
+   mutex_lock(>balloon_lock);
+   last_pfn = get_max_pfn();
+
+   while (ret) {
+   clear_page_bitmap(vb);
+   ret = get_unused_pages(pfn, pfn + pfn_limit, vb->page_bitmap,
+PFNS_PER_BMAP, vb->nr_page_bmap);
+   if (ret < 0)
+   break;
+   hdr->cmd = cpu_to_virtio16(vb->vdev, BALLOON_GET_UNUSED_PAGES);
+   hdr->page_shift = cpu_to_virtio16(vb->vdev, PAGE_SHIFT);
+   hdr->req_id = cpu_to_virtio64(vb->vdev, req_id);
+   hdr->start_pfn = cpu_to_virtio64(vb->vdev, pfn);
+   bmap_len = BALLOON_BMAP_SIZE * vb->nr_page_bmap;
+
+   if (!ret) {
+   hdr->flag = cpu_to_virtio16(vb->vdev,
+BALLOON_FLAG_DONE);
+   nr_pfn = last_pfn - pfn;
+   used_nr_bmap = nr_pfn / PFNS_PER_BMAP;
+   if (nr_pfn % PFNS_PER_BMAP)
+   used_nr_bmap++;
+   bmap_len = nr_pfn / BITS_PER_BYTE;
+   } else {
+   hdr->flag = cpu_to_virtio16(vb->vdev,
+   BALLOON_FLAG_CONT);
+   used_nr_bmap = vb->nr_page_bmap;
+   }
+   hdr->bmap_len = cpu_to_virtio64(vb->vdev, bmap_len);
+   nr_buf = used_nr_bmap + 1;
+   sg_init_table(sg_out, nr_buf);
+   sg_set_buf(_out[0], hdr, sizeof(struct balloon_bmap_hdr));
+   for (i = 0; i < used_nr_bmap; i++) {
+   unsigned int buf_len = BALLOON_BMAP_SIZE;
+
+   if (i + 1 == used_nr_bmap)
+   buf_len = bmap_len - BALLOON_BMAP_SIZE * i;
+   sg_set_buf(_out[i + 1], vb->page_bitmap[i], buf_len);
+   }
+
+   while (vq->num_free < nr_buf)
+   msleep(2);
+   if (virtqueue_add_outbuf(vq, sg_out, nr_buf, vb,
+GFP_KERNEL) == 0) {
+   virtqueue_kick(vq);
+   while (!virtqueue_get_buf(vq, )
+   && !virtqueue_is_broken(vq))
+   cpu_relax();
+   }
+   pfn += pfn_limit;
+   }
+
+   mutex_unlock(>balloon_lock);
+   sg_init_one(_in, >req_hdr, sizeof(vb->req

[Qemu-devel] [PATCH v3 kernel 2/7] virtio-balloon: define new feature bit and page bitmap head

2016-08-08 Thread Liang Li

Add a new feature which supports sending the page information with
a bitmap. The current implementation uses PFNs array, which is not
very efficient. Using bitmap can improve the performance of
inflating/deflating significantly

The page bitmap header will used to tell the host some information
about the page bitmap. e.g. the page size, page bitmap length and
start pfn.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
Cc: Dave Hansen <dave.han...@intel.com>
---
 include/uapi/linux/virtio_balloon.h | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index 343d7dd..d3b182a 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST0 /* Tell before reclaiming 
pages */
 #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_PAGE_BITMAP   3 /* Send page info with bitmap */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -82,4 +83,22 @@ struct virtio_balloon_stat {
__virtio64 val;
 } __attribute__((packed));
 
+/* Page bitmap header structure */
+struct balloon_bmap_hdr {
+   /* Used to distinguish different request */
+   __virtio16 cmd;
+   /* Shift width of page in the bitmap */
+   __virtio16 page_shift;
+   /* flag used to identify different status */
+   __virtio16 flag;
+   /* Reserved */
+   __virtio16 reserved;
+   /* ID of the request */
+   __virtio64 req_id;
+   /* The pfn of 0 bit in the bitmap */
+   __virtio64 start_pfn;
+   /* The length of the bitmap, in bytes */
+   __virtio64 bmap_len;
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
1.8.3.1

[Qemu-devel] [PATCH v3 kernel 6/7] virtio-balloon: define feature bit and head for misc virt queue

2016-08-08 Thread Liang Li

Define a new feature bit which supports a new virtual queue. This
new virtual qeuque is for information exchange between hypervisor
and guest. The VMM hypervisor can make use of this virtual queue
to request the guest do some operations, e.g. drop page cache,
synchronize file system, etc. And the VMM hypervisor can get some
of guest's runtime information through this virtual queue, e.g. the
guest's unused page information, which can be used for live migration
optimization.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
Cc: Dave Hansen <dave.han...@intel.com>
---
 include/uapi/linux/virtio_balloon.h | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index d3b182a..3a9d633 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_PAGE_BITMAP   3 /* Send page info with bitmap */
+#define VIRTIO_BALLOON_F_MISC_VQ   4 /* Misc info virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -101,4 +102,25 @@ struct balloon_bmap_hdr {
__virtio64 bmap_len;
 };
 
+enum balloon_req_id {
+   /* Get unused pages information */
+   BALLOON_GET_UNUSED_PAGES,
+};
+
+enum balloon_flag {
+   /* Have more data for a request */
+   BALLOON_FLAG_CONT,
+   /* No more data for a request */
+   BALLOON_FLAG_DONE,
+};
+
+struct balloon_req_hdr {
+   /* Used to distinguish different request */
+   __virtio16 cmd;
+   /* Reserved */
+   __virtio16 reserved[3];
+   /* Request parameter */
+   __virtio64 param;
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
1.8.3.1

[Qemu-devel] [PATCH v3 kernel 1/7] virtio-balloon: rework deflate to add page to a list

2016-08-08 Thread Liang Li

Will allow faster notifications using a bitmap down the road.
balloon_pfn_to_page() can be removed because it's useless.

Signed-off-by: Liang Li <liang.z...@intel.com>
Signed-off-by: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
Cc: Dave Hansen <dave.han...@intel.com>
---
 drivers/virtio/virtio_balloon.c | 22 --
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 4e7003d..59ffe5a 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -103,12 +103,6 @@ static u32 page_to_balloon_pfn(struct page *page)
return pfn * VIRTIO_BALLOON_PAGES_PER_PAGE;
 }
 
-static struct page *balloon_pfn_to_page(u32 pfn)
-{
-   BUG_ON(pfn % VIRTIO_BALLOON_PAGES_PER_PAGE);
-   return pfn_to_page(pfn / VIRTIO_BALLOON_PAGES_PER_PAGE);
-}
-
 static void balloon_ack(struct virtqueue *vq)
 {
struct virtio_balloon *vb = vq->vdev->priv;
@@ -181,18 +175,16 @@ static unsigned fill_balloon(struct virtio_balloon *vb, 
size_t num)
return num_allocated_pages;
 }
 
-static void release_pages_balloon(struct virtio_balloon *vb)
+static void release_pages_balloon(struct virtio_balloon *vb,
+struct list_head *pages)
 {
-   unsigned int i;
-   struct page *page;
+   struct page *page, *next;
 
-   /* Find pfns pointing at start of each page, get pages and free them. */
-   for (i = 0; i < vb->num_pfns; i += VIRTIO_BALLOON_PAGES_PER_PAGE) {
-   page = balloon_pfn_to_page(virtio32_to_cpu(vb->vdev,
-  vb->pfns[i]));
+   list_for_each_entry_safe(page, next, pages, lru) {
if (!virtio_has_feature(vb->vdev,
VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
adjust_managed_page_count(page, 1);
+   list_del(>lru);
put_page(page); /* balloon reference */
}
 }
@@ -202,6 +194,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, 
size_t num)
unsigned num_freed_pages;
struct page *page;
struct balloon_dev_info *vb_dev_info = >vb_dev_info;
+   LIST_HEAD(pages);
 
/* We can only do one array worth at a time. */
num = min(num, ARRAY_SIZE(vb->pfns));
@@ -215,6 +208,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, 
size_t num)
if (!page)
break;
set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+   list_add(>lru, );
vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
}
 
@@ -226,7 +220,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, 
size_t num)
 */
if (vb->num_pfns != 0)
tell_host(vb, vb->deflate_vq);
-   release_pages_balloon(vb);
+   release_pages_balloon(vb, );
mutex_unlock(>balloon_lock);
return num_freed_pages;
 }
-- 
1.8.3.1

[Qemu-devel] [PATCH v3 kernel 4/7] virtio-balloon: speed up inflate/deflate process

2016-08-08 Thread Liang Li

The implementation of the current virtio-balloon is not very
efficient, the time spends on different stages of inflating
the balloon to 7GB of a 8GB idle guest:

a. allocating pages (6.5%)
b. sending PFNs to host (68.3%)
c. address translation (6.1%)
d. madvise (19%)

It takes about 4126ms for the inflating process to complete.
Debugging shows that the bottle neck are the stage b and stage d.

If using a bitmap to send the page info instead of the PFNs, we
can reduce the overhead in stage b quite a lot. Furthermore, we
can do the address translation and call madvise() with a bulk of
RAM pages, instead of the current page per page way, the overhead
of stage c and stage d can also be reduced a lot.

This patch is the kernel side implementation which is intended to
speed up the inflating & deflating process by adding a new feature
to the virtio-balloon device. With this new feature, inflating the
balloon to 7GB of a 8GB idle guest only takes 590ms, the
performance improvement is about 85%.

TODO: optimize stage a by allocating/freeing a chunk of pages
instead of a single page at a time.

Signed-off-by: Liang Li <liang.z...@intel.com>
Suggested-by: Michael S. Tsirkin <m...@redhat.com>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
Cc: Dave Hansen <dave.han...@intel.com>
---
 drivers/virtio/virtio_balloon.c | 233 +++-
 1 file changed, 209 insertions(+), 24 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 59ffe5a..c31839c 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -42,6 +42,10 @@
 #define OOM_VBALLOON_DEFAULT_PAGES 256
 #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
 
+#define BALLOON_BMAP_SIZE  (8 * PAGE_SIZE)
+#define PFNS_PER_BMAP  (BALLOON_BMAP_SIZE * BITS_PER_BYTE)
+#define BALLOON_BMAP_COUNT 32
+
 static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
 module_param(oom_pages, int, S_IRUSR | S_IWUSR);
 MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
@@ -67,6 +71,13 @@ struct virtio_balloon {
 
/* Number of balloon pages we've told the Host we're not using. */
unsigned int num_pages;
+   /* Pointer of the bitmap header. */
+   void *bmap_hdr;
+   /* Bitmap and bitmap count used to tell the host the pages */
+   unsigned long *page_bitmap[BALLOON_BMAP_COUNT];
+   unsigned int nr_page_bmap;
+   /* Used to record the processed pfn range */
+   unsigned long min_pfn, max_pfn, start_pfn, end_pfn;
/*
 * The pages we've told the Host we're not using are enqueued
 * at vb_dev_info->pages list.
@@ -110,16 +121,66 @@ static void balloon_ack(struct virtqueue *vq)
wake_up(>acked);
 }
 
+static inline void init_pfn_range(struct virtio_balloon *vb)
+{
+   vb->min_pfn = ULONG_MAX;
+   vb->max_pfn = 0;
+}
+
+static inline void update_pfn_range(struct virtio_balloon *vb,
+struct page *page)
+{
+   unsigned long balloon_pfn = page_to_balloon_pfn(page);
+
+   if (balloon_pfn < vb->min_pfn)
+   vb->min_pfn = balloon_pfn;
+   if (balloon_pfn > vb->max_pfn)
+   vb->max_pfn = balloon_pfn;
+}
+
 static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
 {
-   struct scatterlist sg;
-   unsigned int len;
+   struct scatterlist sg, sg2[BALLOON_BMAP_COUNT + 1];
+   unsigned int len, i;
+
+   if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_PAGE_BITMAP)) {
+   struct balloon_bmap_hdr *hdr = vb->bmap_hdr;
+   unsigned long bmap_len;
+   int nr_pfn, nr_used_bmap, nr_buf;
+
+   nr_pfn = vb->end_pfn - vb->start_pfn + 1;
+   nr_pfn = roundup(nr_pfn, BITS_PER_LONG);
+   nr_used_bmap = nr_pfn / PFNS_PER_BMAP;
+   bmap_len = nr_pfn / BITS_PER_BYTE;
+   nr_buf = nr_used_bmap + 1;
+
+   /* cmd, reserved and req_id are init to 0, unused here */
+   hdr->page_shift = cpu_to_virtio16(vb->vdev, PAGE_SHIFT);
+   hdr->start_pfn = cpu_to_virtio64(vb->vdev, vb->start_pfn);
+   hdr->bmap_len = cpu_to_virtio64(vb->vdev, bmap_len);
+   sg_init_table(sg2, nr_buf);
+   sg_set_buf([0], hdr, sizeof(struct balloon_bmap_hdr));
+   for (i = 0; i < nr_used_bmap; i++) {
+   unsigned int  buf_len = BALLOON_BMAP_SIZE;
+
+   if (i + 1 == nr_used_bmap)
+   buf_len = bmap_len - BALLOON_BMAP_SIZE * i;
+   sg_set_buf([i + 1], vb->page_bitmap[i], buf_len);
+   }
 
-   sg_init_one(, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pf

[Qemu-devel] [PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-08-08 Thread Liang Li

This patch set contains two parts of changes to the virtio-balloon. 

One is the change for speeding up the inflating & deflating process,
the main idea of this optimization is to use bitmap to send the page
information to host instead of the PFNs, to reduce the overhead of
virtio data transmission, address translation and madvise(). This can
help to improve the performance by about 85%.

Another change is for speeding up live migration. By skipping process
guest's free pages in the first round of data copy, to reduce needless
data processing, this can help to save quite a lot of CPU cycles and
network bandwidth. We put guest's free page information in bitmap and
send it to host with the virt queue of virtio-balloon. For an idle 8GB
guest, this can help to shorten the total live migration time from 2Sec
to about 500ms in the 10Gbps network environment.  

Dave Hansen suggested a new scheme to encode the data structure,
because of additional complexity, it's not implemented in v3.

Changes from v2 to v3:
* Change the name of 'free page' to 'unused page'.
* Use the scatter & gather bitmap instead of a 1MB page bitmap. 
* Fix overwriting the page bitmap after kicking. 
* Some of MST's comments for v2. 

Changes from v1 to v2:
* Abandon the patch for dropping page cache.
* Put some structures to uapi head file.
* Use a new way to determine the page bitmap size.
* Use a unified way to send the free page information with the bitmap 
* Address the issues referred in MST's comments


Liang Li (7):
  virtio-balloon: rework deflate to add page to a list
  virtio-balloon: define new feature bit and page bitmap head
  mm: add a function to get the max pfn
  virtio-balloon: speed up inflate/deflate process
  mm: add the related functions to get unused page
  virtio-balloon: define feature bit and head for misc virt queue
  virtio-balloon: tell host vm's unused page info

 drivers/virtio/virtio_balloon.c | 390 
 include/linux/mm.h  |   3 +
 include/uapi/linux/virtio_balloon.h |  41 
 mm/page_alloc.c |  94 +
 4 files changed, 485 insertions(+), 43 deletions(-)

-- 
1.8.3.1

[Qemu-devel] [PATCH v2 repost 6/7] mm: add the related functions to get free page info

2016-07-26 Thread Liang Li

Save the free page info into a page bitmap, will be used in virtio
balloon device driver.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Andrew Morton <a...@linux-foundation.org>
Cc: Vlastimil Babka <vba...@suse.cz>
Cc: Mel Gorman <mgor...@techsingularity.net>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
---
 mm/page_alloc.c | 46 ++
 1 file changed, 46 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7da61ad..3ad8b10 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4523,6 +4523,52 @@ unsigned long get_max_pfn(void)
 }
 EXPORT_SYMBOL(get_max_pfn);
 
+static void mark_free_pages_bitmap(struct zone *zone, unsigned long start_pfn,
+   unsigned long end_pfn, unsigned long *bitmap, unsigned long len)
+{
+   unsigned long pfn, flags, page_num;
+   unsigned int order, t;
+   struct list_head *curr;
+
+   if (zone_is_empty(zone))
+   return;
+   end_pfn = min(start_pfn + len, end_pfn);
+   spin_lock_irqsave(>lock, flags);
+
+   for_each_migratetype_order(order, t) {
+   list_for_each(curr, >free_area[order].free_list[t]) {
+   pfn = page_to_pfn(list_entry(curr, struct page, lru));
+   if (pfn >= start_pfn && pfn <= end_pfn) {
+   page_num = 1UL << order;
+   if (pfn + page_num > end_pfn)
+   page_num = end_pfn - pfn;
+   bitmap_set(bitmap, pfn - start_pfn, page_num);
+   }
+   }
+   }
+
+   spin_unlock_irqrestore(>lock, flags);
+}
+
+int get_free_pages(unsigned long start_pfn, unsigned long end_pfn,
+   unsigned long *bitmap, unsigned long len)
+{
+   struct zone *zone;
+   int ret = 0;
+
+   if (bitmap == NULL || start_pfn > end_pfn || start_pfn >= max_pfn)
+   return 0;
+   if (end_pfn < max_pfn)
+   ret = 1;
+   if (end_pfn >= max_pfn)
+   ret = 0;
+
+   for_each_populated_zone(zone)
+   mark_free_pages_bitmap(zone, start_pfn, end_pfn, bitmap, len);
+   return ret;
+}
+EXPORT_SYMBOL(get_free_pages);
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
zoneref->zone = zone;
-- 
1.9.1

[Qemu-devel] [PATCH v2 repost 5/7] virtio-balloon: define feature bit and head for misc virt queue

2016-07-26 Thread Liang Li

Define a new feature bit which supports a new virtual queue. This
new virtual qeuque is for information exchange between hypervisor
and guest. The VMM hypervisor can make use of this virtual queue
to request the guest do some operations, e.g. drop page cache,
synchronize file system, etc. And the VMM hypervisor can get some
of guest's runtime information through this virtual queue, e.g. the
guest's free page information, which can be used for live migration
optimization.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
---
 include/uapi/linux/virtio_balloon.h | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index d3b182a..be4880f 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_PAGE_BITMAP   3 /* Send page info with bitmap */
+#define VIRTIO_BALLOON_F_MISC_VQ   4 /* Misc info virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -101,4 +102,25 @@ struct balloon_bmap_hdr {
__virtio64 bmap_len;
 };
 
+enum balloon_req_id {
+   /* Get free pages information */
+   BALLOON_GET_FREE_PAGES,
+};
+
+enum balloon_flag {
+   /* Have more data for a request */
+   BALLOON_FLAG_CONT,
+   /* No more data for a request */
+   BALLOON_FLAG_DONE,
+};
+
+struct balloon_req_hdr {
+   /* Used to distinguish different request */
+   __virtio16 cmd;
+   /* Reserved */
+   __virtio16 reserved[3];
+   /* Request parameter */
+   __virtio64 param;
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
1.9.1

[Qemu-devel] [PATCH v2 repost 1/7] virtio-balloon: rework deflate to add page to a list

2016-07-26 Thread Liang Li

will allow faster notifications using a bitmap down the road.
balloon_pfn_to_page() can be removed because it's useless.

Signed-off-by: Liang Li <liang.z...@intel.com>
Signed-off-by: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
---
 drivers/virtio/virtio_balloon.c | 22 --
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 476c0e3..8d649a2 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -98,12 +98,6 @@ static u32 page_to_balloon_pfn(struct page *page)
return pfn * VIRTIO_BALLOON_PAGES_PER_PAGE;
 }
 
-static struct page *balloon_pfn_to_page(u32 pfn)
-{
-   BUG_ON(pfn % VIRTIO_BALLOON_PAGES_PER_PAGE);
-   return pfn_to_page(pfn / VIRTIO_BALLOON_PAGES_PER_PAGE);
-}
-
 static void balloon_ack(struct virtqueue *vq)
 {
struct virtio_balloon *vb = vq->vdev->priv;
@@ -176,18 +170,16 @@ static unsigned fill_balloon(struct virtio_balloon *vb, 
size_t num)
return num_allocated_pages;
 }
 
-static void release_pages_balloon(struct virtio_balloon *vb)
+static void release_pages_balloon(struct virtio_balloon *vb,
+struct list_head *pages)
 {
-   unsigned int i;
-   struct page *page;
+   struct page *page, *next;
 
-   /* Find pfns pointing at start of each page, get pages and free them. */
-   for (i = 0; i < vb->num_pfns; i += VIRTIO_BALLOON_PAGES_PER_PAGE) {
-   page = balloon_pfn_to_page(virtio32_to_cpu(vb->vdev,
-  vb->pfns[i]));
+   list_for_each_entry_safe(page, next, pages, lru) {
if (!virtio_has_feature(vb->vdev,
VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
adjust_managed_page_count(page, 1);
+   list_del(>lru);
put_page(page); /* balloon reference */
}
 }
@@ -197,6 +189,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, 
size_t num)
unsigned num_freed_pages;
struct page *page;
struct balloon_dev_info *vb_dev_info = >vb_dev_info;
+   LIST_HEAD(pages);
 
/* We can only do one array worth at a time. */
num = min(num, ARRAY_SIZE(vb->pfns));
@@ -208,6 +201,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, 
size_t num)
if (!page)
break;
set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+   list_add(>lru, );
vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
}
 
@@ -219,7 +213,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, 
size_t num)
 */
if (vb->num_pfns != 0)
tell_host(vb, vb->deflate_vq);
-   release_pages_balloon(vb);
+   release_pages_balloon(vb, );
mutex_unlock(>balloon_lock);
return num_freed_pages;
 }
-- 
1.9.1

[Qemu-devel] [PATCH v2 repost 7/7] virtio-balloon: tell host vm's free page info

2016-07-26 Thread Liang Li

Support the request for vm's free page information, response with
a page bitmap. QEMU can make use of this free page bitmap to speed
up live migration process by skipping process the free pages.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Andrew Morton <a...@linux-foundation.org>
Cc: Vlastimil Babka <vba...@suse.cz>
Cc: Mel Gorman <mgor...@techsingularity.net>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
---
 drivers/virtio/virtio_balloon.c | 104 +---
 1 file changed, 98 insertions(+), 6 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 2d18ff6..5ca4ad3 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -62,10 +62,13 @@ module_param(oom_pages, int, S_IRUSR | S_IWUSR);
 MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
 
 extern unsigned long get_max_pfn(void);
+extern int get_free_pages(unsigned long start_pfn, unsigned long end_pfn,
+   unsigned long *bitmap, unsigned long len);
+
 
 struct virtio_balloon {
struct virtio_device *vdev;
-   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *misc_vq;
 
/* The balloon servicing is delegated to a freezable workqueue. */
struct work_struct update_balloon_stats_work;
@@ -89,6 +92,8 @@ struct virtio_balloon {
unsigned long pfn_limit;
/* Used to record the processed pfn range */
unsigned long min_pfn, max_pfn, start_pfn, end_pfn;
+   /* Request header */
+   struct balloon_req_hdr req_hdr;
/*
 * The pages we've told the Host we're not using are enqueued
 * at vb_dev_info->pages list.
@@ -373,6 +378,49 @@ static void update_balloon_stats(struct virtio_balloon *vb)
pages_to_bytes(available));
 }
 
+static void update_free_pages_stats(struct virtio_balloon *vb,
+   unsigned long req_id)
+{
+   struct scatterlist sg_in, sg_out;
+   unsigned long pfn = 0, bmap_len, max_pfn;
+   struct virtqueue *vq = vb->misc_vq;
+   struct balloon_bmap_hdr *hdr = vb->bmap_hdr;
+   int ret = 1;
+
+   max_pfn = get_max_pfn();
+   mutex_lock(>balloon_lock);
+   while (pfn < max_pfn) {
+   memset(vb->page_bitmap, 0, vb->bmap_len);
+   ret = get_free_pages(pfn, pfn + vb->pfn_limit,
+   vb->page_bitmap, vb->bmap_len * BITS_PER_BYTE);
+   hdr->cmd = cpu_to_virtio16(vb->vdev, BALLOON_GET_FREE_PAGES);
+   hdr->page_shift = cpu_to_virtio16(vb->vdev, PAGE_SHIFT);
+   hdr->req_id = cpu_to_virtio64(vb->vdev, req_id);
+   hdr->start_pfn = cpu_to_virtio64(vb->vdev, pfn);
+   bmap_len = vb->pfn_limit / BITS_PER_BYTE;
+   if (!ret) {
+   hdr->flag = cpu_to_virtio16(vb->vdev,
+   BALLOON_FLAG_DONE);
+   if (pfn + vb->pfn_limit > max_pfn)
+   bmap_len = (max_pfn - pfn) / BITS_PER_BYTE;
+   } else
+   hdr->flag = cpu_to_virtio16(vb->vdev,
+   BALLOON_FLAG_CONT);
+   hdr->bmap_len = cpu_to_virtio64(vb->vdev, bmap_len);
+   sg_init_one(_out, hdr,
+sizeof(struct balloon_bmap_hdr) + bmap_len);
+
+   virtqueue_add_outbuf(vq, _out, 1, vb, GFP_KERNEL);
+   virtqueue_kick(vq);
+   pfn += vb->pfn_limit;
+   }
+
+   sg_init_one(_in, >req_hdr, sizeof(vb->req_hdr));
+   virtqueue_add_inbuf(vq, _in, 1, >req_hdr, GFP_KERNEL);
+   virtqueue_kick(vq);
+   mutex_unlock(>balloon_lock);
+}
+
 /*
  * While most virtqueues communicate guest-initiated requests to the 
hypervisor,
  * the stats queue operates in reverse.  The driver initializes the virtqueue
@@ -511,18 +559,49 @@ static void update_balloon_size_func(struct work_struct 
*work)
queue_work(system_freezable_wq, work);
 }
 
+static void misc_handle_rq(struct virtio_balloon *vb)
+{
+   struct balloon_req_hdr *ptr_hdr;
+   unsigned int len;
+
+   ptr_hdr = virtqueue_get_buf(vb->misc_vq, );
+   if (!ptr_hdr || len != sizeof(vb->req_hdr))
+   return;
+
+   switch (ptr_hdr->cmd) {
+   case BALLOON_GET_FREE_PAGES:
+   update_free_pages_stats(vb, ptr_hdr->param);
+   break;
+   default:
+   break;
+   }
+}
+
+static void misc_request(struct virtqueue *vq)
+{
+   struct virtio_balloon *vb = vq

[Qemu-devel] [PATCH v2 repost 3/7] mm: add a function to get the max pfn

2016-07-26 Thread Liang Li

Expose the function to get the max pfn, so it can be used in the
virtio-balloon device driver.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Andrew Morton <a...@linux-foundation.org>
Cc: Vlastimil Babka <vba...@suse.cz>
Cc: Mel Gorman <mgor...@techsingularity.net>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
---
 mm/page_alloc.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8b3e134..7da61ad 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4517,6 +4517,12 @@ void show_free_areas(unsigned int filter)
show_swap_cache_info();
 }
 
+unsigned long get_max_pfn(void)
+{
+   return max_pfn;
+}
+EXPORT_SYMBOL(get_max_pfn);
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
zoneref->zone = zone;
-- 
1.9.1

[Qemu-devel] [PATCH v2 repost 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-07-26 Thread Liang Li

This patchset is for kernel and contains two parts of change to the
virtio-balloon. 

One is the change for speeding up the inflating & deflating process,
the main idea of this optimization is to use bitmap to send the page
information to host instead of the PFNs, to reduce the overhead of
virtio data transmission, address translation and madvise(). This can
help to improve the performance by about 85%.

Another change is for speeding up live migration. By skipping process
guest's free pages in the first round of data copy, to reduce needless
data processing, this can help to save quite a lot of CPU cycles and
network bandwidth. We put guest's free page information in bitmap and
send it to host with the virt queue of virtio-balloon. For an idle 8GB
guest, this can help to shorten the total live migration time from 2Sec
to about 500ms in the 10Gbps network environment.  


Changes from v1 to v2:
* Abandon the patch for dropping page cache.
* Put some structures to uapi head file.
* Use a new way to determine the page bitmap size.
* Use a unified way to send the free page information with the bitmap 
* Address the issues referred in MST's comments

Liang Li (7):
  virtio-balloon: rework deflate to add page to a list
  virtio-balloon: define new feature bit and page bitmap head
  mm: add a function to get the max pfn
  virtio-balloon: speed up inflate/deflate process
  virtio-balloon: define feature bit and head for misc virt queue
  mm: add the related functions to get free page info
  virtio-balloon: tell host vm's free page info

 drivers/virtio/virtio_balloon.c | 306 +++-
 include/uapi/linux/virtio_balloon.h |  41 +
 mm/page_alloc.c |  52 ++
 3 files changed, 359 insertions(+), 40 deletions(-)

-- 
1.9.1

[Qemu-devel] [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-07-26 Thread Liang Li

The implementation of the current virtio-balloon is not very
efficient, the time spends on different stages of inflating
the balloon to 7GB of a 8GB idle guest:

a. allocating pages (6.5%)
b. sending PFNs to host (68.3%)
c. address translation (6.1%)
d. madvise (19%)

It takes about 4126ms for the inflating process to complete.
Debugging shows that the bottle neck are the stage b and stage d.

If using a bitmap to send the page info instead of the PFNs, we
can reduce the overhead in stage b quite a lot. Furthermore, we
can do the address translation and call madvise() with a bulk of
RAM pages, instead of the current page per page way, the overhead
of stage c and stage d can also be reduced a lot.

This patch is the kernel side implementation which is intended to
speed up the inflating & deflating process by adding a new feature
to the virtio-balloon device. With this new feature, inflating the
balloon to 7GB of a 8GB idle guest only takes 590ms, the
performance improvement is about 85%.

TODO: optimize stage a by allocating/freeing a chunk of pages
instead of a single page at a time.

Signed-off-by: Liang Li <liang.z...@intel.com>
Suggested-by: Michael S. Tsirkin <m...@redhat.com>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Andrew Morton <a...@linux-foundation.org>
Cc: Vlastimil Babka <vba...@suse.cz>
Cc: Mel Gorman <mgor...@techsingularity.net>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
---
 drivers/virtio/virtio_balloon.c | 184 +++-
 1 file changed, 162 insertions(+), 22 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 8d649a2..2d18ff6 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -41,10 +41,28 @@
 #define OOM_VBALLOON_DEFAULT_PAGES 256
 #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
 
+/*
+ * VIRTIO_BALLOON_PFNS_LIMIT is used to limit the size of page bitmap
+ * to prevent a very large page bitmap, there are two reasons for this:
+ * 1) to save memory.
+ * 2) allocate a large bitmap may fail.
+ *
+ * The actual limit of pfn is determined by:
+ * pfn_limit = min(max_pfn, VIRTIO_BALLOON_PFNS_LIMIT);
+ *
+ * If system has more pages than VIRTIO_BALLOON_PFNS_LIMIT, we will scan
+ * the page list and send the PFNs with several times. To reduce the
+ * overhead of scanning the page list. VIRTIO_BALLOON_PFNS_LIMIT should
+ * be set with a value which can cover most cases.
+ */
+#define VIRTIO_BALLOON_PFNS_LIMIT ((32 * (1ULL << 30)) >> PAGE_SHIFT) /* 32GB 
*/
+
 static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
 module_param(oom_pages, int, S_IRUSR | S_IWUSR);
 MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
 
+extern unsigned long get_max_pfn(void);
+
 struct virtio_balloon {
struct virtio_device *vdev;
struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
@@ -62,6 +80,15 @@ struct virtio_balloon {
 
/* Number of balloon pages we've told the Host we're not using. */
unsigned int num_pages;
+   /* Pointer of the bitmap header. */
+   void *bmap_hdr;
+   /* Bitmap and length used to tell the host the pages */
+   unsigned long *page_bitmap;
+   unsigned long bmap_len;
+   /* Pfn limit */
+   unsigned long pfn_limit;
+   /* Used to record the processed pfn range */
+   unsigned long min_pfn, max_pfn, start_pfn, end_pfn;
/*
 * The pages we've told the Host we're not using are enqueued
 * at vb_dev_info->pages list.
@@ -105,12 +132,45 @@ static void balloon_ack(struct virtqueue *vq)
wake_up(>acked);
 }
 
+static inline void init_pfn_range(struct virtio_balloon *vb)
+{
+   vb->min_pfn = ULONG_MAX;
+   vb->max_pfn = 0;
+}
+
+static inline void update_pfn_range(struct virtio_balloon *vb,
+struct page *page)
+{
+   unsigned long balloon_pfn = page_to_balloon_pfn(page);
+
+   if (balloon_pfn < vb->min_pfn)
+   vb->min_pfn = balloon_pfn;
+   if (balloon_pfn > vb->max_pfn)
+   vb->max_pfn = balloon_pfn;
+}
+
 static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
 {
struct scatterlist sg;
unsigned int len;
 
-   sg_init_one(, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
+   if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_PAGE_BITMAP)) {
+   struct balloon_bmap_hdr *hdr = vb->bmap_hdr;
+   unsigned long bmap_len;
+
+   /* cmd and req_id are not used here, set them to 0 */
+   hdr->cmd = cpu_to_virtio16(vb->vdev, 0);
+   hdr->page_shift = cpu_to_virtio16(vb->vdev, PAGE_SHIFT);
+   hdr->reserved = cpu_to_virtio16(vb->vdev, 0);
+   hdr->req_id = cpu_to_virtio64(vb->vdev, 0)

[Qemu-devel] [PATCH v2 repost 2/7] virtio-balloon: define new feature bit and page bitmap head

2016-07-26 Thread Liang Li

Add a new feature which supports sending the page information with
a bitmap. The current implementation uses PFNs array, which is not
very efficient. Using bitmap can improve the performance of
inflating/deflating significantly

The page bitmap header will used to tell the host some information
about the page bitmap. e.g. the page size, page bitmap length and
start pfn.

Signed-off-by: Liang Li <liang.z...@intel.com>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Paolo Bonzini <pbonz...@redhat.com>
Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
Cc: Amit Shah <amit.s...@redhat.com>
---
 include/uapi/linux/virtio_balloon.h | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index 343d7dd..d3b182a 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST0 /* Tell before reclaiming 
pages */
 #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_PAGE_BITMAP   3 /* Send page info with bitmap */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -82,4 +83,22 @@ struct virtio_balloon_stat {
__virtio64 val;
 } __attribute__((packed));
 
+/* Page bitmap header structure */
+struct balloon_bmap_hdr {
+   /* Used to distinguish different request */
+   __virtio16 cmd;
+   /* Shift width of page in the bitmap */
+   __virtio16 page_shift;
+   /* flag used to identify different status */
+   __virtio16 flag;
+   /* Reserved */
+   __virtio16 reserved;
+   /* ID of the request */
+   __virtio64 req_id;
+   /* The pfn of 0 bit in the bitmap */
+   __virtio64 start_pfn;
+   /* The length of the bitmap, in bytes */
+   __virtio64 bmap_len;
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
1.9.1

[Qemu-devel] [QEMU v2 5/9] balloon: get free page info from guest

2016-07-14 Thread Liang Li

Add a new feature to get the free page information from guest,
the free page information is saved in a bitmap. Please note that
'free page' means page is free sometime after host set the value
of request ID and before it receive response with the same ID.

Signed-off-by: Liang Li <liang.z...@intel.com>
---
 balloon.c  |  47 +-
 hw/virtio/virtio-balloon.c | 122 -
 include/hw/virtio/virtio-balloon.h |  18 +-
 include/sysemu/balloon.h   |  18 +-
 4 files changed, 200 insertions(+), 5 deletions(-)

diff --git a/balloon.c b/balloon.c
index f2ef50c..d6a3791 100644
--- a/balloon.c
+++ b/balloon.c
@@ -36,6 +36,8 @@
 
 static QEMUBalloonEvent *balloon_event_fn;
 static QEMUBalloonStatus *balloon_stat_fn;
+static QEMUBalloonGetFreePage *balloon_get_free_page_fn;
+static QEMUBalloonFreePageReady *balloon_free_page_ready_fn;
 static void *balloon_opaque;
 static bool balloon_inhibited;
 
@@ -65,9 +67,13 @@ static bool have_balloon(Error **errp)
 }
 
 int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
- QEMUBalloonStatus *stat_func, void *opaque)
+ QEMUBalloonStatus *stat_func,
+ QEMUBalloonGetFreePage *get_free_page_func,
+ QEMUBalloonFreePageReady *free_page_ready_func,
+ void *opaque)
 {
-if (balloon_event_fn || balloon_stat_fn || balloon_opaque) {
+if (balloon_event_fn || balloon_stat_fn || balloon_get_free_page_fn
+|| balloon_free_page_ready_fn || balloon_opaque) {
 /* We're already registered one balloon handler.  How many can
  * a guest really have?
  */
@@ -75,6 +81,8 @@ int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
 }
 balloon_event_fn = event_func;
 balloon_stat_fn = stat_func;
+balloon_get_free_page_fn = get_free_page_func;
+balloon_free_page_ready_fn = free_page_ready_func;
 balloon_opaque = opaque;
 return 0;
 }
@@ -86,6 +94,8 @@ void qemu_remove_balloon_handler(void *opaque)
 }
 balloon_event_fn = NULL;
 balloon_stat_fn = NULL;
+balloon_get_free_page_fn = NULL;
+balloon_free_page_ready_fn = NULL;
 balloon_opaque = NULL;
 }
 
@@ -116,3 +126,36 @@ void qmp_balloon(int64_t target, Error **errp)
 trace_balloon_event(balloon_opaque, target);
 balloon_event_fn(balloon_opaque, target);
 }
+
+bool balloon_free_pages_support(void)
+{
+return balloon_get_free_page_fn ? true : false;
+}
+
+BalloonReqStatus balloon_get_free_pages(unsigned long *bitmap,
+unsigned long len,
+unsigned long req_id)
+{
+if (!balloon_get_free_page_fn) {
+return REQ_UNSUPPORT;
+}
+
+if (!bitmap) {
+return REQ_INVALID_PARAM;
+}
+
+return balloon_get_free_page_fn(balloon_opaque, bitmap, len, req_id);
+}
+
+BalloonReqStatus balloon_free_page_ready(unsigned long *req_id)
+{
+if (!balloon_free_page_ready_fn) {
+return REQ_UNSUPPORT;
+}
+
+if (!req_id) {
+return REQ_INVALID_PARAM;
+}
+
+return balloon_free_page_ready_fn(balloon_opaque, req_id);
+}
diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index a7152c8..b0c09a7 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -150,6 +150,13 @@ static bool balloon_page_bitmap_supported(const 
VirtIOBalloon *s)
 return virtio_vdev_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_BITMAP);
 }
 
+static bool balloon_misc_vq_supported(const VirtIOBalloon *s)
+{
+VirtIODevice *vdev = VIRTIO_DEVICE(s);
+
+return virtio_vdev_has_feature(vdev, VIRTIO_BALLOON_F_MISC_VQ);
+}
+
 static bool balloon_stats_enabled(const VirtIOBalloon *s)
 {
 return s->stats_poll_interval > 0;
@@ -399,6 +406,52 @@ out:
 }
 }
 
+static void virtio_balloon_handle_resp(VirtIODevice *vdev, VirtQueue *vq)
+{
+VirtIOBalloon *s = VIRTIO_BALLOON(vdev);
+VirtQueueElement *elem;
+size_t offset = 0;
+struct balloon_bmap_hdr hdr;
+
+elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
+if (!elem) {
+s->req_status = REQ_ERROR;
+return;
+}
+
+s->misc_vq_elem = elem;
+if (!elem->out_num) {
+return;
+}
+
+iov_to_buf(elem->out_sg, elem->out_num, offset,
+   , sizeof(hdr));
+offset += sizeof(hdr);
+
+switch (hdr.cmd) {
+case BALLOON_GET_FREE_PAGES:
+if (hdr.req_id == s->misc_req.param) {
+if (s->bmap_len < hdr.start_pfn / BITS_PER_BYTE + hdr.bmap_len) {
+hdr.bmap_len = s->bmap_len - hdr.start_pfn / BITS_PER_BYTE;
+}
+
+iov_to_buf(elem->out_sg, elem->out_num, offset,
+   s->free_page_bmap + hdr.start_pfn / BITS_PER_LONG,
+   hdr.bmap_len);
+if (hdr

1 2 3 4 >

1 - 100 of 332 matches

Mail list logo