RE: [PATCH v6 kernel 3/5] virtio-balloon: speed up inflate/deflate process

2017-02-03 Thread Li, Liang Z
> 
> 
> > +static void free_extended_page_bitmap(struct virtio_balloon *vb) {
> > +   int i, bmap_count = vb->nr_page_bmap;
> > +
> > +   for (i = 1; i < bmap_count; i++) {
> > +   kfree(vb->page_bitmap[i]);
> > +   vb->page_bitmap[i] = NULL;
> > +   vb->nr_page_bmap--;
> > +   }
> > +}
> > +
> > +static void kfree_page_bitmap(struct virtio_balloon *vb) {
> > +   int i;
> > +
> > +   for (i = 0; i < vb->nr_page_bmap; i++)
> > +   kfree(vb->page_bitmap[i]);
> > +}
> 
> It might be worth commenting that pair of functions to make it clear why
> they are so different; I guess the kfree_page_bitmap is used just before you
> free the structure above it so you don't need to keep the count/pointers
> updated?
> 

Yes. I will add some comments for that. Thanks!

Liang
 
> Dave
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v6 kernel 3/5] virtio-balloon: speed up inflate/deflate process

2017-01-18 Thread Li, Liang Z
> On Wed, Jan 18, 2017 at 04:56:58AM +0000, Li, Liang Z wrote:
> > > > -   virtqueue_add_outbuf(vq, , 1, vb, GFP_KERNEL);
> > > > -   virtqueue_kick(vq);
> > > > +static void do_set_resp_bitmap(struct virtio_balloon *vb,
> > > > +   unsigned long base_pfn, int pages)
> > > >
> > > > -   /* When host has read buffer, this completes via balloon_ack */
> > > > -   wait_event(vb->acked, virtqueue_get_buf(vq, ));
> > > > +{
> > > > +   __le64 *range = vb->resp_data + vb->resp_pos;
> > > >
> > > > +   if (pages > (1 << VIRTIO_BALLOON_NR_PFN_BITS)) {
> > > > +   /* when the length field can't contain pages, set it to 
> > > > 0 to
> > >
> > > /*
> > >  * Multi-line
> > >  * comments
> > >  * should look like this.
> > >  */
> > >
> > > Also, pls start sentences with an upper-case letter.
> > >
> >
> > Sorry for that.
> >
> > > > +* indicate the actual length is in the next __le64;
> > > > +*/
> > >
> > > This is part of the interface so should be documented as such.
> > >
> > > > +   *range = cpu_to_le64((base_pfn <<
> > > > +   VIRTIO_BALLOON_NR_PFN_BITS) | 0);
> > > > +   *(range + 1) = cpu_to_le64(pages);
> > > > +   vb->resp_pos += 2;
> > >
> > > Pls use structs for this kind of stuff.
> >
> > I am not sure if you mean to use
> >
> > struct  range {
> > __le64 pfn: 52;
> > __le64 nr_page: 12
> > }
> > Instead of the shift operation?
> 
> Not just that. You want to add a pages field as well.
> 

pages field? Could you give more hints?

> Generally describe the format in the header in some way so host and guest
> can easily stay in sync.

'VIRTIO_BALLOON_NR_PFN_BITS' is for this purpose and it will be passed to the
related function in page_alloc.c as a parameter.

Thanks!
Liang
> All the pointer math and void * means we get zero type safety and I'm not
> happy about it.
> 
> It's not good that virtio format seeps out to page_alloc anyway.
> If unavoidable it is not a good idea to try to hide this fact, people will 
> assume
> they can change the format at will.
> 
> --
> MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [virtio-dev] Re: [PATCH v6 kernel 2/5] virtio-balloon: define new feature bit and head struct

2017-01-18 Thread Li, Liang Z
> > > > > > Signed-off-by: Liang Li 
> > > > > > Cc: Michael S. Tsirkin 
> > > > > > Cc: Paolo Bonzini 
> > > > > > Cc: Cornelia Huck 
> > > > > > Cc: Amit Shah 
> > > > > > Cc: Dave Hansen 
> > > > > > Cc: Andrea Arcangeli 
> > > > > > Cc: David Hildenbrand 
> > > > > > ---
> > > > > >  include/uapi/linux/virtio_balloon.h | 12 
> > > > > >  1 file changed, 12 insertions(+)
> > > > > >
> > > > > > diff --git a/include/uapi/linux/virtio_balloon.h
> > > > > > b/include/uapi/linux/virtio_balloon.h
> > > > > > index 343d7dd..2f850bf 100644
> > > > > > --- a/include/uapi/linux/virtio_balloon.h
> > > > > > +++ b/include/uapi/linux/virtio_balloon.h
> > > > > > @@ -34,10 +34,14 @@
> > > > > >  #define VIRTIO_BALLOON_F_MUST_TELL_HOST0 /* Tell
> before
> > > > > reclaiming pages */
> > > > > >  #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats
> virtqueue
> > > > > */
> > > > > >  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate
> > > balloon
> > > > > on OOM */
> > > > > > +#define VIRTIO_BALLOON_F_PAGE_RANGE3 /* Send page info
> > > > > with ranges */
> > > > > >
> > > > > >  /* Size of a PFN in the balloon interface. */  #define
> > > > > > VIRTIO_BALLOON_PFN_SHIFT 12
> > > > > >
> > > > > > +/* Bits width for the length of the pfn range */
> > > > >
> > > > > What does this mean? Couldn't figure it out.
> > > > >
> > > > > > +#define VIRTIO_BALLOON_NR_PFN_BITS 12
> > > > > > +
> > > > > >  struct virtio_balloon_config {
> > > > > > /* Number of pages host wants Guest to give up. */
> > > > > > __u32 num_pages;
> > > > > > @@ -82,4 +86,12 @@ struct virtio_balloon_stat {
> > > > > > __virtio64 val;
> > > > > >  } __attribute__((packed));
> > > > > >
> > > > > > +/* Response header structure */ struct
> > > > > > +virtio_balloon_resp_hdr {
> > > > > > +   __le64 cmd : 8; /* Distinguish different requests type */
> > > > > > +   __le64 flag: 8; /* Mark status for a specific request type */
> > > > > > +   __le64 id : 16; /* Distinguish requests of a specific type */
> > > > > > +   __le64 data_len: 32; /* Length of the following data, in
> > > > > > +bytes */
> > > > >
> > > > > This use of __le64 makes no sense.  Just use u8/le16/le32 pls.
> > > > >
> > > >
> > > > Got it, will change in the next version.
> > > >
> > > > And could help take a look at other parts? as well as the QEMU part.
> > > >
> > > > Thanks!
> > > > Liang
> > >
> > > Yes but first I would like to understand how come no fields in this
> > > new structure come up if I search for them in the following patch. I
> > > don't see why
> >
> > It's not true, all of the field will be referenced in the following
> > patches except the 'reserved' filed.
> 
> But none of these are used in the following patch 3.

Yes. Only 'data_len' is used in patch 3, and for expansibility maybe at least 
'cmd' is needed to. I should set it in patch 3 to some default value even
it's not currently useful. 'flag' and 'id' are for patch 4. I just want to 
reuse the 'struct virtio_balloon_resp_hdr' and make the code simpler.

Thanks!
Liang

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v6 kernel 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2017-01-18 Thread Li, Liang Z
> Am 21.12.2016 um 07:52 schrieb Liang Li:
> > This patch set contains two parts of changes to the virtio-balloon.
> >
> > One is the change for speeding up the inflating & deflating process,
> > the main idea of this optimization is to use {pfn|length} to present
> > the page information instead of the PFNs, to reduce the overhead of
> > virtio data transmission, address translation and madvise(). This can
> > help to improve the performance by about 85%.
> >
> > Another change is for speeding up live migration. By skipping process
> > guest's unused pages in the first round of data copy, to reduce
> > needless data processing, this can help to save quite a lot of CPU
> > cycles and network bandwidth. We put guest's unused page information
> > in a {pfn|length} array and send it to host with the virt queue of
> > virtio-balloon. For an idle guest with 8GB RAM, this can help to
> > shorten the total live migration time from 2Sec to about 500ms in
> > 10Gbps network environment. For an guest with quite a lot of page
> > cache and with little unused pages, it's possible to let the guest
> > drop it's page cache before live migration, this case can benefit from this
> new feature too.
> 
> I agree that both changes make sense (although the second change just
> smells very racy, as you also pointed out in the patch description), however I
> am not sure if virtio-balloon is really the right place for the latter change.
> 
> virtio-balloon is all about ballooning, nothing else. What you're doing is 
> using
> it as a way to communicate balloon-unrelated data from/to the hypervisor.
> Yes, it is also about guest memory, but completely unrelated to the purpose
> of the balloon device.
> 
> Maybe using virtio-balloon for this purpose is okay - I have mixed feelings
> (especially as I can't tell where else this could go). I would like to get a 
> second
> opinion on this.
> 

We have ever discussed the implementation for a long time, making use the 
current
virtio balloon seems better than the other solutions and is recommended by 
Michael.

Thanks!
Liang
> --
> 
> David
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v6 kernel 3/5] virtio-balloon: speed up inflate/deflate process

2017-01-17 Thread Li, Liang Z
> > -   virtqueue_add_outbuf(vq, , 1, vb, GFP_KERNEL);
> > -   virtqueue_kick(vq);
> > +static void do_set_resp_bitmap(struct virtio_balloon *vb,
> > +   unsigned long base_pfn, int pages)
> >
> > -   /* When host has read buffer, this completes via balloon_ack */
> > -   wait_event(vb->acked, virtqueue_get_buf(vq, ));
> > +{
> > +   __le64 *range = vb->resp_data + vb->resp_pos;
> >
> > +   if (pages > (1 << VIRTIO_BALLOON_NR_PFN_BITS)) {
> > +   /* when the length field can't contain pages, set it to 0 to
> 
> /*
>  * Multi-line
>  * comments
>  * should look like this.
>  */
> 
> Also, pls start sentences with an upper-case letter.
> 

Sorry for that.

> > +* indicate the actual length is in the next __le64;
> > +*/
> 
> This is part of the interface so should be documented as such.
> 
> > +   *range = cpu_to_le64((base_pfn <<
> > +   VIRTIO_BALLOON_NR_PFN_BITS) | 0);
> > +   *(range + 1) = cpu_to_le64(pages);
> > +   vb->resp_pos += 2;
> 
> Pls use structs for this kind of stuff.

I am not sure if you mean to use 

struct  range {
__le64 pfn: 52;
__le64 nr_page: 12
}
Instead of the shift operation?

I didn't use this way because I don't want to include 'virtio-balloon.h' in 
page_alloc.c,
or copy the define of this struct in page_alloc.c

Thanks!
Liang
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [virtio-dev] Re: [PATCH v6 kernel 2/5] virtio-balloon: define new feature bit and head struct

2017-01-17 Thread Li, Liang Z
> Sent: Wednesday, January 18, 2017 3:11 AM
> To: Li, Liang Z
> Cc: k...@vger.kernel.org; virtio-...@lists.oasis-open.org; qemu-
> de...@nongnu.org; linux...@kvack.org; linux-ker...@vger.kernel.org;
> virtualization@lists.linux-foundation.org; amit.s...@redhat.com; Hansen,
> Dave; cornelia.h...@de.ibm.com; pbonz...@redhat.com;
> da...@redhat.com; aarca...@redhat.com; dgilb...@redhat.com;
> quint...@redhat.com
> Subject: Re: [virtio-dev] Re: [PATCH v6 kernel 2/5] virtio-balloon: define new
> feature bit and head struct
> 
> On Fri, Jan 13, 2017 at 09:24:22AM +, Li, Liang Z wrote:
> > > On Wed, Dec 21, 2016 at 02:52:25PM +0800, Liang Li wrote:
> > > > Add a new feature which supports sending the page information with
> > > > range array. The current implementation uses PFNs array, which is
> > > > not very efficient. Using ranges can improve the performance of
> > > > inflating/deflating significantly.
> > > >
> > > > Signed-off-by: Liang Li <liang.z...@intel.com>
> > > > Cc: Michael S. Tsirkin <m...@redhat.com>
> > > > Cc: Paolo Bonzini <pbonz...@redhat.com>
> > > > Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
> > > > Cc: Amit Shah <amit.s...@redhat.com>
> > > > Cc: Dave Hansen <dave.han...@intel.com>
> > > > Cc: Andrea Arcangeli <aarca...@redhat.com>
> > > > Cc: David Hildenbrand <da...@redhat.com>
> > > > ---
> > > >  include/uapi/linux/virtio_balloon.h | 12 
> > > >  1 file changed, 12 insertions(+)
> > > >
> > > > diff --git a/include/uapi/linux/virtio_balloon.h
> > > > b/include/uapi/linux/virtio_balloon.h
> > > > index 343d7dd..2f850bf 100644
> > > > --- a/include/uapi/linux/virtio_balloon.h
> > > > +++ b/include/uapi/linux/virtio_balloon.h
> > > > @@ -34,10 +34,14 @@
> > > >  #define VIRTIO_BALLOON_F_MUST_TELL_HOST0 /* Tell before
> > > reclaiming pages */
> > > >  #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue
> > > */
> > > >  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate
> balloon
> > > on OOM */
> > > > +#define VIRTIO_BALLOON_F_PAGE_RANGE3 /* Send page info
> > > with ranges */
> > > >
> > > >  /* Size of a PFN in the balloon interface. */  #define
> > > > VIRTIO_BALLOON_PFN_SHIFT 12
> > > >
> > > > +/* Bits width for the length of the pfn range */
> > >
> > > What does this mean? Couldn't figure it out.
> > >
> > > > +#define VIRTIO_BALLOON_NR_PFN_BITS 12
> > > > +
> > > >  struct virtio_balloon_config {
> > > > /* Number of pages host wants Guest to give up. */
> > > > __u32 num_pages;
> > > > @@ -82,4 +86,12 @@ struct virtio_balloon_stat {
> > > > __virtio64 val;
> > > >  } __attribute__((packed));
> > > >
> > > > +/* Response header structure */
> > > > +struct virtio_balloon_resp_hdr {
> > > > +   __le64 cmd : 8; /* Distinguish different requests type */
> > > > +   __le64 flag: 8; /* Mark status for a specific request type */
> > > > +   __le64 id : 16; /* Distinguish requests of a specific type */
> > > > +   __le64 data_len: 32; /* Length of the following data, in bytes
> > > > +*/
> > >
> > > This use of __le64 makes no sense.  Just use u8/le16/le32 pls.
> > >
> >
> > Got it, will change in the next version.
> >
> > And could help take a look at other parts? as well as the QEMU part.
> >
> > Thanks!
> > Liang
> 
> Yes but first I would like to understand how come no fields in this new
> structure come up if I search for them in the following patch. I don't see why

It's not true, all of the field will be referenced in the following patches 
except 
the 'reserved' filed.

> should I waste time on reviewing the implementation if the interface isn't
> reasonable. You don't have to waste it too - just send RFC patches with the
> header until we can agree on it.

OK. I will post the header part separately.

Thanks!
Liang
> 
> --
> MST
> 
> -
> To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [virtio-dev] Re: [PATCH v6 kernel 2/5] virtio-balloon: define new feature bit and head struct

2017-01-13 Thread Li, Liang Z
> On Wed, Dec 21, 2016 at 02:52:25PM +0800, Liang Li wrote:
> > Add a new feature which supports sending the page information with
> > range array. The current implementation uses PFNs array, which is not
> > very efficient. Using ranges can improve the performance of
> > inflating/deflating significantly.
> >
> > Signed-off-by: Liang Li 
> > Cc: Michael S. Tsirkin 
> > Cc: Paolo Bonzini 
> > Cc: Cornelia Huck 
> > Cc: Amit Shah 
> > Cc: Dave Hansen 
> > Cc: Andrea Arcangeli 
> > Cc: David Hildenbrand 
> > ---
> >  include/uapi/linux/virtio_balloon.h | 12 
> >  1 file changed, 12 insertions(+)
> >
> > diff --git a/include/uapi/linux/virtio_balloon.h
> > b/include/uapi/linux/virtio_balloon.h
> > index 343d7dd..2f850bf 100644
> > --- a/include/uapi/linux/virtio_balloon.h
> > +++ b/include/uapi/linux/virtio_balloon.h
> > @@ -34,10 +34,14 @@
> >  #define VIRTIO_BALLOON_F_MUST_TELL_HOST0 /* Tell before
> reclaiming pages */
> >  #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue
> */
> >  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon
> on OOM */
> > +#define VIRTIO_BALLOON_F_PAGE_RANGE3 /* Send page info
> with ranges */
> >
> >  /* Size of a PFN in the balloon interface. */  #define
> > VIRTIO_BALLOON_PFN_SHIFT 12
> >
> > +/* Bits width for the length of the pfn range */
> 
> What does this mean? Couldn't figure it out.
> 
> > +#define VIRTIO_BALLOON_NR_PFN_BITS 12
> > +
> >  struct virtio_balloon_config {
> > /* Number of pages host wants Guest to give up. */
> > __u32 num_pages;
> > @@ -82,4 +86,12 @@ struct virtio_balloon_stat {
> > __virtio64 val;
> >  } __attribute__((packed));
> >
> > +/* Response header structure */
> > +struct virtio_balloon_resp_hdr {
> > +   __le64 cmd : 8; /* Distinguish different requests type */
> > +   __le64 flag: 8; /* Mark status for a specific request type */
> > +   __le64 id : 16; /* Distinguish requests of a specific type */
> > +   __le64 data_len: 32; /* Length of the following data, in bytes */
> 
> This use of __le64 makes no sense.  Just use u8/le16/le32 pls.
> 

Got it, will change in the next version. 

And could help take a look at other parts? as well as the QEMU part.

Thanks!
Liang
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v6 kernel 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2017-01-09 Thread Li, Liang Z
Hi guys,

Could you help to review this patch set?

Thanks!
Liang

> -Original Message-
> From: Li, Liang Z
> Sent: Wednesday, December 21, 2016 2:52 PM
> To: k...@vger.kernel.org
> Cc: virtio-...@lists.oasis-open.org; qemu-de...@nongnu.org; linux-
> m...@kvack.org; linux-ker...@vger.kernel.org; virtualization@lists.linux-
> foundation.org; amit.s...@redhat.com; Hansen, Dave;
> cornelia.h...@de.ibm.com; pbonz...@redhat.com; m...@redhat.com;
> da...@redhat.com; aarca...@redhat.com; dgilb...@redhat.com;
> quint...@redhat.com; Li, Liang Z
> Subject: [PATCH v6 kernel 0/5] Extend virtio-balloon for fast (de)inflating &
> fast live migration
> 
> This patch set contains two parts of changes to the virtio-balloon.
> 
> One is the change for speeding up the inflating & deflating process, the main
> idea of this optimization is to use {pfn|length} to present the page
> information instead of the PFNs, to reduce the overhead of virtio data
> transmission, address translation and madvise(). This can help to improve the
> performance by about 85%.
> 
> Another change is for speeding up live migration. By skipping process guest's
> unused pages in the first round of data copy, to reduce needless data
> processing, this can help to save quite a lot of CPU cycles and network
> bandwidth. We put guest's unused page information in a {pfn|length} array
> and send it to host with the virt queue of virtio-balloon. For an idle guest 
> with
> 8GB RAM, this can help to shorten the total live migration time from 2Sec to
> about 500ms in 10Gbps network environment. For an guest with quite a lot
> of page cache and with little unused pages, it's possible to let the guest 
> drop
> it's page cache before live migration, this case can benefit from this new
> feature too.
> 
> Changes from v5 to v6:
> * Drop the bitmap from the virtio ABI, use {pfn|length} only.
> * Enhance the API to get the unused page information from mm.
> 
> Changes from v4 to v5:
> * Drop the code to get the max_pfn, use another way instead.
> * Simplify the API to get the unused page information from mm.
> 
> Changes from v3 to v4:
> * Use the new scheme suggested by Dave Hansen to encode the bitmap.
> * Add code which is missed in v3 to handle migrate page.
> * Free the memory for bitmap intime once the operation is done.
> * Address some of the comments in v3.
> 
> Changes from v2 to v3:
> * Change the name of 'free page' to 'unused page'.
> * Use the scatter & gather bitmap instead of a 1MB page bitmap.
> * Fix overwriting the page bitmap after kicking.
> * Some of MST's comments for v2.
> 
> Changes from v1 to v2:
> * Abandon the patch for dropping page cache.
> * Put some structures to uapi head file.
> * Use a new way to determine the page bitmap size.
> * Use a unified way to send the free page information with the bitmap
> * Address the issues referred in MST's comments
> 
> Liang Li (5):
>   virtio-balloon: rework deflate to add page to a list
>   virtio-balloon: define new feature bit and head struct
>   virtio-balloon: speed up inflate/deflate process
>   virtio-balloon: define flags and head for host request vq
>   virtio-balloon: tell host vm's unused page info
> 
>  drivers/virtio/virtio_balloon.c | 510
> 
>  include/linux/mm.h  |   3 +
>  include/uapi/linux/virtio_balloon.h |  34 +++
>  mm/page_alloc.c | 120 +
>  4 files changed, 621 insertions(+), 46 deletions(-)
> 
> --
> 1.9.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-17 Thread Li, Liang Z
> Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for
> fast (de)inflating & fast live migration
> 
> On Thu, Dec 15, 2016 at 05:40:45PM -0800, Dave Hansen wrote:
> > On 12/15/2016 05:38 PM, Li, Liang Z wrote:
> > >
> > > Use 52 bits for 'pfn', 12 bits for 'length', when the 12 bits is not long
> enough for the 'length'
> > > Set the 'length' to a special value to indicate the "actual length in 
> > > next 8
> bytes".
> > >
> > > That will be much more simple. Right?
> >
> > Sounds fine to me.
> >
> 
> Sounds fine to me too indeed.
> 
> I'm only wondering what is the major point for compressing gpfn+len in
> 8 bytes in the common case, you already use sg_init_table to send down two
> pages, we could send three as well and avoid all math and bit shifts and ors,
> or not?
> 

Yes, we can use more pages for that.

> I agree with the above because from a performance prospective I tend to
> think the above proposal will run at least theoretically faster because the
> other way is to waste double amount of CPU cache, and bit mangling in the
> encoding and the later decoding on qemu side should be faster than
> accessing an array of double size, but then I'm not sure if it's measurable
> optimization. So I'd be curious to know the exact motivation and if it is to
> reduce the CPU cache usage or if there's some other fundamental reason to
> compress it.
> The header already tells qemu how big is the array payload, couldn't we just
> add more pages if one isn't enough?
> 

The original intention to compress the PFN and length it's to reduce the memory 
required.
Even the code was changed a lot from the previous versions, I think this is 
still true.

Now we allocate a specified buffer size to save the 'PFN|length', when the 
buffer is not big
enough to save all the page info for a specified order. A double size buffer 
will be allocated.
This is what we want to avoid because the allocation may fail and allocation 
takes some time,
for fast live migration, time is a critical factor we have to consider, more 
time takes means
more unnecessary pages are sent, because live migration starts before the 
request for unused
 pages get response. 

Thanks

Liang

> Thanks,
> Andrea
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-17 Thread Li, Liang Z
> On Fri, Dec 16, 2016 at 01:12:21AM +0000, Li, Liang Z wrote:
> > There still exist the case if the MAX_ORDER is configured to a large
> > value, e.g. 36 for a system with huge amount of memory, then there is only
> 28 bits left for the pfn, which is not enough.
> 
> Not related to the balloon but how would it help to set MAX_ORDER to 36?
> 

My point here is  MAX_ORDER may be configured to a big value.

> What the MAX_ORDER affects is that you won't be able to ask the kernel
> page allocator for contiguous memory bigger than 1<<(MAX_ORDER-1), but
> that's a driver issue not relevant to the amount of RAM. Drivers won't
> suddenly start to ask the kernel allocator to allocate compound pages at
> orders >= 11 just because more RAM was added.
> 
> The higher the MAX_ORDER the slower the kernel runs simply so the smaller
> the MAX_ORDER the better.
> 
> > Should  we limit the MAX_ORDER? I don't think so.
> 
> We shouldn't strictly depend on MAX_ORDER value but it's mostly limited
> already even if configurable at build time.
> 

I didn't know that and will take a look, thanks for your information.


Liang
> We definitely need it to reach at least the hugepage size, then it's mostly
> driver issue, but drivers requiring large contiguous allocations should rely 
> on
> CMA only or vmalloc if they only require it virtually contiguous, and not rely
> on larger MAX_ORDER that would slowdown all kernel allocations/freeing.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-15 Thread Li, Liang Z
> On 12/15/2016 05:38 PM, Li, Liang Z wrote:
> >
> > Use 52 bits for 'pfn', 12 bits for 'length', when the 12 bits is not long 
> > enough
> for the 'length'
> > Set the 'length' to a special value to indicate the "actual length in next 8
> bytes".
> >
> > That will be much more simple. Right?
> 
> Sounds fine to me.

Thanks for your inspiration!

Liang
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-15 Thread Li, Liang Z
> On 12/15/2016 04:48 PM, Li, Liang Z wrote:
> >>> It seems we leave too many bit  for the pfn, and the bits leave for
> >>> length is not enough, How about keep 45 bits for the pfn and 19 bits
> >>> for length, 45 bits for pfn can cover 57 bits physical address, that
> >>> should be
> >> enough in the near feature.
> >>> What's your opinion?
> >> I still think 'order' makes a lot of sense.  But, as you say, 57 bits
> >> is enough for
> >> x86 for a while.  Other architectures who knows?
> 
> Thinking about this some more...  There are really only two cases that
> matter: 4k pages and "much bigger" ones.
> 
> Squeezing each 4k page into 8 bytes of metadata helps guarantee that this
> scheme won't regress over the old scheme in any cases.  For bigger ranges, 8
> vs 16 bytes means *nothing*.  And 16 bytes will be as good or better than
> the old scheme for everything which is >4k.
> 
> How about this:
>  * 52 bits of 'pfn', 5 bits of 'order', 7 bits of 'length'
>  * One special 'length' value to mean "actual length in next 8 bytes"
> 
> That should be pretty simple to produce and decode.  We have two record
> sizes, but I think it is manageable.

It works,  Now that we intend to use another 8 bytes for length

Why not:

Use 52 bits for 'pfn', 12 bits for 'length', when the 12 bits is not long 
enough for the 'length'
Set the 'length' to a special value to indicate the "actual length in next 8 
bytes".

That will be much more simple. Right?

Liang
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-15 Thread Li, Liang Z
> On Thu, Dec 15, 2016 at 07:34:33AM -0800, Dave Hansen wrote:
> > On 12/14/2016 12:59 AM, Li, Liang Z wrote:
> > >> Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend
> > >> virtio-balloon for fast (de)inflating & fast live migration
> > >>
> > >> On 12/08/2016 08:45 PM, Li, Liang Z wrote:
> > >>> What's the conclusion of your discussion? It seems you want some
> > >>> statistic before deciding whether to  ripping the bitmap from the
> > >>> ABI, am I right?
> > >>
> > >> I think Andrea and David feel pretty strongly that we should remove
> > >> the bitmap, unless we have some data to support keeping it.  I
> > >> don't feel as strongly about it, but I think their critique of it
> > >> is pretty valid.  I think the consensus is that the bitmap needs to go.
> > >>
> > >> The only real question IMNHO is whether we should do a power-of-2
> > >> or a length.  But, if we have 12 bits, then the argument for doing
> > >> length is pretty strong.  We don't need anywhere near 12 bits if doing
> power-of-2.
> > >
> > > Just found the MAX_ORDER should be limited to 12 if use length
> > > instead of order, If the MAX_ORDER is configured to a value bigger
> > > than 12, it will make things more complex to handle this case.
> > >
> > > If use order, we need to break a large memory range whose length is
> > > not the power of 2 into several small ranges, it also make the code
> complex.
> >
> > I can't imagine it makes the code that much more complex.  It adds a
> > for loop.  Right?
> >
> > > It seems we leave too many bit  for the pfn, and the bits leave for
> > > length is not enough, How about keep 45 bits for the pfn and 19 bits
> > > for length, 45 bits for pfn can cover 57 bits physical address, that 
> > > should
> be enough in the near feature.
> > >
> > > What's your opinion?
> >
> > I still think 'order' makes a lot of sense.  But, as you say, 57 bits
> > is enough for x86 for a while.  Other architectures who knows?
> 
> I think you can probably assume page size >= 4K. But I would not want to
> make any other assumptions. E.g. there are systems that absolutely require
> you to set high bits for DMA.
> 
> I think we really want both length and order.
> 
> I understand how you are trying to pack them as tightly as possible.
> 
> However, I thought of a trick, we don't need to encode all possible orders.
> For example, with 2 bits of order, we can make them mean:
> 00 - 4K pages
> 01 - 2M pages
> 02 - 1G pages
> 
> guest can program the sizes for each order through config space.
> 
> We will have 10 bits left for legth.
> 

Please don't, we just get rid of the bitmap for simplification. :)

> It might make sense to also allow guest to program the number of bits used
> for order, this will make it easy to extend without host changes.
> 

There still exist the case if the MAX_ORDER is configured to a large value, 
e.g. 36 for a system
with huge amount of memory, then there is only 28 bits left for the pfn, which 
is not enough.
Should  we limit the MAX_ORDER? I don't think so.

It seems use order is better. 

Thanks!
Liang
> --
> MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-15 Thread Li, Liang Z
> Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for
> fast (de)inflating & fast live migration
> 
> On 12/14/2016 12:59 AM, Li, Liang Z wrote:
> >> Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon
> >> for fast (de)inflating & fast live migration
> >>
> >> On 12/08/2016 08:45 PM, Li, Liang Z wrote:
> >>> What's the conclusion of your discussion? It seems you want some
> >>> statistic before deciding whether to  ripping the bitmap from the
> >>> ABI, am I right?
> >>
> >> I think Andrea and David feel pretty strongly that we should remove
> >> the bitmap, unless we have some data to support keeping it.  I don't
> >> feel as strongly about it, but I think their critique of it is pretty
> >> valid.  I think the consensus is that the bitmap needs to go.
> >>
> >> The only real question IMNHO is whether we should do a power-of-2 or
> >> a length.  But, if we have 12 bits, then the argument for doing
> >> length is pretty strong.  We don't need anywhere near 12 bits if doing
> power-of-2.
> >
> > Just found the MAX_ORDER should be limited to 12 if use length instead
> > of order, If the MAX_ORDER is configured to a value bigger than 12, it
> > will make things more complex to handle this case.
> >
> > If use order, we need to break a large memory range whose length is
> > not the power of 2 into several small ranges, it also make the code complex.
> 
> I can't imagine it makes the code that much more complex.  It adds a for loop.
> Right?
> 

Yes, just a little. :)

> > It seems we leave too many bit  for the pfn, and the bits leave for
> > length is not enough, How about keep 45 bits for the pfn and 19 bits
> > for length, 45 bits for pfn can cover 57 bits physical address, that should 
> > be
> enough in the near feature.
> >
> > What's your opinion?
> 
> I still think 'order' makes a lot of sense.  But, as you say, 57 bits is 
> enough for
> x86 for a while.  Other architectures who knows?

Yes. 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-14 Thread Li, Liang Z
> Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for
> fast (de)inflating & fast live migration
> 
> On 12/08/2016 08:45 PM, Li, Liang Z wrote:
> > What's the conclusion of your discussion? It seems you want some
> > statistic before deciding whether to  ripping the bitmap from the ABI,
> > am I right?
> 
> I think Andrea and David feel pretty strongly that we should remove the
> bitmap, unless we have some data to support keeping it.  I don't feel as
> strongly about it, but I think their critique of it is pretty valid.  I think 
> the
> consensus is that the bitmap needs to go.
> 
> The only real question IMNHO is whether we should do a power-of-2 or a
> length.  But, if we have 12 bits, then the argument for doing length is pretty
> strong.  We don't need anywhere near 12 bits if doing power-of-2.

Just found the MAX_ORDER should be limited to 12 if use length instead of order,
If the MAX_ORDER is configured to a value bigger than 12, it will make things 
more
complex to handle this case. 

If use order, we need to break a large memory range whose length is not the 
power of 2 into several
small ranges, it also make the code complex.

It seems we leave too many bit  for the pfn, and the bits leave for length is 
not enough,
How about keep 45 bits for the pfn and 19 bits for length, 45 bits for pfn can 
cover 57 bits
physical address, that should be enough in the near feature. 

What's your opinion?


thanks!
Liang

 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-14 Thread Li, Liang Z
> fast (de)inflating & fast live migration
> 
> Hello,
> 
> On Fri, Dec 09, 2016 at 05:35:45AM +, Li, Liang Z wrote:
> > > On 12/08/2016 08:45 PM, Li, Liang Z wrote:
> > > > What's the conclusion of your discussion? It seems you want some
> > > > statistic before deciding whether to  ripping the bitmap from the
> > > > ABI, am I right?
> > >
> > > I think Andrea and David feel pretty strongly that we should remove
> > > the bitmap, unless we have some data to support keeping it.  I don't
> > > feel as strongly about it, but I think their critique of it is
> > > pretty valid.  I think the consensus is that the bitmap needs to go.
> > >
> >
> > Thanks for you clarification.
> >
> > > The only real question IMNHO is whether we should do a power-of-2 or
> > > a length.  But, if we have 12 bits, then the argument for doing
> > > length is pretty strong.  We don't need anywhere near 12 bits if doing
> power-of-2.
> > >
> > So each item can max represent 16MB Bytes, seems not big enough, but
> > enough for most case.
> > Things became much more simple without the bitmap, and I like simple
> > solution too. :)
> >
> > I will prepare the v6 and remove all the bitmap related stuffs. Thank you 
> > all!
> 
> Sounds great!
> 
> I suggested to check the statistics, because collecting those stats looked
> simpler and quicker than removing all bitmap related stuff from the patchset.
> However if you prefer to prepare a v6 without the bitmap another perhaps
> more interesting way to evaluate the usefulness of the bitmap is to just run
> the same benchmark and verify that there is no regression compared to the
> bitmap enabled code.
> 
> The other issue with the bitmap is, the best case for the bitmap is ever less
> likely to materialize the more RAM is added to the guest. It won't regress
> linearly because after all there can be some locality bias in the buddy 
> splits,
> but if sync compaction is used in the large order allocations tried before
> reaching order 0, the bitmap payoff will regress close to linearly with the
> increase of RAM.
> 
> So it'd be good to check the stats or the benchmark on large guests, at least
> one hundred gigabytes or so.
> 
> Changing topic but still about the ABI features needed, so it may be relevant
> for this discussion:
> 
> 1) vNUMA locality: i.e. allowing host to specify which vNODEs to take
>memory from, using alloc_pages_node in guest. So you can ask to
>take X pages from vnode A, Y pages from vnode B, in one vmenter.
> 
> 2) allowing qemu to tell the guest to stop inflating the balloon and
>report a fragmentation limit being hit, when sync compaction
>powered allocations fails at certain power-of-two order granularity
>passed by qemu to the guest. This order constraint will be passed
>by default for hugetlbfs guests with 2MB hpage size, while it can
>be used optionally on THP backed guests. This option with THP
>guests would allow a highlevel management software to provide a
>"don't reduce guest performance" while shrinking the memory size of
>the guest from the GUI. If you deselect the option, you can shrink
>down to the last freeable 4k guest page, but doing so may have to
>split THP in the host (you don't know for sure if they were really
>THP but they could have been), and it may regress
>performance. Inflating the balloon while passing a minimum
>granularity "order" of the pages being zapped, will guarantee
>inflating the balloon cannot decrease guest performance
>instead. Plus it's needed for hugetlbfs anyway as far as I can
>tell. hugetlbfs would not be host enforceable even if the idea is
>not to free memory but only reduce the available memory of the
>guest (not without major changes that maps a hugetlb page with 4k
>ptes at least). While for a more cooperative usage of hugetlbfs
>guests, it's simply not useful to inflate the balloon at anything
>less than the "HPAGE_SIZE" hugetlbfs granularity.
> 
> We also plan to use userfaultfd to make the balloon driver host enforced (will
> work fine on hugetlbfs 2M and tmpfs too) but that's going to be invisible to
> the ABI so it's not strictly relevant for this discussion.
> 
> On a side note, registering userfaultfd on the ballooned range, will keep
> khugepaged at bay so it won't risk to re-inflating the MADV_DONTNEED
> zapped sub-THP fragments no matter the sysfs tunings.
> 

Thanks for your elaboration!

> Thanks!
> Andrea
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-08 Thread Li, Liang Z
> On 12/08/2016 08:45 PM, Li, Liang Z wrote:
> > What's the conclusion of your discussion? It seems you want some
> > statistic before deciding whether to  ripping the bitmap from the ABI,
> > am I right?
> 
> I think Andrea and David feel pretty strongly that we should remove the
> bitmap, unless we have some data to support keeping it.  I don't feel as
> strongly about it, but I think their critique of it is pretty valid.  I think 
> the
> consensus is that the bitmap needs to go.
> 

Thanks for you clarification.

> The only real question IMNHO is whether we should do a power-of-2 or a
> length.  But, if we have 12 bits, then the argument for doing length is pretty
> strong.  We don't need anywhere near 12 bits if doing power-of-2.
> 
So each item can max represent 16MB Bytes, seems not big enough,
but enough for most case.
Things became much more simple without the bitmap, and I like simple solution 
too. :)

I will prepare the v6 and remove all the bitmap related stuffs. Thank you all!

Liang


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-08 Thread Li, Liang Z
> > 1. Current patches do a hypercall for each order in the allocator.
> >This is inefficient, but independent from the underlying data
> >structure in the ABI, unless bitmaps are in play, which they aren't.
> > 2. Should we have bitmaps in the ABI, even if they are not in use by the
> >guest implementation today?  Andrea says they have zero benefits
> >over a pfn/len scheme.  Dave doesn't think they have zero benefits
> >but isn't that attached to them.  QEMU's handling gets more
> >complicated when using a bitmap.
> > 3. Should the ABI contain records each with a pfn/len pair or a
> >pfn/order pair?
> >3a. 'len' is more flexible, but will always be a power-of-two anyway
> > for high-order pages (the common case)
> 
> Len wouldn't be a power of two practically only if we detect adjacent pages
> of smaller order that may merge into larger orders we already allocated (or
> the other way around).
> 
> [addr=2M, len=2M] allocated at order 9 pass [addr=4M, len=1M] allocated at
> order 8 pass -> merge as [addr=2M, len=3M]
> 
> Not sure if it would be worth it, but that unless we do this, page-order or 
> len
> won't make much difference.
> 
> >3b. if we decide not to have a bitmap, then we basically have plenty
> > of space for 'len' and should just do it
> >3c. It's easiest for the hypervisor to turn pfn/len into the
> >madvise() calls that it needs.
> >
> > Did I miss anything?
> 
> I think you summarized fine all my arguments in your summary.
> 
> > FWIW, I don't feel that strongly about the bitmap.  Li had one
> > originally, but I think the code thus far has demonstrated a huge
> > benefit without even having a bitmap.
> >
> > I've got no objections to ripping the bitmap out of the ABI.
> 
> I think we need to see a statistic showing the number of bits set in each
> bitmap in average, after some uptime and lru churn, like running stresstest
> app for a while with I/O and then inflate the balloon and
> count:
> 
> 1) how many bits were set vs total number of bits used in bitmaps
> 
> 2) how many times bitmaps were used vs bitmap_len = 0 case of single
>page
> 
> My guess would be like very low percentage for both points.
> 

> So there is a connection with the MAX_ORDER..0 allocation loop and the ABI
> change, but I agree any of the ABI proposed would still allow for it this 
> logic to
> be used. Bitmap or not bitmap, the loop would still work.

Hi guys,

What's the conclusion of your discussion? 
It seems you want some statistic before deciding whether to  ripping the bitmap 
from the ABI, am I right?

Thanks!
Liang 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-08 Thread Li, Liang Z
> Subject: Re: [PATCH kernel v5 0/5] Extend virtio-balloon for fast 
> (de)inflating
> & fast live migration
> 
> On 12/07/2016 05:35 AM, Li, Liang Z wrote:
> >> Am 30.11.2016 um 09:43 schrieb Liang Li:
> >> IOW in real examples, do we have really large consecutive areas or
> >> are all pages just completely distributed over our memory?
> >
> > The buddy system of Linux kernel memory management shows there
> should
> > be quite a lot of consecutive pages as long as there are a portion of
> > free memory in the guest.
> ...
> > If all pages just completely distributed over our memory, it means the
> > memory fragmentation is very serious, the kernel has the mechanism to
> > avoid this happened.
> 
> While it is correct that the kernel has anti-fragmentation mechanisms, I don't
> think it invalidates the question as to whether a bitmap would be too sparse
> to be effective.
> 
> > In the other hand, the inflating should not happen at this time
> > because the guest is almost 'out of memory'.
> 
> I don't think this is correct.  Most systems try to run with relatively 
> little free
> memory all the time, using the bulk of it as page cache.  We have no reason
> to expect that ballooning will only occur when there is lots of actual free
> memory and that it will not occur when that same memory is in use as page
> cache.
> 

Yes.
> In these patches, you're effectively still sending pfns.  You're just sending
> one pfn per high-order page which is giving a really nice speedup.  IMNHO,
> you're avoiding doing a real bitmap because creating a bitmap means either
> have a really big bitmap, or you would have to do some sorting (or multiple
> passes) of the free lists before populating a smaller bitmap.
> 
> Like David, I would still like to see some data on whether the choice between
> bitmaps and pfn lists is ever clearly in favor of bitmaps.  You haven't
> convinced me, at least, that the data isn't even worth collecting.

I will try to get some data with the real workload and share it with your guys.

Thanks!
Liang
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-07 Thread Li, Liang Z
> Am 30.11.2016 um 09:43 schrieb Liang Li:
> > This patch set contains two parts of changes to the virtio-balloon.
> >
> > One is the change for speeding up the inflating & deflating process,
> > the main idea of this optimization is to use bitmap to send the page
> > information to host instead of the PFNs, to reduce the overhead of
> > virtio data transmission, address translation and madvise(). This can
> > help to improve the performance by about 85%.
> 
> Do you have some statistics/some rough feeling how many consecutive bits are
> usually set in the bitmaps? Is it really just purely random or is there some
> granularity that is usually consecutive?
> 

I did something similar. Filled the balloon with 15GB for a 16GB idle guest, by
using bitmap, the madvise count was reduced to 605. when using the PFNs, the 
madvise count
was 3932160. It means there are quite a lot consecutive bits in the bitmap.
I didn't test for a guest with heavy memory workload. 

> IOW in real examples, do we have really large consecutive areas or are all
> pages just completely distributed over our memory?
> 

The buddy system of Linux kernel memory management shows there should be quite 
a lot of
 consecutive pages as long as there are a portion of free memory in the guest.
If all pages just completely distributed over our memory, it means the memory 
fragmentation is very serious, the kernel has the mechanism to avoid this 
happened.
In the other hand, the inflating should not happen at this time because the 
guest is almost
'out of memory'.

Liang

> Thanks!
> 
> --
> 
> David
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH kernel v5 5/5] virtio-balloon: tell host vm's unused page info

2016-12-05 Thread Li, Liang Z
> >>> + mutex_lock(>balloon_lock);
> >>> +
> >>> + for (order = MAX_ORDER - 1; order >= 0; order--) {
> >>
> >> I scratched my head for a bit on this one.  Why are you walking over
> >> orders,
> >> *then* zones.  I *think* you're doing it because you can efficiently
> >> fill the bitmaps at a given order for all zones, then move to a new
> >> bitmap.  But, it would be interesting to document this.
> >
> > Yes, use the order is somewhat strange, but it's helpful to keep the API 
> > simple.
> > Do you think it's acceptable?
> 
> Yeah, it's fine.  Just comment it, please.
> 
Good!

> >>> + if (ret == -ENOSPC) {
> >>> + void *new_resp_data;
> >>> +
> >>> + new_resp_data = kmalloc(2 * vb->resp_buf_size,
> >>> + GFP_KERNEL);
> >>> + if (new_resp_data) {
> >>> + kfree(vb->resp_data);
> >>> + vb->resp_data = new_resp_data;
> >>> + vb->resp_buf_size *= 2;
> >>
> >> What happens to the data in ->resp_data at this point?  Doesn't this
> >> just throw it away?
> >
> > Yes, so we should make sure the data in resp_data is not inuse.
> 
> But doesn't it have valid data that we just collected and haven't told the
> hypervisor about yet?  Aren't we throwing away good data that cost us
> something to collect?

Indeed.  Some filled data may exist for the previous zone. Should we
change the API to 
'int get_unused_pages(unsigned long *unused_pages, unsigned long size,
int order, unsigned long *pos, struct zone *zone)' ?

then we can use the 'zone' to record the zone to retry and not discard the
filled data.

> >> ...
> >>> +struct page_info_item {
> >>> + __le64 start_pfn : 52; /* start pfn for the bitmap */
> >>> + __le64 page_shift : 6; /* page shift width, in bytes */
> 
> What does a page_shift "in bytes" mean? :)

Obviously, you know. :o
I will try to make it clear.

> 
> >>> + __le64 bmap_len : 6;  /* bitmap length, in bytes */ };
> >>
> >> Is 'bmap_len' too short?  a 64-byte buffer is a bit tiny.  Right?
> >
> > Currently, we just use the 8 bytes and 0 bytes bitmap, should we support
> more than 64 bytes?
> 
> It just means that with this format, you end up wasting at least ~1/8th of the
> space with metadata.  That's a bit unfortunate, but I guess it's not fatal.
> 
> I'd definitely call it out in the patch description and make sure other folks 
> take
> a look at it.

OK.

> 
> There's a somewhat easy fix, but that would make the qemu implementation
> more complicated: You could just have bmap_len==0x3f imply that there's
> another field that contains an extended bitmap length for when you need long
> bitmaps.
> 
> But, as you note, there's no need for it, so it's a matter of trading the 
> extra
> complexity versus the desire to not habing to change the ABI again for longer
> (hopefully).
> 

Your suggestion still works without changing the current code, just reserve
 ' bmap_len==0x3f' for future extension, and it's not used by the current code.

> >>> +static int  mark_unused_pages(struct zone *zone,
> >>> + unsigned long *unused_pages, unsigned long size,
> >>> + int order, unsigned long *pos)
> >>> +{
> >>> + unsigned long pfn, flags;
> >>> + unsigned int t;
> >>> + struct list_head *curr;
> >>> + struct page_info_item *info;
> >>> +
> >>> + if (zone_is_empty(zone))
> >>> + return 0;
> >>> +
> >>> + spin_lock_irqsave(>lock, flags);
> >>> +
> >>> + if (*pos + zone->free_area[order].nr_free > size)
> >>> + return -ENOSPC;
> >>
> >> Urg, so this won't partially fill?  So, what the nr_free pages limit
> >> where we no longer fit in the kmalloc()'d buffer where this simply won't
> work?
> >
> > Yes.  My initial implementation is partially fill, it's better for the 
> > worst case.
> > I thought the above code is more efficient for most case ...
> > Do you think partially fill the bitmap is better?
> 
> Could you please answer the question I asked?
> 

For your question:
---
>So, what the nr_free pages limit where we no longer fit in the kmalloc()'d 
>buffer
> where this simply won't work?
--
No, if the buffer is not big enough to save 'nr_free'  pages, 
get_unused_pages() will return
'-ENOSPC', and the following code will try to allocate a 2x times size buffer 
for retrying,
until the proper size buffer is allocated. The current order will not be 
skipped unless the
buffer allocation failed.

> Because if you don't get this right, it could mean that there are system that
> simply *fail* here.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH kernel v5 5/5] virtio-balloon: tell host vm's unused page info

2016-12-04 Thread Li, Liang Z
> On 11/30/2016 12:43 AM, Liang Li wrote:
> > +static void send_unused_pages_info(struct virtio_balloon *vb,
> > +   unsigned long req_id)
> > +{
> > +   struct scatterlist sg_in;
> > +   unsigned long pos = 0;
> > +   struct virtqueue *vq = vb->req_vq;
> > +   struct virtio_balloon_resp_hdr *hdr = vb->resp_hdr;
> > +   int ret, order;
> > +
> > +   mutex_lock(>balloon_lock);
> > +
> > +   for (order = MAX_ORDER - 1; order >= 0; order--) {
> 
> I scratched my head for a bit on this one.  Why are you walking over orders,
> *then* zones.  I *think* you're doing it because you can efficiently fill the
> bitmaps at a given order for all zones, then move to a new bitmap.  But, it
> would be interesting to document this.
> 

Yes, use the order is somewhat strange, but it's helpful to keep the API 
simple. 
Do you think it's acceptable?

> > +   pos = 0;
> > +   ret = get_unused_pages(vb->resp_data,
> > +vb->resp_buf_size / sizeof(unsigned long),
> > +order, );
> 
> FWIW, get_unsued_pages() is a pretty bad name.  "get" usually implies
> bumping reference counts or consuming something.  You're just "recording"
> or "marking" them.
> 

Will change to mark_unused_pages().

> > +   if (ret == -ENOSPC) {
> > +   void *new_resp_data;
> > +
> > +   new_resp_data = kmalloc(2 * vb->resp_buf_size,
> > +   GFP_KERNEL);
> > +   if (new_resp_data) {
> > +   kfree(vb->resp_data);
> > +   vb->resp_data = new_resp_data;
> > +   vb->resp_buf_size *= 2;
> 
> What happens to the data in ->resp_data at this point?  Doesn't this just
> throw it away?
> 

Yes, so we should make sure the data in resp_data is not inuse.

> ...
> > +struct page_info_item {
> > +   __le64 start_pfn : 52; /* start pfn for the bitmap */
> > +   __le64 page_shift : 6; /* page shift width, in bytes */
> > +   __le64 bmap_len : 6;  /* bitmap length, in bytes */ };
> 
> Is 'bmap_len' too short?  a 64-byte buffer is a bit tiny.  Right?
> 

Currently, we just use the 8 bytes and 0 bytes bitmap, should we support more 
than 64 bytes?

> > +static int  mark_unused_pages(struct zone *zone,
> > +   unsigned long *unused_pages, unsigned long size,
> > +   int order, unsigned long *pos)
> > +{
> > +   unsigned long pfn, flags;
> > +   unsigned int t;
> > +   struct list_head *curr;
> > +   struct page_info_item *info;
> > +
> > +   if (zone_is_empty(zone))
> > +   return 0;
> > +
> > +   spin_lock_irqsave(>lock, flags);
> > +
> > +   if (*pos + zone->free_area[order].nr_free > size)
> > +   return -ENOSPC;
> 
> Urg, so this won't partially fill?  So, what the nr_free pages limit where we 
> no
> longer fit in the kmalloc()'d buffer where this simply won't work?
> 

Yes.  My initial implementation is partially fill, it's better for the worst 
case.
I thought the above code is more efficient for most case ...
Do you think partially fill the bitmap is better?
 
> > +   for (t = 0; t < MIGRATE_TYPES; t++) {
> > +   list_for_each(curr, >free_area[order].free_list[t]) {
> > +   pfn = page_to_pfn(list_entry(curr, struct page, lru));
> > +   info = (struct page_info_item *)(unused_pages +
> *pos);
> > +   info->start_pfn = pfn;
> > +   info->page_shift = order + PAGE_SHIFT;
> > +   *pos += 1;
> > +   }
> > +   }
> 
> Do we need to fill in ->bmap_len here?

For integrity, the bmap_len should be filled, will add.
Omit this step just because QEMU assume the ->bmp_len is 0 and ignore this 
field.

Thanks for your comment!

Liang
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH kernel v4 7/7] virtio-balloon: tell host vm's unused page info

2016-11-07 Thread Li, Liang Z
> On 11/06/2016 07:37 PM, Li, Liang Z wrote:
> >> Let's say we do a 32k bitmap that can hold ~1M pages.  That's 4GB of RAM.
> >> On a 1TB system, that's 256 passes through the top-level loop.
> >> The bottom-level lists have tens of thousands of pages in them, even
> >> on my laptop.  Only 1/256 of these pages will get consumed in a given pass.
> >>
> > Your description is not exactly.
> > A 32k bitmap is used only when there is few free memory left in the
> > system and when the extend_page_bitmap() failed to allocate more
> > memory for the bitmap. Or dozens of 32k split bitmap will be used,
> > this version limit the bitmap count to 32, it means we can use at most
> > 32*32 kB for the bitmap, which can cover 128GB for RAM. We can increase
> the bitmap count limit to a larger value if 32 is not big enough.
> 
> OK, so it tries to allocate a large bitmap.  But, if it fails, it will try to 
> work with a
> smaller bitmap.  Correct?
> 
Yes.

> So, what's the _worst_ case?  It sounds like it is even worse than I was
> positing.
> 

Only a  32KB bitmap can be allocated, and there are a huge amount of low order 
(<3) free pages is the worst case. 

> >> That's an awfully inefficient way of doing it.  This patch
> >> essentially changed the data structure without changing the algorithm to
> populate it.
> >>
> >> Please change the *algorithm* to use the new data structure efficiently.
> >>  Such a change would only do a single pass through each freelist, and
> >> would choose whether to use the extent-based (pfn -> range) or
> >> bitmap-based approach based on the contents of the free lists.
> >
> > Save the free page info to a raw bitmap first and then process the raw
> > bitmap to get the proper ' extent-based ' and  'bitmap-based' is the
> > most efficient way I can come up with to save the virtio data transmission.
> Do you have some better idea?
> 
> That's kinda my point.  This patch *does* processing to try to pack the
> bitmaps full of pages from the various pfn ranges.  It's a form of processing
> that gets *REALLY*, *REALLY* bad in some (admittedly obscure) cases.
> 
> Let's not pretend that making an essentially unlimited number of passes over
> the free lists is not processing.
> 
> 1. Allocate as large of a bitmap as you can. (what you already do) 2. Iterate
> from the largest freelist order.  Store those pages in the
>bitmap.
> 3. If you can no longer fit pages in the bitmap, return the list that
>you have.
> 4. Make an approximation about where the bitmap does not make any more,
>and fall back to listing individual PFNs.  This would make sens, for
>instance in a large zone with very few free order-0 pages left.
> 
Sounds good.  Should we ignore some of the order-0 pages in step 4 if the 
bitmap is full?
Or should retry to get a complete list of order-0 pages?

> 
> > It seems the benefit we get for this feature is not as big as that in fast
> balloon inflating/deflating.
> >>
> >> You should not be using get_max_pfn().  Any patch set that continues
> >> to use it is not likely to be using a proper algorithm.
> >
> > Do you have any suggestion about how to avoid it?
> 
> Yes: get the pfns from the page free lists alone.  Don't derive them from the
> pfn limits of the system or zones.

The ' get_max_pfn()' can be avoid in this patch, but I think we can't avoid it 
completely.
We need it as a hint for allocating a proper size bitmap. No?

Thanks!
Liang
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH kernel v4 7/7] virtio-balloon: tell host vm's unused page info

2016-11-06 Thread Li, Liang Z
> Please squish this and patch 5 together.  It makes no sense to separate them.
> 

OK.

> > +static void send_unused_pages_info(struct virtio_balloon *vb,
> > +   unsigned long req_id)
> > +{
> > +   struct scatterlist sg_in;
> > +   unsigned long pfn = 0, bmap_len, pfn_limit, last_pfn, nr_pfn;
> > +   struct virtqueue *vq = vb->req_vq;
> > +   struct virtio_balloon_resp_hdr *hdr = vb->resp_hdr;
> > +   int ret = 1, used_nr_bmap = 0, i;
> > +
> > +   if (virtio_has_feature(vb->vdev,
> VIRTIO_BALLOON_F_PAGE_BITMAP) &&
> > +   vb->nr_page_bmap == 1)
> > +   extend_page_bitmap(vb);
> > +
> > +   pfn_limit = PFNS_PER_BMAP * vb->nr_page_bmap;
> > +   mutex_lock(>balloon_lock);
> > +   last_pfn = get_max_pfn();
> > +
> > +   while (ret) {
> > +   clear_page_bitmap(vb);
> > +   ret = get_unused_pages(pfn, pfn + pfn_limit, vb-
> >page_bitmap,
> > +PFNS_PER_BMAP, vb->nr_page_bmap);
> 
> This changed the underlying data structure without changing the way that
> the structure is populated.
> 
> This algorithm picks a "PFNS_PER_BMAP * vb->nr_page_bmap"-sized set of
> pfns, allocates a bitmap for them, the loops through all zones looking for
> pages in any free list that are in that range.
> 
> Unpacking all the indirection, it looks like this:
> 
> for (pfn = 0; pfn < get_max_pfn(); pfn += BITMAP_SIZE_IN_PFNS)
>   for_each_populated_zone(zone)
>   for_each_migratetype_order(order, t)
>   list_for_each(..., >free_area[order])...
> 
> Let's say we do a 32k bitmap that can hold ~1M pages.  That's 4GB of RAM.
> On a 1TB system, that's 256 passes through the top-level loop.
> The bottom-level lists have tens of thousands of pages in them, even on my
> laptop.  Only 1/256 of these pages will get consumed in a given pass.
> 
Your description is not exactly.
A 32k bitmap is used only when there is few free memory left in the system and 
when 
the extend_page_bitmap() failed to allocate more memory for the bitmap. Or 
dozens of 
32k split bitmap will be used, this version limit the bitmap count to 32, it 
means we can use
at most 32*32 kB for the bitmap, which can cover 128GB for RAM. We can increase 
the bitmap
count limit to a larger value if 32 is not big enough.

> That's an awfully inefficient way of doing it.  This patch essentially changed
> the data structure without changing the algorithm to populate it.
> 
> Please change the *algorithm* to use the new data structure efficiently.
>  Such a change would only do a single pass through each freelist, and would
> choose whether to use the extent-based (pfn -> range) or bitmap-based
> approach based on the contents of the free lists.

Save the free page info to a raw bitmap first and then process the raw bitmap to
get the proper ' extent-based ' and  'bitmap-based' is the most efficient way I 
can 
come up with to save the virtio data transmission.  Do you have some better 
idea?


In the QEMU, no matter how we encode the bitmap, the raw format bitmap will be
used in the end.  But what I did in this version is:
   kernel: get the raw bitmap  --> encode the bitmap
   QEMU: decode the bitmap --> get the raw bitmap

Is it worth to do this kind of job here? we can save the virtio data 
transmission, but at the
same time, we did extra work.

It seems the benefit we get for this feature is not as big as that in fast 
balloon inflating/deflating.
> 
> You should not be using get_max_pfn().  Any patch set that continues to use
> it is not likely to be using a proper algorithm.

Do you have any suggestion about how to avoid it?

Thanks!
Liang

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [RESEND PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-10-26 Thread Li, Liang Z
> On 10/26/2016 03:06 AM, Li, Liang Z wrote:
> > I am working on Dave's new bitmap schema, I have finished the part of
> > getting the 'hybrid scheme bitmap' and found the complexity was more
> > than I expected. The main issue is more memory is required to save the
> > 'hybrid scheme bitmap' beside that used to save the raw page bitmap,
> > for the worst case, the memory required is 3 times than that in the
> > previous implementation.
> 
> Really?  Could you please describe the scenario where this occurs?
> > I am wondering if I should continue, as an alternative solution, how
> > about using PFNs array when inflating/deflating only a few pages?
> > Things will be much more simple.
> 
> Yes, using pfn lists is more efficient than using bitmaps for sparse bitmaps.
> Yes, there will be cases where it is preferable to just use pfn lists vs. any 
> kind
> of bitmap.
> 
> But, what does it matter?  At least with your current scheme where we go
> out and collect get_unused_pages(), we do the allocation up front.  The
> space efficiency doesn't matter at all for small sizes since we do the 
> constant-
> size allocation *anyway*.
> 
> I'm also pretty sure you can pack the pfn and page order into a single 64-bit
> word and have no bitmap for a given record.  That would make it pack just as
> well as the old pfns alone.  Right?

Yes, thanks for reminding, I am using 128 bit now, I will change it to 64 bit.
Let me finish the v4 first.

Thanks!
Liang

 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [RESEND PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-10-26 Thread Li, Liang Z
> Cc: linux-ker...@vger.kernel.org; virtualization@lists.linux-foundation.org;
> linux...@kvack.org; virtio-...@lists.oasis-open.org; k...@vger.kernel.org;
> qemu-de...@nongnu.org; quint...@redhat.com; dgilb...@redhat.com;
> pbonz...@redhat.com; cornelia.h...@de.ibm.com; amit.s...@redhat.com
> Subject: Re: [RESEND PATCH v3 kernel 0/7] Extend virtio-balloon for fast
> (de)inflating & fast live migration
> 
> On 10/26/2016 03:13 AM, Li, Liang Z wrote:
> > 3 times memory required is not accurate, please ignore this. sorry ...
> > The complexity is the point.
> 
> What is making it so complex?  Can you describe the problems?

I plan to complete it first and send out the patch set,  then discuss if it 
worth.  I need some time.

Thanks!
Liang
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [RESEND PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-10-26 Thread Li, Liang Z
> > On 10/20/2016 11:24 PM, Liang Li wrote:
> > > Dave Hansen suggested a new scheme to encode the data structure,
> > > because of additional complexity, it's not implemented in v3.
> >
> > So, what do you want done with this patch set?  Do you want it applied
> > as-is so that we can introduce a new host/guest ABI that we must
> > support until the end of time?  Then, we go back in a year or two and
> > add the newer format that addresses the deficiencies that this ABI has with
> a third version?
> >
> 
> Hi Dave & Michael,
> 
> I am working on Dave's new bitmap schema, I have finished the part of
> getting the 'hybrid scheme bitmap'
> and found the complexity was more than I expected.  The main issue is more
> memory is required to  save the 'hybrid scheme bitmap' beside that used to
> save the raw page bitmap, for the worst case, the memory required is 3
> times than that in the previous implementation.
> 

3 times memory required is not accurate, please ignore this. sorry ...
The complexity is the point. 

> I am wondering if I should continue, as an alternative solution, how about
> using PFNs array when inflating/deflating only a few pages? Things will be
> much more simple.
> 
> 
> Thanks!
> Liang
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [RESEND PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-10-26 Thread Li, Liang Z
> On 10/20/2016 11:24 PM, Liang Li wrote:
> > Dave Hansen suggested a new scheme to encode the data structure,
> > because of additional complexity, it's not implemented in v3.
> 
> So, what do you want done with this patch set?  Do you want it applied as-is
> so that we can introduce a new host/guest ABI that we must support until
> the end of time?  Then, we go back in a year or two and add the newer
> format that addresses the deficiencies that this ABI has with a third version?
> 

Hi Dave & Michael,

I am working on Dave's new bitmap schema, I have finished the part of getting 
the 'hybrid scheme bitmap'
and found the complexity was more than I expected.  The main issue is more 
memory is required to
 save the 'hybrid scheme bitmap' beside that used to save the raw page bitmap, 
for the worst case, the
memory required is 3 times than that in the previous implementation. 

I am wondering if I should continue, as an alternative solution, how about 
using PFNs array when
inflating/deflating only a few pages? Things will be much more simple.


Thanks!
Liang 



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [RESEND PATCH v3 kernel 1/7] virtio-balloon: rework deflate to add page to a list

2016-10-24 Thread Li, Liang Z
> On 10/20/2016 11:24 PM, Liang Li wrote:
> > Will allow faster notifications using a bitmap down the road.
> > balloon_pfn_to_page() can be removed because it's useless.
> 
> This is a pretty terse description of what's going on here.  Could you try to
> elaborate a bit?  What *is* the current approach?  Why does it not work
> going forward?  What do you propose instead?  Why is it better?

Sure. The description will be more clear if it's described as you suggest. 
Thanks!

Liang 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [RESEND PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-10-23 Thread Li, Liang Z
> On Fri, Oct 21, 2016 at 10:25:21AM -0700, Dave Hansen wrote:
> > On 10/20/2016 11:24 PM, Liang Li wrote:
> > > Dave Hansen suggested a new scheme to encode the data structure,
> > > because of additional complexity, it's not implemented in v3.
> >
> > So, what do you want done with this patch set?  Do you want it applied
> > as-is so that we can introduce a new host/guest ABI that we must
> > support until the end of time?  Then, we go back in a year or two and
> > add the newer format that addresses the deficiencies that this ABI has
> > with a third version?
> >
> 
> Exactly my questions.

Hi Dave & Michael,

In the V2, both of you thought that the memory I allocated for the bitmap is 
too large, and gave some
 suggestions about the solution, so I changed the implementation and used  
scattered pages for the bitmap
instead of a large physical continued memory. I didn't get the comments about 
the changes, so I am not 
sure whether that is OK or not, that's the why I resend the V3, I just want 
your opinions about that part. 

I will implement the new schema as Dave suggested in V4. Before that, could you 
take a look at this version and
give some comments? 

Thanks!
Liang

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-08-31 Thread Li, Liang Z
> Subject: Re: [PATCH v3 kernel 0/7] Extend virtio-balloon for fast 
> (de)inflating
> & fast live migration
> 
> 2016-08-08 14:35 GMT+08:00 Liang Li :
> > This patch set contains two parts of changes to the virtio-balloon.
> >
> > One is the change for speeding up the inflating & deflating process,
> > the main idea of this optimization is to use bitmap to send the page
> > information to host instead of the PFNs, to reduce the overhead of
> > virtio data transmission, address translation and madvise(). This can
> > help to improve the performance by about 85%.
> >
> > Another change is for speeding up live migration. By skipping process
> > guest's free pages in the first round of data copy, to reduce needless
> > data processing, this can help to save quite a lot of CPU cycles and
> > network bandwidth. We put guest's free page information in bitmap and
> > send it to host with the virt queue of virtio-balloon. For an idle 8GB
> > guest, this can help to shorten the total live migration time from
> > 2Sec to about 500ms in the 10Gbps network environment.
> 
> I just read the slides of this feature for recent kvm forum, the cloud
> providers more care about live migration downtime to avoid customers'
> perception than total time, however, this feature will increase downtime
> when acquire the benefit of reducing total time, maybe it will be more
> acceptable if there is no downside for downtime.
> 
> Regards,
> Wanpeng Li

In theory, there is no factor that will increase the downtime. There is no 
additional operation
and no more data copy during the stop and copy stage. But in the test, the 
downtime increases
and this can be reproduced. I think the busy network line maybe the reason for 
this. With this
 optimization, a huge amount of data is written to the socket in a shorter 
time, so some of the write
operation may need to wait. Without this optimization, zero page checking takes 
more time,
the network is not so busy.

If the guest is not an idle one, I think the gap of the downtime will not so 
obvious.  Anyway, the
downtime is still less than the  max_down_time set by the user.

Thanks!
Liang
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-08-31 Thread Li, Liang Z
Hi Michael,

I know you are very busy. If you have time, could you help to take a look at 
this patch set?

Thanks!
Liang

> -Original Message-
> From: Li, Liang Z
> Sent: Thursday, August 18, 2016 9:06 AM
> To: Michael S. Tsirkin
> Cc: virtualization@lists.linux-foundation.org; linux...@kvack.org; virtio-
> d...@lists.oasis-open.org; k...@vger.kernel.org; qemu-de...@nongnu.org;
> quint...@redhat.com; dgilb...@redhat.com; Hansen, Dave; linux-
> ker...@vger.kernel.org
> Subject: RE: [PATCH v3 kernel 0/7] Extend virtio-balloon for fast 
> (de)inflating
> & fast live migration
> 
> Hi Michael,
> 
> Could you help to review this version when you have time?
> 
> Thanks!
> Liang
> 
> > -Original Message-
> > From: Li, Liang Z
> > Sent: Monday, August 08, 2016 2:35 PM
> > To: linux-ker...@vger.kernel.org
> > Cc: virtualization@lists.linux-foundation.org; linux...@kvack.org;
> > virtio- d...@lists.oasis-open.org; k...@vger.kernel.org;
> > qemu-de...@nongnu.org; quint...@redhat.com; dgilb...@redhat.com;
> > Hansen, Dave; Li, Liang Z
> > Subject: [PATCH v3 kernel 0/7] Extend virtio-balloon for fast
> > (de)inflating & fast live migration
> >
> > This patch set contains two parts of changes to the virtio-balloon.
> >
> > One is the change for speeding up the inflating & deflating process,
> > the main idea of this optimization is to use bitmap to send the page
> > information to host instead of the PFNs, to reduce the overhead of
> > virtio data transmission, address translation and madvise(). This can
> > help to improve the performance by about 85%.
> >
> > Another change is for speeding up live migration. By skipping process
> > guest's free pages in the first round of data copy, to reduce needless
> > data processing, this can help to save quite a lot of CPU cycles and
> > network bandwidth. We put guest's free page information in bitmap and
> > send it to host with the virt queue of virtio-balloon. For an idle 8GB
> > guest, this can help to shorten the total live migration time from
> > 2Sec to about 500ms in the 10Gbps network environment.
> >
> > Dave Hansen suggested a new scheme to encode the data structure,
> > because of additional complexity, it's not implemented in v3.
> >
> > Changes from v2 to v3:
> > * Change the name of 'free page' to 'unused page'.
> > * Use the scatter & gather bitmap instead of a 1MB page bitmap.
> > * Fix overwriting the page bitmap after kicking.
> > * Some of MST's comments for v2.
> >
> > Changes from v1 to v2:
> > * Abandon the patch for dropping page cache.
> > * Put some structures to uapi head file.
> > * Use a new way to determine the page bitmap size.
> > * Use a unified way to send the free page information with the bitmap
> > * Address the issues referred in MST's comments
> >
> >
> > Liang Li (7):
> >   virtio-balloon: rework deflate to add page to a list
> >   virtio-balloon: define new feature bit and page bitmap head
> >   mm: add a function to get the max pfn
> >   virtio-balloon: speed up inflate/deflate process
> >   mm: add the related functions to get unused page
> >   virtio-balloon: define feature bit and head for misc virt queue
> >   virtio-balloon: tell host vm's unused page info
> >
> >  drivers/virtio/virtio_balloon.c | 390
> > 
> >  include/linux/mm.h  |   3 +
> >  include/uapi/linux/virtio_balloon.h |  41 
> >  mm/page_alloc.c |  94 +
> >  4 files changed, 485 insertions(+), 43 deletions(-)
> >
> > --
> > 1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-08-17 Thread Li, Liang Z
Hi Michael,

Could you help to review this version when you have time? 

Thanks!
Liang

> -Original Message-
> From: Li, Liang Z
> Sent: Monday, August 08, 2016 2:35 PM
> To: linux-ker...@vger.kernel.org
> Cc: virtualization@lists.linux-foundation.org; linux...@kvack.org; virtio-
> d...@lists.oasis-open.org; k...@vger.kernel.org; qemu-de...@nongnu.org;
> quint...@redhat.com; dgilb...@redhat.com; Hansen, Dave; Li, Liang Z
> Subject: [PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating &
> fast live migration
> 
> This patch set contains two parts of changes to the virtio-balloon.
> 
> One is the change for speeding up the inflating & deflating process, the main
> idea of this optimization is to use bitmap to send the page information to
> host instead of the PFNs, to reduce the overhead of virtio data transmission,
> address translation and madvise(). This can help to improve the performance
> by about 85%.
> 
> Another change is for speeding up live migration. By skipping process guest's
> free pages in the first round of data copy, to reduce needless data 
> processing,
> this can help to save quite a lot of CPU cycles and network bandwidth. We
> put guest's free page information in bitmap and send it to host with the virt
> queue of virtio-balloon. For an idle 8GB guest, this can help to shorten the
> total live migration time from 2Sec to about 500ms in the 10Gbps network
> environment.
> 
> Dave Hansen suggested a new scheme to encode the data structure,
> because of additional complexity, it's not implemented in v3.
> 
> Changes from v2 to v3:
> * Change the name of 'free page' to 'unused page'.
> * Use the scatter & gather bitmap instead of a 1MB page bitmap.
> * Fix overwriting the page bitmap after kicking.
> * Some of MST's comments for v2.
> 
> Changes from v1 to v2:
> * Abandon the patch for dropping page cache.
> * Put some structures to uapi head file.
> * Use a new way to determine the page bitmap size.
> * Use a unified way to send the free page information with the bitmap
> * Address the issues referred in MST's comments
> 
> 
> Liang Li (7):
>   virtio-balloon: rework deflate to add page to a list
>   virtio-balloon: define new feature bit and page bitmap head
>   mm: add a function to get the max pfn
>   virtio-balloon: speed up inflate/deflate process
>   mm: add the related functions to get unused page
>   virtio-balloon: define feature bit and head for misc virt queue
>   virtio-balloon: tell host vm's unused page info
> 
>  drivers/virtio/virtio_balloon.c | 390
> 
>  include/linux/mm.h  |   3 +
>  include/uapi/linux/virtio_balloon.h |  41 
>  mm/page_alloc.c |  94 +
>  4 files changed, 485 insertions(+), 43 deletions(-)
> 
> --
> 1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-08-08 Thread Li, Liang Z
> Subject: Re: [PATCH v3 kernel 0/7] Extend virtio-balloon for fast 
> (de)inflating
> & fast live migration
> 
> On 08/07/2016 11:35 PM, Liang Li wrote:
> > Dave Hansen suggested a new scheme to encode the data structure,
> > because of additional complexity, it's not implemented in v3.
> 
> FWIW, I don't think it takes any additional complexity here, at least in the
> guest implementation side.  The thing I suggested would just mean explicitly
> calling out that there was a single bitmap instead of implying it in the ABI.
> 
> Do you think the scheme I suggested is the way to go?

Yes, I think so.  And I will do that in the later version. In this V3, I just 
want to solve the 
issue caused by a large page bitmap in v2.

Liang
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [virtio-dev] Re: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-08-01 Thread Li, Liang Z
> > It's only small because it makes you rescan the free list.
> > So maybe you should do something else.
> > I looked at it a bit. Instead of scanning the free list, how about
> > scanning actual page structures? If page is unused, pass it to host.
> > Solves the problem of rescanning multiple times, does it not?
> 
> FWIW, I think the new data structure needs some work.
> 
> Before, we had a potentially very long list of 4k areas.  Now, we've just got 
> a
> very large bitmap.  The bitmap might not even be very dense if we are
> ballooning relatively few things.
> 
> Can I suggest an alternate scheme?  I think you actually need a hybrid
> scheme that has bitmaps but also allows more flexibility in the pfn ranges.
> The payload could be a number of records each containing 3 things:
> 
>   pfn, page order, length of bitmap (maybe in powers of 2)
> 
> Each record is followed by the bitmap.  Or, if the bitmap length is 0,
> immediately followed by another record.  A bitmap length of 0 implies a
> bitmap with the least significant bit set.  Page order specifies how many
> pages each bit represents.
> 
> This scheme could easily encode the new data structure you are proposing
> by just setting pfn=0, order=0, and a very long bitmap length.  But, it could
> handle sparse bitmaps much better *and* represent large pages much more
> efficiently.
> 
> There's plenty of space to fit a whole record in 64 bits.

I like your idea and it's more flexible, and it's very useful if we want to 
optimize the
page allocating stage further. I believe the memory fragmentation will not be 
very
serious, so the performance won't be too bad in the worst case.

Thanks!
Liang

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [virtio-dev] Re: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-07-28 Thread Li, Liang Z
> > > On Wed, Jul 27, 2016 at 09:03:21AM -0700, Dave Hansen wrote:
> > > > On 07/26/2016 06:23 PM, Liang Li wrote:
> > > > > + vb->pfn_limit = VIRTIO_BALLOON_PFNS_LIMIT;
> > > > > + vb->pfn_limit = min(vb->pfn_limit, get_max_pfn());
> > > > > + vb->bmap_len = ALIGN(vb->pfn_limit, BITS_PER_LONG) /
> > > > > +  BITS_PER_BYTE + 2 * sizeof(unsigned long);
> > > > > + hdr_len = sizeof(struct balloon_bmap_hdr);
> > > > > + vb->bmap_hdr = kzalloc(hdr_len + vb->bmap_len,
> GFP_KERNEL);
> > > >
> > > > This ends up doing a 1MB kmalloc() right?  That seems a _bit_ big.
> > > > How big was the pfn buffer before?
> > >
> > >
> > > Yes I would limit this to 1G memory in a go, will result in a 32KByte 
> > > bitmap.
> > >
> > > --
> > > MST
> >
> > Limit to 1G is bad for the performance, I sent you the test result several
> weeks ago.
> >
> > Paste it bellow:
> > --
> > --
> > About the size of page bitmap, I have test the performance of filling
> > the balloon to 15GB with a  16GB RAM VM.
> >
> > ===
> > 32K Byte (cover 1GB of RAM)
> >
> > Time spends on inflating: 2031ms
> > -
> > 64K Byte (cover 2GB of RAM)
> >
> > Time spends on inflating: 1507ms
> > 
> > 512K Byte (cover 16GB of RAM)
> >
> > Time spends on inflating: 1237ms
> > 
> >
> > If possible, a big bitmap is better for performance.
> >
> > Liang
> 
> Earlier you said:
> a. allocating pages (6.5%)
> b. sending PFNs to host (68.3%)
> c. address translation (6.1%)
> d. madvise (19%)
> 
> Here sending PFNs to host with 512K Byte map should be almost free.
> 
> So is something else taking up the time?
> 
I just want to show you the benefits of using a big bitmap. :)
I did not measure the time spend on each stage after optimization(I will do it 
later),
but I have tried to allocate the page with big chunk and found it can make 
things faster.
Without allocating big chunk page, the performance improvement is about 85%, 
and with
 allocating big  chunk page, the improvement is about 94%.

Liang

> 
> --
> MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [virtio-dev] Re: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-07-28 Thread Li, Liang Z
> On Thu, Jul 28, 2016 at 06:36:18AM +0000, Li, Liang Z wrote:
> > > > > This ends up doing a 1MB kmalloc() right?  That seems a _bit_ big.
> > > > > How big was the pfn buffer before?
> > > >
> > > > Yes, it is if the max pfn is more than 32GB.
> > > > The size of the pfn buffer use before is 256*4 = 1024 Bytes, it's
> > > > too small, and it's the main reason for bad performance.
> > > > Use the max 1MB kmalloc is a balance between performance and
> > > > flexibility, a large page bitmap covers the range of all the
> > > > memory is no good for a system with huge amount of memory. If the
> > > > bitmap is too small, it means we have to traverse a long list for
> > > > many times, and it's bad
> > > for performance.
> > > >
> > > > Thanks!
> > > > Liang
> > >
> > > There are all your implementation decisions though.
> > >
> > > If guest memory is so fragmented that you only have order 0 4k
> > > pages, then allocating a huge 1M contigious chunk is very problematic in
> and of itself.
> > >
> >
> > The memory is allocated in the probe stage. This will not happen if
> > the driver is  loaded when booting the guest.
> >
> > > Most people rarely migrate and do not care how fast that happens.
> > > Wasting a large chunk of memory (and it's zeroed for no good reason,
> > > so you actually request host memory for it) for everyone to speed it
> > > up when it does happen is not really an option.
> > >
> > If people don't plan to do inflating/deflating, they should not enable
> > the virtio-balloon at the beginning, once they decide to use it, the
> > driver should provide better performance as much as possible.
> 
> The reason people inflate/deflate is so they can overcommit memory.
> Do they need to overcommit very quickly? I don't see why.
> So let's get what we can for free but I don't really believe people would want
> to pay for it.
> 
> > 1MB is a very small portion for a VM with more than 32GB memory and
> > it's the *worst case*, for VM with less than 32GB memory, the amount
> > of RAM depends on VM's memory size and will be less than 1MB.
> 
> It's guest memmory so might all be in swap and never touched, your memset
> at probe time will fault it in and make hypervisor actually pay for it.
> 
> > If 1MB is too big, how about 512K, or 256K?  32K seems too small.
> >
> > Liang
> 
> It's only small because it makes you rescan the free list.
> So maybe you should do something else.
> I looked at it a bit. Instead of scanning the free list, how about scanning 
> actual
> page structures? If page is unused, pass it to host.
> Solves the problem of rescanning multiple times, does it not?
> 

Yes, agree.
> 
> Another idea: allocate a small bitmap at probe time (e.g. for deflate), 
> allocate
> a bunch more on each request. Use something like GFP_ATOMIC and a
> scatter/gather, if that fails use the smaller bitmap.
> 

So, the aim of v3 is to use a smaller bitmap without too heavy performance 
penalty.
Thanks a lot!

Liang


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-07-28 Thread Li, Liang Z
> On Thu, Jul 28, 2016 at 03:06:37AM +0000, Li, Liang Z wrote:
> > > > + * VIRTIO_BALLOON_PFNS_LIMIT is used to limit the size of page
> > > > +bitmap
> > > > + * to prevent a very large page bitmap, there are two reasons for this:
> > > > + * 1) to save memory.
> > > > + * 2) allocate a large bitmap may fail.
> > > > + *
> > > > + * The actual limit of pfn is determined by:
> > > > + * pfn_limit = min(max_pfn, VIRTIO_BALLOON_PFNS_LIMIT);
> > > > + *
> > > > + * If system has more pages than VIRTIO_BALLOON_PFNS_LIMIT, we
> > > > +will scan
> > > > + * the page list and send the PFNs with several times. To reduce
> > > > +the
> > > > + * overhead of scanning the page list. VIRTIO_BALLOON_PFNS_LIMIT
> > > > +should
> > > > + * be set with a value which can cover most cases.
> > >
> > > So what if it covers 1/32 of the memory? We'll do 32 exits and not
> > > 1, still not a big deal for a big guest.
> > >
> >
> > The issue here is the overhead is too high for scanning the page list for 32
> times.
> > Limit the page bitmap size to a fixed value is better for a big guest?
> >
> 
> I'd say avoid scanning free lists completely. Scan pages themselves and check
> the refcount to see whether they are free.
> This way each page needs to be tested once.
> 
> And skip the whole optimization if less than e.g. 10% is free.

That's better than rescanning the free list. Will change in next version.

Thanks!

Liang


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v2 repost 7/7] virtio-balloon: tell host vm's free page info

2016-07-28 Thread Li, Liang Z
> >  }
> >
> > +static void update_free_pages_stats(struct virtio_balloon *vb,
> 
> why _stats?

Will change.

> > +   max_pfn = get_max_pfn();
> > +   mutex_lock(>balloon_lock);
> > +   while (pfn < max_pfn) {
> > +   memset(vb->page_bitmap, 0, vb->bmap_len);
> > +   ret = get_free_pages(pfn, pfn + vb->pfn_limit,
> > +   vb->page_bitmap, vb->bmap_len * BITS_PER_BYTE);
> > +   hdr->cmd = cpu_to_virtio16(vb->vdev,
> BALLOON_GET_FREE_PAGES);
> > +   hdr->page_shift = cpu_to_virtio16(vb->vdev, PAGE_SHIFT);
> > +   hdr->req_id = cpu_to_virtio64(vb->vdev, req_id);
> > +   hdr->start_pfn = cpu_to_virtio64(vb->vdev, pfn);
> > +   bmap_len = vb->pfn_limit / BITS_PER_BYTE;
> > +   if (!ret) {
> > +   hdr->flag = cpu_to_virtio16(vb->vdev,
> > +
>   BALLOON_FLAG_DONE);
> > +   if (pfn + vb->pfn_limit > max_pfn)
> > +   bmap_len = (max_pfn - pfn) /
> BITS_PER_BYTE;
> > +   } else
> > +   hdr->flag = cpu_to_virtio16(vb->vdev,
> > +
>   BALLOON_FLAG_CONT);
> > +   hdr->bmap_len = cpu_to_virtio64(vb->vdev, bmap_len);
> > +   sg_init_one(_out, hdr,
> > +sizeof(struct balloon_bmap_hdr) + bmap_len);
> 
> Wait a second. This adds the same buffer multiple times in a loop.
> We will overwrite the buffer without waiting for hypervisor to process it.
> What did I miss?

I am no quite sure about this part, I though the virtqueue_kick(vq) will prevent
the buffer from overwrite, I realized it's wrong.

> > +
> > +   virtqueue_add_outbuf(vq, _out, 1, vb, GFP_KERNEL);
> 
> this can fail. you want to maybe make sure vq has enough space before you
> use it or check error and wait.
> 
> > +   virtqueue_kick(vq);
> 
> why kick here within loop? wait until done. in fact kick outside lock is 
> better
> for smp.

I will change this part in v3.

> 
> > +   pfn += vb->pfn_limit;
> > +   static const char * const names[] = { "inflate", "deflate", "stats",
> > +"misc" };
> > int err, nvqs;
> >
> > /*
> >  * We expect two virtqueues: inflate and deflate, and
> >  * optionally stat.
> >  */
> > -   nvqs = virtio_has_feature(vb->vdev,
> VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
> > +   if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_MISC_VQ))
> > +   nvqs = 4;
> 
> Does misc vq depend on stats vq feature then? if yes please validate that.

Yes, what's you mean by 'validate' that?

> 
> 
> > +   else
> > +   nvqs = virtio_has_feature(vb->vdev,
> > + VIRTIO_BALLOON_F_STATS_VQ) ? 3 :
> 2;
> 
> Replace that ? with else too pls.

Will change.

Thanks!
Liang
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [virtio-dev] Re: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-07-28 Thread Li, Liang Z
> > > This ends up doing a 1MB kmalloc() right?  That seems a _bit_ big.
> > > How big was the pfn buffer before?
> >
> > Yes, it is if the max pfn is more than 32GB.
> > The size of the pfn buffer use before is 256*4 = 1024 Bytes, it's too
> > small, and it's the main reason for bad performance.
> > Use the max 1MB kmalloc is a balance between performance and
> > flexibility, a large page bitmap covers the range of all the memory is
> > no good for a system with huge amount of memory. If the bitmap is too
> > small, it means we have to traverse a long list for many times, and it's bad
> for performance.
> >
> > Thanks!
> > Liang
> 
> There are all your implementation decisions though.
> 
> If guest memory is so fragmented that you only have order 0 4k pages, then
> allocating a huge 1M contigious chunk is very problematic in and of itself.
> 

The memory is allocated in the probe stage. This will not happen if the driver 
is
 loaded when booting the guest.

> Most people rarely migrate and do not care how fast that happens.
> Wasting a large chunk of memory (and it's zeroed for no good reason, so you
> actually request host memory for it) for everyone to speed it up when it
> does happen is not really an option.
> 
If people don't plan to do inflating/deflating, they should not enable the 
virtio-balloon
at the beginning, once they decide to use it, the driver should provide better 
performance
as much as possible.

1MB is a very small portion for a VM with more than 32GB memory and it's the 
*worst case*, 
for VM with less than 32GB memory, the amount of RAM depends on VM's memory size
and will be less than 1MB.

If 1MB is too big, how about 512K, or 256K?  32K seems too small.

Liang

> --
> MST
> 
> -
> To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [virtio-dev] Re: [PATCH v2 repost 6/7] mm: add the related functions to get free page info

2016-07-27 Thread Li, Liang Z
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 7da61ad..3ad8b10
> > 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4523,6 +4523,52 @@ unsigned long get_max_pfn(void)  }
> > EXPORT_SYMBOL(get_max_pfn);
> >
> > +static void mark_free_pages_bitmap(struct zone *zone, unsigned long
> start_pfn,
> > +   unsigned long end_pfn, unsigned long *bitmap, unsigned long len) {
> > +   unsigned long pfn, flags, page_num;
> > +   unsigned int order, t;
> > +   struct list_head *curr;
> > +
> > +   if (zone_is_empty(zone))
> > +   return;
> > +   end_pfn = min(start_pfn + len, end_pfn);
> > +   spin_lock_irqsave(>lock, flags);
> > +
> > +   for_each_migratetype_order(order, t) {
> 
> Why not do each order separately? This way you can use a single bit to pass a
> huge page to host.
> 

I thought about that before, and did not that because of complexity and small 
benefits.
Use separated page bitmaps for each order won't help to reduce the data 
traffic, except
ignoring the pages with small order. 

> Not a requirement but hey.
> 
> Alternatively (and maybe that is a better idea0 if you wanted to, you could
> just skip lone 4K pages.
> It's not clear that they are worth bothering with.
> Add a flag to start with some reasonably large order and go from there.
> 
One of the main reason of this patch is to reduce the network traffic as much 
as possible,
it looks strange to skip the lone 4K pages. Skipping these pages can made live 
migration
faster? I think it depends on the amount of lone 4K pages.

In the other hand, it's faster to send one bit through virtio than that send 4K 
bytes 
through even 10Gps network, is that true?

Liang


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v2 repost 6/7] mm: add the related functions to get free page info

2016-07-27 Thread Li, Liang Z
> On 07/27/2016 03:05 PM, Michael S. Tsirkin wrote:
> > On Wed, Jul 27, 2016 at 09:40:56AM -0700, Dave Hansen wrote:
> >> On 07/26/2016 06:23 PM, Liang Li wrote:
> >>> + for_each_migratetype_order(order, t) {
> >>> + list_for_each(curr, >free_area[order].free_list[t]) {
> >>> + pfn = page_to_pfn(list_entry(curr, struct page, lru));
> >>> + if (pfn >= start_pfn && pfn <= end_pfn) {
> >>> + page_num = 1UL << order;
> >>> + if (pfn + page_num > end_pfn)
> >>> + page_num = end_pfn - pfn;
> >>> + bitmap_set(bitmap, pfn - start_pfn,
> page_num);
> >>> + }
> >>> + }
> >>> + }
> >>
> >> Nit:  The 'page_num' nomenclature really confused me here.  It is the
> >> number of bits being set in the bitmap.  Seems like calling it
> >> nr_pages or num_pages would be more appropriate.
> >>
> >> Isn't this bitmap out of date by the time it's send up to the
> >> hypervisor?  Is there something that makes the inaccuracy OK here?
> >
> > Yes. Calling these free pages is unfortunate. It's likely to confuse
> > people thinking they can just discard these pages.
> >
> > Hypervisor sends a request. We respond with this list of pages, and
> > the guarantee hypervisor needs is that these were free sometime
> > between request and response, so they are safe to free if they are
> > unmodified since the request. hypervisor can detect modifications so
> > it can detect modifications itself and does not need guest help.
> 
> Ahh, that makes sense.
> 
> So the hypervisor is trying to figure out: "Which pages do I move?".  It wants
> to know which pages the guest thinks have good data and need to move.
> But, the list of free pages is (likely) smaller than the list of pages with 
> good
> data, so it asks for that instead.
> 
> A write to a page means that it has valuable data, regardless of whether it
> was in the free list or not.
> 
> The hypervisor only skips moving pages that were free *and* were never
> written to.  So we never lose data, even if this "get free page info"
> stuff is totally out of date.
> 
> The patch description and code comments are, um, a _bit_ light for this level
> of subtlety. :)

I will add more description about this in v3.

Thanks!
Liang
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-07-27 Thread Li, Liang Z
> > +/*
> > + * VIRTIO_BALLOON_PFNS_LIMIT is used to limit the size of page bitmap
> > + * to prevent a very large page bitmap, there are two reasons for this:
> > + * 1) to save memory.
> > + * 2) allocate a large bitmap may fail.
> > + *
> > + * The actual limit of pfn is determined by:
> > + * pfn_limit = min(max_pfn, VIRTIO_BALLOON_PFNS_LIMIT);
> > + *
> > + * If system has more pages than VIRTIO_BALLOON_PFNS_LIMIT, we will
> > +scan
> > + * the page list and send the PFNs with several times. To reduce the
> > + * overhead of scanning the page list. VIRTIO_BALLOON_PFNS_LIMIT
> > +should
> > + * be set with a value which can cover most cases.
> > + */
> > +#define VIRTIO_BALLOON_PFNS_LIMIT ((32 * (1ULL << 30)) >>
> PAGE_SHIFT)
> > +/* 32GB */
> > +
> >  static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> > module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> > MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> >
> > +extern unsigned long get_max_pfn(void);
> > +
> 
> Please just include the correct header. No need for this hackery.
> 

Will change. Thanks!

Liang
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-07-27 Thread Li, Liang Z
> Subject: Re: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate
> process
> 
> On Wed, Jul 27, 2016 at 09:03:21AM -0700, Dave Hansen wrote:
> > On 07/26/2016 06:23 PM, Liang Li wrote:
> > > + vb->pfn_limit = VIRTIO_BALLOON_PFNS_LIMIT;
> > > + vb->pfn_limit = min(vb->pfn_limit, get_max_pfn());
> > > + vb->bmap_len = ALIGN(vb->pfn_limit, BITS_PER_LONG) /
> > > +  BITS_PER_BYTE + 2 * sizeof(unsigned long);
> > > + hdr_len = sizeof(struct balloon_bmap_hdr);
> > > + vb->bmap_hdr = kzalloc(hdr_len + vb->bmap_len, GFP_KERNEL);
> >
> > This ends up doing a 1MB kmalloc() right?  That seems a _bit_ big.
> > How big was the pfn buffer before?
> 
> 
> Yes I would limit this to 1G memory in a go, will result in a 32KByte bitmap.
> 
> --
> MST

Limit to 1G is bad for the performance, I sent you the test result several 
weeks ago.

Paste it bellow:

About the size of page bitmap, I have test the performance of filling the 
balloon to 15GB with a
 16GB RAM VM.

===
32K Byte (cover 1GB of RAM)

Time spends on inflating: 2031ms
-
64K Byte (cover 2GB of RAM)

Time spends on inflating: 1507ms

512K Byte (cover 16GB of RAM)

Time spends on inflating: 1237ms


If possible, a big bitmap is better for performance.

Liang

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-07-27 Thread Li, Liang Z
> > + * VIRTIO_BALLOON_PFNS_LIMIT is used to limit the size of page bitmap
> > + * to prevent a very large page bitmap, there are two reasons for this:
> > + * 1) to save memory.
> > + * 2) allocate a large bitmap may fail.
> > + *
> > + * The actual limit of pfn is determined by:
> > + * pfn_limit = min(max_pfn, VIRTIO_BALLOON_PFNS_LIMIT);
> > + *
> > + * If system has more pages than VIRTIO_BALLOON_PFNS_LIMIT, we will
> > +scan
> > + * the page list and send the PFNs with several times. To reduce the
> > + * overhead of scanning the page list. VIRTIO_BALLOON_PFNS_LIMIT
> > +should
> > + * be set with a value which can cover most cases.
> 
> So what if it covers 1/32 of the memory? We'll do 32 exits and not 1, still 
> not a
> big deal for a big guest.
> 

The issue here is the overhead is too high for scanning the page list for 32 
times.
Limit the page bitmap size to a fixed value is better for a big guest?

> > + */
> > +#define VIRTIO_BALLOON_PFNS_LIMIT ((32 * (1ULL << 30)) >>
> PAGE_SHIFT)
> > +/* 32GB */
> 
> I already said this with a smaller limit.
> 
>   2<< 30  is 2G but that is not a useful comment.
>   pls explain what is the reason for this selection.
> 
> Still applies here.
> 

I will add the comment for this.

> > -   sg_init_one(, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
> > +   if (virtio_has_feature(vb->vdev,
> VIRTIO_BALLOON_F_PAGE_BITMAP)) {
> > +   struct balloon_bmap_hdr *hdr = vb->bmap_hdr;
> > +   unsigned long bmap_len;
> > +
> > +   /* cmd and req_id are not used here, set them to 0 */
> > +   hdr->cmd = cpu_to_virtio16(vb->vdev, 0);
> > +   hdr->page_shift = cpu_to_virtio16(vb->vdev, PAGE_SHIFT);
> > +   hdr->reserved = cpu_to_virtio16(vb->vdev, 0);
> > +   hdr->req_id = cpu_to_virtio64(vb->vdev, 0);
> 
> no need to swap 0, just fill it in. in fact you allocated all 0s so no need 
> to touch
> these fields at all.
> 

Will change in v3.

> > @@ -489,7 +612,7 @@ static int virtballoon_migratepage(struct
> > balloon_dev_info *vb_dev_info,  static int virtballoon_probe(struct
> > virtio_device *vdev)  {
> > struct virtio_balloon *vb;
> > -   int err;
> > +   int err, hdr_len;
> >
> > if (!vdev->config->get) {
> > dev_err(>dev, "%s failure: config access disabled\n",
> @@
> > -508,6 +631,18 @@ static int virtballoon_probe(struct virtio_device *vdev)
> > spin_lock_init(>stop_update_lock);
> > vb->stop_update = false;
> > vb->num_pages = 0;
> > +   vb->pfn_limit = VIRTIO_BALLOON_PFNS_LIMIT;
> > +   vb->pfn_limit = min(vb->pfn_limit, get_max_pfn());
> > +   vb->bmap_len = ALIGN(vb->pfn_limit, BITS_PER_LONG) /
> > +BITS_PER_BYTE + 2 * sizeof(unsigned long);
> 
> What are these 2 longs in aid of?
> 
The rounddown(vb->start_pfn,  BITS_PER_LONG) and roundup(vb->end_pfn, 
BITS_PER_LONG) 
may cause (vb->end_pfn - vb->start_pfn) > vb->pfn_limit, so we need extra space 
to save the
bitmap for this case. 2 longs are enough.

> > +   hdr_len = sizeof(struct balloon_bmap_hdr);
> > +   vb->bmap_hdr = kzalloc(hdr_len + vb->bmap_len, GFP_KERNEL);
> 
> So it can go up to 1MByte but adding header size etc you need a higher order
> allocation. This is a waste, there is no need to have a power of two 
> allocation.
> Start from the other side. Say "I want to allocate 32KBytes for the bitmap".
> Subtract the header and you get bitmap size.
> Calculate the pfn limit from there.
> 

Indeed, will change. Thanks a lot!

Liang

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-07-27 Thread Li, Liang Z
> Subject: Re: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate
> process
> 
> On 07/26/2016 06:23 PM, Liang Li wrote:
> > +   vb->pfn_limit = VIRTIO_BALLOON_PFNS_LIMIT;
> > +   vb->pfn_limit = min(vb->pfn_limit, get_max_pfn());
> > +   vb->bmap_len = ALIGN(vb->pfn_limit, BITS_PER_LONG) /
> > +BITS_PER_BYTE + 2 * sizeof(unsigned long);
> > +   hdr_len = sizeof(struct balloon_bmap_hdr);
> > +   vb->bmap_hdr = kzalloc(hdr_len + vb->bmap_len, GFP_KERNEL);
> 
> This ends up doing a 1MB kmalloc() right?  That seems a _bit_ big.  How big
> was the pfn buffer before?

Yes, it is if the max pfn is more than 32GB.
The size of the pfn buffer use before is 256*4 = 1024 Bytes, it's too small, 
and it's the main reason for bad performance.
Use the max 1MB kmalloc is a balance between performance and flexibility,
a large page bitmap covers the range of all the memory is no good for a system
with huge amount of memory. If the bitmap is too small, it means we have
to traverse a long list for many times, and it's bad for performance.

Thanks!
Liang   

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v2 repost 6/7] mm: add the related functions to get free page info

2016-07-27 Thread Li, Liang Z
> Subject: Re: [PATCH v2 repost 6/7] mm: add the related functions to get free
> page info
> 
> On 07/26/2016 06:23 PM, Liang Li wrote:
> > +   for_each_migratetype_order(order, t) {
> > +   list_for_each(curr, >free_area[order].free_list[t]) {
> > +   pfn = page_to_pfn(list_entry(curr, struct page, lru));
> > +   if (pfn >= start_pfn && pfn <= end_pfn) {
> > +   page_num = 1UL << order;
> > +   if (pfn + page_num > end_pfn)
> > +   page_num = end_pfn - pfn;
> > +   bitmap_set(bitmap, pfn - start_pfn,
> page_num);
> > +   }
> > +   }
> > +   }
> 
> Nit:  The 'page_num' nomenclature really confused me here.  It is the
> number of bits being set in the bitmap.  Seems like calling it nr_pages or
> num_pages would be more appropriate.
> 

You are right,  will change.

> Isn't this bitmap out of date by the time it's send up to the hypervisor?  Is
> there something that makes the inaccuracy OK here?

Yes. The dirty page logging will be used to correct the inaccuracy.
The dirty page logging should be started before getting the free page bitmap, 
then if some of the free pages become no free for writing, these pages will be 
tracked by the dirty page logging mechanism.

Thanks!
Liang
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [virtio-dev] Re: [PATCH v2 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-07-26 Thread Li, Liang Z
> So I'm fine with this patchset, but I noticed it was not yet reviewed by MM
> people. And that is not surprising since you did not copy memory
> management mailing list on it.
> 
> I added linux...@kvack.org Cc on this mail but this might not be enough.
> 
> Please repost (e.g. [PATCH v2 repost]) copying the relevant mailing list so we
> can get some reviews.
> 

I will repost. Thanks!

Liang
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH v2 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-07-05 Thread Li, Liang Z
Ping ...

Liang

> -Original Message-
> From: Li, Liang Z
> Sent: Wednesday, June 29, 2016 6:32 PM
> To: m...@redhat.com
> Cc: linux-ker...@vger.kernel.org; virtualization@lists.linux-foundation.org;
> k...@vger.kernel.org; qemu-de...@nongnu.org; virtio-dev@lists.oasis-
> open.org; dgilb...@redhat.com; quint...@redhat.com; Li, Liang Z
> Subject: [PATCH v2 kernel 0/7] Extend virtio-balloon for fast (de)inflating &
> fast live migration
> 
> This patch set contains two parts of changes to the virtio-balloon.
> 
> One is the change for speeding up the inflating & deflating process, the main
> idea of this optimization is to use bitmap to send the page information to
> host instead of the PFNs, to reduce the overhead of virtio data transmission,
> address translation and madvise(). This can help to improve the performance
> by about 85%.
> 
> Another change is for speeding up live migration. By skipping process guest's
> free pages in the first round of data copy, to reduce needless data 
> processing,
> this can help to save quite a lot of CPU cycles and network bandwidth. We
> put guest's free page information in bitmap and send it to host with the virt
> queue of virtio-balloon. For an idle 8GB guest, this can help to shorten the
> total live migration time from 2Sec to about 500ms in the 10Gbps network
> environment.
> 
> 
> Changes from v1 to v2:
> * Abandon the patch for dropping page cache.
> * Put some structures to uapi head file.
> * Use a new way to determine the page bitmap size.
> * Use a unified way to send the free page information with the bitmap
> * Address the issues referred in MST's comments
> 
> Liang Li (7):
>   virtio-balloon: rework deflate to add page to a list
>   virtio-balloon: define new feature bit and page bitmap head
>   mm: add a function to get the max pfn
>   virtio-balloon: speed up inflate/deflate process
>   virtio-balloon: define feature bit and head for misc virt queue
>   mm: add the related functions to get free page info
>   virtio-balloon: tell host vm's free page info
> 
>  drivers/virtio/virtio_balloon.c | 306
> +++-
>  include/uapi/linux/virtio_balloon.h |  41 +
>  mm/page_alloc.c |  52 ++
>  3 files changed, 359 insertions(+), 40 deletions(-)
> 
> --
> 1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH RFC kernel] balloon: speed up inflating/deflating process

2016-05-25 Thread Li, Liang Z
> > > Interesting. How about instead of tell host, we do multiple scans,
> > > each time ignoring pages out of range?
> > >
> > >   for (pfn = min pfn; pfn < max pfn; pfn += 1G) {
> > >   foreach page
> > >   if page pfn < pfn || page pfn >= pfn + 1G
> > >   continue
> > >   set bit
> > >   tell host
> > >   }
> > >
> >
> > That means we have to allocate/free all the requested pages first, and then
> tell the host.
> > It works fine for inflating, but for deflating, because the page has
> > been deleted from the vb-> vb_dev_info->pages, so, we have to use a
> > struct to save the dequeued pages before calling
> > release_pages_balloon(),
> 
> struct list_head?  I think you can just replace set_page_pfns with
> list_add(>lru, _list);
> 
> 

That's is fine, I will retry and get back to you.

Thanks!

Liang
> 
> >  I think a page bitmap is the best struct to save these pages, because it
> consumes less memory.
> > And that bitmap should be large enough to save pfn 0 to max_pfn.
> >
> > If the above is true, then we are back to the square one. we really need a
> large page bitmap. Right?
> >
> > Liang
> 
> These look like implementation issues to me.
> 
> I think the below might be helpful (completely untested), your work can go
> on top.
> 
> --->
> 
> virtio-balloon: rework deflate to add page to a tmp list
> 
> Will allow faster notifications using a bitmap down the road.
> 
> Signed-off-by: Michael S. Tsirkin 
> 
> ---
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 476c0e3..44050a3 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -195,8 +195,9 @@ static void release_pages_balloon(struct
> virtio_balloon *vb)  static unsigned leak_balloon(struct virtio_balloon *vb,
> size_t num)  {
>   unsigned num_freed_pages;
> - struct page *page;
> + struct page *page, *next;
>   struct balloon_dev_info *vb_dev_info = >vb_dev_info;
> + LIST_HEAD(pages);   /* Pages dequeued for handing to
> Host */
> 
>   /* We can only do one array worth at a time. */
>   num = min(num, ARRAY_SIZE(vb->pfns));
> @@ -207,10 +208,13 @@ static unsigned leak_balloon(struct virtio_balloon
> *vb, size_t num)
>   page = balloon_page_dequeue(vb_dev_info);
>   if (!page)
>   break;
> - set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> + list_add(>lru, );
>   vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
>   }
> 
> + list_for_each_entry_safe(page, next, , lru)
> + set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +
>   num_freed_pages = vb->num_pfns;
>   /*
>* Note that if
> --
> MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH RFC kernel] balloon: speed up inflating/deflating process

2016-05-25 Thread Li, Liang Z
> > > >
> > > > Hi MST,
> > > >
> > > > I have measured the performance when using a 32K page bitmap,
> > >
> > > Just to make sure. Do you mean a 32Kbyte bitmap?
> > > Covering 1Gbyte of memory?
> > Yes.
> >
> > >
> > > > and inflate the balloon to 3GB
> > > > of an idle guest with 4GB RAM.
> > >
> > > Should take 3 requests then, right?
> > >
> >
> > No,  we can't assign the PFN when allocating page in balloon driver,
> > So the PFNs of pages allocated may be across a large range,  we will
> > tell the host once the pfn_max -pfn_min >= 0x4(1GB range), so the
> > requests count is most likely to be more than 3.
> >
> > > > Now:
> > > > total inflating time: 338ms
> > > > the count of virtio data transmission:  373
> > >
> > > Why was this so high? I would expect 3 transmissions.
> >
> > I follow your suggestion:
> > --
> > -- Suggestion to address all above comments:
> > 1. allocate a bunch of pages and link them up,
> >calculating the min and the max pfn.
> >if max-min exceeds the allocated bitmap size,
> >tell host.
> > 2. limit allocated bitmap size to something reasonable.
> >How about 32Kbytes? This is 256kilo bit in the map, which comes
> >out to 1Giga bytes of memory in the balloon.
> > --
> > --- Because the PFNs of the allocated pages are not linear
> > increased, so 3 transmissions are  impossible.
> >
> >
> > Liang
> 
> Interesting. How about instead of tell host, we do multiple scans, each time
> ignoring pages out of range?
> 
>   for (pfn = min pfn; pfn < max pfn; pfn += 1G) {
>   foreach page
>   if page pfn < pfn || page pfn >= pfn + 1G
>   continue
>   set bit
>   tell host
>   }
> 

That means we have to allocate/free all the requested pages first, and then 
tell the host.
It works fine for inflating, but for deflating, because the page has been 
deleted from the vb-> vb_dev_info->pages,
so, we have to use a struct to save the dequeued pages before calling 
release_pages_balloon(),
 I think a page bitmap is the best struct to save these pages, because it 
consumes less memory.
And that bitmap should be large enough to save pfn 0 to max_pfn.

If the above is true, then we are back to the square one. we really need a 
large page bitmap. Right?

Liang


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH RFC kernel] balloon: speed up inflating/deflating process

2016-05-25 Thread Li, Liang Z
> On Wed, May 25, 2016 at 08:48:17AM +0000, Li, Liang Z wrote:
> > > > > Suggestion to address all above comments:
> > > > >   1. allocate a bunch of pages and link them up,
> > > > >  calculating the min and the max pfn.
> > > > >  if max-min exceeds the allocated bitmap size,
> > > > >  tell host.
> > > >
> > > > I am not sure if it works well in some cases, e.g. The allocated
> > > > pages are across a wide range and the max-min > limit is very
> > > > frequently to be
> > > true.
> > > > Then, there will be many times of virtio transmission and it's bad
> > > > for performance improvement. Right?
> > >
> > > It's a tradeoff for sure. Measure it, see what the overhead is.
> > >
> >
> > Hi MST,
> >
> > I have measured the performance when using a 32K page bitmap,
> 
> Just to make sure. Do you mean a 32Kbyte bitmap?
> Covering 1Gbyte of memory?
Yes.

> 
> > and inflate the balloon to 3GB
> > of an idle guest with 4GB RAM.
> 
> Should take 3 requests then, right?
> 

No,  we can't assign the PFN when allocating page in balloon driver,
So the PFNs of pages allocated may be across a large range,  we will
tell the host once the pfn_max -pfn_min >= 0x4(1GB range),
so the requests count is most likely to be more than 3. 


> > Now:
> > total inflating time: 338ms
> > the count of virtio data transmission:  373
> 
> Why was this so high? I would expect 3 transmissions.

I follow your suggestion:

Suggestion to address all above comments:
1. allocate a bunch of pages and link them up,
   calculating the min and the max pfn.
   if max-min exceeds the allocated bitmap size,
   tell host.
2. limit allocated bitmap size to something reasonable.
   How about 32Kbytes? This is 256kilo bit in the map, which comes
   out to 1Giga bytes of memory in the balloon.
-
Because the PFNs of the allocated pages are not linear increased, so 3 
transmissions
are  impossible.


Liang

> 
> > the call count of madvise: 865
> >
> > before:
> > total inflating time: 175ms
> > the count of virtio data transmission: 1 the call count of madvise: 42
> >
> > Maybe the result will be worse if the guest is not idle, or the guest has
> more RAM.
> > Do you want more data?
> >
> > Is it worth to do that?
> >
> > Liang
> 
> Either my math is wrong or there's an implementation bug.
> 
> > > >
> > > > >   2. limit allocated bitmap size to something reasonable.
> > > > >  How about 32Kbytes? This is 256kilo bit in the map, which 
> > > > > comes
> > > > >  out to 1Giga bytes of memory in the balloon.
> > > >
> > > > So, even the VM has 1TB of RAM, the page bitmap will take 32MB of
> > > memory.
> > > > Maybe it's better to use a big page bitmap the save the pages
> > > > allocated by balloon, and split the big page bitmap to 32K bytes
> > > > unit, then
> > > transfer one unit at a time.
> > >
> > > How is this different from what I said?
> > >
> > > >
> > > > Should we use a page bitmap to replace 'vb->pages' ?
> > > >
> > > > How about rolling back to use PFNs if the count of requested pages
> > > > is a
> > > small number?
> > > >
> > > > Liang
> > >
> > > That's why we have start pfn. you can use that to pass even a single
> > > page without a lot of overhead.
> > >
> > > > > > --
> > > > > > 1.9.1
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe kvm"
> > > > > in the body of a message to majord...@vger.kernel.org More
> > > > > majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in the body of
> a message to majord...@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH RFC kernel] balloon: speed up inflating/deflating process

2016-05-25 Thread Li, Liang Z
> > > Suggestion to address all above comments:
> > >   1. allocate a bunch of pages and link them up,
> > >  calculating the min and the max pfn.
> > >  if max-min exceeds the allocated bitmap size,
> > >  tell host.
> >
> > I am not sure if it works well in some cases, e.g. The allocated pages
> > are across a wide range and the max-min > limit is very frequently to be
> true.
> > Then, there will be many times of virtio transmission and it's bad for
> > performance improvement. Right?
> 
> It's a tradeoff for sure. Measure it, see what the overhead is.
> 

Hi MST,

I have measured the performance when using a 32K page bitmap, and inflate the 
balloon to 3GB
of an idle guest with 4GB RAM.

Now: 
total inflating time: 338ms
the count of virtio data transmission:  373
the call count of madvise: 865

before:
total inflating time: 175ms
the count of virtio data transmission: 1
the call count of madvise: 42

Maybe the result will be worse if the guest is not idle, or the guest has more 
RAM.
Do you want more data?

Is it worth to do that?

Liang

> >
> > >   2. limit allocated bitmap size to something reasonable.
> > >  How about 32Kbytes? This is 256kilo bit in the map, which comes
> > >  out to 1Giga bytes of memory in the balloon.
> >
> > So, even the VM has 1TB of RAM, the page bitmap will take 32MB of
> memory.
> > Maybe it's better to use a big page bitmap the save the pages
> > allocated by balloon, and split the big page bitmap to 32K bytes unit, then
> transfer one unit at a time.
> 
> How is this different from what I said?
> 
> >
> > Should we use a page bitmap to replace 'vb->pages' ?
> >
> > How about rolling back to use PFNs if the count of requested pages is a
> small number?
> >
> > Liang
> 
> That's why we have start pfn. you can use that to pass even a single page
> without a lot of overhead.
> 
> > > > --
> > > > 1.9.1
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > > the body of a message to majord...@vger.kernel.org More majordomo
> > > info at http://vger.kernel.org/majordomo-info.html
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH RFC kernel] balloon: speed up inflating/deflating process

2016-05-24 Thread Li, Liang Z
> > > > > > > This is grossly inefficient if you only requested a single page.
> > > > > > > And it's also allocating memory very aggressively without
> > > > > > > ever telling the host what is going on.
> > > > > >
> > > > > > If only requested a single page, there is no need  to send the
> > > > > > entire page bitmap, This RFC patch has already considered about
> this.
> > > > >
> > > > > where's that addressed in code?
> > > > >
> > > >
> > > > By record the start_pfn and end_pfn.
> > > >
> > > > The start_pfn & end_pfn will be updated in set_page_bitmap() and
> > > > will be used in the function tell_host():
> > > >
> > > > --
> > > > 
> > > > ---
> > > > +static void set_page_bitmap(struct virtio_balloon *vb, struct
> > > > +page
> > > > +*page) {
> > > > +   unsigned int i;
> > > > +   unsigned long *bitmap = vb->page_bitmap;
> > > > +   unsigned long balloon_pfn = page_to_balloon_pfn(page);
> > > > +
> > > > +   for (i = 0; i < VIRTIO_BALLOON_PAGES_PER_PAGE; i++)
> > > > +   set_bit(balloon_pfn + i, bitmap);
> > >
> > > BTW, there's a page size value in header so there is no longer need
> > > to set multiple bits per page.
> >
> > Yes, you are right.
> >
> > >
> > > > +   if (balloon_pfn < vb->start_pfn)
> > > > +   vb->start_pfn = balloon_pfn;
> > > > +   if (balloon_pfn > vb->end_pfn)
> > > > +   vb->end_pfn = balloon_pfn;
> > > > +}
> > >
> > > Sounds good, but you also need to limit by allocated bitmap size.
> >
> > Why should we limit the page bitmap size? Is it no good to send a large
> page bitmap?
> > or to save the memory used for page bitmap? Or some other reason?
> 
> To save memory. First allocating a large bitmap can fail, second this is 
> pinned
> memory that is wasted - it's unused most of the time while guest is running.

Make sense.

> 
> > >
> > > >
> > > > +   unsigned long start_pfn, end_pfn, flags = 0, bmap_len;
> > > > +   struct scatterlist sg[5];
> > > > +
> > > > +   start_pfn = rounddown(vb->start_pfn, BITS_PER_LONG);
> > > > +   end_pfn = roundup(vb->end_pfn, BITS_PER_LONG);
> > > > +   bmap_len = (end_pfn - start_pfn) / BITS_PER_LONG *
> > > sizeof(long);
> > > > +
> > > > +   sg_init_table(sg, 5);
> > > > +   sg_set_buf([0], , sizeof(flags));
> > > > +   sg_set_buf([1], _pfn, sizeof(start_pfn));
> > > > +   sg_set_buf([2], _shift, sizeof(page_shift));
> > > > +   sg_set_buf([3], _len, sizeof(bmap_len));
> > > > +   sg_set_buf([4], vb->page_bitmap +
> > > > +(start_pfn / BITS_PER_LONG), bmap_len);
> > >
> > > Looks wrong. start_pfn should start at offset 0 I think ...
> >
> > I don't know what is wrong here, could you tell me why?
> >
> > Thanks!
> >
> > Liang
> 
> start_pfn should mean "bit 0 in bitmap refers to pfn X".
> So it does not make sense to also add it as offset within bitmap.

I got it, but in my code, it's correct, because the page_bitmap is
range from pfn 0 to max_pfn. 
It should be changed if we are going to use a small page bitmap.

Thanks!

Liang


> 
> --
> MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH RFC kernel] balloon: speed up inflating/deflating process

2016-05-24 Thread Li, Liang Z
> > > > > This can be pre-initialized, correct?
> > > >
> > > > pre-initialized? I am not quite understand your mean.
> > >
> > > I think you can maintain sg as part of device state and init sg with the
> bitmap.
> > >
> >
> > I got it.
> >
> > > > > This is grossly inefficient if you only requested a single page.
> > > > > And it's also allocating memory very aggressively without ever
> > > > > telling the host what is going on.
> > > >
> > > > If only requested a single page, there is no need  to send the
> > > > entire page bitmap, This RFC patch has already considered about this.
> > >
> > > where's that addressed in code?
> > >
> >
> > By record the start_pfn and end_pfn.
> >
> > The start_pfn & end_pfn will be updated in set_page_bitmap() and will
> > be used in the function tell_host():
> >
> > --
> > ---
> > +static void set_page_bitmap(struct virtio_balloon *vb, struct page
> > +*page) {
> > +   unsigned int i;
> > +   unsigned long *bitmap = vb->page_bitmap;
> > +   unsigned long balloon_pfn = page_to_balloon_pfn(page);
> > +
> > +   for (i = 0; i < VIRTIO_BALLOON_PAGES_PER_PAGE; i++)
> > +   set_bit(balloon_pfn + i, bitmap);
> 
> BTW, there's a page size value in header so there is no longer need to set
> multiple bits per page.

Yes, you are right.

> 
> > +   if (balloon_pfn < vb->start_pfn)
> > +   vb->start_pfn = balloon_pfn;
> > +   if (balloon_pfn > vb->end_pfn)
> > +   vb->end_pfn = balloon_pfn;
> > +}
> 
> Sounds good, but you also need to limit by allocated bitmap size.

Why should we limit the page bitmap size? Is it no good to send a large page 
bitmap?
or to save the memory used for page bitmap? Or some other reason?
> 
> >
> > +   unsigned long start_pfn, end_pfn, flags = 0, bmap_len;
> > +   struct scatterlist sg[5];
> > +
> > +   start_pfn = rounddown(vb->start_pfn, BITS_PER_LONG);
> > +   end_pfn = roundup(vb->end_pfn, BITS_PER_LONG);
> > +   bmap_len = (end_pfn - start_pfn) / BITS_PER_LONG *
> sizeof(long);
> > +
> > +   sg_init_table(sg, 5);
> > +   sg_set_buf([0], , sizeof(flags));
> > +   sg_set_buf([1], _pfn, sizeof(start_pfn));
> > +   sg_set_buf([2], _shift, sizeof(page_shift));
> > +   sg_set_buf([3], _len, sizeof(bmap_len));
> > +   sg_set_buf([4], vb->page_bitmap +
> > +(start_pfn / BITS_PER_LONG), bmap_len);
> 
> Looks wrong. start_pfn should start at offset 0 I think ...

I don't know what is wrong here, could you tell me why?

Thanks!

Liang


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH RFC kernel] balloon: speed up inflating/deflating process

2016-05-24 Thread Li, Liang Z
> > > >  {
> > > > -   struct scatterlist sg;
> > > > unsigned int len;
> > > >
> > > > -   sg_init_one(, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
> > > > +   if (virtio_has_feature(vb->vdev,
> > > VIRTIO_BALLOON_F_PAGE_BITMAP)) {
> > > > +   u32 page_shift = PAGE_SHIFT;
> > > > +   unsigned long start_pfn, end_pfn, flags = 0, bmap_len;
> > > > +   struct scatterlist sg[5];
> > > > +
> > > > +   start_pfn = rounddown(vb->start_pfn, BITS_PER_LONG);
> > > > +   end_pfn = roundup(vb->end_pfn, BITS_PER_LONG);
> > > > +   bmap_len = (end_pfn - start_pfn) / BITS_PER_LONG *
> > > sizeof(long);
> > > > +
> > > > +   sg_init_table(sg, 5);
> > > > +   sg_set_buf([0], , sizeof(flags));
> > > > +   sg_set_buf([1], _pfn, sizeof(start_pfn));
> > > > +   sg_set_buf([2], _shift, sizeof(page_shift));
> > > > +   sg_set_buf([3], _len, sizeof(bmap_len));
> > > > +   sg_set_buf([4], vb->page_bitmap +
> > > > +(start_pfn / BITS_PER_LONG), bmap_len);
> > >
> > > This can be pre-initialized, correct?
> >
> > pre-initialized? I am not quite understand your mean.
> 
> I think you can maintain sg as part of device state and init sg with the 
> bitmap.
> 

I got it.

> > > This is grossly inefficient if you only requested a single page.
> > > And it's also allocating memory very aggressively without ever
> > > telling the host what is going on.
> >
> > If only requested a single page, there is no need  to send the entire
> > page bitmap, This RFC patch has already considered about this.
> 
> where's that addressed in code?
> 

By record the start_pfn and end_pfn.

The start_pfn & end_pfn will be updated in set_page_bitmap()
and will be used in the function tell_host():

-
+static void set_page_bitmap(struct virtio_balloon *vb, struct page 
+*page) {
+   unsigned int i;
+   unsigned long *bitmap = vb->page_bitmap;
+   unsigned long balloon_pfn = page_to_balloon_pfn(page);
+
+   for (i = 0; i < VIRTIO_BALLOON_PAGES_PER_PAGE; i++)
+   set_bit(balloon_pfn + i, bitmap);
+   if (balloon_pfn < vb->start_pfn)
+   vb->start_pfn = balloon_pfn;
+   if (balloon_pfn > vb->end_pfn)
+   vb->end_pfn = balloon_pfn;
+}


+   unsigned long start_pfn, end_pfn, flags = 0, bmap_len;
+   struct scatterlist sg[5];
+
+   start_pfn = rounddown(vb->start_pfn, BITS_PER_LONG);
+   end_pfn = roundup(vb->end_pfn, BITS_PER_LONG);
+   bmap_len = (end_pfn - start_pfn) / BITS_PER_LONG * sizeof(long);
+
+   sg_init_table(sg, 5);
+   sg_set_buf([0], , sizeof(flags));
+   sg_set_buf([1], _pfn, sizeof(start_pfn));
+   sg_set_buf([2], _shift, sizeof(page_shift));
+   sg_set_buf([3], _len, sizeof(bmap_len));
+   sg_set_buf([4], vb->page_bitmap +
+(start_pfn / BITS_PER_LONG), bmap_len);
+   virtqueue_add_outbuf(vq, sg, 5, vb, GFP_KERNEL);
---
> > But it can works very well if requesting several pages  which across a
> > large range.
> 
> Some kind of limit on range would make sense though.
> It need not cover max pfn.
> 

Yes, agree.

> > > Suggestion to address all above comments:
> > >   1. allocate a bunch of pages and link them up,
> > >  calculating the min and the max pfn.
> > >  if max-min exceeds the allocated bitmap size,
> > >  tell host.
> >
> > I am not sure if it works well in some cases, e.g. The allocated pages
> > are across a wide range and the max-min > limit is very frequently to be
> true.
> > Then, there will be many times of virtio transmission and it's bad for
> > performance improvement. Right?
> 
> It's a tradeoff for sure. Measure it, see what the overhead is.

OK, I will try and get back to you.

> 
> >
> > >   2. limit allocated bitmap size to something reasonable.
> > >  How about 32Kbytes? This is 256kilo bit in the map, which comes
> > >  out to 1Giga bytes of memory in the balloon.
> >
> > So, even the VM has 1TB of RAM, the page bitmap will take 32MB of
> memory.
> > Maybe it's better to use a big page bitmap the save the pages
> > allocated by balloon, and split the big page bitmap to 32K bytes unit, then
> transfer one unit at a time.
> 
> How is this different from what I said?
> 

It's good if it's the same as you said.

Thanks!
Liang

> >
> > Should we use a page bitmap to replace 'vb->pages' ?
> >
> > How about rolling back to use PFNs if the count of requested pages is a
> small number?
> >
> > Liang
> 
> That's why we have start pfn. you can use that to pass even a single page
> without a lot of overhead.
> 
> > > > --

RE: [PATCH RFC kernel] balloon: speed up inflating/deflating process

2016-05-24 Thread Li, Liang Z
> > On Fri, May 20, 2016 at 05:59:46PM +0800, Liang Li wrote:
> > > The implementation of the current virtio-balloon is not very
> > > efficient, Bellow is test result of time spends on inflating the
> > > balloon to 3GB of a 4GB idle guest:
> > >
> > > a. allocating pages (6.5%, 103ms)
> > > b. sending PFNs to host (68.3%, 787ms) c. address translation (6.1%,
> > > 96ms) d. madvise (19%, 300ms)
> > >
> > > It takes about 1577ms for the whole inflating process to complete.
> > > The test shows that the bottle neck is the stage b and stage d.
> > >
> > > If using a bitmap to send the page info instead of the PFNs, we can
> > > reduce the overhead spends on stage b quite a lot. Furthermore, it's
> > > possible to do the address translation and do the madvise with a
> > > bulk of pages, instead of the current page per page way, so the
> > > overhead of stage c and stage d can also be reduced a lot.
> > >
> > > This patch is the kernel side implementation which is intended to
> > > speed up the inflating & deflating process by adding a new feature
> > > to the virtio-balloon device. And now, inflating the balloon to 3GB
> > > of a 4GB idle guest only takes 175ms, it's about 9 times as fast as 
> > > before.
> > >
> > > TODO: optimize stage a by allocating/freeing a chunk of pages
> > > instead of a single page at a time.
> > >
> > > Signed-off-by: Liang Li 
> > > ---
> > >  drivers/virtio/virtio_balloon.c | 199
> > ++--
> > >  include/uapi/linux/virtio_balloon.h |   1 +
> > >  mm/page_alloc.c |   6 ++
> > >  3 files changed, 198 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/drivers/virtio/virtio_balloon.c
> > > b/drivers/virtio/virtio_balloon.c index 7b6d74f..5330b6f 100644
> > > --- a/drivers/virtio/virtio_balloon.c
> > > +++ b/drivers/virtio/virtio_balloon.c
> > > @@ -45,6 +45,8 @@ static int oom_pages =
> > OOM_VBALLOON_DEFAULT_PAGES;
> > > module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> > > MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> > >
> > > +extern unsigned long get_max_pfn(void);
> > > +
> > >  struct virtio_balloon {
> > >   struct virtio_device *vdev;
> > >   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq; @@ -62,6
> > > +64,9 @@ struct virtio_balloon {
> > >
> > >   /* Number of balloon pages we've told the Host we're not using. */
> > >   unsigned int num_pages;
> > > + unsigned long *page_bitmap;
> > > + unsigned long start_pfn, end_pfn;
> > > + unsigned long bmap_len;
> > >   /*
> > >* The pages we've told the Host we're not using are enqueued
> > >* at vb_dev_info->pages list.
> > > @@ -111,15 +116,66 @@ static void balloon_ack(struct virtqueue *vq)
> > >   wake_up(>acked);
> > >  }
> > >
> > > +static int balloon_page_bitmap_init(struct virtio_balloon *vb) {
> > > + unsigned long max_pfn, bmap_bytes;
> > > +
> > > + max_pfn = get_max_pfn();
> >
> > This is racy. max_pfn could be increased by memory hotplug after you got it.
> >
> >
> > > + bmap_bytes = ALIGN(max_pfn, BITS_PER_LONG) / BITS_PER_BYTE;
> > > + if (!vb->page_bitmap)
> > > + vb->page_bitmap = kzalloc(bmap_bytes, GFP_KERNEL);
> >
> > Likely to fail for a huge busy guest.
> > Why not init on device probe?
> > this way
> > - probe will fail, or we can clear the feature bit
> > - free memory is more likely to be available
> >
> 
> Very good suggestion!
> 
> >
> > > + else {
> > > + if (bmap_bytes <= vb->bmap_len)
> > > + memset(vb->page_bitmap, 0, bmap_bytes);
> > > + else {
> > > + kfree(vb->page_bitmap);
> > > + vb->page_bitmap = kzalloc(bmap_bytes,
> > GFP_KERNEL);
> > > + }
> > > + }
> > > + if (!vb->page_bitmap) {
> > > + dev_err(>vdev->dev, "%s failure: allocate page
> > bitmap\n",
> > > +  __func__);
> > > + return -ENOMEM;
> > > + }
> > > + vb->bmap_len = bmap_bytes;
> > > + vb->start_pfn = max_pfn;
> > > + vb->end_pfn = 0;
> > > +
> > > + return 0;
> > > +}
> > > +
> >
> > >  {
> > > - struct scatterlist sg;
> > >   unsigned int len;
> > >
> > > - sg_init_one(, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
> > > + if (virtio_has_feature(vb->vdev,
> > VIRTIO_BALLOON_F_PAGE_BITMAP)) {
> > > + u32 page_shift = PAGE_SHIFT;
> > > + unsigned long start_pfn, end_pfn, flags = 0, bmap_len;
> > > + struct scatterlist sg[5];
> > > +
> > > + start_pfn = rounddown(vb->start_pfn, BITS_PER_LONG);
> > > + end_pfn = roundup(vb->end_pfn, BITS_PER_LONG);
> > > + bmap_len = (end_pfn - start_pfn) / BITS_PER_LONG *
> > sizeof(long);
> > > +
> > > + sg_init_table(sg, 5);
> > > + sg_set_buf([0], , sizeof(flags));
> > > + sg_set_buf([1], _pfn, sizeof(start_pfn));
> > > + sg_set_buf([2], _shift, sizeof(page_shift));
> > > + sg_set_buf([3], _len, sizeof(bmap_len));
> > > + sg_set_buf([4], vb->page_bitmap +
> > > +

RE: [PATCH RFC kernel] balloon: speed up inflating/deflating process

2016-05-24 Thread Li, Liang Z
> On Fri, May 20, 2016 at 05:59:46PM +0800, Liang Li wrote:
> > The implementation of the current virtio-balloon is not very
> > efficient, Bellow is test result of time spends on inflating the
> > balloon to 3GB of a 4GB idle guest:
> >
> > a. allocating pages (6.5%, 103ms)
> > b. sending PFNs to host (68.3%, 787ms) c. address translation (6.1%,
> > 96ms) d. madvise (19%, 300ms)
> >
> > It takes about 1577ms for the whole inflating process to complete. The
> > test shows that the bottle neck is the stage b and stage d.
> >
> > If using a bitmap to send the page info instead of the PFNs, we can
> > reduce the overhead spends on stage b quite a lot. Furthermore, it's
> > possible to do the address translation and do the madvise with a bulk
> > of pages, instead of the current page per page way, so the overhead of
> > stage c and stage d can also be reduced a lot.
> >
> > This patch is the kernel side implementation which is intended to
> > speed up the inflating & deflating process by adding a new feature to
> > the virtio-balloon device. And now, inflating the balloon to 3GB of a
> > 4GB idle guest only takes 175ms, it's about 9 times as fast as before.
> >
> > TODO: optimize stage a by allocating/freeing a chunk of pages instead
> > of a single page at a time.
> >
> > Signed-off-by: Liang Li 
> > ---
> >  drivers/virtio/virtio_balloon.c | 199
> ++--
> >  include/uapi/linux/virtio_balloon.h |   1 +
> >  mm/page_alloc.c |   6 ++
> >  3 files changed, 198 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/virtio/virtio_balloon.c
> > b/drivers/virtio/virtio_balloon.c index 7b6d74f..5330b6f 100644
> > --- a/drivers/virtio/virtio_balloon.c
> > +++ b/drivers/virtio/virtio_balloon.c
> > @@ -45,6 +45,8 @@ static int oom_pages =
> OOM_VBALLOON_DEFAULT_PAGES;
> > module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> > MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> >
> > +extern unsigned long get_max_pfn(void);
> > +
> >  struct virtio_balloon {
> > struct virtio_device *vdev;
> > struct virtqueue *inflate_vq, *deflate_vq, *stats_vq; @@ -62,6 +64,9
> > @@ struct virtio_balloon {
> >
> > /* Number of balloon pages we've told the Host we're not using. */
> > unsigned int num_pages;
> > +   unsigned long *page_bitmap;
> > +   unsigned long start_pfn, end_pfn;
> > +   unsigned long bmap_len;
> > /*
> >  * The pages we've told the Host we're not using are enqueued
> >  * at vb_dev_info->pages list.
> > @@ -111,15 +116,66 @@ static void balloon_ack(struct virtqueue *vq)
> > wake_up(>acked);
> >  }
> >
> > +static int balloon_page_bitmap_init(struct virtio_balloon *vb) {
> > +   unsigned long max_pfn, bmap_bytes;
> > +
> > +   max_pfn = get_max_pfn();
> 
> This is racy. max_pfn could be increased by memory hotplug after you got it.
> 
> 
> > +   bmap_bytes = ALIGN(max_pfn, BITS_PER_LONG) / BITS_PER_BYTE;
> > +   if (!vb->page_bitmap)
> > +   vb->page_bitmap = kzalloc(bmap_bytes, GFP_KERNEL);
> 
> Likely to fail for a huge busy guest.
> Why not init on device probe?
> this way
>   - probe will fail, or we can clear the feature bit
>   - free memory is more likely to be available
> 

Very good suggestion!

> 
> > +   else {
> > +   if (bmap_bytes <= vb->bmap_len)
> > +   memset(vb->page_bitmap, 0, bmap_bytes);
> > +   else {
> > +   kfree(vb->page_bitmap);
> > +   vb->page_bitmap = kzalloc(bmap_bytes,
> GFP_KERNEL);
> > +   }
> > +   }
> > +   if (!vb->page_bitmap) {
> > +   dev_err(>vdev->dev, "%s failure: allocate page
> bitmap\n",
> > +__func__);
> > +   return -ENOMEM;
> > +   }
> > +   vb->bmap_len = bmap_bytes;
> > +   vb->start_pfn = max_pfn;
> > +   vb->end_pfn = 0;
> > +
> > +   return 0;
> > +}
> > +
> 
> >  {
> > -   struct scatterlist sg;
> > unsigned int len;
> >
> > -   sg_init_one(, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
> > +   if (virtio_has_feature(vb->vdev,
> VIRTIO_BALLOON_F_PAGE_BITMAP)) {
> > +   u32 page_shift = PAGE_SHIFT;
> > +   unsigned long start_pfn, end_pfn, flags = 0, bmap_len;
> > +   struct scatterlist sg[5];
> > +
> > +   start_pfn = rounddown(vb->start_pfn, BITS_PER_LONG);
> > +   end_pfn = roundup(vb->end_pfn, BITS_PER_LONG);
> > +   bmap_len = (end_pfn - start_pfn) / BITS_PER_LONG *
> sizeof(long);
> > +
> > +   sg_init_table(sg, 5);
> > +   sg_set_buf([0], , sizeof(flags));
> > +   sg_set_buf([1], _pfn, sizeof(start_pfn));
> > +   sg_set_buf([2], _shift, sizeof(page_shift));
> > +   sg_set_buf([3], _len, sizeof(bmap_len));
> > +   sg_set_buf([4], vb->page_bitmap +
> > +(start_pfn / BITS_PER_LONG), bmap_len);
> 
> This can be pre-initialized, correct?

pre-initialized? I am not quite understand your mean.

> 
> > 

RE: [PATCH RFC kernel] balloon: speed up inflating/deflating process

2016-05-24 Thread Li, Liang Z
> On 20/05/2016 11:59, Liang Li wrote:
> > +
> > +   sg_init_table(sg, 5);
> > +   sg_set_buf([0], , sizeof(flags));
> > +   sg_set_buf([1], _pfn, sizeof(start_pfn));
> > +   sg_set_buf([2], _shift, sizeof(page_shift));
> > +   sg_set_buf([3], _len, sizeof(bmap_len));
> 
> These four should probably be placed in a single struct and therefore a single
> sg entry.  It might even be faster to place it together with the bitmap, thus
> avoiding the use of indirect descriptors.
> 

Yes, thanks for your suggestion.

> You should also test ballooning of a 64GB guest after filling in the page 
> cache,
> not just ballooning of a freshly booted 4GB guest.  This will give you a much
> more sparse bitmap.  Still, the improvement in sending PFNs to the host are

I will include the test result for that case in next version.


Thanks,

Liang

> impressive.
> 
> Thanks,
> 
> Paolo
> 
> > +   sg_set_buf([4], vb->page_bitmap +
> > +(start_pfn / BITS_PER_LONG), bmap_len);
> > +   virtqueue_add_outbuf(vq, sg, 5, vb, GFP_KERNEL);
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH RFC kernel] balloon: speed up inflating/deflating process

2016-05-24 Thread Li, Liang Z
> On Fri, 20 May 2016 17:59:46 +0800
> Liang Li  wrote:
> 
> > The implementation of the current virtio-balloon is not very
> > efficient, Bellow is test result of time spends on inflating the
> > balloon to 3GB of a 4GB idle guest:
> >
> > a. allocating pages (6.5%, 103ms)
> > b. sending PFNs to host (68.3%, 787ms) c. address translation (6.1%,
> > 96ms) d. madvise (19%, 300ms)
> >
> > It takes about 1577ms for the whole inflating process to complete. The
> > test shows that the bottle neck is the stage b and stage d.
> >
> > If using a bitmap to send the page info instead of the PFNs, we can
> > reduce the overhead spends on stage b quite a lot. Furthermore, it's
> > possible to do the address translation and do the madvise with a bulk
> > of pages, instead of the current page per page way, so the overhead of
> > stage c and stage d can also be reduced a lot.
> >
> > This patch is the kernel side implementation which is intended to
> > speed up the inflating & deflating process by adding a new feature to
> > the virtio-balloon device. And now, inflating the balloon to 3GB of a
> > 4GB idle guest only takes 175ms, it's about 9 times as fast as before.
> >
> > TODO: optimize stage a by allocating/freeing a chunk of pages instead
> > of a single page at a time.
> 
> Not commenting on the approach, but...
> 
> >
> > Signed-off-by: Liang Li 
> > ---
> >  drivers/virtio/virtio_balloon.c | 199
> ++--
> >  include/uapi/linux/virtio_balloon.h |   1 +
> >  mm/page_alloc.c |   6 ++
> >  3 files changed, 198 insertions(+), 8 deletions(-)
> >
> 
> >  static void tell_host(struct virtio_balloon *vb, struct virtqueue
> > *vq)  {
> > -   struct scatterlist sg;
> > unsigned int len;
> >
> > -   sg_init_one(, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
> > +   if (virtio_has_feature(vb->vdev,
> VIRTIO_BALLOON_F_PAGE_BITMAP)) {
> > +   u32 page_shift = PAGE_SHIFT;
> > +   unsigned long start_pfn, end_pfn, flags = 0, bmap_len;
> > +   struct scatterlist sg[5];
> > +
> > +   start_pfn = rounddown(vb->start_pfn, BITS_PER_LONG);
> > +   end_pfn = roundup(vb->end_pfn, BITS_PER_LONG);
> > +   bmap_len = (end_pfn - start_pfn) / BITS_PER_LONG *
> sizeof(long);
> > +
> > +   sg_init_table(sg, 5);
> > +   sg_set_buf([0], , sizeof(flags));
> > +   sg_set_buf([1], _pfn, sizeof(start_pfn));
> > +   sg_set_buf([2], _shift, sizeof(page_shift));
> > +   sg_set_buf([3], _len, sizeof(bmap_len));
> > +   sg_set_buf([4], vb->page_bitmap +
> > +(start_pfn / BITS_PER_LONG), bmap_len);
> > +   virtqueue_add_outbuf(vq, sg, 5, vb, GFP_KERNEL);
> > +
> 
> ...you need to take care of the endianness of the data you put on the queue,
> otherwise virtio-1 on big endian won't work. (There's just been a patch for
> that problem.)

OK, thanks for your reminding.

Liang

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-15 Thread Li, Liang Z
> > > > > >   I'm just catching back up on this thread; so without
> > > > > > reference to any particular previous mail in the thread.
> > > > > >
> > > > > >   1) How many of the free pages do we tell the host about?
> > > > > >  Your main change is telling the host about all the
> > > > > >  free pages.
> > > > >
> > > > > Yes, all the guest's free pages.
> > > > >
> > > > > >  If we tell the host about all the free pages, then we might
> > > > > >  end up needing to allocate more pages and update the host
> > > > > >  with pages we now want to use; that would have to wait for the
> > > > > >  host to acknowledge that use of these pages, since if we don't
> > > > > >  wait for it then it might have skipped migrating a page we
> > > > > >  just started using (I don't understand how your series solves 
> > > > > > that).
> > > > > >  So the guest probably needs to keep some free pages - how
> many?
> > > > >
> > > > > Actually, there is no need to care about whether the free pages
> > > > > will be
> > > used by the host.
> > > > > We only care about some of the free pages we get reused by the
> > > > > guest,
> > > right?
> > > > >
> > > > > The dirty page logging can be used to solve this, starting the
> > > > > dirty page logging before getting the free pages informant from guest.
> > > > > Even some of the free pages are modified by the guest during the
> > > > > process of getting the free pages information, these modified
> > > > > pages will
> > > be traced by the dirty page logging mechanism. So in the following
> > > migration_bitmap_sync() function.
> > > > > The pages in the free pages bitmap, but latter was modified,
> > > > > will be reset to dirty. We won't omit any dirtied pages.
> > > > >
> > > > > So, guest doesn't need to keep any free pages.
> > > >
> > > > OK, yes, that works; so we do:
> > > >   * enable dirty logging
> > > >   * ask guest for free pages
> > > >   * initialise the migration bitmap as everything-free
> > > >   * then later we do the normal sync-dirty bitmap stuff and it all just
> works.
> > > >
> > > > That's nice and simple.
> > >
> > > This works once, sure. But there's an issue is that you have to
> > > defer migration until you get the free page list, and this only
> > > works once. So you end up with heuristics about how long to wait.
> > >
> > > Instead I propose:
> > >
> > > - mark all pages dirty as we do now.
> > >
> > > - at start of migration, start tracking dirty
> > >   pages in kvm, and tell guest to start tracking free pages
> > >
> > > we can now introduce any kind of delay, for example wait for ack
> > > from guest, or do whatever else, or even just start migrating pages
> > >
> > > - repeatedly:
> > >   - get list of free pages from guest
> > >   - clear them in migration bitmap
> > >   - get dirty list from kvm
> > >
> > > - at end of migration, stop tracking writes in kvm,
> > >   and tell guest to stop tracking free pages
> >
> > I had thought of filtering out the free pages in each migration bitmap
> synchronization.
> > The advantage is we can skip process as many free pages as possible. Not
> just once.
> > The disadvantage is that we should change the current memory
> > management code to track the free pages, instead of traversing the free
> page list to construct the free pages bitmap, to reduce the overhead to get
> the free pages bitmap.
> > I am not sure the if the Kernel people would like it.
> >
> > If keeping the traversing mechanism, because of the overhead, maybe it's
> not worth to filter out the free pages repeatedly.
> 
> Well, Michael's idea of not waiting for the dirty bitmap to be filled does 
> make
> that idea of constnatly using the free-bitmap better.
> 

No wait is a good idea.
Actually, we could shorten the waiting time by pre allocating the free pages 
bit map
and update it when guest allocating/freeing pages. it requires to modify the mm 
related code. I don't know whether the kernel people like this.

> In that case, is it easier if something (guest/host?) allocates some memory in
> the guests physical RAM space and just points the host to it, rather than
> having an explicit 'send'.
> 

Good idea too.

Liang
> Dave
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-15 Thread Li, Liang Z
> On Mon, Mar 14, 2016 at 05:03:34PM +, Dr. David Alan Gilbert wrote:
> > * Li, Liang Z (liang.z...@intel.com) wrote:
> > > >
> > > > Hi,
> > > >   I'm just catching back up on this thread; so without reference
> > > > to any particular previous mail in the thread.
> > > >
> > > >   1) How many of the free pages do we tell the host about?
> > > >  Your main change is telling the host about all the
> > > >  free pages.
> > >
> > > Yes, all the guest's free pages.
> > >
> > > >  If we tell the host about all the free pages, then we might
> > > >  end up needing to allocate more pages and update the host
> > > >  with pages we now want to use; that would have to wait for the
> > > >  host to acknowledge that use of these pages, since if we don't
> > > >  wait for it then it might have skipped migrating a page we
> > > >  just started using (I don't understand how your series solves 
> > > > that).
> > > >  So the guest probably needs to keep some free pages - how many?
> > >
> > > Actually, there is no need to care about whether the free pages will be
> used by the host.
> > > We only care about some of the free pages we get reused by the guest,
> right?
> > >
> > > The dirty page logging can be used to solve this, starting the dirty
> > > page logging before getting the free pages informant from guest.
> > > Even some of the free pages are modified by the guest during the
> > > process of getting the free pages information, these modified pages will
> be traced by the dirty page logging mechanism. So in the following
> migration_bitmap_sync() function.
> > > The pages in the free pages bitmap, but latter was modified, will be
> > > reset to dirty. We won't omit any dirtied pages.
> > >
> > > So, guest doesn't need to keep any free pages.
> >
> > OK, yes, that works; so we do:
> >   * enable dirty logging
> >   * ask guest for free pages
> >   * initialise the migration bitmap as everything-free
> >   * then later we do the normal sync-dirty bitmap stuff and it all just 
> > works.
> >
> > That's nice and simple.
> 
> This works once, sure. But there's an issue is that you have to defer 
> migration
> until you get the free page list, and this only works once. So you end up with
> heuristics about how long to wait.
> 
> Instead I propose:
> 
> - mark all pages dirty as we do now.
> 
> - at start of migration, start tracking dirty
>   pages in kvm, and tell guest to start tracking free pages
> 
> we can now introduce any kind of delay, for example wait for ack from guest,
> or do whatever else, or even just start migrating pages
> 
> - repeatedly:
>   - get list of free pages from guest
>   - clear them in migration bitmap
>   - get dirty list from kvm
> 
> - at end of migration, stop tracking writes in kvm,
>   and tell guest to stop tracking free pages

I had thought of filtering out the free pages in each migration bitmap 
synchronization. 
The advantage is we can skip process as many free pages as possible. Not just 
once.
The disadvantage is that we should change the current memory management code to 
track the free pages,
instead of traversing the free page list to construct the free pages bitmap, to 
reduce the overhead to get the free pages bitmap.
I am not sure the if the Kernel people would like it.

If keeping the traversing mechanism, because of the overhead, maybe it's not 
worth to filter out the free pages repeatedly.

Liang




___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-10 Thread Li, Liang Z
> 
> Hi,
>   I'm just catching back up on this thread; so without reference to any
> particular previous mail in the thread.
> 
>   1) How many of the free pages do we tell the host about?
>  Your main change is telling the host about all the
>  free pages.

Yes, all the guest's free pages.

>  If we tell the host about all the free pages, then we might
>  end up needing to allocate more pages and update the host
>  with pages we now want to use; that would have to wait for the
>  host to acknowledge that use of these pages, since if we don't
>  wait for it then it might have skipped migrating a page we
>  just started using (I don't understand how your series solves that).
>  So the guest probably needs to keep some free pages - how many?

Actually, there is no need to care about whether the free pages will be used by 
the host.
We only care about some of the free pages we get reused by the guest, right?

The dirty page logging can be used to solve this, starting the dirty page 
logging before getting
the free pages informant from guest. Even some of the free pages are modified 
by the guest
during the process of getting the free pages information, these modified pages 
will be traced
by the dirty page logging mechanism. So in the following 
migration_bitmap_sync() function.
The pages in the free pages bitmap, but latter was modified, will be reset to 
dirty. We won't
omit any dirtied pages.

So, guest doesn't need to keep any free pages.

>   2) Clearing out caches
>  Does it make sense to clean caches?  They're apparently useful data
>  so if we clean them it's likely to slow the guest down; I guess
>  they're also likely to be fairly static data - so at least fairly
>  easy to migrate.
>  The answer here partially depends on what you want from your migration;
>  if you're after the fastest possible migration time it might make
>  sense to clean the caches and avoid migrating them; but that might
>  be at the cost of more disruption to the guest - there's a trade off
>  somewhere and it's not clear to me how you set that depending on your
>  guest/network/reqirements.
> 

Yes, clean the caches is an option.  Let the users decide using it or not.

>   3) Why is ballooning slow?
>  You've got a figure of 5s to balloon on an 8GB VM - but an
>  8GB VM isn't huge; so I worry about how long it would take
>  on a big VM.   We need to understand why it's slow
>* is it due to the guest shuffling pages around?
>* is it due to the virtio-balloon protocol sending one page
>  at a time?
>  + Do balloon pages normally clump in physical memory
> - i.e. would a 'large balloon' message help
> - or do we need a bitmap because it tends not to clump?
> 

I didn't do a comprehensive test. But I found most of the time spending
on allocating the pages and sending the PFNs to guest, I don't know that's
the most time consuming operation, allocating the pages or sending the PFNs.

>* is it due to the madvise on the host?
>  If we were using the normal balloon messages, then we
>  could, during migration, just route those to the migration
>  code rather than bothering with the madvise.
>  If they're clumping together we could just turn that into
>  one big madvise; if they're not then would we benefit from
>  a call that lets us madvise lots of areas?
> 

My test showed madvise() is not the main reason for the long time, only taken
10% of the total  inflating balloon operation time.
Big madvise can more or less improve the performance.

>   4) Speeding up the migration of those free pages
> You're using the bitmap to avoid migrating those free pages; HPe's
> patchset is reconstructing a bitmap from the balloon data;  OK, so
> this all makes sense to avoid migrating them - I'd also been thinking
> of using pagemap to spot zero pages that would help find other zero'd
> pages, but perhaps ballooned is enough?
> 
Could you describe your ideal with more details?

>   5) Second-migrate
> Given a VM where you've done all those tricks on, what happens when
> you migrate it a second time?   I guess you're aiming for the guest
> to update it's bitmap;  HPe's solution is to migrate it's balloon
> bitmap along with the migration data.

Nothing is special in the second migration, QEMU will request the guest for 
free pages
Information, and the guest will traverse it's current free page list to 
construct a
new free page bitmap and send it to QEMU. Just like in the first migration.

Liang
> 
> Dave
> 
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-10 Thread Li, Liang Z
> >  Could provide more information on how to use virtio-serial to exchange
> data?  Thread , Wiki or code are all OK.
> >  I have not find some useful information yet.
> 
> See this commit in the Linux sources:
> 
> 108fc82596e3b66b819df9d28c1ebbc9ab5de14c
> 
> that adds a way to send guest trace data over to the host.  I think that's the
> most relevant to your use-case.  However, you'll have to add an in-kernel
> user of virtio-serial (like the virtio-console code
> -- the code that deals with tty and hvc currently).  There's no other non-tty
> user right now, and this is the right kind of use-case to add one for!
> 
> For many other (userspace) use-cases, see the qemu-guest-agent in the
> qemu sources.
> 
> The API is documented in the wiki:
> 
> http://www.linux-kvm.org/page/Virtio-serial_API
> 
> and the feature pages have some information that may help as well:
> 
> https://fedoraproject.org/wiki/Features/VirtioSerial
> 
> There are some links in here too:
> 
> http://log.amitshah.net/2010/09/communication-between-guests-and-
> hosts/
> 
> Hope this helps.
> 
> 
>   Amit

Thanks a lot !!

Liang
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-09 Thread Li, Liang Z
> > This patch set is the QEMU side implementation.
> >
> > The virtio-balloon is extended so that QEMU can get the free pages
> > information from the guest through virtio.
> >
> > After getting the free pages information (a bitmap), QEMU can use it
> > to filter out the guest's free pages in the ram bulk stage. This make
> > the live migration process much more efficient.
> >
> > This RFC version doesn't take the post-copy and RDMA into
> > consideration, maybe both of them can benefit from this PV solution by
> > with some extra modifications.
> 
> I like the idea, just have to prove (review) and test it a lot to ensure we 
> don't
> end up skipping pages that matter.
> 
> However, there are a couple of points:
> 
> In my opinion, the information that's exchanged between the guest and the
> host should be exchanged over a virtio-serial channel rather than virtio-
> balloon.  First, there's nothing related to the balloon here.
> It just happens to be memory info.  Second, I would never enable balloon in
> a guest that I want to be performance-sensitive.  So even if you add this as
> part of balloon, you'll find no one is using this solution.
> 
> Secondly, I suggest virtio-serial, because it's meant exactly to exchange 
> free-
> flowing information between a host and a guest, and you don't need to
> extend any part of the protocol for it (hence no changes necessary to the
> spec).  You can see how spice, vnc, etc., use virtio-serial to exchange data.
> 
> 
>   Amit

Hi Amit,

 Could provide more information on how to use virtio-serial to exchange data?  
Thread , Wiki or code are all OK. 
 I have not find some useful information yet.

Thanks
Liang

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [Qemu-devel] [RFC kernel 0/2]A PV solution for KVM live migration optimization

2016-03-09 Thread Li, Liang Z
> On 3/8/2016 4:44 PM, Amit Shah wrote:
> > On (Fri) 04 Mar 2016 [15:02:47], Jitendra Kolhe wrote:
> 
>  * Liang Li (liang.z...@intel.com) wrote:
> > The current QEMU live migration implementation mark the all the
> > guest's RAM pages as dirtied in the ram bulk stage, all these
> > pages will be processed and that takes quit a lot of CPU cycles.
> >
> > From guest's point of view, it doesn't care about the content in
> > free pages. We can make use of this fact and skip processing the
> > free pages in the ram bulk stage, it can save a lot CPU cycles and
> > reduce the network traffic significantly while speed up the live
> > migration process obviously.
> >
> > This patch set is the QEMU side implementation.
> >
> > The virtio-balloon is extended so that QEMU can get the free pages
> > information from the guest through virtio.
> >
> > After getting the free pages information (a bitmap), QEMU can use
> > it to filter out the guest's free pages in the ram bulk stage.
> > This make the live migration process much more efficient.
> 
>  Hi,
>    An interesting solution; I know a few different people have been
>  looking at how to speed up ballooned VM migration.
> 
> >>>
> >>> Ooh, different solutions for the same purpose, and both based on the
> balloon.
> >>
> >> We were also tying to address similar problem, without actually
> >> needing to modify the guest driver. Please find patch details under mail
> with subject.
> >> migration: skip sending ram pages released by virtio-balloon driver
> >
> > The scope of this patch series seems to be wider: don't send free
> > pages to a dest at all, vs. don't send pages that are ballooned out.
> >
> > Amit
> 
> Hi,
> 
> Thanks for your response. The scope of this patch series doesn’t seem to
> take care of ballooned out pages. To balloon out a guest ram page the guest
> balloon driver does a alloc_page() and then return the guest pfn to Qemu, so
> ballooned out pages will not be seen as free ram pages by the guest.
> Thus we will still end up scanning (for zero page) for ballooned out pages
> during migration. It would be ideal if we could have both solutions.
> 

Agree,  for users who care about the performance, just skipping the free pages.
For users who have already turned on virtio-balloon,  your solution can take 
effect.

Liang
> Thanks,
> - Jitendra
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

RE: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-09 Thread Li, Liang Z
> > > > > Yes, we really can teach qemu to skip these pages and it's not hard.
> > > > > The problem is the poor performance, this PV solution
> > > >
> > > > Balloon is always PV. And do not call patches solutions please.
> > > >
> > > > > is aimed to make it more
> > > > > efficient and reduce the performance impact on guest.
> > > >
> > > > We need to get a bit beyond this.  You are making multiple
> > > > changes, it seems to make sense to split it all up, and analyse
> > > > each change separately.
> > >
> > > Couldn't agree more.
> > >
> > > There are three stages in this optimization:
> > >
> > > 1) choosing which pages to skip
> > >
> > > 2) communicating them from guest to host
> > >
> > > 3) skip transferring uninteresting pages to the remote side on
> > > migration
> > >
> > > For (3) there seems to be a low-hanging fruit to amend
> > > migration/ram.c:iz_zero_range() to consult /proc/self/pagemap.  This
> > > would work for guest RAM that hasn't been touched yet or which has
> > > been ballooned out.
> > >
> > > For (1) I've been trying to make a point that skipping clean pages
> > > is much more likely to result in noticable benefit than free pages only.
> > >
> >
> > I am considering to drop the pagecache before getting the free pages.
> >
> > > As for (2), we do seem to have a problem with the existing balloon:
> > > according to your measurements it's very slow; besides, I guess it
> > > plays badly
> >
> > I didn't say communicating is slow. Even this is very slow, my
> > solution use bitmap instead of PFNs, there is fewer data traffic, so it's
> faster than the existing balloon which use PFNs.
> 
> By how much?
> 

Haven't measured yet. 
To identify a page, 1 bit is needed if using bitmap, 4 Bytes(32bit) is needed 
if using PFN, 

For a guest with 8GB RAM,  the corresponding free page bitmap size is 256KB.
And the corresponding total PFNs size is 8192KB. Assuming the inflating size
is 7GB, the total PFNs size is 7168KB.

Maybe this is not the point.

Liang

> > > with transparent huge pages (as both the guest and the host work
> > > with one 4k page at a time).  This is a problem for other use cases
> > > of balloon (e.g. as a facility for resource management); tackling
> > > that appears a more natural application for optimization efforts.
> > >
> > > Thanks,
> > > Roman.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-08 Thread Li, Liang Z
> On 04/03/2016 15:26, Li, Liang Z wrote:
> >> >
> >> > The memory usage will keep increasing due to ever growing caches,
> >> > etc, so you'll be left with very little free memory fairly soon.
> >> >
> > I don't think so.
> >
> 
> Roman is right.  For example, here I am looking at a 64 GB (physical) machine
> which was booted about 30 minutes ago, and which is running disk-heavy
> workloads (installing VMs).
> 
> Since I have started writing this email (2 minutes?), the amount of free
> memory has already gone down from 37 GB to 33 GB.  I expect that by the
> time I have finished running the workload, in two hours, it will not have any
> free memory.
> 
> Paolo

I have a VM which has 2GB of RAM, when the guest booted, there were about 1.4GB 
of free pages.
Then I tried to download a large file from the internet with the browser, after 
the downloading finished,
there were only 72MB of free pages left, as Roman pointed out, there were quite 
a lot of Cached memory.
Then I tried to compile the QEMU, after the compiling finished, there were 
about 1.3G free pages.

So even the cache will increase to a large amount, it will be freed if there 
are some other specific workloads. 
The cache memory is a big issue that should be taken into consideration.
 How about reclaim some cache before getting the free pages information?  

Liang 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-08 Thread Li, Liang Z
> On Fri, Mar 04, 2016 at 03:13:03PM +0000, Li, Liang Z wrote:
> > > > Maybe I am not clear enough.
> > > >
> > > > I mean if we inflate balloon before live migration, for a 8GB
> > > > guest, it takes
> > > about 5 Seconds for the inflating operation to finish.
> > >
> > > And these 5 seconds are spent where?
> > >
> >
> > The time is spent on allocating the pages and send the allocated pages
> > pfns to QEMU through virtio.
> 
> What if we skip allocating pages but use the existing interface to send pfns 
> to
> QEMU?
>

I think it will be much faster, allocating pages is the main reason for the 
long time of the operation.
Experiment is needed to get the exact time spend on sending the pfns.

Liang
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-08 Thread Li, Liang Z
> Subject: Re: [RFC qemu 0/4] A PV solution for live migration optimization
> 
> On (Thu) 03 Mar 2016 [18:44:24], Liang Li wrote:
> > The current QEMU live migration implementation mark the all the
> > guest's RAM pages as dirtied in the ram bulk stage, all these pages
> > will be processed and that takes quit a lot of CPU cycles.
> >
> > From guest's point of view, it doesn't care about the content in free
> > pages. We can make use of this fact and skip processing the free pages
> > in the ram bulk stage, it can save a lot CPU cycles and reduce the
> > network traffic significantly while speed up the live migration
> > process obviously.
> >
> > This patch set is the QEMU side implementation.
> >
> > The virtio-balloon is extended so that QEMU can get the free pages
> > information from the guest through virtio.
> >
> > After getting the free pages information (a bitmap), QEMU can use it
> > to filter out the guest's free pages in the ram bulk stage. This make
> > the live migration process much more efficient.
> >
> > This RFC version doesn't take the post-copy and RDMA into
> > consideration, maybe both of them can benefit from this PV solution by
> > with some extra modifications.
> 
> I like the idea, just have to prove (review) and test it a lot to ensure we 
> don't
> end up skipping pages that matter.
> 
> However, there are a couple of points:
> 
> In my opinion, the information that's exchanged between the guest and the
> host should be exchanged over a virtio-serial channel rather than virtio-
> balloon.  First, there's nothing related to the balloon here.
> It just happens to be memory info.  Second, I would never enable balloon in
> a guest that I want to be performance-sensitive.  So even if you add this as
> part of balloon, you'll find no one is using this solution.
> 
> Secondly, I suggest virtio-serial, because it's meant exactly to exchange 
> free-
> flowing information between a host and a guest, and you don't need to
> extend any part of the protocol for it (hence no changes necessary to the
> spec).  You can see how spice, vnc, etc., use virtio-serial to exchange data.
> 
> 
>   Amit

I don't like to use the virtio-balloon too, and it's confusing. 
It's grate if the virtio-serial can be used, I will take a look at it. 

Thanks for your suggestion!

Liang
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-07 Thread Li, Liang Z
> Cc: Roman Kagan; Dr. David Alan Gilbert; ehabk...@redhat.com;
> k...@vger.kernel.org; quint...@redhat.com; linux-ker...@vger.kernel.org;
> qemu-de...@nongnu.org; linux...@kvack.org; amit.s...@redhat.com;
> pbonz...@redhat.com; a...@linux-foundation.org;
> virtualization@lists.linux-foundation.org; r...@twiddle.net; r...@redhat.com
> Subject: Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration
> optimization
> 
> On Mon, Mar 07, 2016 at 06:49:19AM +, Li, Liang Z wrote:
> > > > No. And it's exactly what I mean. The ballooned memory is still
> > > > processed during live migration without skipping. The live
> > > > migration code is
> > > in migration/ram.c.
> > >
> > > So if guest acknowledged VIRTIO_BALLOON_F_MUST_TELL_HOST, we
> can
> > > teach qemu to skip these pages.
> > > Want to write a patch to do this?
> > >
> >
> > Yes, we really can teach qemu to skip these pages and it's not hard.
> > The problem is the poor performance, this PV solution
> 
> Balloon is always PV. And do not call patches solutions please.
> 

OK.
  
> > is aimed to make it more
> > efficient and reduce the performance impact on guest.
> 
> We need to get a bit beyond this.  You are making multiple changes, it seems
> to make sense to split it all up, and analyse each change separately.  If you
> don't this patchset will be stuck: as you have seen people aren't convinced it
> actually helps with real workloads.
> 
Really, changing the virtio spec must have good reasons.

> > > > >
> > > > > > > > The only advantage of ' inflating the balloon before live
> > > > > > > > migration' is simple,
> > > > > > > nothing more.
> > > > > > >
> > > > > > > That's a big advantage.  Another one is that it does
> > > > > > > something useful in real- world scenarios.
> > > > > > >
> > > > > >
> > > > > > I don't think the heave performance impaction is something
> > > > > > useful in real
> > > > > world scenarios.
> > > > > >
> > > > > > Liang
> > > > > > > Roman.
> > > > >
> > > > > So fix the performance then. You will have to try harder if you
> > > > > want to convince people that the performance is due to bad
> > > > > host/guest interface, and so we have to change *that*.
> > > > >
> > > >
> > > > Actually, the PV solution is irrelevant with the balloon
> > > > mechanism, I just use it to transfer information between host and
> guest.
> > > > I am not sure if I should implement a new virtio device, and I
> > > > want to get the answer from the community.
> > > > In this RFC patch, to make things simple, I choose to extend the
> > > > virtio-balloon and use the extended interface to transfer the
> > > > request and
> > > free_page_bimap content.
> > > >
> > > > I am not intend to change the current virtio-balloon implementation.
> > > >
> > > > Liang
> > >
> > > And the answer would depend on the answer to my question above.
> > > Does balloon need an interface passing page bitmaps around?
> >
> > Yes, I need a new interface.
> 
> Possibly, but you will need to justify this at some level if you care about
> upstreaming your patches.
> 
> > > Does this speed up any operations?
> >
> > No, a new interface will not speed up anything, but it is the easiest way to
> solve the compatibility issue.
> 
> A bunch of new code is often easier to write than to figure out the old one,
> but if we keep piling it up we'll end up with an unmaintainable mess. So we
> are rather careful about adding new interfaces, and we try to make them
> generic sometimes even at cost of slight inefficiencies.
> 
> > > OTOH what if you use the regular balloon interface with your patches?
> > >
> >
> > The regular balloon interfaces have their specific function and I can't use
> them in my patches.
> > If using these regular interface, I have to do a lot of changes to keep the
> compatibility.
> 
> Why can't you?
> 
> What exactly do we need to change?
> 
> If we put things in terms of the balloon, that supports adding and removing
> pages.
> 
> Using these terms, let's enumerate:
> - a new method (e.g. new virtqueue) that adds and immediately removes
> page in a balloon
>   clearly,

RE: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-06 Thread Li, Liang Z
> > On 04/03/2016 15:26, Li, Liang Z wrote:
> > >> >
> > >> > The memory usage will keep increasing due to ever growing caches,
> > >> > etc, so you'll be left with very little free memory fairly soon.
> > >> >
> > > I don't think so.
> > >
> >
> > Roman is right.  For example, here I am looking at a 64 GB (physical)
> > machine which was booted about 30 minutes ago, and which is running
> > disk-heavy workloads (installing VMs).
> >
> > Since I have started writing this email (2 minutes?), the amount of
> > free memory has already gone down from 37 GB to 33 GB.  I expect that
> > by the time I have finished running the workload, in two hours, it
> > will not have any free memory.
> 
> But what about a VM sitting idle, or that just has more RAM assigned to it
> than is currently using.
>  I've got a host here that's been up for 46 days and has been doing some
> heavy VM debugging a few days ago, but today:
> 
> # free -m
>   totalusedfree  shared  buff/cache   
> available
> Mem:  965361146   44834 184   50555   
> 94735
> 
> I very rarely use all it's RAM, so it's got a big chunk of free RAM, and yes 
> it's
> got a big chunk of cache as well.
> 
> Dave
> 
> >
> > Paolo

I begin to realize Roman's opinions. The PV solution can't handle the cache 
memory while inflating balloon could.
Inflating balloon so as to skipping the cache memory is no good for guest's 
performance.

How much of the free memory in the guest depends on the workload in the VM  and 
the time VM has already run
before live migration. Even the memory usage will keep increasing due to ever 
growing caches, but we don't know
when the live migration will happen, assuming there are no or very little free 
pages in the guest is not quite right.

The advantage of the pv solution is the smaller performance impact, comparing 
with inflating the balloon.

Liang



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-04 Thread Li, Liang Z
> > > > > > Only detect the unmapped/zero mapped pages is not enough.
> > > Consider
> > > > > the
> > > > > > situation like case 2, it can't achieve the same result.
> > > > >
> > > > > Your case 2 doesn't exist in the real world.  If people could
> > > > > stop their main memory consumer in the guest prior to migration
> > > > > they wouldn't need live migration at all.
> > > >
> > > > The case 2 is just a simplified scenario, not a real case.
> > > > As long as the guest's memory usage does not keep increasing, or
> > > > not always run out, it can be covered by the case 2.
> > >
> > > The memory usage will keep increasing due to ever growing caches,
> > > etc, so you'll be left with very little free memory fairly soon.
> > >
> >
> > I don't think so.
> 
> Here's my laptop:
> KiB Mem : 16048560 total,  8574956 free,  3360532 used,  4113072 buff/cache
> 
> But here's a server:
> KiB Mem:  32892768 total, 20092812 used, 12799956 free,   368704 buffers
> 
> What is the difference? A ton of tiny daemons not doing anything, staying
> resident in memory.
> 
> > > > > I tend to think you can safely assume there's no free memory in
> > > > > the guest, so there's little point optimizing for it.
> > > >
> > > > If this is true, we should not inflate the balloon either.
> > >
> > > We certainly should if there's "available" memory, i.e. not free but
> > > cheap to reclaim.
> > >
> >
> > What's your mean by "available" memory? if they are not free, I don't think
> it's cheap.
> 
> clean pages are cheap to drop as they don't have to be written.
> whether they will be ever be used is another matter.
> 
> > > > > OTOH it makes perfect sense optimizing for the unmapped memory
> > > > > that's made up, in particular, by the ballon, and consider
> > > > > inflating the balloon right before migration unless you already
> > > > > maintain it at the optimal size for other reasons (like e.g. a
> > > > > global resource manager
> > > optimizing the VM density).
> > > > >
> > > >
> > > > Yes, I believe the current balloon works and it's simple. Do you
> > > > take the
> > > performance impact for consideration?
> > > > For and 8G guest, it takes about 5s to  inflating the balloon. But
> > > > it only takes 20ms to  traverse the free_list and construct the
> > > > free pages
> > > bitmap.
> > >
> > > I don't have any feeling of how important the difference is.  And if
> > > the limiting factor for balloon inflation speed is the granularity
> > > of communication it may be worth optimizing that, because quick
> > > balloon reaction may be important in certain resource management
> scenarios.
> > >
> > > > By inflating the balloon, all the guest's pages are still be
> > > > processed (zero
> > > page checking).
> > >
> > > Not sure what you mean.  If you describe the current state of
> > > affairs that's exactly the suggested optimization point: skip unmapped
> pages.
> > >
> >
> > You'd better check the live migration code.
> 
> What's there to check in migration code?
> Here's the extent of what balloon does on output:
> 
> 
> while (iov_to_buf(elem->out_sg, elem->out_num, offset, , 4) == 4)
> {
> ram_addr_t pa;
> ram_addr_t addr;
> int p = virtio_ldl_p(vdev, );
> 
> pa = (ram_addr_t) p << VIRTIO_BALLOON_PFN_SHIFT;
> offset += 4;
> 
> /* FIXME: remove get_system_memory(), but how? */
> section = memory_region_find(get_system_memory(), pa, 1);
> if (!int128_nz(section.size) || !memory_region_is_ram(section.mr))
> continue;
> 
> 
> trace_virtio_balloon_handle_output(memory_region_name(section.mr),
>pa);
> /* Using memory_region_get_ram_ptr is bending the rules a bit, but
>should be OK because we only want a single page.  */
> addr = section.offset_within_region;
> balloon_page(memory_region_get_ram_ptr(section.mr) + addr,
>  !!(vq == s->dvq));
> memory_region_unref(section.mr);
> }
> 
> so all that happens when we get a page is balloon_page.
> and
> 
> static void balloon_page(void *addr, int deflate) { #if defined(__linux__)
> if (!qemu_balloon_is_inhibited() && (!kvm_enabled() ||
>  kvm_has_sync_mmu())) {
> qemu_madvise(addr, TARGET_PAGE_SIZE,
> deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
> }
> #endif
> }
> 
> 
> Do you see anything that tracks pages to help migration skip the ballooned
> memory? I don't.
> 

No. And it's exactly what I mean. The ballooned memory is still processed during
live migration without skipping. The live migration code is in migration/ram.c.

> 
> > > > The only advantage of ' inflating the balloon before live
> > > > migration' is simple,
> > > nothing more.
> > >
> > > That's a big advantage.  Another one is that it does something
> > > useful in real- world 

RE: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-04 Thread Li, Liang Z
> > Maybe I am not clear enough.
> >
> > I mean if we inflate balloon before live migration, for a 8GB guest, it 
> > takes
> about 5 Seconds for the inflating operation to finish.
> 
> And these 5 seconds are spent where?
> 

The time is spent on allocating the pages and send the allocated pages pfns to 
QEMU
through virtio.

> > For the PV solution, there is no need to inflate balloon before live
> > migration, the only cost is to traversing the free_list to  construct
> > the free pages bitmap, and it takes about 20ms for a 8GB idle guest( less if
> there is less free pages),  passing the free pages info to host will take 
> about
> extra 3ms.
> >
> >
> > Liang
> 
> So now let's please stop talking about solutions at a high level and discuss 
> the
> interface changes you make in detail.
> What makes it faster? Better host/guest interface? No need to go through
> buddy allocator within guest? Less interrupts? Something else?
> 

I assume you are familiar with the current virtio-balloon and how it works. 
The new interface is very simple, send a request to the virtio-balloon driver,
The virtio-driver will travers the '>free_area[order].free_list[t])' to 
construct a 'free_page_bitmap', and then the driver will send the content
of  'free_page_bitmap' back to QEMU. That all the new interface does and
there are no ' alloc_page' related affairs, so it's faster.


Some code snippet:
--
+static void mark_free_pages_bitmap(struct zone *zone,
+unsigned long *free_page_bitmap, unsigned long pfn_gap) {
+   unsigned long pfn, flags, i;
+   unsigned int order, t;
+   struct list_head *curr;
+
+   if (zone_is_empty(zone))
+   return;
+
+   spin_lock_irqsave(>lock, flags);
+
+   for_each_migratetype_order(order, t) {
+   list_for_each(curr, >free_area[order].free_list[t]) {
+
+   pfn = page_to_pfn(list_entry(curr, struct page, lru));
+   for (i = 0; i < (1UL << order); i++) {
+   if ((pfn + i) >= PFN_4G)
+   set_bit_le(pfn + i - pfn_gap,
+  free_page_bitmap);
+   else
+   set_bit_le(pfn + i, free_page_bitmap);
+   }
+   }
+   }
+
+   spin_unlock_irqrestore(>lock, flags); }

Sorry for my poor English and expression, if you still can't understand,
you could glance at the patch, total about 400 lines.
> 
> > > --
> > > MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-04 Thread Li, Liang Z
> Subject: Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration
> optimization
> 
> On Fri, Mar 04, 2016 at 09:08:44AM +0000, Li, Liang Z wrote:
> > > On Fri, Mar 04, 2016 at 01:52:53AM +, Li, Liang Z wrote:
> > > > >   I wonder if it would be possible to avoid the kernel changes
> > > > > by parsing /proc/self/pagemap - if that can be used to detect
> > > > > unmapped/zero mapped pages in the guest ram, would it achieve
> > > > > the
> > > same result?
> > > >
> > > > Only detect the unmapped/zero mapped pages is not enough.
> Consider
> > > the
> > > > situation like case 2, it can't achieve the same result.
> > >
> > > Your case 2 doesn't exist in the real world.  If people could stop
> > > their main memory consumer in the guest prior to migration they
> > > wouldn't need live migration at all.
> >
> > The case 2 is just a simplified scenario, not a real case.
> > As long as the guest's memory usage does not keep increasing, or not
> > always run out, it can be covered by the case 2.
> 
> The memory usage will keep increasing due to ever growing caches, etc, so
> you'll be left with very little free memory fairly soon.
> 

I don't think so.

> > > I tend to think you can safely assume there's no free memory in the
> > > guest, so there's little point optimizing for it.
> >
> > If this is true, we should not inflate the balloon either.
> 
> We certainly should if there's "available" memory, i.e. not free but cheap to
> reclaim.
> 

What's your mean by "available" memory? if they are not free, I don't think 
it's cheap.

> > > OTOH it makes perfect sense optimizing for the unmapped memory
> > > that's made up, in particular, by the ballon, and consider inflating
> > > the balloon right before migration unless you already maintain it at
> > > the optimal size for other reasons (like e.g. a global resource manager
> optimizing the VM density).
> > >
> >
> > Yes, I believe the current balloon works and it's simple. Do you take the
> performance impact for consideration?
> > For and 8G guest, it takes about 5s to  inflating the balloon. But it
> > only takes 20ms to  traverse the free_list and construct the free pages
> bitmap.
> 
> I don't have any feeling of how important the difference is.  And if the
> limiting factor for balloon inflation speed is the granularity of 
> communication
> it may be worth optimizing that, because quick balloon reaction may be
> important in certain resource management scenarios.
> 
> > By inflating the balloon, all the guest's pages are still be processed (zero
> page checking).
> 
> Not sure what you mean.  If you describe the current state of affairs that's
> exactly the suggested optimization point: skip unmapped pages.
> 

You'd better check the live migration code.

> > The only advantage of ' inflating the balloon before live migration' is 
> > simple,
> nothing more.
> 
> That's a big advantage.  Another one is that it does something useful in real-
> world scenarios.
> 

I don't think the heave performance impaction is something useful in real world 
scenarios.

Liang
> Roman.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-04 Thread Li, Liang Z
> On Fri, Mar 04, 2016 at 09:12:12AM +0000, Li, Liang Z wrote:
> > > Although I wonder which is cheaper; that would be fairly expensive
> > > for the guest wouldn't it? And you'd somehow have to kick the guest
> > > before migration to do the ballooning - and how long would you wait for
> it to finish?
> >
> > About 5 seconds for an 8G guest, balloon to 1G. Get the free pages
> > bitmap take about 20ms for an 8G idle guest.
> >
> > Liang
> 
> Where is the time spent though? allocating within guest?
> Or passing the info to host?
> If the former, we can use existing inflate/deflate vqs:
> Have guest put each free page on inflate vq, then on deflate vq.
> 

Maybe I am not clear enough.

I mean if we inflate balloon before live migration, for a 8GB guest, it takes 
about 5 Seconds for the inflating operation to finish.

For the PV solution, there is no need to inflate balloon before live migration, 
the only cost is to traversing the free_list to
 construct the free pages bitmap, and it takes about 20ms for a 8GB idle guest( 
less if there is less free pages),
 passing the free pages info to host will take about extra 3ms.


Liang
> --
> MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-04 Thread Li, Liang Z
> > > * Liang Li (liang.z...@intel.com) wrote:
> > > > The current QEMU live migration implementation mark the all the
> > > > guest's RAM pages as dirtied in the ram bulk stage, all these
> > > > pages will be processed and that takes quit a lot of CPU cycles.
> > > >
> > > > From guest's point of view, it doesn't care about the content in
> > > > free pages. We can make use of this fact and skip processing the
> > > > free pages in the ram bulk stage, it can save a lot CPU cycles and
> > > > reduce the network traffic significantly while speed up the live
> > > > migration process obviously.
> > > >
> > > > This patch set is the QEMU side implementation.
> > > >
> > > > The virtio-balloon is extended so that QEMU can get the free pages
> > > > information from the guest through virtio.
> > > >
> > > > After getting the free pages information (a bitmap), QEMU can use
> > > > it to filter out the guest's free pages in the ram bulk stage.
> > > > This make the live migration process much more efficient.
> > >
> > > Hi,
> > >   An interesting solution; I know a few different people have been
> > > looking at how to speed up ballooned VM migration.
> > >
> >
> > Ooh, different solutions for the same purpose, and both based on the
> balloon.
> 
> We were also tying to address similar problem, without actually needing to
> modify the guest driver. Please find patch details under mail with subject.
> migration: skip sending ram pages released by virtio-balloon driver
> 
> Thanks,
> - Jitendra
> 

Great! Thanks for your information.

Liang
> >
> > >   I wonder if it would be possible to avoid the kernel changes by
> > > parsing /proc/self/pagemap - if that can be used to detect
> > > unmapped/zero mapped pages in the guest ram, would it achieve the
> same result?
> > >
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-04 Thread Li, Liang Z
> * Roman Kagan (rka...@virtuozzo.com) wrote:
> > On Fri, Mar 04, 2016 at 08:23:09AM +0000, Li, Liang Z wrote:
> > > > On Thu, Mar 03, 2016 at 05:46:15PM +, Dr. David Alan Gilbert wrote:
> > > > > * Liang Li (liang.z...@intel.com) wrote:
> > > > > > The current QEMU live migration implementation mark the all
> > > > > > the guest's RAM pages as dirtied in the ram bulk stage, all
> > > > > > these pages will be processed and that takes quit a lot of CPU 
> > > > > > cycles.
> > > > > >
> > > > > > From guest's point of view, it doesn't care about the content
> > > > > > in free pages. We can make use of this fact and skip
> > > > > > processing the free pages in the ram bulk stage, it can save a
> > > > > > lot CPU cycles and reduce the network traffic significantly
> > > > > > while speed up the live migration process obviously.
> > > > > >
> > > > > > This patch set is the QEMU side implementation.
> > > > > >
> > > > > > The virtio-balloon is extended so that QEMU can get the free
> > > > > > pages information from the guest through virtio.
> > > > > >
> > > > > > After getting the free pages information (a bitmap), QEMU can
> > > > > > use it to filter out the guest's free pages in the ram bulk
> > > > > > stage. This make the live migration process much more efficient.
> > > > >
> > > > > Hi,
> > > > >   An interesting solution; I know a few different people have
> > > > > been looking at how to speed up ballooned VM migration.
> > > > >
> > > > >   I wonder if it would be possible to avoid the kernel changes
> > > > > by parsing /proc/self/pagemap - if that can be used to detect
> > > > > unmapped/zero mapped pages in the guest ram, would it achieve
> > > > > the
> > > > same result?
> > > >
> > > > Yes I was about to suggest the same thing: it's simple and makes
> > > > use of the existing infrastructure.  And you wouldn't need to care
> > > > if the pages were unmapped by ballooning or anything else
> > > > (alternative balloon implementations, not yet touched by the
> > > > guest, etc.).  Besides, you wouldn't need to synchronize with the guest.
> > > >
> > > > Roman.
> > >
> > > The unmapped/zero mapped pages can be detected by parsing
> > > /proc/self/pagemap, but the free pages can't be detected by this.
> > > Imaging an application allocates a large amount of memory , after
> > > using, it frees the memory, then live migration happens. All these free
> pages will be process and sent to the destination, it's not optimal.
> >
> > First, the likelihood of such a situation is marginal, there's no
> > point optimizing for it specifically.
> >
> > And second, even if that happens, you inflate the balloon right before
> > the migration and the free memory will get umapped very quickly, so
> > this case is covered nicely by the same technique that works for more
> > realistic cases, too.
> 
> Although I wonder which is cheaper; that would be fairly expensive for the
> guest wouldn't it? And you'd somehow have to kick the guest before
> migration to do the ballooning - and how long would you wait for it to finish?

About 5 seconds for an 8G guest, balloon to 1G. Get the free pages bitmap take 
about 20ms
for an 8G idle guest.

Liang

> 
> Dave
> 
> >
> > Roman.
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-04 Thread Li, Liang Z
> On Fri, Mar 04, 2016 at 01:52:53AM +0000, Li, Liang Z wrote:
> > >   I wonder if it would be possible to avoid the kernel changes by
> > > parsing /proc/self/pagemap - if that can be used to detect
> > > unmapped/zero mapped pages in the guest ram, would it achieve the
> same result?
> >
> > Only detect the unmapped/zero mapped pages is not enough. Consider
> the
> > situation like case 2, it can't achieve the same result.
> 
> Your case 2 doesn't exist in the real world.  If people could stop their main
> memory consumer in the guest prior to migration they wouldn't need live
> migration at all.

The case 2 is just a simplified scenario, not a real case.
As long as the guest's memory usage does not keep increasing, or not always run 
out,
it can be covered by the case 2.

> I tend to think you can safely assume there's no free memory in the guest, so
> there's little point optimizing for it.

If this is true, we should not inflate the balloon either.

> OTOH it makes perfect sense optimizing for the unmapped memory that's
> made up, in particular, by the ballon, and consider inflating the balloon 
> right
> before migration unless you already maintain it at the optimal size for other
> reasons (like e.g. a global resource manager optimizing the VM density).
> 

Yes, I believe the current balloon works and it's simple. Do you take the 
performance impact for consideration?
For and 8G guest, it takes about 5s to  inflating the balloon. But it only 
takes 20ms to  traverse the free_list and
construct the free pages bitmap. In this period, the guest are very busy.

By inflating the balloon, all the guest's pages are still be processed (zero 
page checking).

The only advantage of ' inflating the balloon before live migration' is simple, 
nothing more.

Liang

> Roman.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [Qemu-devel] [RFC qemu 4/4] migration: filter out guest's free pages in ram bulk stage

2016-03-03 Thread Li, Liang Z
> On Thu, Mar 03, 2016 at 06:44:28PM +0800, Liang Li wrote:
> > Get the free pages information through virtio and filter out the free
> > pages in the ram bulk stage. This can significantly reduce the total
> > live migration time as well as network traffic.
> >
> > Signed-off-by: Liang Li 
> > ---
> >  migration/ram.c | 52
> > ++--
> >  1 file changed, 46 insertions(+), 6 deletions(-)
> 
> > @@ -1945,6 +1971,20 @@ static int ram_save_setup(QEMUFile *f, void
> *opaque)
> >  DIRTY_MEMORY_MIGRATION);
> >  }
> >  memory_global_dirty_log_start();
> > +
> > +if (balloon_free_pages_support() &&
> > +balloon_get_free_pages(migration_bitmap_rcu->free_pages_bmap,
> > +   _pages_count) == 0) {
> > +qemu_mutex_unlock_iothread();
> > +while (balloon_get_free_pages(migration_bitmap_rcu-
> >free_pages_bmap,
> > +  _pages_count) == 0) {
> > +usleep(1000);
> > +}
> > +qemu_mutex_lock_iothread();
> > +
> > +filter_out_guest_free_pages(migration_bitmap_rcu-
> >free_pages_bmap);
> > +}
> 
> IIUC, this code is synchronous wrt to the guest OS balloon drive. ie it is 
> asking
> the geust for free pages and waiting for a response. If the guest OS has
> crashed this is going to mean QEMU waits forever and thus migration won't
> complete. Similarly you need to consider that the guest OS may be malicious
> and simply never respond.
> 
> So if the migration code is going to use the guest balloon driver to get info
> about free pages it has to be done in an asynchronous manner so that
> migration can never be stalled by a slow/crashed/malicious guest driver.
> 
> Regards,
> Daniel

Really,  thanks a lot!

Liang
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [RFC qemu 2/4] virtio-balloon: Add a new feature to balloon device

2016-03-03 Thread Li, Liang Z
> On Thu,  3 Mar 2016 18:44:26 +0800
> Liang Li  wrote:
> 
> > Extend the virtio balloon device to support a new feature, this new
> > feature can help to get guest's free pages information, which can be
> > used for live migration optimzation.
> 
> Do you have a spec for this, e.g. as a patch to the virtio spec?

Not yet.
> 
> >
> > Signed-off-by: Liang Li 
> > ---
> >  balloon.c   | 30 -
> >  hw/virtio/virtio-balloon.c  | 81 
> > -
> >  include/hw/virtio/virtio-balloon.h  | 17 +-
> >  include/standard-headers/linux/virtio_balloon.h |  1 +
> >  include/sysemu/balloon.h| 10 ++-
> >  5 files changed, 134 insertions(+), 5 deletions(-)
> 
> > +static int virtio_balloon_free_pages(void *opaque,
> > + unsigned long *free_pages_bitmap,
> > + unsigned long *free_pages_count)
> > +{
> > +VirtIOBalloon *s = opaque;
> > +VirtIODevice *vdev = VIRTIO_DEVICE(s);
> > +VirtQueueElement *elem = s->free_pages_vq_elem;
> > +int len;
> > +
> > +if (!balloon_free_pages_supported(s)) {
> > +return -1;
> > +}
> > +
> > +if (s->req_status == NOT_STARTED) {
> > +s->free_pages_bitmap = free_pages_bitmap;
> > +s->req_status = STARTED;
> > +s->mem_layout.low_mem =
> > + pc_get_lowmem(PC_MACHINE(current_machine));
> 
> Please don't leak pc-specific information into generic code.

I have already notice that and just leave it here in this  initial RFC version, 
 
the hard part of this solution is how to handle different architecture ...

Thanks!

Liang
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [RFC qemu 4/4] migration: filter out guest's free pages in ram bulk stage

2016-03-03 Thread Li, Liang Z
> On Thu,  3 Mar 2016 18:44:28 +0800
> Liang Li  wrote:
> 
> > Get the free pages information through virtio and filter out the free
> > pages in the ram bulk stage. This can significantly reduce the total
> > live migration time as well as network traffic.
> >
> > Signed-off-by: Liang Li 
> > ---
> >  migration/ram.c | 52
> > ++--
> >  1 file changed, 46 insertions(+), 6 deletions(-)
> >
> 
> > @@ -1945,6 +1971,20 @@ static int ram_save_setup(QEMUFile *f, void
> *opaque)
> >  DIRTY_MEMORY_MIGRATION);
> >  }
> >  memory_global_dirty_log_start();
> > +
> > +if (balloon_free_pages_support() &&
> > +balloon_get_free_pages(migration_bitmap_rcu->free_pages_bmap,
> > +   _pages_count) == 0) {
> > +qemu_mutex_unlock_iothread();
> > +while (balloon_get_free_pages(migration_bitmap_rcu-
> >free_pages_bmap,
> > +  _pages_count) == 0) {
> > +usleep(1000);
> > +}
> > +qemu_mutex_lock_iothread();
> > +
> > +
> > + filter_out_guest_free_pages(migration_bitmap_rcu-
> >free_pages_bmap);
> 
> A general comment: Using the ballooner to get information about pages that
> can be filtered out is too limited (there may be other ways to do this; we
> might be able to use cmma on s390, for example), and I don't like hardcoding
> to a specific method.
> 
> What about the reverse approach: Code may register a handler that
> populates the free_pages_bitmap which is called during this stage?

Good suggestion, thanks!

Liang
>  yet>
> 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [RFC qemu 2/4] virtio-balloon: Add a new feature to balloon device

2016-03-03 Thread Li, Liang Z
> Subject: Re: [RFC qemu 2/4] virtio-balloon: Add a new feature to balloon
> device
> 
> On Thu, Mar 03, 2016 at 06:44:26PM +0800, Liang Li wrote:
> > Extend the virtio balloon device to support a new feature, this new
> > feature can help to get guest's free pages information, which can be
> > used for live migration optimzation.
> >
> > Signed-off-by: Liang Li 
> 
> I don't understand why we need a new interface.
> Balloon already sends free pages to host.
> Just teach host to skip these pages.
> 

I just make use the current virtio-balloon implementation,  it's more 
complicated to
invent a new virtio-io device...
Actually, there is no need to inflate the balloon before live migration, so the 
host has
no information about the guest's free pages, that's why I add a new one.

> Maybe instead of starting with code, you should send a high level description
> to the virtio tc for consideration?
> 
> You can do it through the mailing list or using the web form:
> http://www.oasis-
> open.org/committees/comments/form.php?wg_abbrev=virtio
> 

Thanks for your information and suggestion.

Liang
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-03 Thread Li, Liang Z
> Subject: Re: [RFC qemu 0/4] A PV solution for live migration optimization
> 
> * Liang Li (liang.z...@intel.com) wrote:
> > The current QEMU live migration implementation mark the all the
> > guest's RAM pages as dirtied in the ram bulk stage, all these pages
> > will be processed and that takes quit a lot of CPU cycles.
> >
> > From guest's point of view, it doesn't care about the content in free
> > pages. We can make use of this fact and skip processing the free pages
> > in the ram bulk stage, it can save a lot CPU cycles and reduce the
> > network traffic significantly while speed up the live migration
> > process obviously.
> >
> > This patch set is the QEMU side implementation.
> >
> > The virtio-balloon is extended so that QEMU can get the free pages
> > information from the guest through virtio.
> >
> > After getting the free pages information (a bitmap), QEMU can use it
> > to filter out the guest's free pages in the ram bulk stage. This make
> > the live migration process much more efficient.
> 
> Hi,
>   An interesting solution; I know a few different people have been looking at
> how to speed up ballooned VM migration.
> 

Ooh, different solutions for the same purpose, and both based on the balloon.

>   I wonder if it would be possible to avoid the kernel changes by parsing
> /proc/self/pagemap - if that can be used to detect unmapped/zero mapped
> pages in the guest ram, would it achieve the same result?
> 

Only detect the unmapped/zero mapped pages is not enough. Consider the 
situation like case 2, it can't achieve the same result.

> > This RFC version doesn't take the post-copy and RDMA into
> > consideration, maybe both of them can benefit from this PV solution by
> > with some extra modifications.
> 
> For postcopy to be safe, you would still need to send a message to the
> destination telling it that there were zero pages, otherwise the destination
> can't tell if it's supposed to request the page from the source or treat the
> page as zero.
> 
> Dave

I will consider this later, thanks, Dave.

Liang

> 
> >
> > Performance data
> > 
> >
> > Test environment:
> >
> > CPU: Intel (R) Xeon(R) CPU ES-2699 v3 @ 2.30GHz Host RAM: 64GB
> > Host Linux Kernel:  4.2.0   Host OS: CentOS 7.1
> > Guest Linux Kernel:  4.5.rc6Guest OS: CentOS 6.6
> > Network:  X540-AT2 with 10 Gigabit connection Guest RAM: 8GB
> >
> > Case 1: Idle guest just boots:
> > 
> > | original  |pv
> > ---
> > total time(ms)  |1894   |   421
> > 
> > transferred ram(KB) |   398017  |  353242
> > 
> >
> >
> > Case 2: The guest has ever run some memory consuming workload, the
> > workload is terminated just before live migration.
> > 
> > | original  |pv
> > ---
> > total time(ms)  |   7436|   552
> > 
> > transferred ram(KB) |  8146291  |  361375
> > 
> >
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-03 Thread Li, Liang Z
> On Thu, Mar 03, 2016 at 06:44:24PM +0800, Liang Li wrote:
> > The current QEMU live migration implementation mark the all the
> > guest's RAM pages as dirtied in the ram bulk stage, all these pages
> > will be processed and that takes quit a lot of CPU cycles.
> >
> > From guest's point of view, it doesn't care about the content in free
> > pages. We can make use of this fact and skip processing the free pages
> > in the ram bulk stage, it can save a lot CPU cycles and reduce the
> > network traffic significantly while speed up the live migration
> > process obviously.
> >
> > This patch set is the QEMU side implementation.
> >
> > The virtio-balloon is extended so that QEMU can get the free pages
> > information from the guest through virtio.
> >
> > After getting the free pages information (a bitmap), QEMU can use it
> > to filter out the guest's free pages in the ram bulk stage. This make
> > the live migration process much more efficient.
> >
> > This RFC version doesn't take the post-copy and RDMA into
> > consideration, maybe both of them can benefit from this PV solution by
> > with some extra modifications.
> >
> > Performance data
> > 
> >
> > Test environment:
> >
> > CPU: Intel (R) Xeon(R) CPU ES-2699 v3 @ 2.30GHz Host RAM: 64GB
> > Host Linux Kernel:  4.2.0   Host OS: CentOS 7.1
> > Guest Linux Kernel:  4.5.rc6Guest OS: CentOS 6.6
> > Network:  X540-AT2 with 10 Gigabit connection Guest RAM: 8GB
> >
> > Case 1: Idle guest just boots:
> > 
> > | original  |pv
> > ---
> > total time(ms)  |1894   |   421
> > 
> > transferred ram(KB) |   398017  |  353242
> > 
> >
> >
> > Case 2: The guest has ever run some memory consuming workload, the
> > workload is terminated just before live migration.
> > 
> > | original  |pv
> > ---
> > total time(ms)  |   7436|   552
> > 
> > transferred ram(KB) |  8146291  |  361375
> > 
> 
> Both cases look very artificial to me.  Normally you migrate VMs which have
> started long ago and which can't have their services terminated before the
> migration, so I wouldn't expect any useful amount of free pages obtained
> this way.
> 

Yes, it's somewhat artificial, just to emphasize the effect.  And I think these 
two
cases are very easy to reproduce. Using the real workload and do the test
in production environment will be more convince.

We can predict that as long as the guest doesn't use out of its memory, this 
solution
may still take affect and shorten the total live migration time. (Off cause, we 
should
consider the time cost of the virtio communication.)

> OTOH I don't see why you can't just inflate the balloon before the migration,
> and really optimize the amount of transferred data this way?
> With the recently proposed VIRTIO_BALLOON_S_AVAIL you can have a fairly
> good estimate of the optimal balloon size, and with the recently merged
> balloon deflation on OOM it's a safe thing to do without exposing the guest
> workloads to OOM risks.
> 
> Roman.

Thanks for your information.  The size of the free page bitmap is not very 
large, for a
guest with 8GB RAM, only 256KB  extra memory is required.
Comparing to this solution, inflate the balloon is more expensive. If the 
balloon size
is not so optimal and guest request more memory during live migration, the 
guest's
performance will be impacted.

Liang

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization