from:"Li, Liang Z"

RE: [PATCH v6 kernel 3/5] virtio-balloon: speed up inflate/deflate process

2017-02-03 Thread Li, Liang Z

> 
> 
> > +static void free_extended_page_bitmap(struct virtio_balloon *vb) {
> > +   int i, bmap_count = vb->nr_page_bmap;
> > +
> > +   for (i = 1; i < bmap_count; i++) {
> > +   kfree(vb->page_bitmap[i]);
> > +   vb->page_bitmap[i] = NULL;
> > +   vb->nr_page_bmap--;
> > +   }
> > +}
> > +
> > +static void kfree_page_bitmap(struct virtio_balloon *vb) {
> > +   int i;
> > +
> > +   for (i = 0; i < vb->nr_page_bmap; i++)
> > +   kfree(vb->page_bitmap[i]);
> > +}
> 
> It might be worth commenting that pair of functions to make it clear why
> they are so different; I guess the kfree_page_bitmap is used just before you
> free the structure above it so you don't need to keep the count/pointers
> updated?
> 

Yes. I will add some comments for that. Thanks!

Liang
 
> Dave
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

RE: [PATCH v6 kernel 3/5] virtio-balloon: speed up inflate/deflate process

2017-02-03 Thread Li, Liang Z

> 
> 
> > +static void free_extended_page_bitmap(struct virtio_balloon *vb) {
> > +   int i, bmap_count = vb->nr_page_bmap;
> > +
> > +   for (i = 1; i < bmap_count; i++) {
> > +   kfree(vb->page_bitmap[i]);
> > +   vb->page_bitmap[i] = NULL;
> > +   vb->nr_page_bmap--;
> > +   }
> > +}
> > +
> > +static void kfree_page_bitmap(struct virtio_balloon *vb) {
> > +   int i;
> > +
> > +   for (i = 0; i < vb->nr_page_bmap; i++)
> > +   kfree(vb->page_bitmap[i]);
> > +}
> 
> It might be worth commenting that pair of functions to make it clear why
> they are so different; I guess the kfree_page_bitmap is used just before you
> free the structure above it so you don't need to keep the count/pointers
> updated?
> 

Yes. I will add some comments for that. Thanks!

Liang
 
> Dave
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

RE: [PATCH v6 kernel 3/5] virtio-balloon: speed up inflate/deflate process

2017-01-18 Thread Li, Liang Z

> On Wed, Jan 18, 2017 at 04:56:58AM +0000, Li, Liang Z wrote:
> > > > -   virtqueue_add_outbuf(vq, , 1, vb, GFP_KERNEL);
> > > > -   virtqueue_kick(vq);
> > > > +static void do_set_resp_bitmap(struct virtio_balloon *vb,
> > > > +   unsigned long base_pfn, int pages)
> > > >
> > > > -   /* When host has read buffer, this completes via balloon_ack */
> > > > -   wait_event(vb->acked, virtqueue_get_buf(vq, ));
> > > > +{
> > > > +   __le64 *range = vb->resp_data + vb->resp_pos;
> > > >
> > > > +   if (pages > (1 << VIRTIO_BALLOON_NR_PFN_BITS)) {
> > > > +   /* when the length field can't contain pages, set it to 
> > > > 0 to
> > >
> > > /*
> > >  * Multi-line
> > >  * comments
> > >  * should look like this.
> > >  */
> > >
> > > Also, pls start sentences with an upper-case letter.
> > >
> >
> > Sorry for that.
> >
> > > > +* indicate the actual length is in the next __le64;
> > > > +*/
> > >
> > > This is part of the interface so should be documented as such.
> > >
> > > > +   *range = cpu_to_le64((base_pfn <<
> > > > +   VIRTIO_BALLOON_NR_PFN_BITS) | 0);
> > > > +   *(range + 1) = cpu_to_le64(pages);
> > > > +   vb->resp_pos += 2;
> > >
> > > Pls use structs for this kind of stuff.
> >
> > I am not sure if you mean to use
> >
> > struct  range {
> > __le64 pfn: 52;
> > __le64 nr_page: 12
> > }
> > Instead of the shift operation?
> 
> Not just that. You want to add a pages field as well.
> 

pages field? Could you give more hints?

> Generally describe the format in the header in some way so host and guest
> can easily stay in sync.

'VIRTIO_BALLOON_NR_PFN_BITS' is for this purpose and it will be passed to the
related function in page_alloc.c as a parameter.

Thanks!
Liang
> All the pointer math and void * means we get zero type safety and I'm not
> happy about it.
> 
> It's not good that virtio format seeps out to page_alloc anyway.
> If unavoidable it is not a good idea to try to hide this fact, people will 
> assume
> they can change the format at will.
> 
> --
> MST

RE: [PATCH v6 kernel 3/5] virtio-balloon: speed up inflate/deflate process

2017-01-18 Thread Li, Liang Z

> On Wed, Jan 18, 2017 at 04:56:58AM +0000, Li, Liang Z wrote:
> > > > -   virtqueue_add_outbuf(vq, , 1, vb, GFP_KERNEL);
> > > > -   virtqueue_kick(vq);
> > > > +static void do_set_resp_bitmap(struct virtio_balloon *vb,
> > > > +   unsigned long base_pfn, int pages)
> > > >
> > > > -   /* When host has read buffer, this completes via balloon_ack */
> > > > -   wait_event(vb->acked, virtqueue_get_buf(vq, ));
> > > > +{
> > > > +   __le64 *range = vb->resp_data + vb->resp_pos;
> > > >
> > > > +   if (pages > (1 << VIRTIO_BALLOON_NR_PFN_BITS)) {
> > > > +   /* when the length field can't contain pages, set it to 
> > > > 0 to
> > >
> > > /*
> > >  * Multi-line
> > >  * comments
> > >  * should look like this.
> > >  */
> > >
> > > Also, pls start sentences with an upper-case letter.
> > >
> >
> > Sorry for that.
> >
> > > > +* indicate the actual length is in the next __le64;
> > > > +*/
> > >
> > > This is part of the interface so should be documented as such.
> > >
> > > > +   *range = cpu_to_le64((base_pfn <<
> > > > +   VIRTIO_BALLOON_NR_PFN_BITS) | 0);
> > > > +   *(range + 1) = cpu_to_le64(pages);
> > > > +   vb->resp_pos += 2;
> > >
> > > Pls use structs for this kind of stuff.
> >
> > I am not sure if you mean to use
> >
> > struct  range {
> > __le64 pfn: 52;
> > __le64 nr_page: 12
> > }
> > Instead of the shift operation?
> 
> Not just that. You want to add a pages field as well.
> 

pages field? Could you give more hints?

> Generally describe the format in the header in some way so host and guest
> can easily stay in sync.

'VIRTIO_BALLOON_NR_PFN_BITS' is for this purpose and it will be passed to the
related function in page_alloc.c as a parameter.

Thanks!
Liang
> All the pointer math and void * means we get zero type safety and I'm not
> happy about it.
> 
> It's not good that virtio format seeps out to page_alloc anyway.
> If unavoidable it is not a good idea to try to hide this fact, people will 
> assume
> they can change the format at will.
> 
> --
> MST

RE: [virtio-dev] Re: [PATCH v6 kernel 2/5] virtio-balloon: define new feature bit and head struct

2017-01-18 Thread Li, Liang Z

> > > > > > Signed-off-by: Liang Li 
> > > > > > Cc: Michael S. Tsirkin 
> > > > > > Cc: Paolo Bonzini 
> > > > > > Cc: Cornelia Huck 
> > > > > > Cc: Amit Shah 
> > > > > > Cc: Dave Hansen 
> > > > > > Cc: Andrea Arcangeli 
> > > > > > Cc: David Hildenbrand 
> > > > > > ---
> > > > > >  include/uapi/linux/virtio_balloon.h | 12 
> > > > > >  1 file changed, 12 insertions(+)
> > > > > >
> > > > > > diff --git a/include/uapi/linux/virtio_balloon.h
> > > > > > b/include/uapi/linux/virtio_balloon.h
> > > > > > index 343d7dd..2f850bf 100644
> > > > > > --- a/include/uapi/linux/virtio_balloon.h
> > > > > > +++ b/include/uapi/linux/virtio_balloon.h
> > > > > > @@ -34,10 +34,14 @@
> > > > > >  #define VIRTIO_BALLOON_F_MUST_TELL_HOST0 /* Tell
> before
> > > > > reclaiming pages */
> > > > > >  #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats
> virtqueue
> > > > > */
> > > > > >  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate
> > > balloon
> > > > > on OOM */
> > > > > > +#define VIRTIO_BALLOON_F_PAGE_RANGE3 /* Send page info
> > > > > with ranges */
> > > > > >
> > > > > >  /* Size of a PFN in the balloon interface. */  #define
> > > > > > VIRTIO_BALLOON_PFN_SHIFT 12
> > > > > >
> > > > > > +/* Bits width for the length of the pfn range */
> > > > >
> > > > > What does this mean? Couldn't figure it out.
> > > > >
> > > > > > +#define VIRTIO_BALLOON_NR_PFN_BITS 12
> > > > > > +
> > > > > >  struct virtio_balloon_config {
> > > > > > /* Number of pages host wants Guest to give up. */
> > > > > > __u32 num_pages;
> > > > > > @@ -82,4 +86,12 @@ struct virtio_balloon_stat {
> > > > > > __virtio64 val;
> > > > > >  } __attribute__((packed));
> > > > > >
> > > > > > +/* Response header structure */ struct
> > > > > > +virtio_balloon_resp_hdr {
> > > > > > +   __le64 cmd : 8; /* Distinguish different requests type */
> > > > > > +   __le64 flag: 8; /* Mark status for a specific request type */
> > > > > > +   __le64 id : 16; /* Distinguish requests of a specific type */
> > > > > > +   __le64 data_len: 32; /* Length of the following data, in
> > > > > > +bytes */
> > > > >
> > > > > This use of __le64 makes no sense.  Just use u8/le16/le32 pls.
> > > > >
> > > >
> > > > Got it, will change in the next version.
> > > >
> > > > And could help take a look at other parts? as well as the QEMU part.
> > > >
> > > > Thanks!
> > > > Liang
> > >
> > > Yes but first I would like to understand how come no fields in this
> > > new structure come up if I search for them in the following patch. I
> > > don't see why
> >
> > It's not true, all of the field will be referenced in the following
> > patches except the 'reserved' filed.
> 
> But none of these are used in the following patch 3.

Yes. Only 'data_len' is used in patch 3, and for expansibility maybe at least 
'cmd' is needed to. I should set it in patch 3 to some default value even
it's not currently useful. 'flag' and 'id' are for patch 4. I just want to 
reuse the 'struct virtio_balloon_resp_hdr' and make the code simpler.

Thanks!
Liang

RE: [virtio-dev] Re: [PATCH v6 kernel 2/5] virtio-balloon: define new feature bit and head struct

2017-01-18 Thread Li, Liang Z

> > > > > > Signed-off-by: Liang Li 
> > > > > > Cc: Michael S. Tsirkin 
> > > > > > Cc: Paolo Bonzini 
> > > > > > Cc: Cornelia Huck 
> > > > > > Cc: Amit Shah 
> > > > > > Cc: Dave Hansen 
> > > > > > Cc: Andrea Arcangeli 
> > > > > > Cc: David Hildenbrand 
> > > > > > ---
> > > > > >  include/uapi/linux/virtio_balloon.h | 12 
> > > > > >  1 file changed, 12 insertions(+)
> > > > > >
> > > > > > diff --git a/include/uapi/linux/virtio_balloon.h
> > > > > > b/include/uapi/linux/virtio_balloon.h
> > > > > > index 343d7dd..2f850bf 100644
> > > > > > --- a/include/uapi/linux/virtio_balloon.h
> > > > > > +++ b/include/uapi/linux/virtio_balloon.h
> > > > > > @@ -34,10 +34,14 @@
> > > > > >  #define VIRTIO_BALLOON_F_MUST_TELL_HOST0 /* Tell
> before
> > > > > reclaiming pages */
> > > > > >  #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats
> virtqueue
> > > > > */
> > > > > >  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate
> > > balloon
> > > > > on OOM */
> > > > > > +#define VIRTIO_BALLOON_F_PAGE_RANGE3 /* Send page info
> > > > > with ranges */
> > > > > >
> > > > > >  /* Size of a PFN in the balloon interface. */  #define
> > > > > > VIRTIO_BALLOON_PFN_SHIFT 12
> > > > > >
> > > > > > +/* Bits width for the length of the pfn range */
> > > > >
> > > > > What does this mean? Couldn't figure it out.
> > > > >
> > > > > > +#define VIRTIO_BALLOON_NR_PFN_BITS 12
> > > > > > +
> > > > > >  struct virtio_balloon_config {
> > > > > > /* Number of pages host wants Guest to give up. */
> > > > > > __u32 num_pages;
> > > > > > @@ -82,4 +86,12 @@ struct virtio_balloon_stat {
> > > > > > __virtio64 val;
> > > > > >  } __attribute__((packed));
> > > > > >
> > > > > > +/* Response header structure */ struct
> > > > > > +virtio_balloon_resp_hdr {
> > > > > > +   __le64 cmd : 8; /* Distinguish different requests type */
> > > > > > +   __le64 flag: 8; /* Mark status for a specific request type */
> > > > > > +   __le64 id : 16; /* Distinguish requests of a specific type */
> > > > > > +   __le64 data_len: 32; /* Length of the following data, in
> > > > > > +bytes */
> > > > >
> > > > > This use of __le64 makes no sense.  Just use u8/le16/le32 pls.
> > > > >
> > > >
> > > > Got it, will change in the next version.
> > > >
> > > > And could help take a look at other parts? as well as the QEMU part.
> > > >
> > > > Thanks!
> > > > Liang
> > >
> > > Yes but first I would like to understand how come no fields in this
> > > new structure come up if I search for them in the following patch. I
> > > don't see why
> >
> > It's not true, all of the field will be referenced in the following
> > patches except the 'reserved' filed.
> 
> But none of these are used in the following patch 3.

Yes. Only 'data_len' is used in patch 3, and for expansibility maybe at least 
'cmd' is needed to. I should set it in patch 3 to some default value even
it's not currently useful. 'flag' and 'id' are for patch 4. I just want to 
reuse the 'struct virtio_balloon_resp_hdr' and make the code simpler.

Thanks!
Liang

RE: [PATCH v6 kernel 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2017-01-18 Thread Li, Liang Z

> Am 21.12.2016 um 07:52 schrieb Liang Li:
> > This patch set contains two parts of changes to the virtio-balloon.
> >
> > One is the change for speeding up the inflating & deflating process,
> > the main idea of this optimization is to use {pfn|length} to present
> > the page information instead of the PFNs, to reduce the overhead of
> > virtio data transmission, address translation and madvise(). This can
> > help to improve the performance by about 85%.
> >
> > Another change is for speeding up live migration. By skipping process
> > guest's unused pages in the first round of data copy, to reduce
> > needless data processing, this can help to save quite a lot of CPU
> > cycles and network bandwidth. We put guest's unused page information
> > in a {pfn|length} array and send it to host with the virt queue of
> > virtio-balloon. For an idle guest with 8GB RAM, this can help to
> > shorten the total live migration time from 2Sec to about 500ms in
> > 10Gbps network environment. For an guest with quite a lot of page
> > cache and with little unused pages, it's possible to let the guest
> > drop it's page cache before live migration, this case can benefit from this
> new feature too.
> 
> I agree that both changes make sense (although the second change just
> smells very racy, as you also pointed out in the patch description), however I
> am not sure if virtio-balloon is really the right place for the latter change.
> 
> virtio-balloon is all about ballooning, nothing else. What you're doing is 
> using
> it as a way to communicate balloon-unrelated data from/to the hypervisor.
> Yes, it is also about guest memory, but completely unrelated to the purpose
> of the balloon device.
> 
> Maybe using virtio-balloon for this purpose is okay - I have mixed feelings
> (especially as I can't tell where else this could go). I would like to get a 
> second
> opinion on this.
> 

We have ever discussed the implementation for a long time, making use the 
current
virtio balloon seems better than the other solutions and is recommended by 
Michael.

Thanks!
Liang
> --
> 
> David

RE: [PATCH v6 kernel 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2017-01-18 Thread Li, Liang Z

> Am 21.12.2016 um 07:52 schrieb Liang Li:
> > This patch set contains two parts of changes to the virtio-balloon.
> >
> > One is the change for speeding up the inflating & deflating process,
> > the main idea of this optimization is to use {pfn|length} to present
> > the page information instead of the PFNs, to reduce the overhead of
> > virtio data transmission, address translation and madvise(). This can
> > help to improve the performance by about 85%.
> >
> > Another change is for speeding up live migration. By skipping process
> > guest's unused pages in the first round of data copy, to reduce
> > needless data processing, this can help to save quite a lot of CPU
> > cycles and network bandwidth. We put guest's unused page information
> > in a {pfn|length} array and send it to host with the virt queue of
> > virtio-balloon. For an idle guest with 8GB RAM, this can help to
> > shorten the total live migration time from 2Sec to about 500ms in
> > 10Gbps network environment. For an guest with quite a lot of page
> > cache and with little unused pages, it's possible to let the guest
> > drop it's page cache before live migration, this case can benefit from this
> new feature too.
> 
> I agree that both changes make sense (although the second change just
> smells very racy, as you also pointed out in the patch description), however I
> am not sure if virtio-balloon is really the right place for the latter change.
> 
> virtio-balloon is all about ballooning, nothing else. What you're doing is 
> using
> it as a way to communicate balloon-unrelated data from/to the hypervisor.
> Yes, it is also about guest memory, but completely unrelated to the purpose
> of the balloon device.
> 
> Maybe using virtio-balloon for this purpose is okay - I have mixed feelings
> (especially as I can't tell where else this could go). I would like to get a 
> second
> opinion on this.
> 

We have ever discussed the implementation for a long time, making use the 
current
virtio balloon seems better than the other solutions and is recommended by 
Michael.

Thanks!
Liang
> --
> 
> David

RE: [PATCH v6 kernel 3/5] virtio-balloon: speed up inflate/deflate process

2017-01-17 Thread Li, Liang Z

> > -   virtqueue_add_outbuf(vq, , 1, vb, GFP_KERNEL);
> > -   virtqueue_kick(vq);
> > +static void do_set_resp_bitmap(struct virtio_balloon *vb,
> > +   unsigned long base_pfn, int pages)
> >
> > -   /* When host has read buffer, this completes via balloon_ack */
> > -   wait_event(vb->acked, virtqueue_get_buf(vq, ));
> > +{
> > +   __le64 *range = vb->resp_data + vb->resp_pos;
> >
> > +   if (pages > (1 << VIRTIO_BALLOON_NR_PFN_BITS)) {
> > +   /* when the length field can't contain pages, set it to 0 to
> 
> /*
>  * Multi-line
>  * comments
>  * should look like this.
>  */
> 
> Also, pls start sentences with an upper-case letter.
> 

Sorry for that.

> > +* indicate the actual length is in the next __le64;
> > +*/
> 
> This is part of the interface so should be documented as such.
> 
> > +   *range = cpu_to_le64((base_pfn <<
> > +   VIRTIO_BALLOON_NR_PFN_BITS) | 0);
> > +   *(range + 1) = cpu_to_le64(pages);
> > +   vb->resp_pos += 2;
> 
> Pls use structs for this kind of stuff.

I am not sure if you mean to use 

struct  range {
__le64 pfn: 52;
__le64 nr_page: 12
}
Instead of the shift operation?

I didn't use this way because I don't want to include 'virtio-balloon.h' in 
page_alloc.c,
or copy the define of this struct in page_alloc.c

Thanks!
Liang

RE: [PATCH v6 kernel 3/5] virtio-balloon: speed up inflate/deflate process

2017-01-17 Thread Li, Liang Z

> > -   virtqueue_add_outbuf(vq, , 1, vb, GFP_KERNEL);
> > -   virtqueue_kick(vq);
> > +static void do_set_resp_bitmap(struct virtio_balloon *vb,
> > +   unsigned long base_pfn, int pages)
> >
> > -   /* When host has read buffer, this completes via balloon_ack */
> > -   wait_event(vb->acked, virtqueue_get_buf(vq, ));
> > +{
> > +   __le64 *range = vb->resp_data + vb->resp_pos;
> >
> > +   if (pages > (1 << VIRTIO_BALLOON_NR_PFN_BITS)) {
> > +   /* when the length field can't contain pages, set it to 0 to
> 
> /*
>  * Multi-line
>  * comments
>  * should look like this.
>  */
> 
> Also, pls start sentences with an upper-case letter.
> 

Sorry for that.

> > +* indicate the actual length is in the next __le64;
> > +*/
> 
> This is part of the interface so should be documented as such.
> 
> > +   *range = cpu_to_le64((base_pfn <<
> > +   VIRTIO_BALLOON_NR_PFN_BITS) | 0);
> > +   *(range + 1) = cpu_to_le64(pages);
> > +   vb->resp_pos += 2;
> 
> Pls use structs for this kind of stuff.

I am not sure if you mean to use 

struct  range {
__le64 pfn: 52;
__le64 nr_page: 12
}
Instead of the shift operation?

I didn't use this way because I don't want to include 'virtio-balloon.h' in 
page_alloc.c,
or copy the define of this struct in page_alloc.c

Thanks!
Liang

RE: [virtio-dev] Re: [PATCH v6 kernel 2/5] virtio-balloon: define new feature bit and head struct

2017-01-17 Thread Li, Liang Z

> Sent: Wednesday, January 18, 2017 3:11 AM
> To: Li, Liang Z
> Cc: k...@vger.kernel.org; virtio-...@lists.oasis-open.org; qemu-
> de...@nongnu.org; linux...@kvack.org; linux-kernel@vger.kernel.org;
> virtualizat...@lists.linux-foundation.org; amit.s...@redhat.com; Hansen,
> Dave; cornelia.h...@de.ibm.com; pbonz...@redhat.com;
> da...@redhat.com; aarca...@redhat.com; dgilb...@redhat.com;
> quint...@redhat.com
> Subject: Re: [virtio-dev] Re: [PATCH v6 kernel 2/5] virtio-balloon: define new
> feature bit and head struct
> 
> On Fri, Jan 13, 2017 at 09:24:22AM +, Li, Liang Z wrote:
> > > On Wed, Dec 21, 2016 at 02:52:25PM +0800, Liang Li wrote:
> > > > Add a new feature which supports sending the page information with
> > > > range array. The current implementation uses PFNs array, which is
> > > > not very efficient. Using ranges can improve the performance of
> > > > inflating/deflating significantly.
> > > >
> > > > Signed-off-by: Liang Li <liang.z...@intel.com>
> > > > Cc: Michael S. Tsirkin <m...@redhat.com>
> > > > Cc: Paolo Bonzini <pbonz...@redhat.com>
> > > > Cc: Cornelia Huck <cornelia.h...@de.ibm.com>
> > > > Cc: Amit Shah <amit.s...@redhat.com>
> > > > Cc: Dave Hansen <dave.han...@intel.com>
> > > > Cc: Andrea Arcangeli <aarca...@redhat.com>
> > > > Cc: David Hildenbrand <da...@redhat.com>
> > > > ---
> > > >  include/uapi/linux/virtio_balloon.h | 12 
> > > >  1 file changed, 12 insertions(+)
> > > >
> > > > diff --git a/include/uapi/linux/virtio_balloon.h
> > > > b/include/uapi/linux/virtio_balloon.h
> > > > index 343d7dd..2f850bf 100644
> > > > --- a/include/uapi/linux/virtio_balloon.h
> > > > +++ b/include/uapi/linux/virtio_balloon.h
> > > > @@ -34,10 +34,14 @@
> > > >  #define VIRTIO_BALLOON_F_MUST_TELL_HOST0 /* Tell before
> > > reclaiming pages */
> > > >  #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue
> > > */
> > > >  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate
> balloon
> > > on OOM */
> > > > +#define VIRTIO_BALLOON_F_PAGE_RANGE3 /* Send page info
> > > with ranges */
> > > >
> > > >  /* Size of a PFN in the balloon interface. */  #define
> > > > VIRTIO_BALLOON_PFN_SHIFT 12
> > > >
> > > > +/* Bits width for the length of the pfn range */
> > >
> > > What does this mean? Couldn't figure it out.
> > >
> > > > +#define VIRTIO_BALLOON_NR_PFN_BITS 12
> > > > +
> > > >  struct virtio_balloon_config {
> > > > /* Number of pages host wants Guest to give up. */
> > > > __u32 num_pages;
> > > > @@ -82,4 +86,12 @@ struct virtio_balloon_stat {
> > > > __virtio64 val;
> > > >  } __attribute__((packed));
> > > >
> > > > +/* Response header structure */
> > > > +struct virtio_balloon_resp_hdr {
> > > > +   __le64 cmd : 8; /* Distinguish different requests type */
> > > > +   __le64 flag: 8; /* Mark status for a specific request type */
> > > > +   __le64 id : 16; /* Distinguish requests of a specific type */
> > > > +   __le64 data_len: 32; /* Length of the following data, in bytes
> > > > +*/
> > >
> > > This use of __le64 makes no sense.  Just use u8/le16/le32 pls.
> > >
> >
> > Got it, will change in the next version.
> >
> > And could help take a look at other parts? as well as the QEMU part.
> >
> > Thanks!
> > Liang
> 
> Yes but first I would like to understand how come no fields in this new
> structure come up if I search for them in the following patch. I don't see why

It's not true, all of the field will be referenced in the following patches 
except 
the 'reserved' filed.

> should I waste time on reviewing the implementation if the interface isn't
> reasonable. You don't have to waste it too - just send RFC patches with the
> header until we can agree on it.

OK. I will post the header part separately.

Thanks!
Liang
> 
> --
> MST
> 
> -
> To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org

RE: [virtio-dev] Re: [PATCH v6 kernel 2/5] virtio-balloon: define new feature bit and head struct

2017-01-17 Thread Li, Liang Z

> Sent: Wednesday, January 18, 2017 3:11 AM
> To: Li, Liang Z
> Cc: k...@vger.kernel.org; virtio-...@lists.oasis-open.org; qemu-
> de...@nongnu.org; linux...@kvack.org; linux-kernel@vger.kernel.org;
> virtualizat...@lists.linux-foundation.org; amit.s...@redhat.com; Hansen,
> Dave; cornelia.h...@de.ibm.com; pbonz...@redhat.com;
> da...@redhat.com; aarca...@redhat.com; dgilb...@redhat.com;
> quint...@redhat.com
> Subject: Re: [virtio-dev] Re: [PATCH v6 kernel 2/5] virtio-balloon: define new
> feature bit and head struct
> 
> On Fri, Jan 13, 2017 at 09:24:22AM +, Li, Liang Z wrote:
> > > On Wed, Dec 21, 2016 at 02:52:25PM +0800, Liang Li wrote:
> > > > Add a new feature which supports sending the page information with
> > > > range array. The current implementation uses PFNs array, which is
> > > > not very efficient. Using ranges can improve the performance of
> > > > inflating/deflating significantly.
> > > >
> > > > Signed-off-by: Liang Li 
> > > > Cc: Michael S. Tsirkin 
> > > > Cc: Paolo Bonzini 
> > > > Cc: Cornelia Huck 
> > > > Cc: Amit Shah 
> > > > Cc: Dave Hansen 
> > > > Cc: Andrea Arcangeli 
> > > > Cc: David Hildenbrand 
> > > > ---
> > > >  include/uapi/linux/virtio_balloon.h | 12 
> > > >  1 file changed, 12 insertions(+)
> > > >
> > > > diff --git a/include/uapi/linux/virtio_balloon.h
> > > > b/include/uapi/linux/virtio_balloon.h
> > > > index 343d7dd..2f850bf 100644
> > > > --- a/include/uapi/linux/virtio_balloon.h
> > > > +++ b/include/uapi/linux/virtio_balloon.h
> > > > @@ -34,10 +34,14 @@
> > > >  #define VIRTIO_BALLOON_F_MUST_TELL_HOST0 /* Tell before
> > > reclaiming pages */
> > > >  #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue
> > > */
> > > >  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate
> balloon
> > > on OOM */
> > > > +#define VIRTIO_BALLOON_F_PAGE_RANGE3 /* Send page info
> > > with ranges */
> > > >
> > > >  /* Size of a PFN in the balloon interface. */  #define
> > > > VIRTIO_BALLOON_PFN_SHIFT 12
> > > >
> > > > +/* Bits width for the length of the pfn range */
> > >
> > > What does this mean? Couldn't figure it out.
> > >
> > > > +#define VIRTIO_BALLOON_NR_PFN_BITS 12
> > > > +
> > > >  struct virtio_balloon_config {
> > > > /* Number of pages host wants Guest to give up. */
> > > > __u32 num_pages;
> > > > @@ -82,4 +86,12 @@ struct virtio_balloon_stat {
> > > > __virtio64 val;
> > > >  } __attribute__((packed));
> > > >
> > > > +/* Response header structure */
> > > > +struct virtio_balloon_resp_hdr {
> > > > +   __le64 cmd : 8; /* Distinguish different requests type */
> > > > +   __le64 flag: 8; /* Mark status for a specific request type */
> > > > +   __le64 id : 16; /* Distinguish requests of a specific type */
> > > > +   __le64 data_len: 32; /* Length of the following data, in bytes
> > > > +*/
> > >
> > > This use of __le64 makes no sense.  Just use u8/le16/le32 pls.
> > >
> >
> > Got it, will change in the next version.
> >
> > And could help take a look at other parts? as well as the QEMU part.
> >
> > Thanks!
> > Liang
> 
> Yes but first I would like to understand how come no fields in this new
> structure come up if I search for them in the following patch. I don't see why

It's not true, all of the field will be referenced in the following patches 
except 
the 'reserved' filed.

> should I waste time on reviewing the implementation if the interface isn't
> reasonable. You don't have to waste it too - just send RFC patches with the
> header until we can agree on it.

OK. I will post the header part separately.

Thanks!
Liang
> 
> --
> MST
> 
> -
> To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org

RE: [PATCH RFC 0/4] 5-level EPT

2017-01-16 Thread Li, Liang Z

> On 29/12/2016 10:25, Liang Li wrote:
> > x86-64 is currently limited physical address width to 46 bits, which
> > can support 64 TiB of memory. Some vendors require to support more for
> > some use case. Intel plans to extend the physical address width to
> > 52 bits in some of the future products.
> >
> > The current EPT implementation only supports 4 level page table, which
> > can support maximum 48 bits physical address width, so it's needed to
> > extend the EPT to 5 level to support 52 bits physical address width.
> >
> > This patchset has been tested in the SIMICS environment for 5 level
> > paging guest, which was patched with Kirill's patchset for enabling
> > 5 level page table, with both the EPT and shadow page support. I just
> > covered the booting process, the guest can boot successfully.
> >
> > Some parts of this patchset can be improved. Any comments on the
> > design or the patches would be appreciated.
> 
> I will review the patches.  They seem fairly straightforward.
> 
> However, I am worried about the design of the 5-level page table feature
> with respect to migration.
> 
> Processors that support the new LA57 mode can write 57-canonical/48-
> noncanonical linear addresses to some registers even when LA57 mode is
> inactive.  This is true even of unprivileged instructions, in particular
> WRFSBASE/WRGSBASE.
> 
> This is fairly bad because, if a guest performs such a write (because of a bug
> or because of malice), it will not be possible to migrate the virtual machine 
> to
> a machine that lacks LA57 mode.
> 
> Ordinarily, hypervisors trap CPUID to hide features that are only present in
> some processors of a heterogeneous cluster, and the hypervisor also traps
> for example CR4 writes to prevent enabling features that were masked away.
> In this case, however, the only way for the hypervisor to prevent the write
> would be to run the guest with
> CR4.FSGSBASE=0 and trap all executions of WRFSBASE/WRGSBASE.  This
> might have negative effects on performance for workloads that use the
> instructions.
> 
> Of course, this is a problem even without your patches.  However, I think it
> should be addressed first.  I am seriously thinking of blacklisting FSGSBASE
> completely on LA57 machines until the above is fixed in hardware.
> 
> Paolo

The issue has already been forwarded to the hardware guys, still waiting for 
the feedback.

Thanks!
Liang

RE: [PATCH RFC 0/4] 5-level EPT

2017-01-16 Thread Li, Liang Z

> On 29/12/2016 10:25, Liang Li wrote:
> > x86-64 is currently limited physical address width to 46 bits, which
> > can support 64 TiB of memory. Some vendors require to support more for
> > some use case. Intel plans to extend the physical address width to
> > 52 bits in some of the future products.
> >
> > The current EPT implementation only supports 4 level page table, which
> > can support maximum 48 bits physical address width, so it's needed to
> > extend the EPT to 5 level to support 52 bits physical address width.
> >
> > This patchset has been tested in the SIMICS environment for 5 level
> > paging guest, which was patched with Kirill's patchset for enabling
> > 5 level page table, with both the EPT and shadow page support. I just
> > covered the booting process, the guest can boot successfully.
> >
> > Some parts of this patchset can be improved. Any comments on the
> > design or the patches would be appreciated.
> 
> I will review the patches.  They seem fairly straightforward.
> 
> However, I am worried about the design of the 5-level page table feature
> with respect to migration.
> 
> Processors that support the new LA57 mode can write 57-canonical/48-
> noncanonical linear addresses to some registers even when LA57 mode is
> inactive.  This is true even of unprivileged instructions, in particular
> WRFSBASE/WRGSBASE.
> 
> This is fairly bad because, if a guest performs such a write (because of a bug
> or because of malice), it will not be possible to migrate the virtual machine 
> to
> a machine that lacks LA57 mode.
> 
> Ordinarily, hypervisors trap CPUID to hide features that are only present in
> some processors of a heterogeneous cluster, and the hypervisor also traps
> for example CR4 writes to prevent enabling features that were masked away.
> In this case, however, the only way for the hypervisor to prevent the write
> would be to run the guest with
> CR4.FSGSBASE=0 and trap all executions of WRFSBASE/WRGSBASE.  This
> might have negative effects on performance for workloads that use the
> instructions.
> 
> Of course, this is a problem even without your patches.  However, I think it
> should be addressed first.  I am seriously thinking of blacklisting FSGSBASE
> completely on LA57 machines until the above is fixed in hardware.
> 
> Paolo

The issue has already been forwarded to the hardware guys, still waiting for 
the feedback.

Thanks!
Liang

RE: [virtio-dev] Re: [PATCH v6 kernel 2/5] virtio-balloon: define new feature bit and head struct

2017-01-13 Thread Li, Liang Z

> On Wed, Dec 21, 2016 at 02:52:25PM +0800, Liang Li wrote:
> > Add a new feature which supports sending the page information with
> > range array. The current implementation uses PFNs array, which is not
> > very efficient. Using ranges can improve the performance of
> > inflating/deflating significantly.
> >
> > Signed-off-by: Liang Li 
> > Cc: Michael S. Tsirkin 
> > Cc: Paolo Bonzini 
> > Cc: Cornelia Huck 
> > Cc: Amit Shah 
> > Cc: Dave Hansen 
> > Cc: Andrea Arcangeli 
> > Cc: David Hildenbrand 
> > ---
> >  include/uapi/linux/virtio_balloon.h | 12 
> >  1 file changed, 12 insertions(+)
> >
> > diff --git a/include/uapi/linux/virtio_balloon.h
> > b/include/uapi/linux/virtio_balloon.h
> > index 343d7dd..2f850bf 100644
> > --- a/include/uapi/linux/virtio_balloon.h
> > +++ b/include/uapi/linux/virtio_balloon.h
> > @@ -34,10 +34,14 @@
> >  #define VIRTIO_BALLOON_F_MUST_TELL_HOST0 /* Tell before
> reclaiming pages */
> >  #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue
> */
> >  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon
> on OOM */
> > +#define VIRTIO_BALLOON_F_PAGE_RANGE3 /* Send page info
> with ranges */
> >
> >  /* Size of a PFN in the balloon interface. */  #define
> > VIRTIO_BALLOON_PFN_SHIFT 12
> >
> > +/* Bits width for the length of the pfn range */
> 
> What does this mean? Couldn't figure it out.
> 
> > +#define VIRTIO_BALLOON_NR_PFN_BITS 12
> > +
> >  struct virtio_balloon_config {
> > /* Number of pages host wants Guest to give up. */
> > __u32 num_pages;
> > @@ -82,4 +86,12 @@ struct virtio_balloon_stat {
> > __virtio64 val;
> >  } __attribute__((packed));
> >
> > +/* Response header structure */
> > +struct virtio_balloon_resp_hdr {
> > +   __le64 cmd : 8; /* Distinguish different requests type */
> > +   __le64 flag: 8; /* Mark status for a specific request type */
> > +   __le64 id : 16; /* Distinguish requests of a specific type */
> > +   __le64 data_len: 32; /* Length of the following data, in bytes */
> 
> This use of __le64 makes no sense.  Just use u8/le16/le32 pls.
> 

Got it, will change in the next version. 

And could help take a look at other parts? as well as the QEMU part.

Thanks!
Liang

RE: [virtio-dev] Re: [PATCH v6 kernel 2/5] virtio-balloon: define new feature bit and head struct

2017-01-13 Thread Li, Liang Z

> On Wed, Dec 21, 2016 at 02:52:25PM +0800, Liang Li wrote:
> > Add a new feature which supports sending the page information with
> > range array. The current implementation uses PFNs array, which is not
> > very efficient. Using ranges can improve the performance of
> > inflating/deflating significantly.
> >
> > Signed-off-by: Liang Li 
> > Cc: Michael S. Tsirkin 
> > Cc: Paolo Bonzini 
> > Cc: Cornelia Huck 
> > Cc: Amit Shah 
> > Cc: Dave Hansen 
> > Cc: Andrea Arcangeli 
> > Cc: David Hildenbrand 
> > ---
> >  include/uapi/linux/virtio_balloon.h | 12 
> >  1 file changed, 12 insertions(+)
> >
> > diff --git a/include/uapi/linux/virtio_balloon.h
> > b/include/uapi/linux/virtio_balloon.h
> > index 343d7dd..2f850bf 100644
> > --- a/include/uapi/linux/virtio_balloon.h
> > +++ b/include/uapi/linux/virtio_balloon.h
> > @@ -34,10 +34,14 @@
> >  #define VIRTIO_BALLOON_F_MUST_TELL_HOST0 /* Tell before
> reclaiming pages */
> >  #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue
> */
> >  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon
> on OOM */
> > +#define VIRTIO_BALLOON_F_PAGE_RANGE3 /* Send page info
> with ranges */
> >
> >  /* Size of a PFN in the balloon interface. */  #define
> > VIRTIO_BALLOON_PFN_SHIFT 12
> >
> > +/* Bits width for the length of the pfn range */
> 
> What does this mean? Couldn't figure it out.
> 
> > +#define VIRTIO_BALLOON_NR_PFN_BITS 12
> > +
> >  struct virtio_balloon_config {
> > /* Number of pages host wants Guest to give up. */
> > __u32 num_pages;
> > @@ -82,4 +86,12 @@ struct virtio_balloon_stat {
> > __virtio64 val;
> >  } __attribute__((packed));
> >
> > +/* Response header structure */
> > +struct virtio_balloon_resp_hdr {
> > +   __le64 cmd : 8; /* Distinguish different requests type */
> > +   __le64 flag: 8; /* Mark status for a specific request type */
> > +   __le64 id : 16; /* Distinguish requests of a specific type */
> > +   __le64 data_len: 32; /* Length of the following data, in bytes */
> 
> This use of __le64 makes no sense.  Just use u8/le16/le32 pls.
> 

Got it, will change in the next version. 

And could help take a look at other parts? as well as the QEMU part.

Thanks!
Liang

RE: [PATCH v6 kernel 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2017-01-09 Thread Li, Liang Z

Hi guys,

Could you help to review this patch set?

Thanks!
Liang

> -Original Message-
> From: Li, Liang Z
> Sent: Wednesday, December 21, 2016 2:52 PM
> To: k...@vger.kernel.org
> Cc: virtio-...@lists.oasis-open.org; qemu-de...@nongnu.org; linux-
> m...@kvack.org; linux-kernel@vger.kernel.org; virtualization@lists.linux-
> foundation.org; amit.s...@redhat.com; Hansen, Dave;
> cornelia.h...@de.ibm.com; pbonz...@redhat.com; m...@redhat.com;
> da...@redhat.com; aarca...@redhat.com; dgilb...@redhat.com;
> quint...@redhat.com; Li, Liang Z
> Subject: [PATCH v6 kernel 0/5] Extend virtio-balloon for fast (de)inflating &
> fast live migration
> 
> This patch set contains two parts of changes to the virtio-balloon.
> 
> One is the change for speeding up the inflating & deflating process, the main
> idea of this optimization is to use {pfn|length} to present the page
> information instead of the PFNs, to reduce the overhead of virtio data
> transmission, address translation and madvise(). This can help to improve the
> performance by about 85%.
> 
> Another change is for speeding up live migration. By skipping process guest's
> unused pages in the first round of data copy, to reduce needless data
> processing, this can help to save quite a lot of CPU cycles and network
> bandwidth. We put guest's unused page information in a {pfn|length} array
> and send it to host with the virt queue of virtio-balloon. For an idle guest 
> with
> 8GB RAM, this can help to shorten the total live migration time from 2Sec to
> about 500ms in 10Gbps network environment. For an guest with quite a lot
> of page cache and with little unused pages, it's possible to let the guest 
> drop
> it's page cache before live migration, this case can benefit from this new
> feature too.
> 
> Changes from v5 to v6:
> * Drop the bitmap from the virtio ABI, use {pfn|length} only.
> * Enhance the API to get the unused page information from mm.
> 
> Changes from v4 to v5:
> * Drop the code to get the max_pfn, use another way instead.
> * Simplify the API to get the unused page information from mm.
> 
> Changes from v3 to v4:
> * Use the new scheme suggested by Dave Hansen to encode the bitmap.
> * Add code which is missed in v3 to handle migrate page.
> * Free the memory for bitmap intime once the operation is done.
> * Address some of the comments in v3.
> 
> Changes from v2 to v3:
> * Change the name of 'free page' to 'unused page'.
> * Use the scatter & gather bitmap instead of a 1MB page bitmap.
> * Fix overwriting the page bitmap after kicking.
> * Some of MST's comments for v2.
> 
> Changes from v1 to v2:
> * Abandon the patch for dropping page cache.
> * Put some structures to uapi head file.
> * Use a new way to determine the page bitmap size.
> * Use a unified way to send the free page information with the bitmap
> * Address the issues referred in MST's comments
> 
> Liang Li (5):
>   virtio-balloon: rework deflate to add page to a list
>   virtio-balloon: define new feature bit and head struct
>   virtio-balloon: speed up inflate/deflate process
>   virtio-balloon: define flags and head for host request vq
>   virtio-balloon: tell host vm's unused page info
> 
>  drivers/virtio/virtio_balloon.c | 510
> 
>  include/linux/mm.h  |   3 +
>  include/uapi/linux/virtio_balloon.h |  34 +++
>  mm/page_alloc.c | 120 +
>  4 files changed, 621 insertions(+), 46 deletions(-)
> 
> --
> 1.9.1

RE: [PATCH v6 kernel 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2017-01-09 Thread Li, Liang Z

Hi guys,

Could you help to review this patch set?

Thanks!
Liang

> -Original Message-
> From: Li, Liang Z
> Sent: Wednesday, December 21, 2016 2:52 PM
> To: k...@vger.kernel.org
> Cc: virtio-...@lists.oasis-open.org; qemu-de...@nongnu.org; linux-
> m...@kvack.org; linux-kernel@vger.kernel.org; virtualization@lists.linux-
> foundation.org; amit.s...@redhat.com; Hansen, Dave;
> cornelia.h...@de.ibm.com; pbonz...@redhat.com; m...@redhat.com;
> da...@redhat.com; aarca...@redhat.com; dgilb...@redhat.com;
> quint...@redhat.com; Li, Liang Z
> Subject: [PATCH v6 kernel 0/5] Extend virtio-balloon for fast (de)inflating &
> fast live migration
> 
> This patch set contains two parts of changes to the virtio-balloon.
> 
> One is the change for speeding up the inflating & deflating process, the main
> idea of this optimization is to use {pfn|length} to present the page
> information instead of the PFNs, to reduce the overhead of virtio data
> transmission, address translation and madvise(). This can help to improve the
> performance by about 85%.
> 
> Another change is for speeding up live migration. By skipping process guest's
> unused pages in the first round of data copy, to reduce needless data
> processing, this can help to save quite a lot of CPU cycles and network
> bandwidth. We put guest's unused page information in a {pfn|length} array
> and send it to host with the virt queue of virtio-balloon. For an idle guest 
> with
> 8GB RAM, this can help to shorten the total live migration time from 2Sec to
> about 500ms in 10Gbps network environment. For an guest with quite a lot
> of page cache and with little unused pages, it's possible to let the guest 
> drop
> it's page cache before live migration, this case can benefit from this new
> feature too.
> 
> Changes from v5 to v6:
> * Drop the bitmap from the virtio ABI, use {pfn|length} only.
> * Enhance the API to get the unused page information from mm.
> 
> Changes from v4 to v5:
> * Drop the code to get the max_pfn, use another way instead.
> * Simplify the API to get the unused page information from mm.
> 
> Changes from v3 to v4:
> * Use the new scheme suggested by Dave Hansen to encode the bitmap.
> * Add code which is missed in v3 to handle migrate page.
> * Free the memory for bitmap intime once the operation is done.
> * Address some of the comments in v3.
> 
> Changes from v2 to v3:
> * Change the name of 'free page' to 'unused page'.
> * Use the scatter & gather bitmap instead of a 1MB page bitmap.
> * Fix overwriting the page bitmap after kicking.
> * Some of MST's comments for v2.
> 
> Changes from v1 to v2:
> * Abandon the patch for dropping page cache.
> * Put some structures to uapi head file.
> * Use a new way to determine the page bitmap size.
> * Use a unified way to send the free page information with the bitmap
> * Address the issues referred in MST's comments
> 
> Liang Li (5):
>   virtio-balloon: rework deflate to add page to a list
>   virtio-balloon: define new feature bit and head struct
>   virtio-balloon: speed up inflate/deflate process
>   virtio-balloon: define flags and head for host request vq
>   virtio-balloon: tell host vm's unused page info
> 
>  drivers/virtio/virtio_balloon.c | 510
> 
>  include/linux/mm.h  |   3 +
>  include/uapi/linux/virtio_balloon.h |  34 +++
>  mm/page_alloc.c | 120 +
>  4 files changed, 621 insertions(+), 46 deletions(-)
> 
> --
> 1.9.1

RE: [PATCH RFC 0/4] 5-level EPT

2016-12-29 Thread Li, Liang Z

> Subject: Re: [PATCH RFC 0/4] 5-level EPT
> 
> On Thu, 29 Dec 2016 17:25:59 +0800, Liang Li said:
> > x86-64 is currently limited physical address width to 46 bits, which
> > can support 64 TiB of memory. Some vendors require to support more for
> > some use case. Intel plans to extend the physical address width to
> > 52 bits in some of the future products.
> 
> Can you explain why this patchset mentions 52 bits in some places, and 57 in
> others?  Is it because there are currently in-process chipsets that will do 
> 52,
> but you want to future-proof it by extending it to 57 so future chipsets won't
> need more work?  Or is there some other reason?

The 57-bits I referred in  this patch set means the virtual address width which 
will
be supported in the future CPU with 52-bits physical address width.
5 level EPT can support maximum 57-bits physical address width, as long as the
future CPU use no more than 57-bits physical address width, no more work is 
needed.

Thanks!
Liang

RE: [PATCH RFC 0/4] 5-level EPT

2016-12-29 Thread Li, Liang Z

> Subject: Re: [PATCH RFC 0/4] 5-level EPT
> 
> On Thu, 29 Dec 2016 17:25:59 +0800, Liang Li said:
> > x86-64 is currently limited physical address width to 46 bits, which
> > can support 64 TiB of memory. Some vendors require to support more for
> > some use case. Intel plans to extend the physical address width to
> > 52 bits in some of the future products.
> 
> Can you explain why this patchset mentions 52 bits in some places, and 57 in
> others?  Is it because there are currently in-process chipsets that will do 
> 52,
> but you want to future-proof it by extending it to 57 so future chipsets won't
> need more work?  Or is there some other reason?

The 57-bits I referred in  this patch set means the virtual address width which 
will
be supported in the future CPU with 52-bits physical address width.
5 level EPT can support maximum 57-bits physical address width, as long as the
future CPU use no more than 57-bits physical address width, no more work is 
needed.

Thanks!
Liang

RE: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-17 Thread Li, Liang Z

> Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for
> fast (de)inflating & fast live migration
> 
> On Thu, Dec 15, 2016 at 05:40:45PM -0800, Dave Hansen wrote:
> > On 12/15/2016 05:38 PM, Li, Liang Z wrote:
> > >
> > > Use 52 bits for 'pfn', 12 bits for 'length', when the 12 bits is not long
> enough for the 'length'
> > > Set the 'length' to a special value to indicate the "actual length in 
> > > next 8
> bytes".
> > >
> > > That will be much more simple. Right?
> >
> > Sounds fine to me.
> >
> 
> Sounds fine to me too indeed.
> 
> I'm only wondering what is the major point for compressing gpfn+len in
> 8 bytes in the common case, you already use sg_init_table to send down two
> pages, we could send three as well and avoid all math and bit shifts and ors,
> or not?
> 

Yes, we can use more pages for that.

> I agree with the above because from a performance prospective I tend to
> think the above proposal will run at least theoretically faster because the
> other way is to waste double amount of CPU cache, and bit mangling in the
> encoding and the later decoding on qemu side should be faster than
> accessing an array of double size, but then I'm not sure if it's measurable
> optimization. So I'd be curious to know the exact motivation and if it is to
> reduce the CPU cache usage or if there's some other fundamental reason to
> compress it.
> The header already tells qemu how big is the array payload, couldn't we just
> add more pages if one isn't enough?
> 

The original intention to compress the PFN and length it's to reduce the memory 
required.
Even the code was changed a lot from the previous versions, I think this is 
still true.

Now we allocate a specified buffer size to save the 'PFN|length', when the 
buffer is not big
enough to save all the page info for a specified order. A double size buffer 
will be allocated.
This is what we want to avoid because the allocation may fail and allocation 
takes some time,
for fast live migration, time is a critical factor we have to consider, more 
time takes means
more unnecessary pages are sent, because live migration starts before the 
request for unused
 pages get response. 

Thanks

Liang

> Thanks,
> Andrea

RE: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-17 Thread Li, Liang Z

> Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for
> fast (de)inflating & fast live migration
> 
> On Thu, Dec 15, 2016 at 05:40:45PM -0800, Dave Hansen wrote:
> > On 12/15/2016 05:38 PM, Li, Liang Z wrote:
> > >
> > > Use 52 bits for 'pfn', 12 bits for 'length', when the 12 bits is not long
> enough for the 'length'
> > > Set the 'length' to a special value to indicate the "actual length in 
> > > next 8
> bytes".
> > >
> > > That will be much more simple. Right?
> >
> > Sounds fine to me.
> >
> 
> Sounds fine to me too indeed.
> 
> I'm only wondering what is the major point for compressing gpfn+len in
> 8 bytes in the common case, you already use sg_init_table to send down two
> pages, we could send three as well and avoid all math and bit shifts and ors,
> or not?
> 

Yes, we can use more pages for that.

> I agree with the above because from a performance prospective I tend to
> think the above proposal will run at least theoretically faster because the
> other way is to waste double amount of CPU cache, and bit mangling in the
> encoding and the later decoding on qemu side should be faster than
> accessing an array of double size, but then I'm not sure if it's measurable
> optimization. So I'd be curious to know the exact motivation and if it is to
> reduce the CPU cache usage or if there's some other fundamental reason to
> compress it.
> The header already tells qemu how big is the array payload, couldn't we just
> add more pages if one isn't enough?
> 

The original intention to compress the PFN and length it's to reduce the memory 
required.
Even the code was changed a lot from the previous versions, I think this is 
still true.

Now we allocate a specified buffer size to save the 'PFN|length', when the 
buffer is not big
enough to save all the page info for a specified order. A double size buffer 
will be allocated.
This is what we want to avoid because the allocation may fail and allocation 
takes some time,
for fast live migration, time is a critical factor we have to consider, more 
time takes means
more unnecessary pages are sent, because live migration starts before the 
request for unused
 pages get response. 

Thanks

Liang

> Thanks,
> Andrea

RE: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-17 Thread Li, Liang Z

> On Fri, Dec 16, 2016 at 01:12:21AM +0000, Li, Liang Z wrote:
> > There still exist the case if the MAX_ORDER is configured to a large
> > value, e.g. 36 for a system with huge amount of memory, then there is only
> 28 bits left for the pfn, which is not enough.
> 
> Not related to the balloon but how would it help to set MAX_ORDER to 36?
> 

My point here is  MAX_ORDER may be configured to a big value.

> What the MAX_ORDER affects is that you won't be able to ask the kernel
> page allocator for contiguous memory bigger than 1<<(MAX_ORDER-1), but
> that's a driver issue not relevant to the amount of RAM. Drivers won't
> suddenly start to ask the kernel allocator to allocate compound pages at
> orders >= 11 just because more RAM was added.
> 
> The higher the MAX_ORDER the slower the kernel runs simply so the smaller
> the MAX_ORDER the better.
> 
> > Should  we limit the MAX_ORDER? I don't think so.
> 
> We shouldn't strictly depend on MAX_ORDER value but it's mostly limited
> already even if configurable at build time.
> 

I didn't know that and will take a look, thanks for your information.


Liang
> We definitely need it to reach at least the hugepage size, then it's mostly
> driver issue, but drivers requiring large contiguous allocations should rely 
> on
> CMA only or vmalloc if they only require it virtually contiguous, and not rely
> on larger MAX_ORDER that would slowdown all kernel allocations/freeing.

RE: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-17 Thread Li, Liang Z

> On Fri, Dec 16, 2016 at 01:12:21AM +0000, Li, Liang Z wrote:
> > There still exist the case if the MAX_ORDER is configured to a large
> > value, e.g. 36 for a system with huge amount of memory, then there is only
> 28 bits left for the pfn, which is not enough.
> 
> Not related to the balloon but how would it help to set MAX_ORDER to 36?
> 

My point here is  MAX_ORDER may be configured to a big value.

> What the MAX_ORDER affects is that you won't be able to ask the kernel
> page allocator for contiguous memory bigger than 1<<(MAX_ORDER-1), but
> that's a driver issue not relevant to the amount of RAM. Drivers won't
> suddenly start to ask the kernel allocator to allocate compound pages at
> orders >= 11 just because more RAM was added.
> 
> The higher the MAX_ORDER the slower the kernel runs simply so the smaller
> the MAX_ORDER the better.
> 
> > Should  we limit the MAX_ORDER? I don't think so.
> 
> We shouldn't strictly depend on MAX_ORDER value but it's mostly limited
> already even if configurable at build time.
> 

I didn't know that and will take a look, thanks for your information.


Liang
> We definitely need it to reach at least the hugepage size, then it's mostly
> driver issue, but drivers requiring large contiguous allocations should rely 
> on
> CMA only or vmalloc if they only require it virtually contiguous, and not rely
> on larger MAX_ORDER that would slowdown all kernel allocations/freeing.

RE: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-15 Thread Li, Liang Z

> On 12/15/2016 05:38 PM, Li, Liang Z wrote:
> >
> > Use 52 bits for 'pfn', 12 bits for 'length', when the 12 bits is not long 
> > enough
> for the 'length'
> > Set the 'length' to a special value to indicate the "actual length in next 8
> bytes".
> >
> > That will be much more simple. Right?
> 
> Sounds fine to me.

Thanks for your inspiration!

Liang

RE: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-15 Thread Li, Liang Z

> On 12/15/2016 05:38 PM, Li, Liang Z wrote:
> >
> > Use 52 bits for 'pfn', 12 bits for 'length', when the 12 bits is not long 
> > enough
> for the 'length'
> > Set the 'length' to a special value to indicate the "actual length in next 8
> bytes".
> >
> > That will be much more simple. Right?
> 
> Sounds fine to me.

Thanks for your inspiration!

Liang

RE: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-15 Thread Li, Liang Z

> On 12/15/2016 04:48 PM, Li, Liang Z wrote:
> >>> It seems we leave too many bit  for the pfn, and the bits leave for
> >>> length is not enough, How about keep 45 bits for the pfn and 19 bits
> >>> for length, 45 bits for pfn can cover 57 bits physical address, that
> >>> should be
> >> enough in the near feature.
> >>> What's your opinion?
> >> I still think 'order' makes a lot of sense.  But, as you say, 57 bits
> >> is enough for
> >> x86 for a while.  Other architectures who knows?
> 
> Thinking about this some more...  There are really only two cases that
> matter: 4k pages and "much bigger" ones.
> 
> Squeezing each 4k page into 8 bytes of metadata helps guarantee that this
> scheme won't regress over the old scheme in any cases.  For bigger ranges, 8
> vs 16 bytes means *nothing*.  And 16 bytes will be as good or better than
> the old scheme for everything which is >4k.
> 
> How about this:
>  * 52 bits of 'pfn', 5 bits of 'order', 7 bits of 'length'
>  * One special 'length' value to mean "actual length in next 8 bytes"
> 
> That should be pretty simple to produce and decode.  We have two record
> sizes, but I think it is manageable.

It works,  Now that we intend to use another 8 bytes for length

Why not:

Use 52 bits for 'pfn', 12 bits for 'length', when the 12 bits is not long 
enough for the 'length'
Set the 'length' to a special value to indicate the "actual length in next 8 
bytes".

That will be much more simple. Right?

Liang

RE: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-15 Thread Li, Liang Z

> On 12/15/2016 04:48 PM, Li, Liang Z wrote:
> >>> It seems we leave too many bit  for the pfn, and the bits leave for
> >>> length is not enough, How about keep 45 bits for the pfn and 19 bits
> >>> for length, 45 bits for pfn can cover 57 bits physical address, that
> >>> should be
> >> enough in the near feature.
> >>> What's your opinion?
> >> I still think 'order' makes a lot of sense.  But, as you say, 57 bits
> >> is enough for
> >> x86 for a while.  Other architectures who knows?
> 
> Thinking about this some more...  There are really only two cases that
> matter: 4k pages and "much bigger" ones.
> 
> Squeezing each 4k page into 8 bytes of metadata helps guarantee that this
> scheme won't regress over the old scheme in any cases.  For bigger ranges, 8
> vs 16 bytes means *nothing*.  And 16 bytes will be as good or better than
> the old scheme for everything which is >4k.
> 
> How about this:
>  * 52 bits of 'pfn', 5 bits of 'order', 7 bits of 'length'
>  * One special 'length' value to mean "actual length in next 8 bytes"
> 
> That should be pretty simple to produce and decode.  We have two record
> sizes, but I think it is manageable.

It works,  Now that we intend to use another 8 bytes for length

Why not:

Use 52 bits for 'pfn', 12 bits for 'length', when the 12 bits is not long 
enough for the 'length'
Set the 'length' to a special value to indicate the "actual length in next 8 
bytes".

That will be much more simple. Right?

Liang

RE: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-15 Thread Li, Liang Z

> On Thu, Dec 15, 2016 at 07:34:33AM -0800, Dave Hansen wrote:
> > On 12/14/2016 12:59 AM, Li, Liang Z wrote:
> > >> Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend
> > >> virtio-balloon for fast (de)inflating & fast live migration
> > >>
> > >> On 12/08/2016 08:45 PM, Li, Liang Z wrote:
> > >>> What's the conclusion of your discussion? It seems you want some
> > >>> statistic before deciding whether to  ripping the bitmap from the
> > >>> ABI, am I right?
> > >>
> > >> I think Andrea and David feel pretty strongly that we should remove
> > >> the bitmap, unless we have some data to support keeping it.  I
> > >> don't feel as strongly about it, but I think their critique of it
> > >> is pretty valid.  I think the consensus is that the bitmap needs to go.
> > >>
> > >> The only real question IMNHO is whether we should do a power-of-2
> > >> or a length.  But, if we have 12 bits, then the argument for doing
> > >> length is pretty strong.  We don't need anywhere near 12 bits if doing
> power-of-2.
> > >
> > > Just found the MAX_ORDER should be limited to 12 if use length
> > > instead of order, If the MAX_ORDER is configured to a value bigger
> > > than 12, it will make things more complex to handle this case.
> > >
> > > If use order, we need to break a large memory range whose length is
> > > not the power of 2 into several small ranges, it also make the code
> complex.
> >
> > I can't imagine it makes the code that much more complex.  It adds a
> > for loop.  Right?
> >
> > > It seems we leave too many bit  for the pfn, and the bits leave for
> > > length is not enough, How about keep 45 bits for the pfn and 19 bits
> > > for length, 45 bits for pfn can cover 57 bits physical address, that 
> > > should
> be enough in the near feature.
> > >
> > > What's your opinion?
> >
> > I still think 'order' makes a lot of sense.  But, as you say, 57 bits
> > is enough for x86 for a while.  Other architectures who knows?
> 
> I think you can probably assume page size >= 4K. But I would not want to
> make any other assumptions. E.g. there are systems that absolutely require
> you to set high bits for DMA.
> 
> I think we really want both length and order.
> 
> I understand how you are trying to pack them as tightly as possible.
> 
> However, I thought of a trick, we don't need to encode all possible orders.
> For example, with 2 bits of order, we can make them mean:
> 00 - 4K pages
> 01 - 2M pages
> 02 - 1G pages
> 
> guest can program the sizes for each order through config space.
> 
> We will have 10 bits left for legth.
> 

Please don't, we just get rid of the bitmap for simplification. :)

> It might make sense to also allow guest to program the number of bits used
> for order, this will make it easy to extend without host changes.
> 

There still exist the case if the MAX_ORDER is configured to a large value, 
e.g. 36 for a system
with huge amount of memory, then there is only 28 bits left for the pfn, which 
is not enough.
Should  we limit the MAX_ORDER? I don't think so.

It seems use order is better. 

Thanks!
Liang
> --
> MST

RE: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-15 Thread Li, Liang Z

> On Thu, Dec 15, 2016 at 07:34:33AM -0800, Dave Hansen wrote:
> > On 12/14/2016 12:59 AM, Li, Liang Z wrote:
> > >> Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend
> > >> virtio-balloon for fast (de)inflating & fast live migration
> > >>
> > >> On 12/08/2016 08:45 PM, Li, Liang Z wrote:
> > >>> What's the conclusion of your discussion? It seems you want some
> > >>> statistic before deciding whether to  ripping the bitmap from the
> > >>> ABI, am I right?
> > >>
> > >> I think Andrea and David feel pretty strongly that we should remove
> > >> the bitmap, unless we have some data to support keeping it.  I
> > >> don't feel as strongly about it, but I think their critique of it
> > >> is pretty valid.  I think the consensus is that the bitmap needs to go.
> > >>
> > >> The only real question IMNHO is whether we should do a power-of-2
> > >> or a length.  But, if we have 12 bits, then the argument for doing
> > >> length is pretty strong.  We don't need anywhere near 12 bits if doing
> power-of-2.
> > >
> > > Just found the MAX_ORDER should be limited to 12 if use length
> > > instead of order, If the MAX_ORDER is configured to a value bigger
> > > than 12, it will make things more complex to handle this case.
> > >
> > > If use order, we need to break a large memory range whose length is
> > > not the power of 2 into several small ranges, it also make the code
> complex.
> >
> > I can't imagine it makes the code that much more complex.  It adds a
> > for loop.  Right?
> >
> > > It seems we leave too many bit  for the pfn, and the bits leave for
> > > length is not enough, How about keep 45 bits for the pfn and 19 bits
> > > for length, 45 bits for pfn can cover 57 bits physical address, that 
> > > should
> be enough in the near feature.
> > >
> > > What's your opinion?
> >
> > I still think 'order' makes a lot of sense.  But, as you say, 57 bits
> > is enough for x86 for a while.  Other architectures who knows?
> 
> I think you can probably assume page size >= 4K. But I would not want to
> make any other assumptions. E.g. there are systems that absolutely require
> you to set high bits for DMA.
> 
> I think we really want both length and order.
> 
> I understand how you are trying to pack them as tightly as possible.
> 
> However, I thought of a trick, we don't need to encode all possible orders.
> For example, with 2 bits of order, we can make them mean:
> 00 - 4K pages
> 01 - 2M pages
> 02 - 1G pages
> 
> guest can program the sizes for each order through config space.
> 
> We will have 10 bits left for legth.
> 

Please don't, we just get rid of the bitmap for simplification. :)

> It might make sense to also allow guest to program the number of bits used
> for order, this will make it easy to extend without host changes.
> 

There still exist the case if the MAX_ORDER is configured to a large value, 
e.g. 36 for a system
with huge amount of memory, then there is only 28 bits left for the pfn, which 
is not enough.
Should  we limit the MAX_ORDER? I don't think so.

It seems use order is better. 

Thanks!
Liang
> --
> MST

RE: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-15 Thread Li, Liang Z

> Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for
> fast (de)inflating & fast live migration
> 
> On 12/14/2016 12:59 AM, Li, Liang Z wrote:
> >> Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon
> >> for fast (de)inflating & fast live migration
> >>
> >> On 12/08/2016 08:45 PM, Li, Liang Z wrote:
> >>> What's the conclusion of your discussion? It seems you want some
> >>> statistic before deciding whether to  ripping the bitmap from the
> >>> ABI, am I right?
> >>
> >> I think Andrea and David feel pretty strongly that we should remove
> >> the bitmap, unless we have some data to support keeping it.  I don't
> >> feel as strongly about it, but I think their critique of it is pretty
> >> valid.  I think the consensus is that the bitmap needs to go.
> >>
> >> The only real question IMNHO is whether we should do a power-of-2 or
> >> a length.  But, if we have 12 bits, then the argument for doing
> >> length is pretty strong.  We don't need anywhere near 12 bits if doing
> power-of-2.
> >
> > Just found the MAX_ORDER should be limited to 12 if use length instead
> > of order, If the MAX_ORDER is configured to a value bigger than 12, it
> > will make things more complex to handle this case.
> >
> > If use order, we need to break a large memory range whose length is
> > not the power of 2 into several small ranges, it also make the code complex.
> 
> I can't imagine it makes the code that much more complex.  It adds a for loop.
> Right?
> 

Yes, just a little. :)

> > It seems we leave too many bit  for the pfn, and the bits leave for
> > length is not enough, How about keep 45 bits for the pfn and 19 bits
> > for length, 45 bits for pfn can cover 57 bits physical address, that should 
> > be
> enough in the near feature.
> >
> > What's your opinion?
> 
> I still think 'order' makes a lot of sense.  But, as you say, 57 bits is 
> enough for
> x86 for a while.  Other architectures who knows?

Yes.

RE: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-15 Thread Li, Liang Z

> Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for
> fast (de)inflating & fast live migration
> 
> On 12/14/2016 12:59 AM, Li, Liang Z wrote:
> >> Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon
> >> for fast (de)inflating & fast live migration
> >>
> >> On 12/08/2016 08:45 PM, Li, Liang Z wrote:
> >>> What's the conclusion of your discussion? It seems you want some
> >>> statistic before deciding whether to  ripping the bitmap from the
> >>> ABI, am I right?
> >>
> >> I think Andrea and David feel pretty strongly that we should remove
> >> the bitmap, unless we have some data to support keeping it.  I don't
> >> feel as strongly about it, but I think their critique of it is pretty
> >> valid.  I think the consensus is that the bitmap needs to go.
> >>
> >> The only real question IMNHO is whether we should do a power-of-2 or
> >> a length.  But, if we have 12 bits, then the argument for doing
> >> length is pretty strong.  We don't need anywhere near 12 bits if doing
> power-of-2.
> >
> > Just found the MAX_ORDER should be limited to 12 if use length instead
> > of order, If the MAX_ORDER is configured to a value bigger than 12, it
> > will make things more complex to handle this case.
> >
> > If use order, we need to break a large memory range whose length is
> > not the power of 2 into several small ranges, it also make the code complex.
> 
> I can't imagine it makes the code that much more complex.  It adds a for loop.
> Right?
> 

Yes, just a little. :)

> > It seems we leave too many bit  for the pfn, and the bits leave for
> > length is not enough, How about keep 45 bits for the pfn and 19 bits
> > for length, 45 bits for pfn can cover 57 bits physical address, that should 
> > be
> enough in the near feature.
> >
> > What's your opinion?
> 
> I still think 'order' makes a lot of sense.  But, as you say, 57 bits is 
> enough for
> x86 for a while.  Other architectures who knows?

Yes.

RE: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-14 Thread Li, Liang Z

> Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for
> fast (de)inflating & fast live migration
> 
> On 12/08/2016 08:45 PM, Li, Liang Z wrote:
> > What's the conclusion of your discussion? It seems you want some
> > statistic before deciding whether to  ripping the bitmap from the ABI,
> > am I right?
> 
> I think Andrea and David feel pretty strongly that we should remove the
> bitmap, unless we have some data to support keeping it.  I don't feel as
> strongly about it, but I think their critique of it is pretty valid.  I think 
> the
> consensus is that the bitmap needs to go.
> 
> The only real question IMNHO is whether we should do a power-of-2 or a
> length.  But, if we have 12 bits, then the argument for doing length is pretty
> strong.  We don't need anywhere near 12 bits if doing power-of-2.

Just found the MAX_ORDER should be limited to 12 if use length instead of order,
If the MAX_ORDER is configured to a value bigger than 12, it will make things 
more
complex to handle this case. 

If use order, we need to break a large memory range whose length is not the 
power of 2 into several
small ranges, it also make the code complex.

It seems we leave too many bit  for the pfn, and the bits leave for length is 
not enough,
How about keep 45 bits for the pfn and 19 bits for length, 45 bits for pfn can 
cover 57 bits
physical address, that should be enough in the near feature. 

What's your opinion?

thanks!
Liang

RE: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-14 Thread Li, Liang Z

> Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for
> fast (de)inflating & fast live migration
> 
> On 12/08/2016 08:45 PM, Li, Liang Z wrote:
> > What's the conclusion of your discussion? It seems you want some
> > statistic before deciding whether to  ripping the bitmap from the ABI,
> > am I right?
> 
> I think Andrea and David feel pretty strongly that we should remove the
> bitmap, unless we have some data to support keeping it.  I don't feel as
> strongly about it, but I think their critique of it is pretty valid.  I think 
> the
> consensus is that the bitmap needs to go.
> 
> The only real question IMNHO is whether we should do a power-of-2 or a
> length.  But, if we have 12 bits, then the argument for doing length is pretty
> strong.  We don't need anywhere near 12 bits if doing power-of-2.

Just found the MAX_ORDER should be limited to 12 if use length instead of order,
If the MAX_ORDER is configured to a value bigger than 12, it will make things 
more
complex to handle this case. 

If use order, we need to break a large memory range whose length is not the 
power of 2 into several
small ranges, it also make the code complex.

It seems we leave too many bit  for the pfn, and the bits leave for length is 
not enough,
How about keep 45 bits for the pfn and 19 bits for length, 45 bits for pfn can 
cover 57 bits
physical address, that should be enough in the near feature. 

What's your opinion?

thanks!
Liang

RE: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-14 Thread Li, Liang Z

> fast (de)inflating & fast live migration
> 
> Hello,
> 
> On Fri, Dec 09, 2016 at 05:35:45AM +, Li, Liang Z wrote:
> > > On 12/08/2016 08:45 PM, Li, Liang Z wrote:
> > > > What's the conclusion of your discussion? It seems you want some
> > > > statistic before deciding whether to  ripping the bitmap from the
> > > > ABI, am I right?
> > >
> > > I think Andrea and David feel pretty strongly that we should remove
> > > the bitmap, unless we have some data to support keeping it.  I don't
> > > feel as strongly about it, but I think their critique of it is
> > > pretty valid.  I think the consensus is that the bitmap needs to go.
> > >
> >
> > Thanks for you clarification.
> >
> > > The only real question IMNHO is whether we should do a power-of-2 or
> > > a length.  But, if we have 12 bits, then the argument for doing
> > > length is pretty strong.  We don't need anywhere near 12 bits if doing
> power-of-2.
> > >
> > So each item can max represent 16MB Bytes, seems not big enough, but
> > enough for most case.
> > Things became much more simple without the bitmap, and I like simple
> > solution too. :)
> >
> > I will prepare the v6 and remove all the bitmap related stuffs. Thank you 
> > all!
> 
> Sounds great!
> 
> I suggested to check the statistics, because collecting those stats looked
> simpler and quicker than removing all bitmap related stuff from the patchset.
> However if you prefer to prepare a v6 without the bitmap another perhaps
> more interesting way to evaluate the usefulness of the bitmap is to just run
> the same benchmark and verify that there is no regression compared to the
> bitmap enabled code.
> 
> The other issue with the bitmap is, the best case for the bitmap is ever less
> likely to materialize the more RAM is added to the guest. It won't regress
> linearly because after all there can be some locality bias in the buddy 
> splits,
> but if sync compaction is used in the large order allocations tried before
> reaching order 0, the bitmap payoff will regress close to linearly with the
> increase of RAM.
> 
> So it'd be good to check the stats or the benchmark on large guests, at least
> one hundred gigabytes or so.
> 
> Changing topic but still about the ABI features needed, so it may be relevant
> for this discussion:
> 
> 1) vNUMA locality: i.e. allowing host to specify which vNODEs to take
>memory from, using alloc_pages_node in guest. So you can ask to
>take X pages from vnode A, Y pages from vnode B, in one vmenter.
> 
> 2) allowing qemu to tell the guest to stop inflating the balloon and
>report a fragmentation limit being hit, when sync compaction
>powered allocations fails at certain power-of-two order granularity
>passed by qemu to the guest. This order constraint will be passed
>by default for hugetlbfs guests with 2MB hpage size, while it can
>be used optionally on THP backed guests. This option with THP
>guests would allow a highlevel management software to provide a
>"don't reduce guest performance" while shrinking the memory size of
>the guest from the GUI. If you deselect the option, you can shrink
>down to the last freeable 4k guest page, but doing so may have to
>split THP in the host (you don't know for sure if they were really
>THP but they could have been), and it may regress
>performance. Inflating the balloon while passing a minimum
>granularity "order" of the pages being zapped, will guarantee
>inflating the balloon cannot decrease guest performance
>instead. Plus it's needed for hugetlbfs anyway as far as I can
>tell. hugetlbfs would not be host enforceable even if the idea is
>not to free memory but only reduce the available memory of the
>guest (not without major changes that maps a hugetlb page with 4k
>ptes at least). While for a more cooperative usage of hugetlbfs
>guests, it's simply not useful to inflate the balloon at anything
>less than the "HPAGE_SIZE" hugetlbfs granularity.
> 
> We also plan to use userfaultfd to make the balloon driver host enforced (will
> work fine on hugetlbfs 2M and tmpfs too) but that's going to be invisible to
> the ABI so it's not strictly relevant for this discussion.
> 
> On a side note, registering userfaultfd on the ballooned range, will keep
> khugepaged at bay so it won't risk to re-inflating the MADV_DONTNEED
> zapped sub-THP fragments no matter the sysfs tunings.
> 

Thanks for your elaboration!

> Thanks!
> Andrea

RE: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-14 Thread Li, Liang Z

> fast (de)inflating & fast live migration
> 
> Hello,
> 
> On Fri, Dec 09, 2016 at 05:35:45AM +, Li, Liang Z wrote:
> > > On 12/08/2016 08:45 PM, Li, Liang Z wrote:
> > > > What's the conclusion of your discussion? It seems you want some
> > > > statistic before deciding whether to  ripping the bitmap from the
> > > > ABI, am I right?
> > >
> > > I think Andrea and David feel pretty strongly that we should remove
> > > the bitmap, unless we have some data to support keeping it.  I don't
> > > feel as strongly about it, but I think their critique of it is
> > > pretty valid.  I think the consensus is that the bitmap needs to go.
> > >
> >
> > Thanks for you clarification.
> >
> > > The only real question IMNHO is whether we should do a power-of-2 or
> > > a length.  But, if we have 12 bits, then the argument for doing
> > > length is pretty strong.  We don't need anywhere near 12 bits if doing
> power-of-2.
> > >
> > So each item can max represent 16MB Bytes, seems not big enough, but
> > enough for most case.
> > Things became much more simple without the bitmap, and I like simple
> > solution too. :)
> >
> > I will prepare the v6 and remove all the bitmap related stuffs. Thank you 
> > all!
> 
> Sounds great!
> 
> I suggested to check the statistics, because collecting those stats looked
> simpler and quicker than removing all bitmap related stuff from the patchset.
> However if you prefer to prepare a v6 without the bitmap another perhaps
> more interesting way to evaluate the usefulness of the bitmap is to just run
> the same benchmark and verify that there is no regression compared to the
> bitmap enabled code.
> 
> The other issue with the bitmap is, the best case for the bitmap is ever less
> likely to materialize the more RAM is added to the guest. It won't regress
> linearly because after all there can be some locality bias in the buddy 
> splits,
> but if sync compaction is used in the large order allocations tried before
> reaching order 0, the bitmap payoff will regress close to linearly with the
> increase of RAM.
> 
> So it'd be good to check the stats or the benchmark on large guests, at least
> one hundred gigabytes or so.
> 
> Changing topic but still about the ABI features needed, so it may be relevant
> for this discussion:
> 
> 1) vNUMA locality: i.e. allowing host to specify which vNODEs to take
>memory from, using alloc_pages_node in guest. So you can ask to
>take X pages from vnode A, Y pages from vnode B, in one vmenter.
> 
> 2) allowing qemu to tell the guest to stop inflating the balloon and
>report a fragmentation limit being hit, when sync compaction
>powered allocations fails at certain power-of-two order granularity
>passed by qemu to the guest. This order constraint will be passed
>by default for hugetlbfs guests with 2MB hpage size, while it can
>be used optionally on THP backed guests. This option with THP
>guests would allow a highlevel management software to provide a
>"don't reduce guest performance" while shrinking the memory size of
>the guest from the GUI. If you deselect the option, you can shrink
>down to the last freeable 4k guest page, but doing so may have to
>split THP in the host (you don't know for sure if they were really
>THP but they could have been), and it may regress
>performance. Inflating the balloon while passing a minimum
>granularity "order" of the pages being zapped, will guarantee
>inflating the balloon cannot decrease guest performance
>instead. Plus it's needed for hugetlbfs anyway as far as I can
>tell. hugetlbfs would not be host enforceable even if the idea is
>not to free memory but only reduce the available memory of the
>guest (not without major changes that maps a hugetlb page with 4k
>ptes at least). While for a more cooperative usage of hugetlbfs
>guests, it's simply not useful to inflate the balloon at anything
>less than the "HPAGE_SIZE" hugetlbfs granularity.
> 
> We also plan to use userfaultfd to make the balloon driver host enforced (will
> work fine on hugetlbfs 2M and tmpfs too) but that's going to be invisible to
> the ABI so it's not strictly relevant for this discussion.
> 
> On a side note, registering userfaultfd on the ballooned range, will keep
> khugepaged at bay so it won't risk to re-inflating the MADV_DONTNEED
> zapped sub-THP fragments no matter the sysfs tunings.
> 

Thanks for your elaboration!

> Thanks!
> Andrea

RE: [RFC PATCH 0/4] KVM: Emulate UMIP (or almost do so)

2016-12-12 Thread Li, Liang Z

> UMIP (User-Mode Instruction Prevention) is a feature of future Intel
> processors (Cannonlake?) that blocks SLDT, SGDT, STR, SIDT and SMSW from
> user-mode processes.
> 
> The idea here is to use virtualization intercepts to emulate UMIP; it slows
> down the instructions when they're executed in ring 0, but they are really
> never executed in practice.  On AMD systems it's possible to emulate it
> entirely; instead on Intel systems it's *almost* possible to emulate it,
> because SMSW doesn't cause a vmexit, and hence SMSW will not fault.
> 
> This patch series provides the infrastructure and implements it on Intel.  I
> tested it through kvm-unit-tests.
> 
> Still I think the idea is interesting, even if it's buggy for current Intel
> processors.  Any opinions?

Hi Paolo,

We intended to enable UMIP for KVM and found you had already worked on it. 
Do you have any plan for the following patch set? It's there anything else you 
expect
us help to do?

Thanks!
Liang

RE: [RFC PATCH 0/4] KVM: Emulate UMIP (or almost do so)

2016-12-12 Thread Li, Liang Z

> UMIP (User-Mode Instruction Prevention) is a feature of future Intel
> processors (Cannonlake?) that blocks SLDT, SGDT, STR, SIDT and SMSW from
> user-mode processes.
> 
> The idea here is to use virtualization intercepts to emulate UMIP; it slows
> down the instructions when they're executed in ring 0, but they are really
> never executed in practice.  On AMD systems it's possible to emulate it
> entirely; instead on Intel systems it's *almost* possible to emulate it,
> because SMSW doesn't cause a vmexit, and hence SMSW will not fault.
> 
> This patch series provides the infrastructure and implements it on Intel.  I
> tested it through kvm-unit-tests.
> 
> Still I think the idea is interesting, even if it's buggy for current Intel
> processors.  Any opinions?

Hi Paolo,

We intended to enable UMIP for KVM and found you had already worked on it. 
Do you have any plan for the following patch set? It's there anything else you 
expect
us help to do?

Thanks!
Liang

RE: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-08 Thread Li, Liang Z

> On 12/08/2016 08:45 PM, Li, Liang Z wrote:
> > What's the conclusion of your discussion? It seems you want some
> > statistic before deciding whether to  ripping the bitmap from the ABI,
> > am I right?
> 
> I think Andrea and David feel pretty strongly that we should remove the
> bitmap, unless we have some data to support keeping it.  I don't feel as
> strongly about it, but I think their critique of it is pretty valid.  I think 
> the
> consensus is that the bitmap needs to go.
> 

Thanks for you clarification.

> The only real question IMNHO is whether we should do a power-of-2 or a
> length.  But, if we have 12 bits, then the argument for doing length is pretty
> strong.  We don't need anywhere near 12 bits if doing power-of-2.
> 
So each item can max represent 16MB Bytes, seems not big enough,
but enough for most case.
Things became much more simple without the bitmap, and I like simple solution 
too. :)

I will prepare the v6 and remove all the bitmap related stuffs. Thank you all!

Liang

RE: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-08 Thread Li, Liang Z

> On 12/08/2016 08:45 PM, Li, Liang Z wrote:
> > What's the conclusion of your discussion? It seems you want some
> > statistic before deciding whether to  ripping the bitmap from the ABI,
> > am I right?
> 
> I think Andrea and David feel pretty strongly that we should remove the
> bitmap, unless we have some data to support keeping it.  I don't feel as
> strongly about it, but I think their critique of it is pretty valid.  I think 
> the
> consensus is that the bitmap needs to go.
> 

Thanks for you clarification.

> The only real question IMNHO is whether we should do a power-of-2 or a
> length.  But, if we have 12 bits, then the argument for doing length is pretty
> strong.  We don't need anywhere near 12 bits if doing power-of-2.
> 
So each item can max represent 16MB Bytes, seems not big enough,
but enough for most case.
Things became much more simple without the bitmap, and I like simple solution 
too. :)

I will prepare the v6 and remove all the bitmap related stuffs. Thank you all!

Liang

RE: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-08 Thread Li, Liang Z

> > 1. Current patches do a hypercall for each order in the allocator.
> >This is inefficient, but independent from the underlying data
> >structure in the ABI, unless bitmaps are in play, which they aren't.
> > 2. Should we have bitmaps in the ABI, even if they are not in use by the
> >guest implementation today?  Andrea says they have zero benefits
> >over a pfn/len scheme.  Dave doesn't think they have zero benefits
> >but isn't that attached to them.  QEMU's handling gets more
> >complicated when using a bitmap.
> > 3. Should the ABI contain records each with a pfn/len pair or a
> >pfn/order pair?
> >3a. 'len' is more flexible, but will always be a power-of-two anyway
> > for high-order pages (the common case)
> 
> Len wouldn't be a power of two practically only if we detect adjacent pages
> of smaller order that may merge into larger orders we already allocated (or
> the other way around).
> 
> [addr=2M, len=2M] allocated at order 9 pass [addr=4M, len=1M] allocated at
> order 8 pass -> merge as [addr=2M, len=3M]
> 
> Not sure if it would be worth it, but that unless we do this, page-order or 
> len
> won't make much difference.
> 
> >3b. if we decide not to have a bitmap, then we basically have plenty
> > of space for 'len' and should just do it
> >3c. It's easiest for the hypervisor to turn pfn/len into the
> >madvise() calls that it needs.
> >
> > Did I miss anything?
> 
> I think you summarized fine all my arguments in your summary.
> 
> > FWIW, I don't feel that strongly about the bitmap.  Li had one
> > originally, but I think the code thus far has demonstrated a huge
> > benefit without even having a bitmap.
> >
> > I've got no objections to ripping the bitmap out of the ABI.
> 
> I think we need to see a statistic showing the number of bits set in each
> bitmap in average, after some uptime and lru churn, like running stresstest
> app for a while with I/O and then inflate the balloon and
> count:
> 
> 1) how many bits were set vs total number of bits used in bitmaps
> 
> 2) how many times bitmaps were used vs bitmap_len = 0 case of single
>page
> 
> My guess would be like very low percentage for both points.
> 

> So there is a connection with the MAX_ORDER..0 allocation loop and the ABI
> change, but I agree any of the ABI proposed would still allow for it this 
> logic to
> be used. Bitmap or not bitmap, the loop would still work.

Hi guys,

What's the conclusion of your discussion? 
It seems you want some statistic before deciding whether to  ripping the bitmap 
from the ABI, am I right?

Thanks!
Liang

RE: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-08 Thread Li, Liang Z

> > 1. Current patches do a hypercall for each order in the allocator.
> >This is inefficient, but independent from the underlying data
> >structure in the ABI, unless bitmaps are in play, which they aren't.
> > 2. Should we have bitmaps in the ABI, even if they are not in use by the
> >guest implementation today?  Andrea says they have zero benefits
> >over a pfn/len scheme.  Dave doesn't think they have zero benefits
> >but isn't that attached to them.  QEMU's handling gets more
> >complicated when using a bitmap.
> > 3. Should the ABI contain records each with a pfn/len pair or a
> >pfn/order pair?
> >3a. 'len' is more flexible, but will always be a power-of-two anyway
> > for high-order pages (the common case)
> 
> Len wouldn't be a power of two practically only if we detect adjacent pages
> of smaller order that may merge into larger orders we already allocated (or
> the other way around).
> 
> [addr=2M, len=2M] allocated at order 9 pass [addr=4M, len=1M] allocated at
> order 8 pass -> merge as [addr=2M, len=3M]
> 
> Not sure if it would be worth it, but that unless we do this, page-order or 
> len
> won't make much difference.
> 
> >3b. if we decide not to have a bitmap, then we basically have plenty
> > of space for 'len' and should just do it
> >3c. It's easiest for the hypervisor to turn pfn/len into the
> >madvise() calls that it needs.
> >
> > Did I miss anything?
> 
> I think you summarized fine all my arguments in your summary.
> 
> > FWIW, I don't feel that strongly about the bitmap.  Li had one
> > originally, but I think the code thus far has demonstrated a huge
> > benefit without even having a bitmap.
> >
> > I've got no objections to ripping the bitmap out of the ABI.
> 
> I think we need to see a statistic showing the number of bits set in each
> bitmap in average, after some uptime and lru churn, like running stresstest
> app for a while with I/O and then inflate the balloon and
> count:
> 
> 1) how many bits were set vs total number of bits used in bitmaps
> 
> 2) how many times bitmaps were used vs bitmap_len = 0 case of single
>page
> 
> My guess would be like very low percentage for both points.
> 

> So there is a connection with the MAX_ORDER..0 allocation loop and the ABI
> change, but I agree any of the ABI proposed would still allow for it this 
> logic to
> be used. Bitmap or not bitmap, the loop would still work.

Hi guys,

What's the conclusion of your discussion? 
It seems you want some statistic before deciding whether to  ripping the bitmap 
from the ABI, am I right?

Thanks!
Liang

RE: [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-08 Thread Li, Liang Z

> Subject: Re: [PATCH kernel v5 0/5] Extend virtio-balloon for fast 
> (de)inflating
> & fast live migration
> 
> On 12/07/2016 05:35 AM, Li, Liang Z wrote:
> >> Am 30.11.2016 um 09:43 schrieb Liang Li:
> >> IOW in real examples, do we have really large consecutive areas or
> >> are all pages just completely distributed over our memory?
> >
> > The buddy system of Linux kernel memory management shows there
> should
> > be quite a lot of consecutive pages as long as there are a portion of
> > free memory in the guest.
> ...
> > If all pages just completely distributed over our memory, it means the
> > memory fragmentation is very serious, the kernel has the mechanism to
> > avoid this happened.
> 
> While it is correct that the kernel has anti-fragmentation mechanisms, I don't
> think it invalidates the question as to whether a bitmap would be too sparse
> to be effective.
> 
> > In the other hand, the inflating should not happen at this time
> > because the guest is almost 'out of memory'.
> 
> I don't think this is correct.  Most systems try to run with relatively 
> little free
> memory all the time, using the bulk of it as page cache.  We have no reason
> to expect that ballooning will only occur when there is lots of actual free
> memory and that it will not occur when that same memory is in use as page
> cache.
> 

Yes.
> In these patches, you're effectively still sending pfns.  You're just sending
> one pfn per high-order page which is giving a really nice speedup.  IMNHO,
> you're avoiding doing a real bitmap because creating a bitmap means either
> have a really big bitmap, or you would have to do some sorting (or multiple
> passes) of the free lists before populating a smaller bitmap.
> 
> Like David, I would still like to see some data on whether the choice between
> bitmaps and pfn lists is ever clearly in favor of bitmaps.  You haven't
> convinced me, at least, that the data isn't even worth collecting.

I will try to get some data with the real workload and share it with your guys.

Thanks!
Liang

RE: [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-08 Thread Li, Liang Z

> Subject: Re: [PATCH kernel v5 0/5] Extend virtio-balloon for fast 
> (de)inflating
> & fast live migration
> 
> On 12/07/2016 05:35 AM, Li, Liang Z wrote:
> >> Am 30.11.2016 um 09:43 schrieb Liang Li:
> >> IOW in real examples, do we have really large consecutive areas or
> >> are all pages just completely distributed over our memory?
> >
> > The buddy system of Linux kernel memory management shows there
> should
> > be quite a lot of consecutive pages as long as there are a portion of
> > free memory in the guest.
> ...
> > If all pages just completely distributed over our memory, it means the
> > memory fragmentation is very serious, the kernel has the mechanism to
> > avoid this happened.
> 
> While it is correct that the kernel has anti-fragmentation mechanisms, I don't
> think it invalidates the question as to whether a bitmap would be too sparse
> to be effective.
> 
> > In the other hand, the inflating should not happen at this time
> > because the guest is almost 'out of memory'.
> 
> I don't think this is correct.  Most systems try to run with relatively 
> little free
> memory all the time, using the bulk of it as page cache.  We have no reason
> to expect that ballooning will only occur when there is lots of actual free
> memory and that it will not occur when that same memory is in use as page
> cache.
> 

Yes.
> In these patches, you're effectively still sending pfns.  You're just sending
> one pfn per high-order page which is giving a really nice speedup.  IMNHO,
> you're avoiding doing a real bitmap because creating a bitmap means either
> have a really big bitmap, or you would have to do some sorting (or multiple
> passes) of the free lists before populating a smaller bitmap.
> 
> Like David, I would still like to see some data on whether the choice between
> bitmaps and pfn lists is ever clearly in favor of bitmaps.  You haven't
> convinced me, at least, that the data isn't even worth collecting.

I will try to get some data with the real workload and share it with your guys.

Thanks!
Liang

RE: [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-07 Thread Li, Liang Z

> Am 30.11.2016 um 09:43 schrieb Liang Li:
> > This patch set contains two parts of changes to the virtio-balloon.
> >
> > One is the change for speeding up the inflating & deflating process,
> > the main idea of this optimization is to use bitmap to send the page
> > information to host instead of the PFNs, to reduce the overhead of
> > virtio data transmission, address translation and madvise(). This can
> > help to improve the performance by about 85%.
> 
> Do you have some statistics/some rough feeling how many consecutive bits are
> usually set in the bitmaps? Is it really just purely random or is there some
> granularity that is usually consecutive?
> 

I did something similar. Filled the balloon with 15GB for a 16GB idle guest, by
using bitmap, the madvise count was reduced to 605. when using the PFNs, the 
madvise count
was 3932160. It means there are quite a lot consecutive bits in the bitmap.
I didn't test for a guest with heavy memory workload. 

> IOW in real examples, do we have really large consecutive areas or are all
> pages just completely distributed over our memory?
> 

The buddy system of Linux kernel memory management shows there should be quite 
a lot of
 consecutive pages as long as there are a portion of free memory in the guest.
If all pages just completely distributed over our memory, it means the memory 
fragmentation is very serious, the kernel has the mechanism to avoid this 
happened.
In the other hand, the inflating should not happen at this time because the 
guest is almost
'out of memory'.

Liang

> Thanks!
> 
> --
> 
> David

RE: [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-12-07 Thread Li, Liang Z

> Am 30.11.2016 um 09:43 schrieb Liang Li:
> > This patch set contains two parts of changes to the virtio-balloon.
> >
> > One is the change for speeding up the inflating & deflating process,
> > the main idea of this optimization is to use bitmap to send the page
> > information to host instead of the PFNs, to reduce the overhead of
> > virtio data transmission, address translation and madvise(). This can
> > help to improve the performance by about 85%.
> 
> Do you have some statistics/some rough feeling how many consecutive bits are
> usually set in the bitmaps? Is it really just purely random or is there some
> granularity that is usually consecutive?
> 

I did something similar. Filled the balloon with 15GB for a 16GB idle guest, by
using bitmap, the madvise count was reduced to 605. when using the PFNs, the 
madvise count
was 3932160. It means there are quite a lot consecutive bits in the bitmap.
I didn't test for a guest with heavy memory workload. 

> IOW in real examples, do we have really large consecutive areas or are all
> pages just completely distributed over our memory?
> 

The buddy system of Linux kernel memory management shows there should be quite 
a lot of
 consecutive pages as long as there are a portion of free memory in the guest.
If all pages just completely distributed over our memory, it means the memory 
fragmentation is very serious, the kernel has the mechanism to avoid this 
happened.
In the other hand, the inflating should not happen at this time because the 
guest is almost
'out of memory'.

Liang

> Thanks!
> 
> --
> 
> David

RE: [PATCH kernel v5 5/5] virtio-balloon: tell host vm's unused page info

2016-12-05 Thread Li, Liang Z

> >>> + mutex_lock(>balloon_lock);
> >>> +
> >>> + for (order = MAX_ORDER - 1; order >= 0; order--) {
> >>
> >> I scratched my head for a bit on this one.  Why are you walking over
> >> orders,
> >> *then* zones.  I *think* you're doing it because you can efficiently
> >> fill the bitmaps at a given order for all zones, then move to a new
> >> bitmap.  But, it would be interesting to document this.
> >
> > Yes, use the order is somewhat strange, but it's helpful to keep the API 
> > simple.
> > Do you think it's acceptable?
> 
> Yeah, it's fine.  Just comment it, please.
> 
Good!

> >>> + if (ret == -ENOSPC) {
> >>> + void *new_resp_data;
> >>> +
> >>> + new_resp_data = kmalloc(2 * vb->resp_buf_size,
> >>> + GFP_KERNEL);
> >>> + if (new_resp_data) {
> >>> + kfree(vb->resp_data);
> >>> + vb->resp_data = new_resp_data;
> >>> + vb->resp_buf_size *= 2;
> >>
> >> What happens to the data in ->resp_data at this point?  Doesn't this
> >> just throw it away?
> >
> > Yes, so we should make sure the data in resp_data is not inuse.
> 
> But doesn't it have valid data that we just collected and haven't told the
> hypervisor about yet?  Aren't we throwing away good data that cost us
> something to collect?

Indeed.  Some filled data may exist for the previous zone. Should we
change the API to 
'int get_unused_pages(unsigned long *unused_pages, unsigned long size,
int order, unsigned long *pos, struct zone *zone)' ?

then we can use the 'zone' to record the zone to retry and not discard the
filled data.

> >> ...
> >>> +struct page_info_item {
> >>> + __le64 start_pfn : 52; /* start pfn for the bitmap */
> >>> + __le64 page_shift : 6; /* page shift width, in bytes */
> 
> What does a page_shift "in bytes" mean? :)

Obviously, you know. :o
I will try to make it clear.

> 
> >>> + __le64 bmap_len : 6;  /* bitmap length, in bytes */ };
> >>
> >> Is 'bmap_len' too short?  a 64-byte buffer is a bit tiny.  Right?
> >
> > Currently, we just use the 8 bytes and 0 bytes bitmap, should we support
> more than 64 bytes?
> 
> It just means that with this format, you end up wasting at least ~1/8th of the
> space with metadata.  That's a bit unfortunate, but I guess it's not fatal.
> 
> I'd definitely call it out in the patch description and make sure other folks 
> take
> a look at it.

OK.

> 
> There's a somewhat easy fix, but that would make the qemu implementation
> more complicated: You could just have bmap_len==0x3f imply that there's
> another field that contains an extended bitmap length for when you need long
> bitmaps.
> 
> But, as you note, there's no need for it, so it's a matter of trading the 
> extra
> complexity versus the desire to not habing to change the ABI again for longer
> (hopefully).
> 

Your suggestion still works without changing the current code, just reserve
 ' bmap_len==0x3f' for future extension, and it's not used by the current code.

> >>> +static int  mark_unused_pages(struct zone *zone,
> >>> + unsigned long *unused_pages, unsigned long size,
> >>> + int order, unsigned long *pos)
> >>> +{
> >>> + unsigned long pfn, flags;
> >>> + unsigned int t;
> >>> + struct list_head *curr;
> >>> + struct page_info_item *info;
> >>> +
> >>> + if (zone_is_empty(zone))
> >>> + return 0;
> >>> +
> >>> + spin_lock_irqsave(>lock, flags);
> >>> +
> >>> + if (*pos + zone->free_area[order].nr_free > size)
> >>> + return -ENOSPC;
> >>
> >> Urg, so this won't partially fill?  So, what the nr_free pages limit
> >> where we no longer fit in the kmalloc()'d buffer where this simply won't
> work?
> >
> > Yes.  My initial implementation is partially fill, it's better for the 
> > worst case.
> > I thought the above code is more efficient for most case ...
> > Do you think partially fill the bitmap is better?
> 
> Could you please answer the question I asked?
> 

For your question:
---
>So, what the nr_free pages limit where we no longer fit in the kmalloc()'d 
>buffer
> where this simply won't work?
--
No, if the buffer is not big enough to save 'nr_free'  pages, 
get_unused_pages() will return
'-ENOSPC', and the following code will try to allocate a 2x times size buffer 
for retrying,
until the proper size buffer is allocated. The current order will not be 
skipped unless the
buffer allocation failed.

> Because if you don't get this right, it could mean that there are system that
> simply *fail* here.

RE: [PATCH kernel v5 5/5] virtio-balloon: tell host vm's unused page info

2016-12-05 Thread Li, Liang Z

> >>> + mutex_lock(>balloon_lock);
> >>> +
> >>> + for (order = MAX_ORDER - 1; order >= 0; order--) {
> >>
> >> I scratched my head for a bit on this one.  Why are you walking over
> >> orders,
> >> *then* zones.  I *think* you're doing it because you can efficiently
> >> fill the bitmaps at a given order for all zones, then move to a new
> >> bitmap.  But, it would be interesting to document this.
> >
> > Yes, use the order is somewhat strange, but it's helpful to keep the API 
> > simple.
> > Do you think it's acceptable?
> 
> Yeah, it's fine.  Just comment it, please.
> 
Good!

> >>> + if (ret == -ENOSPC) {
> >>> + void *new_resp_data;
> >>> +
> >>> + new_resp_data = kmalloc(2 * vb->resp_buf_size,
> >>> + GFP_KERNEL);
> >>> + if (new_resp_data) {
> >>> + kfree(vb->resp_data);
> >>> + vb->resp_data = new_resp_data;
> >>> + vb->resp_buf_size *= 2;
> >>
> >> What happens to the data in ->resp_data at this point?  Doesn't this
> >> just throw it away?
> >
> > Yes, so we should make sure the data in resp_data is not inuse.
> 
> But doesn't it have valid data that we just collected and haven't told the
> hypervisor about yet?  Aren't we throwing away good data that cost us
> something to collect?

Indeed.  Some filled data may exist for the previous zone. Should we
change the API to 
'int get_unused_pages(unsigned long *unused_pages, unsigned long size,
int order, unsigned long *pos, struct zone *zone)' ?

then we can use the 'zone' to record the zone to retry and not discard the
filled data.

> >> ...
> >>> +struct page_info_item {
> >>> + __le64 start_pfn : 52; /* start pfn for the bitmap */
> >>> + __le64 page_shift : 6; /* page shift width, in bytes */
> 
> What does a page_shift "in bytes" mean? :)

Obviously, you know. :o
I will try to make it clear.

> 
> >>> + __le64 bmap_len : 6;  /* bitmap length, in bytes */ };
> >>
> >> Is 'bmap_len' too short?  a 64-byte buffer is a bit tiny.  Right?
> >
> > Currently, we just use the 8 bytes and 0 bytes bitmap, should we support
> more than 64 bytes?
> 
> It just means that with this format, you end up wasting at least ~1/8th of the
> space with metadata.  That's a bit unfortunate, but I guess it's not fatal.
> 
> I'd definitely call it out in the patch description and make sure other folks 
> take
> a look at it.

OK.

> 
> There's a somewhat easy fix, but that would make the qemu implementation
> more complicated: You could just have bmap_len==0x3f imply that there's
> another field that contains an extended bitmap length for when you need long
> bitmaps.
> 
> But, as you note, there's no need for it, so it's a matter of trading the 
> extra
> complexity versus the desire to not habing to change the ABI again for longer
> (hopefully).
> 

Your suggestion still works without changing the current code, just reserve
 ' bmap_len==0x3f' for future extension, and it's not used by the current code.

> >>> +static int  mark_unused_pages(struct zone *zone,
> >>> + unsigned long *unused_pages, unsigned long size,
> >>> + int order, unsigned long *pos)
> >>> +{
> >>> + unsigned long pfn, flags;
> >>> + unsigned int t;
> >>> + struct list_head *curr;
> >>> + struct page_info_item *info;
> >>> +
> >>> + if (zone_is_empty(zone))
> >>> + return 0;
> >>> +
> >>> + spin_lock_irqsave(>lock, flags);
> >>> +
> >>> + if (*pos + zone->free_area[order].nr_free > size)
> >>> + return -ENOSPC;
> >>
> >> Urg, so this won't partially fill?  So, what the nr_free pages limit
> >> where we no longer fit in the kmalloc()'d buffer where this simply won't
> work?
> >
> > Yes.  My initial implementation is partially fill, it's better for the 
> > worst case.
> > I thought the above code is more efficient for most case ...
> > Do you think partially fill the bitmap is better?
> 
> Could you please answer the question I asked?
> 

For your question:
---
>So, what the nr_free pages limit where we no longer fit in the kmalloc()'d 
>buffer
> where this simply won't work?
--
No, if the buffer is not big enough to save 'nr_free'  pages, 
get_unused_pages() will return
'-ENOSPC', and the following code will try to allocate a 2x times size buffer 
for retrying,
until the proper size buffer is allocated. The current order will not be 
skipped unless the
buffer allocation failed.

> Because if you don't get this right, it could mean that there are system that
> simply *fail* here.

RE: [PATCH kernel v5 5/5] virtio-balloon: tell host vm's unused page info

2016-12-04 Thread Li, Liang Z

> On 11/30/2016 12:43 AM, Liang Li wrote:
> > +static void send_unused_pages_info(struct virtio_balloon *vb,
> > +   unsigned long req_id)
> > +{
> > +   struct scatterlist sg_in;
> > +   unsigned long pos = 0;
> > +   struct virtqueue *vq = vb->req_vq;
> > +   struct virtio_balloon_resp_hdr *hdr = vb->resp_hdr;
> > +   int ret, order;
> > +
> > +   mutex_lock(>balloon_lock);
> > +
> > +   for (order = MAX_ORDER - 1; order >= 0; order--) {
> 
> I scratched my head for a bit on this one.  Why are you walking over orders,
> *then* zones.  I *think* you're doing it because you can efficiently fill the
> bitmaps at a given order for all zones, then move to a new bitmap.  But, it
> would be interesting to document this.
> 

Yes, use the order is somewhat strange, but it's helpful to keep the API 
simple. 
Do you think it's acceptable?

> > +   pos = 0;
> > +   ret = get_unused_pages(vb->resp_data,
> > +vb->resp_buf_size / sizeof(unsigned long),
> > +order, );
> 
> FWIW, get_unsued_pages() is a pretty bad name.  "get" usually implies
> bumping reference counts or consuming something.  You're just "recording"
> or "marking" them.
> 

Will change to mark_unused_pages().

> > +   if (ret == -ENOSPC) {
> > +   void *new_resp_data;
> > +
> > +   new_resp_data = kmalloc(2 * vb->resp_buf_size,
> > +   GFP_KERNEL);
> > +   if (new_resp_data) {
> > +   kfree(vb->resp_data);
> > +   vb->resp_data = new_resp_data;
> > +   vb->resp_buf_size *= 2;
> 
> What happens to the data in ->resp_data at this point?  Doesn't this just
> throw it away?
> 

Yes, so we should make sure the data in resp_data is not inuse.

> ...
> > +struct page_info_item {
> > +   __le64 start_pfn : 52; /* start pfn for the bitmap */
> > +   __le64 page_shift : 6; /* page shift width, in bytes */
> > +   __le64 bmap_len : 6;  /* bitmap length, in bytes */ };
> 
> Is 'bmap_len' too short?  a 64-byte buffer is a bit tiny.  Right?
> 

Currently, we just use the 8 bytes and 0 bytes bitmap, should we support more 
than 64 bytes?

> > +static int  mark_unused_pages(struct zone *zone,
> > +   unsigned long *unused_pages, unsigned long size,
> > +   int order, unsigned long *pos)
> > +{
> > +   unsigned long pfn, flags;
> > +   unsigned int t;
> > +   struct list_head *curr;
> > +   struct page_info_item *info;
> > +
> > +   if (zone_is_empty(zone))
> > +   return 0;
> > +
> > +   spin_lock_irqsave(>lock, flags);
> > +
> > +   if (*pos + zone->free_area[order].nr_free > size)
> > +   return -ENOSPC;
> 
> Urg, so this won't partially fill?  So, what the nr_free pages limit where we 
> no
> longer fit in the kmalloc()'d buffer where this simply won't work?
> 

Yes.  My initial implementation is partially fill, it's better for the worst 
case.
I thought the above code is more efficient for most case ...
Do you think partially fill the bitmap is better?
 
> > +   for (t = 0; t < MIGRATE_TYPES; t++) {
> > +   list_for_each(curr, >free_area[order].free_list[t]) {
> > +   pfn = page_to_pfn(list_entry(curr, struct page, lru));
> > +   info = (struct page_info_item *)(unused_pages +
> *pos);
> > +   info->start_pfn = pfn;
> > +   info->page_shift = order + PAGE_SHIFT;
> > +   *pos += 1;
> > +   }
> > +   }
> 
> Do we need to fill in ->bmap_len here?

For integrity, the bmap_len should be filled, will add.
Omit this step just because QEMU assume the ->bmp_len is 0 and ignore this 
field.

Thanks for your comment!

Liang

RE: [PATCH kernel v5 5/5] virtio-balloon: tell host vm's unused page info

2016-12-04 Thread Li, Liang Z

> On 11/30/2016 12:43 AM, Liang Li wrote:
> > +static void send_unused_pages_info(struct virtio_balloon *vb,
> > +   unsigned long req_id)
> > +{
> > +   struct scatterlist sg_in;
> > +   unsigned long pos = 0;
> > +   struct virtqueue *vq = vb->req_vq;
> > +   struct virtio_balloon_resp_hdr *hdr = vb->resp_hdr;
> > +   int ret, order;
> > +
> > +   mutex_lock(>balloon_lock);
> > +
> > +   for (order = MAX_ORDER - 1; order >= 0; order--) {
> 
> I scratched my head for a bit on this one.  Why are you walking over orders,
> *then* zones.  I *think* you're doing it because you can efficiently fill the
> bitmaps at a given order for all zones, then move to a new bitmap.  But, it
> would be interesting to document this.
> 

Yes, use the order is somewhat strange, but it's helpful to keep the API 
simple. 
Do you think it's acceptable?

> > +   pos = 0;
> > +   ret = get_unused_pages(vb->resp_data,
> > +vb->resp_buf_size / sizeof(unsigned long),
> > +order, );
> 
> FWIW, get_unsued_pages() is a pretty bad name.  "get" usually implies
> bumping reference counts or consuming something.  You're just "recording"
> or "marking" them.
> 

Will change to mark_unused_pages().

> > +   if (ret == -ENOSPC) {
> > +   void *new_resp_data;
> > +
> > +   new_resp_data = kmalloc(2 * vb->resp_buf_size,
> > +   GFP_KERNEL);
> > +   if (new_resp_data) {
> > +   kfree(vb->resp_data);
> > +   vb->resp_data = new_resp_data;
> > +   vb->resp_buf_size *= 2;
> 
> What happens to the data in ->resp_data at this point?  Doesn't this just
> throw it away?
> 

Yes, so we should make sure the data in resp_data is not inuse.

> ...
> > +struct page_info_item {
> > +   __le64 start_pfn : 52; /* start pfn for the bitmap */
> > +   __le64 page_shift : 6; /* page shift width, in bytes */
> > +   __le64 bmap_len : 6;  /* bitmap length, in bytes */ };
> 
> Is 'bmap_len' too short?  a 64-byte buffer is a bit tiny.  Right?
> 

Currently, we just use the 8 bytes and 0 bytes bitmap, should we support more 
than 64 bytes?

> > +static int  mark_unused_pages(struct zone *zone,
> > +   unsigned long *unused_pages, unsigned long size,
> > +   int order, unsigned long *pos)
> > +{
> > +   unsigned long pfn, flags;
> > +   unsigned int t;
> > +   struct list_head *curr;
> > +   struct page_info_item *info;
> > +
> > +   if (zone_is_empty(zone))
> > +   return 0;
> > +
> > +   spin_lock_irqsave(>lock, flags);
> > +
> > +   if (*pos + zone->free_area[order].nr_free > size)
> > +   return -ENOSPC;
> 
> Urg, so this won't partially fill?  So, what the nr_free pages limit where we 
> no
> longer fit in the kmalloc()'d buffer where this simply won't work?
> 

Yes.  My initial implementation is partially fill, it's better for the worst 
case.
I thought the above code is more efficient for most case ...
Do you think partially fill the bitmap is better?
 
> > +   for (t = 0; t < MIGRATE_TYPES; t++) {
> > +   list_for_each(curr, >free_area[order].free_list[t]) {
> > +   pfn = page_to_pfn(list_entry(curr, struct page, lru));
> > +   info = (struct page_info_item *)(unused_pages +
> *pos);
> > +   info->start_pfn = pfn;
> > +   info->page_shift = order + PAGE_SHIFT;
> > +   *pos += 1;
> > +   }
> > +   }
> 
> Do we need to fill in ->bmap_len here?

For integrity, the bmap_len should be filled, will add.
Omit this step just because QEMU assume the ->bmp_len is 0 and ignore this 
field.

Thanks for your comment!

Liang

RE: [PATCH kernel v4 7/7] virtio-balloon: tell host vm's unused page info

2016-11-07 Thread Li, Liang Z

> On 11/06/2016 07:37 PM, Li, Liang Z wrote:
> >> Let's say we do a 32k bitmap that can hold ~1M pages.  That's 4GB of RAM.
> >> On a 1TB system, that's 256 passes through the top-level loop.
> >> The bottom-level lists have tens of thousands of pages in them, even
> >> on my laptop.  Only 1/256 of these pages will get consumed in a given pass.
> >>
> > Your description is not exactly.
> > A 32k bitmap is used only when there is few free memory left in the
> > system and when the extend_page_bitmap() failed to allocate more
> > memory for the bitmap. Or dozens of 32k split bitmap will be used,
> > this version limit the bitmap count to 32, it means we can use at most
> > 32*32 kB for the bitmap, which can cover 128GB for RAM. We can increase
> the bitmap count limit to a larger value if 32 is not big enough.
> 
> OK, so it tries to allocate a large bitmap.  But, if it fails, it will try to 
> work with a
> smaller bitmap.  Correct?
> 
Yes.

> So, what's the _worst_ case?  It sounds like it is even worse than I was
> positing.
> 

Only a  32KB bitmap can be allocated, and there are a huge amount of low order 
(<3) free pages is the worst case. 

> >> That's an awfully inefficient way of doing it.  This patch
> >> essentially changed the data structure without changing the algorithm to
> populate it.
> >>
> >> Please change the *algorithm* to use the new data structure efficiently.
> >>  Such a change would only do a single pass through each freelist, and
> >> would choose whether to use the extent-based (pfn -> range) or
> >> bitmap-based approach based on the contents of the free lists.
> >
> > Save the free page info to a raw bitmap first and then process the raw
> > bitmap to get the proper ' extent-based ' and  'bitmap-based' is the
> > most efficient way I can come up with to save the virtio data transmission.
> Do you have some better idea?
> 
> That's kinda my point.  This patch *does* processing to try to pack the
> bitmaps full of pages from the various pfn ranges.  It's a form of processing
> that gets *REALLY*, *REALLY* bad in some (admittedly obscure) cases.
> 
> Let's not pretend that making an essentially unlimited number of passes over
> the free lists is not processing.
> 
> 1. Allocate as large of a bitmap as you can. (what you already do) 2. Iterate
> from the largest freelist order.  Store those pages in the
>bitmap.
> 3. If you can no longer fit pages in the bitmap, return the list that
>you have.
> 4. Make an approximation about where the bitmap does not make any more,
>and fall back to listing individual PFNs.  This would make sens, for
>instance in a large zone with very few free order-0 pages left.
> 
Sounds good.  Should we ignore some of the order-0 pages in step 4 if the 
bitmap is full?
Or should retry to get a complete list of order-0 pages?

> 
> > It seems the benefit we get for this feature is not as big as that in fast
> balloon inflating/deflating.
> >>
> >> You should not be using get_max_pfn().  Any patch set that continues
> >> to use it is not likely to be using a proper algorithm.
> >
> > Do you have any suggestion about how to avoid it?
> 
> Yes: get the pfns from the page free lists alone.  Don't derive them from the
> pfn limits of the system or zones.

The ' get_max_pfn()' can be avoid in this patch, but I think we can't avoid it 
completely.
We need it as a hint for allocating a proper size bitmap. No?

Thanks!
Liang

RE: [PATCH kernel v4 7/7] virtio-balloon: tell host vm's unused page info

2016-11-07 Thread Li, Liang Z

> On 11/06/2016 07:37 PM, Li, Liang Z wrote:
> >> Let's say we do a 32k bitmap that can hold ~1M pages.  That's 4GB of RAM.
> >> On a 1TB system, that's 256 passes through the top-level loop.
> >> The bottom-level lists have tens of thousands of pages in them, even
> >> on my laptop.  Only 1/256 of these pages will get consumed in a given pass.
> >>
> > Your description is not exactly.
> > A 32k bitmap is used only when there is few free memory left in the
> > system and when the extend_page_bitmap() failed to allocate more
> > memory for the bitmap. Or dozens of 32k split bitmap will be used,
> > this version limit the bitmap count to 32, it means we can use at most
> > 32*32 kB for the bitmap, which can cover 128GB for RAM. We can increase
> the bitmap count limit to a larger value if 32 is not big enough.
> 
> OK, so it tries to allocate a large bitmap.  But, if it fails, it will try to 
> work with a
> smaller bitmap.  Correct?
> 
Yes.

> So, what's the _worst_ case?  It sounds like it is even worse than I was
> positing.
> 

Only a  32KB bitmap can be allocated, and there are a huge amount of low order 
(<3) free pages is the worst case. 

> >> That's an awfully inefficient way of doing it.  This patch
> >> essentially changed the data structure without changing the algorithm to
> populate it.
> >>
> >> Please change the *algorithm* to use the new data structure efficiently.
> >>  Such a change would only do a single pass through each freelist, and
> >> would choose whether to use the extent-based (pfn -> range) or
> >> bitmap-based approach based on the contents of the free lists.
> >
> > Save the free page info to a raw bitmap first and then process the raw
> > bitmap to get the proper ' extent-based ' and  'bitmap-based' is the
> > most efficient way I can come up with to save the virtio data transmission.
> Do you have some better idea?
> 
> That's kinda my point.  This patch *does* processing to try to pack the
> bitmaps full of pages from the various pfn ranges.  It's a form of processing
> that gets *REALLY*, *REALLY* bad in some (admittedly obscure) cases.
> 
> Let's not pretend that making an essentially unlimited number of passes over
> the free lists is not processing.
> 
> 1. Allocate as large of a bitmap as you can. (what you already do) 2. Iterate
> from the largest freelist order.  Store those pages in the
>bitmap.
> 3. If you can no longer fit pages in the bitmap, return the list that
>you have.
> 4. Make an approximation about where the bitmap does not make any more,
>and fall back to listing individual PFNs.  This would make sens, for
>instance in a large zone with very few free order-0 pages left.
> 
Sounds good.  Should we ignore some of the order-0 pages in step 4 if the 
bitmap is full?
Or should retry to get a complete list of order-0 pages?

> 
> > It seems the benefit we get for this feature is not as big as that in fast
> balloon inflating/deflating.
> >>
> >> You should not be using get_max_pfn().  Any patch set that continues
> >> to use it is not likely to be using a proper algorithm.
> >
> > Do you have any suggestion about how to avoid it?
> 
> Yes: get the pfns from the page free lists alone.  Don't derive them from the
> pfn limits of the system or zones.

The ' get_max_pfn()' can be avoid in this patch, but I think we can't avoid it 
completely.
We need it as a hint for allocating a proper size bitmap. No?

Thanks!
Liang

RE: [PATCH kernel v4 7/7] virtio-balloon: tell host vm's unused page info

2016-11-06 Thread Li, Liang Z

> Please squish this and patch 5 together.  It makes no sense to separate them.
> 

OK.

> > +static void send_unused_pages_info(struct virtio_balloon *vb,
> > +   unsigned long req_id)
> > +{
> > +   struct scatterlist sg_in;
> > +   unsigned long pfn = 0, bmap_len, pfn_limit, last_pfn, nr_pfn;
> > +   struct virtqueue *vq = vb->req_vq;
> > +   struct virtio_balloon_resp_hdr *hdr = vb->resp_hdr;
> > +   int ret = 1, used_nr_bmap = 0, i;
> > +
> > +   if (virtio_has_feature(vb->vdev,
> VIRTIO_BALLOON_F_PAGE_BITMAP) &&
> > +   vb->nr_page_bmap == 1)
> > +   extend_page_bitmap(vb);
> > +
> > +   pfn_limit = PFNS_PER_BMAP * vb->nr_page_bmap;
> > +   mutex_lock(>balloon_lock);
> > +   last_pfn = get_max_pfn();
> > +
> > +   while (ret) {
> > +   clear_page_bitmap(vb);
> > +   ret = get_unused_pages(pfn, pfn + pfn_limit, vb-
> >page_bitmap,
> > +PFNS_PER_BMAP, vb->nr_page_bmap);
> 
> This changed the underlying data structure without changing the way that
> the structure is populated.
> 
> This algorithm picks a "PFNS_PER_BMAP * vb->nr_page_bmap"-sized set of
> pfns, allocates a bitmap for them, the loops through all zones looking for
> pages in any free list that are in that range.
> 
> Unpacking all the indirection, it looks like this:
> 
> for (pfn = 0; pfn < get_max_pfn(); pfn += BITMAP_SIZE_IN_PFNS)
>   for_each_populated_zone(zone)
>   for_each_migratetype_order(order, t)
>   list_for_each(..., >free_area[order])...
> 
> Let's say we do a 32k bitmap that can hold ~1M pages.  That's 4GB of RAM.
> On a 1TB system, that's 256 passes through the top-level loop.
> The bottom-level lists have tens of thousands of pages in them, even on my
> laptop.  Only 1/256 of these pages will get consumed in a given pass.
> 
Your description is not exactly.
A 32k bitmap is used only when there is few free memory left in the system and 
when 
the extend_page_bitmap() failed to allocate more memory for the bitmap. Or 
dozens of 
32k split bitmap will be used, this version limit the bitmap count to 32, it 
means we can use
at most 32*32 kB for the bitmap, which can cover 128GB for RAM. We can increase 
the bitmap
count limit to a larger value if 32 is not big enough.

> That's an awfully inefficient way of doing it.  This patch essentially changed
> the data structure without changing the algorithm to populate it.
> 
> Please change the *algorithm* to use the new data structure efficiently.
>  Such a change would only do a single pass through each freelist, and would
> choose whether to use the extent-based (pfn -> range) or bitmap-based
> approach based on the contents of the free lists.

Save the free page info to a raw bitmap first and then process the raw bitmap to
get the proper ' extent-based ' and  'bitmap-based' is the most efficient way I 
can 
come up with to save the virtio data transmission.  Do you have some better 
idea?


In the QEMU, no matter how we encode the bitmap, the raw format bitmap will be
used in the end.  But what I did in this version is:
   kernel: get the raw bitmap  --> encode the bitmap
   QEMU: decode the bitmap --> get the raw bitmap

Is it worth to do this kind of job here? we can save the virtio data 
transmission, but at the
same time, we did extra work.

It seems the benefit we get for this feature is not as big as that in fast 
balloon inflating/deflating.
> 
> You should not be using get_max_pfn().  Any patch set that continues to use
> it is not likely to be using a proper algorithm.

Do you have any suggestion about how to avoid it?

Thanks!
Liang

RE: [PATCH kernel v4 7/7] virtio-balloon: tell host vm's unused page info

2016-11-06 Thread Li, Liang Z

> Please squish this and patch 5 together.  It makes no sense to separate them.
> 

OK.

> > +static void send_unused_pages_info(struct virtio_balloon *vb,
> > +   unsigned long req_id)
> > +{
> > +   struct scatterlist sg_in;
> > +   unsigned long pfn = 0, bmap_len, pfn_limit, last_pfn, nr_pfn;
> > +   struct virtqueue *vq = vb->req_vq;
> > +   struct virtio_balloon_resp_hdr *hdr = vb->resp_hdr;
> > +   int ret = 1, used_nr_bmap = 0, i;
> > +
> > +   if (virtio_has_feature(vb->vdev,
> VIRTIO_BALLOON_F_PAGE_BITMAP) &&
> > +   vb->nr_page_bmap == 1)
> > +   extend_page_bitmap(vb);
> > +
> > +   pfn_limit = PFNS_PER_BMAP * vb->nr_page_bmap;
> > +   mutex_lock(>balloon_lock);
> > +   last_pfn = get_max_pfn();
> > +
> > +   while (ret) {
> > +   clear_page_bitmap(vb);
> > +   ret = get_unused_pages(pfn, pfn + pfn_limit, vb-
> >page_bitmap,
> > +PFNS_PER_BMAP, vb->nr_page_bmap);
> 
> This changed the underlying data structure without changing the way that
> the structure is populated.
> 
> This algorithm picks a "PFNS_PER_BMAP * vb->nr_page_bmap"-sized set of
> pfns, allocates a bitmap for them, the loops through all zones looking for
> pages in any free list that are in that range.
> 
> Unpacking all the indirection, it looks like this:
> 
> for (pfn = 0; pfn < get_max_pfn(); pfn += BITMAP_SIZE_IN_PFNS)
>   for_each_populated_zone(zone)
>   for_each_migratetype_order(order, t)
>   list_for_each(..., >free_area[order])...
> 
> Let's say we do a 32k bitmap that can hold ~1M pages.  That's 4GB of RAM.
> On a 1TB system, that's 256 passes through the top-level loop.
> The bottom-level lists have tens of thousands of pages in them, even on my
> laptop.  Only 1/256 of these pages will get consumed in a given pass.
> 
Your description is not exactly.
A 32k bitmap is used only when there is few free memory left in the system and 
when 
the extend_page_bitmap() failed to allocate more memory for the bitmap. Or 
dozens of 
32k split bitmap will be used, this version limit the bitmap count to 32, it 
means we can use
at most 32*32 kB for the bitmap, which can cover 128GB for RAM. We can increase 
the bitmap
count limit to a larger value if 32 is not big enough.

> That's an awfully inefficient way of doing it.  This patch essentially changed
> the data structure without changing the algorithm to populate it.
> 
> Please change the *algorithm* to use the new data structure efficiently.
>  Such a change would only do a single pass through each freelist, and would
> choose whether to use the extent-based (pfn -> range) or bitmap-based
> approach based on the contents of the free lists.

Save the free page info to a raw bitmap first and then process the raw bitmap to
get the proper ' extent-based ' and  'bitmap-based' is the most efficient way I 
can 
come up with to save the virtio data transmission.  Do you have some better 
idea?


In the QEMU, no matter how we encode the bitmap, the raw format bitmap will be
used in the end.  But what I did in this version is:
   kernel: get the raw bitmap  --> encode the bitmap
   QEMU: decode the bitmap --> get the raw bitmap

Is it worth to do this kind of job here? we can save the virtio data 
transmission, but at the
same time, we did extra work.

It seems the benefit we get for this feature is not as big as that in fast 
balloon inflating/deflating.
> 
> You should not be using get_max_pfn().  Any patch set that continues to use
> it is not likely to be using a proper algorithm.

Do you have any suggestion about how to avoid it?

Thanks!
Liang

RE: [RESEND PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-10-26 Thread Li, Liang Z

> On 10/26/2016 03:06 AM, Li, Liang Z wrote:
> > I am working on Dave's new bitmap schema, I have finished the part of
> > getting the 'hybrid scheme bitmap' and found the complexity was more
> > than I expected. The main issue is more memory is required to save the
> > 'hybrid scheme bitmap' beside that used to save the raw page bitmap,
> > for the worst case, the memory required is 3 times than that in the
> > previous implementation.
> 
> Really?  Could you please describe the scenario where this occurs?
> > I am wondering if I should continue, as an alternative solution, how
> > about using PFNs array when inflating/deflating only a few pages?
> > Things will be much more simple.
> 
> Yes, using pfn lists is more efficient than using bitmaps for sparse bitmaps.
> Yes, there will be cases where it is preferable to just use pfn lists vs. any 
> kind
> of bitmap.
> 
> But, what does it matter?  At least with your current scheme where we go
> out and collect get_unused_pages(), we do the allocation up front.  The
> space efficiency doesn't matter at all for small sizes since we do the 
> constant-
> size allocation *anyway*.
> 
> I'm also pretty sure you can pack the pfn and page order into a single 64-bit
> word and have no bitmap for a given record.  That would make it pack just as
> well as the old pfns alone.  Right?

Yes, thanks for reminding, I am using 128 bit now, I will change it to 64 bit.
Let me finish the v4 first.

Thanks!
Liang

RE: [RESEND PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-10-26 Thread Li, Liang Z

> On 10/26/2016 03:06 AM, Li, Liang Z wrote:
> > I am working on Dave's new bitmap schema, I have finished the part of
> > getting the 'hybrid scheme bitmap' and found the complexity was more
> > than I expected. The main issue is more memory is required to save the
> > 'hybrid scheme bitmap' beside that used to save the raw page bitmap,
> > for the worst case, the memory required is 3 times than that in the
> > previous implementation.
> 
> Really?  Could you please describe the scenario where this occurs?
> > I am wondering if I should continue, as an alternative solution, how
> > about using PFNs array when inflating/deflating only a few pages?
> > Things will be much more simple.
> 
> Yes, using pfn lists is more efficient than using bitmaps for sparse bitmaps.
> Yes, there will be cases where it is preferable to just use pfn lists vs. any 
> kind
> of bitmap.
> 
> But, what does it matter?  At least with your current scheme where we go
> out and collect get_unused_pages(), we do the allocation up front.  The
> space efficiency doesn't matter at all for small sizes since we do the 
> constant-
> size allocation *anyway*.
> 
> I'm also pretty sure you can pack the pfn and page order into a single 64-bit
> word and have no bitmap for a given record.  That would make it pack just as
> well as the old pfns alone.  Right?

Yes, thanks for reminding, I am using 128 bit now, I will change it to 64 bit.
Let me finish the v4 first.

Thanks!
Liang

RE: [RESEND PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-10-26 Thread Li, Liang Z

> Cc: linux-kernel@vger.kernel.org; virtualizat...@lists.linux-foundation.org;
> linux...@kvack.org; virtio-...@lists.oasis-open.org; k...@vger.kernel.org;
> qemu-de...@nongnu.org; quint...@redhat.com; dgilb...@redhat.com;
> pbonz...@redhat.com; cornelia.h...@de.ibm.com; amit.s...@redhat.com
> Subject: Re: [RESEND PATCH v3 kernel 0/7] Extend virtio-balloon for fast
> (de)inflating & fast live migration
> 
> On 10/26/2016 03:13 AM, Li, Liang Z wrote:
> > 3 times memory required is not accurate, please ignore this. sorry ...
> > The complexity is the point.
> 
> What is making it so complex?  Can you describe the problems?

I plan to complete it first and send out the patch set,  then discuss if it 
worth.  I need some time.

Thanks!
Liang

RE: [RESEND PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-10-26 Thread Li, Liang Z

> Cc: linux-kernel@vger.kernel.org; virtualizat...@lists.linux-foundation.org;
> linux...@kvack.org; virtio-...@lists.oasis-open.org; k...@vger.kernel.org;
> qemu-de...@nongnu.org; quint...@redhat.com; dgilb...@redhat.com;
> pbonz...@redhat.com; cornelia.h...@de.ibm.com; amit.s...@redhat.com
> Subject: Re: [RESEND PATCH v3 kernel 0/7] Extend virtio-balloon for fast
> (de)inflating & fast live migration
> 
> On 10/26/2016 03:13 AM, Li, Liang Z wrote:
> > 3 times memory required is not accurate, please ignore this. sorry ...
> > The complexity is the point.
> 
> What is making it so complex?  Can you describe the problems?

I plan to complete it first and send out the patch set,  then discuss if it 
worth.  I need some time.

Thanks!
Liang

RE: [RESEND PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-10-26 Thread Li, Liang Z

> > On 10/20/2016 11:24 PM, Liang Li wrote:
> > > Dave Hansen suggested a new scheme to encode the data structure,
> > > because of additional complexity, it's not implemented in v3.
> >
> > So, what do you want done with this patch set?  Do you want it applied
> > as-is so that we can introduce a new host/guest ABI that we must
> > support until the end of time?  Then, we go back in a year or two and
> > add the newer format that addresses the deficiencies that this ABI has with
> a third version?
> >
> 
> Hi Dave & Michael,
> 
> I am working on Dave's new bitmap schema, I have finished the part of
> getting the 'hybrid scheme bitmap'
> and found the complexity was more than I expected.  The main issue is more
> memory is required to  save the 'hybrid scheme bitmap' beside that used to
> save the raw page bitmap, for the worst case, the memory required is 3
> times than that in the previous implementation.
> 

3 times memory required is not accurate, please ignore this. sorry ...
The complexity is the point. 

> I am wondering if I should continue, as an alternative solution, how about
> using PFNs array when inflating/deflating only a few pages? Things will be
> much more simple.
> 
> 
> Thanks!
> Liang
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org

RE: [RESEND PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-10-26 Thread Li, Liang Z

> > On 10/20/2016 11:24 PM, Liang Li wrote:
> > > Dave Hansen suggested a new scheme to encode the data structure,
> > > because of additional complexity, it's not implemented in v3.
> >
> > So, what do you want done with this patch set?  Do you want it applied
> > as-is so that we can introduce a new host/guest ABI that we must
> > support until the end of time?  Then, we go back in a year or two and
> > add the newer format that addresses the deficiencies that this ABI has with
> a third version?
> >
> 
> Hi Dave & Michael,
> 
> I am working on Dave's new bitmap schema, I have finished the part of
> getting the 'hybrid scheme bitmap'
> and found the complexity was more than I expected.  The main issue is more
> memory is required to  save the 'hybrid scheme bitmap' beside that used to
> save the raw page bitmap, for the worst case, the memory required is 3
> times than that in the previous implementation.
> 

3 times memory required is not accurate, please ignore this. sorry ...
The complexity is the point. 

> I am wondering if I should continue, as an alternative solution, how about
> using PFNs array when inflating/deflating only a few pages? Things will be
> much more simple.
> 
> 
> Thanks!
> Liang
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org

RE: [RESEND PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-10-26 Thread Li, Liang Z

> On 10/20/2016 11:24 PM, Liang Li wrote:
> > Dave Hansen suggested a new scheme to encode the data structure,
> > because of additional complexity, it's not implemented in v3.
> 
> So, what do you want done with this patch set?  Do you want it applied as-is
> so that we can introduce a new host/guest ABI that we must support until
> the end of time?  Then, we go back in a year or two and add the newer
> format that addresses the deficiencies that this ABI has with a third version?
> 

Hi Dave & Michael,

I am working on Dave's new bitmap schema, I have finished the part of getting 
the 'hybrid scheme bitmap'
and found the complexity was more than I expected.  The main issue is more 
memory is required to
 save the 'hybrid scheme bitmap' beside that used to save the raw page bitmap, 
for the worst case, the
memory required is 3 times than that in the previous implementation. 

I am wondering if I should continue, as an alternative solution, how about 
using PFNs array when
inflating/deflating only a few pages? Things will be much more simple.

Thanks!
Liang

RE: [RESEND PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-10-26 Thread Li, Liang Z

> On 10/20/2016 11:24 PM, Liang Li wrote:
> > Dave Hansen suggested a new scheme to encode the data structure,
> > because of additional complexity, it's not implemented in v3.
> 
> So, what do you want done with this patch set?  Do you want it applied as-is
> so that we can introduce a new host/guest ABI that we must support until
> the end of time?  Then, we go back in a year or two and add the newer
> format that addresses the deficiencies that this ABI has with a third version?
> 

Hi Dave & Michael,

I am working on Dave's new bitmap schema, I have finished the part of getting 
the 'hybrid scheme bitmap'
and found the complexity was more than I expected.  The main issue is more 
memory is required to
 save the 'hybrid scheme bitmap' beside that used to save the raw page bitmap, 
for the worst case, the
memory required is 3 times than that in the previous implementation. 

I am wondering if I should continue, as an alternative solution, how about 
using PFNs array when
inflating/deflating only a few pages? Things will be much more simple.

Thanks!
Liang

RE: [RESEND PATCH v3 kernel 4/7] virtio-balloon: speed up inflate/deflate process

2016-10-25 Thread Li, Liang Z

> > +static inline void init_pfn_range(struct virtio_balloon *vb) {
> > +   vb->min_pfn = ULONG_MAX;
> > +   vb->max_pfn = 0;
> > +}
> > +
> > +static inline void update_pfn_range(struct virtio_balloon *vb,
> > +struct page *page)
> > +{
> > +   unsigned long balloon_pfn = page_to_balloon_pfn(page);
> > +
> > +   if (balloon_pfn < vb->min_pfn)
> > +   vb->min_pfn = balloon_pfn;
> > +   if (balloon_pfn > vb->max_pfn)
> > +   vb->max_pfn = balloon_pfn;
> > +}
> > +
> 
> rename to hint these are all bitmap related.

Will change in v4.

> 
> 
> >  static void tell_host(struct virtio_balloon *vb, struct virtqueue
> > *vq)  {
> > -   struct scatterlist sg;
> > -   unsigned int len;
> > +   struct scatterlist sg, sg2[BALLOON_BMAP_COUNT + 1];
> > +   unsigned int len, i;
> > +
> > +   if (virtio_has_feature(vb->vdev,
> VIRTIO_BALLOON_F_PAGE_BITMAP)) {
> > +   struct balloon_bmap_hdr *hdr = vb->bmap_hdr;
> > +   unsigned long bmap_len;
> > +   int nr_pfn, nr_used_bmap, nr_buf;
> > +
> > +   nr_pfn = vb->end_pfn - vb->start_pfn + 1;
> > +   nr_pfn = roundup(nr_pfn, BITS_PER_LONG);
> > +   nr_used_bmap = nr_pfn / PFNS_PER_BMAP;
> > +   bmap_len = nr_pfn / BITS_PER_BYTE;
> > +   nr_buf = nr_used_bmap + 1;
> > +
> > +   /* cmd, reserved and req_id are init to 0, unused here */
> > +   hdr->page_shift = cpu_to_virtio16(vb->vdev, PAGE_SHIFT);
> > +   hdr->start_pfn = cpu_to_virtio64(vb->vdev, vb->start_pfn);
> > +   hdr->bmap_len = cpu_to_virtio64(vb->vdev, bmap_len);
> > +   sg_init_table(sg2, nr_buf);
> > +   sg_set_buf([0], hdr, sizeof(struct balloon_bmap_hdr));
> > +   for (i = 0; i < nr_used_bmap; i++) {
> > +   unsigned int  buf_len = BALLOON_BMAP_SIZE;
> > +
> > +   if (i + 1 == nr_used_bmap)
> > +   buf_len = bmap_len - BALLOON_BMAP_SIZE
> * i;
> > +   sg_set_buf([i + 1], vb->page_bitmap[i],
> buf_len);
> > +   }
> >
> > -   sg_init_one(, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
> > +   while (vq->num_free < nr_buf)
> > +   msleep(2);
> 
> 
> What's going on here? Who is expected to update num_free?
> 

I just want to wait until the vq have enough space to write the bitmap, I 
thought qemu
side will update the vq->num_free, is it wrong?

> 
> 
> > +   if (virtqueue_add_outbuf(vq, sg2, nr_buf, vb, GFP_KERNEL)
> == 0)
> > +   virtqueue_kick(vq);
> >
> > -   /* We should always be able to add one buffer to an empty queue.
> */
> > -   virtqueue_add_outbuf(vq, , 1, vb, GFP_KERNEL);
> > -   virtqueue_kick(vq);
> > +   } else {
> > +   sg_init_one(, vb->pfns, sizeof(vb->pfns[0]) * vb-
> >num_pfns);
> > +
> > +   /* We should always be able to add one buffer to an empty
> > +* queue. */
> 
> Pls use a multiple comment style consistent with kernel coding style.

Will change in next version.

> 
> > +   virtqueue_add_outbuf(vq, , 1, vb, GFP_KERNEL);
> > +   virtqueue_kick(vq);
> > +   }
> >
> > /* When host has read buffer, this completes via balloon_ack */
> > wait_event(vb->acked, virtqueue_get_buf(vq, )); @@ -138,13
> > +199,93 @@ static void set_page_pfns(struct virtio_balloon *vb,
> >   page_to_balloon_pfn(page) + i);  }
> >
> > -static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
> > +static void extend_page_bitmap(struct virtio_balloon *vb) {
> > +   int i;
> > +   unsigned long bmap_len, bmap_count;
> > +
> > +   bmap_len = ALIGN(get_max_pfn(), BITS_PER_LONG) /
> BITS_PER_BYTE;
> > +   bmap_count = bmap_len / BALLOON_BMAP_SIZE;
> > +   if (bmap_len % BALLOON_BMAP_SIZE)
> > +   bmap_count++;
> > +   if (bmap_count > BALLOON_BMAP_COUNT)
> > +   bmap_count = BALLOON_BMAP_COUNT;
> > +
> 
> This is doing simple things in tricky ways.
> Please use macros such as ALIGN and max instead of if.
> 

Will change.

> 
> > +   for (i = 1; i < bmap_count; i++) {
> 
> why 1?

In probe stage, already allocated one bitmap.

> 
> > +   vb->page_bitmap[i] = kmalloc(BALLOON_BMAP_SIZE,
> GFP_ATOMIC);
> 
> why GFP_ATOMIC?

Yes, GFP_ATOMIC is not necessary.

> and what will free the previous buffer?

The previous buffer will not be freed.

> 
> 
> > +   if (vb->page_bitmap[i])
> > +   vb->nr_page_bmap++;
> > +   else
> > +   break;
> 
> and what will happen then?

I plan to use the previous allocated buffer to save the bitmap, need more code 
for kmalloc failure?

> > -static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
> > +static unsigned int leak_balloon(struct virtio_balloon *vb, size_t num,
> > +   bool use_bmap)
> 
> this is just a feature bit - why not get it internally?

Indeed.

> > @@ -218,8 +374,14 @@ static unsigned

RE: [RESEND PATCH v3 kernel 4/7] virtio-balloon: speed up inflate/deflate process

2016-10-25 Thread Li, Liang Z

> > +static inline void init_pfn_range(struct virtio_balloon *vb) {
> > +   vb->min_pfn = ULONG_MAX;
> > +   vb->max_pfn = 0;
> > +}
> > +
> > +static inline void update_pfn_range(struct virtio_balloon *vb,
> > +struct page *page)
> > +{
> > +   unsigned long balloon_pfn = page_to_balloon_pfn(page);
> > +
> > +   if (balloon_pfn < vb->min_pfn)
> > +   vb->min_pfn = balloon_pfn;
> > +   if (balloon_pfn > vb->max_pfn)
> > +   vb->max_pfn = balloon_pfn;
> > +}
> > +
> 
> rename to hint these are all bitmap related.

Will change in v4.

> 
> 
> >  static void tell_host(struct virtio_balloon *vb, struct virtqueue
> > *vq)  {
> > -   struct scatterlist sg;
> > -   unsigned int len;
> > +   struct scatterlist sg, sg2[BALLOON_BMAP_COUNT + 1];
> > +   unsigned int len, i;
> > +
> > +   if (virtio_has_feature(vb->vdev,
> VIRTIO_BALLOON_F_PAGE_BITMAP)) {
> > +   struct balloon_bmap_hdr *hdr = vb->bmap_hdr;
> > +   unsigned long bmap_len;
> > +   int nr_pfn, nr_used_bmap, nr_buf;
> > +
> > +   nr_pfn = vb->end_pfn - vb->start_pfn + 1;
> > +   nr_pfn = roundup(nr_pfn, BITS_PER_LONG);
> > +   nr_used_bmap = nr_pfn / PFNS_PER_BMAP;
> > +   bmap_len = nr_pfn / BITS_PER_BYTE;
> > +   nr_buf = nr_used_bmap + 1;
> > +
> > +   /* cmd, reserved and req_id are init to 0, unused here */
> > +   hdr->page_shift = cpu_to_virtio16(vb->vdev, PAGE_SHIFT);
> > +   hdr->start_pfn = cpu_to_virtio64(vb->vdev, vb->start_pfn);
> > +   hdr->bmap_len = cpu_to_virtio64(vb->vdev, bmap_len);
> > +   sg_init_table(sg2, nr_buf);
> > +   sg_set_buf([0], hdr, sizeof(struct balloon_bmap_hdr));
> > +   for (i = 0; i < nr_used_bmap; i++) {
> > +   unsigned int  buf_len = BALLOON_BMAP_SIZE;
> > +
> > +   if (i + 1 == nr_used_bmap)
> > +   buf_len = bmap_len - BALLOON_BMAP_SIZE
> * i;
> > +   sg_set_buf([i + 1], vb->page_bitmap[i],
> buf_len);
> > +   }
> >
> > -   sg_init_one(, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
> > +   while (vq->num_free < nr_buf)
> > +   msleep(2);
> 
> 
> What's going on here? Who is expected to update num_free?
> 

I just want to wait until the vq have enough space to write the bitmap, I 
thought qemu
side will update the vq->num_free, is it wrong?

> 
> 
> > +   if (virtqueue_add_outbuf(vq, sg2, nr_buf, vb, GFP_KERNEL)
> == 0)
> > +   virtqueue_kick(vq);
> >
> > -   /* We should always be able to add one buffer to an empty queue.
> */
> > -   virtqueue_add_outbuf(vq, , 1, vb, GFP_KERNEL);
> > -   virtqueue_kick(vq);
> > +   } else {
> > +   sg_init_one(, vb->pfns, sizeof(vb->pfns[0]) * vb-
> >num_pfns);
> > +
> > +   /* We should always be able to add one buffer to an empty
> > +* queue. */
> 
> Pls use a multiple comment style consistent with kernel coding style.

Will change in next version.

> 
> > +   virtqueue_add_outbuf(vq, , 1, vb, GFP_KERNEL);
> > +   virtqueue_kick(vq);
> > +   }
> >
> > /* When host has read buffer, this completes via balloon_ack */
> > wait_event(vb->acked, virtqueue_get_buf(vq, )); @@ -138,13
> > +199,93 @@ static void set_page_pfns(struct virtio_balloon *vb,
> >   page_to_balloon_pfn(page) + i);  }
> >
> > -static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
> > +static void extend_page_bitmap(struct virtio_balloon *vb) {
> > +   int i;
> > +   unsigned long bmap_len, bmap_count;
> > +
> > +   bmap_len = ALIGN(get_max_pfn(), BITS_PER_LONG) /
> BITS_PER_BYTE;
> > +   bmap_count = bmap_len / BALLOON_BMAP_SIZE;
> > +   if (bmap_len % BALLOON_BMAP_SIZE)
> > +   bmap_count++;
> > +   if (bmap_count > BALLOON_BMAP_COUNT)
> > +   bmap_count = BALLOON_BMAP_COUNT;
> > +
> 
> This is doing simple things in tricky ways.
> Please use macros such as ALIGN and max instead of if.
> 

Will change.

> 
> > +   for (i = 1; i < bmap_count; i++) {
> 
> why 1?

In probe stage, already allocated one bitmap.

> 
> > +   vb->page_bitmap[i] = kmalloc(BALLOON_BMAP_SIZE,
> GFP_ATOMIC);
> 
> why GFP_ATOMIC?

Yes, GFP_ATOMIC is not necessary.

> and what will free the previous buffer?

The previous buffer will not be freed.

> 
> 
> > +   if (vb->page_bitmap[i])
> > +   vb->nr_page_bmap++;
> > +   else
> > +   break;
> 
> and what will happen then?

I plan to use the previous allocated buffer to save the bitmap, need more code 
for kmalloc failure?

> > -static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
> > +static unsigned int leak_balloon(struct virtio_balloon *vb, size_t num,
> > +   bool use_bmap)
> 
> this is just a feature bit - why not get it internally?

Indeed.

> > @@ -218,8 +374,14 @@ static unsigned

RE: [RESEND PATCH v3 kernel 3/7] mm: add a function to get the max pfn

2016-10-24 Thread Li, Liang Z

> On 10/20/2016 11:24 PM, Liang Li wrote:
> > Expose the function to get the max pfn, so it can be used in the
> > virtio-balloon device driver. Simply include the 'linux/bootmem.h'
> > is not enough, if the device driver is built to a module, directly
> > refer the max_pfn lead to build failed.
> 
> I'm not sure the rest of the set is worth reviewing.  I think a lot of it will
> change pretty fundamentally once you have those improved data structures
> in place.

That's true. I will send out the v4 as soon as possible.

Liang

RE: [RESEND PATCH v3 kernel 3/7] mm: add a function to get the max pfn

2016-10-24 Thread Li, Liang Z

> On 10/20/2016 11:24 PM, Liang Li wrote:
> > Expose the function to get the max pfn, so it can be used in the
> > virtio-balloon device driver. Simply include the 'linux/bootmem.h'
> > is not enough, if the device driver is built to a module, directly
> > refer the max_pfn lead to build failed.
> 
> I'm not sure the rest of the set is worth reviewing.  I think a lot of it will
> change pretty fundamentally once you have those improved data structures
> in place.

That's true. I will send out the v4 as soon as possible.

Liang

RE: [RESEND PATCH v3 kernel 2/7] virtio-balloon: define new feature bit and page bitmap head

2016-10-24 Thread Li, Liang Z

> On 10/20/2016 11:24 PM, Liang Li wrote:
> > Add a new feature which supports sending the page information with a
> > bitmap. The current implementation uses PFNs array, which is not very
> > efficient. Using bitmap can improve the performance of
> > inflating/deflating significantly
> 
> Why is it not efficient?  How is using a bitmap more efficient?  What kinds of
> cases is the bitmap inefficient?
> 
> > The page bitmap header will used to tell the host some information
> > about the page bitmap. e.g. the page size, page bitmap length and
> > start pfn.
> 
> Why did you choose to add these features to the structure?  What benefits
> do they add?
> 
> Could you describe your solution a bit here, and describe its strengths and
> weaknesses?
> 

Will elaborate the solution in V4.

> >  /* Size of a PFN in the balloon interface. */  #define
> > VIRTIO_BALLOON_PFN_SHIFT 12 @@ -82,4 +83,22 @@ struct
> > virtio_balloon_stat {
> > __virtio64 val;
> >  } __attribute__((packed));
> >
> > +/* Page bitmap header structure */
> > +struct balloon_bmap_hdr {
> > +   /* Used to distinguish different request */
> > +   __virtio16 cmd;
> > +   /* Shift width of page in the bitmap */
> > +   __virtio16 page_shift;
> > +   /* flag used to identify different status */
> > +   __virtio16 flag;
> > +   /* Reserved */
> > +   __virtio16 reserved;
> > +   /* ID of the request */
> > +   __virtio64 req_id;
> > +   /* The pfn of 0 bit in the bitmap */
> > +   __virtio64 start_pfn;
> > +   /* The length of the bitmap, in bytes */
> > +   __virtio64 bmap_len;
> > +};
> 
> FWIW this is totally unreadable.  Please do something like this:
> 
> > +struct balloon_bmap_hdr {
> > +   __virtio16 cmd; /* Used to distinguish different ...
> > +   __virtio16 page_shift;  /* Shift width of page in the bitmap */
> > +   __virtio16 flag;/* flag used to identify different...
> > +   __virtio16 reserved;/* Reserved */
> > +   __virtio64 req_id;  /* ID of the request */
> > +   __virtio64 start_pfn;   /* The pfn of 0 bit in the bitmap */
> > +   __virtio64 bmap_len;/* The length of the bitmap, in bytes */
> > +};
> 
> and please make an effort to add useful comments.  "/* Reserved */"
> seems like a waste of bytes to me.

OK. Maybe 'padding' is better than 'reserved' .

Thanks for your comments!

Liang

RE: [RESEND PATCH v3 kernel 2/7] virtio-balloon: define new feature bit and page bitmap head

2016-10-24 Thread Li, Liang Z

> On 10/20/2016 11:24 PM, Liang Li wrote:
> > Add a new feature which supports sending the page information with a
> > bitmap. The current implementation uses PFNs array, which is not very
> > efficient. Using bitmap can improve the performance of
> > inflating/deflating significantly
> 
> Why is it not efficient?  How is using a bitmap more efficient?  What kinds of
> cases is the bitmap inefficient?
> 
> > The page bitmap header will used to tell the host some information
> > about the page bitmap. e.g. the page size, page bitmap length and
> > start pfn.
> 
> Why did you choose to add these features to the structure?  What benefits
> do they add?
> 
> Could you describe your solution a bit here, and describe its strengths and
> weaknesses?
> 

Will elaborate the solution in V4.

> >  /* Size of a PFN in the balloon interface. */  #define
> > VIRTIO_BALLOON_PFN_SHIFT 12 @@ -82,4 +83,22 @@ struct
> > virtio_balloon_stat {
> > __virtio64 val;
> >  } __attribute__((packed));
> >
> > +/* Page bitmap header structure */
> > +struct balloon_bmap_hdr {
> > +   /* Used to distinguish different request */
> > +   __virtio16 cmd;
> > +   /* Shift width of page in the bitmap */
> > +   __virtio16 page_shift;
> > +   /* flag used to identify different status */
> > +   __virtio16 flag;
> > +   /* Reserved */
> > +   __virtio16 reserved;
> > +   /* ID of the request */
> > +   __virtio64 req_id;
> > +   /* The pfn of 0 bit in the bitmap */
> > +   __virtio64 start_pfn;
> > +   /* The length of the bitmap, in bytes */
> > +   __virtio64 bmap_len;
> > +};
> 
> FWIW this is totally unreadable.  Please do something like this:
> 
> > +struct balloon_bmap_hdr {
> > +   __virtio16 cmd; /* Used to distinguish different ...
> > +   __virtio16 page_shift;  /* Shift width of page in the bitmap */
> > +   __virtio16 flag;/* flag used to identify different...
> > +   __virtio16 reserved;/* Reserved */
> > +   __virtio64 req_id;  /* ID of the request */
> > +   __virtio64 start_pfn;   /* The pfn of 0 bit in the bitmap */
> > +   __virtio64 bmap_len;/* The length of the bitmap, in bytes */
> > +};
> 
> and please make an effort to add useful comments.  "/* Reserved */"
> seems like a waste of bytes to me.

OK. Maybe 'padding' is better than 'reserved' .

Thanks for your comments!

Liang

RE: [RESEND PATCH v3 kernel 1/7] virtio-balloon: rework deflate to add page to a list

2016-10-24 Thread Li, Liang Z

> On 10/20/2016 11:24 PM, Liang Li wrote:
> > Will allow faster notifications using a bitmap down the road.
> > balloon_pfn_to_page() can be removed because it's useless.
> 
> This is a pretty terse description of what's going on here.  Could you try to
> elaborate a bit?  What *is* the current approach?  Why does it not work
> going forward?  What do you propose instead?  Why is it better?

Sure. The description will be more clear if it's described as you suggest. 
Thanks!

Liang

RE: [RESEND PATCH v3 kernel 1/7] virtio-balloon: rework deflate to add page to a list

2016-10-24 Thread Li, Liang Z

> On 10/20/2016 11:24 PM, Liang Li wrote:
> > Will allow faster notifications using a bitmap down the road.
> > balloon_pfn_to_page() can be removed because it's useless.
> 
> This is a pretty terse description of what's going on here.  Could you try to
> elaborate a bit?  What *is* the current approach?  Why does it not work
> going forward?  What do you propose instead?  Why is it better?

Sure. The description will be more clear if it's described as you suggest. 
Thanks!

Liang

RE: [RESEND PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-10-23 Thread Li, Liang Z

> On Fri, Oct 21, 2016 at 10:25:21AM -0700, Dave Hansen wrote:
> > On 10/20/2016 11:24 PM, Liang Li wrote:
> > > Dave Hansen suggested a new scheme to encode the data structure,
> > > because of additional complexity, it's not implemented in v3.
> >
> > So, what do you want done with this patch set?  Do you want it applied
> > as-is so that we can introduce a new host/guest ABI that we must
> > support until the end of time?  Then, we go back in a year or two and
> > add the newer format that addresses the deficiencies that this ABI has
> > with a third version?
> >
> 
> Exactly my questions.

Hi Dave & Michael,

In the V2, both of you thought that the memory I allocated for the bitmap is 
too large, and gave some
 suggestions about the solution, so I changed the implementation and used  
scattered pages for the bitmap
instead of a large physical continued memory. I didn't get the comments about 
the changes, so I am not 
sure whether that is OK or not, that's the why I resend the V3, I just want 
your opinions about that part. 

I will implement the new schema as Dave suggested in V4. Before that, could you 
take a look at this version and
give some comments? 

Thanks!
Liang

RE: [RESEND PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-10-23 Thread Li, Liang Z

> On Fri, Oct 21, 2016 at 10:25:21AM -0700, Dave Hansen wrote:
> > On 10/20/2016 11:24 PM, Liang Li wrote:
> > > Dave Hansen suggested a new scheme to encode the data structure,
> > > because of additional complexity, it's not implemented in v3.
> >
> > So, what do you want done with this patch set?  Do you want it applied
> > as-is so that we can introduce a new host/guest ABI that we must
> > support until the end of time?  Then, we go back in a year or two and
> > add the newer format that addresses the deficiencies that this ABI has
> > with a third version?
> >
> 
> Exactly my questions.

Hi Dave & Michael,

In the V2, both of you thought that the memory I allocated for the bitmap is 
too large, and gave some
 suggestions about the solution, so I changed the implementation and used  
scattered pages for the bitmap
instead of a large physical continued memory. I didn't get the comments about 
the changes, so I am not 
sure whether that is OK or not, that's the why I resend the V3, I just want 
your opinions about that part. 

I will implement the new schema as Dave suggested in V4. Before that, could you 
take a look at this version and
give some comments? 

Thanks!
Liang

RE: [PATCH v2 1/5] mmu: extend the is_present check to 32 bits

2016-09-01 Thread Li, Liang Z

Intel SDM doesn't describe whether the A bit will be set or not when CPU 
accesses  a no present EPT page table entry? 
even this patch works for the current CPU, it's not good to make such an 
assumption. 

Should we revert it?

Thanks!
Liang


> -Original Message-
> From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org]
> On Behalf Of Bandan Das
> Sent: Wednesday, July 13, 2016 6:19 AM
> To: k...@vger.kernel.org
> Cc: pbonz...@redhat.com; guangrong.x...@linux.intel.com;
> kernel...@gmail.com; linux-kernel@vger.kernel.org
> Subject: [PATCH v2 1/5] mmu: extend the is_present check to 32 bits
> 
> This is safe because this function is called on host controlled page table and
> non-present/non-MMIO sptes never use bits 1..31. For the EPT case, this
> ensures that cases where only the execute bit is set is marked valid.
> 
> Signed-off-by: Bandan Das 
> ---
>  arch/x86/kvm/mmu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index
> def97b3..87b62dc 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -304,7 +304,7 @@ static int is_nx(struct kvm_vcpu *vcpu)
> 
>  static int is_shadow_present_pte(u64 pte)  {
> - return pte & PT_PRESENT_MASK && !is_mmio_spte(pte);
> + return (pte & 0xull) && !is_mmio_spte(pte);
>  }
> 
>  static int is_large_pte(u64 pte)
> --
> 2.5.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in the body of
> a message to majord...@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

RE: [PATCH v2 1/5] mmu: extend the is_present check to 32 bits

2016-09-01 Thread Li, Liang Z

Intel SDM doesn't describe whether the A bit will be set or not when CPU 
accesses  a no present EPT page table entry? 
even this patch works for the current CPU, it's not good to make such an 
assumption. 

Should we revert it?

Thanks!
Liang


> -Original Message-
> From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org]
> On Behalf Of Bandan Das
> Sent: Wednesday, July 13, 2016 6:19 AM
> To: k...@vger.kernel.org
> Cc: pbonz...@redhat.com; guangrong.x...@linux.intel.com;
> kernel...@gmail.com; linux-kernel@vger.kernel.org
> Subject: [PATCH v2 1/5] mmu: extend the is_present check to 32 bits
> 
> This is safe because this function is called on host controlled page table and
> non-present/non-MMIO sptes never use bits 1..31. For the EPT case, this
> ensures that cases where only the execute bit is set is marked valid.
> 
> Signed-off-by: Bandan Das 
> ---
>  arch/x86/kvm/mmu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index
> def97b3..87b62dc 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -304,7 +304,7 @@ static int is_nx(struct kvm_vcpu *vcpu)
> 
>  static int is_shadow_present_pte(u64 pte)  {
> - return pte & PT_PRESENT_MASK && !is_mmio_spte(pte);
> + return (pte & 0xull) && !is_mmio_spte(pte);
>  }
> 
>  static int is_large_pte(u64 pte)
> --
> 2.5.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in the body of
> a message to majord...@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

RE: [PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-08-31 Thread Li, Liang Z

> Subject: Re: [PATCH v3 kernel 0/7] Extend virtio-balloon for fast 
> (de)inflating
> & fast live migration
> 
> 2016-08-08 14:35 GMT+08:00 Liang Li :
> > This patch set contains two parts of changes to the virtio-balloon.
> >
> > One is the change for speeding up the inflating & deflating process,
> > the main idea of this optimization is to use bitmap to send the page
> > information to host instead of the PFNs, to reduce the overhead of
> > virtio data transmission, address translation and madvise(). This can
> > help to improve the performance by about 85%.
> >
> > Another change is for speeding up live migration. By skipping process
> > guest's free pages in the first round of data copy, to reduce needless
> > data processing, this can help to save quite a lot of CPU cycles and
> > network bandwidth. We put guest's free page information in bitmap and
> > send it to host with the virt queue of virtio-balloon. For an idle 8GB
> > guest, this can help to shorten the total live migration time from
> > 2Sec to about 500ms in the 10Gbps network environment.
> 
> I just read the slides of this feature for recent kvm forum, the cloud
> providers more care about live migration downtime to avoid customers'
> perception than total time, however, this feature will increase downtime
> when acquire the benefit of reducing total time, maybe it will be more
> acceptable if there is no downside for downtime.
> 
> Regards,
> Wanpeng Li

In theory, there is no factor that will increase the downtime. There is no 
additional operation
and no more data copy during the stop and copy stage. But in the test, the 
downtime increases
and this can be reproduced. I think the busy network line maybe the reason for 
this. With this
 optimization, a huge amount of data is written to the socket in a shorter 
time, so some of the write
operation may need to wait. Without this optimization, zero page checking takes 
more time,
the network is not so busy.

If the guest is not an idle one, I think the gap of the downtime will not so 
obvious.  Anyway, the
downtime is still less than the  max_down_time set by the user.

Thanks!
Liang

RE: [PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-08-31 Thread Li, Liang Z

> Subject: Re: [PATCH v3 kernel 0/7] Extend virtio-balloon for fast 
> (de)inflating
> & fast live migration
> 
> 2016-08-08 14:35 GMT+08:00 Liang Li :
> > This patch set contains two parts of changes to the virtio-balloon.
> >
> > One is the change for speeding up the inflating & deflating process,
> > the main idea of this optimization is to use bitmap to send the page
> > information to host instead of the PFNs, to reduce the overhead of
> > virtio data transmission, address translation and madvise(). This can
> > help to improve the performance by about 85%.
> >
> > Another change is for speeding up live migration. By skipping process
> > guest's free pages in the first round of data copy, to reduce needless
> > data processing, this can help to save quite a lot of CPU cycles and
> > network bandwidth. We put guest's free page information in bitmap and
> > send it to host with the virt queue of virtio-balloon. For an idle 8GB
> > guest, this can help to shorten the total live migration time from
> > 2Sec to about 500ms in the 10Gbps network environment.
> 
> I just read the slides of this feature for recent kvm forum, the cloud
> providers more care about live migration downtime to avoid customers'
> perception than total time, however, this feature will increase downtime
> when acquire the benefit of reducing total time, maybe it will be more
> acceptable if there is no downside for downtime.
> 
> Regards,
> Wanpeng Li

In theory, there is no factor that will increase the downtime. There is no 
additional operation
and no more data copy during the stop and copy stage. But in the test, the 
downtime increases
and this can be reproduced. I think the busy network line maybe the reason for 
this. With this
 optimization, a huge amount of data is written to the socket in a shorter 
time, so some of the write
operation may need to wait. Without this optimization, zero page checking takes 
more time,
the network is not so busy.

If the guest is not an idle one, I think the gap of the downtime will not so 
obvious.  Anyway, the
downtime is still less than the  max_down_time set by the user.

Thanks!
Liang

RE: [PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-08-31 Thread Li, Liang Z

Hi Michael,

I know you are very busy. If you have time, could you help to take a look at 
this patch set?

Thanks!
Liang

> -Original Message-
> From: Li, Liang Z
> Sent: Thursday, August 18, 2016 9:06 AM
> To: Michael S. Tsirkin
> Cc: virtualizat...@lists.linux-foundation.org; linux...@kvack.org; virtio-
> d...@lists.oasis-open.org; k...@vger.kernel.org; qemu-de...@nongnu.org;
> quint...@redhat.com; dgilb...@redhat.com; Hansen, Dave; linux-
> ker...@vger.kernel.org
> Subject: RE: [PATCH v3 kernel 0/7] Extend virtio-balloon for fast 
> (de)inflating
> & fast live migration
> 
> Hi Michael,
> 
> Could you help to review this version when you have time?
> 
> Thanks!
> Liang
> 
> > -Original Message-
> > From: Li, Liang Z
> > Sent: Monday, August 08, 2016 2:35 PM
> > To: linux-kernel@vger.kernel.org
> > Cc: virtualizat...@lists.linux-foundation.org; linux...@kvack.org;
> > virtio- d...@lists.oasis-open.org; k...@vger.kernel.org;
> > qemu-de...@nongnu.org; quint...@redhat.com; dgilb...@redhat.com;
> > Hansen, Dave; Li, Liang Z
> > Subject: [PATCH v3 kernel 0/7] Extend virtio-balloon for fast
> > (de)inflating & fast live migration
> >
> > This patch set contains two parts of changes to the virtio-balloon.
> >
> > One is the change for speeding up the inflating & deflating process,
> > the main idea of this optimization is to use bitmap to send the page
> > information to host instead of the PFNs, to reduce the overhead of
> > virtio data transmission, address translation and madvise(). This can
> > help to improve the performance by about 85%.
> >
> > Another change is for speeding up live migration. By skipping process
> > guest's free pages in the first round of data copy, to reduce needless
> > data processing, this can help to save quite a lot of CPU cycles and
> > network bandwidth. We put guest's free page information in bitmap and
> > send it to host with the virt queue of virtio-balloon. For an idle 8GB
> > guest, this can help to shorten the total live migration time from
> > 2Sec to about 500ms in the 10Gbps network environment.
> >
> > Dave Hansen suggested a new scheme to encode the data structure,
> > because of additional complexity, it's not implemented in v3.
> >
> > Changes from v2 to v3:
> > * Change the name of 'free page' to 'unused page'.
> > * Use the scatter & gather bitmap instead of a 1MB page bitmap.
> > * Fix overwriting the page bitmap after kicking.
> > * Some of MST's comments for v2.
> >
> > Changes from v1 to v2:
> > * Abandon the patch for dropping page cache.
> > * Put some structures to uapi head file.
> > * Use a new way to determine the page bitmap size.
> > * Use a unified way to send the free page information with the bitmap
> > * Address the issues referred in MST's comments
> >
> >
> > Liang Li (7):
> >   virtio-balloon: rework deflate to add page to a list
> >   virtio-balloon: define new feature bit and page bitmap head
> >   mm: add a function to get the max pfn
> >   virtio-balloon: speed up inflate/deflate process
> >   mm: add the related functions to get unused page
> >   virtio-balloon: define feature bit and head for misc virt queue
> >   virtio-balloon: tell host vm's unused page info
> >
> >  drivers/virtio/virtio_balloon.c | 390
> > 
> >  include/linux/mm.h  |   3 +
> >  include/uapi/linux/virtio_balloon.h |  41 
> >  mm/page_alloc.c |  94 +
> >  4 files changed, 485 insertions(+), 43 deletions(-)
> >
> > --
> > 1.8.3.1

RE: [PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-08-31 Thread Li, Liang Z

Hi Michael,

I know you are very busy. If you have time, could you help to take a look at 
this patch set?

Thanks!
Liang

> -Original Message-
> From: Li, Liang Z
> Sent: Thursday, August 18, 2016 9:06 AM
> To: Michael S. Tsirkin
> Cc: virtualizat...@lists.linux-foundation.org; linux...@kvack.org; virtio-
> d...@lists.oasis-open.org; k...@vger.kernel.org; qemu-de...@nongnu.org;
> quint...@redhat.com; dgilb...@redhat.com; Hansen, Dave; linux-
> ker...@vger.kernel.org
> Subject: RE: [PATCH v3 kernel 0/7] Extend virtio-balloon for fast 
> (de)inflating
> & fast live migration
> 
> Hi Michael,
> 
> Could you help to review this version when you have time?
> 
> Thanks!
> Liang
> 
> > -Original Message-
> > From: Li, Liang Z
> > Sent: Monday, August 08, 2016 2:35 PM
> > To: linux-kernel@vger.kernel.org
> > Cc: virtualizat...@lists.linux-foundation.org; linux...@kvack.org;
> > virtio- d...@lists.oasis-open.org; k...@vger.kernel.org;
> > qemu-de...@nongnu.org; quint...@redhat.com; dgilb...@redhat.com;
> > Hansen, Dave; Li, Liang Z
> > Subject: [PATCH v3 kernel 0/7] Extend virtio-balloon for fast
> > (de)inflating & fast live migration
> >
> > This patch set contains two parts of changes to the virtio-balloon.
> >
> > One is the change for speeding up the inflating & deflating process,
> > the main idea of this optimization is to use bitmap to send the page
> > information to host instead of the PFNs, to reduce the overhead of
> > virtio data transmission, address translation and madvise(). This can
> > help to improve the performance by about 85%.
> >
> > Another change is for speeding up live migration. By skipping process
> > guest's free pages in the first round of data copy, to reduce needless
> > data processing, this can help to save quite a lot of CPU cycles and
> > network bandwidth. We put guest's free page information in bitmap and
> > send it to host with the virt queue of virtio-balloon. For an idle 8GB
> > guest, this can help to shorten the total live migration time from
> > 2Sec to about 500ms in the 10Gbps network environment.
> >
> > Dave Hansen suggested a new scheme to encode the data structure,
> > because of additional complexity, it's not implemented in v3.
> >
> > Changes from v2 to v3:
> > * Change the name of 'free page' to 'unused page'.
> > * Use the scatter & gather bitmap instead of a 1MB page bitmap.
> > * Fix overwriting the page bitmap after kicking.
> > * Some of MST's comments for v2.
> >
> > Changes from v1 to v2:
> > * Abandon the patch for dropping page cache.
> > * Put some structures to uapi head file.
> > * Use a new way to determine the page bitmap size.
> > * Use a unified way to send the free page information with the bitmap
> > * Address the issues referred in MST's comments
> >
> >
> > Liang Li (7):
> >   virtio-balloon: rework deflate to add page to a list
> >   virtio-balloon: define new feature bit and page bitmap head
> >   mm: add a function to get the max pfn
> >   virtio-balloon: speed up inflate/deflate process
> >   mm: add the related functions to get unused page
> >   virtio-balloon: define feature bit and head for misc virt queue
> >   virtio-balloon: tell host vm's unused page info
> >
> >  drivers/virtio/virtio_balloon.c | 390
> > 
> >  include/linux/mm.h  |   3 +
> >  include/uapi/linux/virtio_balloon.h |  41 
> >  mm/page_alloc.c |  94 +
> >  4 files changed, 485 insertions(+), 43 deletions(-)
> >
> > --
> > 1.8.3.1

RE: [PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-08-17 Thread Li, Liang Z

Hi Michael,

Could you help to review this version when you have time? 

Thanks!
Liang

> -Original Message-
> From: Li, Liang Z
> Sent: Monday, August 08, 2016 2:35 PM
> To: linux-kernel@vger.kernel.org
> Cc: virtualizat...@lists.linux-foundation.org; linux...@kvack.org; virtio-
> d...@lists.oasis-open.org; k...@vger.kernel.org; qemu-de...@nongnu.org;
> quint...@redhat.com; dgilb...@redhat.com; Hansen, Dave; Li, Liang Z
> Subject: [PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating &
> fast live migration
> 
> This patch set contains two parts of changes to the virtio-balloon.
> 
> One is the change for speeding up the inflating & deflating process, the main
> idea of this optimization is to use bitmap to send the page information to
> host instead of the PFNs, to reduce the overhead of virtio data transmission,
> address translation and madvise(). This can help to improve the performance
> by about 85%.
> 
> Another change is for speeding up live migration. By skipping process guest's
> free pages in the first round of data copy, to reduce needless data 
> processing,
> this can help to save quite a lot of CPU cycles and network bandwidth. We
> put guest's free page information in bitmap and send it to host with the virt
> queue of virtio-balloon. For an idle 8GB guest, this can help to shorten the
> total live migration time from 2Sec to about 500ms in the 10Gbps network
> environment.
> 
> Dave Hansen suggested a new scheme to encode the data structure,
> because of additional complexity, it's not implemented in v3.
> 
> Changes from v2 to v3:
> * Change the name of 'free page' to 'unused page'.
> * Use the scatter & gather bitmap instead of a 1MB page bitmap.
> * Fix overwriting the page bitmap after kicking.
> * Some of MST's comments for v2.
> 
> Changes from v1 to v2:
> * Abandon the patch for dropping page cache.
> * Put some structures to uapi head file.
> * Use a new way to determine the page bitmap size.
> * Use a unified way to send the free page information with the bitmap
> * Address the issues referred in MST's comments
> 
> 
> Liang Li (7):
>   virtio-balloon: rework deflate to add page to a list
>   virtio-balloon: define new feature bit and page bitmap head
>   mm: add a function to get the max pfn
>   virtio-balloon: speed up inflate/deflate process
>   mm: add the related functions to get unused page
>   virtio-balloon: define feature bit and head for misc virt queue
>   virtio-balloon: tell host vm's unused page info
> 
>  drivers/virtio/virtio_balloon.c | 390
> 
>  include/linux/mm.h  |   3 +
>  include/uapi/linux/virtio_balloon.h |  41 
>  mm/page_alloc.c |  94 +
>  4 files changed, 485 insertions(+), 43 deletions(-)
> 
> --
> 1.8.3.1

RE: [PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-08-17 Thread Li, Liang Z

Hi Michael,

Could you help to review this version when you have time? 

Thanks!
Liang

> -Original Message-
> From: Li, Liang Z
> Sent: Monday, August 08, 2016 2:35 PM
> To: linux-kernel@vger.kernel.org
> Cc: virtualizat...@lists.linux-foundation.org; linux...@kvack.org; virtio-
> d...@lists.oasis-open.org; k...@vger.kernel.org; qemu-de...@nongnu.org;
> quint...@redhat.com; dgilb...@redhat.com; Hansen, Dave; Li, Liang Z
> Subject: [PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating &
> fast live migration
> 
> This patch set contains two parts of changes to the virtio-balloon.
> 
> One is the change for speeding up the inflating & deflating process, the main
> idea of this optimization is to use bitmap to send the page information to
> host instead of the PFNs, to reduce the overhead of virtio data transmission,
> address translation and madvise(). This can help to improve the performance
> by about 85%.
> 
> Another change is for speeding up live migration. By skipping process guest's
> free pages in the first round of data copy, to reduce needless data 
> processing,
> this can help to save quite a lot of CPU cycles and network bandwidth. We
> put guest's free page information in bitmap and send it to host with the virt
> queue of virtio-balloon. For an idle 8GB guest, this can help to shorten the
> total live migration time from 2Sec to about 500ms in the 10Gbps network
> environment.
> 
> Dave Hansen suggested a new scheme to encode the data structure,
> because of additional complexity, it's not implemented in v3.
> 
> Changes from v2 to v3:
> * Change the name of 'free page' to 'unused page'.
> * Use the scatter & gather bitmap instead of a 1MB page bitmap.
> * Fix overwriting the page bitmap after kicking.
> * Some of MST's comments for v2.
> 
> Changes from v1 to v2:
> * Abandon the patch for dropping page cache.
> * Put some structures to uapi head file.
> * Use a new way to determine the page bitmap size.
> * Use a unified way to send the free page information with the bitmap
> * Address the issues referred in MST's comments
> 
> 
> Liang Li (7):
>   virtio-balloon: rework deflate to add page to a list
>   virtio-balloon: define new feature bit and page bitmap head
>   mm: add a function to get the max pfn
>   virtio-balloon: speed up inflate/deflate process
>   mm: add the related functions to get unused page
>   virtio-balloon: define feature bit and head for misc virt queue
>   virtio-balloon: tell host vm's unused page info
> 
>  drivers/virtio/virtio_balloon.c | 390
> 
>  include/linux/mm.h  |   3 +
>  include/uapi/linux/virtio_balloon.h |  41 
>  mm/page_alloc.c |  94 +
>  4 files changed, 485 insertions(+), 43 deletions(-)
> 
> --
> 1.8.3.1

RE: [PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-08-08 Thread Li, Liang Z

> Subject: Re: [PATCH v3 kernel 0/7] Extend virtio-balloon for fast 
> (de)inflating
> & fast live migration
> 
> On 08/07/2016 11:35 PM, Liang Li wrote:
> > Dave Hansen suggested a new scheme to encode the data structure,
> > because of additional complexity, it's not implemented in v3.
> 
> FWIW, I don't think it takes any additional complexity here, at least in the
> guest implementation side.  The thing I suggested would just mean explicitly
> calling out that there was a single bitmap instead of implying it in the ABI.
> 
> Do you think the scheme I suggested is the way to go?

Yes, I think so.  And I will do that in the later version. In this V3, I just 
want to solve the 
issue caused by a large page bitmap in v2.

Liang

RE: [PATCH v3 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-08-08 Thread Li, Liang Z

> Subject: Re: [PATCH v3 kernel 0/7] Extend virtio-balloon for fast 
> (de)inflating
> & fast live migration
> 
> On 08/07/2016 11:35 PM, Liang Li wrote:
> > Dave Hansen suggested a new scheme to encode the data structure,
> > because of additional complexity, it's not implemented in v3.
> 
> FWIW, I don't think it takes any additional complexity here, at least in the
> guest implementation side.  The thing I suggested would just mean explicitly
> calling out that there was a single bitmap instead of implying it in the ABI.
> 
> Do you think the scheme I suggested is the way to go?

Yes, I think so.  And I will do that in the later version. In this V3, I just 
want to solve the 
issue caused by a large page bitmap in v2.

Liang

RE: [virtio-dev] Re: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-08-01 Thread Li, Liang Z

> > It's only small because it makes you rescan the free list.
> > So maybe you should do something else.
> > I looked at it a bit. Instead of scanning the free list, how about
> > scanning actual page structures? If page is unused, pass it to host.
> > Solves the problem of rescanning multiple times, does it not?
> 
> FWIW, I think the new data structure needs some work.
> 
> Before, we had a potentially very long list of 4k areas.  Now, we've just got 
> a
> very large bitmap.  The bitmap might not even be very dense if we are
> ballooning relatively few things.
> 
> Can I suggest an alternate scheme?  I think you actually need a hybrid
> scheme that has bitmaps but also allows more flexibility in the pfn ranges.
> The payload could be a number of records each containing 3 things:
> 
>   pfn, page order, length of bitmap (maybe in powers of 2)
> 
> Each record is followed by the bitmap.  Or, if the bitmap length is 0,
> immediately followed by another record.  A bitmap length of 0 implies a
> bitmap with the least significant bit set.  Page order specifies how many
> pages each bit represents.
> 
> This scheme could easily encode the new data structure you are proposing
> by just setting pfn=0, order=0, and a very long bitmap length.  But, it could
> handle sparse bitmaps much better *and* represent large pages much more
> efficiently.
> 
> There's plenty of space to fit a whole record in 64 bits.

I like your idea and it's more flexible, and it's very useful if we want to 
optimize the
page allocating stage further. I believe the memory fragmentation will not be 
very
serious, so the performance won't be too bad in the worst case.

Thanks!
Liang

RE: [virtio-dev] Re: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-08-01 Thread Li, Liang Z

> > It's only small because it makes you rescan the free list.
> > So maybe you should do something else.
> > I looked at it a bit. Instead of scanning the free list, how about
> > scanning actual page structures? If page is unused, pass it to host.
> > Solves the problem of rescanning multiple times, does it not?
> 
> FWIW, I think the new data structure needs some work.
> 
> Before, we had a potentially very long list of 4k areas.  Now, we've just got 
> a
> very large bitmap.  The bitmap might not even be very dense if we are
> ballooning relatively few things.
> 
> Can I suggest an alternate scheme?  I think you actually need a hybrid
> scheme that has bitmaps but also allows more flexibility in the pfn ranges.
> The payload could be a number of records each containing 3 things:
> 
>   pfn, page order, length of bitmap (maybe in powers of 2)
> 
> Each record is followed by the bitmap.  Or, if the bitmap length is 0,
> immediately followed by another record.  A bitmap length of 0 implies a
> bitmap with the least significant bit set.  Page order specifies how many
> pages each bit represents.
> 
> This scheme could easily encode the new data structure you are proposing
> by just setting pfn=0, order=0, and a very long bitmap length.  But, it could
> handle sparse bitmaps much better *and* represent large pages much more
> efficiently.
> 
> There's plenty of space to fit a whole record in 64 bits.

I like your idea and it's more flexible, and it's very useful if we want to 
optimize the
page allocating stage further. I believe the memory fragmentation will not be 
very
serious, so the performance won't be too bad in the worst case.

Thanks!
Liang

RE: [virtio-dev] Re: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-07-28 Thread Li, Liang Z

> > > On Wed, Jul 27, 2016 at 09:03:21AM -0700, Dave Hansen wrote:
> > > > On 07/26/2016 06:23 PM, Liang Li wrote:
> > > > > + vb->pfn_limit = VIRTIO_BALLOON_PFNS_LIMIT;
> > > > > + vb->pfn_limit = min(vb->pfn_limit, get_max_pfn());
> > > > > + vb->bmap_len = ALIGN(vb->pfn_limit, BITS_PER_LONG) /
> > > > > +  BITS_PER_BYTE + 2 * sizeof(unsigned long);
> > > > > + hdr_len = sizeof(struct balloon_bmap_hdr);
> > > > > + vb->bmap_hdr = kzalloc(hdr_len + vb->bmap_len,
> GFP_KERNEL);
> > > >
> > > > This ends up doing a 1MB kmalloc() right?  That seems a _bit_ big.
> > > > How big was the pfn buffer before?
> > >
> > >
> > > Yes I would limit this to 1G memory in a go, will result in a 32KByte 
> > > bitmap.
> > >
> > > --
> > > MST
> >
> > Limit to 1G is bad for the performance, I sent you the test result several
> weeks ago.
> >
> > Paste it bellow:
> > --
> > --
> > About the size of page bitmap, I have test the performance of filling
> > the balloon to 15GB with a  16GB RAM VM.
> >
> > ===
> > 32K Byte (cover 1GB of RAM)
> >
> > Time spends on inflating: 2031ms
> > -
> > 64K Byte (cover 2GB of RAM)
> >
> > Time spends on inflating: 1507ms
> > 
> > 512K Byte (cover 16GB of RAM)
> >
> > Time spends on inflating: 1237ms
> > 
> >
> > If possible, a big bitmap is better for performance.
> >
> > Liang
> 
> Earlier you said:
> a. allocating pages (6.5%)
> b. sending PFNs to host (68.3%)
> c. address translation (6.1%)
> d. madvise (19%)
> 
> Here sending PFNs to host with 512K Byte map should be almost free.
> 
> So is something else taking up the time?
> 
I just want to show you the benefits of using a big bitmap. :)
I did not measure the time spend on each stage after optimization(I will do it 
later),
but I have tried to allocate the page with big chunk and found it can make 
things faster.
Without allocating big chunk page, the performance improvement is about 85%, 
and with
 allocating big  chunk page, the improvement is about 94%.

Liang

> 
> --
> MST

RE: [virtio-dev] Re: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-07-28 Thread Li, Liang Z

> > > On Wed, Jul 27, 2016 at 09:03:21AM -0700, Dave Hansen wrote:
> > > > On 07/26/2016 06:23 PM, Liang Li wrote:
> > > > > + vb->pfn_limit = VIRTIO_BALLOON_PFNS_LIMIT;
> > > > > + vb->pfn_limit = min(vb->pfn_limit, get_max_pfn());
> > > > > + vb->bmap_len = ALIGN(vb->pfn_limit, BITS_PER_LONG) /
> > > > > +  BITS_PER_BYTE + 2 * sizeof(unsigned long);
> > > > > + hdr_len = sizeof(struct balloon_bmap_hdr);
> > > > > + vb->bmap_hdr = kzalloc(hdr_len + vb->bmap_len,
> GFP_KERNEL);
> > > >
> > > > This ends up doing a 1MB kmalloc() right?  That seems a _bit_ big.
> > > > How big was the pfn buffer before?
> > >
> > >
> > > Yes I would limit this to 1G memory in a go, will result in a 32KByte 
> > > bitmap.
> > >
> > > --
> > > MST
> >
> > Limit to 1G is bad for the performance, I sent you the test result several
> weeks ago.
> >
> > Paste it bellow:
> > --
> > --
> > About the size of page bitmap, I have test the performance of filling
> > the balloon to 15GB with a  16GB RAM VM.
> >
> > ===
> > 32K Byte (cover 1GB of RAM)
> >
> > Time spends on inflating: 2031ms
> > -
> > 64K Byte (cover 2GB of RAM)
> >
> > Time spends on inflating: 1507ms
> > 
> > 512K Byte (cover 16GB of RAM)
> >
> > Time spends on inflating: 1237ms
> > 
> >
> > If possible, a big bitmap is better for performance.
> >
> > Liang
> 
> Earlier you said:
> a. allocating pages (6.5%)
> b. sending PFNs to host (68.3%)
> c. address translation (6.1%)
> d. madvise (19%)
> 
> Here sending PFNs to host with 512K Byte map should be almost free.
> 
> So is something else taking up the time?
> 
I just want to show you the benefits of using a big bitmap. :)
I did not measure the time spend on each stage after optimization(I will do it 
later),
but I have tried to allocate the page with big chunk and found it can make 
things faster.
Without allocating big chunk page, the performance improvement is about 85%, 
and with
 allocating big  chunk page, the improvement is about 94%.

Liang

> 
> --
> MST

RE: [virtio-dev] Re: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-07-28 Thread Li, Liang Z

> On Thu, Jul 28, 2016 at 06:36:18AM +0000, Li, Liang Z wrote:
> > > > > This ends up doing a 1MB kmalloc() right?  That seems a _bit_ big.
> > > > > How big was the pfn buffer before?
> > > >
> > > > Yes, it is if the max pfn is more than 32GB.
> > > > The size of the pfn buffer use before is 256*4 = 1024 Bytes, it's
> > > > too small, and it's the main reason for bad performance.
> > > > Use the max 1MB kmalloc is a balance between performance and
> > > > flexibility, a large page bitmap covers the range of all the
> > > > memory is no good for a system with huge amount of memory. If the
> > > > bitmap is too small, it means we have to traverse a long list for
> > > > many times, and it's bad
> > > for performance.
> > > >
> > > > Thanks!
> > > > Liang
> > >
> > > There are all your implementation decisions though.
> > >
> > > If guest memory is so fragmented that you only have order 0 4k
> > > pages, then allocating a huge 1M contigious chunk is very problematic in
> and of itself.
> > >
> >
> > The memory is allocated in the probe stage. This will not happen if
> > the driver is  loaded when booting the guest.
> >
> > > Most people rarely migrate and do not care how fast that happens.
> > > Wasting a large chunk of memory (and it's zeroed for no good reason,
> > > so you actually request host memory for it) for everyone to speed it
> > > up when it does happen is not really an option.
> > >
> > If people don't plan to do inflating/deflating, they should not enable
> > the virtio-balloon at the beginning, once they decide to use it, the
> > driver should provide better performance as much as possible.
> 
> The reason people inflate/deflate is so they can overcommit memory.
> Do they need to overcommit very quickly? I don't see why.
> So let's get what we can for free but I don't really believe people would want
> to pay for it.
> 
> > 1MB is a very small portion for a VM with more than 32GB memory and
> > it's the *worst case*, for VM with less than 32GB memory, the amount
> > of RAM depends on VM's memory size and will be less than 1MB.
> 
> It's guest memmory so might all be in swap and never touched, your memset
> at probe time will fault it in and make hypervisor actually pay for it.
> 
> > If 1MB is too big, how about 512K, or 256K?  32K seems too small.
> >
> > Liang
> 
> It's only small because it makes you rescan the free list.
> So maybe you should do something else.
> I looked at it a bit. Instead of scanning the free list, how about scanning 
> actual
> page structures? If page is unused, pass it to host.
> Solves the problem of rescanning multiple times, does it not?
> 

Yes, agree.
> 
> Another idea: allocate a small bitmap at probe time (e.g. for deflate), 
> allocate
> a bunch more on each request. Use something like GFP_ATOMIC and a
> scatter/gather, if that fails use the smaller bitmap.
> 

So, the aim of v3 is to use a smaller bitmap without too heavy performance 
penalty.
Thanks a lot!

Liang

RE: [virtio-dev] Re: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-07-28 Thread Li, Liang Z

> On Thu, Jul 28, 2016 at 06:36:18AM +0000, Li, Liang Z wrote:
> > > > > This ends up doing a 1MB kmalloc() right?  That seems a _bit_ big.
> > > > > How big was the pfn buffer before?
> > > >
> > > > Yes, it is if the max pfn is more than 32GB.
> > > > The size of the pfn buffer use before is 256*4 = 1024 Bytes, it's
> > > > too small, and it's the main reason for bad performance.
> > > > Use the max 1MB kmalloc is a balance between performance and
> > > > flexibility, a large page bitmap covers the range of all the
> > > > memory is no good for a system with huge amount of memory. If the
> > > > bitmap is too small, it means we have to traverse a long list for
> > > > many times, and it's bad
> > > for performance.
> > > >
> > > > Thanks!
> > > > Liang
> > >
> > > There are all your implementation decisions though.
> > >
> > > If guest memory is so fragmented that you only have order 0 4k
> > > pages, then allocating a huge 1M contigious chunk is very problematic in
> and of itself.
> > >
> >
> > The memory is allocated in the probe stage. This will not happen if
> > the driver is  loaded when booting the guest.
> >
> > > Most people rarely migrate and do not care how fast that happens.
> > > Wasting a large chunk of memory (and it's zeroed for no good reason,
> > > so you actually request host memory for it) for everyone to speed it
> > > up when it does happen is not really an option.
> > >
> > If people don't plan to do inflating/deflating, they should not enable
> > the virtio-balloon at the beginning, once they decide to use it, the
> > driver should provide better performance as much as possible.
> 
> The reason people inflate/deflate is so they can overcommit memory.
> Do they need to overcommit very quickly? I don't see why.
> So let's get what we can for free but I don't really believe people would want
> to pay for it.
> 
> > 1MB is a very small portion for a VM with more than 32GB memory and
> > it's the *worst case*, for VM with less than 32GB memory, the amount
> > of RAM depends on VM's memory size and will be less than 1MB.
> 
> It's guest memmory so might all be in swap and never touched, your memset
> at probe time will fault it in and make hypervisor actually pay for it.
> 
> > If 1MB is too big, how about 512K, or 256K?  32K seems too small.
> >
> > Liang
> 
> It's only small because it makes you rescan the free list.
> So maybe you should do something else.
> I looked at it a bit. Instead of scanning the free list, how about scanning 
> actual
> page structures? If page is unused, pass it to host.
> Solves the problem of rescanning multiple times, does it not?
> 

Yes, agree.
> 
> Another idea: allocate a small bitmap at probe time (e.g. for deflate), 
> allocate
> a bunch more on each request. Use something like GFP_ATOMIC and a
> scatter/gather, if that fails use the smaller bitmap.
> 

So, the aim of v3 is to use a smaller bitmap without too heavy performance 
penalty.
Thanks a lot!

Liang

RE: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-07-28 Thread Li, Liang Z

> On Thu, Jul 28, 2016 at 03:06:37AM +0000, Li, Liang Z wrote:
> > > > + * VIRTIO_BALLOON_PFNS_LIMIT is used to limit the size of page
> > > > +bitmap
> > > > + * to prevent a very large page bitmap, there are two reasons for this:
> > > > + * 1) to save memory.
> > > > + * 2) allocate a large bitmap may fail.
> > > > + *
> > > > + * The actual limit of pfn is determined by:
> > > > + * pfn_limit = min(max_pfn, VIRTIO_BALLOON_PFNS_LIMIT);
> > > > + *
> > > > + * If system has more pages than VIRTIO_BALLOON_PFNS_LIMIT, we
> > > > +will scan
> > > > + * the page list and send the PFNs with several times. To reduce
> > > > +the
> > > > + * overhead of scanning the page list. VIRTIO_BALLOON_PFNS_LIMIT
> > > > +should
> > > > + * be set with a value which can cover most cases.
> > >
> > > So what if it covers 1/32 of the memory? We'll do 32 exits and not
> > > 1, still not a big deal for a big guest.
> > >
> >
> > The issue here is the overhead is too high for scanning the page list for 32
> times.
> > Limit the page bitmap size to a fixed value is better for a big guest?
> >
> 
> I'd say avoid scanning free lists completely. Scan pages themselves and check
> the refcount to see whether they are free.
> This way each page needs to be tested once.
> 
> And skip the whole optimization if less than e.g. 10% is free.

That's better than rescanning the free list. Will change in next version.

Thanks!

Liang

RE: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-07-28 Thread Li, Liang Z

> On Thu, Jul 28, 2016 at 03:06:37AM +0000, Li, Liang Z wrote:
> > > > + * VIRTIO_BALLOON_PFNS_LIMIT is used to limit the size of page
> > > > +bitmap
> > > > + * to prevent a very large page bitmap, there are two reasons for this:
> > > > + * 1) to save memory.
> > > > + * 2) allocate a large bitmap may fail.
> > > > + *
> > > > + * The actual limit of pfn is determined by:
> > > > + * pfn_limit = min(max_pfn, VIRTIO_BALLOON_PFNS_LIMIT);
> > > > + *
> > > > + * If system has more pages than VIRTIO_BALLOON_PFNS_LIMIT, we
> > > > +will scan
> > > > + * the page list and send the PFNs with several times. To reduce
> > > > +the
> > > > + * overhead of scanning the page list. VIRTIO_BALLOON_PFNS_LIMIT
> > > > +should
> > > > + * be set with a value which can cover most cases.
> > >
> > > So what if it covers 1/32 of the memory? We'll do 32 exits and not
> > > 1, still not a big deal for a big guest.
> > >
> >
> > The issue here is the overhead is too high for scanning the page list for 32
> times.
> > Limit the page bitmap size to a fixed value is better for a big guest?
> >
> 
> I'd say avoid scanning free lists completely. Scan pages themselves and check
> the refcount to see whether they are free.
> This way each page needs to be tested once.
> 
> And skip the whole optimization if less than e.g. 10% is free.

That's better than rescanning the free list. Will change in next version.

Thanks!

Liang

RE: [PATCH v2 repost 7/7] virtio-balloon: tell host vm's free page info

2016-07-28 Thread Li, Liang Z

> >  }
> >
> > +static void update_free_pages_stats(struct virtio_balloon *vb,
> 
> why _stats?

Will change.

> > +   max_pfn = get_max_pfn();
> > +   mutex_lock(>balloon_lock);
> > +   while (pfn < max_pfn) {
> > +   memset(vb->page_bitmap, 0, vb->bmap_len);
> > +   ret = get_free_pages(pfn, pfn + vb->pfn_limit,
> > +   vb->page_bitmap, vb->bmap_len * BITS_PER_BYTE);
> > +   hdr->cmd = cpu_to_virtio16(vb->vdev,
> BALLOON_GET_FREE_PAGES);
> > +   hdr->page_shift = cpu_to_virtio16(vb->vdev, PAGE_SHIFT);
> > +   hdr->req_id = cpu_to_virtio64(vb->vdev, req_id);
> > +   hdr->start_pfn = cpu_to_virtio64(vb->vdev, pfn);
> > +   bmap_len = vb->pfn_limit / BITS_PER_BYTE;
> > +   if (!ret) {
> > +   hdr->flag = cpu_to_virtio16(vb->vdev,
> > +
>   BALLOON_FLAG_DONE);
> > +   if (pfn + vb->pfn_limit > max_pfn)
> > +   bmap_len = (max_pfn - pfn) /
> BITS_PER_BYTE;
> > +   } else
> > +   hdr->flag = cpu_to_virtio16(vb->vdev,
> > +
>   BALLOON_FLAG_CONT);
> > +   hdr->bmap_len = cpu_to_virtio64(vb->vdev, bmap_len);
> > +   sg_init_one(_out, hdr,
> > +sizeof(struct balloon_bmap_hdr) + bmap_len);
> 
> Wait a second. This adds the same buffer multiple times in a loop.
> We will overwrite the buffer without waiting for hypervisor to process it.
> What did I miss?

I am no quite sure about this part, I though the virtqueue_kick(vq) will prevent
the buffer from overwrite, I realized it's wrong.

> > +
> > +   virtqueue_add_outbuf(vq, _out, 1, vb, GFP_KERNEL);
> 
> this can fail. you want to maybe make sure vq has enough space before you
> use it or check error and wait.
> 
> > +   virtqueue_kick(vq);
> 
> why kick here within loop? wait until done. in fact kick outside lock is 
> better
> for smp.

I will change this part in v3.

> 
> > +   pfn += vb->pfn_limit;
> > +   static const char * const names[] = { "inflate", "deflate", "stats",
> > +"misc" };
> > int err, nvqs;
> >
> > /*
> >  * We expect two virtqueues: inflate and deflate, and
> >  * optionally stat.
> >  */
> > -   nvqs = virtio_has_feature(vb->vdev,
> VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
> > +   if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_MISC_VQ))
> > +   nvqs = 4;
> 
> Does misc vq depend on stats vq feature then? if yes please validate that.

Yes, what's you mean by 'validate' that?

> 
> 
> > +   else
> > +   nvqs = virtio_has_feature(vb->vdev,
> > + VIRTIO_BALLOON_F_STATS_VQ) ? 3 :
> 2;
> 
> Replace that ? with else too pls.

Will change.

Thanks!
Liang

RE: [PATCH v2 repost 7/7] virtio-balloon: tell host vm's free page info

2016-07-28 Thread Li, Liang Z

> >  }
> >
> > +static void update_free_pages_stats(struct virtio_balloon *vb,
> 
> why _stats?

Will change.

> > +   max_pfn = get_max_pfn();
> > +   mutex_lock(>balloon_lock);
> > +   while (pfn < max_pfn) {
> > +   memset(vb->page_bitmap, 0, vb->bmap_len);
> > +   ret = get_free_pages(pfn, pfn + vb->pfn_limit,
> > +   vb->page_bitmap, vb->bmap_len * BITS_PER_BYTE);
> > +   hdr->cmd = cpu_to_virtio16(vb->vdev,
> BALLOON_GET_FREE_PAGES);
> > +   hdr->page_shift = cpu_to_virtio16(vb->vdev, PAGE_SHIFT);
> > +   hdr->req_id = cpu_to_virtio64(vb->vdev, req_id);
> > +   hdr->start_pfn = cpu_to_virtio64(vb->vdev, pfn);
> > +   bmap_len = vb->pfn_limit / BITS_PER_BYTE;
> > +   if (!ret) {
> > +   hdr->flag = cpu_to_virtio16(vb->vdev,
> > +
>   BALLOON_FLAG_DONE);
> > +   if (pfn + vb->pfn_limit > max_pfn)
> > +   bmap_len = (max_pfn - pfn) /
> BITS_PER_BYTE;
> > +   } else
> > +   hdr->flag = cpu_to_virtio16(vb->vdev,
> > +
>   BALLOON_FLAG_CONT);
> > +   hdr->bmap_len = cpu_to_virtio64(vb->vdev, bmap_len);
> > +   sg_init_one(_out, hdr,
> > +sizeof(struct balloon_bmap_hdr) + bmap_len);
> 
> Wait a second. This adds the same buffer multiple times in a loop.
> We will overwrite the buffer without waiting for hypervisor to process it.
> What did I miss?

I am no quite sure about this part, I though the virtqueue_kick(vq) will prevent
the buffer from overwrite, I realized it's wrong.

> > +
> > +   virtqueue_add_outbuf(vq, _out, 1, vb, GFP_KERNEL);
> 
> this can fail. you want to maybe make sure vq has enough space before you
> use it or check error and wait.
> 
> > +   virtqueue_kick(vq);
> 
> why kick here within loop? wait until done. in fact kick outside lock is 
> better
> for smp.

I will change this part in v3.

> 
> > +   pfn += vb->pfn_limit;
> > +   static const char * const names[] = { "inflate", "deflate", "stats",
> > +"misc" };
> > int err, nvqs;
> >
> > /*
> >  * We expect two virtqueues: inflate and deflate, and
> >  * optionally stat.
> >  */
> > -   nvqs = virtio_has_feature(vb->vdev,
> VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
> > +   if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_MISC_VQ))
> > +   nvqs = 4;
> 
> Does misc vq depend on stats vq feature then? if yes please validate that.

Yes, what's you mean by 'validate' that?

> 
> 
> > +   else
> > +   nvqs = virtio_has_feature(vb->vdev,
> > + VIRTIO_BALLOON_F_STATS_VQ) ? 3 :
> 2;
> 
> Replace that ? with else too pls.

Will change.

Thanks!
Liang

RE: [virtio-dev] Re: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-07-28 Thread Li, Liang Z

> > > This ends up doing a 1MB kmalloc() right?  That seems a _bit_ big.
> > > How big was the pfn buffer before?
> >
> > Yes, it is if the max pfn is more than 32GB.
> > The size of the pfn buffer use before is 256*4 = 1024 Bytes, it's too
> > small, and it's the main reason for bad performance.
> > Use the max 1MB kmalloc is a balance between performance and
> > flexibility, a large page bitmap covers the range of all the memory is
> > no good for a system with huge amount of memory. If the bitmap is too
> > small, it means we have to traverse a long list for many times, and it's bad
> for performance.
> >
> > Thanks!
> > Liang
> 
> There are all your implementation decisions though.
> 
> If guest memory is so fragmented that you only have order 0 4k pages, then
> allocating a huge 1M contigious chunk is very problematic in and of itself.
> 

The memory is allocated in the probe stage. This will not happen if the driver 
is
 loaded when booting the guest.

> Most people rarely migrate and do not care how fast that happens.
> Wasting a large chunk of memory (and it's zeroed for no good reason, so you
> actually request host memory for it) for everyone to speed it up when it
> does happen is not really an option.
> 
If people don't plan to do inflating/deflating, they should not enable the 
virtio-balloon
at the beginning, once they decide to use it, the driver should provide better 
performance
as much as possible.

1MB is a very small portion for a VM with more than 32GB memory and it's the 
*worst case*, 
for VM with less than 32GB memory, the amount of RAM depends on VM's memory size
and will be less than 1MB.

If 1MB is too big, how about 512K, or 256K?  32K seems too small.

Liang

> --
> MST
> 
> -
> To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org

RE: [virtio-dev] Re: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-07-28 Thread Li, Liang Z

> > > This ends up doing a 1MB kmalloc() right?  That seems a _bit_ big.
> > > How big was the pfn buffer before?
> >
> > Yes, it is if the max pfn is more than 32GB.
> > The size of the pfn buffer use before is 256*4 = 1024 Bytes, it's too
> > small, and it's the main reason for bad performance.
> > Use the max 1MB kmalloc is a balance between performance and
> > flexibility, a large page bitmap covers the range of all the memory is
> > no good for a system with huge amount of memory. If the bitmap is too
> > small, it means we have to traverse a long list for many times, and it's bad
> for performance.
> >
> > Thanks!
> > Liang
> 
> There are all your implementation decisions though.
> 
> If guest memory is so fragmented that you only have order 0 4k pages, then
> allocating a huge 1M contigious chunk is very problematic in and of itself.
> 

The memory is allocated in the probe stage. This will not happen if the driver 
is
 loaded when booting the guest.

> Most people rarely migrate and do not care how fast that happens.
> Wasting a large chunk of memory (and it's zeroed for no good reason, so you
> actually request host memory for it) for everyone to speed it up when it
> does happen is not really an option.
> 
If people don't plan to do inflating/deflating, they should not enable the 
virtio-balloon
at the beginning, once they decide to use it, the driver should provide better 
performance
as much as possible.

1MB is a very small portion for a VM with more than 32GB memory and it's the 
*worst case*, 
for VM with less than 32GB memory, the amount of RAM depends on VM's memory size
and will be less than 1MB.

If 1MB is too big, how about 512K, or 256K?  32K seems too small.

Liang

> --
> MST
> 
> -
> To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org

RE: [virtio-dev] Re: [PATCH v2 repost 6/7] mm: add the related functions to get free page info

2016-07-27 Thread Li, Liang Z

> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 7da61ad..3ad8b10
> > 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4523,6 +4523,52 @@ unsigned long get_max_pfn(void)  }
> > EXPORT_SYMBOL(get_max_pfn);
> >
> > +static void mark_free_pages_bitmap(struct zone *zone, unsigned long
> start_pfn,
> > +   unsigned long end_pfn, unsigned long *bitmap, unsigned long len) {
> > +   unsigned long pfn, flags, page_num;
> > +   unsigned int order, t;
> > +   struct list_head *curr;
> > +
> > +   if (zone_is_empty(zone))
> > +   return;
> > +   end_pfn = min(start_pfn + len, end_pfn);
> > +   spin_lock_irqsave(>lock, flags);
> > +
> > +   for_each_migratetype_order(order, t) {
> 
> Why not do each order separately? This way you can use a single bit to pass a
> huge page to host.
> 

I thought about that before, and did not that because of complexity and small 
benefits.
Use separated page bitmaps for each order won't help to reduce the data 
traffic, except
ignoring the pages with small order. 

> Not a requirement but hey.
> 
> Alternatively (and maybe that is a better idea0 if you wanted to, you could
> just skip lone 4K pages.
> It's not clear that they are worth bothering with.
> Add a flag to start with some reasonably large order and go from there.
> 
One of the main reason of this patch is to reduce the network traffic as much 
as possible,
it looks strange to skip the lone 4K pages. Skipping these pages can made live 
migration
faster? I think it depends on the amount of lone 4K pages.

In the other hand, it's faster to send one bit through virtio than that send 4K 
bytes 
through even 10Gps network, is that true?

Liang

RE: [virtio-dev] Re: [PATCH v2 repost 6/7] mm: add the related functions to get free page info

2016-07-27 Thread Li, Liang Z

> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 7da61ad..3ad8b10
> > 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4523,6 +4523,52 @@ unsigned long get_max_pfn(void)  }
> > EXPORT_SYMBOL(get_max_pfn);
> >
> > +static void mark_free_pages_bitmap(struct zone *zone, unsigned long
> start_pfn,
> > +   unsigned long end_pfn, unsigned long *bitmap, unsigned long len) {
> > +   unsigned long pfn, flags, page_num;
> > +   unsigned int order, t;
> > +   struct list_head *curr;
> > +
> > +   if (zone_is_empty(zone))
> > +   return;
> > +   end_pfn = min(start_pfn + len, end_pfn);
> > +   spin_lock_irqsave(>lock, flags);
> > +
> > +   for_each_migratetype_order(order, t) {
> 
> Why not do each order separately? This way you can use a single bit to pass a
> huge page to host.
> 

I thought about that before, and did not that because of complexity and small 
benefits.
Use separated page bitmaps for each order won't help to reduce the data 
traffic, except
ignoring the pages with small order. 

> Not a requirement but hey.
> 
> Alternatively (and maybe that is a better idea0 if you wanted to, you could
> just skip lone 4K pages.
> It's not clear that they are worth bothering with.
> Add a flag to start with some reasonably large order and go from there.
> 
One of the main reason of this patch is to reduce the network traffic as much 
as possible,
it looks strange to skip the lone 4K pages. Skipping these pages can made live 
migration
faster? I think it depends on the amount of lone 4K pages.

In the other hand, it's faster to send one bit through virtio than that send 4K 
bytes 
through even 10Gps network, is that true?

Liang

RE: [PATCH v2 repost 6/7] mm: add the related functions to get free page info

2016-07-27 Thread Li, Liang Z

> On 07/27/2016 03:05 PM, Michael S. Tsirkin wrote:
> > On Wed, Jul 27, 2016 at 09:40:56AM -0700, Dave Hansen wrote:
> >> On 07/26/2016 06:23 PM, Liang Li wrote:
> >>> + for_each_migratetype_order(order, t) {
> >>> + list_for_each(curr, >free_area[order].free_list[t]) {
> >>> + pfn = page_to_pfn(list_entry(curr, struct page, lru));
> >>> + if (pfn >= start_pfn && pfn <= end_pfn) {
> >>> + page_num = 1UL << order;
> >>> + if (pfn + page_num > end_pfn)
> >>> + page_num = end_pfn - pfn;
> >>> + bitmap_set(bitmap, pfn - start_pfn,
> page_num);
> >>> + }
> >>> + }
> >>> + }
> >>
> >> Nit:  The 'page_num' nomenclature really confused me here.  It is the
> >> number of bits being set in the bitmap.  Seems like calling it
> >> nr_pages or num_pages would be more appropriate.
> >>
> >> Isn't this bitmap out of date by the time it's send up to the
> >> hypervisor?  Is there something that makes the inaccuracy OK here?
> >
> > Yes. Calling these free pages is unfortunate. It's likely to confuse
> > people thinking they can just discard these pages.
> >
> > Hypervisor sends a request. We respond with this list of pages, and
> > the guarantee hypervisor needs is that these were free sometime
> > between request and response, so they are safe to free if they are
> > unmodified since the request. hypervisor can detect modifications so
> > it can detect modifications itself and does not need guest help.
> 
> Ahh, that makes sense.
> 
> So the hypervisor is trying to figure out: "Which pages do I move?".  It wants
> to know which pages the guest thinks have good data and need to move.
> But, the list of free pages is (likely) smaller than the list of pages with 
> good
> data, so it asks for that instead.
> 
> A write to a page means that it has valuable data, regardless of whether it
> was in the free list or not.
> 
> The hypervisor only skips moving pages that were free *and* were never
> written to.  So we never lose data, even if this "get free page info"
> stuff is totally out of date.
> 
> The patch description and code comments are, um, a _bit_ light for this level
> of subtlety. :)

I will add more description about this in v3.

Thanks!
Liang

RE: [PATCH v2 repost 6/7] mm: add the related functions to get free page info

2016-07-27 Thread Li, Liang Z

> On 07/27/2016 03:05 PM, Michael S. Tsirkin wrote:
> > On Wed, Jul 27, 2016 at 09:40:56AM -0700, Dave Hansen wrote:
> >> On 07/26/2016 06:23 PM, Liang Li wrote:
> >>> + for_each_migratetype_order(order, t) {
> >>> + list_for_each(curr, >free_area[order].free_list[t]) {
> >>> + pfn = page_to_pfn(list_entry(curr, struct page, lru));
> >>> + if (pfn >= start_pfn && pfn <= end_pfn) {
> >>> + page_num = 1UL << order;
> >>> + if (pfn + page_num > end_pfn)
> >>> + page_num = end_pfn - pfn;
> >>> + bitmap_set(bitmap, pfn - start_pfn,
> page_num);
> >>> + }
> >>> + }
> >>> + }
> >>
> >> Nit:  The 'page_num' nomenclature really confused me here.  It is the
> >> number of bits being set in the bitmap.  Seems like calling it
> >> nr_pages or num_pages would be more appropriate.
> >>
> >> Isn't this bitmap out of date by the time it's send up to the
> >> hypervisor?  Is there something that makes the inaccuracy OK here?
> >
> > Yes. Calling these free pages is unfortunate. It's likely to confuse
> > people thinking they can just discard these pages.
> >
> > Hypervisor sends a request. We respond with this list of pages, and
> > the guarantee hypervisor needs is that these were free sometime
> > between request and response, so they are safe to free if they are
> > unmodified since the request. hypervisor can detect modifications so
> > it can detect modifications itself and does not need guest help.
> 
> Ahh, that makes sense.
> 
> So the hypervisor is trying to figure out: "Which pages do I move?".  It wants
> to know which pages the guest thinks have good data and need to move.
> But, the list of free pages is (likely) smaller than the list of pages with 
> good
> data, so it asks for that instead.
> 
> A write to a page means that it has valuable data, regardless of whether it
> was in the free list or not.
> 
> The hypervisor only skips moving pages that were free *and* were never
> written to.  So we never lose data, even if this "get free page info"
> stuff is totally out of date.
> 
> The patch description and code comments are, um, a _bit_ light for this level
> of subtlety. :)

I will add more description about this in v3.

Thanks!
Liang

RE: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-07-27 Thread Li, Liang Z

> > +/*
> > + * VIRTIO_BALLOON_PFNS_LIMIT is used to limit the size of page bitmap
> > + * to prevent a very large page bitmap, there are two reasons for this:
> > + * 1) to save memory.
> > + * 2) allocate a large bitmap may fail.
> > + *
> > + * The actual limit of pfn is determined by:
> > + * pfn_limit = min(max_pfn, VIRTIO_BALLOON_PFNS_LIMIT);
> > + *
> > + * If system has more pages than VIRTIO_BALLOON_PFNS_LIMIT, we will
> > +scan
> > + * the page list and send the PFNs with several times. To reduce the
> > + * overhead of scanning the page list. VIRTIO_BALLOON_PFNS_LIMIT
> > +should
> > + * be set with a value which can cover most cases.
> > + */
> > +#define VIRTIO_BALLOON_PFNS_LIMIT ((32 * (1ULL << 30)) >>
> PAGE_SHIFT)
> > +/* 32GB */
> > +
> >  static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> > module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> > MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> >
> > +extern unsigned long get_max_pfn(void);
> > +
> 
> Please just include the correct header. No need for this hackery.
> 

Will change. Thanks!

Liang

RE: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-07-27 Thread Li, Liang Z

> > +/*
> > + * VIRTIO_BALLOON_PFNS_LIMIT is used to limit the size of page bitmap
> > + * to prevent a very large page bitmap, there are two reasons for this:
> > + * 1) to save memory.
> > + * 2) allocate a large bitmap may fail.
> > + *
> > + * The actual limit of pfn is determined by:
> > + * pfn_limit = min(max_pfn, VIRTIO_BALLOON_PFNS_LIMIT);
> > + *
> > + * If system has more pages than VIRTIO_BALLOON_PFNS_LIMIT, we will
> > +scan
> > + * the page list and send the PFNs with several times. To reduce the
> > + * overhead of scanning the page list. VIRTIO_BALLOON_PFNS_LIMIT
> > +should
> > + * be set with a value which can cover most cases.
> > + */
> > +#define VIRTIO_BALLOON_PFNS_LIMIT ((32 * (1ULL << 30)) >>
> PAGE_SHIFT)
> > +/* 32GB */
> > +
> >  static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> > module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> > MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> >
> > +extern unsigned long get_max_pfn(void);
> > +
> 
> Please just include the correct header. No need for this hackery.
> 

Will change. Thanks!

Liang

1 2 3 >

1 - 100 of 238 matches

Mail list logo