Re: [Qemu-devel] [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-07-28 Thread Li, Liang Z
> On Thu, Jul 28, 2016 at 03:06:37AM +, Li, Liang Z wrote:
> > > > + * VIRTIO_BALLOON_PFNS_LIMIT is used to limit the size of page
> > > > +bitmap
> > > > + * to prevent a very large page bitmap, there are two reasons for this:
> > > > + * 1) to save memory.
> > > > + * 2) allocate a large bitmap may fail.
> > > > + *
> > > > + * The actual limit of pfn is determined by:
> > > > + * pfn_limit = min(max_pfn, VIRTIO_BALLOON_PFNS_LIMIT);
> > > > + *
> > > > + * If system has more pages than VIRTIO_BALLOON_PFNS_LIMIT, we
> > > > +will scan
> > > > + * the page list and send the PFNs with several times. To reduce
> > > > +the
> > > > + * overhead of scanning the page list. VIRTIO_BALLOON_PFNS_LIMIT
> > > > +should
> > > > + * be set with a value which can cover most cases.
> > >
> > > So what if it covers 1/32 of the memory? We'll do 32 exits and not
> > > 1, still not a big deal for a big guest.
> > >
> >
> > The issue here is the overhead is too high for scanning the page list for 32
> times.
> > Limit the page bitmap size to a fixed value is better for a big guest?
> >
> 
> I'd say avoid scanning free lists completely. Scan pages themselves and check
> the refcount to see whether they are free.
> This way each page needs to be tested once.
> 
> And skip the whole optimization if less than e.g. 10% is free.

That's better than rescanning the free list. Will change in next version.

Thanks!

Liang





Re: [Qemu-devel] [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-07-28 Thread Michael S. Tsirkin
On Thu, Jul 28, 2016 at 03:06:37AM +, Li, Liang Z wrote:
> > > + * VIRTIO_BALLOON_PFNS_LIMIT is used to limit the size of page bitmap
> > > + * to prevent a very large page bitmap, there are two reasons for this:
> > > + * 1) to save memory.
> > > + * 2) allocate a large bitmap may fail.
> > > + *
> > > + * The actual limit of pfn is determined by:
> > > + * pfn_limit = min(max_pfn, VIRTIO_BALLOON_PFNS_LIMIT);
> > > + *
> > > + * If system has more pages than VIRTIO_BALLOON_PFNS_LIMIT, we will
> > > +scan
> > > + * the page list and send the PFNs with several times. To reduce the
> > > + * overhead of scanning the page list. VIRTIO_BALLOON_PFNS_LIMIT
> > > +should
> > > + * be set with a value which can cover most cases.
> > 
> > So what if it covers 1/32 of the memory? We'll do 32 exits and not 1, still 
> > not a
> > big deal for a big guest.
> > 
> 
> The issue here is the overhead is too high for scanning the page list for 32 
> times.
> Limit the page bitmap size to a fixed value is better for a big guest?
> 

I'd say avoid scanning free lists completely. Scan pages themselves and
check the refcount to see whether they are free.
This way each page needs to be tested once.

And skip the whole optimization if less than e.g. 10% is free.

> > > + */
> > > +#define VIRTIO_BALLOON_PFNS_LIMIT ((32 * (1ULL << 30)) >>
> > PAGE_SHIFT)
> > > +/* 32GB */
> > 
> > I already said this with a smaller limit.
> > 
> > 2<< 30  is 2G but that is not a useful comment.
> > pls explain what is the reason for this selection.
> > 
> > Still applies here.
> > 
> 
> I will add the comment for this.
> 
> > > - sg_init_one(, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
> > > + if (virtio_has_feature(vb->vdev,
> > VIRTIO_BALLOON_F_PAGE_BITMAP)) {
> > > + struct balloon_bmap_hdr *hdr = vb->bmap_hdr;
> > > + unsigned long bmap_len;
> > > +
> > > + /* cmd and req_id are not used here, set them to 0 */
> > > + hdr->cmd = cpu_to_virtio16(vb->vdev, 0);
> > > + hdr->page_shift = cpu_to_virtio16(vb->vdev, PAGE_SHIFT);
> > > + hdr->reserved = cpu_to_virtio16(vb->vdev, 0);
> > > + hdr->req_id = cpu_to_virtio64(vb->vdev, 0);
> > 
> > no need to swap 0, just fill it in. in fact you allocated all 0s so no need 
> > to touch
> > these fields at all.
> > 
> 
> Will change in v3.
> 
> > > @@ -489,7 +612,7 @@ static int virtballoon_migratepage(struct
> > > balloon_dev_info *vb_dev_info,  static int virtballoon_probe(struct
> > > virtio_device *vdev)  {
> > >   struct virtio_balloon *vb;
> > > - int err;
> > > + int err, hdr_len;
> > >
> > >   if (!vdev->config->get) {
> > >   dev_err(>dev, "%s failure: config access disabled\n",
> > @@
> > > -508,6 +631,18 @@ static int virtballoon_probe(struct virtio_device *vdev)
> > >   spin_lock_init(>stop_update_lock);
> > >   vb->stop_update = false;
> > >   vb->num_pages = 0;
> > > + vb->pfn_limit = VIRTIO_BALLOON_PFNS_LIMIT;
> > > + vb->pfn_limit = min(vb->pfn_limit, get_max_pfn());
> > > + vb->bmap_len = ALIGN(vb->pfn_limit, BITS_PER_LONG) /
> > > +  BITS_PER_BYTE + 2 * sizeof(unsigned long);
> > 
> > What are these 2 longs in aid of?
> > 
> The rounddown(vb->start_pfn,  BITS_PER_LONG) and roundup(vb->end_pfn, 
> BITS_PER_LONG) 
> may cause (vb->end_pfn - vb->start_pfn) > vb->pfn_limit, so we need extra 
> space to save the
> bitmap for this case. 2 longs are enough.
> 
> > > + hdr_len = sizeof(struct balloon_bmap_hdr);
> > > + vb->bmap_hdr = kzalloc(hdr_len + vb->bmap_len, GFP_KERNEL);
> > 
> > So it can go up to 1MByte but adding header size etc you need a higher order
> > allocation. This is a waste, there is no need to have a power of two 
> > allocation.
> > Start from the other side. Say "I want to allocate 32KBytes for the bitmap".
> > Subtract the header and you get bitmap size.
> > Calculate the pfn limit from there.
> > 
> 
> Indeed, will change. Thanks a lot!
> 
> Liang



Re: [Qemu-devel] [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-07-28 Thread Michael S. Tsirkin
On Thu, Jul 28, 2016 at 03:30:09AM +, Li, Liang Z wrote:
> > Subject: Re: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate
> > process
> > 
> > On Wed, Jul 27, 2016 at 09:03:21AM -0700, Dave Hansen wrote:
> > > On 07/26/2016 06:23 PM, Liang Li wrote:
> > > > +   vb->pfn_limit = VIRTIO_BALLOON_PFNS_LIMIT;
> > > > +   vb->pfn_limit = min(vb->pfn_limit, get_max_pfn());
> > > > +   vb->bmap_len = ALIGN(vb->pfn_limit, BITS_PER_LONG) /
> > > > +BITS_PER_BYTE + 2 * sizeof(unsigned long);
> > > > +   hdr_len = sizeof(struct balloon_bmap_hdr);
> > > > +   vb->bmap_hdr = kzalloc(hdr_len + vb->bmap_len, GFP_KERNEL);
> > >
> > > This ends up doing a 1MB kmalloc() right?  That seems a _bit_ big.
> > > How big was the pfn buffer before?
> > 
> > 
> > Yes I would limit this to 1G memory in a go, will result in a 32KByte 
> > bitmap.
> > 
> > --
> > MST
> 
> Limit to 1G is bad for the performance, I sent you the test result several 
> weeks ago.
> 
> Paste it bellow:
> 
> About the size of page bitmap, I have test the performance of filling the 
> balloon to 15GB with a
>  16GB RAM VM.
> 
> ===
> 32K Byte (cover 1GB of RAM)
> 
> Time spends on inflating: 2031ms
> -
> 64K Byte (cover 2GB of RAM)
> 
> Time spends on inflating: 1507ms
> 
> 512K Byte (cover 16GB of RAM)
> 
> Time spends on inflating: 1237ms
> 
> 
> If possible, a big bitmap is better for performance.
> 
> Liang

Earlier you said:
a. allocating pages (6.5%)
b. sending PFNs to host (68.3%)
c. address translation (6.1%)
d. madvise (19%)

Here sending PFNs to host with 512K Byte map
should be almost free.

So is something else taking up the time?


-- 
MST



Re: [Qemu-devel] [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-07-27 Thread Li, Liang Z
> > +/*
> > + * VIRTIO_BALLOON_PFNS_LIMIT is used to limit the size of page bitmap
> > + * to prevent a very large page bitmap, there are two reasons for this:
> > + * 1) to save memory.
> > + * 2) allocate a large bitmap may fail.
> > + *
> > + * The actual limit of pfn is determined by:
> > + * pfn_limit = min(max_pfn, VIRTIO_BALLOON_PFNS_LIMIT);
> > + *
> > + * If system has more pages than VIRTIO_BALLOON_PFNS_LIMIT, we will
> > +scan
> > + * the page list and send the PFNs with several times. To reduce the
> > + * overhead of scanning the page list. VIRTIO_BALLOON_PFNS_LIMIT
> > +should
> > + * be set with a value which can cover most cases.
> > + */
> > +#define VIRTIO_BALLOON_PFNS_LIMIT ((32 * (1ULL << 30)) >>
> PAGE_SHIFT)
> > +/* 32GB */
> > +
> >  static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> > module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> > MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> >
> > +extern unsigned long get_max_pfn(void);
> > +
> 
> Please just include the correct header. No need for this hackery.
> 

Will change. Thanks!

Liang



Re: [Qemu-devel] [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-07-27 Thread Li, Liang Z
> Subject: Re: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate
> process
> 
> On Wed, Jul 27, 2016 at 09:03:21AM -0700, Dave Hansen wrote:
> > On 07/26/2016 06:23 PM, Liang Li wrote:
> > > + vb->pfn_limit = VIRTIO_BALLOON_PFNS_LIMIT;
> > > + vb->pfn_limit = min(vb->pfn_limit, get_max_pfn());
> > > + vb->bmap_len = ALIGN(vb->pfn_limit, BITS_PER_LONG) /
> > > +  BITS_PER_BYTE + 2 * sizeof(unsigned long);
> > > + hdr_len = sizeof(struct balloon_bmap_hdr);
> > > + vb->bmap_hdr = kzalloc(hdr_len + vb->bmap_len, GFP_KERNEL);
> >
> > This ends up doing a 1MB kmalloc() right?  That seems a _bit_ big.
> > How big was the pfn buffer before?
> 
> 
> Yes I would limit this to 1G memory in a go, will result in a 32KByte bitmap.
> 
> --
> MST

Limit to 1G is bad for the performance, I sent you the test result several 
weeks ago.

Paste it bellow:

About the size of page bitmap, I have test the performance of filling the 
balloon to 15GB with a
 16GB RAM VM.

===
32K Byte (cover 1GB of RAM)

Time spends on inflating: 2031ms
-
64K Byte (cover 2GB of RAM)

Time spends on inflating: 1507ms

512K Byte (cover 16GB of RAM)

Time spends on inflating: 1237ms


If possible, a big bitmap is better for performance.

Liang




Re: [Qemu-devel] [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-07-27 Thread Li, Liang Z
> > + * VIRTIO_BALLOON_PFNS_LIMIT is used to limit the size of page bitmap
> > + * to prevent a very large page bitmap, there are two reasons for this:
> > + * 1) to save memory.
> > + * 2) allocate a large bitmap may fail.
> > + *
> > + * The actual limit of pfn is determined by:
> > + * pfn_limit = min(max_pfn, VIRTIO_BALLOON_PFNS_LIMIT);
> > + *
> > + * If system has more pages than VIRTIO_BALLOON_PFNS_LIMIT, we will
> > +scan
> > + * the page list and send the PFNs with several times. To reduce the
> > + * overhead of scanning the page list. VIRTIO_BALLOON_PFNS_LIMIT
> > +should
> > + * be set with a value which can cover most cases.
> 
> So what if it covers 1/32 of the memory? We'll do 32 exits and not 1, still 
> not a
> big deal for a big guest.
> 

The issue here is the overhead is too high for scanning the page list for 32 
times.
Limit the page bitmap size to a fixed value is better for a big guest?

> > + */
> > +#define VIRTIO_BALLOON_PFNS_LIMIT ((32 * (1ULL << 30)) >>
> PAGE_SHIFT)
> > +/* 32GB */
> 
> I already said this with a smaller limit.
> 
>   2<< 30  is 2G but that is not a useful comment.
>   pls explain what is the reason for this selection.
> 
> Still applies here.
> 

I will add the comment for this.

> > -   sg_init_one(, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
> > +   if (virtio_has_feature(vb->vdev,
> VIRTIO_BALLOON_F_PAGE_BITMAP)) {
> > +   struct balloon_bmap_hdr *hdr = vb->bmap_hdr;
> > +   unsigned long bmap_len;
> > +
> > +   /* cmd and req_id are not used here, set them to 0 */
> > +   hdr->cmd = cpu_to_virtio16(vb->vdev, 0);
> > +   hdr->page_shift = cpu_to_virtio16(vb->vdev, PAGE_SHIFT);
> > +   hdr->reserved = cpu_to_virtio16(vb->vdev, 0);
> > +   hdr->req_id = cpu_to_virtio64(vb->vdev, 0);
> 
> no need to swap 0, just fill it in. in fact you allocated all 0s so no need 
> to touch
> these fields at all.
> 

Will change in v3.

> > @@ -489,7 +612,7 @@ static int virtballoon_migratepage(struct
> > balloon_dev_info *vb_dev_info,  static int virtballoon_probe(struct
> > virtio_device *vdev)  {
> > struct virtio_balloon *vb;
> > -   int err;
> > +   int err, hdr_len;
> >
> > if (!vdev->config->get) {
> > dev_err(>dev, "%s failure: config access disabled\n",
> @@
> > -508,6 +631,18 @@ static int virtballoon_probe(struct virtio_device *vdev)
> > spin_lock_init(>stop_update_lock);
> > vb->stop_update = false;
> > vb->num_pages = 0;
> > +   vb->pfn_limit = VIRTIO_BALLOON_PFNS_LIMIT;
> > +   vb->pfn_limit = min(vb->pfn_limit, get_max_pfn());
> > +   vb->bmap_len = ALIGN(vb->pfn_limit, BITS_PER_LONG) /
> > +BITS_PER_BYTE + 2 * sizeof(unsigned long);
> 
> What are these 2 longs in aid of?
> 
The rounddown(vb->start_pfn,  BITS_PER_LONG) and roundup(vb->end_pfn, 
BITS_PER_LONG) 
may cause (vb->end_pfn - vb->start_pfn) > vb->pfn_limit, so we need extra space 
to save the
bitmap for this case. 2 longs are enough.

> > +   hdr_len = sizeof(struct balloon_bmap_hdr);
> > +   vb->bmap_hdr = kzalloc(hdr_len + vb->bmap_len, GFP_KERNEL);
> 
> So it can go up to 1MByte but adding header size etc you need a higher order
> allocation. This is a waste, there is no need to have a power of two 
> allocation.
> Start from the other side. Say "I want to allocate 32KBytes for the bitmap".
> Subtract the header and you get bitmap size.
> Calculate the pfn limit from there.
> 

Indeed, will change. Thanks a lot!

Liang




Re: [Qemu-devel] [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-07-27 Thread Michael S. Tsirkin
On Thu, Jul 28, 2016 at 01:13:35AM +, Li, Liang Z wrote:
> > Subject: Re: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate
> > process
> > 
> > On 07/26/2016 06:23 PM, Liang Li wrote:
> > > + vb->pfn_limit = VIRTIO_BALLOON_PFNS_LIMIT;
> > > + vb->pfn_limit = min(vb->pfn_limit, get_max_pfn());
> > > + vb->bmap_len = ALIGN(vb->pfn_limit, BITS_PER_LONG) /
> > > +  BITS_PER_BYTE + 2 * sizeof(unsigned long);
> > > + hdr_len = sizeof(struct balloon_bmap_hdr);
> > > + vb->bmap_hdr = kzalloc(hdr_len + vb->bmap_len, GFP_KERNEL);
> > 
> > This ends up doing a 1MB kmalloc() right?  That seems a _bit_ big.  How big
> > was the pfn buffer before?
> 
> Yes, it is if the max pfn is more than 32GB.
> The size of the pfn buffer use before is 256*4 = 1024 Bytes, it's too small, 
> and it's the main reason for bad performance.
> Use the max 1MB kmalloc is a balance between performance and flexibility,
> a large page bitmap covers the range of all the memory is no good for a system
> with huge amount of memory. If the bitmap is too small, it means we have
> to traverse a long list for many times, and it's bad for performance.
> 
> Thanks!
> Liang   

There are all your implementation decisions though.

If guest memory is so fragmented that you only have order 0 4k pages,
then allocating a huge 1M contigious chunk is very problematic
in and of itself.

Most people rarely migrate and do not care how fast that happens.
Wasting a large chunk of memory (and it's zeroed for no good reason, so you
actually request host memory for it) for everyone to speed it up
when it does happen is not really an option.

-- 
MST



Re: [Qemu-devel] [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-07-27 Thread Li, Liang Z
> Subject: Re: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate
> process
> 
> On 07/26/2016 06:23 PM, Liang Li wrote:
> > +   vb->pfn_limit = VIRTIO_BALLOON_PFNS_LIMIT;
> > +   vb->pfn_limit = min(vb->pfn_limit, get_max_pfn());
> > +   vb->bmap_len = ALIGN(vb->pfn_limit, BITS_PER_LONG) /
> > +BITS_PER_BYTE + 2 * sizeof(unsigned long);
> > +   hdr_len = sizeof(struct balloon_bmap_hdr);
> > +   vb->bmap_hdr = kzalloc(hdr_len + vb->bmap_len, GFP_KERNEL);
> 
> This ends up doing a 1MB kmalloc() right?  That seems a _bit_ big.  How big
> was the pfn buffer before?

Yes, it is if the max pfn is more than 32GB.
The size of the pfn buffer use before is 256*4 = 1024 Bytes, it's too small, 
and it's the main reason for bad performance.
Use the max 1MB kmalloc is a balance between performance and flexibility,
a large page bitmap covers the range of all the memory is no good for a system
with huge amount of memory. If the bitmap is too small, it means we have
to traverse a long list for many times, and it's bad for performance.

Thanks!
Liang   




Re: [Qemu-devel] [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-07-27 Thread Michael S. Tsirkin
On Wed, Jul 27, 2016 at 09:23:33AM +0800, Liang Li wrote:
> The implementation of the current virtio-balloon is not very
> efficient, the time spends on different stages of inflating
> the balloon to 7GB of a 8GB idle guest:
> 
> a. allocating pages (6.5%)
> b. sending PFNs to host (68.3%)
> c. address translation (6.1%)
> d. madvise (19%)
> 
> It takes about 4126ms for the inflating process to complete.
> Debugging shows that the bottle neck are the stage b and stage d.
> 
> If using a bitmap to send the page info instead of the PFNs, we
> can reduce the overhead in stage b quite a lot. Furthermore, we
> can do the address translation and call madvise() with a bulk of
> RAM pages, instead of the current page per page way, the overhead
> of stage c and stage d can also be reduced a lot.
> 
> This patch is the kernel side implementation which is intended to
> speed up the inflating & deflating process by adding a new feature
> to the virtio-balloon device. With this new feature, inflating the
> balloon to 7GB of a 8GB idle guest only takes 590ms, the
> performance improvement is about 85%.
> 
> TODO: optimize stage a by allocating/freeing a chunk of pages
> instead of a single page at a time.
> 
> Signed-off-by: Liang Li 
> Suggested-by: Michael S. Tsirkin 
> Cc: Michael S. Tsirkin 
> Cc: Andrew Morton 
> Cc: Vlastimil Babka 
> Cc: Mel Gorman 
> Cc: Paolo Bonzini 
> Cc: Cornelia Huck 
> Cc: Amit Shah 
> ---
>  drivers/virtio/virtio_balloon.c | 184 
> +++-
>  1 file changed, 162 insertions(+), 22 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 8d649a2..2d18ff6 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -41,10 +41,28 @@
>  #define OOM_VBALLOON_DEFAULT_PAGES 256
>  #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
>  
> +/*
> + * VIRTIO_BALLOON_PFNS_LIMIT is used to limit the size of page bitmap
> + * to prevent a very large page bitmap, there are two reasons for this:
> + * 1) to save memory.
> + * 2) allocate a large bitmap may fail.
> + *
> + * The actual limit of pfn is determined by:
> + * pfn_limit = min(max_pfn, VIRTIO_BALLOON_PFNS_LIMIT);
> + *
> + * If system has more pages than VIRTIO_BALLOON_PFNS_LIMIT, we will scan
> + * the page list and send the PFNs with several times. To reduce the
> + * overhead of scanning the page list. VIRTIO_BALLOON_PFNS_LIMIT should
> + * be set with a value which can cover most cases.
> + */
> +#define VIRTIO_BALLOON_PFNS_LIMIT ((32 * (1ULL << 30)) >> PAGE_SHIFT) /* 
> 32GB */
> +
>  static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
>  module_param(oom_pages, int, S_IRUSR | S_IWUSR);
>  MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
>  
> +extern unsigned long get_max_pfn(void);
> +

Please just include the correct header. No need for this hackery.

>  struct virtio_balloon {
>   struct virtio_device *vdev;
>   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> @@ -62,6 +80,15 @@ struct virtio_balloon {
>  
>   /* Number of balloon pages we've told the Host we're not using. */
>   unsigned int num_pages;
> + /* Pointer of the bitmap header. */
> + void *bmap_hdr;
> + /* Bitmap and length used to tell the host the pages */
> + unsigned long *page_bitmap;
> + unsigned long bmap_len;
> + /* Pfn limit */
> + unsigned long pfn_limit;
> + /* Used to record the processed pfn range */
> + unsigned long min_pfn, max_pfn, start_pfn, end_pfn;
>   /*
>* The pages we've told the Host we're not using are enqueued
>* at vb_dev_info->pages list.
> @@ -105,12 +132,45 @@ static void balloon_ack(struct virtqueue *vq)
>   wake_up(>acked);
>  }
>  
> +static inline void init_pfn_range(struct virtio_balloon *vb)
> +{
> + vb->min_pfn = ULONG_MAX;
> + vb->max_pfn = 0;
> +}
> +
> +static inline void update_pfn_range(struct virtio_balloon *vb,
> +  struct page *page)
> +{
> + unsigned long balloon_pfn = page_to_balloon_pfn(page);
> +
> + if (balloon_pfn < vb->min_pfn)
> + vb->min_pfn = balloon_pfn;
> + if (balloon_pfn > vb->max_pfn)
> + vb->max_pfn = balloon_pfn;
> +}
> +
>  static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
>  {
>   struct scatterlist sg;
>   unsigned int len;
>  
> - sg_init_one(, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
> + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_PAGE_BITMAP)) {
> + struct balloon_bmap_hdr *hdr = vb->bmap_hdr;
> + unsigned long bmap_len;
> +
> + /* cmd and req_id are not used here, set them to 0 */
> + hdr->cmd = cpu_to_virtio16(vb->vdev, 0);
> + hdr->page_shift = 

Re: [Qemu-devel] [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-07-27 Thread Michael S. Tsirkin
On Wed, Jul 27, 2016 at 09:03:21AM -0700, Dave Hansen wrote:
> On 07/26/2016 06:23 PM, Liang Li wrote:
> > +   vb->pfn_limit = VIRTIO_BALLOON_PFNS_LIMIT;
> > +   vb->pfn_limit = min(vb->pfn_limit, get_max_pfn());
> > +   vb->bmap_len = ALIGN(vb->pfn_limit, BITS_PER_LONG) /
> > +BITS_PER_BYTE + 2 * sizeof(unsigned long);
> > +   hdr_len = sizeof(struct balloon_bmap_hdr);
> > +   vb->bmap_hdr = kzalloc(hdr_len + vb->bmap_len, GFP_KERNEL);
> 
> This ends up doing a 1MB kmalloc() right?  That seems a _bit_ big.  How
> big was the pfn buffer before?


Yes I would limit this to 1G memory in a go, will result
in a 32KByte bitmap.

-- 
MST



Re: [Qemu-devel] [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-07-27 Thread Michael S. Tsirkin
On Wed, Jul 27, 2016 at 09:23:33AM +0800, Liang Li wrote:
> The implementation of the current virtio-balloon is not very
> efficient, the time spends on different stages of inflating
> the balloon to 7GB of a 8GB idle guest:
> 
> a. allocating pages (6.5%)
> b. sending PFNs to host (68.3%)
> c. address translation (6.1%)
> d. madvise (19%)
> 
> It takes about 4126ms for the inflating process to complete.
> Debugging shows that the bottle neck are the stage b and stage d.
> 
> If using a bitmap to send the page info instead of the PFNs, we
> can reduce the overhead in stage b quite a lot. Furthermore, we
> can do the address translation and call madvise() with a bulk of
> RAM pages, instead of the current page per page way, the overhead
> of stage c and stage d can also be reduced a lot.
> 
> This patch is the kernel side implementation which is intended to
> speed up the inflating & deflating process by adding a new feature
> to the virtio-balloon device. With this new feature, inflating the
> balloon to 7GB of a 8GB idle guest only takes 590ms, the
> performance improvement is about 85%.
> 
> TODO: optimize stage a by allocating/freeing a chunk of pages
> instead of a single page at a time.
> 
> Signed-off-by: Liang Li 
> Suggested-by: Michael S. Tsirkin 
> Cc: Michael S. Tsirkin 
> Cc: Andrew Morton 
> Cc: Vlastimil Babka 
> Cc: Mel Gorman 
> Cc: Paolo Bonzini 
> Cc: Cornelia Huck 
> Cc: Amit Shah 
> ---
>  drivers/virtio/virtio_balloon.c | 184 
> +++-
>  1 file changed, 162 insertions(+), 22 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 8d649a2..2d18ff6 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -41,10 +41,28 @@
>  #define OOM_VBALLOON_DEFAULT_PAGES 256
>  #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
>  
> +/*
> + * VIRTIO_BALLOON_PFNS_LIMIT is used to limit the size of page bitmap
> + * to prevent a very large page bitmap, there are two reasons for this:
> + * 1) to save memory.
> + * 2) allocate a large bitmap may fail.
> + *
> + * The actual limit of pfn is determined by:
> + * pfn_limit = min(max_pfn, VIRTIO_BALLOON_PFNS_LIMIT);
> + *
> + * If system has more pages than VIRTIO_BALLOON_PFNS_LIMIT, we will scan
> + * the page list and send the PFNs with several times. To reduce the
> + * overhead of scanning the page list. VIRTIO_BALLOON_PFNS_LIMIT should
> + * be set with a value which can cover most cases.

So what if it covers 1/32 of the memory? We'll do 32 exits and not 1,
still not a big deal for a big guest.

> + */
> +#define VIRTIO_BALLOON_PFNS_LIMIT ((32 * (1ULL << 30)) >> PAGE_SHIFT) /* 
> 32GB */

I already said this with a smaller limit.

2<< 30  is 2G but that is not a useful comment.
pls explain what is the reason for this selection.

Still applies here.


> +
>  static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
>  module_param(oom_pages, int, S_IRUSR | S_IWUSR);
>  MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
>  
> +extern unsigned long get_max_pfn(void);
> +
>  struct virtio_balloon {
>   struct virtio_device *vdev;
>   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> @@ -62,6 +80,15 @@ struct virtio_balloon {
>  
>   /* Number of balloon pages we've told the Host we're not using. */
>   unsigned int num_pages;
> + /* Pointer of the bitmap header. */
> + void *bmap_hdr;
> + /* Bitmap and length used to tell the host the pages */
> + unsigned long *page_bitmap;
> + unsigned long bmap_len;
> + /* Pfn limit */
> + unsigned long pfn_limit;
> + /* Used to record the processed pfn range */
> + unsigned long min_pfn, max_pfn, start_pfn, end_pfn;
>   /*
>* The pages we've told the Host we're not using are enqueued
>* at vb_dev_info->pages list.
> @@ -105,12 +132,45 @@ static void balloon_ack(struct virtqueue *vq)
>   wake_up(>acked);
>  }
>  
> +static inline void init_pfn_range(struct virtio_balloon *vb)
> +{
> + vb->min_pfn = ULONG_MAX;
> + vb->max_pfn = 0;
> +}
> +
> +static inline void update_pfn_range(struct virtio_balloon *vb,
> +  struct page *page)
> +{
> + unsigned long balloon_pfn = page_to_balloon_pfn(page);
> +
> + if (balloon_pfn < vb->min_pfn)
> + vb->min_pfn = balloon_pfn;
> + if (balloon_pfn > vb->max_pfn)
> + vb->max_pfn = balloon_pfn;
> +}
> +
>  static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
>  {
>   struct scatterlist sg;
>   unsigned int len;
>  
> - sg_init_one(, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
> + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_PAGE_BITMAP)) {
> + struct balloon_bmap_hdr *hdr 

Re: [Qemu-devel] [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-07-27 Thread Dave Hansen
On 07/26/2016 06:23 PM, Liang Li wrote:
> + vb->pfn_limit = VIRTIO_BALLOON_PFNS_LIMIT;
> + vb->pfn_limit = min(vb->pfn_limit, get_max_pfn());
> + vb->bmap_len = ALIGN(vb->pfn_limit, BITS_PER_LONG) /
> +  BITS_PER_BYTE + 2 * sizeof(unsigned long);
> + hdr_len = sizeof(struct balloon_bmap_hdr);
> + vb->bmap_hdr = kzalloc(hdr_len + vb->bmap_len, GFP_KERNEL);

This ends up doing a 1MB kmalloc() right?  That seems a _bit_ big.  How
big was the pfn buffer before?



[Qemu-devel] [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process

2016-07-26 Thread Liang Li
The implementation of the current virtio-balloon is not very
efficient, the time spends on different stages of inflating
the balloon to 7GB of a 8GB idle guest:

a. allocating pages (6.5%)
b. sending PFNs to host (68.3%)
c. address translation (6.1%)
d. madvise (19%)

It takes about 4126ms for the inflating process to complete.
Debugging shows that the bottle neck are the stage b and stage d.

If using a bitmap to send the page info instead of the PFNs, we
can reduce the overhead in stage b quite a lot. Furthermore, we
can do the address translation and call madvise() with a bulk of
RAM pages, instead of the current page per page way, the overhead
of stage c and stage d can also be reduced a lot.

This patch is the kernel side implementation which is intended to
speed up the inflating & deflating process by adding a new feature
to the virtio-balloon device. With this new feature, inflating the
balloon to 7GB of a 8GB idle guest only takes 590ms, the
performance improvement is about 85%.

TODO: optimize stage a by allocating/freeing a chunk of pages
instead of a single page at a time.

Signed-off-by: Liang Li 
Suggested-by: Michael S. Tsirkin 
Cc: Michael S. Tsirkin 
Cc: Andrew Morton 
Cc: Vlastimil Babka 
Cc: Mel Gorman 
Cc: Paolo Bonzini 
Cc: Cornelia Huck 
Cc: Amit Shah 
---
 drivers/virtio/virtio_balloon.c | 184 +++-
 1 file changed, 162 insertions(+), 22 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 8d649a2..2d18ff6 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -41,10 +41,28 @@
 #define OOM_VBALLOON_DEFAULT_PAGES 256
 #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
 
+/*
+ * VIRTIO_BALLOON_PFNS_LIMIT is used to limit the size of page bitmap
+ * to prevent a very large page bitmap, there are two reasons for this:
+ * 1) to save memory.
+ * 2) allocate a large bitmap may fail.
+ *
+ * The actual limit of pfn is determined by:
+ * pfn_limit = min(max_pfn, VIRTIO_BALLOON_PFNS_LIMIT);
+ *
+ * If system has more pages than VIRTIO_BALLOON_PFNS_LIMIT, we will scan
+ * the page list and send the PFNs with several times. To reduce the
+ * overhead of scanning the page list. VIRTIO_BALLOON_PFNS_LIMIT should
+ * be set with a value which can cover most cases.
+ */
+#define VIRTIO_BALLOON_PFNS_LIMIT ((32 * (1ULL << 30)) >> PAGE_SHIFT) /* 32GB 
*/
+
 static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
 module_param(oom_pages, int, S_IRUSR | S_IWUSR);
 MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
 
+extern unsigned long get_max_pfn(void);
+
 struct virtio_balloon {
struct virtio_device *vdev;
struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
@@ -62,6 +80,15 @@ struct virtio_balloon {
 
/* Number of balloon pages we've told the Host we're not using. */
unsigned int num_pages;
+   /* Pointer of the bitmap header. */
+   void *bmap_hdr;
+   /* Bitmap and length used to tell the host the pages */
+   unsigned long *page_bitmap;
+   unsigned long bmap_len;
+   /* Pfn limit */
+   unsigned long pfn_limit;
+   /* Used to record the processed pfn range */
+   unsigned long min_pfn, max_pfn, start_pfn, end_pfn;
/*
 * The pages we've told the Host we're not using are enqueued
 * at vb_dev_info->pages list.
@@ -105,12 +132,45 @@ static void balloon_ack(struct virtqueue *vq)
wake_up(>acked);
 }
 
+static inline void init_pfn_range(struct virtio_balloon *vb)
+{
+   vb->min_pfn = ULONG_MAX;
+   vb->max_pfn = 0;
+}
+
+static inline void update_pfn_range(struct virtio_balloon *vb,
+struct page *page)
+{
+   unsigned long balloon_pfn = page_to_balloon_pfn(page);
+
+   if (balloon_pfn < vb->min_pfn)
+   vb->min_pfn = balloon_pfn;
+   if (balloon_pfn > vb->max_pfn)
+   vb->max_pfn = balloon_pfn;
+}
+
 static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
 {
struct scatterlist sg;
unsigned int len;
 
-   sg_init_one(, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
+   if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_PAGE_BITMAP)) {
+   struct balloon_bmap_hdr *hdr = vb->bmap_hdr;
+   unsigned long bmap_len;
+
+   /* cmd and req_id are not used here, set them to 0 */
+   hdr->cmd = cpu_to_virtio16(vb->vdev, 0);
+   hdr->page_shift = cpu_to_virtio16(vb->vdev, PAGE_SHIFT);
+   hdr->reserved = cpu_to_virtio16(vb->vdev, 0);
+   hdr->req_id = cpu_to_virtio64(vb->vdev, 0);
+   hdr->start_pfn = cpu_to_virtio64(vb->vdev, vb->start_pfn);
+   bmap_len = min(vb->bmap_len,
+