On 6/3/19 3:57 PM, David Hildenbrand wrote:
> On 03.06.19 19:03, Nitesh Narayan Lal wrote:
>> This patch introduces the core infrastructure for free page hinting in
>> virtual environments. It enables the kernel to track the free pages which
>> can be reported to its hypervisor so that the hypervisor could
>> free and reuse that memory as per its requirement.
>>
>> While the pages are getting processed in the hypervisor (e.g.,
>> via MADV_FREE), the guest must not use them, otherwise, data loss
>> would be possible. To avoid such a situation, these pages are
>> temporarily removed from the buddy. The amount of pages removed
>> temporarily from the buddy is governed by the backend(virtio-balloon
>> in our case).
>>
>> To efficiently identify free pages that can to be hinted to the
>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
>> chunks are reported to the hypervisor - especially, to not break up THP
>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
>> in the bitmap are an indication whether a page *might* be free, not a
>> guarantee. A new hook after buddy merging sets the bits.
>>
>> Bitmaps are stored per zone, protected by the zone lock. A workqueue
>> asynchronously processes the bitmaps, trying to isolate and report pages
>> that are still free. The backend (virtio-balloon) is responsible for
>> reporting these batched pages to the host synchronously. Once reporting/
>> freeing is complete, isolated pages are returned back to the buddy.
>>
>> There are still various things to look into (e.g., memory hotplug, more
>> efficient locking, possible races when disabling).
>>
>> Signed-off-by: Nitesh Narayan Lal <nit...@redhat.com>
>> ---
>>  drivers/virtio/Kconfig       |   1 +
>>  include/linux/page_hinting.h |  46 +++++++
>>  mm/Kconfig                   |   6 +
>>  mm/Makefile                  |   2 +
>>  mm/page_alloc.c              |  17 +--
>>  mm/page_hinting.c            | 236 +++++++++++++++++++++++++++++++++++
>>  6 files changed, 301 insertions(+), 7 deletions(-)
>>  create mode 100644 include/linux/page_hinting.h
>>  create mode 100644 mm/page_hinting.c
>>
>> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
>> index 35897649c24f..5a96b7a2ed1e 100644
>> --- a/drivers/virtio/Kconfig
>> +++ b/drivers/virtio/Kconfig
>> @@ -46,6 +46,7 @@ config VIRTIO_BALLOON
>>      tristate "Virtio balloon driver"
>>      depends on VIRTIO
>>      select MEMORY_BALLOON
>> +    select PAGE_HINTING
>>      ---help---
>>       This driver supports increasing and decreasing the amount
>>       of memory within a KVM guest.
>> diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h
>> new file mode 100644
>> index 000000000000..e65188fe1e6b
>> --- /dev/null
>> +++ b/include/linux/page_hinting.h
>> @@ -0,0 +1,46 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef _LINUX_PAGE_HINTING_H
>> +#define _LINUX_PAGE_HINTING_H
>> +
>> +/*
>> + * Minimum page order required for a page to be hinted to the host.
>> + */
>> +#define PAGE_HINTING_MIN_ORDER              (MAX_ORDER - 2)
>> +
>> +/*
>> + * struct page_hinting_cb: holds the callbacks to store, report and cleanup
>> + * isolated pages.
>> + * @prepare:                Callback responsible for allocating an array to 
>> hold
>> + *                  the isolated pages.
>> + * @hint_pages:             Callback which reports the isolated pages 
>> synchornously
>> + *                  to the host.
>> + * @cleanup:                Callback to free the the array used for 
>> reporting the
>> + *                  isolated pages.
>> + * @max_pages:              Maxmimum pages that are going to be hinted to 
>> the host
>> + *                  at a time of granularity >= PAGE_HINTING_MIN_ORDER.
>> + */
>> +struct page_hinting_cb {
>> +    int (*prepare)(void);
>> +    void (*hint_pages)(struct list_head *list);
>> +    void (*cleanup)(void);
>> +    int max_pages;
> If we allocate the array in virtio-balloon differently (e.g. similar to
> bulk inflation/deflation of pfn's right now), we can most probably get
> rid of prepare() and cleanup(), simplifying the code further.
Yeap, as Alexander suggested. I will go for static allocation and remove
this prepare() and cleanup().
>> +};
>> +
>> +#ifdef CONFIG_PAGE_HINTING
>> +void page_hinting_enqueue(struct page *page, int order);
>> +void page_hinting_enable(const struct page_hinting_cb *cb);
>> +void page_hinting_disable(void);
>> +#else
>> +static inline void page_hinting_enqueue(struct page *page, int order)
>> +{
>> +}
>> +
>> +static inline void page_hinting_enable(struct page_hinting_cb *cb)
>> +{
>> +}
>> +
>> +static inline void page_hinting_disable(void)
>> +{
>> +}
>> +#endif
>> +#endif /* _LINUX_PAGE_HINTING_H */
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index ee8d1f311858..177d858de758 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -764,4 +764,10 @@ config GUP_BENCHMARK
>>  config ARCH_HAS_PTE_SPECIAL
>>      bool
>>  
>> +# PAGE_HINTING will allow the guest to report the free pages to the
>> +# host in regular interval of time.
>> +config PAGE_HINTING
>> +       bool
>> +       def_bool n
>> +       depends on X86_64
>>  endmenu
>> diff --git a/mm/Makefile b/mm/Makefile
>> index ac5e5ba78874..bec456dfee34 100644
>> --- a/mm/Makefile
>> +++ b/mm/Makefile
>> @@ -41,6 +41,7 @@ obj-y                      := filemap.o mempool.o 
>> oom_kill.o fadvise.o \
>>                         interval_tree.o list_lru.o workingset.o \
>>                         debug.o $(mmu-y)
>>  
>> +
>>  # Give 'page_alloc' its own module-parameter namespace
>>  page-alloc-y := page_alloc.o
>>  page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o
>> @@ -94,6 +95,7 @@ obj-$(CONFIG_Z3FOLD)       += z3fold.o
>>  obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
>>  obj-$(CONFIG_CMA)   += cma.o
>>  obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
>> +obj-$(CONFIG_PAGE_HINTING) += page_hinting.o
>>  obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
>>  obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
>>  obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 3b13d3914176..d12f69e0e402 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -68,6 +68,7 @@
>>  #include <linux/lockdep.h>
>>  #include <linux/nmi.h>
>>  #include <linux/psi.h>
>> +#include <linux/page_hinting.h>
>>  
>>  #include <asm/sections.h>
>>  #include <asm/tlbflush.h>
>> @@ -873,10 +874,10 @@ compaction_capture(struct capture_control *capc, 
>> struct page *page,
>>   * -- nyc
>>   */
>>  
>> -static inline void __free_one_page(struct page *page,
>> +inline void __free_one_page(struct page *page,
>>              unsigned long pfn,
>>              struct zone *zone, unsigned int order,
>> -            int migratetype)
>> +            int migratetype, bool hint)
>>  {
>>      unsigned long combined_pfn;
>>      unsigned long uninitialized_var(buddy_pfn);
>> @@ -951,6 +952,8 @@ static inline void __free_one_page(struct page *page,
>>  done_merging:
>>      set_page_order(page, order);
>>  
>> +    if (hint)
>> +            page_hinting_enqueue(page, order);
>>      /*
>>       * If this is not the largest possible page, check if the buddy
>>       * of the next-highest order is free. If it is, it's possible
>> @@ -1262,7 +1265,7 @@ static void free_pcppages_bulk(struct zone *zone, int 
>> count,
>>              if (unlikely(isolated_pageblocks))
>>                      mt = get_pageblock_migratetype(page);
>>  
>> -            __free_one_page(page, page_to_pfn(page), zone, 0, mt);
>> +            __free_one_page(page, page_to_pfn(page), zone, 0, mt, true);
>>              trace_mm_page_pcpu_drain(page, 0, mt);
>>      }
>>      spin_unlock(&zone->lock);
>> @@ -1271,14 +1274,14 @@ static void free_pcppages_bulk(struct zone *zone, 
>> int count,
>>  static void free_one_page(struct zone *zone,
>>                              struct page *page, unsigned long pfn,
>>                              unsigned int order,
>> -                            int migratetype)
>> +                            int migratetype, bool hint)
>>  {
>>      spin_lock(&zone->lock);
>>      if (unlikely(has_isolate_pageblock(zone) ||
>>              is_migrate_isolate(migratetype))) {
>>              migratetype = get_pfnblock_migratetype(page, pfn);
>>      }
>> -    __free_one_page(page, pfn, zone, order, migratetype);
>> +    __free_one_page(page, pfn, zone, order, migratetype, hint);
>>      spin_unlock(&zone->lock);
>>  }
>>  
>> @@ -1368,7 +1371,7 @@ static void __free_pages_ok(struct page *page, 
>> unsigned int order)
>>      migratetype = get_pfnblock_migratetype(page, pfn);
>>      local_irq_save(flags);
>>      __count_vm_events(PGFREE, 1 << order);
>> -    free_one_page(page_zone(page), page, pfn, order, migratetype);
>> +    free_one_page(page_zone(page), page, pfn, order, migratetype, true);
>>      local_irq_restore(flags);
>>  }
>>  
>> @@ -2968,7 +2971,7 @@ static void free_unref_page_commit(struct page *page, 
>> unsigned long pfn)
>>       */
>>      if (migratetype >= MIGRATE_PCPTYPES) {
>>              if (unlikely(is_migrate_isolate(migratetype))) {
>> -                    free_one_page(zone, page, pfn, 0, migratetype);
>> +                    free_one_page(zone, page, pfn, 0, migratetype, true);
>>                      return;
>>              }
>>              migratetype = MIGRATE_MOVABLE;
>> diff --git a/mm/page_hinting.c b/mm/page_hinting.c
>> new file mode 100644
>> index 000000000000..7341c6462de2
>> --- /dev/null
>> +++ b/mm/page_hinting.c
>> @@ -0,0 +1,236 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Page hinting support to enable a VM to report the freed pages back
>> + * to the host.
>> + *
>> + * Copyright Red Hat, Inc. 2019
>> + *
>> + * Author(s): Nitesh Narayan Lal <nit...@redhat.com>
>> + */
>> +
>> +#include <linux/mm.h>
>> +#include <linux/slab.h>
>> +#include <linux/page_hinting.h>
>> +#include <linux/kvm_host.h>
>> +
>> +/*
>> + * struct hinting_bitmap: holds the bitmap pointer which tracks the freed 
>> PFNs
>> + * and other required parameters which could help in retrieving the original
>> + * PFN value using the bitmap.
>> + * @bitmap:         Pointer to the bitmap of free PFN.
>> + * @base_pfn:               Starting PFN value for the zone whose bitmap is 
>> stored.
>> + * @free_pages:             Tracks the number of free pages of granularity
>> + *                  PAGE_HINTING_MIN_ORDER.
>> + * @nbits:          Indicates the total size of the bitmap in bits allocated
>> + *                  at the time of initialization.
>> + */
>> +struct hinting_bitmap {
>> +    unsigned long *bitmap;
>> +    unsigned long base_pfn;
>> +    atomic_t free_pages;
>> +    unsigned long nbits;
>> +} bm_zone[MAX_NR_ZONES];
>> +
>> +static void init_hinting_wq(struct work_struct *work);
>> +extern int __isolate_free_page(struct page *page, unsigned int order);
>> +extern void __free_one_page(struct page *page, unsigned long pfn,
>> +                        struct zone *zone, unsigned int order,
>> +                        int migratetype, bool hint);
>> +const struct page_hinting_cb *hcb;
>> +struct work_struct hinting_work;
>> +
>> +static unsigned long find_bitmap_size(struct zone *zone)
>> +{
>> +    unsigned long nbits = ALIGN(zone->spanned_pages,
>> +                        PAGE_HINTING_MIN_ORDER);
>> +
>> +    nbits = nbits >> PAGE_HINTING_MIN_ORDER;
>> +    return nbits;
> I think we can simplify this to
>
> return (zone->spanned_pages >> PAGE_HINTING_MIN_ORDER) + 1;
>
I will check this.
>> +}
>> +
>> +void page_hinting_enable(const struct page_hinting_cb *callback)
>> +{
>> +    struct zone *zone;
>> +    int idx = 0;
>> +    unsigned long bitmap_size = 0;
> You should probably protect enabling via a mutex and return -EINVAL or
> similar if we already have a callback set (if we ever have different
> drivers). But this has very little priority :)
I will have to look into this.
>> +
>> +    for_each_populated_zone(zone) {
>> +            spin_lock(&zone->lock);
>> +            bitmap_size = find_bitmap_size(zone);
>> +            bm_zone[idx].bitmap = bitmap_zalloc(bitmap_size, GFP_KERNEL);
>> +            if (!bm_zone[idx].bitmap)
>> +                    return;
>> +            bm_zone[idx].nbits = bitmap_size;
>> +            bm_zone[idx].base_pfn = zone->zone_start_pfn;
>> +            spin_unlock(&zone->lock);
>> +            idx++;
>> +    }
>> +    hcb = callback;
>> +    INIT_WORK(&hinting_work, init_hinting_wq);
> There are also possible races when enabling, you will have to take care
> of at one point.
No page will be enqueued until hcb is not set.
I can probably move it below INIT_WORK.
>> +}
>> +EXPORT_SYMBOL_GPL(page_hinting_enable);
>> +
>> +void page_hinting_disable(void)
>> +{
>> +    struct zone *zone;
>> +    int idx = 0;
>> +
>> +    cancel_work_sync(&hinting_work);
>> +    hcb = NULL;
>> +    for_each_populated_zone(zone) {
>> +            spin_lock(&zone->lock);
>> +            bitmap_free(bm_zone[idx].bitmap);
>> +            bm_zone[idx].base_pfn = 0;
>> +            bm_zone[idx].nbits = 0;
>> +            atomic_set(&bm_zone[idx].free_pages, 0);
>> +            spin_unlock(&zone->lock);
>> +            idx++;
>> +    }
>> +}
>> +EXPORT_SYMBOL_GPL(page_hinting_disable);
>> +
>> +static unsigned long pfn_to_bit(struct page *page, int zonenum)
>> +{
>> +    unsigned long bitnr;
>> +
>> +    bitnr = (page_to_pfn(page) - bm_zone[zonenum].base_pfn)
>> +                     >> PAGE_HINTING_MIN_ORDER;
>> +    return bitnr;
>> +}
>> +
>> +static void release_buddy_pages(struct list_head *pages)
> maybe "release_isolated_pages", not sure.
>
>> +{
>> +    int mt = 0, zonenum, order;
>> +    struct page *page, *next;
>> +    struct zone *zone;
>> +    unsigned long bitnr;
>> +
>> +    list_for_each_entry_safe(page, next, pages, lru) {
>> +            zonenum = page_zonenum(page);
>> +            zone = page_zone(page);
>> +            bitnr = pfn_to_bit(page, zonenum);
>> +            spin_lock(&zone->lock);
>> +            list_del(&page->lru);
>> +            order = page_private(page);
>> +            set_page_private(page, 0);
>> +            mt = get_pageblock_migratetype(page);
>> +            __free_one_page(page, page_to_pfn(page), zone,
>> +                            order, mt, false);
>> +            spin_unlock(&zone->lock);
>> +    }
>> +}
>> +
>> +static void bm_set_pfn(struct page *page)
>> +{
>> +    unsigned long bitnr = 0;
>> +    int zonenum = page_zonenum(page);
>> +    struct zone *zone = page_zone(page);
>> +
>> +    lockdep_assert_held(&zone->lock);
>> +    bitnr = pfn_to_bit(page, zonenum);
>> +    if (bm_zone[zonenum].bitmap &&
>> +        bitnr < bm_zone[zonenum].nbits &&
>> +        !test_and_set_bit(bitnr, bm_zone[zonenum].bitmap))
>> +            atomic_inc(&bm_zone[zonenum].free_pages);
>> +}
>> +
>> +static void scan_hinting_bitmap(int zonenum, int free_pages)
>> +{
>> +    unsigned long set_bit, start = 0;
>> +    struct page *page;
>> +    struct zone *zone;
>> +    int scanned_pages = 0, ret = 0, order, isolated_cnt = 0;
>> +    LIST_HEAD(isolated_pages);
>> +
>> +    ret = hcb->prepare();
>> +    if (ret < 0)
>> +            return;
>> +    for (;;) {
>> +            ret = 0;
>> +            set_bit = find_next_bit(bm_zone[zonenum].bitmap,
>> +                                    bm_zone[zonenum].nbits, start);
>> +            if (set_bit >= bm_zone[zonenum].nbits)
>> +                    break;
>> +            page = pfn_to_online_page((set_bit << PAGE_HINTING_MIN_ORDER) +
>> +                            bm_zone[zonenum].base_pfn);
>> +            if (!page)
>> +                    continue;
> You are not clearing the bit / decrementing the counter.
>
>> +            zone = page_zone(page);
>> +            spin_lock(&zone->lock);
>> +
>> +            if (PageBuddy(page) && page_private(page) >=
>> +                PAGE_HINTING_MIN_ORDER) {
>> +                    order = page_private(page);
>> +                    ret = __isolate_free_page(page, order);
>> +            }
>> +            clear_bit(set_bit, bm_zone[zonenum].bitmap);
>> +            spin_unlock(&zone->lock);
>> +            if (ret) {
>> +                    /*
>> +                     * restoring page order to use it while releasing
>> +                     * the pages back to the buddy.
>> +                     */
>> +                    set_page_private(page, order);
>> +                    list_add_tail(&page->lru, &isolated_pages);
>> +                    isolated_cnt++;
>> +                    if (isolated_cnt == hcb->max_pages) {
>> +                            hcb->hint_pages(&isolated_pages);
>> +                            release_buddy_pages(&isolated_pages);
>> +                            isolated_cnt = 0;
>> +                    }
>> +            }
>> +            start = set_bit + 1;
>> +            scanned_pages++;
>> +    }
>> +    if (isolated_cnt) {
>> +            hcb->hint_pages(&isolated_pages);
>> +            release_buddy_pages(&isolated_pages);
>> +    }
>> +    hcb->cleanup();
>> +    if (scanned_pages > free_pages)
>> +            atomic_sub((scanned_pages - free_pages),
>> +                       &bm_zone[zonenum].free_pages);
> This looks overly complicated. Can't we somehow simply decrement when
> clearing a bit?
>
>> +}
>> +
>> +static bool check_hinting_threshold(void)
>> +{
>> +    int zonenum = 0;
>> +
>> +    for (; zonenum < MAX_NR_ZONES; zonenum++) {
>> +            if (atomic_read(&bm_zone[zonenum].free_pages) >=
>> +                            hcb->max_pages)
>> +                    return true;
>> +    }
>> +    return false;
>> +}
>> +
>> +static void init_hinting_wq(struct work_struct *work)
>> +{
>> +    int zonenum = 0, free_pages = 0;
>> +
>> +    for (; zonenum < MAX_NR_ZONES; zonenum++) {
>> +            free_pages = atomic_read(&bm_zone[zonenum].free_pages);
>> +            if (free_pages >= hcb->max_pages) {
>> +                    /* Find a better way to synchronize per zone
>> +                     * free_pages.
>> +                     */
>> +                    atomic_sub(free_pages,
>> +                               &bm_zone[zonenum].free_pages);
> I can't follow yet why we need that information. 
We don't want to enqueue multiple jobs just because we are delaying the
decrementing of free_pages.
I agree, even I am not convinced with the approach which I have right now.
I will try to come up with a better way.
> Wouldn't it be enough
> to just track the number of set bits in the bitmap and start hinting
> depending on that count? (there are false positives, but do we really care?)
In an attempt to minimize the false positives, I might have overly
complicated this part.
I will try to simplify this in the next posting.
>

>> +                    scan_hinting_bitmap(zonenum, free_pages);
>> +            }
>> +    }
>> +}
>> +
>> +void page_hinting_enqueue(struct page *page, int order)
>> +{
>> +    if (hcb && order >= PAGE_HINTING_MIN_ORDER)
>> +            bm_set_pfn(page);
>> +    else
>> +            return;
>> +
>> +    if (check_hinting_threshold()) {
>> +            int cpu = smp_processor_id();
>> +
>> +            queue_work_on(cpu, system_wq, &hinting_work);
>> +    }
>> +}
>>
>
-- 
Regards
Nitesh

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to