On 12.02.19 18:24, Nitesh Narayan Lal wrote: > > On 2/12/19 4:24 AM, David Hildenbrand wrote: >> On 12.02.19 10:03, Wang, Wei W wrote: >>> On Tuesday, February 5, 2019 4:19 AM, Nitesh Narayan Lal wrote: >>>> The following patch-set proposes an efficient mechanism for handing freed >>>> memory between the guest and the host. It enables the guests with no page >>>> cache to rapidly free and reclaims memory to and from the host >>>> respectively. >>>> >>>> Benefit: >>>> With this patch-series, in our test-case, executed on a single system and >>>> single NUMA node with 15GB memory, we were able to successfully launch >>>> atleast 5 guests when page hinting was enabled and 3 without it. (Detailed >>>> explanation of the test procedure is provided at the bottom). >>>> >>>> Changelog in V8: >>>> In this patch-series, the earlier approach [1] which was used to capture >>>> and >>>> scan the pages freed by the guest has been changed. The new approach is >>>> briefly described below: >>>> >>>> The patch-set still leverages the existing arch_free_page() to add this >>>> functionality. It maintains a per CPU array which is used to store the >>>> pages >>>> freed by the guest. The maximum number of entries which it can hold is >>>> defined by MAX_FGPT_ENTRIES(1000). When the array is completely filled, it >>>> is scanned and only the pages which are available in the buddy are stored. >>>> This process continues until the array is filled with pages which are part >>>> of >>>> the buddy free list. After which it wakes up a kernel per-cpu-thread. >>>> This kernel per-cpu-thread rescans the per-cpu-array for any re-allocation >>>> and if the page is not reallocated and present in the buddy, the kernel >>>> thread attempts to isolate it from the buddy. If it is successfully >>>> isolated, the >>>> page is added to another per-cpu array. Once the entire scanning process is >>>> complete, all the isolated pages are reported to the host through an >>>> existing >>>> virtio-balloon driver. >>> Hi Nitesh, >>> >>> Have you guys thought about something like below, which would be simpler: >> Responding because I'm the first to stumble over this mail, hah! :) >> >>> - use bitmaps to record free pages, e.g. xbitmap: >>> https://lkml.org/lkml/2018/1/9/304. >>> The bitmap can be indexed by the guest pfn, and it's globally accessed by >>> all the CPUs; >> Global means all VCPUs will be competing potentially for a single lock >> when freeing/allocating a page, no? What if you have 64VCPUs >> allocating/freeing memory like crazy? >> >> (I assume some kind of locking is required even if the bitmap would be >> atomic. Also, doesn't xbitmap mean that we eventually have to allocate >> memory at places where we don't want to - e.g. from arch_free_page ?) >> >> That's the big benefit of taking the pages of the buddy free list. Other >> VCPUs won't stumble over them, waiting for them to get freed in the >> hypervisor. >> >>> - arch_free_page(): set the bits of the freed pages from the bitmap >>> (no per-CPU array with hardcoded fixed length and no per-cpu scanning >>> thread) >>> - arch_alloc_page(): clear the related bits from the bitmap >>> - expose 2 APIs for the callers: >>> -- unsigned long get_free_page_hints(unsigned long pfn_start, unsigned >>> int nr); >>> This API searches for the next free page chunk (@nr of pages), >>> starting from @pfn_start. >>> Bits of those free pages will be cleared after this function returns. >>> -- void put_free_page_hints(unsigned long pfn_start, unsigned int nr); >>> This API sets the @nr continuous bits starting from pfn_start. >>> >>> Usage example with balloon: >>> 1) host requests to start ballooning; >>> 2) balloon driver get_free_page_hints and report the hints to host via >>> report_vq; >>> 3) host calls madvise(pfn_start, DONTNEED) for each reported chunk of free >>> pages and put back pfn_start to ack_vq; >>> 4) balloon driver receives pfn_start and calls >>> put_free_page_hints(pfn_start) to have the related bits from the bitmap to >>> be set, indicating that those free pages are ready to be allocated. >> This sounds more like "the host requests to get free pages once in a >> while" compared to "the host is always informed about free pages". At >> the time where the host actually has to ask the guest (e.g. because the >> host is low on memory), it might be to late to wait for guest action. >> Nitesh uses MADV_FREE here (as far as I recall :) ), to only mark pages >> as candidates for removal and if the host is low on memory, only >> scanning the guest page tables is sufficient to free up memory. >> >> But both points might just be an implementation detail in the example >> you describe. >> >>> In above 2), get_free_page_hints clears the bits which indicates that those >>> pages are not ready to be used by the guest yet. Why? >>> This is because 3) will unmap the underlying physical pages from EPT. >>> Normally, when guest re-visits those pages, EPT violations and QEMU page >>> faults will get a new host page to set up the related EPT entry. If guest >>> uses that page before the page gets unmapped (i.e. right before step 3), no >>> EPT violation happens and the guest will use the same physical page that >>> will be unmapped and given to other host threads. So we need to make sure >>> that the guest free page is usable only after step 3 finishes. >>> >>> Back to arch_alloc_page(), it needs to check if the allocated pages have >>> "1" set in the bitmap, if that's true, just clear the bits. Otherwise, it >>> means step 2) above has happened and step 4) hasn't been reached. In this >>> case, we can either have arch_alloc_page() busywaiting a bit till 4) is >>> done for that page >>> Or better to have a balloon callback which prioritize 3) and 4) to make >>> this page usable by the guest. >> Regarding the latter, the VCPU allocating a page cannot do anything if >> the page (along with other pages) is just being freed by the hypervisor. >> It has to busy-wait, no chance to prioritize. >> >>> Using bitmaps to record free page hints don't need to take the free pages >>> off the buddy list and return them later, which needs to go through the >>> long allocation/free code path. >>> >> Yes, but it means that any process is able to get stuck on such a page >> for as long as it takes to report the free pages to the hypervisor and >> for it to call madvise(pfn_start, DONTNEED) on any such page. >> >> Nice idea, but I think we definitely need something the can potentially >> be implemented per-cpu without any global locks involved. >> >> Thanks! >> >>> Best, >>> Wei >>> > Hi Wei, > > For your comment, I agree with David. If we have one global per-cpu, we > will have to acquire a lock. > Also as David mentioned the idea is to derive the hints from the guest, > rather than host asking for free pages. > > However, I am wondering if having per-cpu bitmaps is possible? > Using this I can possibly get rid of the fixed array size issue. >
I assume we will have problems with dynamically sized bitmaps - memory has to be allocated. Similar to a dynamically sized list. But it is definitely worth investigating. -- Thanks, David / dhildenb

