On 2/12/19 4:24 AM, David Hildenbrand wrote: > On 12.02.19 10:03, Wang, Wei W wrote: >> On Tuesday, February 5, 2019 4:19 AM, Nitesh Narayan Lal wrote: >>> The following patch-set proposes an efficient mechanism for handing freed >>> memory between the guest and the host. It enables the guests with no page >>> cache to rapidly free and reclaims memory to and from the host respectively. >>> >>> Benefit: >>> With this patch-series, in our test-case, executed on a single system and >>> single NUMA node with 15GB memory, we were able to successfully launch >>> atleast 5 guests when page hinting was enabled and 3 without it. (Detailed >>> explanation of the test procedure is provided at the bottom). >>> >>> Changelog in V8: >>> In this patch-series, the earlier approach [1] which was used to capture and >>> scan the pages freed by the guest has been changed. The new approach is >>> briefly described below: >>> >>> The patch-set still leverages the existing arch_free_page() to add this >>> functionality. It maintains a per CPU array which is used to store the pages >>> freed by the guest. The maximum number of entries which it can hold is >>> defined by MAX_FGPT_ENTRIES(1000). When the array is completely filled, it >>> is scanned and only the pages which are available in the buddy are stored. >>> This process continues until the array is filled with pages which are part >>> of >>> the buddy free list. After which it wakes up a kernel per-cpu-thread. >>> This kernel per-cpu-thread rescans the per-cpu-array for any re-allocation >>> and if the page is not reallocated and present in the buddy, the kernel >>> thread attempts to isolate it from the buddy. If it is successfully >>> isolated, the >>> page is added to another per-cpu array. Once the entire scanning process is >>> complete, all the isolated pages are reported to the host through an >>> existing >>> virtio-balloon driver. >> Hi Nitesh, >> >> Have you guys thought about something like below, which would be simpler: > Responding because I'm the first to stumble over this mail, hah! :) > >> - use bitmaps to record free pages, e.g. xbitmap: >> https://lkml.org/lkml/2018/1/9/304. >> The bitmap can be indexed by the guest pfn, and it's globally accessed by >> all the CPUs; > Global means all VCPUs will be competing potentially for a single lock > when freeing/allocating a page, no? What if you have 64VCPUs > allocating/freeing memory like crazy? > > (I assume some kind of locking is required even if the bitmap would be > atomic. Also, doesn't xbitmap mean that we eventually have to allocate > memory at places where we don't want to - e.g. from arch_free_page ?) > > That's the big benefit of taking the pages of the buddy free list. Other > VCPUs won't stumble over them, waiting for them to get freed in the > hypervisor. > >> - arch_free_page(): set the bits of the freed pages from the bitmap >> (no per-CPU array with hardcoded fixed length and no per-cpu scanning >> thread) >> - arch_alloc_page(): clear the related bits from the bitmap >> - expose 2 APIs for the callers: >> -- unsigned long get_free_page_hints(unsigned long pfn_start, unsigned int >> nr); >> This API searches for the next free page chunk (@nr of pages), starting >> from @pfn_start. >> Bits of those free pages will be cleared after this function returns. >> -- void put_free_page_hints(unsigned long pfn_start, unsigned int nr); >> This API sets the @nr continuous bits starting from pfn_start. >> >> Usage example with balloon: >> 1) host requests to start ballooning; >> 2) balloon driver get_free_page_hints and report the hints to host via >> report_vq; >> 3) host calls madvise(pfn_start, DONTNEED) for each reported chunk of free >> pages and put back pfn_start to ack_vq; >> 4) balloon driver receives pfn_start and calls >> put_free_page_hints(pfn_start) to have the related bits from the bitmap to >> be set, indicating that those free pages are ready to be allocated. > This sounds more like "the host requests to get free pages once in a > while" compared to "the host is always informed about free pages". At > the time where the host actually has to ask the guest (e.g. because the > host is low on memory), it might be to late to wait for guest action. > Nitesh uses MADV_FREE here (as far as I recall :) ), to only mark pages > as candidates for removal and if the host is low on memory, only > scanning the guest page tables is sufficient to free up memory. > > But both points might just be an implementation detail in the example > you describe. > >> In above 2), get_free_page_hints clears the bits which indicates that those >> pages are not ready to be used by the guest yet. Why? >> This is because 3) will unmap the underlying physical pages from EPT. >> Normally, when guest re-visits those pages, EPT violations and QEMU page >> faults will get a new host page to set up the related EPT entry. If guest >> uses that page before the page gets unmapped (i.e. right before step 3), no >> EPT violation happens and the guest will use the same physical page that >> will be unmapped and given to other host threads. So we need to make sure >> that the guest free page is usable only after step 3 finishes. >> >> Back to arch_alloc_page(), it needs to check if the allocated pages have "1" >> set in the bitmap, if that's true, just clear the bits. Otherwise, it means >> step 2) above has happened and step 4) hasn't been reached. In this case, we >> can either have arch_alloc_page() busywaiting a bit till 4) is done for that >> page >> Or better to have a balloon callback which prioritize 3) and 4) to make this >> page usable by the guest. > Regarding the latter, the VCPU allocating a page cannot do anything if > the page (along with other pages) is just being freed by the hypervisor. > It has to busy-wait, no chance to prioritize. > >> Using bitmaps to record free page hints don't need to take the free pages >> off the buddy list and return them later, which needs to go through the long >> allocation/free code path. >> > Yes, but it means that any process is able to get stuck on such a page > for as long as it takes to report the free pages to the hypervisor and > for it to call madvise(pfn_start, DONTNEED) on any such page. > > Nice idea, but I think we definitely need something the can potentially > be implemented per-cpu without any global locks involved. > > Thanks! > >> Best, >> Wei >> Hi Wei,
For your comment, I agree with David. If we have one global per-cpu, we will have to acquire a lock. Also as David mentioned the idea is to derive the hints from the guest, rather than host asking for free pages. However, I am wondering if having per-cpu bitmaps is possible? Using this I can possibly get rid of the fixed array size issue. -- Regards Nitesh
signature.asc
Description: OpenPGP digital signature