On 2/12/19 4:24 AM, David Hildenbrand wrote:
> On 12.02.19 10:03, Wang, Wei W wrote:
>> On Tuesday, February 5, 2019 4:19 AM, Nitesh Narayan Lal wrote:
>>> The following patch-set proposes an efficient mechanism for handing freed
>>> memory between the guest and the host. It enables the guests with no page
>>> cache to rapidly free and reclaims memory to and from the host respectively.
>>>
>>> Benefit:
>>> With this patch-series, in our test-case, executed on a single system and
>>> single NUMA node with 15GB memory, we were able to successfully launch
>>> atleast 5 guests when page hinting was enabled and 3 without it. (Detailed
>>> explanation of the test procedure is provided at the bottom).
>>>
>>> Changelog in V8:
>>> In this patch-series, the earlier approach [1] which was used to capture and
>>> scan the pages freed by the guest has been changed. The new approach is
>>> briefly described below:
>>>
>>> The patch-set still leverages the existing arch_free_page() to add this
>>> functionality. It maintains a per CPU array which is used to store the pages
>>> freed by the guest. The maximum number of entries which it can hold is
>>> defined by MAX_FGPT_ENTRIES(1000). When the array is completely filled, it
>>> is scanned and only the pages which are available in the buddy are stored.
>>> This process continues until the array is filled with pages which are part 
>>> of
>>> the buddy free list. After which it wakes up a kernel per-cpu-thread.
>>> This kernel per-cpu-thread rescans the per-cpu-array for any re-allocation
>>> and if the page is not reallocated and present in the buddy, the kernel
>>> thread attempts to isolate it from the buddy. If it is successfully 
>>> isolated, the
>>> page is added to another per-cpu array. Once the entire scanning process is
>>> complete, all the isolated pages are reported to the host through an 
>>> existing
>>> virtio-balloon driver.
>>  Hi Nitesh,
>>
>> Have you guys thought about something like below, which would be simpler:
> Responding because I'm the first to stumble over this mail, hah! :)
>
>> - use bitmaps to record free pages, e.g. xbitmap: 
>> https://lkml.org/lkml/2018/1/9/304.
>>   The bitmap can be indexed by the guest pfn, and it's globally accessed by 
>> all the CPUs;
> Global means all VCPUs will be competing potentially for a single lock
> when freeing/allocating a page, no? What if you have 64VCPUs
> allocating/freeing memory like crazy?
>
> (I assume some kind of locking is required even if the bitmap would be
> atomic. Also, doesn't xbitmap mean that we eventually have to allocate
> memory at places where we don't want to - e.g. from arch_free_page ?)
>
> That's the big benefit of taking the pages of the buddy free list. Other
> VCPUs won't stumble over them, waiting for them to get freed in the
> hypervisor.
>
>> - arch_free_page(): set the bits of the freed pages from the bitmap
>>  (no per-CPU array with hardcoded fixed length and no per-cpu scanning 
>> thread)
>> - arch_alloc_page(): clear the related bits from the bitmap
>> - expose 2 APIs for the callers:
>>   -- unsigned long get_free_page_hints(unsigned long pfn_start, unsigned int 
>> nr); 
>>      This API searches for the next free page chunk (@nr of pages), starting 
>> from @pfn_start.
>>      Bits of those free pages will be cleared after this function returns.
>>   -- void put_free_page_hints(unsigned long pfn_start, unsigned int nr);
>>      This API sets the @nr continuous bits starting from pfn_start.
>>
>> Usage example with balloon:
>> 1) host requests to start ballooning;
>> 2) balloon driver get_free_page_hints and report the hints to host via 
>> report_vq;
>> 3) host calls madvise(pfn_start, DONTNEED) for each reported chunk of free 
>> pages and put back pfn_start to ack_vq;
>> 4) balloon driver receives pfn_start and calls 
>> put_free_page_hints(pfn_start) to have the related bits from the bitmap to 
>> be set, indicating that those free pages are ready to be allocated.
> This sounds more like "the host requests to get free pages once in a
> while" compared to "the host is always informed about free pages". At
> the time where the host actually has to ask the guest (e.g. because the
> host is low on memory), it might be to late to wait for guest action.
> Nitesh uses MADV_FREE here (as far as I recall :) ), to only mark pages
> as candidates for removal and if the host is low on memory, only
> scanning the guest page tables is sufficient to free up memory.
>
> But both points might just be an implementation detail in the example
> you describe.
>
>> In above 2), get_free_page_hints clears the bits which indicates that those 
>> pages are not ready to be used by the guest yet. Why?
>> This is because 3) will unmap the underlying physical pages from EPT. 
>> Normally, when guest re-visits those pages, EPT violations and QEMU page 
>> faults will get a new host page to set up the related EPT entry. If guest 
>> uses that page before the page gets unmapped (i.e. right before step 3), no 
>> EPT violation happens and the guest will use the same physical page that 
>> will be unmapped and given to other host threads. So we need to make sure 
>> that the guest free page is usable only after step 3 finishes.
>>
>> Back to arch_alloc_page(), it needs to check if the allocated pages have "1" 
>> set in the bitmap, if that's true, just clear the bits. Otherwise, it means 
>> step 2) above has happened and step 4) hasn't been reached. In this case, we 
>> can either have arch_alloc_page() busywaiting a bit till 4) is done for that 
>> page
>> Or better to have a balloon callback which prioritize 3) and 4) to make this 
>> page usable by the guest.
> Regarding the latter, the VCPU allocating a page cannot do anything if
> the page (along with other pages) is just being freed by the hypervisor.
> It has to busy-wait, no chance to prioritize.
>
>> Using bitmaps to record free page hints don't need to take the free pages 
>> off the buddy list and return them later, which needs to go through the long 
>> allocation/free code path.
>>
> Yes, but it means that any process is able to get stuck on such a page
> for as long as it takes to report the free pages to the hypervisor and
> for it to call madvise(pfn_start, DONTNEED) on any such page.
>
> Nice idea, but I think we definitely need something the can potentially
> be implemented per-cpu without any global locks involved.
>
> Thanks!
>
>> Best,
>> Wei
>>
Hi Wei,

For your comment, I agree with David. If we have one global per-cpu, we
will have to acquire a lock.
Also as David mentioned the idea is to derive the hints from the guest,
rather than host asking for free pages.

However, I am wondering if having per-cpu bitmaps is possible?
Using this I can possibly get rid of the fixed array size issue.

-- 
Regards
Nitesh

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to