On 12.02.19 18:24, Nitesh Narayan Lal wrote:
> 
> On 2/12/19 4:24 AM, David Hildenbrand wrote:
>> On 12.02.19 10:03, Wang, Wei W wrote:
>>> On Tuesday, February 5, 2019 4:19 AM, Nitesh Narayan Lal wrote:
>>>> The following patch-set proposes an efficient mechanism for handing freed
>>>> memory between the guest and the host. It enables the guests with no page
>>>> cache to rapidly free and reclaims memory to and from the host 
>>>> respectively.
>>>>
>>>> Benefit:
>>>> With this patch-series, in our test-case, executed on a single system and
>>>> single NUMA node with 15GB memory, we were able to successfully launch
>>>> atleast 5 guests when page hinting was enabled and 3 without it. (Detailed
>>>> explanation of the test procedure is provided at the bottom).
>>>>
>>>> Changelog in V8:
>>>> In this patch-series, the earlier approach [1] which was used to capture 
>>>> and
>>>> scan the pages freed by the guest has been changed. The new approach is
>>>> briefly described below:
>>>>
>>>> The patch-set still leverages the existing arch_free_page() to add this
>>>> functionality. It maintains a per CPU array which is used to store the 
>>>> pages
>>>> freed by the guest. The maximum number of entries which it can hold is
>>>> defined by MAX_FGPT_ENTRIES(1000). When the array is completely filled, it
>>>> is scanned and only the pages which are available in the buddy are stored.
>>>> This process continues until the array is filled with pages which are part 
>>>> of
>>>> the buddy free list. After which it wakes up a kernel per-cpu-thread.
>>>> This kernel per-cpu-thread rescans the per-cpu-array for any re-allocation
>>>> and if the page is not reallocated and present in the buddy, the kernel
>>>> thread attempts to isolate it from the buddy. If it is successfully 
>>>> isolated, the
>>>> page is added to another per-cpu array. Once the entire scanning process is
>>>> complete, all the isolated pages are reported to the host through an 
>>>> existing
>>>> virtio-balloon driver.
>>>  Hi Nitesh,
>>>
>>> Have you guys thought about something like below, which would be simpler:
>> Responding because I'm the first to stumble over this mail, hah! :)
>>
>>> - use bitmaps to record free pages, e.g. xbitmap: 
>>> https://lkml.org/lkml/2018/1/9/304.
>>>   The bitmap can be indexed by the guest pfn, and it's globally accessed by 
>>> all the CPUs;
>> Global means all VCPUs will be competing potentially for a single lock
>> when freeing/allocating a page, no? What if you have 64VCPUs
>> allocating/freeing memory like crazy?
>>
>> (I assume some kind of locking is required even if the bitmap would be
>> atomic. Also, doesn't xbitmap mean that we eventually have to allocate
>> memory at places where we don't want to - e.g. from arch_free_page ?)
>>
>> That's the big benefit of taking the pages of the buddy free list. Other
>> VCPUs won't stumble over them, waiting for them to get freed in the
>> hypervisor.
>>
>>> - arch_free_page(): set the bits of the freed pages from the bitmap
>>>  (no per-CPU array with hardcoded fixed length and no per-cpu scanning 
>>> thread)
>>> - arch_alloc_page(): clear the related bits from the bitmap
>>> - expose 2 APIs for the callers:
>>>   -- unsigned long get_free_page_hints(unsigned long pfn_start, unsigned 
>>> int nr); 
>>>      This API searches for the next free page chunk (@nr of pages), 
>>> starting from @pfn_start.
>>>      Bits of those free pages will be cleared after this function returns.
>>>   -- void put_free_page_hints(unsigned long pfn_start, unsigned int nr);
>>>      This API sets the @nr continuous bits starting from pfn_start.
>>>
>>> Usage example with balloon:
>>> 1) host requests to start ballooning;
>>> 2) balloon driver get_free_page_hints and report the hints to host via 
>>> report_vq;
>>> 3) host calls madvise(pfn_start, DONTNEED) for each reported chunk of free 
>>> pages and put back pfn_start to ack_vq;
>>> 4) balloon driver receives pfn_start and calls 
>>> put_free_page_hints(pfn_start) to have the related bits from the bitmap to 
>>> be set, indicating that those free pages are ready to be allocated.
>> This sounds more like "the host requests to get free pages once in a
>> while" compared to "the host is always informed about free pages". At
>> the time where the host actually has to ask the guest (e.g. because the
>> host is low on memory), it might be to late to wait for guest action.
>> Nitesh uses MADV_FREE here (as far as I recall :) ), to only mark pages
>> as candidates for removal and if the host is low on memory, only
>> scanning the guest page tables is sufficient to free up memory.
>>
>> But both points might just be an implementation detail in the example
>> you describe.
>>
>>> In above 2), get_free_page_hints clears the bits which indicates that those 
>>> pages are not ready to be used by the guest yet. Why?
>>> This is because 3) will unmap the underlying physical pages from EPT. 
>>> Normally, when guest re-visits those pages, EPT violations and QEMU page 
>>> faults will get a new host page to set up the related EPT entry. If guest 
>>> uses that page before the page gets unmapped (i.e. right before step 3), no 
>>> EPT violation happens and the guest will use the same physical page that 
>>> will be unmapped and given to other host threads. So we need to make sure 
>>> that the guest free page is usable only after step 3 finishes.
>>>
>>> Back to arch_alloc_page(), it needs to check if the allocated pages have 
>>> "1" set in the bitmap, if that's true, just clear the bits. Otherwise, it 
>>> means step 2) above has happened and step 4) hasn't been reached. In this 
>>> case, we can either have arch_alloc_page() busywaiting a bit till 4) is 
>>> done for that page
>>> Or better to have a balloon callback which prioritize 3) and 4) to make 
>>> this page usable by the guest.
>> Regarding the latter, the VCPU allocating a page cannot do anything if
>> the page (along with other pages) is just being freed by the hypervisor.
>> It has to busy-wait, no chance to prioritize.
>>
>>> Using bitmaps to record free page hints don't need to take the free pages 
>>> off the buddy list and return them later, which needs to go through the 
>>> long allocation/free code path.
>>>
>> Yes, but it means that any process is able to get stuck on such a page
>> for as long as it takes to report the free pages to the hypervisor and
>> for it to call madvise(pfn_start, DONTNEED) on any such page.
>>
>> Nice idea, but I think we definitely need something the can potentially
>> be implemented per-cpu without any global locks involved.
>>
>> Thanks!
>>
>>> Best,
>>> Wei
>>>
> Hi Wei,
> 
> For your comment, I agree with David. If we have one global per-cpu, we
> will have to acquire a lock.
> Also as David mentioned the idea is to derive the hints from the guest,
> rather than host asking for free pages.
> 
> However, I am wondering if having per-cpu bitmaps is possible?
> Using this I can possibly get rid of the fixed array size issue.
> 

I assume we will have problems with dynamically sized bitmaps - memory
has to be allocated. Similar to a dynamically sized list.

But it is definitely worth investigating.

-- 

Thanks,

David / dhildenb

Reply via email to