On 07.03.19 19:45, Alexander Duyck wrote:
> On Thu, Mar 7, 2019 at 5:09 AM Nitesh Narayan Lal <nit...@redhat.com> wrote:
>>
>>
>> On 3/6/19 5:05 PM, Alexander Duyck wrote:
>>> On Wed, Mar 6, 2019 at 11:07 AM Nitesh Narayan Lal <nit...@redhat.com> 
>>> wrote:
>>>>
>>>> On 3/6/19 1:00 PM, Alexander Duyck wrote:
>>>>> On Wed, Mar 6, 2019 at 7:51 AM Nitesh Narayan Lal <nit...@redhat.com> 
>>>>> wrote:
>>>>>> The following patch-set proposes an efficient mechanism for handing 
>>>>>> freed memory between the guest and the host. It enables the guests with 
>>>>>> no page cache to rapidly free and reclaims memory to and from the host 
>>>>>> respectively.
>>>>>>
>>>>>> Benefit:
>>>>>> With this patch-series, in our test-case, executed on a single system 
>>>>>> and single NUMA node with 15GB memory, we were able to successfully 
>>>>>> launch 5 guests(each with 5 GB memory) when page hinting was enabled and 
>>>>>> 3 without it. (Detailed explanation of the test procedure is provided at 
>>>>>> the bottom under Test - 1).
>>>>>>
>>>>>> Changelog in v9:
>>>>>>         * Guest free page hinting hook is now invoked after a page has 
>>>>>> been merged in the buddy.
>>>>>>         * Free pages only with order 
>>>>>> FREE_PAGE_HINTING_MIN_ORDER(currently defined as MAX_ORDER - 1) are 
>>>>>> captured.
>>>>>>         * Removed kthread which was earlier used to perform the 
>>>>>> scanning, isolation & reporting of free pages.
>>>>> Without a kthread this has the potential to get really ugly really
>>>>> fast. If we are going to run asynchronously we should probably be
>>>>> truly asynchonous and just place a few pieces of data in the page that
>>>>> a worker thread can use to identify which pages have been hinted and
>>>>> which pages have not.
>>>> Can you please explain what do you mean by truly asynchronous?
>>>>
>>>> With this implementation also I am not reporting the pages synchronously.
>>> The problem is you are making it pseudo synchronous by having to push
>>> pages off to a side buffer aren't you? In my mind we should be able to
>>> have the page hinting go on with little to no interference with
>>> existing page allocation and freeing.
>> We have to opt one of the two options:
>> 1. Block allocation by using a flag or acquire a lock to prevent the
>> usage of pages we are hinting.
>> 2. Remove the page set entirely from the buddy. (This is what I am doing
>> right now)
>>
>> The reason I would prefer the second approach is that we are not
>> blocking the allocation in any way and as we are only working with a
>> smaller set of pages we should be fine.
>> However, with the current approach as we are reporting asynchronously
>> there is a chance that we end up hinting more than 2-3 times for a
>> single workload run. In situation where this could lead to low memory
>> condition in the guest, the hinting will anyways fail as the guest will
>> not allow page isolation.
>> I can possibly try and test the same to ensure that we don't get OOM due
>> to hinting when the guest is under memory pressure.
> 
> So in either case you are essentially blocking allocation since the
> memory cannot be used. My concern is more with guaranteeing forward
> progress for as many CPUs as possible.
> 
> With your current design you have one minor issue in that you aren't
> taking the lock to re-insert the pages back into the buddy allocator.
> When you add that step in it means you are going to be blocking
> allocation on that zone while you are reinserting the pages.
> 
> Also right now you are using the calls to free_one_page to generate a
> list of hints where to search. I'm thinking that may not be the best
> approach since what we want to do is provide hints on idle free pages,
> not just pages that will be free for a short period of time.
> 
> To that end what I think w may want to do is instead just walk the LRU
> list for a given zone/order in reverse order so that we can try to
> identify the pages that are most likely to be cold and unused and
> those are the first ones we want to be hinting on rather than the ones
> that were just freed. If we can look at doing something like adding a
> jiffies value to the page indicating when it was last freed we could
> even have a good point for determining when we should stop processing
> pages in a given zone/order list.
> 
> In reality the approach wouldn't be too different from what you are
> doing now, the only real difference would be that we would just want
> to walk the LRU list for the given zone/order rather then pulling
> hints on what to free from the calls to free_one_page. In addition we
> would need to add a couple bits to indicate if the page has been
> hinted on, is in the middle of getting hinted on, and something such
> as the jiffies value I mentioned which we could use to determine how
> old the page is.
> 
>>>
>>>>> Then we can have that one thread just walking
>>>>> through the zone memory pulling out fixed size pieces at a time and
>>>>> providing hints on that. By doing that we avoid the potential of
>>>>> creating a batch of pages that eat up most of the system memory.
>>>>>
>>>>>>         * Pages, captured in the per cpu array are sorted based on the 
>>>>>> zone numbers. This is to avoid redundancy of acquiring zone locks.
>>>>>>         * Dynamically allocated space is used to hold the isolated guest 
>>>>>> free pages.
>>>>> I have concerns that doing this per CPU and allocating memory
>>>>> dynamically can result in you losing a significant amount of memory as
>>>>> it sits waiting to be hinted.
>>>> It should not as the buddy will keep merging the pages and we are only
>>>> capturing MAX_ORDER - 1.
>>>> This was the issue with the last patch-series when I was capturing all
>>>> order pages resulting in the per-cpu array to be filled with lower order
>>>> pages.
>>>>>>         * All the pages are reported asynchronously to the host via 
>>>>>> virtio driver.
>>>>>>         * Pages are returned back to the guest buddy free list only when 
>>>>>> the host response is received.
>>>>> I have been thinking about this. Instead of stealing the page couldn't
>>>>> you simply flag it that there is a hint in progress and simply wait in
>>>>> arch_alloc_page until the hint has been processed?
>>>> With the flag, I am assuming you mean to block the allocation until
>>>> hinting is going on, which is an issue. That was one of the issues
>>>> discussed earlier which I wanted to solve with this implementation.
>>> With the flag we would allow the allocation, but would have to
>>> synchronize with the hinting at that point. I got the idea from the
>>> way the s390 code works. They have both an arch_free_page and an
>>> arch_alloc_page. If I understand correctly the arch_alloc_page is what
>>> is meant to handle the case of a page that has been marked for
>>> hinting, but may not have been hinted on yet. My thought for now is to
>>> keep it simple and use a page flag to indicate that a page is
>>> currently pending a hint.
>> I am assuming this page flag will be located in the page structure.
>>> We should be able to spin in such a case and
>>> it would probably still perform better than a solution where we would
>>> not have the memory available and possibly be under memory pressure.
>> I had this same idea earlier. However, the thing about which I was not
>> sure is if adding a flag in the page structure will be acceptable upstream.
>>>
>>>>> The problem is in
>>>>> stealing pages you are going to introduce false OOM issues when the
>>>>> memory isn't available because it is being hinted on.
>>>> I think this situation will arise when the guest is under memory
>>>> pressure. In such situations any attempt to perform isolation will
>>>> anyways fail and we may not be reporting anything at that time.
>>> What I want to avoid is the scenario where an application grabs a
>>> large amount of memory, then frees said memory, and we are sitting on
>>> it for some time because we decide to try and hint on the large chunk.
>> I agree.
>>> By processing this sometime after the pages are sent to the buddy
>>> allocator in a separate thread, and by processing a small fixed window
>>> of memory at a time we can avoid making freeing memory expensive, and
>>> still provide the hints in a reasonable time frame.
>>
>> My impression is that the current window on which I am working may give
>> issues for smaller size guests. But otherwise, we are already working
>> with a smaller fixed window of memory.
>>
>> I can further restrict this to just 128 entries and test which would
>> bring down the window of memory. Let me know what you think.
> 
> The problem is 128 entries is still pretty big when you consider you
> are working with 4M pages. If I am not mistaken that is a half
> gigabyte of memory. For lower order pages 128 would probably be fine,
> but with the higher order pages we may want to contain things to
> something smaller like 16MB to 64MB worth of memory.
>

I agree, I also still consider it too big for 4MB pages. It would be
different e.g. for 128KB pages.

-- 

Thanks,

David / dhildenb

Reply via email to