On 07.03.19 19:45, Alexander Duyck wrote: > On Thu, Mar 7, 2019 at 5:09 AM Nitesh Narayan Lal <nit...@redhat.com> wrote: >> >> >> On 3/6/19 5:05 PM, Alexander Duyck wrote: >>> On Wed, Mar 6, 2019 at 11:07 AM Nitesh Narayan Lal <nit...@redhat.com> >>> wrote: >>>> >>>> On 3/6/19 1:00 PM, Alexander Duyck wrote: >>>>> On Wed, Mar 6, 2019 at 7:51 AM Nitesh Narayan Lal <nit...@redhat.com> >>>>> wrote: >>>>>> The following patch-set proposes an efficient mechanism for handing >>>>>> freed memory between the guest and the host. It enables the guests with >>>>>> no page cache to rapidly free and reclaims memory to and from the host >>>>>> respectively. >>>>>> >>>>>> Benefit: >>>>>> With this patch-series, in our test-case, executed on a single system >>>>>> and single NUMA node with 15GB memory, we were able to successfully >>>>>> launch 5 guests(each with 5 GB memory) when page hinting was enabled and >>>>>> 3 without it. (Detailed explanation of the test procedure is provided at >>>>>> the bottom under Test - 1). >>>>>> >>>>>> Changelog in v9: >>>>>> * Guest free page hinting hook is now invoked after a page has >>>>>> been merged in the buddy. >>>>>> * Free pages only with order >>>>>> FREE_PAGE_HINTING_MIN_ORDER(currently defined as MAX_ORDER - 1) are >>>>>> captured. >>>>>> * Removed kthread which was earlier used to perform the >>>>>> scanning, isolation & reporting of free pages. >>>>> Without a kthread this has the potential to get really ugly really >>>>> fast. If we are going to run asynchronously we should probably be >>>>> truly asynchonous and just place a few pieces of data in the page that >>>>> a worker thread can use to identify which pages have been hinted and >>>>> which pages have not. >>>> Can you please explain what do you mean by truly asynchronous? >>>> >>>> With this implementation also I am not reporting the pages synchronously. >>> The problem is you are making it pseudo synchronous by having to push >>> pages off to a side buffer aren't you? In my mind we should be able to >>> have the page hinting go on with little to no interference with >>> existing page allocation and freeing. >> We have to opt one of the two options: >> 1. Block allocation by using a flag or acquire a lock to prevent the >> usage of pages we are hinting. >> 2. Remove the page set entirely from the buddy. (This is what I am doing >> right now) >> >> The reason I would prefer the second approach is that we are not >> blocking the allocation in any way and as we are only working with a >> smaller set of pages we should be fine. >> However, with the current approach as we are reporting asynchronously >> there is a chance that we end up hinting more than 2-3 times for a >> single workload run. In situation where this could lead to low memory >> condition in the guest, the hinting will anyways fail as the guest will >> not allow page isolation. >> I can possibly try and test the same to ensure that we don't get OOM due >> to hinting when the guest is under memory pressure. > > So in either case you are essentially blocking allocation since the > memory cannot be used. My concern is more with guaranteeing forward > progress for as many CPUs as possible. > > With your current design you have one minor issue in that you aren't > taking the lock to re-insert the pages back into the buddy allocator. > When you add that step in it means you are going to be blocking > allocation on that zone while you are reinserting the pages. > > Also right now you are using the calls to free_one_page to generate a > list of hints where to search. I'm thinking that may not be the best > approach since what we want to do is provide hints on idle free pages, > not just pages that will be free for a short period of time. > > To that end what I think w may want to do is instead just walk the LRU > list for a given zone/order in reverse order so that we can try to > identify the pages that are most likely to be cold and unused and > those are the first ones we want to be hinting on rather than the ones > that were just freed. If we can look at doing something like adding a > jiffies value to the page indicating when it was last freed we could > even have a good point for determining when we should stop processing > pages in a given zone/order list. > > In reality the approach wouldn't be too different from what you are > doing now, the only real difference would be that we would just want > to walk the LRU list for the given zone/order rather then pulling > hints on what to free from the calls to free_one_page. In addition we > would need to add a couple bits to indicate if the page has been > hinted on, is in the middle of getting hinted on, and something such > as the jiffies value I mentioned which we could use to determine how > old the page is. > >>> >>>>> Then we can have that one thread just walking >>>>> through the zone memory pulling out fixed size pieces at a time and >>>>> providing hints on that. By doing that we avoid the potential of >>>>> creating a batch of pages that eat up most of the system memory. >>>>> >>>>>> * Pages, captured in the per cpu array are sorted based on the >>>>>> zone numbers. This is to avoid redundancy of acquiring zone locks. >>>>>> * Dynamically allocated space is used to hold the isolated guest >>>>>> free pages. >>>>> I have concerns that doing this per CPU and allocating memory >>>>> dynamically can result in you losing a significant amount of memory as >>>>> it sits waiting to be hinted. >>>> It should not as the buddy will keep merging the pages and we are only >>>> capturing MAX_ORDER - 1. >>>> This was the issue with the last patch-series when I was capturing all >>>> order pages resulting in the per-cpu array to be filled with lower order >>>> pages. >>>>>> * All the pages are reported asynchronously to the host via >>>>>> virtio driver. >>>>>> * Pages are returned back to the guest buddy free list only when >>>>>> the host response is received. >>>>> I have been thinking about this. Instead of stealing the page couldn't >>>>> you simply flag it that there is a hint in progress and simply wait in >>>>> arch_alloc_page until the hint has been processed? >>>> With the flag, I am assuming you mean to block the allocation until >>>> hinting is going on, which is an issue. That was one of the issues >>>> discussed earlier which I wanted to solve with this implementation. >>> With the flag we would allow the allocation, but would have to >>> synchronize with the hinting at that point. I got the idea from the >>> way the s390 code works. They have both an arch_free_page and an >>> arch_alloc_page. If I understand correctly the arch_alloc_page is what >>> is meant to handle the case of a page that has been marked for >>> hinting, but may not have been hinted on yet. My thought for now is to >>> keep it simple and use a page flag to indicate that a page is >>> currently pending a hint. >> I am assuming this page flag will be located in the page structure. >>> We should be able to spin in such a case and >>> it would probably still perform better than a solution where we would >>> not have the memory available and possibly be under memory pressure. >> I had this same idea earlier. However, the thing about which I was not >> sure is if adding a flag in the page structure will be acceptable upstream. >>> >>>>> The problem is in >>>>> stealing pages you are going to introduce false OOM issues when the >>>>> memory isn't available because it is being hinted on. >>>> I think this situation will arise when the guest is under memory >>>> pressure. In such situations any attempt to perform isolation will >>>> anyways fail and we may not be reporting anything at that time. >>> What I want to avoid is the scenario where an application grabs a >>> large amount of memory, then frees said memory, and we are sitting on >>> it for some time because we decide to try and hint on the large chunk. >> I agree. >>> By processing this sometime after the pages are sent to the buddy >>> allocator in a separate thread, and by processing a small fixed window >>> of memory at a time we can avoid making freeing memory expensive, and >>> still provide the hints in a reasonable time frame. >> >> My impression is that the current window on which I am working may give >> issues for smaller size guests. But otherwise, we are already working >> with a smaller fixed window of memory. >> >> I can further restrict this to just 128 entries and test which would >> bring down the window of memory. Let me know what you think. > > The problem is 128 entries is still pretty big when you consider you > are working with 4M pages. If I am not mistaken that is a half > gigabyte of memory. For lower order pages 128 would probably be fine, > but with the higher order pages we may want to contain things to > something smaller like 16MB to 64MB worth of memory. >
I agree, I also still consider it too big for 4MB pages. It would be different e.g. for 128KB pages. -- Thanks, David / dhildenb