On 3/20/19 9:18 AM, Nitesh Narayan Lal wrote: > On 3/19/19 1:59 PM, Nitesh Narayan Lal wrote: >> On 3/19/19 1:38 PM, Alexander Duyck wrote: >>> On Tue, Mar 19, 2019 at 9:04 AM Nitesh Narayan Lal <[email protected]> >>> wrote: >>>> On 3/19/19 9:33 AM, David Hildenbrand wrote: >>>>> On 18.03.19 16:57, Nitesh Narayan Lal wrote: >>>>>> On 3/14/19 12:58 PM, Alexander Duyck wrote: >>>>>>> On Thu, Mar 14, 2019 at 9:43 AM Nitesh Narayan Lal <[email protected]> >>>>>>> wrote: >>>>>>>> On 3/6/19 1:12 PM, Michael S. Tsirkin wrote: >>>>>>>>> On Wed, Mar 06, 2019 at 01:07:50PM -0500, Nitesh Narayan Lal wrote: >>>>>>>>>> On 3/6/19 11:09 AM, Michael S. Tsirkin wrote: >>>>>>>>>>> On Wed, Mar 06, 2019 at 10:50:42AM -0500, Nitesh Narayan Lal wrote: >>>>>>>>>>>> The following patch-set proposes an efficient mechanism for >>>>>>>>>>>> handing freed memory between the guest and the host. It enables >>>>>>>>>>>> the guests with no page cache to rapidly free and reclaims memory >>>>>>>>>>>> to and from the host respectively. >>>>>>>>>>>> >>>>>>>>>>>> Benefit: >>>>>>>>>>>> With this patch-series, in our test-case, executed on a single >>>>>>>>>>>> system and single NUMA node with 15GB memory, we were able to >>>>>>>>>>>> successfully launch 5 guests(each with 5 GB memory) when page >>>>>>>>>>>> hinting was enabled and 3 without it. (Detailed explanation of the >>>>>>>>>>>> test procedure is provided at the bottom under Test - 1). >>>>>>>>>>>> >>>>>>>>>>>> Changelog in v9: >>>>>>>>>>>> * Guest free page hinting hook is now invoked after a page has >>>>>>>>>>>> been merged in the buddy. >>>>>>>>>>>> * Free pages only with order >>>>>>>>>>>> FREE_PAGE_HINTING_MIN_ORDER(currently defined as MAX_ORDER - 1) >>>>>>>>>>>> are captured. >>>>>>>>>>>> * Removed kthread which was earlier used to perform the >>>>>>>>>>>> scanning, isolation & reporting of free pages. >>>>>>>>>>>> * Pages, captured in the per cpu array are sorted based on the >>>>>>>>>>>> zone numbers. This is to avoid redundancy of acquiring zone locks. >>>>>>>>>>>> * Dynamically allocated space is used to hold the isolated >>>>>>>>>>>> guest free pages. >>>>>>>>>>>> * All the pages are reported asynchronously to the host >>>>>>>>>>>> via virtio driver. >>>>>>>>>>>> * Pages are returned back to the guest buddy free list >>>>>>>>>>>> only when the host response is received. >>>>>>>>>>>> >>>>>>>>>>>> Pending items: >>>>>>>>>>>> * Make sure that the guest free page hinting's current >>>>>>>>>>>> implementation doesn't break hugepages or device assigned guests. >>>>>>>>>>>> * Follow up on VIRTIO_BALLOON_F_PAGE_POISON's device side >>>>>>>>>>>> support. (It is currently missing) >>>>>>>>>>>> * Compare reporting free pages via vring with vhost. >>>>>>>>>>>> * Decide between MADV_DONTNEED and MADV_FREE. >>>>>>>>>>>> * Analyze overall performance impact due to guest free page >>>>>>>>>>>> hinting. >>>>>>>>>>>> * Come up with proper/traceable error-message/logs. >>>>>>>>>>>> >>>>>>>>>>>> Tests: >>>>>>>>>>>> 1. Use-case - Number of guests we can launch >>>>>>>>>>>> >>>>>>>>>>>> NUMA Nodes = 1 with 15 GB memory >>>>>>>>>>>> Guest Memory = 5 GB >>>>>>>>>>>> Number of cores in guest = 1 >>>>>>>>>>>> Workload = test allocation program allocates 4GB memory, >>>>>>>>>>>> touches it via memset and exits. >>>>>>>>>>>> Procedure = >>>>>>>>>>>> The first guest is launched and once its console is up, the >>>>>>>>>>>> test allocation program is executed with 4 GB memory request (Due >>>>>>>>>>>> to this the guest occupies almost 4-5 GB of memory in the host in >>>>>>>>>>>> a system without page hinting). Once this program exits at that >>>>>>>>>>>> time another guest is launched in the host and the same process is >>>>>>>>>>>> followed. We continue launching the guests until a guest gets >>>>>>>>>>>> killed due to low memory condition in the host. >>>>>>>>>>>> >>>>>>>>>>>> Results: >>>>>>>>>>>> Without hinting = 3 >>>>>>>>>>>> With hinting = 5 >>>>>>>>>>>> >>>>>>>>>>>> 2. Hackbench >>>>>>>>>>>> Guest Memory = 5 GB >>>>>>>>>>>> Number of cores = 4 >>>>>>>>>>>> Number of tasks Time with Hinting Time without >>>>>>>>>>>> Hinting >>>>>>>>>>>> 4000 19.540 17.818 >>>>>>>>>>>> >>>>>>>>>>> How about memhog btw? >>>>>>>>>>> Alex reported: >>>>>>>>>>> >>>>>>>>>>> My testing up till now has consisted of setting up 4 8GB VMs on >>>>>>>>>>> a system >>>>>>>>>>> with 32GB of memory and 4GB of swap. To stress the memory on >>>>>>>>>>> the system I >>>>>>>>>>> would run "memhog 8G" sequentially on each of the guests and >>>>>>>>>>> observe how >>>>>>>>>>> long it took to complete the run. The observed behavior is that >>>>>>>>>>> on the >>>>>>>>>>> systems with these patches applied in both the guest and on the >>>>>>>>>>> host I was >>>>>>>>>>> able to complete the test with a time of 5 to 7 seconds per >>>>>>>>>>> guest. On a >>>>>>>>>>> system without these patches the time ranged from 7 to 49 >>>>>>>>>>> seconds per >>>>>>>>>>> guest. I am assuming the variability is due to time being spent >>>>>>>>>>> writing >>>>>>>>>>> pages out to disk in order to free up space for the guest. >>>>>>>>>>> >>>>>>>>>> Here are the results: >>>>>>>>>> >>>>>>>>>> Procedure: 3 Guests of size 5GB is launched on a single NUMA node >>>>>>>>>> with >>>>>>>>>> total memory of 15GB and no swap. In each of the guest, memhog is run >>>>>>>>>> with 5GB. Post-execution of memhog, Host memory usage is monitored by >>>>>>>>>> using Free command. >>>>>>>>>> >>>>>>>>>> Without Hinting: >>>>>>>>>> Time of execution Host used memory >>>>>>>>>> Guest 1: 45 seconds 5.4 GB >>>>>>>>>> Guest 2: 45 seconds 10 GB >>>>>>>>>> Guest 3: 1 minute 15 GB >>>>>>>>>> >>>>>>>>>> With Hinting: >>>>>>>>>> Time of execution Host used memory >>>>>>>>>> Guest 1: 49 seconds 2.4 GB >>>>>>>>>> Guest 2: 40 seconds 4.3 GB >>>>>>>>>> Guest 3: 50 seconds 6.3 GB >>>>>>>>> OK so no improvement. OTOH Alex's patches cut time down to 5-7 seconds >>>>>>>>> which seems better. Want to try testing Alex's patches for comparison? >>>>>>>>> >>>>>>>> I realized that the last time I reported the memhog numbers, I didn't >>>>>>>> enable the swap due to which the actual benefits of the series were not >>>>>>>> shown. >>>>>>>> I have re-run the test by including some of the changes suggested by >>>>>>>> Alexander and David: >>>>>>>> * Reduced the size of the per-cpu array to 32 and minimum hinting >>>>>>>> threshold to 16. >>>>>>>> * Reported length of isolated pages along with start pfn, instead >>>>>>>> of >>>>>>>> the order from the guest. >>>>>>>> * Used the reported length to madvise the entire length of address >>>>>>>> instead of a single 4K page. >>>>>>>> * Replaced MADV_DONTNEED with MADV_FREE. >>>>>>>> >>>>>>>> Setup for the test: >>>>>>>> NUMA node:1 >>>>>>>> Memory: 15GB >>>>>>>> Swap: 4GB >>>>>>>> Guest memory: 6GB >>>>>>>> Number of core: 1 >>>>>>>> >>>>>>>> Process: A guest is launched and memhog is run with 6GB. As its >>>>>>>> execution is over next guest is launched. Everytime memhog execution >>>>>>>> time is monitored. >>>>>>>> Results: >>>>>>>> Without Hinting: >>>>>>>> Time of execution >>>>>>>> Guest1: 22s >>>>>>>> Guest2: 24s >>>>>>>> Guest3: 1m29s >>>>>>>> >>>>>>>> With Hinting: >>>>>>>> Time of execution >>>>>>>> Guest1: 24s >>>>>>>> Guest2: 25s >>>>>>>> Guest3: 28s >>>>>>>> >>>>>>>> When hinting is enabled swap space is not used until memhog with 6GB is >>>>>>>> ran in 6th guest. >>>>>>> So one change you may want to make to your test setup would be to >>>>>>> launch the tests sequentially after all the guests all up, instead of >>>>>>> combining the test and guest bring-up. In addition you could run >>>>>>> through the guests more than once to determine a more-or-less steady >>>>>>> state in terms of the performance as you move between the guests after >>>>>>> they have hit the point of having to either swap or pull MADV_FREE >>>>>>> pages. >>>>>> I tried running memhog as you suggested, here are the results: >>>>>> Setup for the test: >>>>>> NUMA node:1 >>>>>> Memory: 15GB >>>>>> Swap: 4GB >>>>>> Guest memory: 6GB >>>>>> Number of core: 1 >>>>>> >>>>>> Process: 3 guests are launched and memhog is run with 6GB. Results are >>>>>> monitored after 1st-time execution of memhog. Memhog is launched >>>>>> sequentially in each of the guests and time is observed after the >>>>>> execution of all 3 memhog is over. >>>>>> >>>>>> Results: >>>>>> Without Hinting >>>>>> Time of Execution >>>>>> 1. 6m48s >>>>>> 2. 6m9s >>>>>> >>>>>> With Hinting >>>>>> Array size:16 Minimum Threshold:8 >>>>>> 1. 2m57s >>>>>> 2. 2m20s >>>>>> >>>>>> The memhog execution time in the case of hinting is still not that low >>>>>> as we would have expected. This is due to the usage of swap space. >>>>>> Although wrt to non-hinting when swap used space is around 3.5G, with >>>>>> hinting it remains to around 1.1-1.5G. >>>>>> I did try using a zone free page barrier which prevented hinting when >>>>>> free pages of order HINTING_ORDER goes below 256. This further brings >>>>>> down the swap usage to 100-150 MB. The tricky part of this approach is >>>>>> to configure this barrier condition for different guests. >>>>>> >>>>>> Array size:16 Minimum Threshold:8 >>>>>> 1. 1m16s >>>>>> 2. 1m41s >>>>>> >>>>>> Note: Memhog time does seem to vary a little bit on every boot with or >>>>>> without hinting. >>>>>> >>>>> I don't quite understand yet why "hinting more pages" (no free page >>>>> barrier) should result in a higher swap usage in the hypervisor >>>>> (1.1-1.5GB vs. 100-150 MB). If we are "hinting more pages" I would have >>>>> guessed that runtime could get slower, but not that we need more swap. >>>>> >>>>> One theory: >>>>> >>>>> If you hint all MAX_ORDER - 1 pages, at one point it could be that all >>>>> "remaining" free pages are currently isolated to be hinted. As MM needs >>>>> more pages for a process, it will fallback to using "MAX_ORDER - 2" >>>>> pages and so on. These pages, when they are freed, you won't hint >>>>> anymore unless they get merged. But after all they won't get merged >>>>> because they can't be merged (otherwise they wouldn't be "MAX_ORDER - 2" >>>>> after all right from the beginning). >>>>> >>>>> Try hinting a smaller granularity to see if this could actually be the >>>>> case. >>>> So I have two questions in my mind after looking at the results now: >>>> 1. Why swap is coming into the picture when hinting is enabled? >>>> 2. Same to what you have raised. >>>> For the 1st question, I think the answer is: (correct me if I am wrong.) >>>> Memhog while writing the memory does free memory but the pages it frees >>>> are of a lower order which doesn't merge until the memhog write >>>> completes. After which we do get the MAX_ORDER - 1 page from the buddy >>>> resulting in hinting. >>>> As all 3 memhog are running parallelly we don't get free memory until >>>> one of them completes. >>>> This does explain that when 3 guests each of 6GB on a 15GB host tries to >>>> run memhog with 6GB parallelly, swap comes into the picture even if >>>> hinting is enabled. >>> Are you running them in parallel or sequentially? >> I was running them parallelly but then I realized to see any benefits, >> in that case, I should have run less number of guests. >>> I had suggested >>> running them serially so that the previous one could complete and free >>> the memory before the next one allocated memory. In that setup you >>> should see the guests still swapping without hints, but with hints the >>> guest should free the memory up before the next one starts using it. >> Yeah, I just realized this. Thanks for the clarification. >>> If you are running them in parallel then you are going to see things >>> going to swap because memhog does like what the name implies and it >>> will use all of the memory you give it. It isn't until it completes >>> that the memory is freed. >>> >>>> This doesn't explain why putting a barrier or avoid hinting reduced the >>>> swap usage. It seems I possibly had a wrong impression of the delaying >>>> hinting idea which we discussed. >>>> As I was observing the value of the swap at the end of the memhog >>>> execution which is logically incorrect. I will re-run the test and >>>> observe the highest swap usage during the entire execution of memhog for >>>> hinting vs non-hinting. >>> So one option you may look at if you are wanting to run the tests in >>> parallel would be to limit the number of tests you have running at the >>> same time. If you have 15G of memory and 6G per guest you should be >>> able to run 2 sessions at a time without going to swap, however if you >>> run all 3 then you are likely going to be going to swap even with >>> hinting. >>> >>> - Alex > Here are the updated numbers excluding the guest bring-up cost: > Setup for the test- > NUMA node:1 > Memory: 15GB > Swap: 4GB > Guest memory: 6GB > Number of core: 1 > Process: 3 guests are launched and memhog is run serially with 6GB. > Results: > Without Hinting > Time of Execution > Guest1: 56s > Guest2: 45s > Guest3: 3m41s > > With Hinting > Guest1: 46s > Guest2: 45s > Guest3: 49s > > > > I performed some experiments to see if the current implementation of hinting breaks THP. I used AnonHugePages to track the THP pages currently in use and memhog as the guest workload. Setup: Host Size: 30GB (No swap) Guest Size: 15GB THP Size: 2MB Process: Guest is installed with different kernels to hint different granularities(MAX_ORDER - 1, MAX_ORDER - 2 and MAX_ORDER - 3). Memhog 15G is run multiple times in the same guest to see AnonHugePages usage in the host.
Observation: There is no THP split for order MAX_ORDER - 1 & MAX_ORDER - 2 whereas for hinting granularity MAX_ORDER - 3 THP does split irrespective of MADVISE_FREE or MADVISE_DONTNEED. -- Regards Nitesh
signature.asc
Description: OpenPGP digital signature

