On 3/19/19 1:38 PM, Alexander Duyck wrote: > On Tue, Mar 19, 2019 at 9:04 AM Nitesh Narayan Lal <[email protected]> wrote: >> On 3/19/19 9:33 AM, David Hildenbrand wrote: >>> On 18.03.19 16:57, Nitesh Narayan Lal wrote: >>>> On 3/14/19 12:58 PM, Alexander Duyck wrote: >>>>> On Thu, Mar 14, 2019 at 9:43 AM Nitesh Narayan Lal <[email protected]> >>>>> wrote: >>>>>> On 3/6/19 1:12 PM, Michael S. Tsirkin wrote: >>>>>>> On Wed, Mar 06, 2019 at 01:07:50PM -0500, Nitesh Narayan Lal wrote: >>>>>>>> On 3/6/19 11:09 AM, Michael S. Tsirkin wrote: >>>>>>>>> On Wed, Mar 06, 2019 at 10:50:42AM -0500, Nitesh Narayan Lal wrote: >>>>>>>>>> The following patch-set proposes an efficient mechanism for handing >>>>>>>>>> freed memory between the guest and the host. It enables the guests >>>>>>>>>> with no page cache to rapidly free and reclaims memory to and from >>>>>>>>>> the host respectively. >>>>>>>>>> >>>>>>>>>> Benefit: >>>>>>>>>> With this patch-series, in our test-case, executed on a single >>>>>>>>>> system and single NUMA node with 15GB memory, we were able to >>>>>>>>>> successfully launch 5 guests(each with 5 GB memory) when page >>>>>>>>>> hinting was enabled and 3 without it. (Detailed explanation of the >>>>>>>>>> test procedure is provided at the bottom under Test - 1). >>>>>>>>>> >>>>>>>>>> Changelog in v9: >>>>>>>>>> * Guest free page hinting hook is now invoked after a page has >>>>>>>>>> been merged in the buddy. >>>>>>>>>> * Free pages only with order >>>>>>>>>> FREE_PAGE_HINTING_MIN_ORDER(currently defined as MAX_ORDER - 1) are >>>>>>>>>> captured. >>>>>>>>>> * Removed kthread which was earlier used to perform the scanning, >>>>>>>>>> isolation & reporting of free pages. >>>>>>>>>> * Pages, captured in the per cpu array are sorted based on the >>>>>>>>>> zone numbers. This is to avoid redundancy of acquiring zone locks. >>>>>>>>>> * Dynamically allocated space is used to hold the isolated >>>>>>>>>> guest free pages. >>>>>>>>>> * All the pages are reported asynchronously to the host via >>>>>>>>>> virtio driver. >>>>>>>>>> * Pages are returned back to the guest buddy free list only >>>>>>>>>> when the host response is received. >>>>>>>>>> >>>>>>>>>> Pending items: >>>>>>>>>> * Make sure that the guest free page hinting's current >>>>>>>>>> implementation doesn't break hugepages or device assigned guests. >>>>>>>>>> * Follow up on VIRTIO_BALLOON_F_PAGE_POISON's device side >>>>>>>>>> support. (It is currently missing) >>>>>>>>>> * Compare reporting free pages via vring with vhost. >>>>>>>>>> * Decide between MADV_DONTNEED and MADV_FREE. >>>>>>>>>> * Analyze overall performance impact due to guest free page >>>>>>>>>> hinting. >>>>>>>>>> * Come up with proper/traceable error-message/logs. >>>>>>>>>> >>>>>>>>>> Tests: >>>>>>>>>> 1. Use-case - Number of guests we can launch >>>>>>>>>> >>>>>>>>>> NUMA Nodes = 1 with 15 GB memory >>>>>>>>>> Guest Memory = 5 GB >>>>>>>>>> Number of cores in guest = 1 >>>>>>>>>> Workload = test allocation program allocates 4GB memory, touches >>>>>>>>>> it via memset and exits. >>>>>>>>>> Procedure = >>>>>>>>>> The first guest is launched and once its console is up, the test >>>>>>>>>> allocation program is executed with 4 GB memory request (Due to this >>>>>>>>>> the guest occupies almost 4-5 GB of memory in the host in a system >>>>>>>>>> without page hinting). Once this program exits at that time another >>>>>>>>>> guest is launched in the host and the same process is followed. We >>>>>>>>>> continue launching the guests until a guest gets killed due to low >>>>>>>>>> memory condition in the host. >>>>>>>>>> >>>>>>>>>> Results: >>>>>>>>>> Without hinting = 3 >>>>>>>>>> With hinting = 5 >>>>>>>>>> >>>>>>>>>> 2. Hackbench >>>>>>>>>> Guest Memory = 5 GB >>>>>>>>>> Number of cores = 4 >>>>>>>>>> Number of tasks Time with Hinting Time without >>>>>>>>>> Hinting >>>>>>>>>> 4000 19.540 17.818 >>>>>>>>>> >>>>>>>>> How about memhog btw? >>>>>>>>> Alex reported: >>>>>>>>> >>>>>>>>> My testing up till now has consisted of setting up 4 8GB VMs on a >>>>>>>>> system >>>>>>>>> with 32GB of memory and 4GB of swap. To stress the memory on the >>>>>>>>> system I >>>>>>>>> would run "memhog 8G" sequentially on each of the guests and >>>>>>>>> observe how >>>>>>>>> long it took to complete the run. The observed behavior is that >>>>>>>>> on the >>>>>>>>> systems with these patches applied in both the guest and on the >>>>>>>>> host I was >>>>>>>>> able to complete the test with a time of 5 to 7 seconds per >>>>>>>>> guest. On a >>>>>>>>> system without these patches the time ranged from 7 to 49 seconds >>>>>>>>> per >>>>>>>>> guest. I am assuming the variability is due to time being spent >>>>>>>>> writing >>>>>>>>> pages out to disk in order to free up space for the guest. >>>>>>>>> >>>>>>>> Here are the results: >>>>>>>> >>>>>>>> Procedure: 3 Guests of size 5GB is launched on a single NUMA node with >>>>>>>> total memory of 15GB and no swap. In each of the guest, memhog is run >>>>>>>> with 5GB. Post-execution of memhog, Host memory usage is monitored by >>>>>>>> using Free command. >>>>>>>> >>>>>>>> Without Hinting: >>>>>>>> Time of execution Host used memory >>>>>>>> Guest 1: 45 seconds 5.4 GB >>>>>>>> Guest 2: 45 seconds 10 GB >>>>>>>> Guest 3: 1 minute 15 GB >>>>>>>> >>>>>>>> With Hinting: >>>>>>>> Time of execution Host used memory >>>>>>>> Guest 1: 49 seconds 2.4 GB >>>>>>>> Guest 2: 40 seconds 4.3 GB >>>>>>>> Guest 3: 50 seconds 6.3 GB >>>>>>> OK so no improvement. OTOH Alex's patches cut time down to 5-7 seconds >>>>>>> which seems better. Want to try testing Alex's patches for comparison? >>>>>>> >>>>>> I realized that the last time I reported the memhog numbers, I didn't >>>>>> enable the swap due to which the actual benefits of the series were not >>>>>> shown. >>>>>> I have re-run the test by including some of the changes suggested by >>>>>> Alexander and David: >>>>>> * Reduced the size of the per-cpu array to 32 and minimum hinting >>>>>> threshold to 16. >>>>>> * Reported length of isolated pages along with start pfn, instead of >>>>>> the order from the guest. >>>>>> * Used the reported length to madvise the entire length of address >>>>>> instead of a single 4K page. >>>>>> * Replaced MADV_DONTNEED with MADV_FREE. >>>>>> >>>>>> Setup for the test: >>>>>> NUMA node:1 >>>>>> Memory: 15GB >>>>>> Swap: 4GB >>>>>> Guest memory: 6GB >>>>>> Number of core: 1 >>>>>> >>>>>> Process: A guest is launched and memhog is run with 6GB. As its >>>>>> execution is over next guest is launched. Everytime memhog execution >>>>>> time is monitored. >>>>>> Results: >>>>>> Without Hinting: >>>>>> Time of execution >>>>>> Guest1: 22s >>>>>> Guest2: 24s >>>>>> Guest3: 1m29s >>>>>> >>>>>> With Hinting: >>>>>> Time of execution >>>>>> Guest1: 24s >>>>>> Guest2: 25s >>>>>> Guest3: 28s >>>>>> >>>>>> When hinting is enabled swap space is not used until memhog with 6GB is >>>>>> ran in 6th guest. >>>>> So one change you may want to make to your test setup would be to >>>>> launch the tests sequentially after all the guests all up, instead of >>>>> combining the test and guest bring-up. In addition you could run >>>>> through the guests more than once to determine a more-or-less steady >>>>> state in terms of the performance as you move between the guests after >>>>> they have hit the point of having to either swap or pull MADV_FREE >>>>> pages. >>>> I tried running memhog as you suggested, here are the results: >>>> Setup for the test: >>>> NUMA node:1 >>>> Memory: 15GB >>>> Swap: 4GB >>>> Guest memory: 6GB >>>> Number of core: 1 >>>> >>>> Process: 3 guests are launched and memhog is run with 6GB. Results are >>>> monitored after 1st-time execution of memhog. Memhog is launched >>>> sequentially in each of the guests and time is observed after the >>>> execution of all 3 memhog is over. >>>> >>>> Results: >>>> Without Hinting >>>> Time of Execution >>>> 1. 6m48s >>>> 2. 6m9s >>>> >>>> With Hinting >>>> Array size:16 Minimum Threshold:8 >>>> 1. 2m57s >>>> 2. 2m20s >>>> >>>> The memhog execution time in the case of hinting is still not that low >>>> as we would have expected. This is due to the usage of swap space. >>>> Although wrt to non-hinting when swap used space is around 3.5G, with >>>> hinting it remains to around 1.1-1.5G. >>>> I did try using a zone free page barrier which prevented hinting when >>>> free pages of order HINTING_ORDER goes below 256. This further brings >>>> down the swap usage to 100-150 MB. The tricky part of this approach is >>>> to configure this barrier condition for different guests. >>>> >>>> Array size:16 Minimum Threshold:8 >>>> 1. 1m16s >>>> 2. 1m41s >>>> >>>> Note: Memhog time does seem to vary a little bit on every boot with or >>>> without hinting. >>>> >>> I don't quite understand yet why "hinting more pages" (no free page >>> barrier) should result in a higher swap usage in the hypervisor >>> (1.1-1.5GB vs. 100-150 MB). If we are "hinting more pages" I would have >>> guessed that runtime could get slower, but not that we need more swap. >>> >>> One theory: >>> >>> If you hint all MAX_ORDER - 1 pages, at one point it could be that all >>> "remaining" free pages are currently isolated to be hinted. As MM needs >>> more pages for a process, it will fallback to using "MAX_ORDER - 2" >>> pages and so on. These pages, when they are freed, you won't hint >>> anymore unless they get merged. But after all they won't get merged >>> because they can't be merged (otherwise they wouldn't be "MAX_ORDER - 2" >>> after all right from the beginning). >>> >>> Try hinting a smaller granularity to see if this could actually be the case. >> So I have two questions in my mind after looking at the results now: >> 1. Why swap is coming into the picture when hinting is enabled? >> 2. Same to what you have raised. >> For the 1st question, I think the answer is: (correct me if I am wrong.) >> Memhog while writing the memory does free memory but the pages it frees >> are of a lower order which doesn't merge until the memhog write >> completes. After which we do get the MAX_ORDER - 1 page from the buddy >> resulting in hinting. >> As all 3 memhog are running parallelly we don't get free memory until >> one of them completes. >> This does explain that when 3 guests each of 6GB on a 15GB host tries to >> run memhog with 6GB parallelly, swap comes into the picture even if >> hinting is enabled. > Are you running them in parallel or sequentially? I was running them parallelly but then I realized to see any benefits, in that case, I should have run less number of guests. > I had suggested > running them serially so that the previous one could complete and free > the memory before the next one allocated memory. In that setup you > should see the guests still swapping without hints, but with hints the > guest should free the memory up before the next one starts using it. Yeah, I just realized this. Thanks for the clarification. > If you are running them in parallel then you are going to see things > going to swap because memhog does like what the name implies and it > will use all of the memory you give it. It isn't until it completes > that the memory is freed. > >> This doesn't explain why putting a barrier or avoid hinting reduced the >> swap usage. It seems I possibly had a wrong impression of the delaying >> hinting idea which we discussed. >> As I was observing the value of the swap at the end of the memhog >> execution which is logically incorrect. I will re-run the test and >> observe the highest swap usage during the entire execution of memhog for >> hinting vs non-hinting. > So one option you may look at if you are wanting to run the tests in > parallel would be to limit the number of tests you have running at the > same time. If you have 15G of memory and 6G per guest you should be > able to run 2 sessions at a time without going to swap, however if you > run all 3 then you are likely going to be going to swap even with > hinting. > > - Alex -- Regards Nitesh
signature.asc
Description: OpenPGP digital signature

