On 3/19/19 9:33 AM, David Hildenbrand wrote: > On 18.03.19 16:57, Nitesh Narayan Lal wrote: >> On 3/14/19 12:58 PM, Alexander Duyck wrote: >>> On Thu, Mar 14, 2019 at 9:43 AM Nitesh Narayan Lal <[email protected]> >>> wrote: >>>> On 3/6/19 1:12 PM, Michael S. Tsirkin wrote: >>>>> On Wed, Mar 06, 2019 at 01:07:50PM -0500, Nitesh Narayan Lal wrote: >>>>>> On 3/6/19 11:09 AM, Michael S. Tsirkin wrote: >>>>>>> On Wed, Mar 06, 2019 at 10:50:42AM -0500, Nitesh Narayan Lal wrote: >>>>>>>> The following patch-set proposes an efficient mechanism for handing >>>>>>>> freed memory between the guest and the host. It enables the guests >>>>>>>> with no page cache to rapidly free and reclaims memory to and from the >>>>>>>> host respectively. >>>>>>>> >>>>>>>> Benefit: >>>>>>>> With this patch-series, in our test-case, executed on a single system >>>>>>>> and single NUMA node with 15GB memory, we were able to successfully >>>>>>>> launch 5 guests(each with 5 GB memory) when page hinting was enabled >>>>>>>> and 3 without it. (Detailed explanation of the test procedure is >>>>>>>> provided at the bottom under Test - 1). >>>>>>>> >>>>>>>> Changelog in v9: >>>>>>>> * Guest free page hinting hook is now invoked after a page has been >>>>>>>> merged in the buddy. >>>>>>>> * Free pages only with order >>>>>>>> FREE_PAGE_HINTING_MIN_ORDER(currently defined as MAX_ORDER - 1) are >>>>>>>> captured. >>>>>>>> * Removed kthread which was earlier used to perform the scanning, >>>>>>>> isolation & reporting of free pages. >>>>>>>> * Pages, captured in the per cpu array are sorted based on the zone >>>>>>>> numbers. This is to avoid redundancy of acquiring zone locks. >>>>>>>> * Dynamically allocated space is used to hold the isolated >>>>>>>> guest free pages. >>>>>>>> * All the pages are reported asynchronously to the host via >>>>>>>> virtio driver. >>>>>>>> * Pages are returned back to the guest buddy free list only >>>>>>>> when the host response is received. >>>>>>>> >>>>>>>> Pending items: >>>>>>>> * Make sure that the guest free page hinting's current >>>>>>>> implementation doesn't break hugepages or device assigned guests. >>>>>>>> * Follow up on VIRTIO_BALLOON_F_PAGE_POISON's device side support. >>>>>>>> (It is currently missing) >>>>>>>> * Compare reporting free pages via vring with vhost. >>>>>>>> * Decide between MADV_DONTNEED and MADV_FREE. >>>>>>>> * Analyze overall performance impact due to guest free page hinting. >>>>>>>> * Come up with proper/traceable error-message/logs. >>>>>>>> >>>>>>>> Tests: >>>>>>>> 1. Use-case - Number of guests we can launch >>>>>>>> >>>>>>>> NUMA Nodes = 1 with 15 GB memory >>>>>>>> Guest Memory = 5 GB >>>>>>>> Number of cores in guest = 1 >>>>>>>> Workload = test allocation program allocates 4GB memory, touches it >>>>>>>> via memset and exits. >>>>>>>> Procedure = >>>>>>>> The first guest is launched and once its console is up, the test >>>>>>>> allocation program is executed with 4 GB memory request (Due to this >>>>>>>> the guest occupies almost 4-5 GB of memory in the host in a system >>>>>>>> without page hinting). Once this program exits at that time another >>>>>>>> guest is launched in the host and the same process is followed. We >>>>>>>> continue launching the guests until a guest gets killed due to low >>>>>>>> memory condition in the host. >>>>>>>> >>>>>>>> Results: >>>>>>>> Without hinting = 3 >>>>>>>> With hinting = 5 >>>>>>>> >>>>>>>> 2. Hackbench >>>>>>>> Guest Memory = 5 GB >>>>>>>> Number of cores = 4 >>>>>>>> Number of tasks Time with Hinting Time without Hinting >>>>>>>> 4000 19.540 17.818 >>>>>>>> >>>>>>> How about memhog btw? >>>>>>> Alex reported: >>>>>>> >>>>>>> My testing up till now has consisted of setting up 4 8GB VMs on a >>>>>>> system >>>>>>> with 32GB of memory and 4GB of swap. To stress the memory on the >>>>>>> system I >>>>>>> would run "memhog 8G" sequentially on each of the guests and >>>>>>> observe how >>>>>>> long it took to complete the run. The observed behavior is that on >>>>>>> the >>>>>>> systems with these patches applied in both the guest and on the >>>>>>> host I was >>>>>>> able to complete the test with a time of 5 to 7 seconds per guest. >>>>>>> On a >>>>>>> system without these patches the time ranged from 7 to 49 seconds >>>>>>> per >>>>>>> guest. I am assuming the variability is due to time being spent >>>>>>> writing >>>>>>> pages out to disk in order to free up space for the guest. >>>>>>> >>>>>> Here are the results: >>>>>> >>>>>> Procedure: 3 Guests of size 5GB is launched on a single NUMA node with >>>>>> total memory of 15GB and no swap. In each of the guest, memhog is run >>>>>> with 5GB. Post-execution of memhog, Host memory usage is monitored by >>>>>> using Free command. >>>>>> >>>>>> Without Hinting: >>>>>> Time of execution Host used memory >>>>>> Guest 1: 45 seconds 5.4 GB >>>>>> Guest 2: 45 seconds 10 GB >>>>>> Guest 3: 1 minute 15 GB >>>>>> >>>>>> With Hinting: >>>>>> Time of execution Host used memory >>>>>> Guest 1: 49 seconds 2.4 GB >>>>>> Guest 2: 40 seconds 4.3 GB >>>>>> Guest 3: 50 seconds 6.3 GB >>>>> OK so no improvement. OTOH Alex's patches cut time down to 5-7 seconds >>>>> which seems better. Want to try testing Alex's patches for comparison? >>>>> >>>> I realized that the last time I reported the memhog numbers, I didn't >>>> enable the swap due to which the actual benefits of the series were not >>>> shown. >>>> I have re-run the test by including some of the changes suggested by >>>> Alexander and David: >>>> * Reduced the size of the per-cpu array to 32 and minimum hinting >>>> threshold to 16. >>>> * Reported length of isolated pages along with start pfn, instead of >>>> the order from the guest. >>>> * Used the reported length to madvise the entire length of address >>>> instead of a single 4K page. >>>> * Replaced MADV_DONTNEED with MADV_FREE. >>>> >>>> Setup for the test: >>>> NUMA node:1 >>>> Memory: 15GB >>>> Swap: 4GB >>>> Guest memory: 6GB >>>> Number of core: 1 >>>> >>>> Process: A guest is launched and memhog is run with 6GB. As its >>>> execution is over next guest is launched. Everytime memhog execution >>>> time is monitored. >>>> Results: >>>> Without Hinting: >>>> Time of execution >>>> Guest1: 22s >>>> Guest2: 24s >>>> Guest3: 1m29s >>>> >>>> With Hinting: >>>> Time of execution >>>> Guest1: 24s >>>> Guest2: 25s >>>> Guest3: 28s >>>> >>>> When hinting is enabled swap space is not used until memhog with 6GB is >>>> ran in 6th guest. >>> So one change you may want to make to your test setup would be to >>> launch the tests sequentially after all the guests all up, instead of >>> combining the test and guest bring-up. In addition you could run >>> through the guests more than once to determine a more-or-less steady >>> state in terms of the performance as you move between the guests after >>> they have hit the point of having to either swap or pull MADV_FREE >>> pages. >> I tried running memhog as you suggested, here are the results: >> Setup for the test: >> NUMA node:1 >> Memory: 15GB >> Swap: 4GB >> Guest memory: 6GB >> Number of core: 1 >> >> Process: 3 guests are launched and memhog is run with 6GB. Results are >> monitored after 1st-time execution of memhog. Memhog is launched >> sequentially in each of the guests and time is observed after the >> execution of all 3 memhog is over. >> >> Results: >> Without Hinting >> Time of Execution >> 1. 6m48s >> 2. 6m9s >> >> With Hinting >> Array size:16 Minimum Threshold:8 >> 1. 2m57s >> 2. 2m20s >> >> The memhog execution time in the case of hinting is still not that low >> as we would have expected. This is due to the usage of swap space. >> Although wrt to non-hinting when swap used space is around 3.5G, with >> hinting it remains to around 1.1-1.5G. >> I did try using a zone free page barrier which prevented hinting when >> free pages of order HINTING_ORDER goes below 256. This further brings >> down the swap usage to 100-150 MB. The tricky part of this approach is >> to configure this barrier condition for different guests. >> >> Array size:16 Minimum Threshold:8 >> 1. 1m16s >> 2. 1m41s >> >> Note: Memhog time does seem to vary a little bit on every boot with or >> without hinting. >> > I don't quite understand yet why "hinting more pages" (no free page > barrier) should result in a higher swap usage in the hypervisor > (1.1-1.5GB vs. 100-150 MB). If we are "hinting more pages" I would have > guessed that runtime could get slower, but not that we need more swap. > > One theory: > > If you hint all MAX_ORDER - 1 pages, at one point it could be that all > "remaining" free pages are currently isolated to be hinted. As MM needs > more pages for a process, it will fallback to using "MAX_ORDER - 2" > pages and so on. These pages, when they are freed, you won't hint > anymore unless they get merged. But after all they won't get merged > because they can't be merged (otherwise they wouldn't be "MAX_ORDER - 2" > after all right from the beginning). > > Try hinting a smaller granularity to see if this could actually be the case. So I have two questions in my mind after looking at the results now: 1. Why swap is coming into the picture when hinting is enabled? 2. Same to what you have raised. For the 1st question, I think the answer is: (correct me if I am wrong.) Memhog while writing the memory does free memory but the pages it frees are of a lower order which doesn't merge until the memhog write completes. After which we do get the MAX_ORDER - 1 page from the buddy resulting in hinting. As all 3 memhog are running parallelly we don't get free memory until one of them completes. This does explain that when 3 guests each of 6GB on a 15GB host tries to run memhog with 6GB parallelly, swap comes into the picture even if hinting is enabled.
This doesn't explain why putting a barrier or avoid hinting reduced the swap usage. It seems I possibly had a wrong impression of the delaying hinting idea which we discussed. As I was observing the value of the swap at the end of the memhog execution which is logically incorrect. I will re-run the test and observe the highest swap usage during the entire execution of memhog for hinting vs non-hinting. -- Regards Nitesh
signature.asc
Description: OpenPGP digital signature

