On 3/20/19 9:18 AM, Nitesh Narayan Lal wrote:
> On 3/19/19 1:59 PM, Nitesh Narayan Lal wrote:
>> On 3/19/19 1:38 PM, Alexander Duyck wrote:
>>> On Tue, Mar 19, 2019 at 9:04 AM Nitesh Narayan Lal <[email protected]> 
>>> wrote:
>>>> On 3/19/19 9:33 AM, David Hildenbrand wrote:
>>>>> On 18.03.19 16:57, Nitesh Narayan Lal wrote:
>>>>>> On 3/14/19 12:58 PM, Alexander Duyck wrote:
>>>>>>> On Thu, Mar 14, 2019 at 9:43 AM Nitesh Narayan Lal <[email protected]> 
>>>>>>> wrote:
>>>>>>>> On 3/6/19 1:12 PM, Michael S. Tsirkin wrote:
>>>>>>>>> On Wed, Mar 06, 2019 at 01:07:50PM -0500, Nitesh Narayan Lal wrote:
>>>>>>>>>> On 3/6/19 11:09 AM, Michael S. Tsirkin wrote:
>>>>>>>>>>> On Wed, Mar 06, 2019 at 10:50:42AM -0500, Nitesh Narayan Lal wrote:
>>>>>>>>>>>> The following patch-set proposes an efficient mechanism for 
>>>>>>>>>>>> handing freed memory between the guest and the host. It enables 
>>>>>>>>>>>> the guests with no page cache to rapidly free and reclaims memory 
>>>>>>>>>>>> to and from the host respectively.
>>>>>>>>>>>>
>>>>>>>>>>>> Benefit:
>>>>>>>>>>>> With this patch-series, in our test-case, executed on a single 
>>>>>>>>>>>> system and single NUMA node with 15GB memory, we were able to 
>>>>>>>>>>>> successfully launch 5 guests(each with 5 GB memory) when page 
>>>>>>>>>>>> hinting was enabled and 3 without it. (Detailed explanation of the 
>>>>>>>>>>>> test procedure is provided at the bottom under Test - 1).
>>>>>>>>>>>>
>>>>>>>>>>>> Changelog in v9:
>>>>>>>>>>>>    * Guest free page hinting hook is now invoked after a page has 
>>>>>>>>>>>> been merged in the buddy.
>>>>>>>>>>>>         * Free pages only with order 
>>>>>>>>>>>> FREE_PAGE_HINTING_MIN_ORDER(currently defined as MAX_ORDER - 1) 
>>>>>>>>>>>> are captured.
>>>>>>>>>>>>    * Removed kthread which was earlier used to perform the 
>>>>>>>>>>>> scanning, isolation & reporting of free pages.
>>>>>>>>>>>>    * Pages, captured in the per cpu array are sorted based on the 
>>>>>>>>>>>> zone numbers. This is to avoid redundancy of acquiring zone locks.
>>>>>>>>>>>>         * Dynamically allocated space is used to hold the isolated 
>>>>>>>>>>>> guest free pages.
>>>>>>>>>>>>         * All the pages are reported asynchronously to the host 
>>>>>>>>>>>> via virtio driver.
>>>>>>>>>>>>         * Pages are returned back to the guest buddy free list 
>>>>>>>>>>>> only when the host response is received.
>>>>>>>>>>>>
>>>>>>>>>>>> Pending items:
>>>>>>>>>>>>         * Make sure that the guest free page hinting's current 
>>>>>>>>>>>> implementation doesn't break hugepages or device assigned guests.
>>>>>>>>>>>>    * Follow up on VIRTIO_BALLOON_F_PAGE_POISON's device side 
>>>>>>>>>>>> support. (It is currently missing)
>>>>>>>>>>>>         * Compare reporting free pages via vring with vhost.
>>>>>>>>>>>>         * Decide between MADV_DONTNEED and MADV_FREE.
>>>>>>>>>>>>    * Analyze overall performance impact due to guest free page 
>>>>>>>>>>>> hinting.
>>>>>>>>>>>>    * Come up with proper/traceable error-message/logs.
>>>>>>>>>>>>
>>>>>>>>>>>> Tests:
>>>>>>>>>>>> 1. Use-case - Number of guests we can launch
>>>>>>>>>>>>
>>>>>>>>>>>>    NUMA Nodes = 1 with 15 GB memory
>>>>>>>>>>>>    Guest Memory = 5 GB
>>>>>>>>>>>>    Number of cores in guest = 1
>>>>>>>>>>>>    Workload = test allocation program allocates 4GB memory, 
>>>>>>>>>>>> touches it via memset and exits.
>>>>>>>>>>>>    Procedure =
>>>>>>>>>>>>    The first guest is launched and once its console is up, the 
>>>>>>>>>>>> test allocation program is executed with 4 GB memory request (Due 
>>>>>>>>>>>> to this the guest occupies almost 4-5 GB of memory in the host in 
>>>>>>>>>>>> a system without page hinting). Once this program exits at that 
>>>>>>>>>>>> time another guest is launched in the host and the same process is 
>>>>>>>>>>>> followed. We continue launching the guests until a guest gets 
>>>>>>>>>>>> killed due to low memory condition in the host.
>>>>>>>>>>>>
>>>>>>>>>>>>    Results:
>>>>>>>>>>>>    Without hinting = 3
>>>>>>>>>>>>    With hinting = 5
>>>>>>>>>>>>
>>>>>>>>>>>> 2. Hackbench
>>>>>>>>>>>>    Guest Memory = 5 GB
>>>>>>>>>>>>    Number of cores = 4
>>>>>>>>>>>>    Number of tasks         Time with Hinting       Time without 
>>>>>>>>>>>> Hinting
>>>>>>>>>>>>    4000                    19.540                  17.818
>>>>>>>>>>>>
>>>>>>>>>>> How about memhog btw?
>>>>>>>>>>> Alex reported:
>>>>>>>>>>>
>>>>>>>>>>>     My testing up till now has consisted of setting up 4 8GB VMs on 
>>>>>>>>>>> a system
>>>>>>>>>>>     with 32GB of memory and 4GB of swap. To stress the memory on 
>>>>>>>>>>> the system I
>>>>>>>>>>>     would run "memhog 8G" sequentially on each of the guests and 
>>>>>>>>>>> observe how
>>>>>>>>>>>     long it took to complete the run. The observed behavior is that 
>>>>>>>>>>> on the
>>>>>>>>>>>     systems with these patches applied in both the guest and on the 
>>>>>>>>>>> host I was
>>>>>>>>>>>     able to complete the test with a time of 5 to 7 seconds per 
>>>>>>>>>>> guest. On a
>>>>>>>>>>>     system without these patches the time ranged from 7 to 49 
>>>>>>>>>>> seconds per
>>>>>>>>>>>     guest. I am assuming the variability is due to time being spent 
>>>>>>>>>>> writing
>>>>>>>>>>>     pages out to disk in order to free up space for the guest.
>>>>>>>>>>>
>>>>>>>>>> Here are the results:
>>>>>>>>>>
>>>>>>>>>> Procedure: 3 Guests of size 5GB is launched on a single NUMA node 
>>>>>>>>>> with
>>>>>>>>>> total memory of 15GB and no swap. In each of the guest, memhog is run
>>>>>>>>>> with 5GB. Post-execution of memhog, Host memory usage is monitored by
>>>>>>>>>> using Free command.
>>>>>>>>>>
>>>>>>>>>> Without Hinting:
>>>>>>>>>>                  Time of execution    Host used memory
>>>>>>>>>> Guest 1:        45 seconds            5.4 GB
>>>>>>>>>> Guest 2:        45 seconds            10 GB
>>>>>>>>>> Guest 3:        1  minute               15 GB
>>>>>>>>>>
>>>>>>>>>> With Hinting:
>>>>>>>>>>                 Time of execution     Host used memory
>>>>>>>>>> Guest 1:        49 seconds            2.4 GB
>>>>>>>>>> Guest 2:        40 seconds            4.3 GB
>>>>>>>>>> Guest 3:        50 seconds            6.3 GB
>>>>>>>>> OK so no improvement. OTOH Alex's patches cut time down to 5-7 seconds
>>>>>>>>> which seems better. Want to try testing Alex's patches for comparison?
>>>>>>>>>
>>>>>>>> I realized that the last time I reported the memhog numbers, I didn't
>>>>>>>> enable the swap due to which the actual benefits of the series were not
>>>>>>>> shown.
>>>>>>>> I have re-run the test by including some of the changes suggested by
>>>>>>>> Alexander and David:
>>>>>>>>     * Reduced the size of the per-cpu array to 32 and minimum hinting
>>>>>>>> threshold to 16.
>>>>>>>>     * Reported length of isolated pages along with start pfn, instead 
>>>>>>>> of
>>>>>>>> the order from the guest.
>>>>>>>>     * Used the reported length to madvise the entire length of address
>>>>>>>> instead of a single 4K page.
>>>>>>>>     * Replaced MADV_DONTNEED with MADV_FREE.
>>>>>>>>
>>>>>>>> Setup for the test:
>>>>>>>> NUMA node:1
>>>>>>>> Memory: 15GB
>>>>>>>> Swap: 4GB
>>>>>>>> Guest memory: 6GB
>>>>>>>> Number of core: 1
>>>>>>>>
>>>>>>>> Process: A guest is launched and memhog is run with 6GB. As its
>>>>>>>> execution is over next guest is launched. Everytime memhog execution
>>>>>>>> time is monitored.
>>>>>>>> Results:
>>>>>>>>     Without Hinting:
>>>>>>>>                  Time of execution
>>>>>>>>     Guest1:    22s
>>>>>>>>     Guest2:    24s
>>>>>>>>     Guest3: 1m29s
>>>>>>>>
>>>>>>>>     With Hinting:
>>>>>>>>                 Time of execution
>>>>>>>>     Guest1:    24s
>>>>>>>>     Guest2:    25s
>>>>>>>>     Guest3:    28s
>>>>>>>>
>>>>>>>> When hinting is enabled swap space is not used until memhog with 6GB is
>>>>>>>> ran in 6th guest.
>>>>>>> So one change you may want to make to your test setup would be to
>>>>>>> launch the tests sequentially after all the guests all up, instead of
>>>>>>> combining the test and guest bring-up. In addition you could run
>>>>>>> through the guests more than once to determine a more-or-less steady
>>>>>>> state in terms of the performance as you move between the guests after
>>>>>>> they have hit the point of having to either swap or pull MADV_FREE
>>>>>>> pages.
>>>>>> I tried running memhog as you suggested, here are the results:
>>>>>> Setup for the test:
>>>>>> NUMA node:1
>>>>>> Memory: 15GB
>>>>>> Swap: 4GB
>>>>>> Guest memory: 6GB
>>>>>> Number of core: 1
>>>>>>
>>>>>> Process: 3 guests are launched and memhog is run with 6GB. Results are
>>>>>> monitored after 1st-time execution of memhog. Memhog is launched
>>>>>> sequentially in each of the guests and time is observed after the
>>>>>> execution of all 3 memhog is over.
>>>>>>
>>>>>> Results:
>>>>>> Without Hinting
>>>>>>     Time of Execution
>>>>>> 1.    6m48s
>>>>>> 2.    6m9s
>>>>>>
>>>>>> With Hinting
>>>>>> Array size:16 Minimum Threshold:8
>>>>>> 1.    2m57s
>>>>>> 2.    2m20s
>>>>>>
>>>>>> The memhog execution time in the case of hinting is still not that low
>>>>>> as we would have expected. This is due to the usage of swap space.
>>>>>> Although wrt to non-hinting when swap used space is around 3.5G, with
>>>>>> hinting it remains to around 1.1-1.5G.
>>>>>> I did try using a zone free page barrier which prevented hinting when
>>>>>> free pages of order HINTING_ORDER goes below 256. This further brings
>>>>>> down the swap usage to 100-150 MB. The tricky part of this approach is
>>>>>> to configure this barrier condition for different guests.
>>>>>>
>>>>>> Array size:16 Minimum Threshold:8
>>>>>> 1.    1m16s
>>>>>> 2.    1m41s
>>>>>>
>>>>>> Note: Memhog time does seem to vary a little bit on every boot with or
>>>>>> without hinting.
>>>>>>
>>>>> I don't quite understand yet why "hinting more pages" (no free page
>>>>> barrier) should result in a higher swap usage in the hypervisor
>>>>> (1.1-1.5GB vs. 100-150 MB). If we are "hinting more pages" I would have
>>>>> guessed that runtime could get slower, but not that we need more swap.
>>>>>
>>>>> One theory:
>>>>>
>>>>> If you hint all MAX_ORDER - 1 pages, at one point it could be that all
>>>>> "remaining" free pages are currently isolated to be hinted. As MM needs
>>>>> more pages for a process, it will fallback to using "MAX_ORDER - 2"
>>>>> pages and so on. These pages, when they are freed, you won't hint
>>>>> anymore unless they get merged. But after all they won't get merged
>>>>> because they can't be merged (otherwise they wouldn't be "MAX_ORDER - 2"
>>>>> after all right from the beginning).
>>>>>
>>>>> Try hinting a smaller granularity to see if this could actually be the 
>>>>> case.
>>>> So I have two questions in my mind after looking at the results now:
>>>> 1. Why swap is coming into the picture when hinting is enabled?
>>>> 2. Same to what you have raised.
>>>> For the 1st question, I think the answer is: (correct me if I am wrong.)
>>>> Memhog while writing the memory does free memory but the pages it frees
>>>> are of a lower order which doesn't merge until the memhog write
>>>> completes. After which we do get the MAX_ORDER - 1 page from the buddy
>>>> resulting in hinting.
>>>> As all 3 memhog are running parallelly we don't get free memory until
>>>> one of them completes.
>>>> This does explain that when 3 guests each of 6GB on a 15GB host tries to
>>>> run memhog with 6GB parallelly, swap comes into the picture even if
>>>> hinting is enabled.
>>> Are you running them in parallel or sequentially? 
>> I was running them parallelly but then I realized to see any benefits,
>> in that case, I should have run less number of guests.
>>> I had suggested
>>> running them serially so that the previous one could complete and free
>>> the memory before the next one allocated memory. In that setup you
>>> should see the guests still swapping without hints, but with hints the
>>> guest should free the memory up before the next one starts using it.
>> Yeah, I just realized this. Thanks for the clarification.
>>> If you are running them in parallel then you are going to see things
>>> going to swap because memhog does like what the name implies and it
>>> will use all of the memory you give it. It isn't until it completes
>>> that the memory is freed.
>>>
>>>> This doesn't explain why putting a barrier or avoid hinting reduced the
>>>> swap usage. It seems I possibly had a wrong impression of the delaying
>>>> hinting idea which we discussed.
>>>> As I was observing the value of the swap at the end of the memhog
>>>> execution which is logically incorrect. I will re-run the test and
>>>> observe the highest swap usage during the entire execution of memhog for
>>>> hinting vs non-hinting.
>>> So one option you may look at if you are wanting to run the tests in
>>> parallel would be to limit the number of tests you have running at the
>>> same time. If you have 15G of memory and 6G per guest you should be
>>> able to run 2 sessions at a time without going to swap, however if you
>>> run all 3 then you are likely going to be going to swap even with
>>> hinting.
>>>
>>> - Alex
> Here are the updated numbers excluding the guest bring-up cost:
> Setup for the test-
> NUMA node:1
> Memory: 15GB
> Swap: 4GB
> Guest memory: 6GB
> Number of core: 1
> Process: 3 guests are launched and memhog is run serially with 6GB.
> Results:
> Without Hinting
>                     Time of Execution   
> Guest1:                56s                        
> Guest2:                45s           
> Guest3:                3m41s           
>
> With Hinting
> Guest1:                46s                        
> Guest2:                45s           
> Guest3:                49s           
>
>
>
>
I performed some experiments to see if the current implementation of
hinting breaks THP. I used AnonHugePages to track the THP pages
currently in use and memhog as the guest workload.
Setup:
Host Size: 30GB (No swap)
Guest Size: 15GB
THP Size: 2MB
Process: Guest is installed with different kernels to hint different
granularities(MAX_ORDER - 1, MAX_ORDER - 2 and MAX_ORDER - 3). Memhog 
15G is run multiple times in the same guest to see AnonHugePages usage
in the host.

Observation:
There is no THP split for order MAX_ORDER - 1 & MAX_ORDER - 2 whereas
for hinting granularity MAX_ORDER - 3 THP does split irrespective of
MADVISE_FREE or MADVISE_DONTNEED.
-- 
Regards
Nitesh

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to