On 3/19/19 9:33 AM, David Hildenbrand wrote:
> On 18.03.19 16:57, Nitesh Narayan Lal wrote:
>> On 3/14/19 12:58 PM, Alexander Duyck wrote:
>>> On Thu, Mar 14, 2019 at 9:43 AM Nitesh Narayan Lal <[email protected]> 
>>> wrote:
>>>> On 3/6/19 1:12 PM, Michael S. Tsirkin wrote:
>>>>> On Wed, Mar 06, 2019 at 01:07:50PM -0500, Nitesh Narayan Lal wrote:
>>>>>> On 3/6/19 11:09 AM, Michael S. Tsirkin wrote:
>>>>>>> On Wed, Mar 06, 2019 at 10:50:42AM -0500, Nitesh Narayan Lal wrote:
>>>>>>>> The following patch-set proposes an efficient mechanism for handing 
>>>>>>>> freed memory between the guest and the host. It enables the guests 
>>>>>>>> with no page cache to rapidly free and reclaims memory to and from the 
>>>>>>>> host respectively.
>>>>>>>>
>>>>>>>> Benefit:
>>>>>>>> With this patch-series, in our test-case, executed on a single system 
>>>>>>>> and single NUMA node with 15GB memory, we were able to successfully 
>>>>>>>> launch 5 guests(each with 5 GB memory) when page hinting was enabled 
>>>>>>>> and 3 without it. (Detailed explanation of the test procedure is 
>>>>>>>> provided at the bottom under Test - 1).
>>>>>>>>
>>>>>>>> Changelog in v9:
>>>>>>>>    * Guest free page hinting hook is now invoked after a page has been 
>>>>>>>> merged in the buddy.
>>>>>>>>         * Free pages only with order 
>>>>>>>> FREE_PAGE_HINTING_MIN_ORDER(currently defined as MAX_ORDER - 1) are 
>>>>>>>> captured.
>>>>>>>>    * Removed kthread which was earlier used to perform the scanning, 
>>>>>>>> isolation & reporting of free pages.
>>>>>>>>    * Pages, captured in the per cpu array are sorted based on the zone 
>>>>>>>> numbers. This is to avoid redundancy of acquiring zone locks.
>>>>>>>>         * Dynamically allocated space is used to hold the isolated 
>>>>>>>> guest free pages.
>>>>>>>>         * All the pages are reported asynchronously to the host via 
>>>>>>>> virtio driver.
>>>>>>>>         * Pages are returned back to the guest buddy free list only 
>>>>>>>> when the host response is received.
>>>>>>>>
>>>>>>>> Pending items:
>>>>>>>>         * Make sure that the guest free page hinting's current 
>>>>>>>> implementation doesn't break hugepages or device assigned guests.
>>>>>>>>    * Follow up on VIRTIO_BALLOON_F_PAGE_POISON's device side support. 
>>>>>>>> (It is currently missing)
>>>>>>>>         * Compare reporting free pages via vring with vhost.
>>>>>>>>         * Decide between MADV_DONTNEED and MADV_FREE.
>>>>>>>>    * Analyze overall performance impact due to guest free page hinting.
>>>>>>>>    * Come up with proper/traceable error-message/logs.
>>>>>>>>
>>>>>>>> Tests:
>>>>>>>> 1. Use-case - Number of guests we can launch
>>>>>>>>
>>>>>>>>    NUMA Nodes = 1 with 15 GB memory
>>>>>>>>    Guest Memory = 5 GB
>>>>>>>>    Number of cores in guest = 1
>>>>>>>>    Workload = test allocation program allocates 4GB memory, touches it 
>>>>>>>> via memset and exits.
>>>>>>>>    Procedure =
>>>>>>>>    The first guest is launched and once its console is up, the test 
>>>>>>>> allocation program is executed with 4 GB memory request (Due to this 
>>>>>>>> the guest occupies almost 4-5 GB of memory in the host in a system 
>>>>>>>> without page hinting). Once this program exits at that time another 
>>>>>>>> guest is launched in the host and the same process is followed. We 
>>>>>>>> continue launching the guests until a guest gets killed due to low 
>>>>>>>> memory condition in the host.
>>>>>>>>
>>>>>>>>    Results:
>>>>>>>>    Without hinting = 3
>>>>>>>>    With hinting = 5
>>>>>>>>
>>>>>>>> 2. Hackbench
>>>>>>>>    Guest Memory = 5 GB
>>>>>>>>    Number of cores = 4
>>>>>>>>    Number of tasks         Time with Hinting       Time without Hinting
>>>>>>>>    4000                    19.540                  17.818
>>>>>>>>
>>>>>>> How about memhog btw?
>>>>>>> Alex reported:
>>>>>>>
>>>>>>>     My testing up till now has consisted of setting up 4 8GB VMs on a 
>>>>>>> system
>>>>>>>     with 32GB of memory and 4GB of swap. To stress the memory on the 
>>>>>>> system I
>>>>>>>     would run "memhog 8G" sequentially on each of the guests and 
>>>>>>> observe how
>>>>>>>     long it took to complete the run. The observed behavior is that on 
>>>>>>> the
>>>>>>>     systems with these patches applied in both the guest and on the 
>>>>>>> host I was
>>>>>>>     able to complete the test with a time of 5 to 7 seconds per guest. 
>>>>>>> On a
>>>>>>>     system without these patches the time ranged from 7 to 49 seconds 
>>>>>>> per
>>>>>>>     guest. I am assuming the variability is due to time being spent 
>>>>>>> writing
>>>>>>>     pages out to disk in order to free up space for the guest.
>>>>>>>
>>>>>> Here are the results:
>>>>>>
>>>>>> Procedure: 3 Guests of size 5GB is launched on a single NUMA node with
>>>>>> total memory of 15GB and no swap. In each of the guest, memhog is run
>>>>>> with 5GB. Post-execution of memhog, Host memory usage is monitored by
>>>>>> using Free command.
>>>>>>
>>>>>> Without Hinting:
>>>>>>                  Time of execution    Host used memory
>>>>>> Guest 1:        45 seconds            5.4 GB
>>>>>> Guest 2:        45 seconds            10 GB
>>>>>> Guest 3:        1  minute               15 GB
>>>>>>
>>>>>> With Hinting:
>>>>>>                 Time of execution     Host used memory
>>>>>> Guest 1:        49 seconds            2.4 GB
>>>>>> Guest 2:        40 seconds            4.3 GB
>>>>>> Guest 3:        50 seconds            6.3 GB
>>>>> OK so no improvement. OTOH Alex's patches cut time down to 5-7 seconds
>>>>> which seems better. Want to try testing Alex's patches for comparison?
>>>>>
>>>> I realized that the last time I reported the memhog numbers, I didn't
>>>> enable the swap due to which the actual benefits of the series were not
>>>> shown.
>>>> I have re-run the test by including some of the changes suggested by
>>>> Alexander and David:
>>>>     * Reduced the size of the per-cpu array to 32 and minimum hinting
>>>> threshold to 16.
>>>>     * Reported length of isolated pages along with start pfn, instead of
>>>> the order from the guest.
>>>>     * Used the reported length to madvise the entire length of address
>>>> instead of a single 4K page.
>>>>     * Replaced MADV_DONTNEED with MADV_FREE.
>>>>
>>>> Setup for the test:
>>>> NUMA node:1
>>>> Memory: 15GB
>>>> Swap: 4GB
>>>> Guest memory: 6GB
>>>> Number of core: 1
>>>>
>>>> Process: A guest is launched and memhog is run with 6GB. As its
>>>> execution is over next guest is launched. Everytime memhog execution
>>>> time is monitored.
>>>> Results:
>>>>     Without Hinting:
>>>>                  Time of execution
>>>>     Guest1:    22s
>>>>     Guest2:    24s
>>>>     Guest3: 1m29s
>>>>
>>>>     With Hinting:
>>>>                 Time of execution
>>>>     Guest1:    24s
>>>>     Guest2:    25s
>>>>     Guest3:    28s
>>>>
>>>> When hinting is enabled swap space is not used until memhog with 6GB is
>>>> ran in 6th guest.
>>> So one change you may want to make to your test setup would be to
>>> launch the tests sequentially after all the guests all up, instead of
>>> combining the test and guest bring-up. In addition you could run
>>> through the guests more than once to determine a more-or-less steady
>>> state in terms of the performance as you move between the guests after
>>> they have hit the point of having to either swap or pull MADV_FREE
>>> pages.
>> I tried running memhog as you suggested, here are the results:
>> Setup for the test:
>> NUMA node:1
>> Memory: 15GB
>> Swap: 4GB
>> Guest memory: 6GB
>> Number of core: 1
>>
>> Process: 3 guests are launched and memhog is run with 6GB. Results are
>> monitored after 1st-time execution of memhog. Memhog is launched
>> sequentially in each of the guests and time is observed after the
>> execution of all 3 memhog is over.
>>
>> Results:
>> Without Hinting
>>     Time of Execution   
>> 1.    6m48s                   
>> 2.    6m9s               
>>
>> With Hinting
>> Array size:16 Minimum Threshold:8
>> 1.    2m57s           
>> 2.    2m20s           
>>
>> The memhog execution time in the case of hinting is still not that low
>> as we would have expected. This is due to the usage of swap space.
>> Although wrt to non-hinting when swap used space is around 3.5G, with
>> hinting it remains to around 1.1-1.5G.
>> I did try using a zone free page barrier which prevented hinting when
>> free pages of order HINTING_ORDER goes below 256. This further brings
>> down the swap usage to 100-150 MB. The tricky part of this approach is
>> to configure this barrier condition for different guests.
>>
>> Array size:16 Minimum Threshold:8
>> 1.    1m16s       
>> 2.    1m41s
>>
>> Note: Memhog time does seem to vary a little bit on every boot with or
>> without hinting.
>>
> I don't quite understand yet why "hinting more pages" (no free page
> barrier) should result in a higher swap usage in the hypervisor
> (1.1-1.5GB vs. 100-150 MB). If we are "hinting more pages" I would have
> guessed that runtime could get slower, but not that we need more swap.
>
> One theory:
>
> If you hint all MAX_ORDER - 1 pages, at one point it could be that all
> "remaining" free pages are currently isolated to be hinted. As MM needs
> more pages for a process, it will fallback to using "MAX_ORDER - 2"
> pages and so on. These pages, when they are freed, you won't hint
> anymore unless they get merged. But after all they won't get merged
> because they can't be merged (otherwise they wouldn't be "MAX_ORDER - 2"
> after all right from the beginning).
>
> Try hinting a smaller granularity to see if this could actually be the case.
So I have two questions in my mind after looking at the results now:
1. Why swap is coming into the picture when hinting is enabled?
2. Same to what you have raised.
For the 1st question, I think the answer is: (correct me if I am wrong.)
Memhog while writing the memory does free memory but the pages it frees
are of a lower order which doesn't merge until the memhog write
completes. After which we do get the MAX_ORDER - 1 page from the buddy
resulting in hinting.
As all 3 memhog are running parallelly we don't get free memory until
one of them completes.
This does explain that when 3 guests each of 6GB on a 15GB host tries to
run memhog with 6GB parallelly, swap comes into the picture even if
hinting is enabled.

This doesn't explain why putting a barrier or avoid hinting reduced the
swap usage. It seems I possibly had a wrong impression of the delaying
hinting idea which we discussed.
As I was observing the value of the swap at the end of the memhog
execution which is logically incorrect. I will re-run the test and
observe the highest swap usage during the entire execution of memhog for
hinting vs non-hinting.

-- 
Regards
Nitesh

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to