On 3/7/19 1:45 PM, Alexander Duyck wrote:
> On Thu, Mar 7, 2019 at 5:09 AM Nitesh Narayan Lal <nit...@redhat.com> wrote:
>>
>> On 3/6/19 5:05 PM, Alexander Duyck wrote:
>>> On Wed, Mar 6, 2019 at 11:07 AM Nitesh Narayan Lal <nit...@redhat.com> 
>>> wrote:
>>>> On 3/6/19 1:00 PM, Alexander Duyck wrote:
>>>>> On Wed, Mar 6, 2019 at 7:51 AM Nitesh Narayan Lal <nit...@redhat.com> 
>>>>> wrote:
>>>>>> The following patch-set proposes an efficient mechanism for handing 
>>>>>> freed memory between the guest and the host. It enables the guests with 
>>>>>> no page cache to rapidly free and reclaims memory to and from the host 
>>>>>> respectively.
>>>>>>
>>>>>> Benefit:
>>>>>> With this patch-series, in our test-case, executed on a single system 
>>>>>> and single NUMA node with 15GB memory, we were able to successfully 
>>>>>> launch 5 guests(each with 5 GB memory) when page hinting was enabled and 
>>>>>> 3 without it. (Detailed explanation of the test procedure is provided at 
>>>>>> the bottom under Test - 1).
>>>>>>
>>>>>> Changelog in v9:
>>>>>>         * Guest free page hinting hook is now invoked after a page has 
>>>>>> been merged in the buddy.
>>>>>>         * Free pages only with order 
>>>>>> FREE_PAGE_HINTING_MIN_ORDER(currently defined as MAX_ORDER - 1) are 
>>>>>> captured.
>>>>>>         * Removed kthread which was earlier used to perform the 
>>>>>> scanning, isolation & reporting of free pages.
>>>>> Without a kthread this has the potential to get really ugly really
>>>>> fast. If we are going to run asynchronously we should probably be
>>>>> truly asynchonous and just place a few pieces of data in the page that
>>>>> a worker thread can use to identify which pages have been hinted and
>>>>> which pages have not.
>>>> Can you please explain what do you mean by truly asynchronous?
>>>>
>>>> With this implementation also I am not reporting the pages synchronously.
>>> The problem is you are making it pseudo synchronous by having to push
>>> pages off to a side buffer aren't you? In my mind we should be able to
>>> have the page hinting go on with little to no interference with
>>> existing page allocation and freeing.
>> We have to opt one of the two options:
>> 1. Block allocation by using a flag or acquire a lock to prevent the
>> usage of pages we are hinting.
>> 2. Remove the page set entirely from the buddy. (This is what I am doing
>> right now)
>>
>> The reason I would prefer the second approach is that we are not
>> blocking the allocation in any way and as we are only working with a
>> smaller set of pages we should be fine.
>> However, with the current approach as we are reporting asynchronously
>> there is a chance that we end up hinting more than 2-3 times for a
>> single workload run. In situation where this could lead to low memory
>> condition in the guest, the hinting will anyways fail as the guest will
>> not allow page isolation.
>> I can possibly try and test the same to ensure that we don't get OOM due
>> to hinting when the guest is under memory pressure.
> So in either case you are essentially blocking allocation since the
> memory cannot be used. My concern is more with guaranteeing forward
> progress for as many CPUs as possible.
>
> With your current design you have one minor issue in that you aren't
> taking the lock to re-insert the pages back into the buddy allocator.
> When you add that step in it means you are going to be blocking
> allocation on that zone while you are reinserting the pages.
>
> Also right now you are using the calls to free_one_page to generate a
> list of hints where to search. I'm thinking that may not be the best
> approach since what we want to do is provide hints on idle free pages,
> not just pages that will be free for a short period of time.
>
> To that end what I think w may want to do is instead just walk the LRU
> list for a given zone/order in reverse order so that we can try to
> identify the pages that are most likely to be cold and unused and
> those are the first ones we want to be hinting on rather than the ones
> that were just freed. If we can look at doing something like adding a
> jiffies value to the page indicating when it was last freed we could
> even have a good point for determining when we should stop processing
> pages in a given zone/order list.
>
> In reality the approach wouldn't be too different from what you are
> doing now, the only real difference would be that we would just want
> to walk the LRU list for the given zone/order rather then pulling
> hints on what to free from the calls to free_one_page. In addition we
> would need to add a couple bits to indicate if the page has been
> hinted on, is in the middle of getting hinted on, and something such
> as the jiffies value I mentioned which we could use to determine how
> old the page is.
>
>>>>> Then we can have that one thread just walking
>>>>> through the zone memory pulling out fixed size pieces at a time and
>>>>> providing hints on that. By doing that we avoid the potential of
>>>>> creating a batch of pages that eat up most of the system memory.
>>>>>
>>>>>>         * Pages, captured in the per cpu array are sorted based on the 
>>>>>> zone numbers. This is to avoid redundancy of acquiring zone locks.
>>>>>>         * Dynamically allocated space is used to hold the isolated guest 
>>>>>> free pages.
>>>>> I have concerns that doing this per CPU and allocating memory
>>>>> dynamically can result in you losing a significant amount of memory as
>>>>> it sits waiting to be hinted.
>>>> It should not as the buddy will keep merging the pages and we are only
>>>> capturing MAX_ORDER - 1.
>>>> This was the issue with the last patch-series when I was capturing all
>>>> order pages resulting in the per-cpu array to be filled with lower order
>>>> pages.
>>>>>>         * All the pages are reported asynchronously to the host via 
>>>>>> virtio driver.
>>>>>>         * Pages are returned back to the guest buddy free list only when 
>>>>>> the host response is received.
>>>>> I have been thinking about this. Instead of stealing the page couldn't
>>>>> you simply flag it that there is a hint in progress and simply wait in
>>>>> arch_alloc_page until the hint has been processed?
>>>> With the flag, I am assuming you mean to block the allocation until
>>>> hinting is going on, which is an issue. That was one of the issues
>>>> discussed earlier which I wanted to solve with this implementation.
>>> With the flag we would allow the allocation, but would have to
>>> synchronize with the hinting at that point. I got the idea from the
>>> way the s390 code works. They have both an arch_free_page and an
>>> arch_alloc_page. If I understand correctly the arch_alloc_page is what
>>> is meant to handle the case of a page that has been marked for
>>> hinting, but may not have been hinted on yet. My thought for now is to
>>> keep it simple and use a page flag to indicate that a page is
>>> currently pending a hint.
>> I am assuming this page flag will be located in the page structure.
>>> We should be able to spin in such a case and
>>> it would probably still perform better than a solution where we would
>>> not have the memory available and possibly be under memory pressure.
>> I had this same idea earlier. However, the thing about which I was not
>> sure is if adding a flag in the page structure will be acceptable upstream.
>>>>> The problem is in
>>>>> stealing pages you are going to introduce false OOM issues when the
>>>>> memory isn't available because it is being hinted on.
>>>> I think this situation will arise when the guest is under memory
>>>> pressure. In such situations any attempt to perform isolation will
>>>> anyways fail and we may not be reporting anything at that time.
>>> What I want to avoid is the scenario where an application grabs a
>>> large amount of memory, then frees said memory, and we are sitting on
>>> it for some time because we decide to try and hint on the large chunk.
>> I agree.
>>> By processing this sometime after the pages are sent to the buddy
>>> allocator in a separate thread, and by processing a small fixed window
>>> of memory at a time we can avoid making freeing memory expensive, and
>>> still provide the hints in a reasonable time frame.
>> My impression is that the current window on which I am working may give
>> issues for smaller size guests. But otherwise, we are already working
>> with a smaller fixed window of memory.
>>
>> I can further restrict this to just 128 entries and test which would
>> bring down the window of memory. Let me know what you think.
> The problem is 128 entries is still pretty big when you consider you
> are working with 4M pages. If I am not mistaken that is a half
> gigabyte of memory. For lower order pages 128 would probably be fine,
> but with the higher order pages we may want to contain things to
> something smaller like 16MB to 64MB worth of memory.
This is something with which we can certainly play around or may even
make configurable.
For now, I think I will continue testing with 128.
>
>>>>>> Pending items:
>>>>>>         * Make sure that the guest free page hinting's current 
>>>>>> implementation doesn't break hugepages or device assigned guests.
>>>>>>         * Follow up on VIRTIO_BALLOON_F_PAGE_POISON's device side 
>>>>>> support. (It is currently missing)
>>>>>>         * Compare reporting free pages via vring with vhost.
>>>>>>         * Decide between MADV_DONTNEED and MADV_FREE.
>>>>>>         * Analyze overall performance impact due to guest free page 
>>>>>> hinting.
>>>>>>         * Come up with proper/traceable error-message/logs.
>>>>> I'll try applying these patches and see if I can reproduce the results
>>>>> you reported.
>>>> Thanks. Let me know if you run into any issues.
>>>>> With the last patch set I couldn't reproduce the results
>>>>> as you reported them.
>>>> If I remember correctly then the last time you only tried with multiple
>>>> vcpus and not with 1 vcpu.
>>> I had tried 1 vcpu, however I ended up running into some other issues
>>> that made it difficult to even boot the system last week.
>>>
>>>>> It has me wondering if you were somehow seeing
>>>>> the effects of a balloon instead of the actual memory hints as I
>>>>> couldn't find any evidence of the memory ever actually being freed
>>>>> back by the hints functionality.
>>>> Can you please elaborate what kind of evidence you are looking for?
>>>>
>>>> I did trace the hints on the QEMU/host side.
>>> It looks like the new patches are working as I am seeing the memory
>>> freeing occurring this time around. Although it looks like this is
>>> still generating traces from free_pcpages_bulk if I enable multiple
>>> VCPUs:
>> I am assuming with the changes you suggested you were able to run this
>> patch-series. Is that correct?
> Yes, I got it working by disabling SMP. I think I found and pointed
> out the issue in your other patch where you were using __free_one_page
> without holding the zone lock.
Yeah. Thanks.
>
>>> [  175.823539] list_add corruption. next->prev should be prev
>>> (ffff947c7ffd61e0), but was ffffc7a29f9e0008. (next=ffffc7a29f4c0008).
>>> [  175.825978] ------------[ cut here ]------------
>>> [  175.826889] kernel BUG at lib/list_debug.c:25!
>>> [  175.827766] invalid opcode: 0000 [#1] SMP PTI
>>> [  175.828621] CPU: 5 PID: 1344 Comm: page_fault1_thr Not tainted
>>> 5.0.0-next-20190306-baseline+ #76
>>> [  175.830312] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>>> BIOS Bochs 01/01/2011
>>> [  175.831885] RIP: 0010:__list_add_valid+0x35/0x70
>>> [  175.832784] Code: 18 48 8b 32 48 39 f0 75 39 48 39 c7 74 1e 48 39
>>> fa 74 19 b8 01 00 00 00 c3 48 89 c1 48 c7 c7 80 b5 0f a9 31 c0 e8 8f
>>> aa c8 ff <0f> 0b 48 89 c1 48 89 fe 31 c0 48 c7 c7 30 b6 0f a9 e8 79 aa
>>> c8 ff
>>> [  175.836379] RSP: 0018:ffffa717c40839b0 EFLAGS: 00010046
>>> [  175.837394] RAX: 0000000000000075 RBX: ffff947c7ffd61e0 RCX: 
>>> 0000000000000000
>>> [  175.838779] RDX: 0000000000000000 RSI: ffff947c5f957188 RDI: 
>>> ffff947c5f957188
>>> [  175.840162] RBP: ffff947c7ffd61d0 R08: 000000000000026f R09: 
>>> 0000000000000005
>>> [  175.841539] R10: 0000000000000000 R11: ffffa717c4083730 R12: 
>>> ffffc7a29f260008
>>> [  175.842932] R13: ffff947c7ffd5d00 R14: ffffc7a29f4c0008 R15: 
>>> ffffc7a29f260000
>>> [  175.844319] FS:  0000000000000000(0000) GS:ffff947c5f940000(0000)
>>> knlGS:0000000000000000
>>> [  175.845896] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [  175.847009] CR2: 00007fffe3421000 CR3: 000000051220e006 CR4: 
>>> 0000000000160ee0
>>> [  175.848390] Call Trace:
>>> [  175.848896]  free_pcppages_bulk+0x4bc/0x6a0
>>> [  175.849723]  free_unref_page_list+0x10d/0x190
>>> [  175.850567]  release_pages+0x103/0x4a0
>>> [  175.851313]  tlb_flush_mmu_free+0x36/0x50
>>> [  175.852105]  unmap_page_range+0x963/0xd50
>>> [  175.852897]  unmap_vmas+0x62/0xc0
>>> [  175.853549]  exit_mmap+0xb5/0x1a0
>>> [  175.854205]  mmput+0x5b/0x120
>>> [  175.854794]  do_exit+0x273/0xc30
>>> [  175.855426]  ? free_unref_page_commit+0x85/0xf0
>>> [  175.856312]  do_group_exit+0x39/0xa0
>>> [  175.857018]  get_signal+0x172/0x7c0
>>> [  175.857703]  do_signal+0x36/0x620
>>> [  175.858355]  ? percpu_counter_add_batch+0x4b/0x60
>>> [  175.859280]  ? __do_munmap+0x288/0x390
>>> [  175.860020]  exit_to_usermode_loop+0x4c/0xa8
>>> [  175.860859]  do_syscall_64+0x152/0x170
>>> [  175.861595]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> [  175.862586] RIP: 0033:0x7ffff76a8ec7
>>> [  175.863292] Code: Bad RIP value.
>>> [  175.863928] RSP: 002b:00007ffff4422eb8 EFLAGS: 00000212 ORIG_RAX:
>>> 000000000000000b
>>> [  175.865396] RAX: 0000000000000000 RBX: 00007ffff7ff7280 RCX: 
>>> 00007ffff76a8ec7
>>> [  175.866799] RDX: 00007fffe3422000 RSI: 0000000008000000 RDI: 
>>> 00007fffdb422000
>>> [  175.868194] RBP: 0000000000001000 R08: ffffffffffffffff R09: 
>>> 0000000000000000
>>> [  175.869582] R10: 0000000000000022 R11: 0000000000000212 R12: 
>>> 00007ffff4422fc0
>>> [  175.870984] R13: 0000000000000001 R14: 00007fffffffc1b0 R15: 
>>> 00007ffff44239c0
>>> [  175.872350] Modules linked in: ip6t_rpfilter ip6t_REJECT
>>> nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat
>>> ebtable_broute bridge stp llc ip6table_nat ip6table_mangle
>>> ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack
>>> nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw
>>> iptable_security ebtable_filter ebtables ip6table_filter ip6_tables
>>> sunrpc sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel
>>> kvm_intel kvm ppdev irqbypass parport_pc parport virtio_balloon pcspkr
>>> i2c_piix4 joydev xfs libcrc32c cirrus drm_kms_helper ttm drm e1000
>>> crc32c_intel virtio_blk serio_raw ata_generic floppy pata_acpi
>>> qemu_fw_cfg
>>> [  175.883153] ---[ end trace 5b67f12a67d1f373 ]---
>>>
>>> I should be able to rebuild the kernels/qemu and test this patch set
>>> over the next day or two.
>> Thanks.
>>> Thanks.
>>>
>>> - Alex
>> --
>> Regards
>> Nitesh
>>
-- 
Regards
Nitesh

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to