On 3/6/19 5:05 PM, Alexander Duyck wrote:
> On Wed, Mar 6, 2019 at 11:07 AM Nitesh Narayan Lal <nit...@redhat.com> wrote:
>>
>> On 3/6/19 1:00 PM, Alexander Duyck wrote:
>>> On Wed, Mar 6, 2019 at 7:51 AM Nitesh Narayan Lal <nit...@redhat.com> wrote:
>>>> The following patch-set proposes an efficient mechanism for handing freed 
>>>> memory between the guest and the host. It enables the guests with no page 
>>>> cache to rapidly free and reclaims memory to and from the host 
>>>> respectively.
>>>>
>>>> Benefit:
>>>> With this patch-series, in our test-case, executed on a single system and 
>>>> single NUMA node with 15GB memory, we were able to successfully launch 5 
>>>> guests(each with 5 GB memory) when page hinting was enabled and 3 without 
>>>> it. (Detailed explanation of the test procedure is provided at the bottom 
>>>> under Test - 1).
>>>>
>>>> Changelog in v9:
>>>>         * Guest free page hinting hook is now invoked after a page has 
>>>> been merged in the buddy.
>>>>         * Free pages only with order FREE_PAGE_HINTING_MIN_ORDER(currently 
>>>> defined as MAX_ORDER - 1) are captured.
>>>>         * Removed kthread which was earlier used to perform the scanning, 
>>>> isolation & reporting of free pages.
>>> Without a kthread this has the potential to get really ugly really
>>> fast. If we are going to run asynchronously we should probably be
>>> truly asynchonous and just place a few pieces of data in the page that
>>> a worker thread can use to identify which pages have been hinted and
>>> which pages have not.
>> Can you please explain what do you mean by truly asynchronous?
>>
>> With this implementation also I am not reporting the pages synchronously.
> The problem is you are making it pseudo synchronous by having to push
> pages off to a side buffer aren't you? In my mind we should be able to
> have the page hinting go on with little to no interference with
> existing page allocation and freeing.
We have to opt one of the two options:
1. Block allocation by using a flag or acquire a lock to prevent the
usage of pages we are hinting.
2. Remove the page set entirely from the buddy. (This is what I am doing
right now)

The reason I would prefer the second approach is that we are not
blocking the allocation in any way and as we are only working with a
smaller set of pages we should be fine.
However, with the current approach as we are reporting asynchronously
there is a chance that we end up hinting more than 2-3 times for a
single workload run. In situation where this could lead to low memory
condition in the guest, the hinting will anyways fail as the guest will
not allow page isolation.
I can possibly try and test the same to ensure that we don't get OOM due
to hinting when the guest is under memory pressure.


>
>>> Then we can have that one thread just walking
>>> through the zone memory pulling out fixed size pieces at a time and
>>> providing hints on that. By doing that we avoid the potential of
>>> creating a batch of pages that eat up most of the system memory.
>>>
>>>>         * Pages, captured in the per cpu array are sorted based on the 
>>>> zone numbers. This is to avoid redundancy of acquiring zone locks.
>>>>         * Dynamically allocated space is used to hold the isolated guest 
>>>> free pages.
>>> I have concerns that doing this per CPU and allocating memory
>>> dynamically can result in you losing a significant amount of memory as
>>> it sits waiting to be hinted.
>> It should not as the buddy will keep merging the pages and we are only
>> capturing MAX_ORDER - 1.
>> This was the issue with the last patch-series when I was capturing all
>> order pages resulting in the per-cpu array to be filled with lower order
>> pages.
>>>>         * All the pages are reported asynchronously to the host via virtio 
>>>> driver.
>>>>         * Pages are returned back to the guest buddy free list only when 
>>>> the host response is received.
>>> I have been thinking about this. Instead of stealing the page couldn't
>>> you simply flag it that there is a hint in progress and simply wait in
>>> arch_alloc_page until the hint has been processed?
>> With the flag, I am assuming you mean to block the allocation until
>> hinting is going on, which is an issue. That was one of the issues
>> discussed earlier which I wanted to solve with this implementation.
> With the flag we would allow the allocation, but would have to
> synchronize with the hinting at that point. I got the idea from the
> way the s390 code works. They have both an arch_free_page and an
> arch_alloc_page. If I understand correctly the arch_alloc_page is what
> is meant to handle the case of a page that has been marked for
> hinting, but may not have been hinted on yet. My thought for now is to
> keep it simple and use a page flag to indicate that a page is
> currently pending a hint. 
I am assuming this page flag will be located in the page structure.
> We should be able to spin in such a case and
> it would probably still perform better than a solution where we would
> not have the memory available and possibly be under memory pressure.
I had this same idea earlier. However, the thing about which I was not
sure is if adding a flag in the page structure will be acceptable upstream.
>
>>> The problem is in
>>> stealing pages you are going to introduce false OOM issues when the
>>> memory isn't available because it is being hinted on.
>> I think this situation will arise when the guest is under memory
>> pressure. In such situations any attempt to perform isolation will
>> anyways fail and we may not be reporting anything at that time.
> What I want to avoid is the scenario where an application grabs a
> large amount of memory, then frees said memory, and we are sitting on
> it for some time because we decide to try and hint on the large chunk.
I agree.
> By processing this sometime after the pages are sent to the buddy
> allocator in a separate thread, and by processing a small fixed window
> of memory at a time we can avoid making freeing memory expensive, and
> still provide the hints in a reasonable time frame.

My impression is that the current window on which I am working may give
issues for smaller size guests. But otherwise, we are already working
with a smaller fixed window of memory.

I can further restrict this to just 128 entries and test which would
bring down the window of memory. Let me know what you think.

>
>>>> Pending items:
>>>>         * Make sure that the guest free page hinting's current 
>>>> implementation doesn't break hugepages or device assigned guests.
>>>>         * Follow up on VIRTIO_BALLOON_F_PAGE_POISON's device side support. 
>>>> (It is currently missing)
>>>>         * Compare reporting free pages via vring with vhost.
>>>>         * Decide between MADV_DONTNEED and MADV_FREE.
>>>>         * Analyze overall performance impact due to guest free page 
>>>> hinting.
>>>>         * Come up with proper/traceable error-message/logs.
>>> I'll try applying these patches and see if I can reproduce the results
>>> you reported.
>> Thanks. Let me know if you run into any issues.
>>> With the last patch set I couldn't reproduce the results
>>> as you reported them.
>> If I remember correctly then the last time you only tried with multiple
>> vcpus and not with 1 vcpu.
> I had tried 1 vcpu, however I ended up running into some other issues
> that made it difficult to even boot the system last week.
>
>>> It has me wondering if you were somehow seeing
>>> the effects of a balloon instead of the actual memory hints as I
>>> couldn't find any evidence of the memory ever actually being freed
>>> back by the hints functionality.
>> Can you please elaborate what kind of evidence you are looking for?
>>
>> I did trace the hints on the QEMU/host side.
> It looks like the new patches are working as I am seeing the memory
> freeing occurring this time around. Although it looks like this is
> still generating traces from free_pcpages_bulk if I enable multiple
> VCPUs:
I am assuming with the changes you suggested you were able to run this
patch-series. Is that correct?
>
> [  175.823539] list_add corruption. next->prev should be prev
> (ffff947c7ffd61e0), but was ffffc7a29f9e0008. (next=ffffc7a29f4c0008).
> [  175.825978] ------------[ cut here ]------------
> [  175.826889] kernel BUG at lib/list_debug.c:25!
> [  175.827766] invalid opcode: 0000 [#1] SMP PTI
> [  175.828621] CPU: 5 PID: 1344 Comm: page_fault1_thr Not tainted
> 5.0.0-next-20190306-baseline+ #76
> [  175.830312] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS Bochs 01/01/2011
> [  175.831885] RIP: 0010:__list_add_valid+0x35/0x70
> [  175.832784] Code: 18 48 8b 32 48 39 f0 75 39 48 39 c7 74 1e 48 39
> fa 74 19 b8 01 00 00 00 c3 48 89 c1 48 c7 c7 80 b5 0f a9 31 c0 e8 8f
> aa c8 ff <0f> 0b 48 89 c1 48 89 fe 31 c0 48 c7 c7 30 b6 0f a9 e8 79 aa
> c8 ff
> [  175.836379] RSP: 0018:ffffa717c40839b0 EFLAGS: 00010046
> [  175.837394] RAX: 0000000000000075 RBX: ffff947c7ffd61e0 RCX: 
> 0000000000000000
> [  175.838779] RDX: 0000000000000000 RSI: ffff947c5f957188 RDI: 
> ffff947c5f957188
> [  175.840162] RBP: ffff947c7ffd61d0 R08: 000000000000026f R09: 
> 0000000000000005
> [  175.841539] R10: 0000000000000000 R11: ffffa717c4083730 R12: 
> ffffc7a29f260008
> [  175.842932] R13: ffff947c7ffd5d00 R14: ffffc7a29f4c0008 R15: 
> ffffc7a29f260000
> [  175.844319] FS:  0000000000000000(0000) GS:ffff947c5f940000(0000)
> knlGS:0000000000000000
> [  175.845896] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  175.847009] CR2: 00007fffe3421000 CR3: 000000051220e006 CR4: 
> 0000000000160ee0
> [  175.848390] Call Trace:
> [  175.848896]  free_pcppages_bulk+0x4bc/0x6a0
> [  175.849723]  free_unref_page_list+0x10d/0x190
> [  175.850567]  release_pages+0x103/0x4a0
> [  175.851313]  tlb_flush_mmu_free+0x36/0x50
> [  175.852105]  unmap_page_range+0x963/0xd50
> [  175.852897]  unmap_vmas+0x62/0xc0
> [  175.853549]  exit_mmap+0xb5/0x1a0
> [  175.854205]  mmput+0x5b/0x120
> [  175.854794]  do_exit+0x273/0xc30
> [  175.855426]  ? free_unref_page_commit+0x85/0xf0
> [  175.856312]  do_group_exit+0x39/0xa0
> [  175.857018]  get_signal+0x172/0x7c0
> [  175.857703]  do_signal+0x36/0x620
> [  175.858355]  ? percpu_counter_add_batch+0x4b/0x60
> [  175.859280]  ? __do_munmap+0x288/0x390
> [  175.860020]  exit_to_usermode_loop+0x4c/0xa8
> [  175.860859]  do_syscall_64+0x152/0x170
> [  175.861595]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  175.862586] RIP: 0033:0x7ffff76a8ec7
> [  175.863292] Code: Bad RIP value.
> [  175.863928] RSP: 002b:00007ffff4422eb8 EFLAGS: 00000212 ORIG_RAX:
> 000000000000000b
> [  175.865396] RAX: 0000000000000000 RBX: 00007ffff7ff7280 RCX: 
> 00007ffff76a8ec7
> [  175.866799] RDX: 00007fffe3422000 RSI: 0000000008000000 RDI: 
> 00007fffdb422000
> [  175.868194] RBP: 0000000000001000 R08: ffffffffffffffff R09: 
> 0000000000000000
> [  175.869582] R10: 0000000000000022 R11: 0000000000000212 R12: 
> 00007ffff4422fc0
> [  175.870984] R13: 0000000000000001 R14: 00007fffffffc1b0 R15: 
> 00007ffff44239c0
> [  175.872350] Modules linked in: ip6t_rpfilter ip6t_REJECT
> nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat
> ebtable_broute bridge stp llc ip6table_nat ip6table_mangle
> ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack
> nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw
> iptable_security ebtable_filter ebtables ip6table_filter ip6_tables
> sunrpc sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel
> kvm_intel kvm ppdev irqbypass parport_pc parport virtio_balloon pcspkr
> i2c_piix4 joydev xfs libcrc32c cirrus drm_kms_helper ttm drm e1000
> crc32c_intel virtio_blk serio_raw ata_generic floppy pata_acpi
> qemu_fw_cfg
> [  175.883153] ---[ end trace 5b67f12a67d1f373 ]---
>
> I should be able to rebuild the kernels/qemu and test this patch set
> over the next day or two.
Thanks.
>
> Thanks.
>
> - Alex
-- 
Regards
Nitesh

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to