On 3/6/19 5:05 PM, Alexander Duyck wrote: > On Wed, Mar 6, 2019 at 11:07 AM Nitesh Narayan Lal <nit...@redhat.com> wrote: >> >> On 3/6/19 1:00 PM, Alexander Duyck wrote: >>> On Wed, Mar 6, 2019 at 7:51 AM Nitesh Narayan Lal <nit...@redhat.com> wrote: >>>> The following patch-set proposes an efficient mechanism for handing freed >>>> memory between the guest and the host. It enables the guests with no page >>>> cache to rapidly free and reclaims memory to and from the host >>>> respectively. >>>> >>>> Benefit: >>>> With this patch-series, in our test-case, executed on a single system and >>>> single NUMA node with 15GB memory, we were able to successfully launch 5 >>>> guests(each with 5 GB memory) when page hinting was enabled and 3 without >>>> it. (Detailed explanation of the test procedure is provided at the bottom >>>> under Test - 1). >>>> >>>> Changelog in v9: >>>> * Guest free page hinting hook is now invoked after a page has >>>> been merged in the buddy. >>>> * Free pages only with order FREE_PAGE_HINTING_MIN_ORDER(currently >>>> defined as MAX_ORDER - 1) are captured. >>>> * Removed kthread which was earlier used to perform the scanning, >>>> isolation & reporting of free pages. >>> Without a kthread this has the potential to get really ugly really >>> fast. If we are going to run asynchronously we should probably be >>> truly asynchonous and just place a few pieces of data in the page that >>> a worker thread can use to identify which pages have been hinted and >>> which pages have not. >> Can you please explain what do you mean by truly asynchronous? >> >> With this implementation also I am not reporting the pages synchronously. > The problem is you are making it pseudo synchronous by having to push > pages off to a side buffer aren't you? In my mind we should be able to > have the page hinting go on with little to no interference with > existing page allocation and freeing. We have to opt one of the two options: 1. Block allocation by using a flag or acquire a lock to prevent the usage of pages we are hinting. 2. Remove the page set entirely from the buddy. (This is what I am doing right now)
The reason I would prefer the second approach is that we are not blocking the allocation in any way and as we are only working with a smaller set of pages we should be fine. However, with the current approach as we are reporting asynchronously there is a chance that we end up hinting more than 2-3 times for a single workload run. In situation where this could lead to low memory condition in the guest, the hinting will anyways fail as the guest will not allow page isolation. I can possibly try and test the same to ensure that we don't get OOM due to hinting when the guest is under memory pressure. > >>> Then we can have that one thread just walking >>> through the zone memory pulling out fixed size pieces at a time and >>> providing hints on that. By doing that we avoid the potential of >>> creating a batch of pages that eat up most of the system memory. >>> >>>> * Pages, captured in the per cpu array are sorted based on the >>>> zone numbers. This is to avoid redundancy of acquiring zone locks. >>>> * Dynamically allocated space is used to hold the isolated guest >>>> free pages. >>> I have concerns that doing this per CPU and allocating memory >>> dynamically can result in you losing a significant amount of memory as >>> it sits waiting to be hinted. >> It should not as the buddy will keep merging the pages and we are only >> capturing MAX_ORDER - 1. >> This was the issue with the last patch-series when I was capturing all >> order pages resulting in the per-cpu array to be filled with lower order >> pages. >>>> * All the pages are reported asynchronously to the host via virtio >>>> driver. >>>> * Pages are returned back to the guest buddy free list only when >>>> the host response is received. >>> I have been thinking about this. Instead of stealing the page couldn't >>> you simply flag it that there is a hint in progress and simply wait in >>> arch_alloc_page until the hint has been processed? >> With the flag, I am assuming you mean to block the allocation until >> hinting is going on, which is an issue. That was one of the issues >> discussed earlier which I wanted to solve with this implementation. > With the flag we would allow the allocation, but would have to > synchronize with the hinting at that point. I got the idea from the > way the s390 code works. They have both an arch_free_page and an > arch_alloc_page. If I understand correctly the arch_alloc_page is what > is meant to handle the case of a page that has been marked for > hinting, but may not have been hinted on yet. My thought for now is to > keep it simple and use a page flag to indicate that a page is > currently pending a hint. I am assuming this page flag will be located in the page structure. > We should be able to spin in such a case and > it would probably still perform better than a solution where we would > not have the memory available and possibly be under memory pressure. I had this same idea earlier. However, the thing about which I was not sure is if adding a flag in the page structure will be acceptable upstream. > >>> The problem is in >>> stealing pages you are going to introduce false OOM issues when the >>> memory isn't available because it is being hinted on. >> I think this situation will arise when the guest is under memory >> pressure. In such situations any attempt to perform isolation will >> anyways fail and we may not be reporting anything at that time. > What I want to avoid is the scenario where an application grabs a > large amount of memory, then frees said memory, and we are sitting on > it for some time because we decide to try and hint on the large chunk. I agree. > By processing this sometime after the pages are sent to the buddy > allocator in a separate thread, and by processing a small fixed window > of memory at a time we can avoid making freeing memory expensive, and > still provide the hints in a reasonable time frame. My impression is that the current window on which I am working may give issues for smaller size guests. But otherwise, we are already working with a smaller fixed window of memory. I can further restrict this to just 128 entries and test which would bring down the window of memory. Let me know what you think. > >>>> Pending items: >>>> * Make sure that the guest free page hinting's current >>>> implementation doesn't break hugepages or device assigned guests. >>>> * Follow up on VIRTIO_BALLOON_F_PAGE_POISON's device side support. >>>> (It is currently missing) >>>> * Compare reporting free pages via vring with vhost. >>>> * Decide between MADV_DONTNEED and MADV_FREE. >>>> * Analyze overall performance impact due to guest free page >>>> hinting. >>>> * Come up with proper/traceable error-message/logs. >>> I'll try applying these patches and see if I can reproduce the results >>> you reported. >> Thanks. Let me know if you run into any issues. >>> With the last patch set I couldn't reproduce the results >>> as you reported them. >> If I remember correctly then the last time you only tried with multiple >> vcpus and not with 1 vcpu. > I had tried 1 vcpu, however I ended up running into some other issues > that made it difficult to even boot the system last week. > >>> It has me wondering if you were somehow seeing >>> the effects of a balloon instead of the actual memory hints as I >>> couldn't find any evidence of the memory ever actually being freed >>> back by the hints functionality. >> Can you please elaborate what kind of evidence you are looking for? >> >> I did trace the hints on the QEMU/host side. > It looks like the new patches are working as I am seeing the memory > freeing occurring this time around. Although it looks like this is > still generating traces from free_pcpages_bulk if I enable multiple > VCPUs: I am assuming with the changes you suggested you were able to run this patch-series. Is that correct? > > [ 175.823539] list_add corruption. next->prev should be prev > (ffff947c7ffd61e0), but was ffffc7a29f9e0008. (next=ffffc7a29f4c0008). > [ 175.825978] ------------[ cut here ]------------ > [ 175.826889] kernel BUG at lib/list_debug.c:25! > [ 175.827766] invalid opcode: 0000 [#1] SMP PTI > [ 175.828621] CPU: 5 PID: 1344 Comm: page_fault1_thr Not tainted > 5.0.0-next-20190306-baseline+ #76 > [ 175.830312] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > BIOS Bochs 01/01/2011 > [ 175.831885] RIP: 0010:__list_add_valid+0x35/0x70 > [ 175.832784] Code: 18 48 8b 32 48 39 f0 75 39 48 39 c7 74 1e 48 39 > fa 74 19 b8 01 00 00 00 c3 48 89 c1 48 c7 c7 80 b5 0f a9 31 c0 e8 8f > aa c8 ff <0f> 0b 48 89 c1 48 89 fe 31 c0 48 c7 c7 30 b6 0f a9 e8 79 aa > c8 ff > [ 175.836379] RSP: 0018:ffffa717c40839b0 EFLAGS: 00010046 > [ 175.837394] RAX: 0000000000000075 RBX: ffff947c7ffd61e0 RCX: > 0000000000000000 > [ 175.838779] RDX: 0000000000000000 RSI: ffff947c5f957188 RDI: > ffff947c5f957188 > [ 175.840162] RBP: ffff947c7ffd61d0 R08: 000000000000026f R09: > 0000000000000005 > [ 175.841539] R10: 0000000000000000 R11: ffffa717c4083730 R12: > ffffc7a29f260008 > [ 175.842932] R13: ffff947c7ffd5d00 R14: ffffc7a29f4c0008 R15: > ffffc7a29f260000 > [ 175.844319] FS: 0000000000000000(0000) GS:ffff947c5f940000(0000) > knlGS:0000000000000000 > [ 175.845896] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 175.847009] CR2: 00007fffe3421000 CR3: 000000051220e006 CR4: > 0000000000160ee0 > [ 175.848390] Call Trace: > [ 175.848896] free_pcppages_bulk+0x4bc/0x6a0 > [ 175.849723] free_unref_page_list+0x10d/0x190 > [ 175.850567] release_pages+0x103/0x4a0 > [ 175.851313] tlb_flush_mmu_free+0x36/0x50 > [ 175.852105] unmap_page_range+0x963/0xd50 > [ 175.852897] unmap_vmas+0x62/0xc0 > [ 175.853549] exit_mmap+0xb5/0x1a0 > [ 175.854205] mmput+0x5b/0x120 > [ 175.854794] do_exit+0x273/0xc30 > [ 175.855426] ? free_unref_page_commit+0x85/0xf0 > [ 175.856312] do_group_exit+0x39/0xa0 > [ 175.857018] get_signal+0x172/0x7c0 > [ 175.857703] do_signal+0x36/0x620 > [ 175.858355] ? percpu_counter_add_batch+0x4b/0x60 > [ 175.859280] ? __do_munmap+0x288/0x390 > [ 175.860020] exit_to_usermode_loop+0x4c/0xa8 > [ 175.860859] do_syscall_64+0x152/0x170 > [ 175.861595] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > [ 175.862586] RIP: 0033:0x7ffff76a8ec7 > [ 175.863292] Code: Bad RIP value. > [ 175.863928] RSP: 002b:00007ffff4422eb8 EFLAGS: 00000212 ORIG_RAX: > 000000000000000b > [ 175.865396] RAX: 0000000000000000 RBX: 00007ffff7ff7280 RCX: > 00007ffff76a8ec7 > [ 175.866799] RDX: 00007fffe3422000 RSI: 0000000008000000 RDI: > 00007fffdb422000 > [ 175.868194] RBP: 0000000000001000 R08: ffffffffffffffff R09: > 0000000000000000 > [ 175.869582] R10: 0000000000000022 R11: 0000000000000212 R12: > 00007ffff4422fc0 > [ 175.870984] R13: 0000000000000001 R14: 00007fffffffc1b0 R15: > 00007ffff44239c0 > [ 175.872350] Modules linked in: ip6t_rpfilter ip6t_REJECT > nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat > ebtable_broute bridge stp llc ip6table_nat ip6table_mangle > ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack > nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw > iptable_security ebtable_filter ebtables ip6table_filter ip6_tables > sunrpc sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel > kvm_intel kvm ppdev irqbypass parport_pc parport virtio_balloon pcspkr > i2c_piix4 joydev xfs libcrc32c cirrus drm_kms_helper ttm drm e1000 > crc32c_intel virtio_blk serio_raw ata_generic floppy pata_acpi > qemu_fw_cfg > [ 175.883153] ---[ end trace 5b67f12a67d1f373 ]--- > > I should be able to rebuild the kernels/qemu and test this patch set > over the next day or two. Thanks. > > Thanks. > > - Alex -- Regards Nitesh
signature.asc
Description: OpenPGP digital signature