On 2019/5/10 23:16, Markus Armbruster wrote: > Xiang Zheng <zhengxia...@huawei.com> writes: > >> On 2019/5/9 19:59, Markus Armbruster wrote: >>> Xiang Zheng <zhengxia...@huawei.com> writes: >>> >>>> On 2019/5/8 21:20, Markus Armbruster wrote: >>>>> Laszlo Ersek <ler...@redhat.com> writes: >>>>> >>>>>> Hi Markus, >>>>>> >>>>>> On 05/07/19 20:01, Markus Armbruster wrote: >>>>>>> The subject is slightly misleading. Holes read as zero. So do >>>>>>> non-holes full of zeroes. The patch avoids reading the former, but >>>>>>> still reads the latter. >>>>>>> >>>>>>> Xiang Zheng <zhengxia...@huawei.com> writes: >>>>>>> >>>>>>>> Currently we fill the memory space with two 64MB NOR images when >>>>>>>> using persistent UEFI variables on virt board. Actually we only use >>>>>>>> a very small(non-zero) part of the memory while the rest significant >>>>>>>> large(zero) part of memory is wasted. >>>>>>> >>>>>>> Neglects to mention that the "virt board" is ARM. >>>>>>> >>>>>>>> So this patch checks the block status and only writes the non-zero part >>>>>>>> into memory. This requires pflash devices to use sparse files for >>>>>>>> backends. >>>>>>> >>>>>>> I started to draft an improved commit message, but then I realized this >>>>>>> patch can't work. >>>>>>> >>>>>>> The pflash_cfi01 device allocates its device memory like this: >>>>>>> >>>>>>> memory_region_init_rom_device( >>>>>>> &pfl->mem, OBJECT(dev), >>>>>>> &pflash_cfi01_ops, >>>>>>> pfl, >>>>>>> pfl->name, total_len, &local_err); >>>>>>> >>>>>>> pflash_cfi02 is similar. >>>>>>> >>>>>>> memory_region_init_rom_device() calls >>>>>>> memory_region_init_rom_device_nomigrate() calls qemu_ram_alloc() calls >>>>>>> qemu_ram_alloc_internal() calls g_malloc0(). Thus, all the device >>>>>>> memory gets written to even with this patch. >>>>>> >>>>>> As far as I can see, qemu_ram_alloc_internal() calls g_malloc0() only to >>>>>> allocate the the new RAMBlock object called "new_block". The actual >>>>>> guest RAM allocation occurs inside ram_block_add(), which is also called >>>>>> by qemu_ram_alloc_internal(). >>>>> >>>>> You're right. I should've read more attentively. >>>>> >>>>>> One frame outwards the stack, qemu_ram_alloc() passes NULL to >>>>>> qemu_ram_alloc_internal(), for the 4th ("host") parameter. Therefore, in >>>>>> qemu_ram_alloc_internal(), we set "new_block->host" to NULL as well. >>>>>> >>>>>> Then in ram_block_add(), we take the (!new_block->host) branch, and call >>>>>> phys_mem_alloc(). >>>>>> >>>>>> Unfortunately, "phys_mem_alloc" is a function pointer, set with >>>>>> phys_mem_set_alloc(). The phys_mem_set_alloc() function is called from >>>>>> "target/s390x/kvm.c" (setting the function pointer to >>>>>> legacy_s390_alloc()), so it doesn't apply in this case. Therefore we end >>>>>> up calling the default qemu_anon_ram_alloc() function, through the >>>>>> funcptr. (I think anyway.) >>>>>> >>>>>> And qemu_anon_ram_alloc() boils down to mmap() + MAP_ANONYMOUS, in >>>>>> qemu_ram_mmap(). (Even on PPC64 hosts, because qemu_anon_ram_alloc() >>>>>> passes (-1) for "fd".) >>>>>> >>>>>> I may have missed something, of course -- I obviously didn't test it, >>>>>> just speculated from the source. >>>>> >>>>> Thanks for your sleuthing! >>>>> >>>>>>> I'm afraid you neglected to test. >>>>> >>>>> Accusation actually unsupported. I apologize, and replace it by a >>>>> question: have you observed the improvement you're trying to achieve, >>>>> and if yes, how? >>>>> >>>> >>>> Yes, we need to create sparse files as the backing images for pflash >>>> device. >>>> To create sparse files like: >>>> >>>> dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M seek=64 count=0 >>>> dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc >>> >>> This creates a copy of firmware binary QEMU_EFI.fd padded with a hole to >>> 64MiB. >>> >>>> dd of="empty_VARS.fd" if="/dev/zero" bs=1M seek=64 count=0 >>> >>> This creates the varstore as a 64MiB hole. As far as I know (very >>> little), you should use the varstore template that comes with the >>> firmware binary. >>> >>> I use >>> >>> cp --sparse=always bld/pc-bios/edk2-arm-vars.fd . >>> cp --sparse=always bld/pc-bios/edk2-aarch64-code.fd . >>> >>> These guys are already zero-padded, and I use cp to sparsify. >>> >>>> Start a VM with below commandline: >>>> >>>> -drive >>>> file=/usr/share/edk2/aarch64/QEMU_EFI-pflash.raw,if=pflash,format=raw,unit=0,readonly=on\ >>>> -drive >>>> file=/usr/share/edk2/aarch64/empty_VARS.fd,if=pflash,format=raw,unit=1 \ >>>> >>>> Then observe the memory usage of the qemu process (THP is on). >>>> >>>> 1) Without this patch: >>>> # cat /proc/`pidof qemu-system-aarch64`/smaps | grep AnonHugePages: | grep >>>> -v ' 0 kB' >>>> AnonHugePages: 706560 kB >>>> AnonHugePages: 2048 kB >>>> AnonHugePages: 65536 kB // pflash memory device >>>> AnonHugePages: 65536 kB // pflash memory device >>>> AnonHugePages: 2048 kB >>>> >>>> # ps aux | grep qemu-system-aarch64 >>>> RSS: 879684 >>>> >>>> 2) After applying this patch: >>>> # cat /proc/`pidof qemu-system-aarch64`/smaps | grep AnonHugePages: | grep >>>> -v ' 0 kB' >>>> AnonHugePages: 700416 kB >>>> AnonHugePages: 2048 kB >>>> AnonHugePages: 2048 kB // pflash memory device >>>> AnonHugePages: 2048 kB // pflash memory device >>>> AnonHugePages: 2048 kB >>>> >>>> # ps aux | grep qemu-system-aarch64 >>>> RSS: 744380 >>> >>> Okay, this demonstrates the patch succeeds at mapping parts of the >>> pflash memory as holes. >>> >>> Do the guests in these QEMU processes run? >> >> Yes. > > Good to know, thanks. > >>>> Obviously, there are at least 100MiB memory saved for each guest. >>> >>> For a definition of "memory". >>> >>> Next question: what impact on system performance do you observe? >>> >>> Let me explain. >>> >>> Virtual memory holes get filled in by demand paging on access. In other >>> words, they remain holes only as long as nothing accesses the memory. >>> >>> Without your patch, we allocate pages at image read time and fill them >>> with zeroes. If we don't access them again, the kernel will eventually >>> page them out (assuming you're running with swap). So the steady state >>> is "we waste some swap space", not "we waste some physical RAM". >>> >> >> Not everybody wants to run with swap because it may cause low performance. > > Someone running without swap because he heard someone say someone said > swap may be slow is probably throwing away performance. > > But I assume you mean people running without swap because they measured > their workload and found it more performant without swap. Legitimate.
Yes, and I had ever suffered from the high IO waits with swap.:) > >>> Your patch lets us map pflash memory pages containing only zeros as >>> holes. >>> >>> For pages that never get accessed, your patch avoids page allocation, >>> filling with zeroes, writing to swap (all one-time costs), and saves >>> some swap space (not commonly an issue). >>> >>> For pflash memory that gets accessed, your patch merely delays page >>> allocation from image read time to first access. >>> >>> I wonder how these savings and delays affect actual system performance. >>> Without an observable change in system performance, all we'd accomplish >>> is changing a bunch of numers in /proc/$pid/. >>> >>> What improvement(s) can you observe? >> >> We only use pflash device for UEFI, and we hardly care about the performance. >> I think the bottleneck of the performance is the MMIO emulation, even this >> patch would delay page allocation at the first access. > > I wasn't inquiring about the performance of the pflash device. I was > inquiring about *system* performance. But let me rephrase my question. > > Doing work to save resources is only worthwhile if something valuable > gets better in a measurable way. I'm asking you > > (1) to explain what exactly you value, and > > (2) to provide measurements that show improvement. > What we exactly value is the cost of memory resources and it is the only thing that this patch aims to resolve. I am confused that why you think it will impact the system performance? Did I neglect something? >>> I guess the best case for your patch is many guests with relatively >>> small RAM sizes. > > . > -- Thanks, Xiang