On 26.08.24 17:38, David Hildenbrand wrote:
On 20.08.24 10:50, Peter Maydell wrote:
On Mon, 19 Aug 2024 at 20:07, David Hildenbrand <da...@redhat.com> wrote:

On 19.08.24 18:24, Peter Maydell wrote:
Hi; I'm looking at a memory leak apparently in the host memory backend
code that you can see from the qmp-cmd-test. Repro instructions:

Hi Peter,


(1) build QEMU with '--cc=clang' '--cxx=clang++' '--enable-debug'
'--target-list=x86_64-softmmu' '--enable-sanitizers'
(2) run 'make check'. More specifically, to get just this
failure ('make check' on current head-of-tree produces some
other unrelated leak errors) you can run the relevant single test:

(cd build/asan && ASAN_OPTIONS="fast_unwind_on_malloc=0"
QTEST_QEMU_BINARY=./qemu-system-x86_64 ./tests/qtest/qmp-cmd-test
--tap -k -p /x86_64/qmp/object-add-failure-modes)

The test case is doing a variety of object-add then object-del
of the "memory-backend-ram" object, and this add-del cycle seems
to result in a fairly large leak:

Direct leak of 1572864 byte(s) in 6 object(s) allocated from:
       #0 0x555c1336efd8 in __interceptor_calloc
(/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/asan/qemu-system-x86_64+0x218efd8)
(BuildId: fc7566a39db1253aed91d500b5b1784e0c438397)
       #1 0x7f5bf3472c50 in g_malloc0 
debian/build/deb/../../../glib/gmem.c:161:13
       #2 0x555c155bb134 in bitmap_new
/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/include/qemu/bitmap.h:102:12
       #3 0x555c155ba4ee in dirty_memory_extend system/physmem.c:1831:37
       #4 0x555c15585a2b in ram_block_add system/physmem.c:1907:9
       #5 0x555c15589e50 in qemu_ram_alloc_internal system/physmem.c:2109:5
       #6 0x555c1558a096 in qemu_ram_alloc system/physmem.c:2129:12
       #7 0x555c15518b69 in memory_region_init_ram_flags_nomigrate
system/memory.c:1571:21
       #8 0x555c1464fd27 in ram_backend_memory_alloc 
backends/hostmem-ram.c:34:12
       #9 0x555c146510ac in host_memory_backend_memory_complete
backends/hostmem.c:345:10
       #10 0x555c1580bc90 in user_creatable_complete 
qom/object_interfaces.c:28:9
       #11 0x555c1580c6f8 in user_creatable_add_type 
qom/object_interfaces.c:125:10
       #12 0x555c1580ccc4 in user_creatable_add_qapi 
qom/object_interfaces.c:157:11
       #13 0x555c15ff0e2c in qmp_object_add qom/qom-qmp-cmds.c:227:5
       #14 0x555c161ce508 in qmp_marshal_object_add
/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/asan/qapi/qapi-commands-qom.c:337:5
       #15 0x555c162a7139 in do_qmp_dispatch_bh qapi/qmp-dispatch.c:128:5
       #16 0x555c16387921 in aio_bh_call util/async.c:171:5
       #17 0x555c163887fc in aio_bh_poll util/async.c:218:13
       #18 0x555c162e1288 in aio_dispatch util/aio-posix.c:423:5
       #19 0x555c1638f7be in aio_ctx_dispatch util/async.c:360:5
       #20 0x7f5bf3469d3a in g_main_dispatch
debian/build/deb/../../../glib/gmain.c:3419:28
       #21 0x7f5bf3469d3a in g_main_context_dispatch
debian/build/deb/../../../glib/gmain.c:4137:7
       #22 0x555c163935c9 in glib_pollfds_poll util/main-loop.c:287:9
       #23 0x555c16391f03 in os_host_main_loop_wait util/main-loop.c:310:5
       #24 0x555c16391acc in main_loop_wait util/main-loop.c:589:11
       #25 0x555c14614917 in qemu_main_loop system/runstate.c:801:9
       #26 0x555c16008b8c in qemu_default_main system/main.c:37:14
       #27 0x555c16008bd7 in main system/main.c:48:12
       #28 0x7f5bf12fbd8f in __libc_start_call_main
csu/../sysdeps/nptl/libc_start_call_main.h:58:16

My initial suspicion here is that the problem is that
TYPE_MEMORY_BACKEND has a UserCreatableClass::complete method which
calls HostMemoryBackend::alloc, but there is no corresponding
"now free this" in instance_finalize. So ram_backend_memory_alloc()
calls memory_region_init_ram_flags_nomigrate(), which allocates
RAM, dirty blocks, etc, but nothing ever destroys the MR and the
memory is leaked when the TYPE_MEMORY_BACKEND object is finalized.

But there isn't a "free" method in HostMemoryBackendClass,
only an "alloc", so this looks like an API with "leaks memory"
baked into it. How is the freeing of the memory on object
deletion intended to work?

I *think* during object_del(), we would be un-refing the contained
memory-region, which in turn will make the refcount go to 0 and end up
calling memory_region_finalize().

Oh, yes, I'd forgotten about the MemoryRegions being refcounted.
That explains why the MR itself doesn't show up as a leak, only
these dirty memory bitmaps.

In memory_region_finalize, we do various things, including calling
mr->destructor(mr).

For memory_region_init_ram_flags_nomigrate(), the deconstructor is set
to memory_region_destructor_ram(). This is the place where we call
qemu_ram_free(mr->ram_block);

There we clean up.

What we *don't* clean up is the allocation you are seeing:
dirty_memory_extend() will extend the ram_list.dirty_memory bitmap as
needed. It is not stored in the RAMBlock, it's a global list.

It's not really a leak I think: when we object_del + object_add *I
think* that bitmap will simply get reused.

I think there probably is a leak here somewhere, though --
lsan will only report if the memory is unreachable from
anywhere on program exit, AIUI. If we still had the global
list available to reuse on the next object-creation
shouldn't it still be reachable from somewhere?

Yes, that's what confusing me here. It's a global array that holds these
bitmap chunks. I don't see how there would be a leak, but maybe I'm
missing something. Let me have another look.


It's possible the leak only happens in some of the
"check failure cases of object-add" code paths that the
test is exercising, of course.

Right, but I think in any case we would keep the global array
consistent. Let me try to re-understand that code.


Ah, yes there is a memory leak. The issue is that on unplug, when the RAMBlock goes away, we don't free any entries that dirty_memory_extend() allocated.

The next time a RAMBlock is added, we call last_ram_page() again, to see if we have to extend. And we think we have to extend.

As last_ram_page() is based on ramblocks, and the relevant RAMBlock vanished, we would call dirty_memory_extend() again, doing a bitmap_new() at spots where there would already have been a bitmap_new() before.

That whole code is quite fragile I'm afraid. It would probably be best to just sore the numbers of allocated blocks per ram_list.dirty_memory[i] entry, and maybe try getting rid of this "old_ram_size" thingy completely.

Of course, freeing bitmaps where possible might also be an option when removing RAMBlocks ...

--
Cheers,

David / dhildenb


Reply via email to