Markus Armbruster <arm...@redhat.com> writes: > Thomas Huth <th...@redhat.com> writes: > >> On 28/09/15 10:11, Markus Armbruster wrote: >>> Thomas Huth <th...@redhat.com> writes: >>> >>>> On 25/09/15 16:17, Markus Armbruster wrote: >>>>> Thomas Huth <th...@redhat.com> writes: >>>>> >>>>>> On 24/09/15 20:57, Markus Armbruster wrote: >>>>>>> Several devices don't survive object_unref(object_new(T)): they crash >>>>>>> or hang during cleanup, or they leave dangling pointers behind. >>>>>>> >>>>>>> This breaks at least device-list-properties, because >>>>>>> qmp_device_list_properties() needs to create a device to find its >>>>>>> properties. Broken in commit f4eb32b "qmp: show QOM properties in >>>>>>> device-list-properties", v2.1. Example reproducer: >>>>>>> >>>>>>> $ qemu-system-aarch64 -nodefaults -display none -machine none >>>>>>> -S -qmp stdio >>>>>>> {"QMP": {"version": {"qemu": {"micro": 50, "minor": 4, >>>>>>> "major": 2}, "package": ""}, "capabilities": []}} >>>>>>> { "execute": "qmp_capabilities" } >>>>>>> {"return": {}} >>>>>>> { "execute": "device-list-properties", "arguments": { >>>>>>> "typename": "pxa2xx-pcmcia" } } >>>>>>> qemu-system-aarch64: /home/armbru/work/qemu/memory.c:1307: >>>>>>> memory_region_finalize: Assertion `((&mr->subregions)->tqh_first >>>>>>> == ((void *)0))' failed. >>>>>>> Aborted (core dumped) >>>>>>> [Exit 134 (SIGABRT)] >>>>>>> >>>>>>> Unfortunately, I can't fix the problems in these devices right now. >>>>>>> Instead, add DeviceClass member cannot_even_create_with_object_new_yet >>>>>>> to mark them: >>>> ... >>>>>>> static void pxa2xx_pcmcia_register_types(void) >>>>>>> diff --git a/hw/ppc/spapr_rng.c b/hw/ppc/spapr_rng.c >>>>>>> index ed43d5e..e1b115d 100644 >>>>>>> --- a/hw/ppc/spapr_rng.c >>>>>>> +++ b/hw/ppc/spapr_rng.c >>>>>>> @@ -169,6 +169,11 @@ static void spapr_rng_class_init(ObjectClass *oc, >>>>>>> void *data) >>>>>>> dc->realize = spapr_rng_realize; >>>>>>> set_bit(DEVICE_CATEGORY_MISC, dc->categories); >>>>>>> dc->props = spapr_rng_properties; >>>>>>> + >>>>>>> + /* >>>>>>> + * Reason: crashes device-introspect-test for unknown reason. >>>>>>> + */ >>>>>>> + dc->cannot_even_create_with_object_new_yet = true; >>>>>>> } >>>>>> >>>>>> Please don't do that! That breaks the help output from >>>>>> "-device spapr-rng,?" which should help the user to see how to use this >>>>>> device! >>>>> >>>>> Well, device-introspection-test makes qemu crash, with the backtrace >>>>> pointing squarely to this device. Stands to reason that device >>>>> introspection could crash in normal usage, too. Until the crash is >>>>> debugged, we better disable introspection of this device. >>>>> >>>>> I quite agree that disabling introspection hurts users. Just not as >>>>> much as crashes :) >>>>> >>>>>> I tried to debug why this device breaks the test, but the test >>>>>> environment is giving me a hard time ... how do you best hook a gdb into >>>>>> that framework, so you can trace such problems? >>>>>> Anyway, with some trial and error, I found out that it seems like the >>>>>> >>>>>> object_resolve_path_type("", TYPE_SPAPR_RNG, NULL) >>>>>> >>>>>> in spapr_rng_instance_init() is causing the problems. Could it be that >>>>>> object_resolve_path_type is not working with the test environment? >>>>> >>>>> I tried to figure out why this device breaks under this test, but >>>>> couldn't, so I posted with the "for unknown reason" comment. >>>> >>>> I've debugged this now for a while (thanks for the tip with >>>> MALLOC_PERTURB, by the way!) and it seems to me that the problem is in >>>> the macio object than in spapr-rng - the latter is just the victim of >>>> some memory corruption caused by the first one: The >>>> object_resolve_path_type() crashes while trying to go through the macio >>>> object. >>>> >>>> So could you please add the "dc->cannot_even_create_with_object_new_yet >>>> = true;" to macio_class_init() instead? ... that seems to fix the crash >>>> for me, too, and is likely the better place. >>> >>> Hmm. >>> >>> For most of the devices my patch marks, we have a pretty good idea on >>> what's wrong with them. spapr-rng is among the exceptions. You believe >>> it's actually "the macio object". Which one? "macio" is abstract... >>> >>> You report introspecting "spapr-rng" crashes "while trying to go through >>> the macio object". I wonder how omitting introspection of macio objects >>> (that's what marking them does to this test) could affect the object >>> we're going through when we crash. >> >> I have to correct myself: It's not going through the macio object, the >> problem is actually the "macio[0]" property that is created during >> memory_region_init() with object_property_add_child() ... the property >> points to a free()d object when the crash happens. >> >>>> Or maybe we could get this also fixed? The problem could be the >>>> memory_region_init(&s->bar, NULL, "macio", 0x80000) in >>>> macio_instance_init() ... is this ok here? Or does this rather have to >>>> go to the realize() function instead? >>> >>> Hmm, does creating and destroying a macio object leave the memory region >>> behind? >>> >>> Paolo, is calling memory_region_init() in an instance_init() method >>> okay? >> >> As Paolo mentioned, we likely need to pass an "owner" to >> memory_region_init() or the macio memory region will get attached to >> "/unattached" instead - and then leave a dangling link property behind >> when the original macio object got destroyed. >> >> By the way, there are some more spots like this in the code, e.g. in >> pxa2xx_fir_instance_init() in hw/arm/pxa2xx.c ... > > That's a memory_region_init_io(), so I should search for that pattern, > too. Any memory_region_init*() in fact, I guess. >300 hits :(
I tracked down problematic devices in two ways: 1. I made device-introspection-test run "info qom-tree", which has a lovely propensity to crash when a crappy device left dangling pointer behind. This led me to "cgthree", "cuda", "integrator_debug", "macio-oldworld", "macio-newworld", "pxa2xx-fir", "SUNW,tcx". They all create memory regions without owner in their instance_init() method. "pxa2xx-pcmcia" does, too. It's already marked in v3, because it actually crashes. Perhaps it has additional problems. 2. I instrumented memory_region_init() and object_init_with_type() to crash when the former is called with null owner from within ->instance_init(). I verified this catches cases like the above. It doesn't catch any new ones. This makes me reasonably confident I got them all. I'll send out v4 shortly.