Markus Armbruster <arm...@redhat.com> writes:

> Thomas Huth <th...@redhat.com> writes:
>
>> On 28/09/15 10:11, Markus Armbruster wrote:
>>> Thomas Huth <th...@redhat.com> writes:
>>> 
>>>> On 25/09/15 16:17, Markus Armbruster wrote:
>>>>> Thomas Huth <th...@redhat.com> writes:
>>>>>
>>>>>> On 24/09/15 20:57, Markus Armbruster wrote:
>>>>>>> Several devices don't survive object_unref(object_new(T)): they crash
>>>>>>> or hang during cleanup, or they leave dangling pointers behind.
>>>>>>>
>>>>>>> This breaks at least device-list-properties, because
>>>>>>> qmp_device_list_properties() needs to create a device to find its
>>>>>>> properties.  Broken in commit f4eb32b "qmp: show QOM properties in
>>>>>>> device-list-properties", v2.1.  Example reproducer:
>>>>>>>
>>>>>>>     $ qemu-system-aarch64 -nodefaults -display none -machine none
>>>>>>> -S -qmp stdio
>>>>>>>     {"QMP": {"version": {"qemu": {"micro": 50, "minor": 4,
>>>>>>> "major": 2}, "package": ""}, "capabilities": []}}
>>>>>>>     { "execute": "qmp_capabilities" }
>>>>>>>     {"return": {}}
>>>>>>>     { "execute": "device-list-properties", "arguments": {
>>>>>>> "typename": "pxa2xx-pcmcia" } }
>>>>>>>     qemu-system-aarch64: /home/armbru/work/qemu/memory.c:1307:
>>>>>>> memory_region_finalize: Assertion `((&mr->subregions)->tqh_first
>>>>>>> == ((void *)0))' failed.
>>>>>>>     Aborted (core dumped)
>>>>>>>     [Exit 134 (SIGABRT)]
>>>>>>>
>>>>>>> Unfortunately, I can't fix the problems in these devices right now.
>>>>>>> Instead, add DeviceClass member cannot_even_create_with_object_new_yet
>>>>>>> to mark them:
>>>> ...
>>>>>>>  static void pxa2xx_pcmcia_register_types(void)
>>>>>>> diff --git a/hw/ppc/spapr_rng.c b/hw/ppc/spapr_rng.c
>>>>>>> index ed43d5e..e1b115d 100644
>>>>>>> --- a/hw/ppc/spapr_rng.c
>>>>>>> +++ b/hw/ppc/spapr_rng.c
>>>>>>> @@ -169,6 +169,11 @@ static void spapr_rng_class_init(ObjectClass *oc, 
>>>>>>> void *data)
>>>>>>>      dc->realize = spapr_rng_realize;
>>>>>>>      set_bit(DEVICE_CATEGORY_MISC, dc->categories);
>>>>>>>      dc->props = spapr_rng_properties;
>>>>>>> +
>>>>>>> +    /*
>>>>>>> +     * Reason: crashes device-introspect-test for unknown reason.
>>>>>>> +     */
>>>>>>> +    dc->cannot_even_create_with_object_new_yet = true;
>>>>>>>  }
>>>>>>
>>>>>> Please don't do that! That breaks the help output from
>>>>>> "-device spapr-rng,?" which should help the user to see how to use this
>>>>>> device!
>>>>>
>>>>> Well, device-introspection-test makes qemu crash, with the backtrace
>>>>> pointing squarely to this device.  Stands to reason that device
>>>>> introspection could crash in normal usage, too.  Until the crash is
>>>>> debugged, we better disable introspection of this device.
>>>>>
>>>>> I quite agree that disabling introspection hurts users.  Just not as
>>>>> much as crashes :)
>>>>>
>>>>>> I tried to debug why this device breaks the test, but the test
>>>>>> environment is giving me a hard time ... how do you best hook a gdb into
>>>>>> that framework, so you can trace such problems?
>>>>>> Anyway, with some trial and error, I found out that it seems like the
>>>>>>
>>>>>>   object_resolve_path_type("", TYPE_SPAPR_RNG, NULL)
>>>>>>
>>>>>> in spapr_rng_instance_init() is causing the problems. Could it be that
>>>>>> object_resolve_path_type is not working with the test environment?
>>>>>
>>>>> I tried to figure out why this device breaks under this test, but
>>>>> couldn't, so I posted with the "for unknown reason" comment.
>>>>
>>>> I've debugged this now for a while (thanks for the tip with
>>>> MALLOC_PERTURB, by the way!) and it seems to me that the problem is in
>>>> the macio object than in spapr-rng - the latter is just the victim of
>>>> some memory corruption caused by the first one: The
>>>> object_resolve_path_type() crashes while trying to go through the macio
>>>> object.
>>>>
>>>> So could you please add the "dc->cannot_even_create_with_object_new_yet
>>>> = true;" to macio_class_init() instead? ... that seems to fix the crash
>>>> for me, too, and is likely the better place.
>>> 
>>> Hmm.
>>> 
>>> For most of the devices my patch marks, we have a pretty good idea on
>>> what's wrong with them.  spapr-rng is among the exceptions.  You believe
>>> it's actually "the macio object".  Which one?  "macio" is abstract...
>>> 
>>> You report introspecting "spapr-rng" crashes "while trying to go through
>>> the macio object".  I wonder how omitting introspection of macio objects
>>> (that's what marking them does to this test) could affect the object
>>> we're going through when we crash.
>>
>> I have to correct myself: It's not going through the macio object, the
>> problem is actually the "macio[0]" property that is created during
>> memory_region_init() with object_property_add_child() ... the property
>> points to a free()d object when the crash happens.
>>
>>>> Or maybe we could get this also fixed? The problem could be the
>>>> memory_region_init(&s->bar, NULL, "macio", 0x80000) in
>>>> macio_instance_init() ... is this ok here? Or does this rather have to
>>>> go to the realize() function instead?
>>> 
>>> Hmm, does creating and destroying a macio object leave the memory region
>>> behind?
>>> 
>>> Paolo, is calling memory_region_init() in an instance_init() method
>>> okay?
>>
>> As Paolo mentioned, we likely need to pass an "owner" to
>> memory_region_init() or the macio memory region will get attached to
>> "/unattached" instead - and then leave a dangling link property behind
>> when the original macio object got destroyed.
>>
>> By the way, there are some more spots like this in the code, e.g. in
>> pxa2xx_fir_instance_init() in hw/arm/pxa2xx.c ...
>
> That's a memory_region_init_io(), so I should search for that pattern,
> too.  Any memory_region_init*() in fact, I guess.  >300 hits :(

I tracked down problematic devices in two ways:

1. I made device-introspection-test run "info qom-tree", which has a
   lovely propensity to crash when a crappy device left dangling pointer
   behind.  This led me to "cgthree", "cuda", "integrator_debug",
   "macio-oldworld", "macio-newworld", "pxa2xx-fir", "SUNW,tcx".  They
   all create memory regions without owner in their instance_init()
   method.

   "pxa2xx-pcmcia" does, too.  It's already marked in v3, because it
   actually crashes.  Perhaps it has additional problems.

2. I instrumented memory_region_init() and object_init_with_type() to
   crash when the former is called with null owner from within
   ->instance_init().  I verified this catches cases like the above.  It
   doesn't catch any new ones.  This makes me reasonably confident I got
   them all.

I'll send out v4 shortly.

Reply via email to