Re: [Xenomai] native/heap: "removing non-linked element"

Philippe Gerum Thu, 16 May 2013 00:56:19 -0700

On 05/15/2013 11:46 PM, Gilles Chanteperdrix wrote:
> On 05/15/2013 08:01 PM, Kai Bollue wrote:
> 
>> On 09.05.2013 18:31, Gilles Chanteperdrix wrote:
>>> On 05/09/2013 06:13 PM, Gilles Chanteperdrix wrote:
>>>
>>>> On 05/02/2013 08:45 PM, Kai Bollue wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> we experience a crash upon unbinding of a previously deleted (and
>>>>> cleaned up) shared heap.
>>>>> Scheme:
>>>>> - Process A calls rt_heap_create() (with H_SHARED flag), waits for some
>>>>> time and then terminates.
>>>>> - Process B calls rt_heap_bind() on that heap, uses it and calls
>>>>> rt_heap_unbind() (or terminates) after process A has terminated.
>>>>>
>>>>> Then the system crashes after the output of "Xenomai: removing
>>>>> non-linked element, holder=ffffc900125e4940, qslot=ffff880427aa90f8 at
>>>>> kernel/xenomai/skins/native/heap.c:374".
>>>>>
>>>>> The crash does not always happen, but can quite reliably be reproduced
>>>>> by starting process A in a loop from bash (while [ TRUE ]; do ...) and
>>>>> keeping process B running.
>>>>>
>>>>> Two aspects seem to be crucial:
>>>>> - Calling rt_heap_delete() in process A is not sufficient to reproduce
>>>>> the problem, the process has to terminate (the cleaning up seems to be
>>>>> relevant).
>>>>> - We could only reproduce the crash as long as process B accessed the
>>>>> heap after process A had terminated (e.g. using memcpy).
>>>>>
>>>>> As a workaround, it could be tried to avoid access to a deleted heap,
>>>>> but it is not always possible to detect the termination of process A on
>>>>> time in such a constellation.
>>>>>
>>>>> The system:
>>>>> - AMD AM3 FX-8350
>>>>> - Debian 6.0
>>>>> - Kernel 3.5.7
>>>>> - Xenomai 2.6.2.1
>>>>>
>>>>> We also tested this on an older system (Xenomai 2.6.0, Kernel 2.6.37):
>>>>> Here, both processes hung indefinitely and could not be killed, but the
>>>>> system did not crash.
>>>>>
>>>>> Any hints are appreciated.
>>>>>
>>>>> Attachments:
>>>>> - Console output
>>>>> - Code of process A
>>>>> - Code of process B
>>>>
>>>> Hi Kai,
>>>>
>>>> thank you very much for your test case, it allowed to reproduce the
>>>> issue and try and understand what happens.
>>>>
>>>>  From what I understand, processA creates the shared heap which is added
>>>> to the list of the objects it holds (xeno_get_rholder()), when processA
>>>> dies, the heap is removed from the list, but not destroyed because it is
>>>> also bound to processB.
>>>>
>>>> Then processB unbinds the heap, which triggers an auto-destruction,
>>>> which tries to remove the heap from processA list again. If processA
>>>> control block has not been re-used, this works, because the list is
>>>> still there, if processA has be re-launched, the control block has been
>>>> reinitialized, as well as the list, so removing the element from the
>>>> list fails.
>>
>> Hi Gilles,
>>
>> thank you very much for your analysis and suggestions.
>>
>>>> I see several possible corrections:
>>>> - get rt_heap_delete to return an error when the heap is currently bound
>>>> to another process (EBUSY for instance), while still unmapping it from
>>>> the current process. This will cause __xeno_flush_rq to move the heap to
>>>> the "global" ressource holder, where it can safely be deleted later
>>
>> I am not sure if this is the best solution as the the heap object itself 
>> can actually be deleted, only the underlying xnheap remains.
>>
>>>> - put any rt_heap with the H_MAPPABLE flag directly on the global
>>>> ressource holder, as it is a global object anyway, this means that when
>>>> a process which created a mappable heap dies, the heap survives, but
>>>> this is maybe what should be expected from shareable heaps.
>>
>> This is probably better, but:
>>
>>>
>>> - or remove the rt_heap from the list directly in rt_heap_delete, it
>>> does not seem to make sense to keep it in the list after it has been
>>> deleted: it will be automatically deleted when the last process bound to
>>> it unbinds it anyway.
>>>
>>
>> This is IMHO the most consistent solution. With the following change, we 
>> cannot reproduce the crash anymore:
>>
>> diff --git a/ksrc/skins/native/heap.c b/ksrc/skins/native/heap.c
>> index 4a39d07..be4aee9 100644
>> --- a/ksrc/skins/native/heap.c
>> +++ b/ksrc/skins/native/heap.c
>> @@ -371,8 +371,6 @@ static void __heap_post_release(struct xnheap *h)
>>
>>          xnlock_get_irqsave(&nklock, s);
>>
>> -       removeq(heap->rqueue, &heap->rlink);
>> -
>>          if (heap->handle)
>>                  xnregistry_remove(heap->handle);
>>
>> @@ -442,6 +440,8 @@ int rt_heap_delete_inner(RT_HEAP *heap, void __user 
>> *mapaddr)
>>
>>          xeno_mark_deleted(heap);
>>
>> +       removeq(heap->rqueue, &heap->rlink);
>> +
>>          /* Get out of the nklocked section before releasing the heap
>>             memory, since we are about to invoke Linux kernel
> 
>>             services. */
> 
> Yes, this is the fix I pushed:
> http://git.xenomai.org/?p=xenomai-2.6.git;a=commitdiff;h=ee28ad6936964ebf4198dcfd77dec4b8c5e8623c;hp=a2a6d456b23b9960f46964505668619b90b69400
>


RT_QUEUE implementation follows the same pattern. You may want to push a
fix for this one too.

-- 
Philippe.

_______________________________________________
Xenomai mailing list
[email protected]
http://www.xenomai.org/mailman/listinfo/xenomai

Re: [Xenomai] native/heap: "removing non-linked element"

Reply via email to