Re: [Xenomai-core] Potential problem with rt_eepro100

Jan Kiszka Thu, 04 Nov 2010 02:11:04 -0700

Am 04.11.2010 09:45, Anders Blomdell wrote:
> Jan Kiszka wrote:
>> Am 04.11.2010 01:13, Gilles Chanteperdrix wrote:
>>> Jan Kiszka wrote:
>>>> Am 04.11.2010 00:56, Gilles Chanteperdrix wrote:
>>>>> Jan Kiszka wrote:
>>>>>> Am 04.11.2010 00:44, Gilles Chanteperdrix wrote:
>>>>>>> Jan Kiszka wrote:
>>>>>>>> Am 04.11.2010 00:18, Gilles Chanteperdrix wrote:
>>>>>>>>> Jan Kiszka wrote:
>>>>>>>>>> Am 04.11.2010 00:11, Gilles Chanteperdrix wrote:
>>>>>>>>>>> Jan Kiszka wrote:
>>>>>>>>>>>> Am 03.11.2010 23:11, Jan Kiszka wrote:
>>>>>>>>>>>>> Am 03.11.2010 23:03, Jan Kiszka wrote:
>>>>>>>>>>>>>> But we not not always use atomic ops for manipulating
>>>>>>>>>>>>>> status bits (but
>>>>>>>>>>>>>> we do in other cases where this is no need - different
>>>>>>>>>>>>>> story). This may
>>>>>>>>>>>>>> fix the race:
>>>>>>>>>>>>> Err, nonsense. As we manipulate xnsched::status also
>>>>>>>>>>>>> outside of nklock
>>>>>>>>>>>>> protection, we must _always_ use atomic ops.
>>>>>>>>>>>>>
>>>>>>>>>>>>> This screams for a cleanup: local-only bits like XNHTICK or
>>>>>>>>>>>>> XNINIRQ
>>>>>>>>>>>>> should be pushed in a separate status word that can then be
>>>>>>>>>>>>> safely
>>>>>>>>>>>>> modified non-atomically.
>>>>>>>>>>>> Second try to fix and clean up the sched status bits.
>>>>>>>>>>>> Anders, please
>>>>>>>>>>>> test.
>>>>>>>>>>>>
>>>>>>>>>>>> Jan
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/include/nucleus/pod.h b/include/nucleus/pod.h
>>>>>>>>>>>> index 01ff0a7..5987a1f 100644
>>>>>>>>>>>> --- a/include/nucleus/pod.h
>>>>>>>>>>>> +++ b/include/nucleus/pod.h
>>>>>>>>>>>> @@ -277,12 +277,10 @@ static inline void xnpod_schedule(void)
>>>>>>>>>>>>       * context is active, or if we are caught in the middle
>>>>>>>>>>>> of a
>>>>>>>>>>>>       * unlocked context switch.
>>>>>>>>>>>>       */
>>>>>>>>>>>> -#if XENO_DEBUG(NUCLEUS)
>>>>>>>>>>>>      if (testbits(sched->status, XNKCOUT|XNINIRQ|XNSWLOCK))
>>>>>>>>>>>>          return;
>>>>>>>>>>>> -#else /* !XENO_DEBUG(NUCLEUS) */
>>>>>>>>>>>> -    if (testbits(sched->status,
>>>>>>>>>>>> -             XNKCOUT|XNINIRQ|XNSWLOCK|XNRESCHED) != XNRESCHED)
>>>>>>>>>>>> +#if !XENO_DEBUG(NUCLEUS)
>>>>>>>>>>>> +    if (!sched->resched)
>>>>>>>>>>>>          return;
>>>>>>>>>>>>  #endif /* !XENO_DEBUG(NUCLEUS) */
>>>>>>>>>>> Having only one test was really nice here, maybe we simply
>>>>>>>>>>> read a
>>>>>>>>>>> barrier before reading the status?
>>>>>>>>>>>
>>>>>>>>>> I agree - but the alternative is letting all modifications of
>>>>>>>>>> xnsched::status use atomic bitops (that's required when
>>>>>>>>>> folding all bits
>>>>>>>>>> into a single word). And that should be much more costly,
>>>>>>>>>> specifically
>>>>>>>>>> on SMP.
>>>>>>>>> What about issuing a barrier before testing the status?
>>>>>>>>>
>>>>>>>> The problem is not about reading but writing the status
>>>>>>>> concurrently,
>>>>>>>> thus it's not about the code you see above.
>>>>>>> The bits are modified under nklock, which implies a barrier when
>>>>>>> unlocked. Furthermore, an IPI is guaranteed to be received on the
>>>>>>> remote
>>>>>>> CPU after this barrier, so, a barrier should be enough to see the
>>>>>>> modifications which have been made remotely.
>>>>>> Check nucleus/intr.c for tons of unprotected status modifications.
>>>>> Ok. Then maybe, we should reconsider the original decision to start
>>>>> fiddling with the XNRESCHED bit remotely.
>>>> ...which removed complexity and fixed a race? Let's better review the
>>>> checks done in xnpod_schedule vs. its callers, I bet there is more to
>>>> save (IOW: remove the need to test for sched->resched).
>>> Not that much complexitiy... and the race was a false positive in debug
>>> code, no big deal. At least it worked, and it has done so for a long
>>> time. No atomic needed, no barrier, only one test in xnpod_schedule. And
>>> a nice invariant: sched->status is always accessed on the local cpu.
>>> What else?
>>
>> Take a step back and look at the root cause for this issue again.
>> Unlocked
>>
>>     if need-resched
>>         __xnpod_schedule
>>
>> is inherently racy and will always be (not only for the remote
>> reschedule case BTW). So we either have to accept this and remove the
>> debugging check from the scheduler or push the check back to
>> __xnpod_schedule where it once came from. When this it cleaned up, we
>> can look into the remote resched protocol again.
> Probably being daft here; why not stop fiddling with remote CPU status
> bits and always do a reschedule on IPI irq's?


Yes that's what we did before. But the snippet above was and still is
the issue to be resolved first. It motivated the change in the remote
reschedule to avoid one of the possible races around this check.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

_______________________________________________
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] Potential problem with rt_eepro100

Reply via email to