Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-16 Thread Jan Kiszka
On 2011-07-16 11:56, Philippe Gerum wrote:
> On Sat, 2011-07-16 at 11:15 +0200, Jan Kiszka wrote:
>> On 2011-07-16 10:52, Philippe Gerum wrote:
>>> On Sat, 2011-07-16 at 10:13 +0200, Jan Kiszka wrote:
 On 2011-07-15 15:10, Jan Kiszka wrote:
> But... right now it looks like we found our primary regression:
> "nucleus/shadow: shorten the uninterruptible path to secondary mode".
> It opens a short windows during relax where the migrated task may be
> active under both schedulers. We are currently evaluating a revert
> (looks good so far), and I need to work out my theory in more
> details.

 Looks like this commit just made a long-standing flaw in Xenomai's
 interrupt handling more visible: We reschedule over the interrupt stack
 in the Xenomai interrupt handler tails, at least on x86-64. Not sure if
 other archs have interrupt stacks, the point is Xenomai's design wrongly
 assumes there are no such things.
>>>
>>> Fortunately, no, this is not a design issue, no such assumption was ever
>>> made, but the Xenomai core expects this to be handled on a per-arch
>>> basis with the interrupt pipeline.
>>
>> And that's already the problem: If Linux uses interrupt stacks, relying 
>> on ipipe to disable this during Xenomai interrupt handler execution is 
>> at best a workaround. A fragile one unless you increase the pre-thread 
>> stack size by the size of the interrupt stack. Lacking support for a 
>> generic rescheduling hook became a problem by the time Linux introduced 
>> interrupt threads.
> 
> Don't assume too much. What was done for ppc64 was not meant as a
> general policy. Again, this is a per-arch decision.

Actually, it was the right decision, not only for ppc64: Reusing Linux
interrupt stacks for Xenomai does not work. If we interrupt Linux while
it is already running over the interrupt stack, the stack becomes taboo
on that CPU. From that point on, no RT IRQ must run over the Linux
interrupt stack as it would smash it.

But then the question is why we should try to use the interrupt stacks
for Xenomai at all. It's better to increase the task kernel stacks and
disable interrupt stacks when ipipe is enabled. That's what I'm heading
for with x86-64 now (THREAD_ORDER 2, no stack switching).

What we may do is introducing per-domain interrupt stacks. But that's at
best Xenomai 3 / I-pipe 3 stuff.

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-16 Thread Philippe Gerum
On Sat, 2011-07-16 at 11:15 +0200, Jan Kiszka wrote:
> On 2011-07-16 10:52, Philippe Gerum wrote:
> > On Sat, 2011-07-16 at 10:13 +0200, Jan Kiszka wrote:
> >> On 2011-07-15 15:10, Jan Kiszka wrote:
> >>> But... right now it looks like we found our primary regression:
> >>> "nucleus/shadow: shorten the uninterruptible path to secondary mode".
> >>> It opens a short windows during relax where the migrated task may be
> >>> active under both schedulers. We are currently evaluating a revert
> >>> (looks good so far), and I need to work out my theory in more
> >>> details.
> >>
> >> Looks like this commit just made a long-standing flaw in Xenomai's
> >> interrupt handling more visible: We reschedule over the interrupt stack
> >> in the Xenomai interrupt handler tails, at least on x86-64. Not sure if
> >> other archs have interrupt stacks, the point is Xenomai's design wrongly
> >> assumes there are no such things.
> >
> > Fortunately, no, this is not a design issue, no such assumption was ever
> > made, but the Xenomai core expects this to be handled on a per-arch
> > basis with the interrupt pipeline.
> 
> And that's already the problem: If Linux uses interrupt stacks, relying 
> on ipipe to disable this during Xenomai interrupt handler execution is 
> at best a workaround. A fragile one unless you increase the pre-thread 
> stack size by the size of the interrupt stack. Lacking support for a 
> generic rescheduling hook became a problem by the time Linux introduced 
> interrupt threads.

Don't assume too much. What was done for ppc64 was not meant as a
general policy. Again, this is a per-arch decision.

> 
> > As you pointed out, there is no way
> > to handle this via some generic Xenomai-only support.
> >
> > ppc64 now has separate interrupt stacks, which is why I disabled
> > IRQSTACKS which became the builtin default at some point. Blackfin goes
> > through a Xenomai-defined irq tail handler as well, because it may not
> > reschedule over nested interrupt stacks.
> 
> How does this arch prevent that xnpod_schedule in the generic interrupt 
> handler tail does its normal work?

It polls some hw status to know whether a rescheduling would be safe.
See xnarch_escalate().

> 
> > Fact is that such pending
> > problem with x86_64 was overlooked since day #1 by /me.
> >
> >>   We were lucky so far that the values
> >> saved on this shared stack were apparently "compatible", means we were
> >> overwriting them with identical or harmless values. But that's no longer
> >> true when interrupts are hitting us in the xnpod_suspend_thread path of
> >> a relaxing shadow.
> >>
> >
> > Makes sense. It would be better to find a solution that does not make
> > the relax path uninterruptible again for a significant amount of time.
> > On low end platforms we support (i.e. non-x86* mainly), this causes
> > obvious latency spots.
> 
> I agree. Conceptually, the interruptible relaxation should be safe now 
> after recent fixes.
> 
> >
> >> Likely the only possible fix is establishing a reschedule hook for
> >> Xenomai in the interrupt exit path after the original stack is restored
> >> - - just like Linux works. Requires changes to both ipipe and Xenomai
> >> unfortunately.
> >
> > __ipipe_run_irqtail() is in the I-pipe core for such purpose. If
> > instantiated properly for x86_64, and paired with xnarch_escalate() for
> > that arch as well, it could be an option for running the rescheduling
> > procedure when safe.
> 
> Nope, that doesn't work. The stack is switched later in the return path 
> in entry_64.S. We need a hook there, ideally a conditional one, 
> controlled by some per-cpu variable that is set by Xenomai on return 
> from its interrupt handlers to signal the rescheduling need.
> 

Yes, makes sense. The way to make it conditional without dragging bits
of Xenomai logic into the kernel innards is not obvious though.

It is probably time to officially introduce "exo-kernel" oriented bits
into the Linux thread info. PTDs have too lose semantics to be practical
if we want to avoid trashing the I-cache by calling probe hooks within
the dual kernel, each time we want to check some basic condition (e.g.
resched needed). A backlink to a foreign TCB there would help too.

Which leads us to killing the ad hoc kernel threads (and stacks) at some
point, which are an absolute pain.

> Jan

-- 
Philippe.



___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-16 Thread Jan Kiszka

On 2011-07-16 10:52, Philippe Gerum wrote:

On Sat, 2011-07-16 at 10:13 +0200, Jan Kiszka wrote:

On 2011-07-15 15:10, Jan Kiszka wrote:

But... right now it looks like we found our primary regression:
"nucleus/shadow: shorten the uninterruptible path to secondary mode".
It opens a short windows during relax where the migrated task may be
active under both schedulers. We are currently evaluating a revert
(looks good so far), and I need to work out my theory in more
details.


Looks like this commit just made a long-standing flaw in Xenomai's
interrupt handling more visible: We reschedule over the interrupt stack
in the Xenomai interrupt handler tails, at least on x86-64. Not sure if
other archs have interrupt stacks, the point is Xenomai's design wrongly
assumes there are no such things.


Fortunately, no, this is not a design issue, no such assumption was ever
made, but the Xenomai core expects this to be handled on a per-arch
basis with the interrupt pipeline.


And that's already the problem: If Linux uses interrupt stacks, relying 
on ipipe to disable this during Xenomai interrupt handler execution is 
at best a workaround. A fragile one unless you increase the pre-thread 
stack size by the size of the interrupt stack. Lacking support for a 
generic rescheduling hook became a problem by the time Linux introduced 
interrupt threads.



As you pointed out, there is no way
to handle this via some generic Xenomai-only support.

ppc64 now has separate interrupt stacks, which is why I disabled
IRQSTACKS which became the builtin default at some point. Blackfin goes
through a Xenomai-defined irq tail handler as well, because it may not
reschedule over nested interrupt stacks.


How does this arch prevent that xnpod_schedule in the generic interrupt 
handler tail does its normal work?



Fact is that such pending
problem with x86_64 was overlooked since day #1 by /me.


  We were lucky so far that the values
saved on this shared stack were apparently "compatible", means we were
overwriting them with identical or harmless values. But that's no longer
true when interrupts are hitting us in the xnpod_suspend_thread path of
a relaxing shadow.



Makes sense. It would be better to find a solution that does not make
the relax path uninterruptible again for a significant amount of time.
On low end platforms we support (i.e. non-x86* mainly), this causes
obvious latency spots.


I agree. Conceptually, the interruptible relaxation should be safe now 
after recent fixes.





Likely the only possible fix is establishing a reschedule hook for
Xenomai in the interrupt exit path after the original stack is restored
- - just like Linux works. Requires changes to both ipipe and Xenomai
unfortunately.


__ipipe_run_irqtail() is in the I-pipe core for such purpose. If
instantiated properly for x86_64, and paired with xnarch_escalate() for
that arch as well, it could be an option for running the rescheduling
procedure when safe.


Nope, that doesn't work. The stack is switched later in the return path 
in entry_64.S. We need a hook there, ideally a conditional one, 
controlled by some per-cpu variable that is set by Xenomai on return 
from its interrupt handlers to signal the rescheduling need.


Jan

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-16 Thread Philippe Gerum
On Sat, 2011-07-16 at 10:13 +0200, Jan Kiszka wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> On 2011-07-15 15:10, Jan Kiszka wrote:
> > But... right now it looks like we found our primary regression: 
> > "nucleus/shadow: shorten the uninterruptible path to secondary mode".
> > It opens a short windows during relax where the migrated task may be
> > active under both schedulers. We are currently evaluating a revert
> > (looks good so far), and I need to work out my theory in more
> > details.
> 
> Looks like this commit just made a long-standing flaw in Xenomai's
> interrupt handling more visible: We reschedule over the interrupt stack
> in the Xenomai interrupt handler tails, at least on x86-64. Not sure if
> other archs have interrupt stacks, the point is Xenomai's design wrongly
> assumes there are no such things.

Fortunately, no, this is not a design issue, no such assumption was ever
made, but the Xenomai core expects this to be handled on a per-arch
basis with the interrupt pipeline. As you pointed out, there is no way
to handle this via some generic Xenomai-only support.

ppc64 now has separate interrupt stacks, which is why I disabled
IRQSTACKS which became the builtin default at some point. Blackfin goes
through a Xenomai-defined irq tail handler as well, because it may not
reschedule over nested interrupt stacks. Fact is that such pending
problem with x86_64 was overlooked since day #1 by /me.

>  We were lucky so far that the values
> saved on this shared stack were apparently "compatible", means we were
> overwriting them with identical or harmless values. But that's no longer
> true when interrupts are hitting us in the xnpod_suspend_thread path of
> a relaxing shadow.
> 

Makes sense. It would be better to find a solution that does not make
the relax path uninterruptible again for a significant amount of time.
On low end platforms we support (i.e. non-x86* mainly), this causes
obvious latency spots.

> Likely the only possible fix is establishing a reschedule hook for
> Xenomai in the interrupt exit path after the original stack is restored
> - - just like Linux works. Requires changes to both ipipe and Xenomai
> unfortunately.

__ipipe_run_irqtail() is in the I-pipe core for such purpose. If
instantiated properly for x86_64, and paired with xnarch_escalate() for
that arch as well, it could be an option for running the rescheduling
procedure when safe.

> 
> Jan
> 
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v2.0.16 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
> 
> iEYEARECAAYFAk4hSDsACgkQitSsb3rl5xSmOACfbZfcNKyO9YDvPE+R5H75d0ky
> DX0An32BrZW+lpEnxnLLCHSQ5r8itnE9
> =n6u8
> -END PGP SIGNATURE-

-- 
Philippe.



___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-16 Thread Jan Kiszka
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 2011-07-15 15:10, Jan Kiszka wrote:
> But... right now it looks like we found our primary regression: 
> "nucleus/shadow: shorten the uninterruptible path to secondary mode".
> It opens a short windows during relax where the migrated task may be
> active under both schedulers. We are currently evaluating a revert
> (looks good so far), and I need to work out my theory in more
> details.

Looks like this commit just made a long-standing flaw in Xenomai's
interrupt handling more visible: We reschedule over the interrupt stack
in the Xenomai interrupt handler tails, at least on x86-64. Not sure if
other archs have interrupt stacks, the point is Xenomai's design wrongly
assumes there are no such things. We were lucky so far that the values
saved on this shared stack were apparently "compatible", means we were
overwriting them with identical or harmless values. But that's no longer
true when interrupts are hitting us in the xnpod_suspend_thread path of
a relaxing shadow.

Likely the only possible fix is establishing a reschedule hook for
Xenomai in the interrupt exit path after the original stack is restored
- - just like Linux works. Requires changes to both ipipe and Xenomai
unfortunately.

Jan

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.16 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk4hSDsACgkQitSsb3rl5xSmOACfbZfcNKyO9YDvPE+R5H75d0ky
DX0An32BrZW+lpEnxnLLCHSQ5r8itnE9
=n6u8
-END PGP SIGNATURE-

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-15 Thread Jan Kiszka
On 2011-07-15 14:30, Gilles Chanteperdrix wrote:
> On 07/14/2011 10:57 PM, Jan Kiszka wrote:
>> On 2011-07-13 21:12, Gilles Chanteperdrix wrote:
>>> On 07/13/2011 09:04 PM, Jan Kiszka wrote:
 On 2011-07-13 20:39, Gilles Chanteperdrix wrote:
> On 07/12/2011 07:43 PM, Jan Kiszka wrote:
>> On 2011-07-12 19:38, Gilles Chanteperdrix wrote:
>>> On 07/12/2011 07:34 PM, Jan Kiszka wrote:
 On 2011-07-12 19:31, Gilles Chanteperdrix wrote:
> On 07/12/2011 02:57 PM, Jan Kiszka wrote:
>>  xnlock_put_irqrestore(&nklock, s);
>>  xnpod_schedule();
>>  }
>> @@ -1036,6 +1043,7 @@ redo:
>>   * to process this signal anyway.
>>   */
>>  if (rthal_current_domain == rthal_root_domain) {
>> +XENO_BUGON(NUCLEUS, xnthread_test_info(thread, 
>> XNATOMIC));
>
> Misleading dead code again, XNATOMIC is cleared not ten lines above.

 Nope, I forgot to remove that line.

>
>>  if (XENO_DEBUG(NUCLEUS) && (!signal_pending(this_task)
>>  || this_task->state != TASK_RUNNING))
>>  xnpod_fatal
>> @@ -1044,6 +1052,8 @@ redo:
>>  return -ERESTARTSYS;
>>  }
>>  
>> +xnthread_clear_info(thread, XNATOMIC);
>
> Why this? I find the xnthread_clear_info(XNATOMIC) right at the right
> place at the point it currently is.

 Nope. Now we either clear XNATOMIC after successful migration or when
 the signal is about to be sent (ie. in the hook). That way we can test
 more reliably (TM) in the gatekeeper if the thread can be migrated.
>>>
>>> Ok for adding the XNATOMIC test, because it improves the robustness, but
>>> why changing the way XNATOMIC is set and clear? Chances of breaking
>>> thing while changing code in this area are really high...
>>
>> The current code is (most probably) broken as it does not properly
>> synchronizes the gatekeeper against a signaled and "runaway" target
>> Linux task.
>>
>> We need an indication if a Linux signal will (or already has) woken up
>> the to-be-migrated task. That task may have continued over its context,
>> potentially on a different CPU. Providing this indication is the purpose
>> of changing where XNATOMIC is cleared.
>
> What about synchronizing with the gatekeeper with a semaphore, as done
> in the first patch you sent, but doing it in xnshadow_harden, as soon as
> we detect that we are not back from schedule in primary mode? It seems
> it would avoid any further issue, as we would then be guaranteed that
> the thread could not switch to TASK_INTERRUPTIBLE again before the
> gatekeeper is finished.

 The problem is that the gatekeeper tests the task state without holding
 the task's rq lock (which is not available to us without a kernel
 patch). That cannot work reliably as long as we accept signals. That's
 why I'm trying to move state change and test under nklock.

>
> What worries me is the comment in xnshadow_harden:
>
>* gatekeeper sent us to primary mode. Since
>* TASK_UNINTERRUPTIBLE is unavailable to us without wrecking
>* the runqueue's count of uniniterruptible tasks, we just
>* notice the issue and gracefully fail; the caller will have
>* to process this signal anyway.
>*/
>
> Does this mean that we can not switch to TASK_UNINTERRUPTIBLE at this
> point? Or simply that TASK_UNINTERRUPTIBLE is not available for the
> business of xnshadow_harden?
>

 TASK_UNINTERRUPTIBLE is not available without patching the kernel's
 scheduler for the reason mentioned in the comment (the scheduler becomes
 confused and may pick the wrong tasks, IIRC).
>>>
>>> Does not using down/up in the taskexit event handler risk to cause the
>>> same issue?
>>
>> Yes, and that means the first patch is incomplete without something like
>> the second.
>>
>>>

 But I would refrain from trying to "improve" the gatekeeper design. I've
 recently mentioned this to Philippe offlist: For Xenomai 3 with some
 ipipe v3, we must rather patch schedule() to enable zero-switch domain
 migration. Means: enter the scheduler, let it suspend current and pick
 another task, but then simply escalate to the RT domain before doing any
 context switch. That's much cheaper than the current design and
 hopefully also less error-prone.
>>>
>>> So, do you want me to merge your for-upstream branch?
>>
>> You may merge up to for-upstream^, ie. without any gatekeeper fixes.
>>
>> I strongly suspect that there are still more races in the migration
>> path. The crashes we face even with all patches applied may be related
>>

Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-15 Thread Gilles Chanteperdrix
On 07/14/2011 10:57 PM, Jan Kiszka wrote:
> On 2011-07-13 21:12, Gilles Chanteperdrix wrote:
>> On 07/13/2011 09:04 PM, Jan Kiszka wrote:
>>> On 2011-07-13 20:39, Gilles Chanteperdrix wrote:
 On 07/12/2011 07:43 PM, Jan Kiszka wrote:
> On 2011-07-12 19:38, Gilles Chanteperdrix wrote:
>> On 07/12/2011 07:34 PM, Jan Kiszka wrote:
>>> On 2011-07-12 19:31, Gilles Chanteperdrix wrote:
 On 07/12/2011 02:57 PM, Jan Kiszka wrote:
>   xnlock_put_irqrestore(&nklock, s);
>   xnpod_schedule();
>   }
> @@ -1036,6 +1043,7 @@ redo:
>* to process this signal anyway.
>*/
>   if (rthal_current_domain == rthal_root_domain) {
> + XENO_BUGON(NUCLEUS, xnthread_test_info(thread, 
> XNATOMIC));

 Misleading dead code again, XNATOMIC is cleared not ten lines above.
>>>
>>> Nope, I forgot to remove that line.
>>>

>   if (XENO_DEBUG(NUCLEUS) && (!signal_pending(this_task)
>   || this_task->state != TASK_RUNNING))
>   xnpod_fatal
> @@ -1044,6 +1052,8 @@ redo:
>   return -ERESTARTSYS;
>   }
>  
> + xnthread_clear_info(thread, XNATOMIC);

 Why this? I find the xnthread_clear_info(XNATOMIC) right at the right
 place at the point it currently is.
>>>
>>> Nope. Now we either clear XNATOMIC after successful migration or when
>>> the signal is about to be sent (ie. in the hook). That way we can test
>>> more reliably (TM) in the gatekeeper if the thread can be migrated.
>>
>> Ok for adding the XNATOMIC test, because it improves the robustness, but
>> why changing the way XNATOMIC is set and clear? Chances of breaking
>> thing while changing code in this area are really high...
>
> The current code is (most probably) broken as it does not properly
> synchronizes the gatekeeper against a signaled and "runaway" target
> Linux task.
>
> We need an indication if a Linux signal will (or already has) woken up
> the to-be-migrated task. That task may have continued over its context,
> potentially on a different CPU. Providing this indication is the purpose
> of changing where XNATOMIC is cleared.

 What about synchronizing with the gatekeeper with a semaphore, as done
 in the first patch you sent, but doing it in xnshadow_harden, as soon as
 we detect that we are not back from schedule in primary mode? It seems
 it would avoid any further issue, as we would then be guaranteed that
 the thread could not switch to TASK_INTERRUPTIBLE again before the
 gatekeeper is finished.
>>>
>>> The problem is that the gatekeeper tests the task state without holding
>>> the task's rq lock (which is not available to us without a kernel
>>> patch). That cannot work reliably as long as we accept signals. That's
>>> why I'm trying to move state change and test under nklock.
>>>

 What worries me is the comment in xnshadow_harden:

 * gatekeeper sent us to primary mode. Since
 * TASK_UNINTERRUPTIBLE is unavailable to us without wrecking
 * the runqueue's count of uniniterruptible tasks, we just
 * notice the issue and gracefully fail; the caller will have
 * to process this signal anyway.
 */

 Does this mean that we can not switch to TASK_UNINTERRUPTIBLE at this
 point? Or simply that TASK_UNINTERRUPTIBLE is not available for the
 business of xnshadow_harden?

>>>
>>> TASK_UNINTERRUPTIBLE is not available without patching the kernel's
>>> scheduler for the reason mentioned in the comment (the scheduler becomes
>>> confused and may pick the wrong tasks, IIRC).
>>
>> Does not using down/up in the taskexit event handler risk to cause the
>> same issue?
> 
> Yes, and that means the first patch is incomplete without something like
> the second.
> 
>>
>>>
>>> But I would refrain from trying to "improve" the gatekeeper design. I've
>>> recently mentioned this to Philippe offlist: For Xenomai 3 with some
>>> ipipe v3, we must rather patch schedule() to enable zero-switch domain
>>> migration. Means: enter the scheduler, let it suspend current and pick
>>> another task, but then simply escalate to the RT domain before doing any
>>> context switch. That's much cheaper than the current design and
>>> hopefully also less error-prone.
>>
>> So, do you want me to merge your for-upstream branch?
> 
> You may merge up to for-upstream^, ie. without any gatekeeper fixes.
> 
> I strongly suspect that there are still more races in the migration
> path. The crashes we face even with all patches applied may be related
> to a shadow task being executed under Linux and Xenomai at the same time.

Maybe we could try the following patch instead?

diff --gi

Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-14 Thread Jan Kiszka
On 2011-07-13 21:12, Gilles Chanteperdrix wrote:
> On 07/13/2011 09:04 PM, Jan Kiszka wrote:
>> On 2011-07-13 20:39, Gilles Chanteperdrix wrote:
>>> On 07/12/2011 07:43 PM, Jan Kiszka wrote:
 On 2011-07-12 19:38, Gilles Chanteperdrix wrote:
> On 07/12/2011 07:34 PM, Jan Kiszka wrote:
>> On 2011-07-12 19:31, Gilles Chanteperdrix wrote:
>>> On 07/12/2011 02:57 PM, Jan Kiszka wrote:
xnlock_put_irqrestore(&nklock, s);
xnpod_schedule();
}
 @@ -1036,6 +1043,7 @@ redo:
 * to process this signal anyway.
 */
if (rthal_current_domain == rthal_root_domain) {
 +  XENO_BUGON(NUCLEUS, xnthread_test_info(thread, 
 XNATOMIC));
>>>
>>> Misleading dead code again, XNATOMIC is cleared not ten lines above.
>>
>> Nope, I forgot to remove that line.
>>
>>>
if (XENO_DEBUG(NUCLEUS) && (!signal_pending(this_task)
|| this_task->state != TASK_RUNNING))
xnpod_fatal
 @@ -1044,6 +1052,8 @@ redo:
return -ERESTARTSYS;
}
  
 +  xnthread_clear_info(thread, XNATOMIC);
>>>
>>> Why this? I find the xnthread_clear_info(XNATOMIC) right at the right
>>> place at the point it currently is.
>>
>> Nope. Now we either clear XNATOMIC after successful migration or when
>> the signal is about to be sent (ie. in the hook). That way we can test
>> more reliably (TM) in the gatekeeper if the thread can be migrated.
>
> Ok for adding the XNATOMIC test, because it improves the robustness, but
> why changing the way XNATOMIC is set and clear? Chances of breaking
> thing while changing code in this area are really high...

 The current code is (most probably) broken as it does not properly
 synchronizes the gatekeeper against a signaled and "runaway" target
 Linux task.

 We need an indication if a Linux signal will (or already has) woken up
 the to-be-migrated task. That task may have continued over its context,
 potentially on a different CPU. Providing this indication is the purpose
 of changing where XNATOMIC is cleared.
>>>
>>> What about synchronizing with the gatekeeper with a semaphore, as done
>>> in the first patch you sent, but doing it in xnshadow_harden, as soon as
>>> we detect that we are not back from schedule in primary mode? It seems
>>> it would avoid any further issue, as we would then be guaranteed that
>>> the thread could not switch to TASK_INTERRUPTIBLE again before the
>>> gatekeeper is finished.
>>
>> The problem is that the gatekeeper tests the task state without holding
>> the task's rq lock (which is not available to us without a kernel
>> patch). That cannot work reliably as long as we accept signals. That's
>> why I'm trying to move state change and test under nklock.
>>
>>>
>>> What worries me is the comment in xnshadow_harden:
>>>
>>>  * gatekeeper sent us to primary mode. Since
>>>  * TASK_UNINTERRUPTIBLE is unavailable to us without wrecking
>>>  * the runqueue's count of uniniterruptible tasks, we just
>>>  * notice the issue and gracefully fail; the caller will have
>>>  * to process this signal anyway.
>>>  */
>>>
>>> Does this mean that we can not switch to TASK_UNINTERRUPTIBLE at this
>>> point? Or simply that TASK_UNINTERRUPTIBLE is not available for the
>>> business of xnshadow_harden?
>>>
>>
>> TASK_UNINTERRUPTIBLE is not available without patching the kernel's
>> scheduler for the reason mentioned in the comment (the scheduler becomes
>> confused and may pick the wrong tasks, IIRC).
> 
> Does not using down/up in the taskexit event handler risk to cause the
> same issue?

Yes, and that means the first patch is incomplete without something like
the second.

> 
>>
>> But I would refrain from trying to "improve" the gatekeeper design. I've
>> recently mentioned this to Philippe offlist: For Xenomai 3 with some
>> ipipe v3, we must rather patch schedule() to enable zero-switch domain
>> migration. Means: enter the scheduler, let it suspend current and pick
>> another task, but then simply escalate to the RT domain before doing any
>> context switch. That's much cheaper than the current design and
>> hopefully also less error-prone.
> 
> So, do you want me to merge your for-upstream branch?

You may merge up to for-upstream^, ie. without any gatekeeper fixes.

I strongly suspect that there are still more races in the migration
path. The crashes we face even with all patches applied may be related
to a shadow task being executed under Linux and Xenomai at the same time.

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-co

Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-13 Thread Philippe Gerum
On Wed, 2011-07-13 at 20:39 +0200, Gilles Chanteperdrix wrote:
> On 07/12/2011 07:43 PM, Jan Kiszka wrote:
> > On 2011-07-12 19:38, Gilles Chanteperdrix wrote:
> >> On 07/12/2011 07:34 PM, Jan Kiszka wrote:
> >>> On 2011-07-12 19:31, Gilles Chanteperdrix wrote:
>  On 07/12/2011 02:57 PM, Jan Kiszka wrote:
> > xnlock_put_irqrestore(&nklock, s);
> > xnpod_schedule();
> > }
> > @@ -1036,6 +1043,7 @@ redo:
> >  * to process this signal anyway.
> >  */
> > if (rthal_current_domain == rthal_root_domain) {
> > +   XENO_BUGON(NUCLEUS, xnthread_test_info(thread, 
> > XNATOMIC));
> 
>  Misleading dead code again, XNATOMIC is cleared not ten lines above.
> >>>
> >>> Nope, I forgot to remove that line.
> >>>
> 
> > if (XENO_DEBUG(NUCLEUS) && (!signal_pending(this_task)
> > || this_task->state != TASK_RUNNING))
> > xnpod_fatal
> > @@ -1044,6 +1052,8 @@ redo:
> > return -ERESTARTSYS;
> > }
> >  
> > +   xnthread_clear_info(thread, XNATOMIC);
> 
>  Why this? I find the xnthread_clear_info(XNATOMIC) right at the right
>  place at the point it currently is.
> >>>
> >>> Nope. Now we either clear XNATOMIC after successful migration or when
> >>> the signal is about to be sent (ie. in the hook). That way we can test
> >>> more reliably (TM) in the gatekeeper if the thread can be migrated.
> >>
> >> Ok for adding the XNATOMIC test, because it improves the robustness, but
> >> why changing the way XNATOMIC is set and clear? Chances of breaking
> >> thing while changing code in this area are really high...
> > 
> > The current code is (most probably) broken as it does not properly
> > synchronizes the gatekeeper against a signaled and "runaway" target
> > Linux task.
> > 
> > We need an indication if a Linux signal will (or already has) woken up
> > the to-be-migrated task. That task may have continued over its context,
> > potentially on a different CPU. Providing this indication is the purpose
> > of changing where XNATOMIC is cleared.
> 
> What about synchronizing with the gatekeeper with a semaphore, as done
> in the first patch you sent, but doing it in xnshadow_harden, as soon as
> we detect that we are not back from schedule in primary mode? It seems
> it would avoid any further issue, as we would then be guaranteed that
> the thread could not switch to TASK_INTERRUPTIBLE again before the
> gatekeeper is finished.
> 
> What worries me is the comment in xnshadow_harden:
> 
>* gatekeeper sent us to primary mode. Since
>* TASK_UNINTERRUPTIBLE is unavailable to us without wrecking
>* the runqueue's count of uniniterruptible tasks, we just
>* notice the issue and gracefully fail; the caller will have
>* to process this signal anyway.
>*/
> 
> Does this mean that we can not switch to TASK_UNINTERRUPTIBLE at this
> point? Or simply that TASK_UNINTERRUPTIBLE is not available for the
> business of xnshadow_harden?

Second interpretation is correct.

> 

-- 
Philippe.



___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-13 Thread Gilles Chanteperdrix
On 07/13/2011 09:04 PM, Jan Kiszka wrote:
> On 2011-07-13 20:39, Gilles Chanteperdrix wrote:
>> On 07/12/2011 07:43 PM, Jan Kiszka wrote:
>>> On 2011-07-12 19:38, Gilles Chanteperdrix wrote:
 On 07/12/2011 07:34 PM, Jan Kiszka wrote:
> On 2011-07-12 19:31, Gilles Chanteperdrix wrote:
>> On 07/12/2011 02:57 PM, Jan Kiszka wrote:
>>> xnlock_put_irqrestore(&nklock, s);
>>> xnpod_schedule();
>>> }
>>> @@ -1036,6 +1043,7 @@ redo:
>>>  * to process this signal anyway.
>>>  */
>>> if (rthal_current_domain == rthal_root_domain) {
>>> +   XENO_BUGON(NUCLEUS, xnthread_test_info(thread, 
>>> XNATOMIC));
>>
>> Misleading dead code again, XNATOMIC is cleared not ten lines above.
>
> Nope, I forgot to remove that line.
>
>>
>>> if (XENO_DEBUG(NUCLEUS) && (!signal_pending(this_task)
>>> || this_task->state != TASK_RUNNING))
>>> xnpod_fatal
>>> @@ -1044,6 +1052,8 @@ redo:
>>> return -ERESTARTSYS;
>>> }
>>>  
>>> +   xnthread_clear_info(thread, XNATOMIC);
>>
>> Why this? I find the xnthread_clear_info(XNATOMIC) right at the right
>> place at the point it currently is.
>
> Nope. Now we either clear XNATOMIC after successful migration or when
> the signal is about to be sent (ie. in the hook). That way we can test
> more reliably (TM) in the gatekeeper if the thread can be migrated.

 Ok for adding the XNATOMIC test, because it improves the robustness, but
 why changing the way XNATOMIC is set and clear? Chances of breaking
 thing while changing code in this area are really high...
>>>
>>> The current code is (most probably) broken as it does not properly
>>> synchronizes the gatekeeper against a signaled and "runaway" target
>>> Linux task.
>>>
>>> We need an indication if a Linux signal will (or already has) woken up
>>> the to-be-migrated task. That task may have continued over its context,
>>> potentially on a different CPU. Providing this indication is the purpose
>>> of changing where XNATOMIC is cleared.
>>
>> What about synchronizing with the gatekeeper with a semaphore, as done
>> in the first patch you sent, but doing it in xnshadow_harden, as soon as
>> we detect that we are not back from schedule in primary mode? It seems
>> it would avoid any further issue, as we would then be guaranteed that
>> the thread could not switch to TASK_INTERRUPTIBLE again before the
>> gatekeeper is finished.
> 
> The problem is that the gatekeeper tests the task state without holding
> the task's rq lock (which is not available to us without a kernel
> patch). That cannot work reliably as long as we accept signals. That's
> why I'm trying to move state change and test under nklock.
> 
>>
>> What worries me is the comment in xnshadow_harden:
>>
>>   * gatekeeper sent us to primary mode. Since
>>   * TASK_UNINTERRUPTIBLE is unavailable to us without wrecking
>>   * the runqueue's count of uniniterruptible tasks, we just
>>   * notice the issue and gracefully fail; the caller will have
>>   * to process this signal anyway.
>>   */
>>
>> Does this mean that we can not switch to TASK_UNINTERRUPTIBLE at this
>> point? Or simply that TASK_UNINTERRUPTIBLE is not available for the
>> business of xnshadow_harden?
>>
> 
> TASK_UNINTERRUPTIBLE is not available without patching the kernel's
> scheduler for the reason mentioned in the comment (the scheduler becomes
> confused and may pick the wrong tasks, IIRC).

Does not using down/up in the taskexit event handler risk to cause the
same issue?

> 
> But I would refrain from trying to "improve" the gatekeeper design. I've
> recently mentioned this to Philippe offlist: For Xenomai 3 with some
> ipipe v3, we must rather patch schedule() to enable zero-switch domain
> migration. Means: enter the scheduler, let it suspend current and pick
> another task, but then simply escalate to the RT domain before doing any
> context switch. That's much cheaper than the current design and
> hopefully also less error-prone.

So, do you want me to merge your for-upstream branch?
-- 
Gilles.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-13 Thread Jan Kiszka
On 2011-07-13 20:39, Gilles Chanteperdrix wrote:
> On 07/12/2011 07:43 PM, Jan Kiszka wrote:
>> On 2011-07-12 19:38, Gilles Chanteperdrix wrote:
>>> On 07/12/2011 07:34 PM, Jan Kiszka wrote:
 On 2011-07-12 19:31, Gilles Chanteperdrix wrote:
> On 07/12/2011 02:57 PM, Jan Kiszka wrote:
>>  xnlock_put_irqrestore(&nklock, s);
>>  xnpod_schedule();
>>  }
>> @@ -1036,6 +1043,7 @@ redo:
>>   * to process this signal anyway.
>>   */
>>  if (rthal_current_domain == rthal_root_domain) {
>> +XENO_BUGON(NUCLEUS, xnthread_test_info(thread, 
>> XNATOMIC));
>
> Misleading dead code again, XNATOMIC is cleared not ten lines above.

 Nope, I forgot to remove that line.

>
>>  if (XENO_DEBUG(NUCLEUS) && (!signal_pending(this_task)
>>  || this_task->state != TASK_RUNNING))
>>  xnpod_fatal
>> @@ -1044,6 +1052,8 @@ redo:
>>  return -ERESTARTSYS;
>>  }
>>  
>> +xnthread_clear_info(thread, XNATOMIC);
>
> Why this? I find the xnthread_clear_info(XNATOMIC) right at the right
> place at the point it currently is.

 Nope. Now we either clear XNATOMIC after successful migration or when
 the signal is about to be sent (ie. in the hook). That way we can test
 more reliably (TM) in the gatekeeper if the thread can be migrated.
>>>
>>> Ok for adding the XNATOMIC test, because it improves the robustness, but
>>> why changing the way XNATOMIC is set and clear? Chances of breaking
>>> thing while changing code in this area are really high...
>>
>> The current code is (most probably) broken as it does not properly
>> synchronizes the gatekeeper against a signaled and "runaway" target
>> Linux task.
>>
>> We need an indication if a Linux signal will (or already has) woken up
>> the to-be-migrated task. That task may have continued over its context,
>> potentially on a different CPU. Providing this indication is the purpose
>> of changing where XNATOMIC is cleared.
> 
> What about synchronizing with the gatekeeper with a semaphore, as done
> in the first patch you sent, but doing it in xnshadow_harden, as soon as
> we detect that we are not back from schedule in primary mode? It seems
> it would avoid any further issue, as we would then be guaranteed that
> the thread could not switch to TASK_INTERRUPTIBLE again before the
> gatekeeper is finished.

The problem is that the gatekeeper tests the task state without holding
the task's rq lock (which is not available to us without a kernel
patch). That cannot work reliably as long as we accept signals. That's
why I'm trying to move state change and test under nklock.

> 
> What worries me is the comment in xnshadow_harden:
> 
>* gatekeeper sent us to primary mode. Since
>* TASK_UNINTERRUPTIBLE is unavailable to us without wrecking
>* the runqueue's count of uniniterruptible tasks, we just
>* notice the issue and gracefully fail; the caller will have
>* to process this signal anyway.
>*/
> 
> Does this mean that we can not switch to TASK_UNINTERRUPTIBLE at this
> point? Or simply that TASK_UNINTERRUPTIBLE is not available for the
> business of xnshadow_harden?
> 

TASK_UNINTERRUPTIBLE is not available without patching the kernel's
scheduler for the reason mentioned in the comment (the scheduler becomes
confused and may pick the wrong tasks, IIRC).

But I would refrain from trying to "improve" the gatekeeper design. I've
recently mentioned this to Philippe offlist: For Xenomai 3 with some
ipipe v3, we must rather patch schedule() to enable zero-switch domain
migration. Means: enter the scheduler, let it suspend current and pick
another task, but then simply escalate to the RT domain before doing any
context switch. That's much cheaper than the current design and
hopefully also less error-prone.

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-13 Thread Gilles Chanteperdrix
On 07/12/2011 07:43 PM, Jan Kiszka wrote:
> On 2011-07-12 19:38, Gilles Chanteperdrix wrote:
>> On 07/12/2011 07:34 PM, Jan Kiszka wrote:
>>> On 2011-07-12 19:31, Gilles Chanteperdrix wrote:
 On 07/12/2011 02:57 PM, Jan Kiszka wrote:
>   xnlock_put_irqrestore(&nklock, s);
>   xnpod_schedule();
>   }
> @@ -1036,6 +1043,7 @@ redo:
>* to process this signal anyway.
>*/
>   if (rthal_current_domain == rthal_root_domain) {
> + XENO_BUGON(NUCLEUS, xnthread_test_info(thread, XNATOMIC));

 Misleading dead code again, XNATOMIC is cleared not ten lines above.
>>>
>>> Nope, I forgot to remove that line.
>>>

>   if (XENO_DEBUG(NUCLEUS) && (!signal_pending(this_task)
>   || this_task->state != TASK_RUNNING))
>   xnpod_fatal
> @@ -1044,6 +1052,8 @@ redo:
>   return -ERESTARTSYS;
>   }
>  
> + xnthread_clear_info(thread, XNATOMIC);

 Why this? I find the xnthread_clear_info(XNATOMIC) right at the right
 place at the point it currently is.
>>>
>>> Nope. Now we either clear XNATOMIC after successful migration or when
>>> the signal is about to be sent (ie. in the hook). That way we can test
>>> more reliably (TM) in the gatekeeper if the thread can be migrated.
>>
>> Ok for adding the XNATOMIC test, because it improves the robustness, but
>> why changing the way XNATOMIC is set and clear? Chances of breaking
>> thing while changing code in this area are really high...
> 
> The current code is (most probably) broken as it does not properly
> synchronizes the gatekeeper against a signaled and "runaway" target
> Linux task.
> 
> We need an indication if a Linux signal will (or already has) woken up
> the to-be-migrated task. That task may have continued over its context,
> potentially on a different CPU. Providing this indication is the purpose
> of changing where XNATOMIC is cleared.

What about synchronizing with the gatekeeper with a semaphore, as done
in the first patch you sent, but doing it in xnshadow_harden, as soon as
we detect that we are not back from schedule in primary mode? It seems
it would avoid any further issue, as we would then be guaranteed that
the thread could not switch to TASK_INTERRUPTIBLE again before the
gatekeeper is finished.

What worries me is the comment in xnshadow_harden:

 * gatekeeper sent us to primary mode. Since
 * TASK_UNINTERRUPTIBLE is unavailable to us without wrecking
 * the runqueue's count of uniniterruptible tasks, we just
 * notice the issue and gracefully fail; the caller will have
 * to process this signal anyway.
 */

Does this mean that we can not switch to TASK_UNINTERRUPTIBLE at this
point? Or simply that TASK_UNINTERRUPTIBLE is not available for the
business of xnshadow_harden?

-- 
Gilles.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-12 Thread Jan Kiszka
On 2011-07-12 19:38, Gilles Chanteperdrix wrote:
> On 07/12/2011 07:34 PM, Jan Kiszka wrote:
>> On 2011-07-12 19:31, Gilles Chanteperdrix wrote:
>>> On 07/12/2011 02:57 PM, Jan Kiszka wrote:
xnlock_put_irqrestore(&nklock, s);
xnpod_schedule();
}
 @@ -1036,6 +1043,7 @@ redo:
 * to process this signal anyway.
 */
if (rthal_current_domain == rthal_root_domain) {
 +  XENO_BUGON(NUCLEUS, xnthread_test_info(thread, XNATOMIC));
>>>
>>> Misleading dead code again, XNATOMIC is cleared not ten lines above.
>>
>> Nope, I forgot to remove that line.
>>
>>>
if (XENO_DEBUG(NUCLEUS) && (!signal_pending(this_task)
|| this_task->state != TASK_RUNNING))
xnpod_fatal
 @@ -1044,6 +1052,8 @@ redo:
return -ERESTARTSYS;
}
  
 +  xnthread_clear_info(thread, XNATOMIC);
>>>
>>> Why this? I find the xnthread_clear_info(XNATOMIC) right at the right
>>> place at the point it currently is.
>>
>> Nope. Now we either clear XNATOMIC after successful migration or when
>> the signal is about to be sent (ie. in the hook). That way we can test
>> more reliably (TM) in the gatekeeper if the thread can be migrated.
> 
> Ok for adding the XNATOMIC test, because it improves the robustness, but
> why changing the way XNATOMIC is set and clear? Chances of breaking
> thing while changing code in this area are really high...

The current code is (most probably) broken as it does not properly
synchronizes the gatekeeper against a signaled and "runaway" target
Linux task.

We need an indication if a Linux signal will (or already has) woken up
the to-be-migrated task. That task may have continued over its context,
potentially on a different CPU. Providing this indication is the purpose
of changing where XNATOMIC is cleared.

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-12 Thread Gilles Chanteperdrix
On 07/12/2011 07:34 PM, Jan Kiszka wrote:
> On 2011-07-12 19:31, Gilles Chanteperdrix wrote:
>> On 07/12/2011 02:57 PM, Jan Kiszka wrote:
>>> xnlock_put_irqrestore(&nklock, s);
>>> xnpod_schedule();
>>> }
>>> @@ -1036,6 +1043,7 @@ redo:
>>>  * to process this signal anyway.
>>>  */
>>> if (rthal_current_domain == rthal_root_domain) {
>>> +   XENO_BUGON(NUCLEUS, xnthread_test_info(thread, XNATOMIC));
>>
>> Misleading dead code again, XNATOMIC is cleared not ten lines above.
> 
> Nope, I forgot to remove that line.
> 
>>
>>> if (XENO_DEBUG(NUCLEUS) && (!signal_pending(this_task)
>>> || this_task->state != TASK_RUNNING))
>>> xnpod_fatal
>>> @@ -1044,6 +1052,8 @@ redo:
>>> return -ERESTARTSYS;
>>> }
>>>  
>>> +   xnthread_clear_info(thread, XNATOMIC);
>>
>> Why this? I find the xnthread_clear_info(XNATOMIC) right at the right
>> place at the point it currently is.
> 
> Nope. Now we either clear XNATOMIC after successful migration or when
> the signal is about to be sent (ie. in the hook). That way we can test
> more reliably (TM) in the gatekeeper if the thread can be migrated.

Ok for adding the XNATOMIC test, because it improves the robustness, but
why changing the way XNATOMIC is set and clear? Chances of breaking
thing while changing code in this area are really high...

-- 
Gilles.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-12 Thread Jan Kiszka
On 2011-07-12 19:31, Gilles Chanteperdrix wrote:
> On 07/12/2011 02:57 PM, Jan Kiszka wrote:
>>  xnlock_put_irqrestore(&nklock, s);
>>  xnpod_schedule();
>>  }
>> @@ -1036,6 +1043,7 @@ redo:
>>   * to process this signal anyway.
>>   */
>>  if (rthal_current_domain == rthal_root_domain) {
>> +XENO_BUGON(NUCLEUS, xnthread_test_info(thread, XNATOMIC));
> 
> Misleading dead code again, XNATOMIC is cleared not ten lines above.

Nope, I forgot to remove that line.

> 
>>  if (XENO_DEBUG(NUCLEUS) && (!signal_pending(this_task)
>>  || this_task->state != TASK_RUNNING))
>>  xnpod_fatal
>> @@ -1044,6 +1052,8 @@ redo:
>>  return -ERESTARTSYS;
>>  }
>>  
>> +xnthread_clear_info(thread, XNATOMIC);
> 
> Why this? I find the xnthread_clear_info(XNATOMIC) right at the right
> place at the point it currently is.

Nope. Now we either clear XNATOMIC after successful migration or when
the signal is about to be sent (ie. in the hook). That way we can test
more reliably (TM) in the gatekeeper if the thread can be migrated.

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-12 Thread Gilles Chanteperdrix
On 07/12/2011 02:57 PM, Jan Kiszka wrote:
>   xnlock_put_irqrestore(&nklock, s);
>   xnpod_schedule();
>   }
> @@ -1036,6 +1043,7 @@ redo:
>* to process this signal anyway.
>*/
>   if (rthal_current_domain == rthal_root_domain) {
> + XENO_BUGON(NUCLEUS, xnthread_test_info(thread, XNATOMIC));

Misleading dead code again, XNATOMIC is cleared not ten lines above.

>   if (XENO_DEBUG(NUCLEUS) && (!signal_pending(this_task)
>   || this_task->state != TASK_RUNNING))
>   xnpod_fatal
> @@ -1044,6 +1052,8 @@ redo:
>   return -ERESTARTSYS;
>   }
>  
> + xnthread_clear_info(thread, XNATOMIC);

Why this? I find the xnthread_clear_info(XNATOMIC) right at the right
place at the point it currently is.

> +
>   /* "current" is now running into the Xenomai domain. */
>   thread->gksched = NULL;
>   sched = xnsched_finish_unlocked_switch(thread->sched);
> @@ -2650,6 +2660,8 @@ static inline void do_sigwake_event(struct task_struct 
> *p)
>  
>   xnlock_get_irqsave(&nklock, s);
>  
> + xnthread_clear_info(thread, XNATOMIC);
> +

Ditto.

-- 
Gilles.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-12 Thread Jan Kiszka
On 2011-07-12 17:48, Philippe Gerum wrote:
> On Tue, 2011-07-12 at 14:57 +0200, Jan Kiszka wrote:
>> On 2011-07-12 14:13, Jan Kiszka wrote:
>>> On 2011-07-12 14:06, Gilles Chanteperdrix wrote:
 On 07/12/2011 01:58 PM, Jan Kiszka wrote:
> On 2011-07-12 13:56, Jan Kiszka wrote:
>> However, this parallel unsynchronized execution of the gatekeeper and
>> its target thread leaves an increasingly bad feeling on my side. Did we
>> really catch all corner cases now? I wouldn't guarantee that yet.
>> Specifically as I still have an obscure crash of a Xenomai thread on
>> Linux schedule() on my table.
>>
>> What if the target thread woke up due to a signal, continued much
>> further on a different CPU, blocked in TASK_INTERRUPTIBLE, and then the
>> gatekeeper continued? I wish we could already eliminate this complexity
>> and do the migration directly inside schedule()...
>
> BTW, we do we mask out TASK_ATOMICSWITCH when checking the task state in
> the gatekeeper? What would happen if we included it (state ==
> (TASK_ATOMICSWITCH | TASK_INTERRUPTIBLE))?

 I would tend to think that what we should check is
 xnthread_test_info(XNATOMIC). Or maybe check both, the interruptible
 state and the XNATOMIC info bit.
>>>
>>> Actually, neither the info bits nor the task state is sufficiently
>>> synchronized against the gatekeeper yet. We need to hold a shared lock
>>> when testing and resetting the state. I'm not sure yet if that is
>>> fixable given the gatekeeper architecture.
>>>
>>
>> This may work (on top of the exit-race fix):
>>
>> diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
>> index 50dcf43..90feb16 100644
>> --- a/ksrc/nucleus/shadow.c
>> +++ b/ksrc/nucleus/shadow.c
>> @@ -913,20 +913,27 @@ static int gatekeeper_thread(void *data)
>>  if ((xnthread_user_task(target)->state & ~TASK_ATOMICSWITCH) == 
>> TASK_INTERRUPTIBLE) {
>>  rpi_pop(target);
>>  xnlock_get_irqsave(&nklock, s);
>> -#ifdef CONFIG_SMP
>> +
>>  /*
>> - * If the task changed its CPU while in
>> - * secondary mode, change the CPU of the
>> - * underlying Xenomai shadow too. We do not
>> - * migrate the thread timers here, it would
>> - * not work. For a "full" migration comprising
>> - * timers, using xnpod_migrate_thread is
>> - * required.
>> + * Recheck XNATOMIC to avoid waking the shadow if the
>> + * Linux task received a signal meanwhile.
>>   */
>> -if (target->sched != sched)
>> -xnsched_migrate_passive(target, sched);
>> +if (xnthread_test_info(target, XNATOMIC)) {
>> +#ifdef CONFIG_SMP
>> +/*
>> + * If the task changed its CPU while in
>> + * secondary mode, change the CPU of the
>> + * underlying Xenomai shadow too. We do not
>> + * migrate the thread timers here, it would
>> + * not work. For a "full" migration comprising
>> + * timers, using xnpod_migrate_thread is
>> + * required.
>> + */
>> +if (target->sched != sched)
>> +xnsched_migrate_passive(target, sched);
>>  #endif /* CONFIG_SMP */
>> -xnpod_resume_thread(target, XNRELAX);
>> +xnpod_resume_thread(target, XNRELAX);
>> +}
>>  xnlock_put_irqrestore(&nklock, s);
>>  xnpod_schedule();
>>  }
>> @@ -1036,6 +1043,7 @@ redo:
>>   * to process this signal anyway.
>>   */
>>  if (rthal_current_domain == rthal_root_domain) {
>> +XENO_BUGON(NUCLEUS, xnthread_test_info(thread, XNATOMIC));
>>  if (XENO_DEBUG(NUCLEUS) && (!signal_pending(this_task)
>>  || this_task->state != TASK_RUNNING))
>>  xnpod_fatal
>> @@ -1044,6 +1052,8 @@ redo:
>>  return -ERESTARTSYS;
>>  }
>>  
>> +xnthread_clear_info(thread, XNATOMIC);
>> +
>>  /* "current" is now running into the Xenomai domain. */
>>  thread->gksched = NULL;
>>  sched = xnsched_finish_unlocked_switch(thread->sched);
>> @@ -2650,6 +2660,8 @@ static inline void do_sigwake_event(struct task_struct 
>> *p)
>>  
>>  xnlock_get_irqsave(&nklock, s);
>>  
>> +xnthread_clear_info(thread, XNATOMIC);
>> +
>>  if ((p->ptrace & PT_PTRACED) && !xnthread_test_state(thread, XNDEBUG)) {
>>  sigset_t pending;
>>  
>>
>> It totally ignores RPI and PREEMPT_RT for now. RPI is broken anyway,

Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-12 Thread Philippe Gerum
On Tue, 2011-07-12 at 14:57 +0200, Jan Kiszka wrote:
> On 2011-07-12 14:13, Jan Kiszka wrote:
> > On 2011-07-12 14:06, Gilles Chanteperdrix wrote:
> >> On 07/12/2011 01:58 PM, Jan Kiszka wrote:
> >>> On 2011-07-12 13:56, Jan Kiszka wrote:
>  However, this parallel unsynchronized execution of the gatekeeper and
>  its target thread leaves an increasingly bad feeling on my side. Did we
>  really catch all corner cases now? I wouldn't guarantee that yet.
>  Specifically as I still have an obscure crash of a Xenomai thread on
>  Linux schedule() on my table.
> 
>  What if the target thread woke up due to a signal, continued much
>  further on a different CPU, blocked in TASK_INTERRUPTIBLE, and then the
>  gatekeeper continued? I wish we could already eliminate this complexity
>  and do the migration directly inside schedule()...
> >>>
> >>> BTW, we do we mask out TASK_ATOMICSWITCH when checking the task state in
> >>> the gatekeeper? What would happen if we included it (state ==
> >>> (TASK_ATOMICSWITCH | TASK_INTERRUPTIBLE))?
> >>
> >> I would tend to think that what we should check is
> >> xnthread_test_info(XNATOMIC). Or maybe check both, the interruptible
> >> state and the XNATOMIC info bit.
> > 
> > Actually, neither the info bits nor the task state is sufficiently
> > synchronized against the gatekeeper yet. We need to hold a shared lock
> > when testing and resetting the state. I'm not sure yet if that is
> > fixable given the gatekeeper architecture.
> > 
> 
> This may work (on top of the exit-race fix):
> 
> diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
> index 50dcf43..90feb16 100644
> --- a/ksrc/nucleus/shadow.c
> +++ b/ksrc/nucleus/shadow.c
> @@ -913,20 +913,27 @@ static int gatekeeper_thread(void *data)
>   if ((xnthread_user_task(target)->state & ~TASK_ATOMICSWITCH) == 
> TASK_INTERRUPTIBLE) {
>   rpi_pop(target);
>   xnlock_get_irqsave(&nklock, s);
> -#ifdef CONFIG_SMP
> +
>   /*
> -  * If the task changed its CPU while in
> -  * secondary mode, change the CPU of the
> -  * underlying Xenomai shadow too. We do not
> -  * migrate the thread timers here, it would
> -  * not work. For a "full" migration comprising
> -  * timers, using xnpod_migrate_thread is
> -  * required.
> +  * Recheck XNATOMIC to avoid waking the shadow if the
> +  * Linux task received a signal meanwhile.
>*/
> - if (target->sched != sched)
> - xnsched_migrate_passive(target, sched);
> + if (xnthread_test_info(target, XNATOMIC)) {
> +#ifdef CONFIG_SMP
> + /*
> +  * If the task changed its CPU while in
> +  * secondary mode, change the CPU of the
> +  * underlying Xenomai shadow too. We do not
> +  * migrate the thread timers here, it would
> +  * not work. For a "full" migration comprising
> +  * timers, using xnpod_migrate_thread is
> +  * required.
> +  */
> + if (target->sched != sched)
> + xnsched_migrate_passive(target, sched);
>  #endif /* CONFIG_SMP */
> - xnpod_resume_thread(target, XNRELAX);
> + xnpod_resume_thread(target, XNRELAX);
> + }
>   xnlock_put_irqrestore(&nklock, s);
>   xnpod_schedule();
>   }
> @@ -1036,6 +1043,7 @@ redo:
>* to process this signal anyway.
>*/
>   if (rthal_current_domain == rthal_root_domain) {
> + XENO_BUGON(NUCLEUS, xnthread_test_info(thread, XNATOMIC));
>   if (XENO_DEBUG(NUCLEUS) && (!signal_pending(this_task)
>   || this_task->state != TASK_RUNNING))
>   xnpod_fatal
> @@ -1044,6 +1052,8 @@ redo:
>   return -ERESTARTSYS;
>   }
>  
> + xnthread_clear_info(thread, XNATOMIC);
> +
>   /* "current" is now running into the Xenomai domain. */
>   thread->gksched = NULL;
>   sched = xnsched_finish_unlocked_switch(thread->sched);
> @@ -2650,6 +2660,8 @@ static inline void do_sigwake_event(struct task_struct 
> *p)
>  
>   xnlock_get_irqsave(&nklock, s);
>  
> + xnthread_clear_info(thread, XNATOMIC);
> +
>   if ((p->ptrace & PT_PTRACED) && !xnthread_test_state(thread, XNDEBUG)) {
>   sigset_t pending;
>  
> 
> It totally ignores RPI and PREEMPT_RT for now. RPI is broken anyway,

I want to drop RPI in v3 for sure because it is misleading people

Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-12 Thread Jan Kiszka
On 2011-07-12 14:13, Jan Kiszka wrote:
> On 2011-07-12 14:06, Gilles Chanteperdrix wrote:
>> On 07/12/2011 01:58 PM, Jan Kiszka wrote:
>>> On 2011-07-12 13:56, Jan Kiszka wrote:
 However, this parallel unsynchronized execution of the gatekeeper and
 its target thread leaves an increasingly bad feeling on my side. Did we
 really catch all corner cases now? I wouldn't guarantee that yet.
 Specifically as I still have an obscure crash of a Xenomai thread on
 Linux schedule() on my table.

 What if the target thread woke up due to a signal, continued much
 further on a different CPU, blocked in TASK_INTERRUPTIBLE, and then the
 gatekeeper continued? I wish we could already eliminate this complexity
 and do the migration directly inside schedule()...
>>>
>>> BTW, we do we mask out TASK_ATOMICSWITCH when checking the task state in
>>> the gatekeeper? What would happen if we included it (state ==
>>> (TASK_ATOMICSWITCH | TASK_INTERRUPTIBLE))?
>>
>> I would tend to think that what we should check is
>> xnthread_test_info(XNATOMIC). Or maybe check both, the interruptible
>> state and the XNATOMIC info bit.
> 
> Actually, neither the info bits nor the task state is sufficiently
> synchronized against the gatekeeper yet. We need to hold a shared lock
> when testing and resetting the state. I'm not sure yet if that is
> fixable given the gatekeeper architecture.
> 

This may work (on top of the exit-race fix):

diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
index 50dcf43..90feb16 100644
--- a/ksrc/nucleus/shadow.c
+++ b/ksrc/nucleus/shadow.c
@@ -913,20 +913,27 @@ static int gatekeeper_thread(void *data)
if ((xnthread_user_task(target)->state & ~TASK_ATOMICSWITCH) == 
TASK_INTERRUPTIBLE) {
rpi_pop(target);
xnlock_get_irqsave(&nklock, s);
-#ifdef CONFIG_SMP
+
/*
-* If the task changed its CPU while in
-* secondary mode, change the CPU of the
-* underlying Xenomai shadow too. We do not
-* migrate the thread timers here, it would
-* not work. For a "full" migration comprising
-* timers, using xnpod_migrate_thread is
-* required.
+* Recheck XNATOMIC to avoid waking the shadow if the
+* Linux task received a signal meanwhile.
 */
-   if (target->sched != sched)
-   xnsched_migrate_passive(target, sched);
+   if (xnthread_test_info(target, XNATOMIC)) {
+#ifdef CONFIG_SMP
+   /*
+* If the task changed its CPU while in
+* secondary mode, change the CPU of the
+* underlying Xenomai shadow too. We do not
+* migrate the thread timers here, it would
+* not work. For a "full" migration comprising
+* timers, using xnpod_migrate_thread is
+* required.
+*/
+   if (target->sched != sched)
+   xnsched_migrate_passive(target, sched);
 #endif /* CONFIG_SMP */
-   xnpod_resume_thread(target, XNRELAX);
+   xnpod_resume_thread(target, XNRELAX);
+   }
xnlock_put_irqrestore(&nklock, s);
xnpod_schedule();
}
@@ -1036,6 +1043,7 @@ redo:
 * to process this signal anyway.
 */
if (rthal_current_domain == rthal_root_domain) {
+   XENO_BUGON(NUCLEUS, xnthread_test_info(thread, XNATOMIC));
if (XENO_DEBUG(NUCLEUS) && (!signal_pending(this_task)
|| this_task->state != TASK_RUNNING))
xnpod_fatal
@@ -1044,6 +1052,8 @@ redo:
return -ERESTARTSYS;
}
 
+   xnthread_clear_info(thread, XNATOMIC);
+
/* "current" is now running into the Xenomai domain. */
thread->gksched = NULL;
sched = xnsched_finish_unlocked_switch(thread->sched);
@@ -2650,6 +2660,8 @@ static inline void do_sigwake_event(struct task_struct *p)
 
xnlock_get_irqsave(&nklock, s);
 
+   xnthread_clear_info(thread, XNATOMIC);
+
if ((p->ptrace & PT_PTRACED) && !xnthread_test_state(thread, XNDEBUG)) {
sigset_t pending;
 

It totally ignores RPI and PREEMPT_RT for now. RPI is broken anyway,
ripping it out would allow to use solely XNATOMIC as condition in the
gatekeeper.

/me is now looking to get this applied to a testbox that still crashes
suspiciously.

Jan



signature.asc
Description: OpenPGP digital signatu

Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-12 Thread Jan Kiszka
On 2011-07-12 14:06, Gilles Chanteperdrix wrote:
> On 07/12/2011 01:58 PM, Jan Kiszka wrote:
>> On 2011-07-12 13:56, Jan Kiszka wrote:
>>> However, this parallel unsynchronized execution of the gatekeeper and
>>> its target thread leaves an increasingly bad feeling on my side. Did we
>>> really catch all corner cases now? I wouldn't guarantee that yet.
>>> Specifically as I still have an obscure crash of a Xenomai thread on
>>> Linux schedule() on my table.
>>>
>>> What if the target thread woke up due to a signal, continued much
>>> further on a different CPU, blocked in TASK_INTERRUPTIBLE, and then the
>>> gatekeeper continued? I wish we could already eliminate this complexity
>>> and do the migration directly inside schedule()...
>>
>> BTW, we do we mask out TASK_ATOMICSWITCH when checking the task state in
>> the gatekeeper? What would happen if we included it (state ==
>> (TASK_ATOMICSWITCH | TASK_INTERRUPTIBLE))?
> 
> I would tend to think that what we should check is
> xnthread_test_info(XNATOMIC). Or maybe check both, the interruptible
> state and the XNATOMIC info bit.

Actually, neither the info bits nor the task state is sufficiently
synchronized against the gatekeeper yet. We need to hold a shared lock
when testing and resetting the state. I'm not sure yet if that is
fixable given the gatekeeper architecture.

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-12 Thread Gilles Chanteperdrix
On 07/12/2011 01:58 PM, Jan Kiszka wrote:
> On 2011-07-12 13:56, Jan Kiszka wrote:
>> However, this parallel unsynchronized execution of the gatekeeper and
>> its target thread leaves an increasingly bad feeling on my side. Did we
>> really catch all corner cases now? I wouldn't guarantee that yet.
>> Specifically as I still have an obscure crash of a Xenomai thread on
>> Linux schedule() on my table.
>>
>> What if the target thread woke up due to a signal, continued much
>> further on a different CPU, blocked in TASK_INTERRUPTIBLE, and then the
>> gatekeeper continued? I wish we could already eliminate this complexity
>> and do the migration directly inside schedule()...
> 
> BTW, we do we mask out TASK_ATOMICSWITCH when checking the task state in
> the gatekeeper? What would happen if we included it (state ==
> (TASK_ATOMICSWITCH | TASK_INTERRUPTIBLE))?

I would tend to think that what we should check is
xnthread_test_info(XNATOMIC). Or maybe check both, the interruptible
state and the XNATOMIC info bit.

-- 
Gilles.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-12 Thread Jan Kiszka
On 2011-07-12 13:56, Jan Kiszka wrote:
> However, this parallel unsynchronized execution of the gatekeeper and
> its target thread leaves an increasingly bad feeling on my side. Did we
> really catch all corner cases now? I wouldn't guarantee that yet.
> Specifically as I still have an obscure crash of a Xenomai thread on
> Linux schedule() on my table.
> 
> What if the target thread woke up due to a signal, continued much
> further on a different CPU, blocked in TASK_INTERRUPTIBLE, and then the
> gatekeeper continued? I wish we could already eliminate this complexity
> and do the migration directly inside schedule()...

BTW, we do we mask out TASK_ATOMICSWITCH when checking the task state in
the gatekeeper? What would happen if we included it (state ==
(TASK_ATOMICSWITCH | TASK_INTERRUPTIBLE))?

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-12 Thread Jan Kiszka
On 2011-07-12 13:41, Gilles Chanteperdrix wrote:
> On 07/12/2011 01:29 PM, Jan Kiszka wrote:
>>> I find all this complicated for a very small corner-case, so, I keep
>>> looking for a simpler solution. Let us try something else.
>>
>> It's rather the contrary: this solution is straightforward IMHO.
>>
>>>
>>> If the thread is woken up for whatever reason before the gatekeeper,
>>> then we will hit the following "if" in xnshadow_harden, can we set the
>>> target to NULL at this point if it is the current thread? With a cmpxchg
>>> perhaps?
>>
>> Gatekeeper and target task aren't synchronized, are they?
> 
> That is another reason why the solution to synchronize using the
> semaphore may not work. If the thread hits the call to down before the

"thread" means do_taskexit_event here?

> gatekeeper is woken up, then the gatekeeper test may see the target
> thread as suspended, and we may return from the call to "down" in
> xenomai domain.

Good point, but fortunately not possible: If we block on down() in
do_taskexit_event, we enter TASK_UNINTERRUPTIBLE state. But the
gatekeeper checks for TASK_INTERRUPTIBLE.

However, this parallel unsynchronized execution of the gatekeeper and
its target thread leaves an increasingly bad feeling on my side. Did we
really catch all corner cases now? I wouldn't guarantee that yet.
Specifically as I still have an obscure crash of a Xenomai thread on
Linux schedule() on my table.

What if the target thread woke up due to a signal, continued much
further on a different CPU, blocked in TASK_INTERRUPTIBLE, and then the
gatekeeper continued? I wish we could already eliminate this complexity
and do the migration directly inside schedule()...

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-12 Thread Gilles Chanteperdrix
On 07/12/2011 01:29 PM, Jan Kiszka wrote:
>> I find all this complicated for a very small corner-case, so, I keep
>> looking for a simpler solution. Let us try something else.
> 
> It's rather the contrary: this solution is straightforward IMHO.
> 
>>
>> If the thread is woken up for whatever reason before the gatekeeper,
>> then we will hit the following "if" in xnshadow_harden, can we set the
>> target to NULL at this point if it is the current thread? With a cmpxchg
>> perhaps?
> 
> Gatekeeper and target task aren't synchronized, are they?

That is another reason why the solution to synchronize using the
semaphore may not work. If the thread hits the call to down before the
gatekeeper is woken up, then the gatekeeper test may see the target
thread as suspended, and we may return from the call to "down" in
xenomai domain.

-- 
Gilles.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-12 Thread Jan Kiszka
On 2011-07-12 13:26, Gilles Chanteperdrix wrote:
> On 07/12/2011 01:10 PM, Jan Kiszka wrote:
>> On 2011-07-12 13:08, Gilles Chanteperdrix wrote:
>>> On 07/12/2011 01:06 PM, Jan Kiszka wrote:
 On 2011-07-12 13:04, Gilles Chanteperdrix wrote:
> On 07/12/2011 01:00 PM, Jan Kiszka wrote:
>> On 2011-07-12 12:59, Gilles Chanteperdrix wrote:
>>> On 07/12/2011 09:22 AM, Jan Kiszka wrote:
 On 2011-07-12 08:41, Gilles Chanteperdrix wrote:
> On 07/11/2011 10:12 PM, Jan Kiszka wrote:
>> On 2011-07-11 22:09, Gilles Chanteperdrix wrote:
>>> On 07/11/2011 10:06 PM, Jan Kiszka wrote:
 On 2011-07-11 22:02, Gilles Chanteperdrix wrote:
> On 07/11/2011 09:59 PM, Jan Kiszka wrote:
>> On 2011-07-11 21:51, Gilles Chanteperdrix wrote:
>>> On 07/11/2011 09:16 PM, Jan Kiszka wrote:
 On 2011-07-11 21:10, Jan Kiszka wrote:
> On 2011-07-11 20:53, Gilles Chanteperdrix wrote:
>> On 07/08/2011 06:29 PM, GIT version control wrote:
>>> @@ -2528,6 +2534,22 @@ static inline void 
>>> do_taskexit_event(struct task_struct *p)
>>> magic = xnthread_get_magic(thread);
>>>  
>>> xnlock_get_irqsave(&nklock, s);
>>> +
>>> +   gksched = thread->gksched;
>>> +   if (gksched) {
>>> +   xnlock_put_irqrestore(&nklock, s);
>>
>> Are we sure irqs are on here? Are you sure that what is 
>> needed is not an
>> xnlock_clear_irqon?
>
> We are in the context of do_exit. Not only IRQs are on, also 
> preemption.
> And surely no nklock is held.
>
>> Furthermore, I do not understand how we
>> "synchronize" with the gatekeeper, how is the gatekeeper 
>> garanteed to
>> wait for this assignment?
>
> The gatekeeper holds the gksync token while it's active. We 
> request it,
> thus we wait for the gatekeeper to become idle again. While 
> it is idle,
> we reset the queued reference - but I just realized that this 
> may tramp
> on other tasks' values. I need to add a check that the value 
> to be
> null'ified is actually still ours.

 Thinking again, that's actually not a problem: gktarget is 
 only needed
 while gksync is zero - but then we won't get hold of it anyway 
 and,
 thus, can't cause any damage.
>>>
>>> Well, you make it look like it does not work. From what I 
>>> understand,
>>> what you want is to set gktarget to null if a task being 
>>> hardened is
>>> destroyed. But by waiting for the semaphore, you actually wait 
>>> for the
>>> harden to be complete, so setting to NULL is useless. Or am I 
>>> missing
>>> something else?
>>
>> Setting to NULL is probably unneeded but still better than rely 
>> on the
>> gatekeeper never waking up spuriously and then dereferencing a 
>> stale
>> pointer.
>>
>> The key element of this fix is waitng on gksync, thus on the 
>> completion
>> of the non-RT part of the hardening. Actually, this part usually 
>> fails
>> as the target task received a termination signal at this point.
>
> Yes, but since you wait on the completion of the hardening, the 
> test
> if (target &&...) in the gatekeeper code will always be true, 
> because at
> this point the cleanup code will still be waiting for the 
> semaphore.

 Yes, except we will ever wake up the gatekeeper later on without an
 updated gktarget, ie. spuriously. Better safe than sorry, this is 
 hairy
 code anyway (hopefully obsolete one day).
>>>
>>> The gatekeeper is not woken up by posting the semaphore, the 
>>> gatekeeper
>>> is woken up by the thread which is going to be hardened (and this 
>>> thread
>>> is the one which waits for the semaphore).
>>
>> All true. And what is the point?
>
> The point being, would not something like this patch be sufficient?
>
> diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
> index 01f4200..4742c02 100644
> --- a/ksrc/nucleus/shadow.c
> +++ b/ksrc/nucleus

Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-12 Thread Gilles Chanteperdrix
On 07/12/2011 01:10 PM, Jan Kiszka wrote:
> On 2011-07-12 13:08, Gilles Chanteperdrix wrote:
>> On 07/12/2011 01:06 PM, Jan Kiszka wrote:
>>> On 2011-07-12 13:04, Gilles Chanteperdrix wrote:
 On 07/12/2011 01:00 PM, Jan Kiszka wrote:
> On 2011-07-12 12:59, Gilles Chanteperdrix wrote:
>> On 07/12/2011 09:22 AM, Jan Kiszka wrote:
>>> On 2011-07-12 08:41, Gilles Chanteperdrix wrote:
 On 07/11/2011 10:12 PM, Jan Kiszka wrote:
> On 2011-07-11 22:09, Gilles Chanteperdrix wrote:
>> On 07/11/2011 10:06 PM, Jan Kiszka wrote:
>>> On 2011-07-11 22:02, Gilles Chanteperdrix wrote:
 On 07/11/2011 09:59 PM, Jan Kiszka wrote:
> On 2011-07-11 21:51, Gilles Chanteperdrix wrote:
>> On 07/11/2011 09:16 PM, Jan Kiszka wrote:
>>> On 2011-07-11 21:10, Jan Kiszka wrote:
 On 2011-07-11 20:53, Gilles Chanteperdrix wrote:
> On 07/08/2011 06:29 PM, GIT version control wrote:
>> @@ -2528,6 +2534,22 @@ static inline void 
>> do_taskexit_event(struct task_struct *p)
>>  magic = xnthread_get_magic(thread);
>>  
>>  xnlock_get_irqsave(&nklock, s);
>> +
>> +gksched = thread->gksched;
>> +if (gksched) {
>> +xnlock_put_irqrestore(&nklock, s);
>
> Are we sure irqs are on here? Are you sure that what is 
> needed is not an
> xnlock_clear_irqon?

 We are in the context of do_exit. Not only IRQs are on, also 
 preemption.
 And surely no nklock is held.

> Furthermore, I do not understand how we
> "synchronize" with the gatekeeper, how is the gatekeeper 
> garanteed to
> wait for this assignment?

 The gatekeeper holds the gksync token while it's active. We 
 request it,
 thus we wait for the gatekeeper to become idle again. While it 
 is idle,
 we reset the queued reference - but I just realized that this 
 may tramp
 on other tasks' values. I need to add a check that the value 
 to be
 null'ified is actually still ours.
>>>
>>> Thinking again, that's actually not a problem: gktarget is only 
>>> needed
>>> while gksync is zero - but then we won't get hold of it anyway 
>>> and,
>>> thus, can't cause any damage.
>>
>> Well, you make it look like it does not work. From what I 
>> understand,
>> what you want is to set gktarget to null if a task being 
>> hardened is
>> destroyed. But by waiting for the semaphore, you actually wait 
>> for the
>> harden to be complete, so setting to NULL is useless. Or am I 
>> missing
>> something else?
>
> Setting to NULL is probably unneeded but still better than rely 
> on the
> gatekeeper never waking up spuriously and then dereferencing a 
> stale
> pointer.
>
> The key element of this fix is waitng on gksync, thus on the 
> completion
> of the non-RT part of the hardening. Actually, this part usually 
> fails
> as the target task received a termination signal at this point.

 Yes, but since you wait on the completion of the hardening, the 
 test
 if (target &&...) in the gatekeeper code will always be true, 
 because at
 this point the cleanup code will still be waiting for the 
 semaphore.
>>>
>>> Yes, except we will ever wake up the gatekeeper later on without an
>>> updated gktarget, ie. spuriously. Better safe than sorry, this is 
>>> hairy
>>> code anyway (hopefully obsolete one day).
>>
>> The gatekeeper is not woken up by posting the semaphore, the 
>> gatekeeper
>> is woken up by the thread which is going to be hardened (and this 
>> thread
>> is the one which waits for the semaphore).
>
> All true. And what is the point?

 The point being, would not something like this patch be sufficient?

 diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
 index 01f4200..4742c02 100644
 --- a/ksrc/nucleus/shadow.c
 +++ b/ksrc/nucleus/shadow.c
 @@ -2527,6 +2527,18 @@ static inline void do_taskexit_event(struct
 task_struct *p)
magic = xnthread_get_ma

Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-12 Thread Jan Kiszka
On 2011-07-12 13:08, Gilles Chanteperdrix wrote:
> On 07/12/2011 01:06 PM, Jan Kiszka wrote:
>> On 2011-07-12 13:04, Gilles Chanteperdrix wrote:
>>> On 07/12/2011 01:00 PM, Jan Kiszka wrote:
 On 2011-07-12 12:59, Gilles Chanteperdrix wrote:
> On 07/12/2011 09:22 AM, Jan Kiszka wrote:
>> On 2011-07-12 08:41, Gilles Chanteperdrix wrote:
>>> On 07/11/2011 10:12 PM, Jan Kiszka wrote:
 On 2011-07-11 22:09, Gilles Chanteperdrix wrote:
> On 07/11/2011 10:06 PM, Jan Kiszka wrote:
>> On 2011-07-11 22:02, Gilles Chanteperdrix wrote:
>>> On 07/11/2011 09:59 PM, Jan Kiszka wrote:
 On 2011-07-11 21:51, Gilles Chanteperdrix wrote:
> On 07/11/2011 09:16 PM, Jan Kiszka wrote:
>> On 2011-07-11 21:10, Jan Kiszka wrote:
>>> On 2011-07-11 20:53, Gilles Chanteperdrix wrote:
 On 07/08/2011 06:29 PM, GIT version control wrote:
> @@ -2528,6 +2534,22 @@ static inline void 
> do_taskexit_event(struct task_struct *p)
>   magic = xnthread_get_magic(thread);
>  
>   xnlock_get_irqsave(&nklock, s);
> +
> + gksched = thread->gksched;
> + if (gksched) {
> + xnlock_put_irqrestore(&nklock, s);

 Are we sure irqs are on here? Are you sure that what is needed 
 is not an
 xnlock_clear_irqon?
>>>
>>> We are in the context of do_exit. Not only IRQs are on, also 
>>> preemption.
>>> And surely no nklock is held.
>>>
 Furthermore, I do not understand how we
 "synchronize" with the gatekeeper, how is the gatekeeper 
 garanteed to
 wait for this assignment?
>>>
>>> The gatekeeper holds the gksync token while it's active. We 
>>> request it,
>>> thus we wait for the gatekeeper to become idle again. While it 
>>> is idle,
>>> we reset the queued reference - but I just realized that this 
>>> may tramp
>>> on other tasks' values. I need to add a check that the value to 
>>> be
>>> null'ified is actually still ours.
>>
>> Thinking again, that's actually not a problem: gktarget is only 
>> needed
>> while gksync is zero - but then we won't get hold of it anyway 
>> and,
>> thus, can't cause any damage.
>
> Well, you make it look like it does not work. From what I 
> understand,
> what you want is to set gktarget to null if a task being hardened 
> is
> destroyed. But by waiting for the semaphore, you actually wait 
> for the
> harden to be complete, so setting to NULL is useless. Or am I 
> missing
> something else?

 Setting to NULL is probably unneeded but still better than rely on 
 the
 gatekeeper never waking up spuriously and then dereferencing a 
 stale
 pointer.

 The key element of this fix is waitng on gksync, thus on the 
 completion
 of the non-RT part of the hardening. Actually, this part usually 
 fails
 as the target task received a termination signal at this point.
>>>
>>> Yes, but since you wait on the completion of the hardening, the test
>>> if (target &&...) in the gatekeeper code will always be true, 
>>> because at
>>> this point the cleanup code will still be waiting for the semaphore.
>>
>> Yes, except we will ever wake up the gatekeeper later on without an
>> updated gktarget, ie. spuriously. Better safe than sorry, this is 
>> hairy
>> code anyway (hopefully obsolete one day).
>
> The gatekeeper is not woken up by posting the semaphore, the 
> gatekeeper
> is woken up by the thread which is going to be hardened (and this 
> thread
> is the one which waits for the semaphore).

 All true. And what is the point?
>>>
>>> The point being, would not something like this patch be sufficient?
>>>
>>> diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
>>> index 01f4200..4742c02 100644
>>> --- a/ksrc/nucleus/shadow.c
>>> +++ b/ksrc/nucleus/shadow.c
>>> @@ -2527,6 +2527,18 @@ static inline void do_taskexit_event(struct
>>> task_struct *p)
>>> magic = xnthread_get_magic(thread);
>>>
>>> xnlock_get_irqsave(&nklock, s);
>>> +   if (xnthread_test_info(thread, XNATOMIC)) {
>>> +   struct xnsched *g

Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-12 Thread Gilles Chanteperdrix
On 07/12/2011 01:06 PM, Jan Kiszka wrote:
> On 2011-07-12 13:04, Gilles Chanteperdrix wrote:
>> On 07/12/2011 01:00 PM, Jan Kiszka wrote:
>>> On 2011-07-12 12:59, Gilles Chanteperdrix wrote:
 On 07/12/2011 09:22 AM, Jan Kiszka wrote:
> On 2011-07-12 08:41, Gilles Chanteperdrix wrote:
>> On 07/11/2011 10:12 PM, Jan Kiszka wrote:
>>> On 2011-07-11 22:09, Gilles Chanteperdrix wrote:
 On 07/11/2011 10:06 PM, Jan Kiszka wrote:
> On 2011-07-11 22:02, Gilles Chanteperdrix wrote:
>> On 07/11/2011 09:59 PM, Jan Kiszka wrote:
>>> On 2011-07-11 21:51, Gilles Chanteperdrix wrote:
 On 07/11/2011 09:16 PM, Jan Kiszka wrote:
> On 2011-07-11 21:10, Jan Kiszka wrote:
>> On 2011-07-11 20:53, Gilles Chanteperdrix wrote:
>>> On 07/08/2011 06:29 PM, GIT version control wrote:
 @@ -2528,6 +2534,22 @@ static inline void 
 do_taskexit_event(struct task_struct *p)
magic = xnthread_get_magic(thread);
  
xnlock_get_irqsave(&nklock, s);
 +
 +  gksched = thread->gksched;
 +  if (gksched) {
 +  xnlock_put_irqrestore(&nklock, s);
>>>
>>> Are we sure irqs are on here? Are you sure that what is needed 
>>> is not an
>>> xnlock_clear_irqon?
>>
>> We are in the context of do_exit. Not only IRQs are on, also 
>> preemption.
>> And surely no nklock is held.
>>
>>> Furthermore, I do not understand how we
>>> "synchronize" with the gatekeeper, how is the gatekeeper 
>>> garanteed to
>>> wait for this assignment?
>>
>> The gatekeeper holds the gksync token while it's active. We 
>> request it,
>> thus we wait for the gatekeeper to become idle again. While it 
>> is idle,
>> we reset the queued reference - but I just realized that this 
>> may tramp
>> on other tasks' values. I need to add a check that the value to 
>> be
>> null'ified is actually still ours.
>
> Thinking again, that's actually not a problem: gktarget is only 
> needed
> while gksync is zero - but then we won't get hold of it anyway 
> and,
> thus, can't cause any damage.

 Well, you make it look like it does not work. From what I 
 understand,
 what you want is to set gktarget to null if a task being hardened 
 is
 destroyed. But by waiting for the semaphore, you actually wait for 
 the
 harden to be complete, so setting to NULL is useless. Or am I 
 missing
 something else?
>>>
>>> Setting to NULL is probably unneeded but still better than rely on 
>>> the
>>> gatekeeper never waking up spuriously and then dereferencing a stale
>>> pointer.
>>>
>>> The key element of this fix is waitng on gksync, thus on the 
>>> completion
>>> of the non-RT part of the hardening. Actually, this part usually 
>>> fails
>>> as the target task received a termination signal at this point.
>>
>> Yes, but since you wait on the completion of the hardening, the test
>> if (target &&...) in the gatekeeper code will always be true, 
>> because at
>> this point the cleanup code will still be waiting for the semaphore.
>
> Yes, except we will ever wake up the gatekeeper later on without an
> updated gktarget, ie. spuriously. Better safe than sorry, this is 
> hairy
> code anyway (hopefully obsolete one day).

 The gatekeeper is not woken up by posting the semaphore, the gatekeeper
 is woken up by the thread which is going to be hardened (and this 
 thread
 is the one which waits for the semaphore).
>>>
>>> All true. And what is the point?
>>
>> The point being, would not something like this patch be sufficient?
>>
>> diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
>> index 01f4200..4742c02 100644
>> --- a/ksrc/nucleus/shadow.c
>> +++ b/ksrc/nucleus/shadow.c
>> @@ -2527,6 +2527,18 @@ static inline void do_taskexit_event(struct
>> task_struct *p)
>>  magic = xnthread_get_magic(thread);
>>
>>  xnlock_get_irqsave(&nklock, s);
>> +if (xnthread_test_info(thread, XNATOMIC)) {
>> +struct xnsched *gksched = xnpod_sched_slot(task_cpu(p));
>
> That's not reliable, the task might have been migrated by Linux in the
> meantime. We must use the stored gksched.
>

Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-12 Thread Jan Kiszka
On 2011-07-12 13:04, Gilles Chanteperdrix wrote:
> On 07/12/2011 01:00 PM, Jan Kiszka wrote:
>> On 2011-07-12 12:59, Gilles Chanteperdrix wrote:
>>> On 07/12/2011 09:22 AM, Jan Kiszka wrote:
 On 2011-07-12 08:41, Gilles Chanteperdrix wrote:
> On 07/11/2011 10:12 PM, Jan Kiszka wrote:
>> On 2011-07-11 22:09, Gilles Chanteperdrix wrote:
>>> On 07/11/2011 10:06 PM, Jan Kiszka wrote:
 On 2011-07-11 22:02, Gilles Chanteperdrix wrote:
> On 07/11/2011 09:59 PM, Jan Kiszka wrote:
>> On 2011-07-11 21:51, Gilles Chanteperdrix wrote:
>>> On 07/11/2011 09:16 PM, Jan Kiszka wrote:
 On 2011-07-11 21:10, Jan Kiszka wrote:
> On 2011-07-11 20:53, Gilles Chanteperdrix wrote:
>> On 07/08/2011 06:29 PM, GIT version control wrote:
>>> @@ -2528,6 +2534,22 @@ static inline void 
>>> do_taskexit_event(struct task_struct *p)
>>> magic = xnthread_get_magic(thread);
>>>  
>>> xnlock_get_irqsave(&nklock, s);
>>> +
>>> +   gksched = thread->gksched;
>>> +   if (gksched) {
>>> +   xnlock_put_irqrestore(&nklock, s);
>>
>> Are we sure irqs are on here? Are you sure that what is needed 
>> is not an
>> xnlock_clear_irqon?
>
> We are in the context of do_exit. Not only IRQs are on, also 
> preemption.
> And surely no nklock is held.
>
>> Furthermore, I do not understand how we
>> "synchronize" with the gatekeeper, how is the gatekeeper 
>> garanteed to
>> wait for this assignment?
>
> The gatekeeper holds the gksync token while it's active. We 
> request it,
> thus we wait for the gatekeeper to become idle again. While it is 
> idle,
> we reset the queued reference - but I just realized that this may 
> tramp
> on other tasks' values. I need to add a check that the value to be
> null'ified is actually still ours.

 Thinking again, that's actually not a problem: gktarget is only 
 needed
 while gksync is zero - but then we won't get hold of it anyway and,
 thus, can't cause any damage.
>>>
>>> Well, you make it look like it does not work. From what I 
>>> understand,
>>> what you want is to set gktarget to null if a task being hardened is
>>> destroyed. But by waiting for the semaphore, you actually wait for 
>>> the
>>> harden to be complete, so setting to NULL is useless. Or am I 
>>> missing
>>> something else?
>>
>> Setting to NULL is probably unneeded but still better than rely on 
>> the
>> gatekeeper never waking up spuriously and then dereferencing a stale
>> pointer.
>>
>> The key element of this fix is waitng on gksync, thus on the 
>> completion
>> of the non-RT part of the hardening. Actually, this part usually 
>> fails
>> as the target task received a termination signal at this point.
>
> Yes, but since you wait on the completion of the hardening, the test
> if (target &&...) in the gatekeeper code will always be true, because 
> at
> this point the cleanup code will still be waiting for the semaphore.

 Yes, except we will ever wake up the gatekeeper later on without an
 updated gktarget, ie. spuriously. Better safe than sorry, this is hairy
 code anyway (hopefully obsolete one day).
>>>
>>> The gatekeeper is not woken up by posting the semaphore, the gatekeeper
>>> is woken up by the thread which is going to be hardened (and this thread
>>> is the one which waits for the semaphore).
>>
>> All true. And what is the point?
>
> The point being, would not something like this patch be sufficient?
>
> diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
> index 01f4200..4742c02 100644
> --- a/ksrc/nucleus/shadow.c
> +++ b/ksrc/nucleus/shadow.c
> @@ -2527,6 +2527,18 @@ static inline void do_taskexit_event(struct
> task_struct *p)
>   magic = xnthread_get_magic(thread);
>
>   xnlock_get_irqsave(&nklock, s);
> + if (xnthread_test_info(thread, XNATOMIC)) {
> + struct xnsched *gksched = xnpod_sched_slot(task_cpu(p));

 That's not reliable, the task might have been migrated by Linux in the
 meantime. We must use the stored gksched.

> + xnlock_put_irqrestore(&nklock, s);
> +
> + /* Thread is in flight to primary mode, wait for the
> +gatekeeper to be done with it. */
> + down(&gksched->gksync);
> +   

Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-12 Thread Gilles Chanteperdrix
On 07/12/2011 01:00 PM, Jan Kiszka wrote:
> On 2011-07-12 12:59, Gilles Chanteperdrix wrote:
>> On 07/12/2011 09:22 AM, Jan Kiszka wrote:
>>> On 2011-07-12 08:41, Gilles Chanteperdrix wrote:
 On 07/11/2011 10:12 PM, Jan Kiszka wrote:
> On 2011-07-11 22:09, Gilles Chanteperdrix wrote:
>> On 07/11/2011 10:06 PM, Jan Kiszka wrote:
>>> On 2011-07-11 22:02, Gilles Chanteperdrix wrote:
 On 07/11/2011 09:59 PM, Jan Kiszka wrote:
> On 2011-07-11 21:51, Gilles Chanteperdrix wrote:
>> On 07/11/2011 09:16 PM, Jan Kiszka wrote:
>>> On 2011-07-11 21:10, Jan Kiszka wrote:
 On 2011-07-11 20:53, Gilles Chanteperdrix wrote:
> On 07/08/2011 06:29 PM, GIT version control wrote:
>> @@ -2528,6 +2534,22 @@ static inline void 
>> do_taskexit_event(struct task_struct *p)
>>  magic = xnthread_get_magic(thread);
>>  
>>  xnlock_get_irqsave(&nklock, s);
>> +
>> +gksched = thread->gksched;
>> +if (gksched) {
>> +xnlock_put_irqrestore(&nklock, s);
>
> Are we sure irqs are on here? Are you sure that what is needed is 
> not an
> xnlock_clear_irqon?

 We are in the context of do_exit. Not only IRQs are on, also 
 preemption.
 And surely no nklock is held.

> Furthermore, I do not understand how we
> "synchronize" with the gatekeeper, how is the gatekeeper 
> garanteed to
> wait for this assignment?

 The gatekeeper holds the gksync token while it's active. We 
 request it,
 thus we wait for the gatekeeper to become idle again. While it is 
 idle,
 we reset the queued reference - but I just realized that this may 
 tramp
 on other tasks' values. I need to add a check that the value to be
 null'ified is actually still ours.
>>>
>>> Thinking again, that's actually not a problem: gktarget is only 
>>> needed
>>> while gksync is zero - but then we won't get hold of it anyway and,
>>> thus, can't cause any damage.
>>
>> Well, you make it look like it does not work. From what I understand,
>> what you want is to set gktarget to null if a task being hardened is
>> destroyed. But by waiting for the semaphore, you actually wait for 
>> the
>> harden to be complete, so setting to NULL is useless. Or am I missing
>> something else?
>
> Setting to NULL is probably unneeded but still better than rely on the
> gatekeeper never waking up spuriously and then dereferencing a stale
> pointer.
>
> The key element of this fix is waitng on gksync, thus on the 
> completion
> of the non-RT part of the hardening. Actually, this part usually fails
> as the target task received a termination signal at this point.

 Yes, but since you wait on the completion of the hardening, the test
 if (target &&...) in the gatekeeper code will always be true, because 
 at
 this point the cleanup code will still be waiting for the semaphore.
>>>
>>> Yes, except we will ever wake up the gatekeeper later on without an
>>> updated gktarget, ie. spuriously. Better safe than sorry, this is hairy
>>> code anyway (hopefully obsolete one day).
>>
>> The gatekeeper is not woken up by posting the semaphore, the gatekeeper
>> is woken up by the thread which is going to be hardened (and this thread
>> is the one which waits for the semaphore).
>
> All true. And what is the point?

 The point being, would not something like this patch be sufficient?

 diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
 index 01f4200..4742c02 100644
 --- a/ksrc/nucleus/shadow.c
 +++ b/ksrc/nucleus/shadow.c
 @@ -2527,6 +2527,18 @@ static inline void do_taskexit_event(struct
 task_struct *p)
magic = xnthread_get_magic(thread);

xnlock_get_irqsave(&nklock, s);
 +  if (xnthread_test_info(thread, XNATOMIC)) {
 +  struct xnsched *gksched = xnpod_sched_slot(task_cpu(p));
>>>
>>> That's not reliable, the task might have been migrated by Linux in the
>>> meantime. We must use the stored gksched.
>>>
 +  xnlock_put_irqrestore(&nklock, s);
 +
 +  /* Thread is in flight to primary mode, wait for the
 + gatekeeper to be done with it. */
 +  down(&gksched->gksync);
 +  up(&gksched->gksync);
 +
 +  xnlock_get_irqsave(&nklock, s);
 +  }
 +
/* Prevent wakeup call from xnshadow_unmap(). */
xnshadow_thrptd(p) = NULL;

Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-12 Thread Jan Kiszka
On 2011-07-12 12:59, Gilles Chanteperdrix wrote:
> On 07/12/2011 09:22 AM, Jan Kiszka wrote:
>> On 2011-07-12 08:41, Gilles Chanteperdrix wrote:
>>> On 07/11/2011 10:12 PM, Jan Kiszka wrote:
 On 2011-07-11 22:09, Gilles Chanteperdrix wrote:
> On 07/11/2011 10:06 PM, Jan Kiszka wrote:
>> On 2011-07-11 22:02, Gilles Chanteperdrix wrote:
>>> On 07/11/2011 09:59 PM, Jan Kiszka wrote:
 On 2011-07-11 21:51, Gilles Chanteperdrix wrote:
> On 07/11/2011 09:16 PM, Jan Kiszka wrote:
>> On 2011-07-11 21:10, Jan Kiszka wrote:
>>> On 2011-07-11 20:53, Gilles Chanteperdrix wrote:
 On 07/08/2011 06:29 PM, GIT version control wrote:
> @@ -2528,6 +2534,22 @@ static inline void 
> do_taskexit_event(struct task_struct *p)
>   magic = xnthread_get_magic(thread);
>  
>   xnlock_get_irqsave(&nklock, s);
> +
> + gksched = thread->gksched;
> + if (gksched) {
> + xnlock_put_irqrestore(&nklock, s);

 Are we sure irqs are on here? Are you sure that what is needed is 
 not an
 xnlock_clear_irqon?
>>>
>>> We are in the context of do_exit. Not only IRQs are on, also 
>>> preemption.
>>> And surely no nklock is held.
>>>
 Furthermore, I do not understand how we
 "synchronize" with the gatekeeper, how is the gatekeeper garanteed 
 to
 wait for this assignment?
>>>
>>> The gatekeeper holds the gksync token while it's active. We request 
>>> it,
>>> thus we wait for the gatekeeper to become idle again. While it is 
>>> idle,
>>> we reset the queued reference - but I just realized that this may 
>>> tramp
>>> on other tasks' values. I need to add a check that the value to be
>>> null'ified is actually still ours.
>>
>> Thinking again, that's actually not a problem: gktarget is only 
>> needed
>> while gksync is zero - but then we won't get hold of it anyway and,
>> thus, can't cause any damage.
>
> Well, you make it look like it does not work. From what I understand,
> what you want is to set gktarget to null if a task being hardened is
> destroyed. But by waiting for the semaphore, you actually wait for the
> harden to be complete, so setting to NULL is useless. Or am I missing
> something else?

 Setting to NULL is probably unneeded but still better than rely on the
 gatekeeper never waking up spuriously and then dereferencing a stale
 pointer.

 The key element of this fix is waitng on gksync, thus on the completion
 of the non-RT part of the hardening. Actually, this part usually fails
 as the target task received a termination signal at this point.
>>>
>>> Yes, but since you wait on the completion of the hardening, the test
>>> if (target &&...) in the gatekeeper code will always be true, because at
>>> this point the cleanup code will still be waiting for the semaphore.
>>
>> Yes, except we will ever wake up the gatekeeper later on without an
>> updated gktarget, ie. spuriously. Better safe than sorry, this is hairy
>> code anyway (hopefully obsolete one day).
>
> The gatekeeper is not woken up by posting the semaphore, the gatekeeper
> is woken up by the thread which is going to be hardened (and this thread
> is the one which waits for the semaphore).

 All true. And what is the point?
>>>
>>> The point being, would not something like this patch be sufficient?
>>>
>>> diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
>>> index 01f4200..4742c02 100644
>>> --- a/ksrc/nucleus/shadow.c
>>> +++ b/ksrc/nucleus/shadow.c
>>> @@ -2527,6 +2527,18 @@ static inline void do_taskexit_event(struct
>>> task_struct *p)
>>> magic = xnthread_get_magic(thread);
>>>
>>> xnlock_get_irqsave(&nklock, s);
>>> +   if (xnthread_test_info(thread, XNATOMIC)) {
>>> +   struct xnsched *gksched = xnpod_sched_slot(task_cpu(p));
>>
>> That's not reliable, the task might have been migrated by Linux in the
>> meantime. We must use the stored gksched.
>>
>>> +   xnlock_put_irqrestore(&nklock, s);
>>> +
>>> +   /* Thread is in flight to primary mode, wait for the
>>> +  gatekeeper to be done with it. */
>>> +   down(&gksched->gksync);
>>> +   up(&gksched->gksync);
>>> +
>>> +   xnlock_get_irqsave(&nklock, s);
>>> +   }
>>> +
>>> /* Prevent wakeup call from xnshadow_unmap(). */
>>> xnshadow_thrptd(p) = NULL;
>>> xnthread_archtcb(thread)->user_task = NULL;
>>>
>>
>> Again, setting gktarget to NULL and testing for NULL is simply safer,
>> and I see no gain in skipping that. But if you prefer the
>> micro-op

Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-12 Thread Gilles Chanteperdrix
On 07/12/2011 09:22 AM, Jan Kiszka wrote:
> On 2011-07-12 08:41, Gilles Chanteperdrix wrote:
>> On 07/11/2011 10:12 PM, Jan Kiszka wrote:
>>> On 2011-07-11 22:09, Gilles Chanteperdrix wrote:
 On 07/11/2011 10:06 PM, Jan Kiszka wrote:
> On 2011-07-11 22:02, Gilles Chanteperdrix wrote:
>> On 07/11/2011 09:59 PM, Jan Kiszka wrote:
>>> On 2011-07-11 21:51, Gilles Chanteperdrix wrote:
 On 07/11/2011 09:16 PM, Jan Kiszka wrote:
> On 2011-07-11 21:10, Jan Kiszka wrote:
>> On 2011-07-11 20:53, Gilles Chanteperdrix wrote:
>>> On 07/08/2011 06:29 PM, GIT version control wrote:
 @@ -2528,6 +2534,22 @@ static inline void do_taskexit_event(struct 
 task_struct *p)
magic = xnthread_get_magic(thread);
  
xnlock_get_irqsave(&nklock, s);
 +
 +  gksched = thread->gksched;
 +  if (gksched) {
 +  xnlock_put_irqrestore(&nklock, s);
>>>
>>> Are we sure irqs are on here? Are you sure that what is needed is 
>>> not an
>>> xnlock_clear_irqon?
>>
>> We are in the context of do_exit. Not only IRQs are on, also 
>> preemption.
>> And surely no nklock is held.
>>
>>> Furthermore, I do not understand how we
>>> "synchronize" with the gatekeeper, how is the gatekeeper garanteed 
>>> to
>>> wait for this assignment?
>>
>> The gatekeeper holds the gksync token while it's active. We request 
>> it,
>> thus we wait for the gatekeeper to become idle again. While it is 
>> idle,
>> we reset the queued reference - but I just realized that this may 
>> tramp
>> on other tasks' values. I need to add a check that the value to be
>> null'ified is actually still ours.
>
> Thinking again, that's actually not a problem: gktarget is only needed
> while gksync is zero - but then we won't get hold of it anyway and,
> thus, can't cause any damage.

 Well, you make it look like it does not work. From what I understand,
 what you want is to set gktarget to null if a task being hardened is
 destroyed. But by waiting for the semaphore, you actually wait for the
 harden to be complete, so setting to NULL is useless. Or am I missing
 something else?
>>>
>>> Setting to NULL is probably unneeded but still better than rely on the
>>> gatekeeper never waking up spuriously and then dereferencing a stale
>>> pointer.
>>>
>>> The key element of this fix is waitng on gksync, thus on the completion
>>> of the non-RT part of the hardening. Actually, this part usually fails
>>> as the target task received a termination signal at this point.
>>
>> Yes, but since you wait on the completion of the hardening, the test
>> if (target &&...) in the gatekeeper code will always be true, because at
>> this point the cleanup code will still be waiting for the semaphore.
>
> Yes, except we will ever wake up the gatekeeper later on without an
> updated gktarget, ie. spuriously. Better safe than sorry, this is hairy
> code anyway (hopefully obsolete one day).

 The gatekeeper is not woken up by posting the semaphore, the gatekeeper
 is woken up by the thread which is going to be hardened (and this thread
 is the one which waits for the semaphore).
>>>
>>> All true. And what is the point?
>>
>> The point being, would not something like this patch be sufficient?
>>
>> diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
>> index 01f4200..4742c02 100644
>> --- a/ksrc/nucleus/shadow.c
>> +++ b/ksrc/nucleus/shadow.c
>> @@ -2527,6 +2527,18 @@ static inline void do_taskexit_event(struct
>> task_struct *p)
>>  magic = xnthread_get_magic(thread);
>>
>>  xnlock_get_irqsave(&nklock, s);
>> +if (xnthread_test_info(thread, XNATOMIC)) {
>> +struct xnsched *gksched = xnpod_sched_slot(task_cpu(p));
> 
> That's not reliable, the task might have been migrated by Linux in the
> meantime. We must use the stored gksched.
> 
>> +xnlock_put_irqrestore(&nklock, s);
>> +
>> +/* Thread is in flight to primary mode, wait for the
>> +   gatekeeper to be done with it. */
>> +down(&gksched->gksync);
>> +up(&gksched->gksync);
>> +
>> +xnlock_get_irqsave(&nklock, s);
>> +}
>> +
>>  /* Prevent wakeup call from xnshadow_unmap(). */
>>  xnshadow_thrptd(p) = NULL;
>>  xnthread_archtcb(thread)->user_task = NULL;
>>
> 
> Again, setting gktarget to NULL and testing for NULL is simply safer,
> and I see no gain in skipping that. But if you prefer the
> micro-optimization, I'll drop it.

Could not we use an info bit instead of adding a pointer?

-- 
 

Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-12 Thread Gilles Chanteperdrix
On 07/12/2011 09:22 AM, Jan Kiszka wrote:
> On 2011-07-12 08:41, Gilles Chanteperdrix wrote:
>> On 07/11/2011 10:12 PM, Jan Kiszka wrote:
>>> On 2011-07-11 22:09, Gilles Chanteperdrix wrote:
 On 07/11/2011 10:06 PM, Jan Kiszka wrote:
> On 2011-07-11 22:02, Gilles Chanteperdrix wrote:
>> On 07/11/2011 09:59 PM, Jan Kiszka wrote:
>>> On 2011-07-11 21:51, Gilles Chanteperdrix wrote:
 On 07/11/2011 09:16 PM, Jan Kiszka wrote:
> On 2011-07-11 21:10, Jan Kiszka wrote:
>> On 2011-07-11 20:53, Gilles Chanteperdrix wrote:
>>> On 07/08/2011 06:29 PM, GIT version control wrote:
 @@ -2528,6 +2534,22 @@ static inline void do_taskexit_event(struct 
 task_struct *p)
magic = xnthread_get_magic(thread);
  
xnlock_get_irqsave(&nklock, s);
 +
 +  gksched = thread->gksched;
 +  if (gksched) {
 +  xnlock_put_irqrestore(&nklock, s);
>>>
>>> Are we sure irqs are on here? Are you sure that what is needed is 
>>> not an
>>> xnlock_clear_irqon?
>>
>> We are in the context of do_exit. Not only IRQs are on, also 
>> preemption.
>> And surely no nklock is held.
>>
>>> Furthermore, I do not understand how we
>>> "synchronize" with the gatekeeper, how is the gatekeeper garanteed 
>>> to
>>> wait for this assignment?
>>
>> The gatekeeper holds the gksync token while it's active. We request 
>> it,
>> thus we wait for the gatekeeper to become idle again. While it is 
>> idle,
>> we reset the queued reference - but I just realized that this may 
>> tramp
>> on other tasks' values. I need to add a check that the value to be
>> null'ified is actually still ours.
>
> Thinking again, that's actually not a problem: gktarget is only needed
> while gksync is zero - but then we won't get hold of it anyway and,
> thus, can't cause any damage.

 Well, you make it look like it does not work. From what I understand,
 what you want is to set gktarget to null if a task being hardened is
 destroyed. But by waiting for the semaphore, you actually wait for the
 harden to be complete, so setting to NULL is useless. Or am I missing
 something else?
>>>
>>> Setting to NULL is probably unneeded but still better than rely on the
>>> gatekeeper never waking up spuriously and then dereferencing a stale
>>> pointer.
>>>
>>> The key element of this fix is waitng on gksync, thus on the completion
>>> of the non-RT part of the hardening. Actually, this part usually fails
>>> as the target task received a termination signal at this point.
>>
>> Yes, but since you wait on the completion of the hardening, the test
>> if (target &&...) in the gatekeeper code will always be true, because at
>> this point the cleanup code will still be waiting for the semaphore.
>
> Yes, except we will ever wake up the gatekeeper later on without an
> updated gktarget, ie. spuriously. Better safe than sorry, this is hairy
> code anyway (hopefully obsolete one day).

 The gatekeeper is not woken up by posting the semaphore, the gatekeeper
 is woken up by the thread which is going to be hardened (and this thread
 is the one which waits for the semaphore).
>>>
>>> All true. And what is the point?
>>
>> The point being, would not something like this patch be sufficient?
>>
>> diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
>> index 01f4200..4742c02 100644
>> --- a/ksrc/nucleus/shadow.c
>> +++ b/ksrc/nucleus/shadow.c
>> @@ -2527,6 +2527,18 @@ static inline void do_taskexit_event(struct
>> task_struct *p)
>>  magic = xnthread_get_magic(thread);
>>
>>  xnlock_get_irqsave(&nklock, s);
>> +if (xnthread_test_info(thread, XNATOMIC)) {
>> +struct xnsched *gksched = xnpod_sched_slot(task_cpu(p));
> 
> That's not reliable, the task might have been migrated by Linux in the
> meantime. We must use the stored gksched.
> 
>> +xnlock_put_irqrestore(&nklock, s);
>> +
>> +/* Thread is in flight to primary mode, wait for the
>> +   gatekeeper to be done with it. */
>> +down(&gksched->gksync);
>> +up(&gksched->gksync);
>> +
>> +xnlock_get_irqsave(&nklock, s);
>> +}
>> +
>>  /* Prevent wakeup call from xnshadow_unmap(). */
>>  xnshadow_thrptd(p) = NULL;
>>  xnthread_archtcb(thread)->user_task = NULL;
>>
> 
> Again, setting gktarget to NULL and testing for NULL is simply safer,

>From my point of view, testing for NULL is misleading dead code, since
it will never happen.

-- 
Gilles.

___
Xenomai-core mailing

Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-12 Thread Jan Kiszka
On 2011-07-12 08:41, Gilles Chanteperdrix wrote:
> On 07/11/2011 10:12 PM, Jan Kiszka wrote:
>> On 2011-07-11 22:09, Gilles Chanteperdrix wrote:
>>> On 07/11/2011 10:06 PM, Jan Kiszka wrote:
 On 2011-07-11 22:02, Gilles Chanteperdrix wrote:
> On 07/11/2011 09:59 PM, Jan Kiszka wrote:
>> On 2011-07-11 21:51, Gilles Chanteperdrix wrote:
>>> On 07/11/2011 09:16 PM, Jan Kiszka wrote:
 On 2011-07-11 21:10, Jan Kiszka wrote:
> On 2011-07-11 20:53, Gilles Chanteperdrix wrote:
>> On 07/08/2011 06:29 PM, GIT version control wrote:
>>> @@ -2528,6 +2534,22 @@ static inline void do_taskexit_event(struct 
>>> task_struct *p)
>>> magic = xnthread_get_magic(thread);
>>>  
>>> xnlock_get_irqsave(&nklock, s);
>>> +
>>> +   gksched = thread->gksched;
>>> +   if (gksched) {
>>> +   xnlock_put_irqrestore(&nklock, s);
>>
>> Are we sure irqs are on here? Are you sure that what is needed is 
>> not an
>> xnlock_clear_irqon?
>
> We are in the context of do_exit. Not only IRQs are on, also 
> preemption.
> And surely no nklock is held.
>
>> Furthermore, I do not understand how we
>> "synchronize" with the gatekeeper, how is the gatekeeper garanteed to
>> wait for this assignment?
>
> The gatekeeper holds the gksync token while it's active. We request 
> it,
> thus we wait for the gatekeeper to become idle again. While it is 
> idle,
> we reset the queued reference - but I just realized that this may 
> tramp
> on other tasks' values. I need to add a check that the value to be
> null'ified is actually still ours.

 Thinking again, that's actually not a problem: gktarget is only needed
 while gksync is zero - but then we won't get hold of it anyway and,
 thus, can't cause any damage.
>>>
>>> Well, you make it look like it does not work. From what I understand,
>>> what you want is to set gktarget to null if a task being hardened is
>>> destroyed. But by waiting for the semaphore, you actually wait for the
>>> harden to be complete, so setting to NULL is useless. Or am I missing
>>> something else?
>>
>> Setting to NULL is probably unneeded but still better than rely on the
>> gatekeeper never waking up spuriously and then dereferencing a stale
>> pointer.
>>
>> The key element of this fix is waitng on gksync, thus on the completion
>> of the non-RT part of the hardening. Actually, this part usually fails
>> as the target task received a termination signal at this point.
>
> Yes, but since you wait on the completion of the hardening, the test
> if (target &&...) in the gatekeeper code will always be true, because at
> this point the cleanup code will still be waiting for the semaphore.

 Yes, except we will ever wake up the gatekeeper later on without an
 updated gktarget, ie. spuriously. Better safe than sorry, this is hairy
 code anyway (hopefully obsolete one day).
>>>
>>> The gatekeeper is not woken up by posting the semaphore, the gatekeeper
>>> is woken up by the thread which is going to be hardened (and this thread
>>> is the one which waits for the semaphore).
>>
>> All true. And what is the point?
> 
> The point being, would not something like this patch be sufficient?
> 
> diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
> index 01f4200..4742c02 100644
> --- a/ksrc/nucleus/shadow.c
> +++ b/ksrc/nucleus/shadow.c
> @@ -2527,6 +2527,18 @@ static inline void do_taskexit_event(struct
> task_struct *p)
>   magic = xnthread_get_magic(thread);
> 
>   xnlock_get_irqsave(&nklock, s);
> + if (xnthread_test_info(thread, XNATOMIC)) {
> + struct xnsched *gksched = xnpod_sched_slot(task_cpu(p));

That's not reliable, the task might have been migrated by Linux in the
meantime. We must use the stored gksched.

> + xnlock_put_irqrestore(&nklock, s);
> +
> + /* Thread is in flight to primary mode, wait for the
> +gatekeeper to be done with it. */
> + down(&gksched->gksync);
> + up(&gksched->gksync);
> +
> + xnlock_get_irqsave(&nklock, s);
> + }
> +
>   /* Prevent wakeup call from xnshadow_unmap(). */
>   xnshadow_thrptd(p) = NULL;
>   xnthread_archtcb(thread)->user_task = NULL;
> 

Again, setting gktarget to NULL and testing for NULL is simply safer,
and I see no gain in skipping that. But if you prefer the
micro-optimization, I'll drop it.

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-11 Thread Gilles Chanteperdrix
On 07/11/2011 10:12 PM, Jan Kiszka wrote:
> On 2011-07-11 22:09, Gilles Chanteperdrix wrote:
>> On 07/11/2011 10:06 PM, Jan Kiszka wrote:
>>> On 2011-07-11 22:02, Gilles Chanteperdrix wrote:
 On 07/11/2011 09:59 PM, Jan Kiszka wrote:
> On 2011-07-11 21:51, Gilles Chanteperdrix wrote:
>> On 07/11/2011 09:16 PM, Jan Kiszka wrote:
>>> On 2011-07-11 21:10, Jan Kiszka wrote:
 On 2011-07-11 20:53, Gilles Chanteperdrix wrote:
> On 07/08/2011 06:29 PM, GIT version control wrote:
>> @@ -2528,6 +2534,22 @@ static inline void do_taskexit_event(struct 
>> task_struct *p)
>>  magic = xnthread_get_magic(thread);
>>  
>>  xnlock_get_irqsave(&nklock, s);
>> +
>> +gksched = thread->gksched;
>> +if (gksched) {
>> +xnlock_put_irqrestore(&nklock, s);
>
> Are we sure irqs are on here? Are you sure that what is needed is not 
> an
> xnlock_clear_irqon?

 We are in the context of do_exit. Not only IRQs are on, also 
 preemption.
 And surely no nklock is held.

> Furthermore, I do not understand how we
> "synchronize" with the gatekeeper, how is the gatekeeper garanteed to
> wait for this assignment?

 The gatekeeper holds the gksync token while it's active. We request it,
 thus we wait for the gatekeeper to become idle again. While it is idle,
 we reset the queued reference - but I just realized that this may tramp
 on other tasks' values. I need to add a check that the value to be
 null'ified is actually still ours.
>>>
>>> Thinking again, that's actually not a problem: gktarget is only needed
>>> while gksync is zero - but then we won't get hold of it anyway and,
>>> thus, can't cause any damage.
>>
>> Well, you make it look like it does not work. From what I understand,
>> what you want is to set gktarget to null if a task being hardened is
>> destroyed. But by waiting for the semaphore, you actually wait for the
>> harden to be complete, so setting to NULL is useless. Or am I missing
>> something else?
>
> Setting to NULL is probably unneeded but still better than rely on the
> gatekeeper never waking up spuriously and then dereferencing a stale
> pointer.
>
> The key element of this fix is waitng on gksync, thus on the completion
> of the non-RT part of the hardening. Actually, this part usually fails
> as the target task received a termination signal at this point.

 Yes, but since you wait on the completion of the hardening, the test
 if (target &&...) in the gatekeeper code will always be true, because at
 this point the cleanup code will still be waiting for the semaphore.
>>>
>>> Yes, except we will ever wake up the gatekeeper later on without an
>>> updated gktarget, ie. spuriously. Better safe than sorry, this is hairy
>>> code anyway (hopefully obsolete one day).
>>
>> The gatekeeper is not woken up by posting the semaphore, the gatekeeper
>> is woken up by the thread which is going to be hardened (and this thread
>> is the one which waits for the semaphore).
> 
> All true. And what is the point?

The point being, would not something like this patch be sufficient?

diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
index 01f4200..4742c02 100644
--- a/ksrc/nucleus/shadow.c
+++ b/ksrc/nucleus/shadow.c
@@ -2527,6 +2527,18 @@ static inline void do_taskexit_event(struct
task_struct *p)
magic = xnthread_get_magic(thread);

xnlock_get_irqsave(&nklock, s);
+   if (xnthread_test_info(thread, XNATOMIC)) {
+   struct xnsched *gksched = xnpod_sched_slot(task_cpu(p));
+   xnlock_put_irqrestore(&nklock, s);
+
+   /* Thread is in flight to primary mode, wait for the
+  gatekeeper to be done with it. */
+   down(&gksched->gksync);
+   up(&gksched->gksync);
+
+   xnlock_get_irqsave(&nklock, s);
+   }
+
/* Prevent wakeup call from xnshadow_unmap(). */
xnshadow_thrptd(p) = NULL;
xnthread_archtcb(thread)->user_task = NULL;

-- 
Gilles.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-11 Thread Jan Kiszka
On 2011-07-11 22:09, Gilles Chanteperdrix wrote:
> On 07/11/2011 10:06 PM, Jan Kiszka wrote:
>> On 2011-07-11 22:02, Gilles Chanteperdrix wrote:
>>> On 07/11/2011 09:59 PM, Jan Kiszka wrote:
 On 2011-07-11 21:51, Gilles Chanteperdrix wrote:
> On 07/11/2011 09:16 PM, Jan Kiszka wrote:
>> On 2011-07-11 21:10, Jan Kiszka wrote:
>>> On 2011-07-11 20:53, Gilles Chanteperdrix wrote:
 On 07/08/2011 06:29 PM, GIT version control wrote:
> @@ -2528,6 +2534,22 @@ static inline void do_taskexit_event(struct 
> task_struct *p)
>   magic = xnthread_get_magic(thread);
>  
>   xnlock_get_irqsave(&nklock, s);
> +
> + gksched = thread->gksched;
> + if (gksched) {
> + xnlock_put_irqrestore(&nklock, s);

 Are we sure irqs are on here? Are you sure that what is needed is not 
 an
 xnlock_clear_irqon?
>>>
>>> We are in the context of do_exit. Not only IRQs are on, also preemption.
>>> And surely no nklock is held.
>>>
 Furthermore, I do not understand how we
 "synchronize" with the gatekeeper, how is the gatekeeper garanteed to
 wait for this assignment?
>>>
>>> The gatekeeper holds the gksync token while it's active. We request it,
>>> thus we wait for the gatekeeper to become idle again. While it is idle,
>>> we reset the queued reference - but I just realized that this may tramp
>>> on other tasks' values. I need to add a check that the value to be
>>> null'ified is actually still ours.
>>
>> Thinking again, that's actually not a problem: gktarget is only needed
>> while gksync is zero - but then we won't get hold of it anyway and,
>> thus, can't cause any damage.
>
> Well, you make it look like it does not work. From what I understand,
> what you want is to set gktarget to null if a task being hardened is
> destroyed. But by waiting for the semaphore, you actually wait for the
> harden to be complete, so setting to NULL is useless. Or am I missing
> something else?

 Setting to NULL is probably unneeded but still better than rely on the
 gatekeeper never waking up spuriously and then dereferencing a stale
 pointer.

 The key element of this fix is waitng on gksync, thus on the completion
 of the non-RT part of the hardening. Actually, this part usually fails
 as the target task received a termination signal at this point.
>>>
>>> Yes, but since you wait on the completion of the hardening, the test
>>> if (target &&...) in the gatekeeper code will always be true, because at
>>> this point the cleanup code will still be waiting for the semaphore.
>>
>> Yes, except we will ever wake up the gatekeeper later on without an
>> updated gktarget, ie. spuriously. Better safe than sorry, this is hairy
>> code anyway (hopefully obsolete one day).
> 
> The gatekeeper is not woken up by posting the semaphore, the gatekeeper
> is woken up by the thread which is going to be hardened (and this thread
> is the one which waits for the semaphore).

All true. And what is the point?

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-11 Thread Gilles Chanteperdrix
On 07/11/2011 10:06 PM, Jan Kiszka wrote:
> On 2011-07-11 22:02, Gilles Chanteperdrix wrote:
>> On 07/11/2011 09:59 PM, Jan Kiszka wrote:
>>> On 2011-07-11 21:51, Gilles Chanteperdrix wrote:
 On 07/11/2011 09:16 PM, Jan Kiszka wrote:
> On 2011-07-11 21:10, Jan Kiszka wrote:
>> On 2011-07-11 20:53, Gilles Chanteperdrix wrote:
>>> On 07/08/2011 06:29 PM, GIT version control wrote:
 @@ -2528,6 +2534,22 @@ static inline void do_taskexit_event(struct 
 task_struct *p)
magic = xnthread_get_magic(thread);
  
xnlock_get_irqsave(&nklock, s);
 +
 +  gksched = thread->gksched;
 +  if (gksched) {
 +  xnlock_put_irqrestore(&nklock, s);
>>>
>>> Are we sure irqs are on here? Are you sure that what is needed is not an
>>> xnlock_clear_irqon?
>>
>> We are in the context of do_exit. Not only IRQs are on, also preemption.
>> And surely no nklock is held.
>>
>>> Furthermore, I do not understand how we
>>> "synchronize" with the gatekeeper, how is the gatekeeper garanteed to
>>> wait for this assignment?
>>
>> The gatekeeper holds the gksync token while it's active. We request it,
>> thus we wait for the gatekeeper to become idle again. While it is idle,
>> we reset the queued reference - but I just realized that this may tramp
>> on other tasks' values. I need to add a check that the value to be
>> null'ified is actually still ours.
>
> Thinking again, that's actually not a problem: gktarget is only needed
> while gksync is zero - but then we won't get hold of it anyway and,
> thus, can't cause any damage.

 Well, you make it look like it does not work. From what I understand,
 what you want is to set gktarget to null if a task being hardened is
 destroyed. But by waiting for the semaphore, you actually wait for the
 harden to be complete, so setting to NULL is useless. Or am I missing
 something else?
>>>
>>> Setting to NULL is probably unneeded but still better than rely on the
>>> gatekeeper never waking up spuriously and then dereferencing a stale
>>> pointer.
>>>
>>> The key element of this fix is waitng on gksync, thus on the completion
>>> of the non-RT part of the hardening. Actually, this part usually fails
>>> as the target task received a termination signal at this point.
>>
>> Yes, but since you wait on the completion of the hardening, the test
>> if (target &&...) in the gatekeeper code will always be true, because at
>> this point the cleanup code will still be waiting for the semaphore.
> 
> Yes, except we will ever wake up the gatekeeper later on without an
> updated gktarget, ie. spuriously. Better safe than sorry, this is hairy
> code anyway (hopefully obsolete one day).

The gatekeeper is not woken up by posting the semaphore, the gatekeeper
is woken up by the thread which is going to be hardened (and this thread
is the one which waits for the semaphore).

> 
> Jan
> 


-- 
Gilles.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-11 Thread Jan Kiszka
On 2011-07-11 22:02, Gilles Chanteperdrix wrote:
> On 07/11/2011 09:59 PM, Jan Kiszka wrote:
>> On 2011-07-11 21:51, Gilles Chanteperdrix wrote:
>>> On 07/11/2011 09:16 PM, Jan Kiszka wrote:
 On 2011-07-11 21:10, Jan Kiszka wrote:
> On 2011-07-11 20:53, Gilles Chanteperdrix wrote:
>> On 07/08/2011 06:29 PM, GIT version control wrote:
>>> @@ -2528,6 +2534,22 @@ static inline void do_taskexit_event(struct 
>>> task_struct *p)
>>> magic = xnthread_get_magic(thread);
>>>  
>>> xnlock_get_irqsave(&nklock, s);
>>> +
>>> +   gksched = thread->gksched;
>>> +   if (gksched) {
>>> +   xnlock_put_irqrestore(&nklock, s);
>>
>> Are we sure irqs are on here? Are you sure that what is needed is not an
>> xnlock_clear_irqon?
>
> We are in the context of do_exit. Not only IRQs are on, also preemption.
> And surely no nklock is held.
>
>> Furthermore, I do not understand how we
>> "synchronize" with the gatekeeper, how is the gatekeeper garanteed to
>> wait for this assignment?
>
> The gatekeeper holds the gksync token while it's active. We request it,
> thus we wait for the gatekeeper to become idle again. While it is idle,
> we reset the queued reference - but I just realized that this may tramp
> on other tasks' values. I need to add a check that the value to be
> null'ified is actually still ours.

 Thinking again, that's actually not a problem: gktarget is only needed
 while gksync is zero - but then we won't get hold of it anyway and,
 thus, can't cause any damage.
>>>
>>> Well, you make it look like it does not work. From what I understand,
>>> what you want is to set gktarget to null if a task being hardened is
>>> destroyed. But by waiting for the semaphore, you actually wait for the
>>> harden to be complete, so setting to NULL is useless. Or am I missing
>>> something else?
>>
>> Setting to NULL is probably unneeded but still better than rely on the
>> gatekeeper never waking up spuriously and then dereferencing a stale
>> pointer.
>>
>> The key element of this fix is waitng on gksync, thus on the completion
>> of the non-RT part of the hardening. Actually, this part usually fails
>> as the target task received a termination signal at this point.
> 
> Yes, but since you wait on the completion of the hardening, the test
> if (target &&...) in the gatekeeper code will always be true, because at
> this point the cleanup code will still be waiting for the semaphore.

Yes, except we will ever wake up the gatekeeper later on without an
updated gktarget, ie. spuriously. Better safe than sorry, this is hairy
code anyway (hopefully obsolete one day).

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-11 Thread Gilles Chanteperdrix
On 07/11/2011 09:59 PM, Jan Kiszka wrote:
> On 2011-07-11 21:51, Gilles Chanteperdrix wrote:
>> On 07/11/2011 09:16 PM, Jan Kiszka wrote:
>>> On 2011-07-11 21:10, Jan Kiszka wrote:
 On 2011-07-11 20:53, Gilles Chanteperdrix wrote:
> On 07/08/2011 06:29 PM, GIT version control wrote:
>> @@ -2528,6 +2534,22 @@ static inline void do_taskexit_event(struct 
>> task_struct *p)
>>  magic = xnthread_get_magic(thread);
>>  
>>  xnlock_get_irqsave(&nklock, s);
>> +
>> +gksched = thread->gksched;
>> +if (gksched) {
>> +xnlock_put_irqrestore(&nklock, s);
>
> Are we sure irqs are on here? Are you sure that what is needed is not an
> xnlock_clear_irqon?

 We are in the context of do_exit. Not only IRQs are on, also preemption.
 And surely no nklock is held.

> Furthermore, I do not understand how we
> "synchronize" with the gatekeeper, how is the gatekeeper garanteed to
> wait for this assignment?

 The gatekeeper holds the gksync token while it's active. We request it,
 thus we wait for the gatekeeper to become idle again. While it is idle,
 we reset the queued reference - but I just realized that this may tramp
 on other tasks' values. I need to add a check that the value to be
 null'ified is actually still ours.
>>>
>>> Thinking again, that's actually not a problem: gktarget is only needed
>>> while gksync is zero - but then we won't get hold of it anyway and,
>>> thus, can't cause any damage.
>>
>> Well, you make it look like it does not work. From what I understand,
>> what you want is to set gktarget to null if a task being hardened is
>> destroyed. But by waiting for the semaphore, you actually wait for the
>> harden to be complete, so setting to NULL is useless. Or am I missing
>> something else?
> 
> Setting to NULL is probably unneeded but still better than rely on the
> gatekeeper never waking up spuriously and then dereferencing a stale
> pointer.
> 
> The key element of this fix is waitng on gksync, thus on the completion
> of the non-RT part of the hardening. Actually, this part usually fails
> as the target task received a termination signal at this point.

Yes, but since you wait on the completion of the hardening, the test
if (target &&...) in the gatekeeper code will always be true, because at
this point the cleanup code will still be waiting for the semaphore.

-- 
Gilles.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-11 Thread Jan Kiszka
On 2011-07-11 21:51, Gilles Chanteperdrix wrote:
> On 07/11/2011 09:16 PM, Jan Kiszka wrote:
>> On 2011-07-11 21:10, Jan Kiszka wrote:
>>> On 2011-07-11 20:53, Gilles Chanteperdrix wrote:
 On 07/08/2011 06:29 PM, GIT version control wrote:
> @@ -2528,6 +2534,22 @@ static inline void do_taskexit_event(struct 
> task_struct *p)
>   magic = xnthread_get_magic(thread);
>  
>   xnlock_get_irqsave(&nklock, s);
> +
> + gksched = thread->gksched;
> + if (gksched) {
> + xnlock_put_irqrestore(&nklock, s);

 Are we sure irqs are on here? Are you sure that what is needed is not an
 xnlock_clear_irqon?
>>>
>>> We are in the context of do_exit. Not only IRQs are on, also preemption.
>>> And surely no nklock is held.
>>>
 Furthermore, I do not understand how we
 "synchronize" with the gatekeeper, how is the gatekeeper garanteed to
 wait for this assignment?
>>>
>>> The gatekeeper holds the gksync token while it's active. We request it,
>>> thus we wait for the gatekeeper to become idle again. While it is idle,
>>> we reset the queued reference - but I just realized that this may tramp
>>> on other tasks' values. I need to add a check that the value to be
>>> null'ified is actually still ours.
>>
>> Thinking again, that's actually not a problem: gktarget is only needed
>> while gksync is zero - but then we won't get hold of it anyway and,
>> thus, can't cause any damage.
> 
> Well, you make it look like it does not work. From what I understand,
> what you want is to set gktarget to null if a task being hardened is
> destroyed. But by waiting for the semaphore, you actually wait for the
> harden to be complete, so setting to NULL is useless. Or am I missing
> something else?

Setting to NULL is probably unneeded but still better than rely on the
gatekeeper never waking up spuriously and then dereferencing a stale
pointer.

The key element of this fix is waitng on gksync, thus on the completion
of the non-RT part of the hardening. Actually, this part usually fails
as the target task received a termination signal at this point.

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-11 Thread Gilles Chanteperdrix
On 07/11/2011 09:16 PM, Jan Kiszka wrote:
> On 2011-07-11 21:10, Jan Kiszka wrote:
>> On 2011-07-11 20:53, Gilles Chanteperdrix wrote:
>>> On 07/08/2011 06:29 PM, GIT version control wrote:
 @@ -2528,6 +2534,22 @@ static inline void do_taskexit_event(struct 
 task_struct *p)
magic = xnthread_get_magic(thread);
  
xnlock_get_irqsave(&nklock, s);
 +
 +  gksched = thread->gksched;
 +  if (gksched) {
 +  xnlock_put_irqrestore(&nklock, s);
>>>
>>> Are we sure irqs are on here? Are you sure that what is needed is not an
>>> xnlock_clear_irqon?
>>
>> We are in the context of do_exit. Not only IRQs are on, also preemption.
>> And surely no nklock is held.
>>
>>> Furthermore, I do not understand how we
>>> "synchronize" with the gatekeeper, how is the gatekeeper garanteed to
>>> wait for this assignment?
>>
>> The gatekeeper holds the gksync token while it's active. We request it,
>> thus we wait for the gatekeeper to become idle again. While it is idle,
>> we reset the queued reference - but I just realized that this may tramp
>> on other tasks' values. I need to add a check that the value to be
>> null'ified is actually still ours.
> 
> Thinking again, that's actually not a problem: gktarget is only needed
> while gksync is zero - but then we won't get hold of it anyway and,
> thus, can't cause any damage.

Well, you make it look like it does not work. From what I understand,
what you want is to set gktarget to null if a task being hardened is
destroyed. But by waiting for the semaphore, you actually wait for the
harden to be complete, so setting to NULL is useless. Or am I missing
something else?

-- 
Gilles.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-11 Thread Jan Kiszka
On 2011-07-11 21:10, Jan Kiszka wrote:
> On 2011-07-11 20:53, Gilles Chanteperdrix wrote:
>> On 07/08/2011 06:29 PM, GIT version control wrote:
>>> @@ -2528,6 +2534,22 @@ static inline void do_taskexit_event(struct 
>>> task_struct *p)
>>> magic = xnthread_get_magic(thread);
>>>  
>>> xnlock_get_irqsave(&nklock, s);
>>> +
>>> +   gksched = thread->gksched;
>>> +   if (gksched) {
>>> +   xnlock_put_irqrestore(&nklock, s);
>>
>> Are we sure irqs are on here? Are you sure that what is needed is not an
>> xnlock_clear_irqon?
> 
> We are in the context of do_exit. Not only IRQs are on, also preemption.
> And surely no nklock is held.
> 
>> Furthermore, I do not understand how we
>> "synchronize" with the gatekeeper, how is the gatekeeper garanteed to
>> wait for this assignment?
> 
> The gatekeeper holds the gksync token while it's active. We request it,
> thus we wait for the gatekeeper to become idle again. While it is idle,
> we reset the queued reference - but I just realized that this may tramp
> on other tasks' values. I need to add a check that the value to be
> null'ified is actually still ours.

Thinking again, that's actually not a problem: gktarget is only needed
while gksync is zero - but then we won't get hold of it anyway and,
thus, can't cause any damage.

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-11 Thread Jan Kiszka
On 2011-07-11 20:53, Gilles Chanteperdrix wrote:
> On 07/08/2011 06:29 PM, GIT version control wrote:
>> @@ -2528,6 +2534,22 @@ static inline void do_taskexit_event(struct 
>> task_struct *p)
>>  magic = xnthread_get_magic(thread);
>>  
>>  xnlock_get_irqsave(&nklock, s);
>> +
>> +gksched = thread->gksched;
>> +if (gksched) {
>> +xnlock_put_irqrestore(&nklock, s);
> 
> Are we sure irqs are on here? Are you sure that what is needed is not an
> xnlock_clear_irqon?

We are in the context of do_exit. Not only IRQs are on, also preemption.
And surely no nklock is held.

> Furthermore, I do not understand how we
> "synchronize" with the gatekeeper, how is the gatekeeper garanteed to
> wait for this assignment?

The gatekeeper holds the gksync token while it's active. We request it,
thus we wait for the gatekeeper to become idle again. While it is idle,
we reset the queued reference - but I just realized that this may tramp
on other tasks' values. I need to add a check that the value to be
null'ified is actually still ours.

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

2011-07-11 Thread Gilles Chanteperdrix
On 07/08/2011 06:29 PM, GIT version control wrote:
> @@ -2528,6 +2534,22 @@ static inline void do_taskexit_event(struct 
> task_struct *p)
>   magic = xnthread_get_magic(thread);
>  
>   xnlock_get_irqsave(&nklock, s);
> +
> + gksched = thread->gksched;
> + if (gksched) {
> + xnlock_put_irqrestore(&nklock, s);

Are we sure irqs are on here? Are you sure that what is needed is not an
xnlock_clear_irqon? Furthermore, I do not understand how we
"synchronize" with the gatekeeper, how is the gatekeeper garanteed to
wait for this assignment?


-- 
Gilles.

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core