[Xenomai-core] Re: [Xenomai-help] Bad EIP kernel-Oops

2006-09-15 Thread Philippe Gerum
On Fri, 2006-09-15 at 00:12 +0200, Dmitry Adamushko wrote:

[...]

 
 And it's not a good idea to get ipipe_catch_event() buzy spinning as
 (as I understand) ipipe_dispatch_event() may take, in general, an
 unbounded amount of time.
 

Actually, this is not an issue, since there is no requirement for
bounded execution time in ipipe_catch_event(). This call is part of the
initialization chores, it is not meant as being deterministic.

 Ok, some more weird thoughts on top of my mind. I hope the idea is
 clear, notwithstanding my description that can be not that clear but
 hey... it's a late hour :)
 
 
 
 ipipe_dispatch_event()
 {
 
 ...
 
ipipe_lock_cpu(flags);
 
start_domain = this_domain = ipipe_percpu_domain[cpuid];
 
list_for_each_safe(pos,npos,__ipipe_pipeline) {
 
 next_domain = list_entry(pos,struct ipipe_domain,p_link);
 
 +   event_handler = next_domain-evhand[event];
 
 if (next_domain-evhand[event] != NULL) {
 ipipe_percpu_domain[cpuid] = next_domain;
 +   atomic_inc(somewhere_stored-counter);
 ipipe_unlock_cpu(flags);
 
 -   propagate = !
 next_domain-evhand[event](event,start_domain,data);
 +   propagate = !event_handler(...);
 
 
   ipipe_lock_cpu(flags);
 +   if (atomic_dec(somewhere_stored-counter) == 0)
 +   send_virtual_irq(virt_irq, EVENT_TYPE,
 arg); // do it per interested domain
 ...

This is going to hurt whenever the handler suspends or even kills the
current context, e.g. think of a fault handling in xnpod_trap_fault over
a kernel-based Xenomai thread context. In such a case, the counter would
never be decremented; we could clear the counter/flag of the outgoing
domain upon domain switch, so that it disappears when it is not relevant
anymore.

 }
 
 then ipipe_catch_event(..., NULL); should do something along the
 following lines :
 
 ipipe_catch_event()
 {
 ...
 
 lock() ; // not sure, it's even necessary
 

We could take the critical lock, just for the section that actually
clears the handler and resets the event flags, so that either
__ipipe_dispatch_event on any CPU sees a null handler and discards the
invocation, or a non-null one and fires it, with the added guarantee
that the target code won't be unmapped under its feet. The locking would
not cover the event handler invocation proper, to keep latencies low.

 set ipd-evhand[event] to NULL;
 
 unlock();
 
 
 // gets blocked
 ipipe_get_synched(EVENT_TYPE, arg);
 
 ...
 }
 
 ipipe_gets_synched()
 -  lock
 -  if somewhere_stored-counter != 0
 - adds a caller to some wait queue (impl. depends on domain)
 - unlock
 
 - gets blocked.
 
 
 virtual_irq_handler()
 
 - lock
 - wakeup_all_blocked for a given EVENT_TYPE and domain
 - unlock

The virtual IRQ looks overkill, if we admit that we only want to solve
the Linux domain + null handler case in ipipe_catch_event(); the latter
could simply poll the counter using schedule_timeout(), since we don't
care how long it needs to complete anyway. Another idea is to provide a
per-CPU, per-event flag for detecting undergoing calls to each event
handler, that we could test in ipipe_catch_event(); once a handler has
been cleared, the associated flag could never be raised again,
protecting the code against livelocking (unless some guy spuriously
calls ipipe_catch_event again for the same event with a non-null handler
in the meantime, but in such a case, it would be more like a usage
issue, not an Adeos bug).

Here is a tentative implementation that seems to work on UP, but still
needs to be tested over SMP. Patch is against 2.6.17-1.3-10:

--- 2.6.17-ipipe-1.3-10/include/linux/ipipe.h   2006-09-15 10:13:15.0 
+0200
+++ 2.6.17-ipipe/include/linux/ipipe.h  2006-09-14 20:06:38.0 +0200
@@ -50,12 +50,14 @@
 #define IPIPE_SPRINTK_FLAG 0   /* Synchronous printk() allowed */
 #define IPIPE_AHEAD_FLAG   1   /* Domain always heads the pipeline */
 
+/* Per-cpu pipeline status */
 #define IPIPE_STALL_FLAG   0   /* Stalls a pipeline stage -- 
guaranteed at bit #0 */
 #define IPIPE_SYNC_FLAG1   /* The interrupt syncer is running for 
the domain */
 #define IPIPE_NOSTACK_FLAG 2   /* Domain currently runs on a foreign 
stack */
 
 #define IPIPE_SYNC_MASK(1  IPIPE_SYNC_FLAG)
 
+/* Interrupt control bits */
 #define IPIPE_HANDLE_FLAG  0
 #define IPIPE_PASS_FLAG1
 #define IPIPE_ENABLE_FLAG  2
@@ -149,6 +151,7 @@ struct ipipe_domain {
unsigned long pending_hits;
unsigned long total_hits;
} irq_counters[IPIPE_NR_IRQS];
+   unsigned long long evsync;
} cacheline_aligned_in_smp cpudata[IPIPE_NR_CPUS];
 
struct {
@@ -409,6 +412,7 @@ static inline void __ipipe_switch_to(str
 * pipeline (and obviously different).
 */
 

[Xenomai-core] Re: [Xenomai-help] Bad EIP kernel-Oops

2006-09-15 Thread Jan Kiszka
Philippe Gerum wrote:
 diff -uNrp 2.6.17-ipipe-1.3-10/kernel/ipipe/core.c 
 2.6.17-ipipe/kernel/ipipe/core.c
 --- 2.6.17-ipipe-1.3-10/kernel/ipipe/core.c   2006-09-15 10:13:15.0 
 +0200
 +++ 2.6.17-ipipe/kernel/ipipe/core.c  2006-09-14 20:09:22.0 +0200
 ...
 @@ -638,6 +642,7 @@ int ipipe_control_irq(unsigned irq, unsi
  int fastcall __ipipe_dispatch_event (unsigned event, void *data)
  {
   struct ipipe_domain *start_domain, *this_domain, *next_domain;
 + ipipe_event_handler_t evhand;
   struct list_head *pos, *npos;
   unsigned long flags;
   ipipe_declare_cpuid;
 @@ -649,8 +654,6 @@ int fastcall __ipipe_dispatch_event (uns
  
   list_for_each_safe(pos,npos,__ipipe_pipeline) {
  
 - next_domain = list_entry(pos,struct ipipe_domain,p_link);
 -
   /*
* Note: Domain migration may occur while running
* event or interrupt handlers, in which case the
 @@ -659,11 +662,22 @@ int fastcall __ipipe_dispatch_event (uns
* care for that, always tracking the current domain
* descriptor upon return from those handlers.
*/
 - if (next_domain-evhand[event] != NULL) {
 + next_domain = list_entry(pos,struct ipipe_domain,p_link);
 +
 + /*
 +  * Keep a cached copy of the handler's address since
 +  * ipipe_catch_event() may clear it under our feet.
 +  */
 +
 + evhand = next_domain-evhand[event];
 +
 + if (evhand != NULL) {
   ipipe_percpu_domain[cpuid] = next_domain;
 + next_domain-cpudata[cpuid].evsync |= (1LL  event);

Isn't there still a race window between grabbing evhand and setting this
bit here? If yes, can this be solved by moving set/clear flag out of the
condition?

If this is not the solution, I wonder if we should really care that
much. Unregistering an event handler remains a rarely used scenario
which actually never happens with the nucleus built in. Shouldn't we
better put some grace period after it and live with the fact that on a
*very busy* system it *may* cause troubles once in a while? Reminds me
of the fact that procfs handler unregistration with private data
attached is also racy by design...

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


[Xenomai-core] Re: [Xenomai-help] Bad EIP kernel-Oops

2006-09-15 Thread Philippe Gerum
On Fri, 2006-09-15 at 10:33 +0200, Jan Kiszka wrote:
 Philippe Gerum wrote:
  diff -uNrp 2.6.17-ipipe-1.3-10/kernel/ipipe/core.c 
  2.6.17-ipipe/kernel/ipipe/core.c
  --- 2.6.17-ipipe-1.3-10/kernel/ipipe/core.c 2006-09-15 10:13:15.0 
  +0200
  +++ 2.6.17-ipipe/kernel/ipipe/core.c2006-09-14 20:09:22.0 
  +0200
  ...
  @@ -638,6 +642,7 @@ int ipipe_control_irq(unsigned irq, unsi
   int fastcall __ipipe_dispatch_event (unsigned event, void *data)
   {
  struct ipipe_domain *start_domain, *this_domain, *next_domain;
  +   ipipe_event_handler_t evhand;
  struct list_head *pos, *npos;
  unsigned long flags;
  ipipe_declare_cpuid;
  @@ -649,8 +654,6 @@ int fastcall __ipipe_dispatch_event (uns
   
  list_for_each_safe(pos,npos,__ipipe_pipeline) {
   
  -   next_domain = list_entry(pos,struct ipipe_domain,p_link);
  -
  /*
   * Note: Domain migration may occur while running
   * event or interrupt handlers, in which case the
  @@ -659,11 +662,22 @@ int fastcall __ipipe_dispatch_event (uns
   * care for that, always tracking the current domain
   * descriptor upon return from those handlers.
   */
  -   if (next_domain-evhand[event] != NULL) {
  +   next_domain = list_entry(pos,struct ipipe_domain,p_link);
  +
  +   /*
  +* Keep a cached copy of the handler's address since
  +* ipipe_catch_event() may clear it under our feet.
  +*/
  +
  +   evhand = next_domain-evhand[event];
  +
  +   if (evhand != NULL) {
  ipipe_percpu_domain[cpuid] = next_domain;
  +   next_domain-cpudata[cpuid].evsync |= (1LL  event);
 
 Isn't there still a race window between grabbing evhand and setting this
 bit here? If yes, can this be solved by moving set/clear flag out of the
 condition?
 

ipipe_catch_event() that changes the handler's address must hold the
critical lock to do so, and since we are running with masked IRQs in the
code above, we prevent the lock from being be obtained (i.e. through the
critical IPI) for the current CPU, so this is ok.

 If this is not the solution, I wonder if we should really care that
 much. Unregistering an event handler remains a rarely used scenario
 which actually never happens with the nucleus built in. Shouldn't we
 better put some grace period after it and live with the fact that on a
 *very busy* system it *may* cause troubles once in a while? Reminds me
 of the fact that procfs handler unregistration with private data
 attached is also racy by design...
 

I agree, this is a corner case which is extremely unlikely to happen in
production systems, since we would likely never unplug the nucleus in
the first place. Still, I think that if we can fix this issue with
minimum overhead, it should be done. So far, the cost involved is
changing a 64bit value twice in the event dispatcher, and once in the
domain switch code. I'd prefer to limit this to 32bit values, but
unfortunately can't yet, until some of the x86-specific exceptions are
grouped under a single event code, so that we don't need more than 32
event bits to flag them.

 Jan
 
-- 
Philippe.



___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


[Xenomai-core] Re: [Xenomai-help] Bad EIP kernel-Oops

2006-09-15 Thread Dmitry Adamushko
On 15/09/06, Philippe Gerum [EMAIL PROTECTED] wrote:
On Fri, 2006-09-15 at 00:12 +0200, Dmitry Adamushko wrote:[...] And it's not a good idea to get ipipe_catch_event() buzy spinning as (as I understand) ipipe_dispatch_event() may take, in general, an
 unbounded amount of time.Actually, this is not an issue, since there is no requirement forbounded execution time in ipipe_catch_event(). This call is part of theinitialization chores, it is not meant as being deterministic.

I meant it's not ok to keep a CPU spinning all this time. But your
approach with schedule_timeout() is really better. And probably, we
don't need to try being more general that it's really required as we
need to get synched only with ipipe_catch_event(..., NULL) from the
Linux domain.


--Philippe.-- Best regards,
Dmitry Adamushko
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


[Xenomai-core] Re: [Xenomai-help] Bad EIP kernel-Oops

2006-09-14 Thread Dmitry Adamushko

On 13/09/06, Jan Kiszka [EMAIL PROTECTED] wrote:


It's the indirect call to the event handler.
8b3: 8b 55
e4mov0xffe4(%ebp),%edx
8b6:
50push
%eax- 8b7: ff 94 93 80 22 00 00call *0x2280(%ebx,%edx,4)
8be: 83 c4
0cadd$0xc,%esp
8c1: 85
c0
test %eax,%eaxIn my case the kernel tries to access the address 0xd09bc5e5 which seemslike it used to be a valid one.
I suppose, you have = 256 Mb on this machine? If so, 0xd09bc5e5
belongs to vmalloc()'ed area (in the past). So that was likely some
module (e.g. nucleus).
So this looks like we really need some mechanism to make sure all CPUsuse the updated pointers after unhooking some event handler and before
proceeding with further cleanups.
It's more complicated, I'm afraid. We need to be sure the event handler
function itself will not disappear in the mean time. Hence, module
unloading should be delayed, iow something alike synchronize_rcu() that
blocks a cleanup caller (which calls unregister_domain()) untill all
the readers are done with their activities.

Maybe it's wrong. Some more code reading would be required.

Jan-- Best regards,Dmitry Adamushko

___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


[Xenomai-core] Re: [Xenomai-help] Bad EIP kernel-Oops

2006-09-14 Thread Dmitry Adamushko

The proper fix is to synchronize ipipe_catch_event(..., NULL) with theevent dispatcher, so that any caller could legitimately assume that no
subsequent call to the former handler will happen on any CPU after thisservice has returned. Since ipipe_unregister_domain() may alreadylegitimately assume that all event handlers have been cleared usingipipe_catch_event() by the caller before proceeding, the issue would be
solved.
That's I understand. I just said a cleanup caller which amongst other
things does ipipe_unregister_domain. Although, yep. The latter one is
not necessarily involved.
The difficult part is to refrain from masking the hw interrupts whenrunning the event handlers for the sake of keeping short latencies
(nothing would prevent those handlers to re-enable interrupts anyway).IOW, using a big stick like the critical inter-CPU lock is not thepreferred option.
And it's not a good idea to get ipipe_catch_event() buzy spinning as
(as I understand) ipipe_dispatch_event() may take, in general, an
unbounded amount of time.

Ok, some more weird thoughts on top of my mind. I hope the idea is
clear, notwithstanding my description that can be not that clear but
hey... it's a late hour :)



ipipe_dispatch_event()
{

...

 ipipe_lock_cpu(flags);

 start_domain = this_domain = ipipe_percpu_domain[cpuid];

 list_for_each_safe(pos,npos,__ipipe_pipeline) {

 next_domain = list_entry(pos,struct ipipe_domain,p_link);

+ event_handler = next_domain-evhand[event];

 if (next_domain-evhand[event] != NULL) {

ipipe_percpu_domain[cpuid] = next_domain;
+
atomic_inc(somewhere_stored-counter);
 ipipe_unlock_cpu(flags);

-
propagate = !next_domain-evhand[event](event,start_domain,data);
+ propagate = !event_handler(...);

 ipipe_lock_cpu(flags);
+
if (atomic_dec(somewhere_stored-counter) == 0)
+
send_virtual_irq(virt_irq, EVENT_TYPE, arg); // do it per interested
domain
...
}

then ipipe_catch_event(..., NULL); should do something along the following lines :

ipipe_catch_event()
{
...

 lock() ; // not sure, it's even necessary

 set ipd-evhand[event] to NULL;

 unlock();


 // gets blocked
 ipipe_get_synched(EVENT_TYPE, arg);

...
}
ipipe_gets_synched()
- lock
- if somewhere_stored-counter != 0
- adds a caller to some wait queue (impl. depends on domain)
- unlock

- gets blocked.


virtual_irq_handler()

- lock
- wakeup_all_blocked for a given EVENT_TYPE and domain
- unlock
-- Best regards,Dmitry Adamushko
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


[Xenomai-core] Re: [Xenomai-help] Bad EIP kernel-Oops

2006-09-12 Thread Dmitry Adamushko
On 12/09/06, Philippe Gerum [EMAIL PROTECTED] wrote:
On Tue, 2006-09-12 at 15:24 +0200, Nils Kemper wrote: Hi, I want to use Xenomai, but I get (sometimes, but everytime the same) kernel-Oops just by running xeno-test: [..] Xenomai: stopping native API services.
 I-pipe: Domain Xenomai unregistered. Xenomai: hal/x86 stopped. Xenomai: real-time nucleus unloaded.Does the issue still pop up if you set the Xenomai nucleus as static(i.e. not as a module) in the kernel configuration?


Just a weird presupposition.

In __ipipe_dispatch_event() 

 ipipe_lock_cpu(flags);

 start_domain = this_domain = ipipe_percpu_domain[cpuid];

 list_for_each_safe(pos,npos,__ipipe_pipeline) {


next_domain = list_entry(pos,struct ipipe_domain,p_link);

//...

if (next_domain-evhand[event] != NULL) {

ipipe_percpu_domain[cpuid] = next_domain;

ipipe_unlock_cpu(flags);
(1)

propagate = !next_domain-evhand[event](event,start_domain,data);
 
Does anything prevent another thread from preempting the current one at (1) and making next_domain invalid?

then :

if next_domain == rthal_domain (aka Xenomai) - e.g. someone unloaded all the modules.

then if it's static :

rthal_domain is still kind of a valid object - it's at least in a valid
memory region + evhand points to a valid function. It's even possible
to jump to the next element if the rthal_domain::fields were not
cleared...

non-static :

the module image was unloaded, next_domain doesn't point to anything reasonable.

Jan or Nils, what instructions does objdump -d kernel/ipipe/core.o
show for a given offset in the __ipipe_dispatch_event(). 

0xcd in case of Nils.

[c013f158] __ipipe_dispatch_event+0xcd/0xeb

?

-- Best regards,Dmitry Adamushko


___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core


[Xenomai-core] Re: [Xenomai-help] Bad EIP kernel-Oops

2006-09-12 Thread Jan Kiszka
Dmitry Adamushko wrote:
 On 12/09/06, Philippe Gerum [EMAIL PROTECTED] wrote:

 On Tue, 2006-09-12 at 15:24 +0200, Nils Kemper wrote:
  Hi,
  I want to use Xenomai, but I get (sometimes, but everytime the same)
  kernel-Oops just by running xeno-test:
 
  [..]
  Xenomai: stopping native API services.
  I-pipe: Domain Xenomai unregistered.
  Xenomai: hal/x86 stopped.
  Xenomai: real-time nucleus unloaded.

 Does the issue still pop up if you set the Xenomai nucleus as static
 (i.e. not as a module) in the kernel configuration?
 
 
 
 Just a weird presupposition.
 
 In __ipipe_dispatch_event()
 
  ipipe_lock_cpu(flags);
 
  start_domain = this_domain = ipipe_percpu_domain[cpuid];
 
  list_for_each_safe(pos,npos,__ipipe_pipeline) {
 
next_domain = list_entry(pos,struct ipipe_domain,p_link);
 
 //...
if (next_domain-evhand[event] != NULL) {
ipipe_percpu_domain[cpuid] = next_domain;
ipipe_unlock_cpu(flags);
 (1)
propagate =
 !next_domain-evhand[event](event,start_domain,data);
 
 Does anything prevent another thread from preempting the current one at (1)
 and making next_domain invalid?

That could explain it. I only read ipipe_lock_cpu during my first scan
of this code earlier today, missing the unlock. One should better safe
the handler in a local variable before releasing the lock...

 
 then :
 
 if next_domain == rthal_domain (aka Xenomai)  -  e.g. someone unloaded
 all
 the modules.
 
 then if it's static :
 
 rthal_domain is still kind of a valid object - it's at least in a valid
 memory region + evhand points to a valid function. It's even possible to
 jump to the next element if the rthal_domain::fields were not cleared...
 
 non-static :
 
 the module image was unloaded, next_domain doesn't point to anything
 reasonable.

Mmh, we probably need some grace period on unload to avoid such races.
Reminds me of issues with the IRQ proc output or the shared IRQ
deregistration...

 
 Jan or Nils, what instructions does objdump -d kernel/ipipe/core.o show
 for a given offset in the __ipipe_dispatch_event().
 
 0xcd in case of Nils.
 
 [c013f158] __ipipe_dispatch_event+0xcd/0xeb
 
 ?
 
 

Will check this tomorrow.

Jan



signature.asc
Description: OpenPGP digital signature
___
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core