[Xenomai-core] Re: [Xenomai-help] Bad EIP kernel-Oops
On Fri, 2006-09-15 at 00:12 +0200, Dmitry Adamushko wrote: [...] And it's not a good idea to get ipipe_catch_event() buzy spinning as (as I understand) ipipe_dispatch_event() may take, in general, an unbounded amount of time. Actually, this is not an issue, since there is no requirement for bounded execution time in ipipe_catch_event(). This call is part of the initialization chores, it is not meant as being deterministic. Ok, some more weird thoughts on top of my mind. I hope the idea is clear, notwithstanding my description that can be not that clear but hey... it's a late hour :) ipipe_dispatch_event() { ... ipipe_lock_cpu(flags); start_domain = this_domain = ipipe_percpu_domain[cpuid]; list_for_each_safe(pos,npos,__ipipe_pipeline) { next_domain = list_entry(pos,struct ipipe_domain,p_link); + event_handler = next_domain-evhand[event]; if (next_domain-evhand[event] != NULL) { ipipe_percpu_domain[cpuid] = next_domain; + atomic_inc(somewhere_stored-counter); ipipe_unlock_cpu(flags); - propagate = ! next_domain-evhand[event](event,start_domain,data); + propagate = !event_handler(...); ipipe_lock_cpu(flags); + if (atomic_dec(somewhere_stored-counter) == 0) + send_virtual_irq(virt_irq, EVENT_TYPE, arg); // do it per interested domain ... This is going to hurt whenever the handler suspends or even kills the current context, e.g. think of a fault handling in xnpod_trap_fault over a kernel-based Xenomai thread context. In such a case, the counter would never be decremented; we could clear the counter/flag of the outgoing domain upon domain switch, so that it disappears when it is not relevant anymore. } then ipipe_catch_event(..., NULL); should do something along the following lines : ipipe_catch_event() { ... lock() ; // not sure, it's even necessary We could take the critical lock, just for the section that actually clears the handler and resets the event flags, so that either __ipipe_dispatch_event on any CPU sees a null handler and discards the invocation, or a non-null one and fires it, with the added guarantee that the target code won't be unmapped under its feet. The locking would not cover the event handler invocation proper, to keep latencies low. set ipd-evhand[event] to NULL; unlock(); // gets blocked ipipe_get_synched(EVENT_TYPE, arg); ... } ipipe_gets_synched() - lock - if somewhere_stored-counter != 0 - adds a caller to some wait queue (impl. depends on domain) - unlock - gets blocked. virtual_irq_handler() - lock - wakeup_all_blocked for a given EVENT_TYPE and domain - unlock The virtual IRQ looks overkill, if we admit that we only want to solve the Linux domain + null handler case in ipipe_catch_event(); the latter could simply poll the counter using schedule_timeout(), since we don't care how long it needs to complete anyway. Another idea is to provide a per-CPU, per-event flag for detecting undergoing calls to each event handler, that we could test in ipipe_catch_event(); once a handler has been cleared, the associated flag could never be raised again, protecting the code against livelocking (unless some guy spuriously calls ipipe_catch_event again for the same event with a non-null handler in the meantime, but in such a case, it would be more like a usage issue, not an Adeos bug). Here is a tentative implementation that seems to work on UP, but still needs to be tested over SMP. Patch is against 2.6.17-1.3-10: --- 2.6.17-ipipe-1.3-10/include/linux/ipipe.h 2006-09-15 10:13:15.0 +0200 +++ 2.6.17-ipipe/include/linux/ipipe.h 2006-09-14 20:06:38.0 +0200 @@ -50,12 +50,14 @@ #define IPIPE_SPRINTK_FLAG 0 /* Synchronous printk() allowed */ #define IPIPE_AHEAD_FLAG 1 /* Domain always heads the pipeline */ +/* Per-cpu pipeline status */ #define IPIPE_STALL_FLAG 0 /* Stalls a pipeline stage -- guaranteed at bit #0 */ #define IPIPE_SYNC_FLAG1 /* The interrupt syncer is running for the domain */ #define IPIPE_NOSTACK_FLAG 2 /* Domain currently runs on a foreign stack */ #define IPIPE_SYNC_MASK(1 IPIPE_SYNC_FLAG) +/* Interrupt control bits */ #define IPIPE_HANDLE_FLAG 0 #define IPIPE_PASS_FLAG1 #define IPIPE_ENABLE_FLAG 2 @@ -149,6 +151,7 @@ struct ipipe_domain { unsigned long pending_hits; unsigned long total_hits; } irq_counters[IPIPE_NR_IRQS]; + unsigned long long evsync; } cacheline_aligned_in_smp cpudata[IPIPE_NR_CPUS]; struct { @@ -409,6 +412,7 @@ static inline void __ipipe_switch_to(str * pipeline (and obviously different). */
[Xenomai-core] Re: [Xenomai-help] Bad EIP kernel-Oops
Philippe Gerum wrote: diff -uNrp 2.6.17-ipipe-1.3-10/kernel/ipipe/core.c 2.6.17-ipipe/kernel/ipipe/core.c --- 2.6.17-ipipe-1.3-10/kernel/ipipe/core.c 2006-09-15 10:13:15.0 +0200 +++ 2.6.17-ipipe/kernel/ipipe/core.c 2006-09-14 20:09:22.0 +0200 ... @@ -638,6 +642,7 @@ int ipipe_control_irq(unsigned irq, unsi int fastcall __ipipe_dispatch_event (unsigned event, void *data) { struct ipipe_domain *start_domain, *this_domain, *next_domain; + ipipe_event_handler_t evhand; struct list_head *pos, *npos; unsigned long flags; ipipe_declare_cpuid; @@ -649,8 +654,6 @@ int fastcall __ipipe_dispatch_event (uns list_for_each_safe(pos,npos,__ipipe_pipeline) { - next_domain = list_entry(pos,struct ipipe_domain,p_link); - /* * Note: Domain migration may occur while running * event or interrupt handlers, in which case the @@ -659,11 +662,22 @@ int fastcall __ipipe_dispatch_event (uns * care for that, always tracking the current domain * descriptor upon return from those handlers. */ - if (next_domain-evhand[event] != NULL) { + next_domain = list_entry(pos,struct ipipe_domain,p_link); + + /* + * Keep a cached copy of the handler's address since + * ipipe_catch_event() may clear it under our feet. + */ + + evhand = next_domain-evhand[event]; + + if (evhand != NULL) { ipipe_percpu_domain[cpuid] = next_domain; + next_domain-cpudata[cpuid].evsync |= (1LL event); Isn't there still a race window between grabbing evhand and setting this bit here? If yes, can this be solved by moving set/clear flag out of the condition? If this is not the solution, I wonder if we should really care that much. Unregistering an event handler remains a rarely used scenario which actually never happens with the nucleus built in. Shouldn't we better put some grace period after it and live with the fact that on a *very busy* system it *may* cause troubles once in a while? Reminds me of the fact that procfs handler unregistration with private data attached is also racy by design... Jan signature.asc Description: OpenPGP digital signature ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
[Xenomai-core] Re: [Xenomai-help] Bad EIP kernel-Oops
On Fri, 2006-09-15 at 10:33 +0200, Jan Kiszka wrote: Philippe Gerum wrote: diff -uNrp 2.6.17-ipipe-1.3-10/kernel/ipipe/core.c 2.6.17-ipipe/kernel/ipipe/core.c --- 2.6.17-ipipe-1.3-10/kernel/ipipe/core.c 2006-09-15 10:13:15.0 +0200 +++ 2.6.17-ipipe/kernel/ipipe/core.c2006-09-14 20:09:22.0 +0200 ... @@ -638,6 +642,7 @@ int ipipe_control_irq(unsigned irq, unsi int fastcall __ipipe_dispatch_event (unsigned event, void *data) { struct ipipe_domain *start_domain, *this_domain, *next_domain; + ipipe_event_handler_t evhand; struct list_head *pos, *npos; unsigned long flags; ipipe_declare_cpuid; @@ -649,8 +654,6 @@ int fastcall __ipipe_dispatch_event (uns list_for_each_safe(pos,npos,__ipipe_pipeline) { - next_domain = list_entry(pos,struct ipipe_domain,p_link); - /* * Note: Domain migration may occur while running * event or interrupt handlers, in which case the @@ -659,11 +662,22 @@ int fastcall __ipipe_dispatch_event (uns * care for that, always tracking the current domain * descriptor upon return from those handlers. */ - if (next_domain-evhand[event] != NULL) { + next_domain = list_entry(pos,struct ipipe_domain,p_link); + + /* +* Keep a cached copy of the handler's address since +* ipipe_catch_event() may clear it under our feet. +*/ + + evhand = next_domain-evhand[event]; + + if (evhand != NULL) { ipipe_percpu_domain[cpuid] = next_domain; + next_domain-cpudata[cpuid].evsync |= (1LL event); Isn't there still a race window between grabbing evhand and setting this bit here? If yes, can this be solved by moving set/clear flag out of the condition? ipipe_catch_event() that changes the handler's address must hold the critical lock to do so, and since we are running with masked IRQs in the code above, we prevent the lock from being be obtained (i.e. through the critical IPI) for the current CPU, so this is ok. If this is not the solution, I wonder if we should really care that much. Unregistering an event handler remains a rarely used scenario which actually never happens with the nucleus built in. Shouldn't we better put some grace period after it and live with the fact that on a *very busy* system it *may* cause troubles once in a while? Reminds me of the fact that procfs handler unregistration with private data attached is also racy by design... I agree, this is a corner case which is extremely unlikely to happen in production systems, since we would likely never unplug the nucleus in the first place. Still, I think that if we can fix this issue with minimum overhead, it should be done. So far, the cost involved is changing a 64bit value twice in the event dispatcher, and once in the domain switch code. I'd prefer to limit this to 32bit values, but unfortunately can't yet, until some of the x86-specific exceptions are grouped under a single event code, so that we don't need more than 32 event bits to flag them. Jan -- Philippe. ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
[Xenomai-core] Re: [Xenomai-help] Bad EIP kernel-Oops
On 15/09/06, Philippe Gerum [EMAIL PROTECTED] wrote: On Fri, 2006-09-15 at 00:12 +0200, Dmitry Adamushko wrote:[...] And it's not a good idea to get ipipe_catch_event() buzy spinning as (as I understand) ipipe_dispatch_event() may take, in general, an unbounded amount of time.Actually, this is not an issue, since there is no requirement forbounded execution time in ipipe_catch_event(). This call is part of theinitialization chores, it is not meant as being deterministic. I meant it's not ok to keep a CPU spinning all this time. But your approach with schedule_timeout() is really better. And probably, we don't need to try being more general that it's really required as we need to get synched only with ipipe_catch_event(..., NULL) from the Linux domain. --Philippe.-- Best regards, Dmitry Adamushko ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
[Xenomai-core] Re: [Xenomai-help] Bad EIP kernel-Oops
On 13/09/06, Jan Kiszka [EMAIL PROTECTED] wrote: It's the indirect call to the event handler. 8b3: 8b 55 e4mov0xffe4(%ebp),%edx 8b6: 50push %eax- 8b7: ff 94 93 80 22 00 00call *0x2280(%ebx,%edx,4) 8be: 83 c4 0cadd$0xc,%esp 8c1: 85 c0 test %eax,%eaxIn my case the kernel tries to access the address 0xd09bc5e5 which seemslike it used to be a valid one. I suppose, you have = 256 Mb on this machine? If so, 0xd09bc5e5 belongs to vmalloc()'ed area (in the past). So that was likely some module (e.g. nucleus). So this looks like we really need some mechanism to make sure all CPUsuse the updated pointers after unhooking some event handler and before proceeding with further cleanups. It's more complicated, I'm afraid. We need to be sure the event handler function itself will not disappear in the mean time. Hence, module unloading should be delayed, iow something alike synchronize_rcu() that blocks a cleanup caller (which calls unregister_domain()) untill all the readers are done with their activities. Maybe it's wrong. Some more code reading would be required. Jan-- Best regards,Dmitry Adamushko ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
[Xenomai-core] Re: [Xenomai-help] Bad EIP kernel-Oops
The proper fix is to synchronize ipipe_catch_event(..., NULL) with theevent dispatcher, so that any caller could legitimately assume that no subsequent call to the former handler will happen on any CPU after thisservice has returned. Since ipipe_unregister_domain() may alreadylegitimately assume that all event handlers have been cleared usingipipe_catch_event() by the caller before proceeding, the issue would be solved. That's I understand. I just said a cleanup caller which amongst other things does ipipe_unregister_domain. Although, yep. The latter one is not necessarily involved. The difficult part is to refrain from masking the hw interrupts whenrunning the event handlers for the sake of keeping short latencies (nothing would prevent those handlers to re-enable interrupts anyway).IOW, using a big stick like the critical inter-CPU lock is not thepreferred option. And it's not a good idea to get ipipe_catch_event() buzy spinning as (as I understand) ipipe_dispatch_event() may take, in general, an unbounded amount of time. Ok, some more weird thoughts on top of my mind. I hope the idea is clear, notwithstanding my description that can be not that clear but hey... it's a late hour :) ipipe_dispatch_event() { ... ipipe_lock_cpu(flags); start_domain = this_domain = ipipe_percpu_domain[cpuid]; list_for_each_safe(pos,npos,__ipipe_pipeline) { next_domain = list_entry(pos,struct ipipe_domain,p_link); + event_handler = next_domain-evhand[event]; if (next_domain-evhand[event] != NULL) { ipipe_percpu_domain[cpuid] = next_domain; + atomic_inc(somewhere_stored-counter); ipipe_unlock_cpu(flags); - propagate = !next_domain-evhand[event](event,start_domain,data); + propagate = !event_handler(...); ipipe_lock_cpu(flags); + if (atomic_dec(somewhere_stored-counter) == 0) + send_virtual_irq(virt_irq, EVENT_TYPE, arg); // do it per interested domain ... } then ipipe_catch_event(..., NULL); should do something along the following lines : ipipe_catch_event() { ... lock() ; // not sure, it's even necessary set ipd-evhand[event] to NULL; unlock(); // gets blocked ipipe_get_synched(EVENT_TYPE, arg); ... } ipipe_gets_synched() - lock - if somewhere_stored-counter != 0 - adds a caller to some wait queue (impl. depends on domain) - unlock - gets blocked. virtual_irq_handler() - lock - wakeup_all_blocked for a given EVENT_TYPE and domain - unlock -- Best regards,Dmitry Adamushko ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
[Xenomai-core] Re: [Xenomai-help] Bad EIP kernel-Oops
On 12/09/06, Philippe Gerum [EMAIL PROTECTED] wrote: On Tue, 2006-09-12 at 15:24 +0200, Nils Kemper wrote: Hi, I want to use Xenomai, but I get (sometimes, but everytime the same) kernel-Oops just by running xeno-test: [..] Xenomai: stopping native API services. I-pipe: Domain Xenomai unregistered. Xenomai: hal/x86 stopped. Xenomai: real-time nucleus unloaded.Does the issue still pop up if you set the Xenomai nucleus as static(i.e. not as a module) in the kernel configuration? Just a weird presupposition. In __ipipe_dispatch_event() ipipe_lock_cpu(flags); start_domain = this_domain = ipipe_percpu_domain[cpuid]; list_for_each_safe(pos,npos,__ipipe_pipeline) { next_domain = list_entry(pos,struct ipipe_domain,p_link); //... if (next_domain-evhand[event] != NULL) { ipipe_percpu_domain[cpuid] = next_domain; ipipe_unlock_cpu(flags); (1) propagate = !next_domain-evhand[event](event,start_domain,data); Does anything prevent another thread from preempting the current one at (1) and making next_domain invalid? then : if next_domain == rthal_domain (aka Xenomai) - e.g. someone unloaded all the modules. then if it's static : rthal_domain is still kind of a valid object - it's at least in a valid memory region + evhand points to a valid function. It's even possible to jump to the next element if the rthal_domain::fields were not cleared... non-static : the module image was unloaded, next_domain doesn't point to anything reasonable. Jan or Nils, what instructions does objdump -d kernel/ipipe/core.o show for a given offset in the __ipipe_dispatch_event(). 0xcd in case of Nils. [c013f158] __ipipe_dispatch_event+0xcd/0xeb ? -- Best regards,Dmitry Adamushko ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
[Xenomai-core] Re: [Xenomai-help] Bad EIP kernel-Oops
Dmitry Adamushko wrote: On 12/09/06, Philippe Gerum [EMAIL PROTECTED] wrote: On Tue, 2006-09-12 at 15:24 +0200, Nils Kemper wrote: Hi, I want to use Xenomai, but I get (sometimes, but everytime the same) kernel-Oops just by running xeno-test: [..] Xenomai: stopping native API services. I-pipe: Domain Xenomai unregistered. Xenomai: hal/x86 stopped. Xenomai: real-time nucleus unloaded. Does the issue still pop up if you set the Xenomai nucleus as static (i.e. not as a module) in the kernel configuration? Just a weird presupposition. In __ipipe_dispatch_event() ipipe_lock_cpu(flags); start_domain = this_domain = ipipe_percpu_domain[cpuid]; list_for_each_safe(pos,npos,__ipipe_pipeline) { next_domain = list_entry(pos,struct ipipe_domain,p_link); //... if (next_domain-evhand[event] != NULL) { ipipe_percpu_domain[cpuid] = next_domain; ipipe_unlock_cpu(flags); (1) propagate = !next_domain-evhand[event](event,start_domain,data); Does anything prevent another thread from preempting the current one at (1) and making next_domain invalid? That could explain it. I only read ipipe_lock_cpu during my first scan of this code earlier today, missing the unlock. One should better safe the handler in a local variable before releasing the lock... then : if next_domain == rthal_domain (aka Xenomai) - e.g. someone unloaded all the modules. then if it's static : rthal_domain is still kind of a valid object - it's at least in a valid memory region + evhand points to a valid function. It's even possible to jump to the next element if the rthal_domain::fields were not cleared... non-static : the module image was unloaded, next_domain doesn't point to anything reasonable. Mmh, we probably need some grace period on unload to avoid such races. Reminds me of issues with the IRQ proc output or the shared IRQ deregistration... Jan or Nils, what instructions does objdump -d kernel/ipipe/core.o show for a given offset in the __ipipe_dispatch_event(). 0xcd in case of Nils. [c013f158] __ipipe_dispatch_event+0xcd/0xeb ? Will check this tomorrow. Jan signature.asc Description: OpenPGP digital signature ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core